{"title": "Extended Regularization Methods for Nonconvergent Model Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 228, "page_last": 235, "abstract": "", "full_text": "Extended Regularization Methods for \n\nN onconvergent Model Selection \n\nW. Finnoff, F. Hergert and H.G. Zimmermann \nSiemens AG, Corporate Research and Development \n\nOtto-Hahn-Ring 6 \n\n8000 Munich 83, Fed. Rep. Germany \n\nAbstract \n\nMany techniques for model selection in the field of neural networks \ncorrespond to well established statistical methods. The method \nof 'stopped training', on the other hand, in which an oversized \nnetwork is trained until the error on a further validation set of ex(cid:173)\namples deteriorates, then training is stopped, is a true innovation, \nsince model selection doesn't require convergence of the training \nprocess. \nIn this paper we show that this performance can be significantly \nenhanced by extending the 'non convergent model selection method' \nof stopped training to include dynamic topology modifications \n(dynamic weight pruning) and modified complexity penalty term \nmethods in which the weighting of the penalty term is adjusted \nduring the training process. \n\n1 \n\nINTRODUCTION \n\nOne of the central topics in the field of neural networks is that of model selection. \nBoth the theoretical and practical side of this have been intensively investigated and \na vast array of methods have been suggested to perform this task. A widely used \nclass of techniques starts by choosing an 'oversized' network architecture then either \nremoving redundant elements based on some measure of saliency (pruning), adding a \nfurther term to the cost function penalizing complexity (penalty terms), and finally, \nobserving the error on a further validation set of examples, then stopping training \nas soon as this performance begins to deteriorate (stopped training). The first \ntwo methods can be viewed as variations of long established statistical techniques \n\n228 \n\n\fExtended Regularization Methods for N onconvergent Model Selection \n\n229 \n\ncorresponding in the case of pruning to specification searches, and with respect to \npenalty terms as regularization or biased regression. \n\nThe method of stopped training, on the other hand, seems to be one of the true \ninnovations to come out of neural network research. Here, the model chosen doesn't \nrequire the training process to converge, rather, the training process is used to per(cid:173)\nform a directed search of weight space to find a model with superior generalization \nperformance. Recent theoretical ([B,C,91], [F,91], [F,Z,91]) and empirical results \n([H,F ,Z,92], [W,R,H,90]) have provided strong evidence for the efficiency of stopped \ntraining. In this paper we will show that generalization performance can be fur(cid:173)\nther enhanced by expanding the 'nonconvergent method' of stopped training to \ninclude dynamic topology modifications (dynamic pruning) and modified complex(cid:173)\nity penalty term methods in which the weighting of the penalty term is adjusted \nduring the training process. Here, the empirical results are based on an extensive se(cid:173)\nquence of simulation examples designed to reduce the effects of domain dependence \non the performance comparisons. \n\n2 CLASSICAL MODEL SELECTION \n\nClassical model selection methods are generally divided into a number of steps \nthat are performed independently. The first step consists of choosing a network \narchitecture, then either an objective function (possibly including a penalty term) \nis chosen directly, or in a Bayesian setting, prior distributions on the elements of \nthe data generating process (noise, weights in the model, regularizers, etc.) are \nspecified from which an objective function is derived. Next, using the specified \nobjective function, the training process is started and continued until a convergence \ncriterion is fulfilled. The resulting parametrization of the given architecture is then \nplaced in a 'pool' from which a final model will be selected. \n\nThe next step can consist of a modification of the network architecture (for exam(cid:173)\nple by pruning weights/hidden-neurons/input-neurons), or of the penalty term (for \nexample by changing its weighting in' the objective function) or of the Bayesian \nprior distributions. The last two modifications then result in a modification of \nthe objective function. This establishes a new framework for the training process \nwhich is then restarted and continued until convergence, producing another model \nfor the pool. This process is iterated until the model builder is satisfied that the \npool contains a reasonable diversity of candidate models, which are then compared \nwith one another using some estimator of generalization ability, (for example, the \nperformance on a validation set). \n\nStopped training, on the other hand, has a fundamentally different character. Al(cid:173)\nthough the choice of framework remains the same, the essential innovation consists \nof considering every parametrization of a given architecture as a potential model. \nThis contrasts with classical methods in which only those parametrizations corre(cid:173)\nsponding to minima of the objective function are taken into consideration for the \nmodel pool. \n\nUnder the weight of accumulated empirical evidence (see [W,R,H,90], [H,F,Z,92]) \ntheorists have begun to investigate the properties of this technique and have been \nable to show that stopped training has the same sort of regularization effect (Le. \nreduction of model variance at the cost of bias) that penalty terms provide (see \n\n\f230 \n\nFinnoH, Hergert, and Zimmermann \n\n[B,C,91], [F,91]). Since the basic effect of pruning procedures is also to reduce \nnetwork complexity (and consequent model variance) one sees that there is a close \nrelationship in the instrumental effects of stopped training, pruning and regular(cid:173)\nization. The question remains whether (or under what circumstances) anyone or \ncombination of these methods produces superior results. \n\n3 THE METHODS TESTED \n\nIn our expirements a single hidden layer feedforward network with tanh activation \nfunctions and ten hidden units was used to fit data sets generated in such a manner \nthat network complexity had to be reduced or constrained to prevent overfitting. A \nvariety of both classical and non convergent methods were tested for this purpose. \nThe first we will discuss used weight pruning. To characterize the relevance of a \nweight in a given network, three different test variables were used. The first simply \nmeasures weight size under the assumption that the training process naturally forces \nnonrelevant weights into a region around zero. The second test variable is that used \nin the Optimal Brain Damage (OBD) pruning procedure of Le Cun et al. (see \n[L,D,S,90]). The final test variables considered are those proposed by Finnoff and \nZimmermann in [F,Z,91], based on significance tests for deviations from zero in the \nweight update process. \n\nTwo pruning algorithms were used in the experiments, both of which attempt to \nemulate successful interactive methods. In the first algorithm, one removes a cer(cid:173)\ntain fixed percentage of weights in the network after a stopping criterion is reached. \nThe reduced network is then trained further until the stopping criterion is once \nagain fulfilled. This process is then repeated until performance breaks down com(cid:173)\npletely. This method will be referred to in the following as auto-pruning and was \nimplement.ed using all three types of test variables to determine the weights to be \nremoved. The only difference lay in the stopping criterion used. In the case of \nthe OBD test variables, training was stopped after the training process converged. \nIn the case of the statistical and small weight test variables, training was stopped \nwhenever overtraining (defined by a repeated increase in the error on a validation \nset) was observed. A final (restart) variant of auto-pruning using the statistical \ntest variables was also tested. This version of auto-pruning only differs in that the \nweights are reinitialized (on the reduced topology) after every pruning step. In \nthe tables of results presented in the appendix, the results for auto-pruning using \nthe statistical (resp. small weight, resp. OBD) test variables will be denoted by p. \n(resp. G*, resp. 0*). The version of auto-pruning using restarts will be denoted by \np \u2022. \n\nThe second method uses the statistical test variables to both remove and reactivate \nweights. As in auto-pruning the network is trained until overfitting is observed after \na fixed number of epochs, then test values are calculated for all active and inactive \nweights. Here a fixed number \u00a3 > 0 is given, corresponding to some quantile value \nof a probability distribution. If the test variable for an active weight falls below \n\u00a3 the weight is pruned (deactivated). For weights that have already been set to \nzero, the value of the test variables are compared with \u00a3, and if larger, the weight is \nreactivated with a small random value. Furthermore, the value of \u00a3 is increased by \nsome ~\u00a3 > 0 after each pruning step until some value \u00a3ma~ is reached. This method \nis referred to as epsi-pruning. Epsi-pruning was tested in versions both with (e\u00b7) \n\n\fExtended Regularization Methods for N onconvergent Model Selection \n\n231 \n\nand without restarts (E*). \n\nTwo complexity penalty terms were considered. These consist of a further term \nC-\\(w) added to the error function which forces the network to achieve a compro(cid:173)\nmise between fit and network complexity during the training process; here, the \nparameter A E [0,00) controls the strength of the complexity penalty. The first is \nthe quadratic term, the first derivative of which leads to the so-called weight decay \nterm in the weight updates (see [H,P,S9]). The second is the Weigend/Rumelhart \npenalty term (see [W,R,H,91]). The weight decay penalty term was tested using \n[n the first of these, (D*), A was held constant throughout the \ntwo techniques. \ntraining process. In the second, (d*), ..\\ was set to zero until overtraining was ob(cid:173)\nserved, then turned on and held constant for the remainder of the training process. \nThe Weigend/Rumelhart penalty term was also tested using these two methods \n(denoted in the following tables by W\u00b7, resp. w*). Further, the algorithm suggested \nby A. Weigend in [W,R,H,91] in which the value of A is varied during training was \nconsidered (wF). \n\nIn addition to the pruning and penalty term methods investigated, two (simple) \nversions of stopped training were tested, in one case (nN) with a constant learning \nstep throughout, and in the other (nF) with the step size reduced after overtra.ining \nwas observed. Finally three benchmarks were included. All these involved training \na network until convergence to emulate the situation when no precautions are taken \nto prevent overfitting other than varying the number of hidden units. The number \nof hidden units in these benchmark tests was set at three, six and ten, (#3, #6, \n##) this last network having then the same topology as that used in the remaining \ntests. \n\n4 THE DATA GENERATION PROCESSES \n\nTo test the methods under consideration, a number of processes were used to \ngenerate data sets. By testing on a sufficiently wide range of controlled exam(cid:173)\nples one hopes to reduce the domain dependence that might arise in the perfor(cid:173)\nmance comparisons. The data used in our experiments was based on pairs (Yi, z~, \ni = 1, ... , T, TEN with targets 'iii E R and inputs Zi = (xt, ... , zf) E [-1,1] \n, \nwhere Yi = g(zt, ... , xi) + Ui, for j, KEN. Here, 9 represents the structure in the \ndata, xl, ... , z1 the relevant inputs, xJ+1, ... , zK, the irrelevant or decoy inputs and \nUi a stochastic disturbance term. \n\nThe first group of experiments was based on an additive structure 9 having the \nfollowing form with j = 5 and K = 10, g(xf, ... , z~) = E:=l l(aA: z1), aA: E R \nand I either the identity on R or sin. The second class of models investigated had \na highly nonlinear product structure 9 with j = 3, K = 10 and g(zt, ... , zr) = \nn:=l I( ale x1), ale E R and I once again either the identity on R or sin. The \nnext structure considered was constructed using sums of Radial Basis Functions \n( 1 \n(RB ') \n,Wlt a E \nI = 1, ... ,8. Here, for every 1= 1, ... ,8 the vector parameter \nR \n(a l \", ... , a lS ,,) corresponds to the center of the RBF. The final group of experiments \nwere conducted using data generated by feedforward network activation functions. \nThe network used for this task had fifty input units, two hundred hidden units and \n\nF s as fo ows, 9 zi' ... , zi - L,.../=l -\nfor k = 1, ... ,5, \n\n,,8 (1)' \n\n(a'\"\"- zon2\n\n2q2 \n\n(,,5 \n\nexp L,...A:=l \n\n) \n\n\u2022 h A:,I \n\n11 \n\n5) -\n\n\f232 \n\nFinnoff, Hergert, and Zimmermann \n\none output. In every experiment, the data was divided into three disjoint subsets \n1)t,1)\",1)g: The first set 1), was used for training, the second 1)\" (validation) \nset to test for overfitting and to steer the pruning algorithms and the third 1), \n(generalization) set to test the quality of the model selection process. \n\n5 DISCUSSION \n\nThe results of the experiments are given ~elow. Here we give a short review of the \nmost interesting phanomena observed. \n\nNotable in a general sense is a striking domain dependence in the performance, \nwhich illustrates the danger of basing a comparison of methods on tests using a \nsingle (particularly small) data set. Another valuable observation is that by testing \nat higher levels of significance, apparent performance differences can dwindle or even \ndisappear. Finally, one sees that even in the examples without noise that overfitting \noccurs, which contradicts the frequently stated conviction that overfitting is noise \nfitting. \n\nWith regard to specific methods, one sees that all the methods tested significantly \nimproved generalization performance when compared to the benchmarks. Further, \nthe results show that the extended non convergent methods are on average superior \n(sometimes dramatically so) than their classical counterparts. In particular, the \nperformance of penalty terms is greatly enhanced if they are first introduced in the \ntraining process after overtraining is observed. Further, dynamic pruning using the \nstatistical or even the small weight test variables produces significantly better results \nthan stopped training alone or using the Optimal Brain Damage (OBD) weight \nelimination method which requires training to minima of the objective function. A \nfinal notable observation is that the pruning methods (especially those using resarts) \ngenerally work better in the examples with a great deal of noise, while the penalty \nterm methods are superior when the structure is highly nonlinear. \n\n6 TABLES OF RESULTS \n\nThe experiments were performed as follows: First, each data generating process \nwas used to produce six independent sets of data and initial weights to increase the \nstatistical significance of observed effects and to help reduce the effects of any data \nset specific anomalies. In a second step, the parameters of the training processes \nwere optimized for each example by extensive testing, then a fixed value for each \nparameter was chosen for use across the entire range of experiments. With these \nparameters, each method was tested on all of the six data sets produced by one data \ngenerating process. Both the penalty terms and the pruning methods were tested \nwith different settings of the relevant parameters in each model. The parameter \nvalues used in the simulations and an overview of the methods tested are collected \nin the following two tables. \n\n\fExtended Regularization Methods for Nonconvergent Model Selection \n\n233 \n\n6.1 \n\nParameter Settings of the Experiments \n\nearn \n\ntep \n\no 10 \n\nlze \nVt!V,,/V \nbefore/after overfitting \n4 \n. 5 O. 5 \n0.05/0.005 \n400/200/1000 \n0.05/0.005 \n400/200/1000 \n0.05/0.005 \n200/100/1000 \n0.05/0.005 \n200/100/1000 \n200/100/1000 \n0.05/0.005 \n1400/600/1000 0.05/0.01 \n1400/600/1000 0.05/0.01 \n1400/600/1000 0.05/0.01 \n1400/600/1000 0.05/0.01 \n1400/600/1000 0.05/0.01 \n1400/600/1000 0.05/0.01 \n\n0 \n400/200/1000 \n400/200/1000 \n400/200/1000 \n400/200/1000 \n400 200/1000 \n\n0.05/0.005 \n0.05/0.005 \n0.05/0.005 \n0.05/0.005 \n0.05/0.005 \n\nexp __ n \nexp_3-n \nexp_6-n \nid_7..n \nid_B_n \nid_9_n \nn_Ojd \nn_Lid \nn_2jd \nn_O-Bin \nn_1-Bin \nn_2-Bin \nnet_ -n \nnet_3..n \nnet_6..n \nsin_O-n \nsin_3_n \nsin_6_n \n\n0.3 \n0.6 \n0.7 \nO.B \n0.9 \n0.0 \n0.1 \n0.2 \n0.0 \n\n0.3 \n0.6 \n0.0 \n0.3 \n0.6 \n\n6.2 Overview of Methods Tested \n\n\f234 \n\nFinnoff, Hergert, and Zimmermann \n\nThe following tables give categorical rankings of the results. The rankings were \ncalculated as follows: The method with the best performance was given ranking \nI, then the performance of each following method was compared with that of the \nmethod on the first position using a modified t-test statistic. The first method in \nthe list whose test results deviated from that on the first position to at least the \nquantile value of the statistic given at the head of the table was then used to start \nthe second category. All those whose test results did not deviate by at least this \namount were given the same ranking as the leading method of the category, (in this \ncase 1). Following categories were then formed in an analogous fashion using test \nresults measured against the performance of the leading method at the head of the \ncategory. \n\nThe results are presented in two tables. The first contains the results for the data \ngenerating processes without noise and the second for the models with noise. The \ncategorical rankings given were determined using the procedure described above at \na 0.9 level of significance. The ordering of the methods given, listed in the first \ncolumn, is based on the average ranking over all the simulations listed in the table. \nThis average is given in the second column. \n\n6.2.1 Data Generating Processes without Noise \n\nClassification by objective function, ta = 0.9 \n\nav \nmethod \nd* \n1.6 \n1.8 \nP'\" \n2.0 \nw'\" \nwf' \n2.0 \nG* \n2.2 \n2.6 \nt;'\" \n2.6 \n0'\" \np* \n2.6 \nnF \n3.0 \n3.8 \ne'\" \nnN \n3.8 \n## 5.2 \nW* \n5.6 \nD* \n5.8 \n6.2 \n6# \n6.4 \n3# \n\nexp_O_n \n1 \n2 \n1 \n2 \n2 \n2 \n3 \n4 \n3 \n4 \n4 \n5 \n8 \n8 \n6 \n7 \n\nn_O.Jd \n3 \n2 \n2 \n1 \n3 \n5 \n4 \n5 \n5 \n6 \n7 \n8 \n10 \n11 \n9 \n12 \n\nn_O_sm \n1 \n2 \n3 \n3 \n3 \n3 \n2 \n2 \n3 \n4 \n4 \n4 \n7 \n7 \n6 \n5 \n\nneLO_n \n2 \n1 \n3 \n2 \n1 \n1 \n2 \n1 \n2 \n3 \n1 \n3 \n1 \n2 \n5 \n4 \n\nslD_O_n \n1 \n2 \n1 \n2 \n2 \n2 \n2 \n1 \n2 \n2 \n3 \n6 \n2 \n1 \n5 \n4 \n\n\fExtended Regularization Methods for Nonconvergent Model Selection \n\n235 \n\n6.2.2 Data Generating Processes with Noise \nClassification by objective function, ta = 0.9 \n\nmethod \n\nav \n\n2.1 \nP'\" \n2.2 \nd'\" \n2.2 \np'\" \n2.2 \nwI\" \n2.2 \ne'\" \n2.6 \n~'\" \n2.7 \n<1'\" \n2.8 \n0'\" \n2.8 \nw'\" \n2.9 \nnF \n3.5 \nnN \n3.7 \nll'\" \n4.1 \nW'\" \n## 5.2 \n5.3 \n3# \n6# \n5.4 \n\nexp \n_3_n \n3 \n5 \n2 \n4 \n1 \n3 \n4 \n5 \n5 \n5 \n5 \n5 \n5 \n6 \n7 \n8 \n\nexp \n_6_n \n1 \n5 \n1 \n5 \n2 \n3 \n4 \n5 \n5 \n5 \n5 \n4 \n4 \n6 \n7 \n8 \n\nid \n_9_n \n4 \n3 \n1 \n3 \n1 \n2 \n3 \n3 \n3 \n3 \n4 \n5 \n5 \n6 \n7 \n8 \n\nn_l \nJd \n2 \n1 \n3 \n1 \n4 \n4 \n3 \n3 \n3 \n3 \n4 \n1 \n5 \n6 \n7 \n5 \n\nn_l \n..sin \n2 \n1 \n5 \n4 \n5 \n4 \n3 \n5 \n5 \n5 \n5 \n5 \n6 \n5 \n5 \n4 \n\nn_2 \nJd \n1 \n2 \n3 \n1 \n3 \n3 \n3 \n4 \n3 \n3 \n3 \n7 \n7 \n5 \n'l \n6 \n\nn_2 \n-Bin \n3 \n1 \n2 \n2 \n4 \n3 \n3 \n1 \n3 \n3 \n3 \n5 \n5 \n5 \n6 \n3 \n\nnet \n_3_n \n3 \n1 \n1 \n2 \n2 \n2 \n2 \n1 \n1 \n1 \n3 \n2 \n2 \n5 \n4 \n4 \n\nnet \n_6_n \n1 \n1 \n2 \n1 \n2 \n1 \n1 \n1 \n1 \n1 \n1 \n2 \n2 \n4 \n3 \n4 \n\nsm \n_3_n \n2 \n2 \n1 \n2 \n1 \n2 \n2 \n2 \n2 \n2 \n3 \n1 \n1 \n5 \n4 \n5 \n\nsm \n_6_n \n2 \n2 \n1 \n1 \n1 \n2 \n2 \n1 \n2 \n2 \n3 \n1 \n1 \n5 \n4 \n5 \n\n7 REFERENCES \n\n[B,C,91] Baldi, P. and Chauvin, Y., Temporal evolution of generalization during \n\nlearning in linear networks, Neural Computation 3, 1991, pp. 589-603. \n\n[F,91] Finnoff, W., Complexity measures for classes of neural networks with variable \nweight bounds, in Proc. Int. J{)int Conf. on Neural Networks, Singapore, \n1991. \n\n[F,Z,91] Fi'nnofi', W., Zimmermann, H.G., Detecting structure in small datasets by \nnetwork fitting under complexity constraints, to appear in Proc. of 2nd Ann. \nWorkshop Computational Learning Theory and Natural Learning Systems, \nBerkeley, 1991. \n\n[H,P,89], Hanson, S. J., and Pratt, L. Y., Comparing biases for minimal network \nconstruction with back-propagation, in Advances in Neural Information Pro(cid:173)\ncessing I, D. S. Touretzky, Ed., Morgan Kaufman, 1989. \n\n[H,F,Z,92] Hergert, F., Finnofi', W. and H.G. Zimmermann, A comparison of weight \n\nelimination methods for reducing complexity in neural networks. To be pre(cid:173)\nsented at Int. Joint Con/. on Neural Networks, Baltimore, 1992. \n\n[L,D,S,90] Le Cun, Y., Denker J. and Solla, S., Optimal Brain Damage, in Proceed(cid:173)\n\nings of Neural Information Processing Systems II, Denver, 1990. \n\n[W,R,H,91] Weigend, A., Rumelhart, D., and Huberman, B., Generalization by \n\nweight elimination with application to forecasting, Advances in Neural Infor(cid:173)\nmation Processing III, Ed. R. P. Lippman and J. Moody, Morgan Kaufman, \n1991. \n\n\f", "award": [], "sourceid": 643, "authors": [{"given_name": "W.", "family_name": "Finnoff", "institution": null}, {"given_name": "F.", "family_name": "Hergert", "institution": null}, {"given_name": "H. G.", "family_name": "Zimmermann", "institution": null}]}