== (1 + \n\nNL \n\n) < Eemp{w ) > +o(N)' \n\n(3) \nwhere < . > is the average value for the training samples, o( 1/ N) is a small term \nwhich satisfies No(I/N) ~ 0 when N ~ 00, and F(w*), N, and L are respectively \nthe numbers of the effective parameters of w*, the training samples, and output \nunits. \nAlthough the average < . > cannot be calculated in the actual application, the \noptimal model for the minimum prediction error can be found by choosing the \nmodel that minimizes the Akaike informat.ion crit.erion (AIC) [1], \n\n* \n\nJ(w)=(I+ \n\n2(F(w*) + 1) \n\nNL \n\n* \n\n)Eemp(w). \n\n(4) \n\nThis method was generalized for arbitrary distance [2]. The Bayes informat.ion \ncriterion (BIC) [3] and the minimum descript.ion lengt.h (MDL) [4] were proposed \nto overcome the inconsistency problem of AIC that the true model is not always \nchosen even when N ~ 00. \n\nThe above information criteria have been applied to the neural network model selec(cid:173)\ntion problem, where the maximum likelihood estimator w* was calculated for each \nmodel, and then information criteria were compared. Nevertheless, the practical \nproblem is caused by the fact. that we can not always find the ma..ximum likelihood \nestimator for each model, and even if we can. it takes long calculation time. \n\nIn order to improve such model selection procedures, this paper proposes a prac(cid:173)\ntical learning algorithm by which the optimal model and parameter can be found \nsimultaneously. Let us consider a modified information criterion, \n\nJu(w) == (1+ \n\n2(FuCw) + 1) \n\nNL \n\n)Eemp(w). \n\n(5) \n\nwhere a > 0 is a parameter and Fa(w) is a Cl-class function which converges \nto F(w) when a ~ O. \n\\Ve minimize Ja(w), while controlling a as a ~ 0, To \nshow effectiveness of this method, we show experimental results, and discuss the \ntheoretical background. \n\n2 A Modified Information Criterion \n\n2.1 A Formal Information Criterion \n\nLet us consider a conditional probability distribut.ion. \n\nP{W,O\";ylx) = ( \n\n1 \n2)L/2 exp(-\n2nO\" \n\nIly-ip(w;x)W \n\n? 2 \n_0\" \n\n), \n\n(6) \n\n\fAn Optimization Method of Layered Neural Networks \n\n295 \n\nwhere a function rp( w; x) = {rpi(W; x)} is given by the three-layered perceptron, \n\nrpi(W; :1:) = p(WiO + L Wij p(WjO + L wjkxd), \n\nl\\\" \n\nII \n\n(7) \n\nj=1 \n\nk=l \n\nand W = {w iO, Wij} is a set of biases and weights and p(.) is a sigmoidal function. \nLet A1max be the full-connected neuralnctwork model with 1'1.\" input units, H hidden \nunits, and L output units, and /vt be the family of all models made from A1max by \npruning weights or eliminating biases. \\Vhcn a sct of training samples {(Xi, vd }[~:1 \nis given, we define an empirical loss and the prediction loss by \n\n1 N \n\n1 _ \n\n- N' L log P(w, 0\"; vi/xd, \n-J J Q(:l\",v) 10gP(w,0\"; Vlx)d:t:dy. \n\n1=1 \n\nL(w,O\") \n\n(8) \n\n(9) \n\nMinimizing Lemp (w, 0\") is equivalent to minimizing E elllp{ w), and mIIllmlzing \nL(w, 0\") is equivalent. to minimizing E(w). \\Ye assume t.hat. t.here exists a parameter \n(wAI'O\"AI) which minimizes Lemp{W,CT) in each modcl.H E A1. By the theory of \nAIC, we have the following formula, \n\n(10) \n\nBased on this property, let us define a formal information criterion I (Af) for a model \nAf by \n\nI{Jlf) = 2N Lemp{wAI' O\"~I ) + A( Fo (wAf) + 1) \n\n(11) \n\n(12) \n\nwhere A is a constant and Fo (w) is the number of nonzero parameters in w, \n\nJl \nFo{w) = L L \n\nL \n\nII \n\nl\\ \n\nfO(Wij) + L L \n\nfO{Wjd\u00b7 \n\ni=1 j=O \n\nj=lk=O \n\nwhere fo (x) is 0 if x = 0, or 1 if otherwise. \nI{1U) is formally equal to AIC if \nA = 2, or l\\'IDL if A = 10g{N). Notc that F(w) ~ Fo{w) for arbitrary wand that \nF( wAJ ) = Fo (w AI) if and only if the Fisher information mat.rix of the model !II is \npositive definite. \n\n2.2 A Modified Information Criterion \n\nIn order to find the optimal model and parameter simultaneously, we define a mod(cid:173)\nified information critcrion. For Q' > O. \n\n2NLemp(w,0\") + A{Fo{w) + 1), \nL \nLLfO'{Wij) + LLfo{wjJ.o), \n\nJl \n\nI{ \n\nH \n\nFo{w) \n\n(13) \n\n(14) \n\nwhere fa-(x) satisfies the following two conditions. \n\ni=l j=O \n\nj=ll,\u00b7=O \n\n\f296 \n\nWatanabe \n\n(1) 10.(x) -+ 10(x) when 0: -+ O. \n(2) If Ixl :::; Ivi then 0:::; 10.(.1:) :::; 10(Y) :::; 1. \n\nFor example, 1- exp( _x2 /0:2 ) and 1-1/(1 + (x/0:)2) satisfy this condition. Based \non these definitions, we have the following theorem. \nTheorem \n\nmin 1(111) = lim min 10 (w, 0'). \nAI EM \n\no,~o W,CT \n\nThis theorem shows that the optimal model and parameter can be found by mini(cid:173)\nmizing 1a(1O, 0') while controlling 0: as 0: -+ 0 (The parameter 0: plays the same role \nas the temperature in the simulated annealing). As Fo.(x) -+ Fo(x) is not uniform \nconvergence, this theorem needs the second condition on 1 a (:t'). (For proof of the \ntheorem, see [5]). \n\nIf we choose a different.iable function for 10 (10), then its local minimum can be \nfound by the steepest descent method, \n\ndw \ndt =-o10 10 (w,0'), \n\n0 \n\ndO' \nTt=-oO'la(w,O'). \n\n0 \n\nThese equat.ions result in a learning dynamics, \n\n~1o = -TJ 2: {ow IIvi - ';'(10; .'ri) 112 + ; Ot;'}, \n\nA A2 0F \n\nN \n\n0 \n\n(15) \n\n(16) \n\ni=l \n\nwhere 0'2 = (I/NL)\"'\u00a3//=lllvi - ,;,(w;:rdIl 2 . and 0: is slowly controlled as 0: -+ O. \nThis dynamics can be understood as the (,lTor backpropagation with the added \nterm. \n\n3 Experimental Results \n\n3.1 The true distribution is contained in the models \n\nFirst, we consider a case when t.he true distribut.ion is cont.ained in the model \nfamily M. Figure 1 (1) shows the true model from which t.he training samples were \ntaken. One thousand input samples were t.aken from the uniform probability on \n[-0.5,0.5] x [-0.5,0.5] x [-0.5,0.5]. The output samples were calculat.ed by the \nnetwork in Figure 1 (1), and noizes were added which were taken from a normal \ndistribution with the expectation 0 and the variance 3.33 x 10- 3 . Ten thousands \ntesting samples were t.aken from t.he same distribut.ion. \"Te used 10 ('IV) = 1 -\nexp( _w 2 /20'2) as a soft.ener function, and t.he \"annealing schedule\" of 0 ' was set \nas 0:( n) = 0'0 (1 - n/ n max ) + \u20ac, where 'Il is the t.raining cycle number, 0 '0 = 3.0, \nn max = 25000, and \u20ac = 0.01. Figure 1 (2) shows the full-connected nlOdel Afmax \nwith 10 hidden units, which is the initial model. In the training, the learning speed \nTJ was set as 0.1. \n\nWe compared the empirical errors and t.he prediction errors for several cases for A \n(Figure 1 (5), (6)). If A = 2, the crit.erion is AIC, and if A = 10g(N) = 6.907, it is \nBIC or MDL. Figure 1 (3) and (4) show the optimized models and parameters for \nthe criteria ,vith A = 2 and A = 5. \\\\Then .4 = 5, t.he true model could be found. \n\n\fAn Optimization Method of Layered Neural Networks \n\n297 \n\n3.2 The true distribution is not contained \n\nSecond, let. us consider a case that the true distribution is not contained in the \nmodel family. For t.he training samples and the testing samples, we used the same \nprobability density as the above case except that the function was \n\nFigure 2 (1) and (2) show the training error and the prediction error, respectively. \nIn t.his case, the best generalized model was found by AIC, shown in Figure 3. In \nthe optimized network, Xl and X2 were almost separated from X3, which means that \nthe network could find the structure of the true model in eq.{17.) \n\nThe practical application to ultrasonic image reconstruct.ion is shown in Figure 3. \n\n(17) \n\n4 Discussion \n\n4.1 An information criterion and pruning weights \n\nIf P(w, u; ylx) sufficiently approximates Q(YI:~~ ) and N is sufficiently large, we have \n\n(18) \n\nwhere Z N = Lemp{ 'LV j\\f) - LC(iJ j\\f) and 'IV j\\f is the parameter which minimizes L( 'lV, u) \nin the model lIf. Although < ZN >= 0 resulting in equation (10), its standard \ndeviation has the same order as (1/ VN). However, if 1111 C 1If2 or lIt!1 ~ lith, then \n'Ii; 1111 and 'LV 1\\12 expected to be almost common. and it doesn't essentially affect the \nmodel selection problem [2]. \n\nThe model family made by pruning weights or by eliminating biases is not. a totally \nordered set but a partially ordered set for the order \"c\". Therefore, if a model \n111 E M is select.ed, it is the optimal model in a local model family M' = {1If' E \nMj 1If' C 111 or 111' ~ Af}, but it may not be the optimal model in the global \nfamily M. Artificial neural networks have the local minimum problem not. only in \nthe parameter space but also in the model family. \n\n4.2 The degenerate Fisher information matrix. \n\nIf the true probability is contained in the model and the number of hidden units is \nlarger than necessary one, then the Fisher informat.ion matrix is degenerated, and \nconsequently. the maximum likelihood est.imator is not. subject t.o the asympt.otically \nnormal distribution [6]. Therefore, the prediction error is not given by eq.(3), or \nAIC cannot. be deriyed. However, by the proposed method, the selected model has \nthe non-degenerated Fiher information matrix, because if it is degenerate then the \nmodified information crit.erion is not. minimized. \n\n\f298 \n\nWatanabe \n\n~ -\n\nN(O,3.33 X 10 ) \n\noutput \nunit \n\n10 \n\n\"~,. \n\n-2.2 ~ \n\n, \n2.27 \n-0.7 -2.9 \n\nt \n\n(1) True model \n\n(2) Initial model for learning. \n\n(3) Optimized by AIC(A=2) \n\n(4) Optimized by A=5 \n\nE (w*) = 3.29 X 10 -3 \n3 \nemp \nE(w~ = 3.39 X 10-\n\n~m'w*) = 3.31 X 10 -3 \nE(W) = 3.37XlO \n-3 \n\n*' \n\nE \n\n(w*) \n\nemp \n\n3.35 \n\n3.3 \n\n3.25 \n\ninitial 1 \ninitial 3 \n\ninitial 2 \n\nAIC \n(5) The emprical error \n\nMDL \n\nA \n\nX 10-3 \n\nE(w*) \n\n3.45 \n\n3.4 \n\n3.35 \n\n3 \n\ninitial \n\nA \n\n(6) The prediction error \n\nFigure I: True distribution is contained in the models. \n\ninitial 2 ! \n\ninitial 1 \n\nE \n\n(w*) \n\nemp \n\ninitial 3 \n\n3.6 \n\n3.5 \n\n3.4 \n\n3.3 \n\nE(w*) \n\ninitial 2 \n\nThe empirical error 3.31 X 10-3 \nThe prediction error 3.41 X 10 -3 \n\ninitial 3 \n\n<'4;2' ~ WI~ \n_;~9~65 ~/O~4' \nT T \n\n--t-t--+--+--t-+-..... A \n'-\n\n3.409 X 10-\n\n3 \n\n3.7 \n\n3.6 \n\n3.5 \n\n3.4 \n\nAIC \n\nSIC \n\n(1) The empirical error \n\nAIC \n\nSIC \n\nxl \n\nx2 \n\nx3 \n\n(2) The prediction error \n\n(3) Optimized by AI C (A=2). \n\nFigure 2: True distribution is not contained in the models. \n\n\fAn Optimization Method of Layered Neural Networks \n\n299 \n\n(1) An Ultrasonic Imaging System \n\n(2) Sample Objects. \n\nImages for Traiuillg \n\nImages for Tcstiug \n\nReconstructed Image \n\n15 \n\nunits \n\nOrigillal Illlagcs --~~--+--~~---+----~ \n\nRestored using LS:-'I ----1-------+--------1 \n\nRestored using .\\ IC - - -- - t - - - - - -+ - - - - - - l \n\n\": \nf-~\"\" \n\n., -\n\n-\n\n~ . \n\nneighborhood \n\n.. '. : '. : . \n\n~ltrasOnic Image \n\n32X32 \n\nRestored using :.IDL_-'--_----''---____ -'--____ ..-.J \n\n(3) Neural Net.works \n\n(4 )Rcstored Images \n\nFigure 3: Practical Applicat.ion t.o Image Rest.oration \n\nThe propo~ed method was applied t.o ultrasonic image rest.orat.ioll. Figure 3 (1). (2), \n(3), (4) respectively show an ultrasonic imaging system, the sample objects, and a \nneural network for image restorat.ion, and the original restored images. The number \nof paramet.ers optimized by LS~L AIC. and ':\\IDL were respect.in-Iy 166. 138. and \n57. Rather noizeless images w('re obtained using the modified AIC or 1IDL. For \nexample, the '\"Tail of R\" \u00b7was clearly restored using AIC. \n\n\f300 \n\nWatanabe \n\n4.3 Relation to another generalization methods \n\nIn the neural information processing field, many methods have been proposed for \npreventing the over-fit.ting problem. One of t.he most. famous met.hods is the weight \ndecay method, in which we assume a priori probabilit.y distribut.ion on the parameter \nspace and minimize \n\nEl (w) = Eemp( 10) + '\\C( 10), \n\n(19) \nwhere ,\\ and C(w) are chosen by several heuristic methods [7]. The BIC is the \ninformation criterion for such a met.hod [3], and the proposed method may be \nunderstood as a met.hod how to cont.rol ,\\ and C( w). \n\n5 Conclusion \n\nAn optimization met.hod for layered neural networks was proposed ba.<;ed on the \nmodified informat.ion criterion, and its effectiveness was discussed theoretically and \nexperimentally. \n\nAcknowledgements \n\nThe author would like to t.hank Prof. S. Amari, Prof. S. Yoshizawa, Prof. K. \nAihara in University of Tokyo, and all members of the Amari seminar for their \nactive discussions about statistical met.hods in neural net.works. \n\nReferences \n\n[1] H.Akaike. (1974) A New Look at the St.atistical Model Identification. it IEEE \nTrans. on Automatic Control, Vol.AC-19, No.6, pp.716-723. \n\n[2] N.Murata, S.Yoshizawa, and S.Amari.(1992) Learning Curves, IvIodel Sel~ction \nand Complexit.y of Neural Networks. Ad\u00b7lIa.nces in Neural Injorm(\u00a3tion Processing \nSystems 5, San Mateo, Morgan Kaufman, pp.607-614. \n\n[3] C.Schwarz (1978) Estimating the dimension of a model. Annals of St(ttistics \nVo1.6, pp.461-464. \n\n[4] J .Rissanen. (1984) Universal Coding, Information, Prediction, and Estimation. \nIEEE Tra:ns. on Injormation Theory, Vo1.30, pp.629-636. \n\n(1993) An Optimization :r.,\u00b71ethod of Artificial Neural Networks \n[5] S.Watanabe. \nbased on a Modified Informat.ion Criterion. IEICE technical Re1JOrt Vol.NC93-52, \npp.71-78. \n\n[6] H.'iVhite. (1989) Learning in Art.ificial Neural Net.works : A Stat.istical Perspec(cid:173)\ntive. Neural Computation, Vol.l, pp.425-464. \n\n[7] A.S.'iVeigend, D.E.Rumelhart, and B.A.Huberman. \n(1991) Generalizat.ion of \nweight-elimination with application t.o foreca.<;t.ing. Advances in Neural Information \nProcessing Systems, Vo1.3, pp.875-882. \n\n\fPART II \n\nLEARNING THEORY, \nGENERALIZATION, \nAND COMPLEXITY \n\n\f\f", "award": [], "sourceid": 747, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}