{"title": "Second-order Learning Algorithm with Squared Penalty Term", "book": "Advances in Neural Information Processing Systems", "page_first": 627, "page_last": 633, "abstract": null, "full_text": "Second-order Learning Algorithm with \n\nSquared Penalty Term \n\nRyohei Nakano \nKazumi Saito \nNTT Communication Science Laboratories \n\n2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan \n\n{saito,nakano }@cslab.kecl.ntt.jp \n\nAbstract \n\nThis paper compares three penalty terms with respect to the effi(cid:173)\nciency of supervised learning, by using first- and second-order learn(cid:173)\ning algorithms. Our experiments showed that for a reasonably ade(cid:173)\nquate penalty factor, the combination of the squared penalty term \nand the second-order learning algorithm drastically improves the \nconvergence performance more than 20 times over the other com(cid:173)\nbinations, at the same time bringing about a better generalization \nperformance. \n\n1 \n\nINTRODUCTION \n\nIt has been found empirically that adding some penalty term to an objective func(cid:173)\ntion in the learning of neural networks can lead to significant improvements in \nnetwork generalization. Such terms have been proposed on the basis of several \nviewpoints such as weight-decay (Hinton, 1987), regularization (Poggio & Girosi, \n1990), function-smoothing (Bishop, 1995), weight-pruning (Hanson & Pratt, 1989; \nIshikawa, 1990), and Bayesian priors (MacKay, 1992; Williams, 1995). Some are \ncalculated by using simple arithmetic operations, while others utilize higher-order \nderivatives. The most important evaluation criterion for these terms is how the gen(cid:173)\neralization performance improves, but the learning efficiency is also an important \ncriterion in large-scale practical problems; i.e., computationally demanding terms \nare hardly applicable to such problems. Here, it is naturally conceivable that the \neffects of penalty terms depend on learning algorithms; thus, we need comparative \nevaluations. \n\nThis paper evaluates the efficiency of first- and second-order learning algorithms \n\n\f628 \n\nK. Saito and R. Nakano \n\nwith three penalty terms. Section 2 explains the framework of the present learning \nand shows a second-order algorithm with the penalty terms. Section 3 shows ex(cid:173)\nperimental results for a regression problem, a graphical evaluation, and a penalty \nfactor determination llsing cross-validation. \n\n2 LEARNING WITH PENALTY TERM \n\n2.1 Framework \n\nLet {(Xl, Y1),\"', (xm, Ym)} be a set of examples, where Xt denotes an n-dimensional \ninput vector and Yt a target value corresponding to Xt. In a three-layer neural \nnetwork, let h be the number of hidden units, Wj (j = 1,\"', h) be the weight \nvector between all the input units and the hidden unit j, and Wo = (WOO,\"\" WOh)T \nbe the weight vector between all the hidden units and the output unit; WjO means \na bias term and XtO is set to 1. Note that aT denotes the transposed vector of a. \nHereafter, a vector consisting of all parameters, (w5,\"\" wDT , is simply expressed \nas ~ = (~, \n\n.1.~T\\72n (~).1.~ = t (1 - 3\u00a2>~).1.\u00a2>~ \n\nk=1 \n\n3 \n\nk=1 \n\n(1 + \u00a2>~)3 \n\n. \n\n(5) \n\nNote that, in the step-length calculation, .1.~T\\72Fi(~).1.~ is basically assumed to \nbe positive. The three terms have a different effect on it, Le., the squared penalty \nterm always adds a non-negative value; the absolute penalty term has no effect; the \nnormalized penalty term may add a negative value if many weight values are larger \nthan ...jf73. This indicates that the squared penalty term has a desirable feature. \nIncidentally, we can employ other second-order learning algorithms such as SCG \n(M(ljller, 1993) or ass (Battiti, 1992), but BPQ worked the most efficiently among \nthem in our own experience (Saito & Nakano, 1997). \n\n3 EVALUATION BY EXPERIMENTS \n\n3.1 Regression Problem \nBy using a regression problem for a function y = (1- x + 2x2)e-O.5X2 , the learning \nperformance of adding a penalty term was evaluated. In the experiment, a value of \nx was randomly generated in the range of [-4,4], and the corresponding value of y \nwas calculated from x; each value of y was corrupted by adding Gaussian noise with \na mean of 0 and a standard deviation of 0.2. The total number of training examples \nwas set to 30. The number of hidden units was set to 5, where the initial values \nfor the weights between the input and hidden units were independently generated \naccording to a normal distribution with a mean of 0 and a standard deviation of \n1; the initial values for the weights between the hidden and output units were set \nto 0, but the bias value at the output unit was initially set to the average output \nvalue of all training examples. The iteration was terminated when the gradient \nvector was sufficiently small (Le., 11\\7 Fi ( ~) 112/ N < 10-12) or the total processing \ntime exceeded 100 seconds. The penalty factor J.t was changed from tJ to 2- 19 by \nmultiplying by 2- 1 ; trials were performed 20 times for each penalty factor. \n\nFigure 1 shows the training examples, the true function, and a function obtained \nafter learning without a penalty term. We can see that such a learning over-fitted \nthe training examples to some degree. \n\n3.2 Evaluation using Second-order Algorithm \n\nBy using BPQ, an evaluation was made after adding each penalty term. Figure 2(a) \ncompares the generalization performance, which was evaluated by using the aver(cid:173)\nage RMSE (root mean squared error) for a set of 5,000 test examples. The best \npossible RMSE level is 0.2 because each test example includes the same amount of \nGaussian noise given to each training example. For each penalty term, the general(cid:173)\nization performance was improved when J.t was set adequately, but the normalized \n\n\f630 \n\nK. Saito and R. Nakano \n\n3 \n\n2 \n\n1 \n\na \n\n-4 \n\n0 \n'0 . \n\naverage RMSE \n0.8 \n\n0.6 \n\nII \n\ntrue function \nleaming result \no \n\n-2 \n\na \n\n2 \n\n4 \n\nFigure 1: Learning problem \n\nCPU time (sec.) \n100 \n\n, '- ....\u2022. ~ ~ \n\n, \n, '1:3 \n\n' ...... ~ \" ... \" \" \n.... <~\" : \n~-----~-'~--u-t--~'~~\"~#'-~~~ \n\npenalty \n\n, \n\nII \n\n'1:3\" \n\nI \n\nI \n\n, , \n. . \n. ' \n\n\\ \n\nI \n\n10 \n\n1 \n\n0.2 -+-r-T\"T\"T'T'\"T\"\"\"\"\"\"\"I\"'T\"\"r-T\"T\"T'''T'T'''''''' \n\n~ \n\n2-5 \n\n2.10 \n\n2-15 \n\n2\u00b720~ \n\n(a) Generalization performance \n\n2-5 \n\n2-10 \n\n2.15 \n\n2-20 ~ \n\n(b) CPU time until convergence \n\nFigure 2: Comparison using second-order algorithm BPQ \n\npenalty term was the most unstable among the three, because it frequently got \nstuck in undesirable local minima. Figure 2(b) compares the processing time! until \nconvergence. In comparison to the learning without a penalty term, the squared \npenalty term drastically decreased the processing time especially when f.1 was large, \nwhile the absolute penalty term did not converge when f.1 was large; the normalized \npenalty term generally required a larger processing time. Thus, only the squared \npenalty term improved the convergence performance more than 2 rv 100 times, \nkeeping a better generalization performance for an adequate penalty factor. \n\n3.3 Evaluation using First-order Algorithm \n\nBy using BP, a similar evaluation was made after adding each penalty term. Here, \nwe adopted Silva and Almeida's learning rate adaptation rule (Silva & Almeida, \n1990), i.e., learning rate \"'k for each weight