{"title": "Networks with Learned Unit Response Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1048, "page_last": 1055, "abstract": null, "full_text": "Networks with Learned Unit Response Functions \n\nJohn Moody and Norman Yarvin \nYale Computer Science, 51 Prospect St. \n\nP.O. Box 2158 Yale Station, New Haven, CT 06520-2158 \n\nAbstract \n\nFeedforward networks composed of units which compute a sigmoidal func(cid:173)\ntion of a weighted sum of their inputs have been much investigated. We \ntested the approximation and estimation capabilities of networks using \nfunctions more complex than sigmoids. Three classes of functions were \ntested: polynomials, rational functions, and flexible Fourier series. Un(cid:173)\nlike sigmoids, these classes can fit non-monotonic functions. They were \ncompared on three problems: prediction of Boston housing prices, the \nsunspot count, and robot arm inverse dynamics. The complex units at(cid:173)\ntained clearly superior performance on the robot arm problem, which is \na highly non-monotonic, pure approximation problem. On the noisy and \nonly mildly nonlinear Boston housing and sunspot problems, differences \namong the complex units were revealed; polynomials did poorly, whereas \nrationals and flexible Fourier series were comparable to sigmoids. \n\n1 \n\nIntroduction \n\nA commonly studied neural architecture is the feedforward network in which each \nunit of the network computes a nonlinear function g( x) of a weighted sum of its \ninputs x = wtu. Generally this function is a sigmoid, such as g( x) = tanh x or \ng(x) = 1/(1 + e(x-9\u00bb). To these we compared units of a substantially different \ntype: they also compute a nonlinear function of a weighted sum of their inputs, \nbut the unit response function is able to fit a much higher degree of nonlinearity \nthan can a sigmoid. The nonlinearities we considered were polynomials, rational \nfunctions (ratios of polynomials), and flexible Fourier series (sums of cosines.) Our \ncomparisons were done in the context of two-layer networks consisting of one hidden \nlayer of complex units and an output layer of a single linear unit. \n\n1048 \n\n\fNetworks with Learned Unit Response Functions \n\n1049 \n\nThis network architecture is similar to that built by projection pursuit regression \n(PPR) [1, 2], another technique for function approximation. The one difference is \nthat in PPR the nonlinear function of the units of the hidden layer is a nonparamet(cid:173)\nric smooth. This nonparametric smooth has two disadvantages for neural modeling: \nit has many parameters, and, as a smooth, it is easily trained only if desired output \nvalues are available for that particular unit. The latter property makes the use of \nsmooths in multilayer networks inconvenient. If a parametrized function of a type \nsuitable for one-dimensional function approximation is used instead of the nonpara(cid:173)\nmetric smooth, then these disadvantages do not apply. The functions we used are \nall suitable for one-dimensional function approximation. \n\n2 Representation \n\nA few details of the representation of the unit response functions are worth noting. \n\nPolynomials: Each polynomial unit computed the function \n\ng(x) = alX + a2x2 + ... + anxn \n\nwith x = wT u being the weighted sum of the input. A zero'th order term was not \nincluded in the above formula, since it would have been redundant among all the \nunits. The zero'th order term was dealt with separately and only stored in one \nlocation. \n\nRationals: A rational function representation was adopted which could not have \nzeros in the denominator. This representation used a sum of squares of polynomials, \nas follows: \n\nao + alx + ... + anxn \n\n( ) \n9 x \n\n-\n- 1 + (b o + b1x)2 + (b 2x + b3x2)2 + (b4x + b5x 2 + b6X3 + b7x4)2 + .,. \n\nThis representation has the qualities that the denominator is never less than 1, \nand that n parameters are used to produce a denominator of degree n. If the above \nformula were continued the next terms in the denominator would be of degrees eight, \nsixteen, and thirty-two. This powers-of-two sequence was used for the following \nreason: of the 2( n - m) terms in the square of a polynomial p = am xm + '\" + anxn , \nit is possible by manipulating am ... an to determine the n - m highest coefficients, \nwith the exception that the very highest coefficient must be non-negative. Thus \nif we consider the coefficients of the polynomial that results from squaring and \nadding together the terms of the denominator of the above formula, the highest \ndegree squared polynomial may be regarded as determining the highest half of the \ncoefficients, the second highest degree polynomial may be regarded as determining \nthe highest half of the rest of the coefficients, and so forth. This process cannot set \nall the coefficients arbitrarily; some must be non-negative. \n\nFlexible Fourier series: The flexible Fourier series units computed \n\nn \n\ng(x) = L: ai COS(bi X + Ci) \n\ni=O \n\nwhere the amplitudes ai, frequencies bi and phases Ci were unconstrained and could \nassume any value. \n\n\f1050 \n\nMoody and Yarvin \n\nSigmoids: We used the standard logistic function: \ng(x) = 1/(1 + e(x-9)) \n\n3 Training Method \n\nAll the results presented here were trained with the Levenberg-Marquardt modifi(cid:173)\ncation of the Gauss-Newton nonlinear least squares algorithm. Stochastic gradient \ndescent was also tried at first, but on the problems where the two were compared, \nLevenberg-Marquardt was much superior both in convergence time and in quality of \nresult. Levenberg-Marquardt required substantially fewer iterations than stochas(cid:173)\ntic gradient descent to converge. However, it needs O(p2) space and O(p2n) time \nper iteration in a network with p parameters and n input examples, as compared \nto O(p) space and O(pn) time per epoch for stochastic gradient descent. Further \ndetails of the training method will be discussed in a longer paper. \nWith some data sets, a weight decay term was added to the energy function to be \noptimized. The added term was of the form A L~=l w;. When weight decay was \nused, a range of values of A was tried for every network trained. \n\nBefore training, all the data was normalized: each input variable was scaled so that \nits range was (-1,1), then scaled so that the maximum sum of squares of input \nvariables for any example was 1. The output variable was scaled to have mean zero \nand mean absolute value 1. This helped the training algorithm, especially in the \ncase of stochastic gradient descent. \n\n4 Results \n\nWe present results of training our networks on three data sets: robot arm inverse \ndynamics, Boston housing data, and sunspot count prediction. The Boston and \nsunspot data sets are noisy, but have only mild nonlinearity. The robot arm inverse \ndynamics data has no noise, but a high degree of nonlinearity. Noise-free problems \nhave low estimation error. Models for linear or mildly nonlinear problems typically \nhave low approximation error. The robot arm inverse dynamics problem is thus a \npure approximation problem, while performance on the noisy Boston and sunspots \nproblems is limited more by estimation error than by approximation error. \n\nFigure la is a graph, as those used in PPR, of the unit response function of a one(cid:173)\nunit network trained on the Boston housing data. The x axis is a projection (a \nweighted sum of inputs wT u) of the 13-dimensional input space onto 1 dimension, \nusing those weights chosen by the unit in training. The y axis is the fit to data. The \nresponse function of the unit is a sum ofthree cosines. Figure Ib is the superposition \nof five graphs of the five unit response functions used in a five-unit rational function \nsolution (RMS error less than 2%) of the robot arm inverse dynamics problem. The \ndomain for each curve lies along a different direction in the six-dimensional input \nspace. Four of the five fits along the projection directions are non-monotonic, and \nthus can be fit only poorly by a sigmoid. \n\nTwo different error measures are used in the following. The first is the RMS error, \nnormalized so that error of 1 corresponds to no training. The second measure is the \n\n\fNetworks with Learned Unit Response Functions \n\n1051 \n\nRobot arm fit to data \n\n40 \n\n20 \n\no \n\n-zo \n\n-40 \n\n1.0 \n\n-4 \n\n. \n\n. \n\n\" . \n. ' . . ' \n\n~ \n.; 2 \no \n~ o \n! .. c \n\no \n\n-2 \n\n-2.0 \n\nFigure 1: \n\na \n\nb \n\nsquare of the normalized RMS error, otherwise known as the fraction of explained \nvarIance. We used whichever error measure was used in earlier work on that data \nset. \n\n4.1 Robot arm inverse dynamics \n\nThis problem is the determination of the torque necessary at the joints of a two(cid:173)\njoint robot arm required to achieve a given acceleration of each segment of the \narm , given each segment's velocity and position. There are six input variables to \nthe network, and two output variables. This problem was treated as two separate \nestimation problems, one for the shoulder torque and one for the elbow torque. The \nshoulder torque was a slightly more difficult problem, for almost all networks. The \n1000 points in the training set covered the input space relatively thoroughly. This, \ntogether with the fact that the problem had no noise, meant that there was little \ndifference between training set error and test set error. \n\nPolynomial networks of limited degree are not universal approximators, and that \nis quite evident on this data set; polynomial networks of low degree reached their \nminimum error after a few units. Figure 2a shows this. If polynomial, cosine, ra(cid:173)\ntional, and sigmoid networks are compared as in Figure 2b, leaving out low degree \npolynomials, the sigmoids have relatively high approximation error even for net(cid:173)\nworks with 20 units. As shown in the following table, the complex units have more \nparameters each, but still get better performance with fewer parameters total. \n\nType \ndegree 7 polynomial 5 \ndegree 6 rational \n5 \n2 term cosine \n6 \nsigmoid \n10 \nsigmoid \n20 \n\nUnits Parameters Error \n\n65 \n95 \n73 \n81 \n161 \n\n.024 \n.027 \n.020 \n.139 \n.119 \n\nSince the training set is noise-free, these errors represent pure approximation error. \n\n\f1052 \n\nMoody and Yarvin \n\n0.8 \n\n0.8 \n\n~ \n\u2022 \n0.4 \n\n0.2 \n\n0.8 \n\nO.S \n\n.. \nE \n0.4 \n\n0 \n\n0.2 \n\n~.Iilte ...... \n+ootII1n.. 3 ler .... \nOoooln.. 4 tel'lNl \n\nopoJynomleJ de, 7 \n\nXrationeJ do, 8 \n\u2022 ... \"'0101 \n\n0.0 L---,b-----+--~::::::::8~~\u00a7=t::::::!::::::1J \n\nnumbel' of WIIt11 \n\nFigure 2: \n\na \n\n10 \n\nnumber Dr WIIt11 \n\n111 \n\n20 \n\nb \n\nThe superior performance of the complex units on this problem is probably due to \ntheir ability to approximate non-monotonic functions. \n\n4.2 Boston housing \n\nThe second data set is a benchmark for statistical algorithms: the prediction of \nBoston housing prices from 13 factors [3]. This data set contains 506 exemplars and \nis relatively simple; it can be approximated well with only a single unit. Networks \nof between one and six units were trained on this problem. Figure 3a is a graph \nof training set performance from networks trained on the entire data set; the error \nmeasure used was the fraction of explained variance. From this graph it is apparent \n\n03 tenD coolh. \nx.itmold \n\n1.0 \n\no polJDomll1 d., fi \n+raUo,,\"1 dec 2 \n02 term.....m. \n0 3 term COllin. \nx.tpnotd \n\n0.5 \n\n0 .20 \n\nO. lfi \n\n~ \u2022 \n\n0.10 \n\n0.05 \n\nFigure 3: \n\na \n\nb \n\n\fNetworks with Learned Unit Response Functions \n\n1053 \n\nthat training set performance does not vary greatly between different types of units, \nthough networks with more units do better. \n\nOn the test set there is a large difference. This is shown in Figure 3b. Each point \non the graph is the average performance of ten networks of that type. Each network \nwas trained using a different permutation of the data into test and training sets, the \ntest set being 1/3 of the examples and the training set 2/3. It can be seen that the \ncosine nets perform the best, the sigmoid nets a close second, the rationals third, \nand the polynomials worst (with the error increasing quite a bit with increasing \npolynomial degree.) \n\nIt should be noted that the distribution of errors is far from a normal distribution, \nand that the training set error gives little clue as to the test set error. The following \ntable of errors, for nine networks of four units using a degree 5 polynomial, is \nsomewhat typical: \n\nSet \ntraining \ntest \n\nError \n\n0.091 I \n\n0.395 \n\nOur speculation on the cause of these extremely high errors is that polynomial ap(cid:173)\nproximations do not extrapolate well; if the prediction of some data point results in \na polynomial being evaluated slightly outside the region on which the polynomial \nwas trained, the error may be extremely high. Rational functions where the nu(cid:173)\nmerator and denominator have equal degree have less of a problem with this, since \nasymptotically they are constant. However, over small intervals they can have the \nextrapolation characteristics of polynomials. Cosines are bounded, and so, though \nthey may not extrapolate well if the function is not somewhat periodic, at least do \nnot reach large values like polynomials. \n\n4.3 Sunspots \n\nThe third problem was the prediction of the average monthly sunspot count in a \ngiven year from the values of the previous twelve years. We followed previous work \nin using as our error measure the fraction of variance explained, and in using as \nthe training set the years 1700 through 1920 and as the test set the years 1921 \nthrough 1955. This was a relatively easy test set - every network of one unit which \nwe trained (whether sigmoid, polynomial, rational, or cosine) had, in each of ten \nruns, a training set error between .147 and .153 and a test set error between .105 \nand .111. For comparison, the best test set error achieved by us or previous testers \nwas about .085. A similar set of runs was done as those for the Boston housing \ndata, but using at most four units; similar results were obtained. Figure 4a shows \ntraining set error and Figure 4b shows test set error on this problem. \n\n4.4 Weight Decay \n\nThe performance of almost all networks was improved by some amount of weight \ndecay. Figure 5 contains graphs of test set error for sigmoidal and polynomial units, \n\n\f1054 \n\nMoody and Yarvin \n\n0.18 ,..-,------=..::.;==.::.....:::...:=:..:2..,;:.::.:..----r--1 0.25 ~---..::.S.::.:un:::;;a.!:..po.:...:l:....:t:.::e.:...:Bt:....:lI:.::e..:..l ..:.:,mre.::.:an~ __ --,-, \n\nOP0lr.!:0mt .. dea \n\nt ~\u00b7leO:: o~:~~ \n\nC 3 hrm corlne \nX_lamold \n\n0.14 \n\nO.IZ \n\n.. 0 \nI: .. \n\n0.10 \n\nO.OB \n\nOpolynomlal d \u2022\u2022 1\\ \n\"\"\"allon.. de. 2 \n02 term co.lne \ncs term coolne \nx.tamcld \n\n0.20 \n\n0.15 \n\n0.10 \n\n0.08 ' - -+1 - - - - -\u00b1 2 - - - - - !S e - - - - - -+ - - ' \n\nnumber of WIlle \n\nFigure 4: \n\na \n\n2 \n\nDumb .... of unit. \n\n3 \n\nb \n\nusing various values of the weight decay parameter A. For the sigmoids, very little \nweight decay seems to be needed to give good results, and there is an order of \nmagnitude range (between .001 and .01) which produces close to optimal results. \nFor polynomials of degree 5, more weight decay seems to be necessary for good \nresults; in fact, the highest value of weight decay is the best. Since very high values \nof weight decay are needed, and at those values there is little improvement over \nusing a single unit, it may be supposed that using those values of weight decay \nrestricts the multiple units to producing a very similar solution to the one-unit \nsolution. Figure 6 contains the corresponding graphs for sunspots. Weight decay \nseems to help less here for the sigmoids, but for the polynomials, moderate amounts \nof weight decay produce an improvement over the one-unit solution. \n\nAcknowledgements \n\nThe authors would like to acknowledge support from ONR grant N00014-89-J-\n1228, AFOSR grant 89-0478, and a fellowship from the John and Fannie Hertz \nFoundation. The robot arm data set was provided by Chris Atkeson. \n\nReferences \n\n[1] J. H. Friedman, W. Stuetzle, \"Projection Pursuit Regression\", Journal of the \nAmerican Statistical Association, December 1981, Volume 76, Number 376, \n817-823 \n\n[2] P. J. Huber, \"Projection Pursuit\", The Annals of Statistics, 1985 Vol. 13 No. \n\n2,435-475 \n\n[3] L. Breiman et aI, Classification and Regression Trees, Wadsworth and Brooks, \n\n1984, pp217-220 \n\n\fBoston housin \n\n0.30 r-T\"=::...:..:;.:;:....:r:-=::;.5I~;=::::..:;=:-;;..:..:..::.....;;-=..:.!ar:......::=~..., \n\nhi decay \n\nNetworks with Learned Unit Response Functions \n\n1055 \n\n00 \n+.0001 \n0.001 \n0.01 \n)(.1 \n'.3 \n\n00 \n+.0001 \n0.001 \n0.01 \nX.l \n\u00b7.3 \n\n1.0 \n\n0.5 \n\n0.25 \n\n~0.20 \n\u2022 \n\n0.15 \n\nFigure 5: Boston housing test error with various amounts of weight decay \n\n0.16 \n\n0.14 \n\nmoids wilh wei hl decay \n\n00 \n+.0001 \n0 .001 \n0 .01 \n><.1 \n\u00b7 .3 \n1.8 \n\nO.IB \n\n0. 111 \n\n.. \n1: 0.12 \nD ~ 0. 12 \n~~ \nsea \n\n::::::,. \n\n0.10 \n\n0.1 \u2022 \n\n0. 10 \n\n0.08 \n\n2 \n\nDum be .. of 1IJlIt, \n\n3 \n\n0.08 \n\n<4 \n\n2 \n\nDumb.,. 01 WIll' \n\n3 \n\nFigure 6: Sunspot test error with various amounts of weight decay \n\n\f", "award": [], "sourceid": 568, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}, {"given_name": "Norman", "family_name": "Yarvin", "institution": null}]}