{"title": "Adaptive Spline Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 675, "page_last": 683, "abstract": null, "full_text": "ADAPTIVE SPLINE NETWORKS \n\nJerome H. Friedman \nDepartment of Statistics and \nStanford Linear Accelerator Center \nStanford University \nStanford, CA 94305 \n\nAbstract \n\nA network based on splines is described. It automatically adapts the num(cid:173)\nber of units, unit parameters, and the architecture of the network for each \napplication. \n\n1 \n\nINTRODUCTION \n\nIn supervised learning one has a system under study that responds to a set of \nsimultaneous input signals {Xl'\" x n }. The response is characterized by a set of \noutput signals {Y1, Y2,\"', Ym}. The goal is to learn the relationship between the \ninputs and the outputs. This exercise generally has two purposes: prediction and \nunderstanding. With prediction one is given a set of input values and wishes to \npredict or forecast likely values of the corresponding outputs without having to \nactually run the system. Sometimes prediction is the only purpose. Often, however, \none wishes to use the derived relationship to gain understanding of how the system \nworks. Such knowledge is often useful in its own right, for example in science, or it \nmay be used to help improve the characteristics of the system, as in industrial or \nengineering applications. \nThe learning is accomplished by taking training data. One observes the outputs \nproduced by the system in response to varying sets of input values \n\n(1) \nThese data (1) are then used to train an \"artificial\" system (usually a computer \nprogram) to learn the input/output relationship_ The underlying framework or \nmodel is usually taken to be \n\n{Y1i ... Ymi I Xli' .. xndf -\n\nYk = !k(Xl- - -xn ) + fk, \n\nk = I,m \n\n(2) \n\n675 \n\n\f676 \n\nFriedman \n\nwith ave(fk I Xl ... xn) = O. Here (2) Yk is the kth responding output signal, fk is \na single valued deterministic function of an n-dimensional argument (inputs) and \ntk is a random (stochastic) component that reflects the fact that (if nonzero) Yk \nis not completely specified by the observed inputs, but is also responding to other \nquantities that are neither controlled nor observed. In this framework the learning \ngoal is to use the training data to derive a function j(Xl '\" xn) that can serve as a \nreasonable approximation (estimate) of the true underlying (\"target\") function fk \n(2). The supervised learning problem can in this way be viewed as one of function \nor surface approximation, usually in high dimensions (n \u00bb 2). \n\n2 SPLINES \n\nThere is an extensive literature on the theory of function approximation (see Cheney \n[1986] and Chui [1988], and references therein). From this literature spline methods \nhave emerged as being among the most successful (see deBoor [1978] for a nice in(cid:173)\ntroduction to spline methods). Loosely speaking, spline functions have the property \nthat they are the smoothest for a given flexibility and vice versa. This is impor(cid:173)\ntant if one wishes to operate under the least restrictive assumptions concerning \nfk(XI'\" xn) (2), namely, that it is relatively smooth compared to the noise tk but \nis otherwise arbitrary. A spline approximation is characterized by its order q [q = 1 \n(linear), q = 2 (quadratic), and q = 3 (cubic) are the most popular orders]. The \nprocedure is to first partition the input variable space into a set of disjoint regions. \nThe approximation l(xi ... xn) is taken to be a separate n-dimensional polynomial \nin each region with maximum degree q in anyone variable, constrained so that I \nand all of its derivatives to order q - 1 are continuous across all region boundaries. \nThus, a particular spline approximation is determined by a choice for q, which tends \nnot to be very important, and the particular set of chosen regions, which tends to \nbe crucial. The central problem associated with spline approximations is how to \nchoose a good set of associated regions for the problem at hand. \n\n2.1 TENSOR-PRODUCT SPLINES \n\nThe most popular method for partitioning the input variable space is by the tensor \nor outer product of interval sets on each of the n axes. Each input axis is partitioned \ninto I< + 1 intervals delineated by I< points (\"knots\"). The regions in the n(cid:173)\ndimensional space are taken to be the (I< + 1t intersections of all such intervals. \nFigure 1 illustrates this procedure for I< = 4 knots on each of two axes producing \n25 regions in the corresponding two-dimensional space. \n\nOwing to the regularity of tensor-product representations, the corresponding spline \napproximation can be represented in a simple form as a basis function expansion. \nLet x = (Xl'\" x n ). Then \n\nlex) = l: WtBt(x) \n\n(3) \n\nwhere {wtl are the coefficients (weights) for each respective basis function Bt(x), \nand the basis function set {Bt(x)} is obtained by taking the tensor product of the \nset of functions \n\nt \n\n(4) \n\n\fAdaptive Spline Networks \n\n677 \n\nover all of the axes, j = 1, n. That is, each of the I< + q + 1 functions on each axis j \n(j = 1, n) is multiplied by all of the functions (4) corresponding to all of the other \naxes k (k = 1, n; k 1= j). As a result the total number of basis functions (3) defining \nthe tensor-product spline approximation is \n\n(5) \n\nX\u00b7 < tk\u00b7 \n3 \n3 -\nXj > tkj \n\nThe functions comprising the second set in (4) are known as the truncated power \nfunctions: \n\n(6) \nand there is one for each knot location tkj (k = 1, I<) on each input axis j (j = 1, n). \nAlthough conceptually quite simple, tensor-product splines have severe limitations \nthat preclude their use in high dimensional settings (n > > 2). These limitations \nstem from the exponentially large number of basis functions that are required (5). \nFor cubic splines (q = 3) with five inputs (n = 5) and only five knots per axis \n(I< = 5) 59049 basis functions are required. For n = 6 that number is 531441, and \nfor n = 10 it is approximately 3.5 x 109 \u2022 This poses severe statistical problems \nin fitting the corresponding number of weights unless the training sample is large \ncompared to these numbers, and computational problems in any case since the \ncomputation grows as the number of weights (basis functions) cubed. These are \ntypical manifestations of the so-called \"curse-of-dimensionality\" (Bellman [1961]) \nthat afflicts nearly all high-dimensional problems. \n\n3 ADAPTIVE SPLINES \n\nThis section gives a very brief overview of an adaptive strategy that attempts \nto overcome the limitations of the straightforward application of tensor-product \nsplines, making practical their use in high-dimensional settings. This method, called \nMARS (multivariate adaptive regression splines), is described in detail in Friedman \n[1991] along with many examples of its use involving both real and artificially gen(cid:173)\nerated data. (A FORTRAN program implementing the method is available from \nthe author.) \nThe method (conceptually) begins by generating a tensor-product partition of the \ninput variable space using a large number of knots, J{ < N, on each axis. Here \nN (1) is the training sample size. This induces a very large (I< + l)n number \nof regions. The procedure then uses the training data to select particular unions \nof these (initially large number of) regions to define a relatively small number of \n(larger) regions most suitable for the problem at hand. \n\n\"\" \n\nThis strategy is implemented through the basis function representation of spline \napproximations (3). The idea is to select a relatively small subset of basis functions \n\n{B~(x)}~ C {Bl(X)}~1Uge \n\nsmall \n\n(7) \n\nfrom the very large set (3) (4) (5) induced by the initial tensor-product partition. \nThe particular subset for a problem at. hand is obtained through standard statistical \nvariable subset selection, treating the basis functions as the \"variables\". At the \n\n\f678 \n\nFriedman \n\nfirst step the best single basis function is chosen. The second step chooses the basis \nfunction that works best in conjunction with the first. At the mth step, the one \nthat works best with the m - 1 already selected, is chosen, and so on. The process \nstops when including additional basis functions fails to improve the approximation. \n\n3.1 ADAPTIVE SPLINE NETWORKS \n\nThis section describes a network implementation that approximates the adaptive \nspline strategy described in the previous section. The goal is to synthesize a good \nset of spline basis functions (7) to approximate a particular system's input/output \nrelationship, using the training data. For the moment, consider only one output y; \nthis is generalized later. The basic observation leading to this implementation is \nthat the approximation takes the form of sums of products of very simple functions, \nnamely the truncated power functions (6), each involving a single input variable, \n\nand \n\nKm \n\nB~ (x) = II (Xj(k) -\n\ntkj )~, \n\nk=l \n\nM \n\nj(x) = L wmB:n(x). \n\n(8) \n\n(9) \n\nHere 1 0 splines \ncontinuous approximations are produced. This generally results in a dramatic in(cid:173)\ncrease in accuracy. In addition, all unit outputs are eligible to contribute to the \nfinal adder, not just the terminal ones; and finally, all previous unit outputs are \neligible to be selected as inputs for new units, not just the currently terminal ones. \n\nBoth additive and CART approximations have been highly successful in largely \ncomplementary situations: additive modeling when the true underlying function is \nclose to additive, and CART when it dominately involves high order interactions \nbetween the input variables. MARS unifies both into a single framework. This \nlends hope that MARS will be successful at both these extremes as well as the \nbroad spectrum of situations in between where neither works well. \n\nMultiple response outputs Y1'\" Ym (1) (2) are incorporated in a straightforward \nmanner. The internal units and their interconnections are the same as described \nabove and shown in Figures 2 and 3. Only the final weighted adder unit (Figure 2) \nis modified to incorporate a set of weights \n\nfor each response output (k = 1, m). The approximation for each output is \n\n{Wmk}~lm \n\n(14) \n\nM \n\nA(x) = L WmkBm, k = 1, m. \n\nm=O \n\nThe numerator in the GCV criterion (12) is replaced by \n\n1 m N \nN LL(Yik - jik)2 \nm \n\nk=l i=l \n\n\fAdaptive Spline Networks \n\n681 \n\nand it is minimized with respect to the internal network parameters (10) and all of \nthe weights (14). \n\n4 DISCUSSION \n\nThis section (briefly) compares and contrasts the MARS approach with radial basis \nfunctions and sigmoid \"back-probagation\" networks. An important consequence \nof the MARS strategy is input variable subset selection. Each unit individually \nselects the best system input so that it can best contribute to the approximation. \nIt is often the case that some or many of the inputs are never selected. These will be \ninputs that tend to have little or no effect on the output(s). In this case excluding \nthem from the approximation will greatly increase statistical accuracy. It also aids \nin the interpretation of the produced model. In addition to global variable subset \nselection, MARS is able to do input variable subset selection locally in different \nregions of the input variable space. This is a consequence of the restricted support \n(nonzero value) of the basis functions produced. Thus, if in any local region, the \ntarget function (2) depends on only a few of the inputs, MARS is able to use this to \nadvantage even if the relevant inputs are different in different local regions. Also, \nMARS is able to produce approximations of low interaction order even if the number \nof selected inputs is large. \n\nRadial basis functions are not able to do local (or usually even global) input variable \nsubset selection as a part of the procedure. All basis functions involve all of the \ninputs at the same relative strength everywhere in the input variable space. If the \ntarget function (2) is of this nature they will perform well in that no competing \nprocedure will do better, or likely even as well. If this is not the case, radial \nbasis functions are not able to take advantage of the situation to improve accuracy. \nAlso, radially symmetric basis functions produce approximations of the highest \npossible interaction order (everywhere in the input space). This results in a marked \ndisadvantage if the target function tends to dominately involve interactions in at \nmost a few of the inputs (such as additive functions (13\u00bb. \n\nStandard networks based on sigmoidal units of linear combinations of inputs share \nthe properties described above for radial basis functions. Including \"weight elim(cid:173)\nination\" (Rumelhart [1988]) provides an (important) ability to do global (but not \nlocal) input variable subset selection. The principal differences between MARS and \nthis approach center on the use of splines rather than sigmoids, and products rather \nthan linear combinations of the input variables. Splines tend to be more flexible in \nthat two spline functions can closely approximate any sigmoid whereas it can take \nmany sigmoids to approximate some splines. MARS' use of product expansions en(cid:173)\nables it to produce approximations that are local in nature. Local approximations \nhave the property that if the target function is badly behaved in any local region \nof the input space, the quality of the approximation is not affected in the other re(cid:173)\ngions. Also, as noted above, MARS can produce approximations of low interaction \norder. This is difficult for approximations based on linear combinations. \n\nBoth radial basis functions and sigmoidal networks produce approximations that \nare difficult to interpret. Even in situations where they produce high accuracy, \nthey provide little information concerning the nature of the target function. MARS \napproximations on the other hand can often provide considerable interpretable in-\n\n\f682 \n\nFriedman \n\nformation. Interpreting MARS models is discussed in detail in Friedman [1991]. \nFinally, training MARS networks tends to be computationally much faster than \nother types of learning procedures. \n\nReferences \n\nBellman, R. E. (1961). Adaptive Control Processes. Princeton University Press, \nPrinceton, NJ. \n\nBreiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification \nand Regression Trees. Wadsworth, Belmont, CA. \n\nCheney, E. W. (1986). Multivariate Approximation Theory: Selected Topics. \nMonograph: SIAM CBMS-NSF Regional Conference Series in Applied Mathemat(cid:173)\nics, Vol. 5l. \nChui, C. K. (1988). Multivariate Splines. Monograph: SIAM CBMS-NSF Regional \nConference Series in Applied Mathematics, Vol. 54. \n\nCraven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. \nEstimating the correct degree of smoothing by the method of generalized cross(cid:173)\nvalidation. Numerische Mathematik 31 317-403. \nde Boor , C. (1978). A Practical Guide to Splines. Springer-Verlag, New York, NY. \n\nFriedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). \nAnnals of Statistics, March. \nRumelhart, D. E. (1988). Learning and generalization. IEEE International Confer(cid:173)\nence on Neural Networks, San Diego, plenary address. \n\n\fAdaptive Spline Networks \n\n683 \n\nG-~ fl\\ca..A Aal.G.p ~i If II. S p' ,;..., IVa t~ I< \n\nFIGURE 2 \n\nFIGURE 1 \n\na. \n\nFIGURE 3 \n\nAoL\",.ldve Se/;\"\", ~it: \n\n1. 0 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nX 1..' \nJ, \n,-1.L \n--- J .. \n\"_1 \n~ r-LL \nR~ I \n\nj. \nr--- c .. \nr--\"r-' \n\n\u2022 \n\n, \n\n\u2022 X\", \n\nJ, \n\n, \n\n,.--'-\nJa. \nJ, \n\n---I_~ \n\nr-'-\nJrI. \nj ... \n~ t:ol \n\ns.. \n\n8.. 0, \n\nII .. \"DB. \n/ \n~~.w_8_ \n\na.. S. \n\n__ 0 \n\n,...= I \n\n-!- \" \n\nX:a. \n\n\u2022 \n\n\u2022 \n\n\u2022 \u2022 \n\n. \n\n\u2022 \u2022 \n\nlC,.. \n\n~ \n\nFIGURE 4 \n\no \n\n/'to \n\nf-\n\n\f", "award": [], "sourceid": 408, "authors": [{"given_name": "Jerome", "family_name": "Friedman", "institution": null}]}