{"title": "Basis-Function Trees as a Generalization of Local Variable Selection Methods for Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 700, "page_last": 706, "abstract": null, "full_text": "Basis-Function Trees as a Generalization of Local \n\nVariable Selection Methods for Function \n\nApproximation \n\nTerence D. Sanger \nDept. Electrical Engineering and Computer Science \nMassachusetts Institute of Technology, E25-534 \nCambridge, MA 02139 \n\nAbstract \n\nLocal variable selection has proven to be a powerful technique for ap(cid:173)\nproximating functions in high-dimensional spaces. It is used in several \nstatistical methods, including CART, ID3, C4, MARS, and others (see the \nbibliography for references to these algorithms). In this paper I present \na tree-structured network which is a generalization of these techniques. \nThe network provides a framework for understanding the behavior of such \nalgorithms and for modifying them to suit particular applications. \n\n1 \n\nINTRODUCTION \n\nFunction approximation on high-dimensional spaces is often thwarted by a lack of \nsufficient data to adequately \"fill\" the space, or lack of sufficient computational \nresources. The technique of local variable selection provides a partial solution to \nthese problems by attempting to approximate functions locally using fewer than the \ncomplete set of input dimensions. \n\nSeveral algorithms currently exist which take advantage of local variable selection, \nincluding AID (Morgan and Sonquist, 1963, Sonquist et al., 1971), k-d Trees (Bent(cid:173)\nley, 1975), ID3 (Quinlan, 1983, Schlimmer and Fisher, 1986, Sun et ai., 1988), \nCART (Breiman et al., 1984), C4 (Quinlan, 1987), and MARS (Friedman, 1988), \nas well as closely related algorithms such as GMDH (Ivakhnenko, 1971, Ikeda et \nai., 1976, Barron et al., 1984) and SONN (Tenorio and Lee, 1989). Most of these \nalgorithms use tree structures to represent the sequential incorporation of increas(cid:173)\ning numbers of input variables. The differences between these techniques lie in the \nrepresentation ability of the networks they generate, and the methods used to grow \nand prune the trees. In the following I will show why trees are a natural structure \n\n700 \n\n\fBasis-Function Trees as a Generalization of Local Variable Selection Methods \n\n701 \n\nfor these techniques, and how all these algorithms can be seen as special cases of a \ngeneral method I call \"Basis Function Trees\". I will also propose a new algorithm \ncalled an \"LMS tree\" which has a simple and fast network implementation. \n\n2 SEPARABLE BASIS FUNCTIONS \n\nConsider approximating a scalar function I( x) of d-dimensional input x by \n\nI(xt, ... ,Xd) ~ L C;U;(Xl' ... ,Xd) \n\nL \n\n;=1 \n\n(1) \n\nwhere the u;'s are a finite set of nonlinear basis functions, and the c;'s are constant \ncoefficients. If the u/s are separable functions we can assume without loss of gener(cid:173)\nality that there exists a finite set of scalar-input functions {4>n }~=1 (which includes \nthe constant function), such that we can write \n\n(2) \n\n\u2022 \n\np \n\nwhere xp is the p'th component of x, 4>ri (xp) is a scalar function of scalar input xp, \nand r~ is an integer from 1 to N specifying which function 4> is chosen for the p'th \ndimension of the i'th basis function Ui. \nIf there are d input dimensions and N possible scalar functions 4>n, then there are \nN d possible basis functions U;. If d is large, then there will be a prohibitively \nlarge number of basis functions and coefficients to compute. This is one form of \nBellman's \"curse of dimensionality\" (Bellman, 1961). The purpose of local variable \nselection methods is to find a small basis which uses products of fewer than d of \nthe 4>n's. If the 4>n's are local functions, then this will select different subsets of the \ninput variables for different ranges of their values. Most of these methods work by \nincrementally increasing both the number and order of the separable basis functions \nuntil the approximation error is below some threshold. \n\n3 TREE STRUCTURES \n\nPolynomials have a natural representation as a tree structure. In this representation, \nthe output of a subtree of a node determines the weight from that node to its parent. \nFor example, in figure 1, the subtree computes its output by summing the weights \na and b multiplied by the inputs x and y, and the result ax + by becomes the weight \nfrom the input x at the first layer. The depth of the tree gives the order of the \npolynomial, and a leaf at a particular depth p represents a monomial of order p \nwhich can be found by taking products of all inputs on the path back to the root. \n\nNow, if we expand equation 1 to get \n\nL \n\nI(xl, .. . , Xd) ~ '\" ci4>ri (xt) ... 4>r i (Xd) \n\n(3) \n\nL....J \n;=1 \n\n1 \n\nd \n\nwe see that the approximation is a polynomial in the terms 4>rl (Xp ). So the approx-\n\np \n\n\f702 \n\nSanger \n\nx(ax+by) +cy+dz \n\nx \n\ny \n\nFigure 1: Tree representation of the polynomial ax2 + bxy + cy + dz. \n\nimation on separable basis functions can be described as a tree where the \"inputs\" \nare the one-dimensional functions 1(xd, ., ., 4>N(xd, and the expected value of \nthe weight change E[.6..a n ] will be zero. However, there may still be considerable \nvariance of the weight changes, so that E[(.6..an )2] f. O. The weight change variance \nindicates that there is \"pressure\" to increase or decrease the weights for certain \ninput values, and it is related to the output error by \n\nE:-1 E[(.6..a n )2] > E[(! _ ])2] > max E[(.6..a n )2] \nminxl E:=l 4>~(xd -\nn E[(4)n(xt))2] \n\n-\n\n(6) \n\n(Sanger, 1990b). So the output error will be zero if and only if E[(.6..a n )2] = 0 for \nall n. \n\nWe can decrease the weight change variance by using another network based on \nX2 to add a variable term to the weight a r1 with largest variance, so that the new \nnetwork is given by \n\nj(XI' X2) = I: a n4>n(XI) + (ar1 + t ar1,m4>m(X2)) 4>rl (xd\u00b7 \n\nn\u00a2~ \n\nm=l \n\n(7) \n\n.6..ar1 becomes the error term used to train the second-level weights a r1 ,m, so that \n.6..ar1 ,m = .6..arl 4>m(X2). In general, the weight change at any layer in the tree is \nthe error term for the layer below, so that \n\nwhere the root of the recursion is .6..ae = '7(!(Xl, ... , Xd) -\nterm associated with the root of the tree. \n\nj), and ae is a constant \n\nAs described so far, the algorithm imposes an arbitrary ordering on the dimensions \nXl, ... , Xd. This can be avoided by using all dimensions at once. The first layer tree \nwould be formed by the additive approximation \n\n(8) \n\nd \n\nN \n\n!(XI,\"\" X'd) ~ I: I: a(n,p)4>n(Xp)' \n\np=ln=l \n\n(9) \n\nNew subtrees would include all dimensions and could be grown below any 4>n(xp). \nSince this technique generates larger trees, tree pruning becomes very important. \nIn practice, most of the weights in large trees are often close to zero, so after a \nnetwork has been trained, weights below a threshold level can be set to zero and \nany leaf with a zero weight can be removed. \n\nLMS trees have the advantage of being extremely fast and easy to program. (For \nexample, a 49-input network was trained to a size of 20 subtrees on 40,000 data \n\n\fBasis-Function Trees as a Generalization of Local Variable Selection Methods \n\n705 \n\nTree Growing \n\nBasis Functions \nTruncated Cubic Exhaustive search for split which minimizes \nPolynomials \nStep functions \n\na cross-validation criterion \nSplit leaf with largest mean-squared predic-\ntion error (= weight variance) \nChoose split which maximizes an information \ncriterion \n\nMethod \nMARS \n\nCART \n(Re-\ngression), AID \nCART (Clas-\nsification), \nID3, C4 \nk-d Trees \nGMDH, \nSONN \nLMS Trees \n\nStep functions \n\nStep functions \nData Dimensions \n\nAny. All dimen-\nSlons present at \neach level. \n\nSplit leaf with the most data points \nFind product of existing terms which maxi-\nmizes correlation to desired function \nSplit leaf with largest weight change variance \n\nFigure 3: Existing tree algorithms. \n\nsamples in approximately 30 minutes of elapsed time on a sun-4 computer. The \nLMS tree algorithm required 22 lines of C code (Sanger, 1990b).) The LMS rule \ntrains the weights and automatically provides the weight change variance which is \nused to grow new subtrees. The data set does not have to be stored, so no memory \nis required at nodes. Because the weight learning and tree growing both use the \nrecursive LMS rule, trees can adapt to slowly-varying nonstationary environments. \n\n6 CONCLUSION \n\nFigure 3 shows how several of the existing tree algorithms fit into the framework \npresented here. Some aspects of these algorithms are not well described by this \nframework. For instance, in MARS the location of the spline functions can depend \non the data, so the 4>n's do not form a fixed finite basis set. GMDH is not well \ndescribed by a tree structure, since new leaves can be formed by taking products of \nexisting leaves, and thus the approximation order can increase by more than 1 as \neach layer is added. However, it seems that the essential features of these algorithms \nand the way in which they can help avoid the \"curse of dimensionality\" are well \nexplained by this formulation. \n\nAcknowledgements \n\nThanks are due to John Moody for introducing me to MARS, to Chris Atkeson for intro(cid:173)\nducing me to the other statistical methods, and to the many people at NIPS who gave \nuseful comments and suggestions. The LMS Tree technique was inspired by a course at \nMIT taught by Chris Atkeson, Michael Jordan, and Marc Raibert. This report describes \nresearch done within the laboratory of Dr. Emilio Bizzi in the department of Brain and \nCognitive Sciences at MIT. The author was supported by an NDSEG fellowship from the \nU.S. Air Force. \n\n\f706 \n\nSanger \n\nReferences \nBarron R. L., Mucciardi A. N., Cook F. J., Craig J. N., Barron A. R., 1984, Adaptive \nlearning networks: Development and application in the United States of algorithms related \nto GMDH, In Farlow S. J., ed., Self-Organizing Methods in Modeling, Marcel Dekker, New \nYork. \nBellman R. E., 1961, Adaptive Control Processes, Princeton Univ. Press, Princeton, N J. \nBentley J. H., 1975, Multidimensional binary search trees used for associated searching, \nCommunications A CM, 18(9):509-517. \nBreiman L., Friedman J., Olshen R., Stone C. J., 1984, Classification and Regression \nTrees, Wadsworth Belmont, California. \nFriedman J. H., 1988, Multivariate adaptive regression splines, Technical Report 102, \nStanford Univ. Lab for Computational Statistics. \nIkeda S., Ochiai M., Sawaragi Y., 1976, Sequential GMDH algorithm and its application \nto river flow prediction, IEEE Trans. Systems, Man, and Cybernetics, SMC-6(7):473-479. \nIvakhnenko A. G., 1971, Polynomial theory of complex systems, IEEE Trans. Systems, \nMan, and Cybernetics, SMC-1(4):364-378. \nLjung L., Soderstrom T., 1983, Theory and Practice of Recursive Identification, MIT \nPress, Cambridge, MA. \nMorgan J. N., Sonquist J. A., 1963, Problems in the analysis of survey data, and a \nproposal, J. Am. Statistical Assoc., 58:415-434. \nQuinlan J. R., 1983, Learning efficient classification procedures and their application to \nIn Michalski R. S., Carbonell J. G., Mitchell T. M., ed.s, Machine \nchess end games, \nLearning: An Artificial Intelligence Approach, chapter 15, pages 463-482, Tioga P., Palo \nAlto. \nQuinlan J. R., 1987, Simplifying decision trees, Int. J. Man-Machine Studies, 27:221-234. \nSanger T. D., 1990a, Basis-function trees for approximation in high-dimensional spaces, \nIn Touretzky D., Elman J., Sejnowski T., Hinton G., ed.s, Proceedings of the 1990 Con(cid:173)\nnectionist Models Summer School, pages 145-151, Morgan Kaufmann, San Mateo, CA. \nSanger T. D., 1990b, A tree-structured algorithm for function approximation in high \ndimensional spaces, IEEE Trans. Neural Networks, in press. \nSanger T. D., 1991, A tree-structured algorithm for reducing computation in networks \nwith separable basis functions, Neural Computation, 3(1), in press. \nSchlimmer J. C., Fisher D., 1986, A case study of incremental concept induction, In Proc. \nAAAI-86, Fifth National Conference on AI, pages 496-501, Los Altos, Morgan Kaufmann. \nSonquist J. A., Baker E. L., Morgan J. N., 1971, Searching for structure, Institute for \nSocial Research, Univ. Michigan, Ann Arbor. \nSun G. Z., Lee Y. C., Chen H. H., 1988, A novel net that learns sequential decision process, \nIn Anderson D. Z., ed., Neural Information Processing Systems, pages 760-766, American \nInstitute of Physics, New York. \nTenorio M. F., Lee W.-T., 1989, Self organizing neural network for optimum supervised \nlearning, Technical Report TR-EE 89-30, Purdue Univ. School of Elec. Eng. \nWidrow B., Hoff M. E., 1960, Adaptive switching circuits, \nRecord, Part 4, pages 96-104. \n\nIn IRE WESCON Conv. \n\n\f", "award": [], "sourceid": 340, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}]}