{"title": "Data Analysis using G/SPLINES", "book": "Advances in Neural Information Processing Systems", "page_first": 1088, "page_last": 1095, "abstract": null, "full_text": "Data Analysis using G/SPLINES \n\nDavid Rogers\u00b7 \n\nResearch Institute for Advanced Computer Science \n\nMS T041-5, NASA/Ames Research Center \n\nMoffett Field, CA 94035 \n\nINTERNET: drogerS@riacs.edu \n\nAbstract \n\nG/SPLINES is an algorithm for building functional models of data. It \nuses genetic search to discover combinations of basis functions which \nare then used to build a least-squares regression model. Because it \nproduces a population of models which evolve over time rather than a \nsingle model, it allows analysis not possible with other regression-based \napproaches. \n\n1 INTRODUCTION \nG/SPLINES is a hybrid of Friedman's Multivariable Adaptive Regression Splines \n(MARS) algorithm (Friedman, 1990) with Holland's Genetic Algorithm (Holland, 1975). \nG/SPLINES has advantages over MARS \nthat it requires fewer least-squares \ncomputations, is easily extendable to non-spline basis functions, may discover models \ninaccessible to local-variable selection algorithms, and allows significantly larger \nproblems to be considered. These issues are discussed in (Rogers, 1991). \nThis paper begins with a discussion of linear regression models, followed by a description \nof the G/SPLINES algorithm, and finishes with a series of experiments illustrating its \nperformance, robustness, and analysis capabilities. \n\nin \n\n* Currently at Polygen/Molecular Simulations, Inc., 796 N. Pastoria Ave., Sunnyvale, CA 94086, \nINTERNET: drogers@msi.com. \n\n1088 \n\n\fData Analysis Using G/Splines \n\n1089 \n\n2 LINEAR MODELS \nA common assumption used in data modeling is that the data samples are derived from an \nunderlying function: \n\nYi = f(X i) + error \n'''I \n) \n\n= XU' .\u2022. , Xin + error \n\nf( \n\nThe goal of analysis is to develop a model F(X) which minimizes the least-squares error: \n\nLSE(F) = ~ L (Yi - F(X i)) 2 \n\nN \n\ni = 1 \n\nThe function F(X) can then be used to estimate the underlying function fat previously(cid:173)\nseen data samples (recall) or at new data samples (prediction). Samples used to construct \nthe function F(X) are in the training set; samples used to test prediction are in the test set. \n10 constructing F(X), if we assume the model F can be written as a linear combination of \nbasis function { ct>1C} : \n\nF(X) = aO + L ak/X) \n\nM \n\nk=1 \n\nthen standard least-squares regression can find the optimal coefficients {ak}' However, \nselecting an appropriate set of basis functions for high-dimensional models can be \ndifficult. G/SPLINES is a primarily a method for selecting this set. \n\n3 G/SPLINES \nMany techniques develop a regression model by incremental addition or deletion of basis \nfunctions to a single model.The primary idea of G/SPLINES is to keep a collection of \nmodels, and use the genetic algorithm to recombine among these models. \nG/SPLINES begins with a collection of models containing randomly-generated basis \nfunctions. \n\nF 1: {ct> 1 ct>2 ct>3 ct> 4 ct> 5 ct> 6 ct>? ct> 8 ct>9 ct> 10 ct> II ct> 12 ct> 13 ct> 14} \nF2: {01 02 0304 05 06 &, 08 09 010 OIl} \n\u2022 \n\u2022 \n\u2022 \nFK: {01 \u00b02\u00b03\u00b04\u00b05\u00b06\u00b07\u00b08\u00b09\u00b010 011 012} \n\n\u2022 \n\u2022 \n\u2022 \n\nThe basis functions are functions which use a small number of the variables in the data set, \nsuch as SIN(X2 - 1) or (X4 - A)(X5 - .1). The model coefficients {ak} are determined using \nleast-squares regression. \nEach model is scored using Friedman's \"lack of fit\" (LOF) measure, which is a penalized \nleast-squares measure for goodness of fit; this measure takes into account factors such as \nthe number of data samples, the least-squares error, and the number of model parameters. \n\n\f1090 \n\nRogers \n\nAt this point, we repeatedly perform the genetic crossover operation: \n\n\u2022 Two good models are probabilistically selected as \"parents\". The likelihood of being \n\nchosen is inversely proportional to a model's LOF score. \n\n\u2022 Each parent is randomly \"cut\" into two sections, and a new model is created using a \n\npiece from each parent: \n\nFirst parent \n\nSecond parent \n\nNew model \n\n\u2022 Optional mutation operators may alter the newly-created model. \n\u2022 The model with the worst LOF score is replaced by this new model. \n\nThis process ends when the average fitness of the population stops improving. \nSome features of the G/SPLlNES algorithm are significantly different from MARS: \n\nUnlike incremental search, full-sized models are tested at every step. \nThe algorithm automatically determines the proper size for models. \nMany fewer models are tested than with MARS . \nA population of models offers information not available from single-model methods. \n\n4 MUTATION OPERATORS \nAdditional mutation operators were added to the system to counteract some negative \ntendencies of a purely crossover-based algorithm. \n\nProblem: genetic diversity is reduced as process proceeds (fewer basis functions in \npopulation) \nNEW: creates a new basis function by randomly choosing a basis function type and then \nrandomly filling in the parameters. \nProblem: need process for constructing useful multidimensional basis functions \nMERGE: takes a random basis function from each parent, and creates a new basis function \nby multiplying them together. \nProblem: models contain \"hitchhiking\" basis functions which contribute little \nDELETION: ranks the basis functions in order of minimum maximum contribution to the \napproximation. It removes one or more of the least-contributing basis functions. \n\n5 EXPERIMENTAL \nExperiments were conducted on data derived from a function used by Friedman (1988): \n\nf(X) = SIN(1tX 1X 2)+20(X 3 -2:) +10X 4 +5X 5 \n\n1 2 \n\n\fData Analysis Using G/Splines \n\n1091 \n\nStandard experimental conditions are as follows. Experiments used a training set \ncontaining 200 samples, and a test set containing 200 samples. Each sample contained 10 \npredictor variables (5 informative,S non informative) and a response. Sample points were \nrandomly selected from within the unit hypercube. The signal/noise ratio was 4.B/1.0 \nThe G/SPLINE population consisted of 100 models. Linear truncated-power splines were \nused as basis functions. After each crossover, a model had a 50% chance of getting a new \nbasis function created by operator NEW or MERGE and the least-contributing 10% of its \nbasis functions deleted using operator DELETE. \nThe standard training phase involved 10,000 crossover operations. After training, the \nmodels were tested against a set of 200 previously-unseen test samples. \n\n5.1 G/SPLINES VS. MARS \nQuestion: is G/SPLINE competitive with MARS? \n\n27 . '\"t-' .............. __ ............................. \n\no Be,t reot LS leO .. \nC MARS ... t LS 000 .. \n\n2 \n22. \n2 \n\u00a3 \n~ 17. \nm \n1 \n....l \n<;; 12. \nM \n1 \n<;; \n~ 7. \n\nFigure 1. Test least-squares scores versus number of least-squares regressions for \nG/SPLINES and MARS. \n\nSLSOps .100 \n\nThe MARS algorithm was close to convergence after 50,000 least-squares regressions, \nand showed no further improvement after BO,OOO. The G/SPLINES algorithm was close to \nconvergence after 4,()()() least-squared regressions, and showed no further improvement \nafter 10,000. [Note: the number of least-squares regressions is not a direct measure of the \ncomputational efficiency of the algorithms, as MARS uses a technique (applicable only to \nlinear truncated-power splines) to greatly reduce cost of doing least-squares-regression.] \nTo complete the comparison, we need results on the quality of the discovered models: \n-1 .17 \n-1.12 \n-LOB \n\nFinal average least-squared error of the best 4 G/SPLINES models was: \nFinal least-squared error of the MARS model was: \nThe \"best\" model has a least-squared error (from the added noise) of: \n\nUsing only linear truncated-power splines, G/SPLINES builds models comparable \n(though slightly inferior) to MARS. However, by using basis functions other than linear \ntruncated power splines, G/SPLINES can build improved models. If we repeat the \nexperiment with additional basis function types of step functions, linear splines, and \nquadratic splines, we get improved results: \n\nWith additional basis functions, the final average least-squared error was: \n\n-1.095. \n\nI suggest that by including basis functions which reflect the underlying structure of f, the \nquality of the discovered models is improved. \n\n\f1092 \n\nRogers \n\n5.2 VARIABLE ELIMINATION \nQuestion: does variable usage in the population reflect the underlying function? (Recall \nthat the data samples contained 10 variables; only the first 5 were used to calculate f.) \n\n1400 ..... - -............... -_ ............ . . - -.... \n\nIi! > 1200 \ngp \n.... 1000 \nCI) \n=' \n~ 800 \no \n';:j \n\n600 \n\nu c .z \n\n400 \n\n200 \n\n\u2022 Var(l) use \n\u2022 Var(2) use \n4 Var(3) use \n.Var(4) use \n:::: Var(5) use \n~J Var(6) use \n\n\u2022 Var(7) use \n\u2022 Var(8) use \n\"Var(9) use \n,. Var[l 0) use \n\nO~ . . ~~~~~~~. \no 10 20 31) 40 50 60 70 80 90100 \n\nII Genetic Operations x 100 \n\nFigure 2. # of basis functions using a variable vs. # of crossover operations. \n\nG/SPLINES correctly focuses on basis functions which use the first five variables The \nrelative usage of these five variables reflects the complexity of the relationship between an \ninput variable and the response in a given dimension. \nQuestion: is the rate of elimination of variables affected by sample size? \n\n90 : \n80 \n70 \n60 \n50 \n40 \n\n30 \n20 \n\no Var(6) \na Var(7) \n\u2022 Var(8) \n.::: Var(9) \n.. , Var(10) \n\n1~1-~~~~~!;~~~~~!1 \no \n\n5 10 15 20 25 30 35 40 45 50 \n\n5 \n\n10 15 20 25 30 35 40 45 50 \n\nII Genetic Operations x 100 \n\nII Genetic Operations x 100 \n\nFigure 3. Close-up of Figme 2, showing the five variables not affecting the response. \nThe left graph is the standard experiment; the right from a training with 50 samples. \n\nThe left graph plots the number of basis functions containing a variable versus the number \nof genetic operations for the five noninformative variables in the standard experiment. The \nvariables are slowly eliminated from consideration. The right graph plots the same \ninfonnation, using a training set size of 50 samples. The variables are rapidly eliminated. \nSmaller training sets force the algorithm to work with most predictive variables, causing a \nfaster elimination of less predictive variables. \nQuestion: Is variable elimination effective with increased numbers of noninfonnative \nvariables? \nThis experiment used the standard conditions but increased the number of predictor \nvariables in the training and test sets to 100 (5 infonnative, 25. noninformative). \n\n\f600 \n\nData Analysis Using G/Splines \n\n1093 \n\n500 \n:i! 400 \n\u00b7s > \n.. .c \nIt 300 \n1 200 \n~ \nIt \n.. \nera \n'!! \n\n100 \n\n0 \n\nI \nI \n\n_I \n\nI. \n\n-100 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n10 \n\n80 \n\n90 \n\n100 \n\nVariable Index \n\nFigure 4. Number of basis functions which used a variable vs. variable index, after \n10,000 genetic operations. \n\nFigure 4 shows that elimination behavior was still apparent in this high-dimensional data \nset. The five infonnative variables were the first five in order of use. \n\n5.3 MODEL SIZE \nQuestion: What is the effect of the genetic algorithm on model size? \n\n7~--~~------'---~ \n6 \n5 \nCD \u00a7 4 \n~ 3 \nCD \n2 \n\no Best score \na Avg score \n\no Avg fcn I. .. \n\no~~~~ __ -. ______ ~ \n0102030405060708090100 \n\n\u2022 Genetic Ops x 100 \n\n10 \n9~~~~ __________ ~ \no 10 20 30 40 50 60 70 80 90100 \n\n\u2022 Genlllic Ops x 100 \n\nFigure 5. Model scores on training set and average function length. \n\nThe left graph plots the best and average LOF score for the training set versus the number \nof genetic operations. The right graph plots the average number of basis functions in a \nmodel versus the number of genetic operations. \nEven after the LOF error is minimized, the average model length continues to decrease. \nThis is likely due to pressure from the genetic algorithm; a compact representation is more \nlikely to survive the crossover operation without loss. (In fact, due to the nature of the \nLOF function, the least-squared errors of the best models is slightly increased by this \nprocedure. The system considers the increase a fair trade-off for smaller model size.) \n\n5.4 RESISTANCE TO OVER FITTING \nQuestion: Does Friedman's LOF function resist overfitting with small training sets? \n\nTraining was conducted with data sets of two sizes: 200 and 50. The left graph in Figure 6 \nplots the population average least-squared error for the training set and the test set versus \nthe number of genetic operations, using a training set size of 200 samples. The right graph \n\n\f1094 \n\nRogers \n\no Avg LS score \na Avg lesl LS score \n\n4.5 \n4 \n~ 3.5 \n\u00a7 3 \nfI) 2.5 \n~ 2 \n~ < 1.5 \n1 \n.5 \no~~~ ____ .-~-. ____ .-__ -+ \no 10 20 30 40 50 60 70 80 90 100 \n\n4 \nCD 3.5 \n\u00a7 3 \nfI) 2.5 \n...J 2 \n~ < 1.5 \n1 \nO~ __ ~ ________ -p ____ ~ __ -+ \n.5 \no 10 20 30 40 50 60 70 80 90 100 \n\no Avg LS score \n(:E Avg lesl LS score \n\nII Genelic Operellons x 100 \n\nII Genelic Operallons x 100 \n\nFigure 6. LS error vs. # of operations for training with 200 and 50 samples. \n\nplots the same information, but for a system using a training set size of 50 samples. \nIn both cases, little overfitting is seen, even when the algorithm is allowed to run long after \nthe point where improvement ceases. Training with a small number of samples still leads \nto models which resist overfitting. \nQuestion: What is the effect of additive noise on overfitting? \n\n40+---~------------~~ \n38 \n36 \n\no Avg LS score \na Avg lesl LS score \n\n40~--~------------~~ \n38 \n36 \n\n0 Avg LS score \nOAvg lest LS score \n\n22 20+-__ ~ ______________ \u2022 \n\no 102030 40 5060 7080 90100 \n\n20+-________________ ~. \no 102030405060 7080 90100 \n\nFigure 7. LS error vs. # of operations for low and high noise data sets. \n\n\u2022 LSOpsx 100 \n\nIILSOpsx100 \n\nTraining was conducted with training sets having a signal/noise ratio of 1.0/1.0. The left \ngraph plots the least-squared error for the training and test set versus the number of \ngenetic operations. The right graph plots the same information, but with a higher setting of \nFriedman's smoothing parameter. \nNoisy data results in a higher risk of overfitting. However, this can be accommodated if \nwe set a higher value for Friedman's smoothing parameter. \n\n5.5 ADDITIONAL BASIS FUNCTION TYPES AND TRAINING SET SIZES \nQuestion: What is the effect of changes in training set size on the type of basis functions \nselected? \nThe experiment in Figure 8 used the standard conditions, but using many additional basis \nfunction types. The left graph plots the use of different types of basis functions using a \ntraining set of size 50.The right graph plots the same information using a training set size \nof 200. Simply put, different training set sizes lead to significant changes in preferences \namong function types. A detailed analysis of these graphs can give insight into the nature \nof the data and the best components for model construction. \n\n\f4S0,..... __ ..... _ _ _ __ _ _ \"\"+ \n\nData Analysis Using G/Splines \n\n1095 \n\nc ..... \n~4 \n\nC \n~ \n.2: \ntlO \n\"-' c \n..B \n\"-' ] \n\n=11= \n\no \n\n10 20 30 40 SO 60 70 80 90 100 \n\n1# Genetic Operalions l 100 \n\no Linear Spline use \na Lutearuse \nA Quadratic use \n~ Slop use \n\n\"\"'~Itd \u2022\u2022 Spline ocder 2 use \n\nA BSpline order 0 use \n\u2022 BSpIine ocder I use \n\u2022 BS p1ine ocder 2 use \n\n10 20 30 40 SO 60 70 80 90 100 \n\n1/ Genetic Operaliono l 100 \n\nFigure 8. # of basis functions of a given type Vs. # of genetic operations, for training \nsets of 50 and 200 samples. \n\n6 CONCLUSIONS \nG/SPLINES is a new algorithm related to state-of-the-art statistical modeling techniques \nsuch as MARS. The strengths of this algorithm are that G/SPLINES builds models that are \ncomparable in quality to MARS, with a greatly reduced number of intermediate model \nconstructions; is capable of building models from data sets that are too large for the \nMARS algorithm; and is easily extendable to basis functions that are not spline-based. \nWeaknesses of this algorithm include the ad-hoc nature of the mutation operators; the lack \nof studies of the real-time performance of G/SPLINES vs. other model builders such as \nMARS; the need for theoretical analysis of the algorithm's convergence behavior; the \nLOF function needs to be changed to reflect additional basis function types. \nThe WOLF program source code, which implements G/SPLINES, is available free to \nother researchers \nin either Macintosh or UNIX/C formats. Contact the author \n(drogerS@riacs.edu) for information. \n\nAcknowledgments \nThis work was supported in part by Cooperative Agreements NCC 2-387 and NCC 2-408 \nbetween the National Aeronautics and Space Administration (NASA) and the Universities \nSpace Research Association (USRA). Special thanks to my domestic partner Doug \nBrockman, who shared my enthusiasm even though he didn't know what the hell I was up \nto; and my father, Philip, who made me want to become a scientist. \n\nReferences \nFriedman, J., \"Multivariate Adaptive Regression Splines,\" Technical Report No. 102, \nLaboratory for Computational Statistics, Department of Statistics, Stanford University, \nNovember 1988 (revised August 1990). \nHolland, J., Adaptation in Artificial and Natural Systems, University of Michigan Press, \nAnn Arbor, MI, 1975. \nRogers, David, \"G/SPLINES: A Hybrid of Friedman's Multivariate Adaptive Splines \n(MARS) Algorithm with Holland's Genetic Algorithm,\" in Proceedings of the Fourth \nInternational Conference on Genetic Algorithms, San Diego, July, 1991. \n\n\f", "award": [], "sourceid": 572, "authors": [{"given_name": "David", "family_name": "Rogers", "institution": null}]}