{"title": "A Comparison between Neural Networks and other Statistical Techniques for Modeling the Relationship between Tobacco and Alcohol and Cancer", "book": "Advances in Neural Information Processing Systems", "page_first": 967, "page_last": 973, "abstract": null, "full_text": "A comparison between neural networks \n\nand other statistical techniques for \nmodeling the relationship between \n\ntobacco and alcohol and cancer \n\nTony Plate \n\nBC Cancer Agency \n\n601 West 10th Ave, Epidemiology \nVancouver BC Canada V5Z 1L3 \n\ntap@comp.vuw.ac.nz \n\nPierre Band \n\nBC Cancer Agency \n\n601 West 10th Ave, Epidemiology \nVancouver BC Canada V5Z 1L3 \n\nJoel Bert \n\nDept of Chemical Engineering \nUniversity of British Columbia \n\n2216 Main Mall \n\nVancouver BC Canada V6T 1Z4 \n\nJohn Grace \n\nDept of Chemical Engineering \nUniversity of British Columbia \n\n2216 Main Mall \n\nVancouver BC Canada V6T 1Z4 \n\nAbstract \n\nEpidemiological data is traditionally analyzed with very simple \ntechniques. Flexible models, such as neural networks, have the \npotential to discover unanticipated features in the data. However, \nto be useful, flexible models must have effective control on overfit(cid:173)\nting. This paper reports on a comparative study of the predictive \nquality of neural networks and other flexible models applied to real \nand artificial epidemiological data. The results suggest that there \nare no major unanticipated complex features in the real data, and \nalso demonstrate that MacKay's [1995] Bayesian neural network \nmethodology provides effective control on overfitting while retain(cid:173)\ning the ability to discover complex features in the artificial data. \n\n1 \n\nIntroduction \n\nTraditionally, very simple statistical techniques are used in the analysis of epidemi(cid:173)\nological studies. The predominant technique is logistic regression, in which the \neffects of predictors are linear (or categorical) and additive on the log-odds scale. \nAn important virtue of logistic regression is that the relationships identified in the \n\n\f968 \n\nT. Plate, P. Band, J. Bert and J. Grace \n\ndata can be interpreted and explained in simple terms, such as \"the odds of devel(cid:173)\noping lung cancer for males who smoke between 20 and 29 cigarettes per day are \nincreased by a factor of 11.5 over males who do not smoke\". However, because of \ntheir simplicity, it is difficult to use these models to discover unanticipated complex \nrelationships, i.e., non-linearities in the effect of a predictor or interactions between \npredictors. Interactions and non-linearities can of course be introduced into logistic \nregressions, but must be pre-specified, which tends to be impractical unless there \nare only a few variables or there are a priori reasons to test for particular effects. \nNeural networks have the potential to automatically discover complex relationships. \nThere has been much interest in using neural networks in biomedical applications; \nwitness the recent series of articles in The Lancet, e.g., Wyatt [1995] and Baxt \n[1995]. However, there are not yet sufficient comparisons or theory to come to firm \nconclusions about the utility of neural networks in biomedical data analysis. To \ndate, comparison studies, e.g, those by Michie, Spiegelhalter, and Taylor [1994], \nBurke, Rosen, and Goodman [1995]' and Lippmann, Lee, and Shahian [1995], have \nhad mixed results, and Jefferson et aI's [1995] complaint that many \"successful\" \napplications of neural networks are not compared against standard techniques ap(cid:173)\npears to be justified. The intent of this paper is to contribute to the body of useful \ncomparisons by reporting a study of various neural-network and statistical modeling \ntechniques applied to an epidemiological data analysis problem. \n\nIn this study, detailed questionnaire reported personal information, life(cid:173)\n\n2 The data \nThe original data set consisted of information on 15,463 subjects from a study con(cid:173)\nducted by the Division of Epidemiology and Cancer Prevention at the BC Cancer \nAgency. \ntime tobacco and alcohol use, and lifetime employment history for each subject. \nThe subjects were cancer patients in BC with diagnosis dates between 1983 and \n1989, as ascertained by the population-based registry at the BC Cancer Agency. \nSix different tobacco and alcohol habits were included: cigarette (c), cigar (G), and \npipe (p) smoking, and beer (B), wine (w), and spirit drinking (s). The models re(cid:173)\nported in this paper used up to 27 predictor variables: age at first diagnosis (AGE), \nand 26 variables related to alcohol and tobacco consumption. These included four \nvariables for each habit: total years of consumption (CYR etc), consumption per \nday or week (CDAY, BWK etc), years since quitting (CYQUIT etc), and a binary vari(cid:173)\nable indicating any indulgence (CSMOKE, BDRINK etc) . The remaining two binary \nvariables indicated whether the subject ever smoked tobacco or drank alcohol. All \nthe binary variables were non-linear (threshold) transforms of the other variables. \nVariables not applicable to a particular subject were zero, e.g., number of years of \nsmoking for a non-smoker, or years since quitting for a smoker who did not quit. \nOf the 15,463 records, 5901 had missing information in some of the fields related \nto tobacco or alcohol use. These were not used, as there are no simple methods \nfor dealing with missing data in neural networks. Of the 9,562 complete records, a \nrandomly selected 3,195 were set aside for testing, leaving 6,367 complete records \nto be used in the modeling experiments. \nThere were 28 binary outcomes: the 28 sites at which a subject could have cancer \n(subjects had cancers at up to 3 different sites). The number of cases for each site \nvaried, e.g., for LUNGSQ (Lung Squamous) there were 694 cases among the complete \nrecords, for ORAL (Oral Cavity and Pharynx) 306, and for MEL (Melanoma) 464. \nAll sites were modeled individually using carefully selected subjects as controls. \nThis is common practice in cancer epidemiology studies, due to the difficulty of \ncollecting an unbiased sample of non-cancer subjects for controls. Subjects with \n\n\fNeural Networks in Cancer Epidemiology \n\n969 \n\ncancers at a site suspected of being related to tobacco usage were not used as \ncontrols. This eliminated subjects with any sites other than COLON, RECTUM, MEL \n(Melanoma), NMSK (Non-melanoma skin), PROS (Prostate), NHL (Non-Hodgkin's \nlymphoma), and MMY (Multiple-Myeloma), and resulted in between 2959 and 3694 \ncontrols for each site. For example, the model for LUNGSQ (lung squamous cell) \ncancer was fitted using subjects with LUNGSQ as the positive outcomes (694 cases), \nand subjects all of whose sites were among COLON, RECTUM, MEL, NMSK, PROS, NHL, \nor MMY as negative outcomes (3694 controls). \n\n3 Statistical methods \n\nA number of different types of statistical methods were used to model the data. \nThese ranged from the non-flexible (logistic regression) through partially flexible \n(Generalized Additive Models or GAMs) to completely flexible (classification trees \nand neural networks). Each site was modeled independently, using the log likeli(cid:173)\nhood of the data under the binomial distribution as the fitting criterion. All of the \nmodeling, except for the neural networks and ridge regression, was done using the \nthe S-plus statistical software package [StatSci 1995]. \nFor several methods, we used Breiman's [1996] bagging technique to control over(cid:173)\nfitting. To \"bag\" a model, one fits a set of models independently on bootstrap \nsamples. The bagged prediction is then the average of the predictions of the models \nin the set. Breiman suggests that bagging will give superior predictions for unstable \nmodels (such as stepwise selection, pruned trees, and neural networks). \nPreliminary analysis revealed that the predictive power of non-flexible models could \nbe improved by including non-linear transforms of some variables, namely AGESQ \nand the binary indicator variables SMOKE, DRINK, CSMOKE, etc. Flexible models \nshould be able to discover useful non-linear transforms for themselves and so these \nderived variables were not included in the flexible models. In order to allow com(cid:173)\nparisons to test this, one of non-flexible models (ONLYLIN-STEP) also did not use any \nof these derived variables. \nNull model: (NULL) The predictions of the null model are just the frequency of \nthe outcome in the training set. \nLogistic regression: The FULL model used the full set of predictor variables, \nincluding a quadratic term for age: AGESQ. \nStepwise logistic regression: A number of stepwise regressions were fitted, dif(cid:173)\nfering in the set of variables considered. Outcome-balanced lO-fold cross validation \nwas used to select the model size giving best generalization. The models were as \nfollows: AGE-STEP (AGE and AGESQ); CYR-AGE-STEP (CYR, AGE and AGESQ)j ALC(cid:173)\nCYR-AGE-STEP (all alcohol variables, CYR, AGE and AGESQ); FULL-STEP (all variables \nincluding AGESQ); and ONLYLIN-STEP (all variables except for the derived binary \nindicator variables SMOKE, CSMOKE, etc, and only a linear AGE term). \nRidge regression: (RIDGE) Ridge regression penalizes a logistic regression model \nby the sum of the squared parameter values in order to control overfitting. The \nevidence framework [MacKay 1995] was used to select seven shrinkage parameters: \none for each of the six habits, and one for SMOKE, DRINK, AGE and AGESQ. \nGeneralized Additive Models: GAMs [Hastie and Tibshirani 1990] fit a smooth(cid:173)\ning spline to each parameter. GAMs can model non-linearities, but not interactions. \nA stepwise procedure was used to select the degree (0,1,2, or 4) of the smoothing \nspline for each parameter. The procedure started with a model having a smoothing \nspline of degree 2 for each parameter, and stopped when the Ale statistic could \n\n\f970 \n\nT. Plate, P. Band, 1. Bert and 1. Grace \n\nnot reduced any further. Two stepwise GAM models were fitted : GAM-FULL used \nthe full set of variables, while GAM-CIG used the cigarette variables and AGE . \nClassification trees: [Breiman et al. 1984] The same cross-validation procedure \nas used with stepwise regression was used to select the best size for TREE, using \nthe implementation in S-plus, and the function shrink.treeO for pruning. A bagged \nversion with 50 replications, TREE-BAGGED, was also used. After constructing a tree \nfor the data in a replication, it was pruned to perform optimally on the training \ndata not included in that replication. \nOrdinary neural networks: The neural network models had a single hidden layer \nof tanh functions and a small weight penalty (0.01) to prevent parameters going to \ninfinity. A conjugate-gradient procedure was used to optimize weights. For the NN(cid:173)\nORD-H2 model, which had no control on complexity, a network with two hidden units \nwas trained three times from different small random starting weights. Of these three, \nthe one with best performance on the training data was selected as \"the model\". The \nNN-ORD-HCV used common method for controlling overfitting in neural networks: 10-\nfold CV for selecting the optimal number of hidden units. Three random starting \npoints for each partition were used calculate the average generalization error for \nnetworks with one, two and three hidden units Three networks with the best number \nof hidden units were trained on the entire set of training data, and the network \nhaving the lowest training error was chosen. \nBagged neural networks with early stopping: Bagging and early stopping \n(terminating training before reaching a minimum on training set error in order to \nprevent overfitting) work naturally together. The training examples omitted from \neach bootstrap replication provide a validation set to decide when to stop, and with \nearly stopping, training is fast enough to make bagging practical. 100 networks \nwith two hidden units were trained on separate bootstrap replications, and the \nbest 50 (by their performance on the omitted examples) were included in the final \nbagged model, NN-ESTOP-BAGGED. For comparison purposes, the mean individual \nperformance of these early-stopped networks is reported as NN-ESTOP-AVG . \nN eur~l networks with Bayesian regularization: MacKay's [1995] Bayesian \nevidence framework was used to control overfitting in neural networks. Three ran(cid:173)\ndom starts for networks with 1, 2, 3 or 4 hidden units and three different sets of \nregularization (penalty) parameters were used, giving a total of 36 networks for each \nsite. The three possibilities for regularization parameters were: (a) three penalty \nparameters - one for each of input to hidden, bias to hidden, and hidden to output; \n(b) partial Automatic Relevance Determination (ARD) [MacKay 1995] with seven \npenalty parameters controlling the input to hidden weights - one for each habit and \none for AGE ; and (c) full ARD, with one penalty parameter for each of the 19 in(cid:173)\nputs. The \"evidence\" for each network was evaluated and the best 18 networks were \nselected for the equally-weighted committee model NN-BAYES-CMTT. NN-BAYES-BEST \nwas the single network with the maximum evidence. \n\n4 Results and Discussion \nModels were compared based on their performance on the held-out test data, so \nas to avoid overfitting bias in evaluation. While there are several ways to measure \nperformance, e.g. , 0-1 classification error, or area under the ROC curve (as in Burke, \nRosen and Goodman [1995]), we used the test-set deviance as it seems appropriate \nto compare models using the same criterion as was used for fitting. Reporting \nperformance is complicated by the fact that there were 28 different modeling tasks \n(Le., sites), and some models did better on some sites and worse on others. We \nreport some overall performance figures and some pairwise comparisons of models. \n\n\fNeural Networks in Cancer Epidemiology \n\n971 \n\n-10 \n\n0 \n\n10 \n\n20 \n\n-10 \n\n0 \n\n10 \n\n20 \n\nNULL \nAGE-STEP \nCYR-AGE-STEP \nALC-CYR-AGE-STEP \nONLYLIN-STEP \nFULL-STEP \nFULL \nRIDGE \nGAM-CIG \nGAM-FULL \nTREE \nTREE-BAGGED \nNN-ORD-H2 \nNN-ORD-HCV \nNN-ESTOP-AVG \nNN-ESTOP-BAGGED \nNN -BAYES-BEST \nNN-BAYES-CMTT \n\nALL \n\n6618/3299 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nMEL \n\n322/142 r \u2022 \n\n\u2022 \n\n:. \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n-10 \n\n0 \n\n10 \n\n20 \n\n-10 \n\n0 \n\n10 \n\n20 \n\nFigure 1: Percent improvement in deviance on test data over the null model. \n\nFigure 1 shows aggregate deviances across sites (Le., the sum of the test deviance for \none model over the 28 sites) and deviances for selected sites. The horizontal scale \nin each column indicates the percentage reduction in deviance over the null model. \nZero percent (the dotted line) is the same performance as the null model, and 100% \nwould be perfect predictions. Numbers below the column labels are the number of \npositive outcomes in the training and test sets, respectively. The best predictions \nfor LUNGSQ can reduce the null deviance by just over 25%. It is interesting to note \nthat much of the information is contained in AGE and CYR: The CYR-AGE-STEP \nmodel achieved a 7.1% reduction in overall deviance, while the maximum reduction \n(achieved by NN-BAYES-CMTT) was only 8.3%. \nThere is no single threshold at which differences in test-set deviance are \"signifi(cid:173)\ncant\", because of strong correlations between predictions of different models. How(cid:173)\never, the general patterns of superiority apparent in Figure 1 were repeated across \nthe other sites, and various other tests indicate they are reliable indicators of gen(cid:173)\neral performance. For example, the best five models, both in terms of aggregate \ndeviance across all sites and median rank of performance on individual sites, were, \nin order NN-BAYES-CMTT, RIDGE, NN-ESTOP-BAGGED, GAM-CIG, and FULL-STEP. The \nONLYLIN-STEP model ranked sixth in median rank, and tenth in aggregate deviance. \nAlthough the differences between the best flexible models and the logistic models \nwere slight, they were consistent. For example, NN-BAYES-CMTT did better than \nFULL-STEP on 21 sites, and better than ONLYLIN-STEP on 23 sites, while FULL-STEP \ndrew with ONLYLIN-STEP on 14 sites and did better on 9. If the models had no \neffective difference, there was only a 1.25%. chance of one model doing better than \nthe other 21 or more times out of 28. Individual measures of performance were \nalso consistent with these findings. For example, for LUNGSQ a bootstrap test of \ntest-set deviance revealed that the predictions of NN-BA YES-CMTT were on average \nbetter than those of ONLYLIN-STEP in 99.82% of res amp led test sets (out of 10,000), \nwhile the predictions of NN-BAYES-CMTT beat FULL-STEP in 93.75% of replications \n\n\f972 \n\nT. Plate, P. Band, J. Bert and J. Grace \n\nand FULL-STEP beat ONLYLIN-STEP in 98.48% of replications. \nThese results demonstrate that good control on overfitting is essential for this task. \nOrdinary neural networks with no control on overfitting do worse than guessing (i.e., \nthe null model). Even when the number of hidden units is chosen by cross-validation, \nthe performance is still worse than a simple two-variable stepwise logistic regression \n(CYR-AGE-STEP). The inadequacy of the simple Ale-based stepwise procedure for \nchoosing the complexity of G AMs is illustrated by the poor performance of the \nGAM-FULL model (the more restricted GAM-CIG model does quite well). \nThe effective methods for controlling overfitting were bagging and Bayesian reg(cid:173)\nularization. Bagging improved the performance of trees and early-stopped neural \nnetworks to good levels. Bayesian regularization worked very well with neural net(cid:173)\nworks and with ridge regression. Furthermore, examination of the performance of \nindividual networks indicates that networks with fine-grained ARD were frequently \nsuperior to those with coarser control on regularization. \n\n5 Artificial sites with complex relationships \nThe very minor improvement achieved by neural networks and trees over logistic \nmodels provokes the following question: are complex relationships are really rela(cid:173)\ntively unimportant in this data, or is the strong control on overfitting preventing \nidentification of complex relationships? In order to answer this question, we cre(cid:173)\nated six artificial \"sites\" for the subjects. These were designed to have very similar \nproperties to the real sites, while possessing non-linear effects and interactions. \nThe risk models for the artificial sites possessed a underlying trend equal to half \nthat of a good logistic model for LUNGSQ, and one of three more complex effects: \nFREQ, a frequent non-linear (threshold) effect (BWK > 1) affecting 4,334 of the 9,562 \nsubjects; RARE, a rare threshold effect (BWK > 10), affecting 1,550 subjects; and \nINTER, an interaction (BYR . GYR) affecting 482 subjects. For three of the artificial \nsites the complex effect was weak (LO), and for the other three it was strong (HI). For \neach subject and each artificial site, a random choice as to whether that subject was \na positive case for that site was made, based on probability given by the model for \nthe artificial site. Models were fitted to these sites in the same way as to other sites \nand only subjects without cancer at a smoking related site were used as controls. \n\no \n\n20 \n\n40 \n\n60 \n\no \n\n20 \n\n40 \n\n60 \n\no \n\n20 \n\n40 \n\n60 \n\nNULL \nFREQ-TRUE \nRARE-TRUE \nINTER-TRUE \nONLYLIN-STEP \nFULL-STEP \nTREE-BAGGED \nNN-ESTOP-BAGGED \nNN-BA YES-CMTT \n\nFREQ-LO \n\n263;128 \n\n\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nFREQ-HI RA~f-LO RARE-HI \n482/253 \n440/2lO \n\n402 218 \n\nINTER-LO INTER-HI \n\n245/115 \n\n564/274 \n\n~ \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n~ \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n~ \n\n\u2022 \n\n\u2022 \n\n~ \n\u2022 \n\u2022 \u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n~ \n\n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n0204060 \n\n0204060 \n\n020 4060 \n\nFigure 2: Percent improvement in deviance on test data for the artificial sites. \n\nFor comparison purposes, logistic models containing the true set of variables, includ(cid:173)\ning non-linearities and interactions, were fitted to the artificial data. For example, \nthe model RARE-TRUE contained the continuous variables AGE, AGESQ, CDAY, CYR, \nand CYQUIT, and the binary variables SMOKE and BWK> 10-, \n\n\fNeural Networks in Cancer Epidemiology \n\n973 \n\nFigure 2 shows performance on the artificial data. The neural networks and bagged \ntrees were very effective at detecting non-linearities and interactions. Their perfor(cid:173)\nmance was at the same level as the appropriate true models, while the performance \nof simple models lacking the ability to fit the complexities (e.g., FULL-STEP) was \nconsiderably worse. \n\n6 Conclusions \nFor predicting the risk of cancer in our data, neural networks with Bayesian es(cid:173)\ntimation of regularization parameters to control overfitting performed consistently \nbut only slightly better than logistic regression models. This appeared to be due \nto the lack of complex relationships in the data: on artificial data with complex \nrelationships they performed markedly better than logistic \u00b7 models. Good control \nof overfitting is essential for this task, as shown by the poor performance of neural \nnetworks with the number of hidden units chosen by cross-validation. \nGiven their ability to not overfit while still identifying complex relationships we \nexpect that neural networks could prove useful in epidemiological data-analysis \nby providing a method for checking that a simple statistical model is not missing \nimportant complex relationships. \n\nAcknowledgments \n\nThis research was funded by grants from the Workers Compensation Board of \nBritish Columbia, NSERC , and IRIS, and conducted at the BC Cancer Agency. \n\nReferences \nBaxt, w . G. 1995. Application of artificial neural networks to clinical medicine. The \n\nLancet, 346:1135-1138. \n\nBreiman, L. 1996. Bagging predictors. Machine Learning, 26(2) :123- 140. \nBreiman, L. , Friedman, J ., Olshen, R. , and Stone, C. 1984. Classification and Regression \n\nTrees. Wadsworth, Belmont, CA. \n\nBurke, H. , Rosen, D. , and Goodman, P. 1995. Comparing the prediction accuracy of \nartificial neural networks and other statistical methods for breast cancer survival. In \nTesauro, G., Touretzky, D. S., and Leen, T. K. , editors, Advances in Neural Information \nProcessing Systems 7, pages 1063- 1067, Cambridge, MA. MIT Press. \n\nHastie, T. J . and Tibshirani, R. J . 1990. Generalized additive models. Chapman and Hall, \n\nLondon. \n\nJefferson, M. F ., Pendleton, N., Lucas, S., and Horan, M. A. 1995. Neural networks \n\n(letter). The Lancet, 346:1712. \n\nLippmann, R. , Lee, Y. , and Shahian, D. 1995. Predicting the risk of complications in \ncoronary artery bypass operations using neural networks. In Tesauro, G. , Touretzky, \nD. S. , and Leen, T . K ., editors, Advances in Neural Information Processing Systems 7, \npages 1055-1062, Cambridge, MA. MIT Press. \n\nMacKay, D. J. C. 1995. Probable networks and plausible predictions - a review of practical \nBayesian methods for supervised neural networks. Network: Computation in Neural \nSystems, 6:469- 505. \n\nMichie, D., Spiegelhalter, D., and Taylor, C. 1994. Machine Learning, Neural and Statis(cid:173)\n\ntical Classification. Ellis Horwood, Hertfordshire, UK. \n\nStatSci 1995. S-Plus Guide to Statistical and Mathematical Analyses, Version 3.3. StatSci, \n\na division of MathSoft, Inc, Seattle. \n\nWyatt, J. 1995. Nervous about artificial neural networks? (commentary). The Lancet, \n\n346:1175-1177. \n\n\f", "award": [], "sourceid": 1252, "authors": [{"given_name": "Tony", "family_name": "Plate", "institution": null}, {"given_name": "Pierre", "family_name": "Band", "institution": null}, {"given_name": "Joel", "family_name": "Bert", "institution": null}, {"given_name": "John", "family_name": "Grace", "institution": null}]}