{"title": "Predicting the Risk of Complications in Coronary Artery Bypass Operations using Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1055, "page_last": 1062, "abstract": null, "full_text": "Predicting the Risk of Complications in Coronary \nArtery Bypass Operations using Neural Networks \n\nRichard P. Lippmann, Linda Kukolich \n\nMIT Lincoln Laboratory \n\n244 Wood Street \n\nLexington, MA 02173-0073 \n\nDr. David Shahian \nLahey Clinic \n\nBurlington, MA 01805 \n\nAbstract \n\nExperiments demonstrated that sigmoid multilayer perceptron (MLP) \nnetworks provide slightly better risk prediction than conventional \nlogistic regression when used to predict the risk of death, stroke, and \nrenal failure on 1257 patients who underwent coronary artery bypass \noperations at the Lahey Clinic. MLP networks with no hidden layer and \nnetworks with one hidden layer were trained using stochastic gradient \ndescent with early stopping. MLP networks and logistic regression used \nthe same input features and were evaluated using bootstrap sampling \nwith 50 replications. ROC areas for predicting mortality using \npreoperative input features were 70.5% for logistic regression and \n76.0% for MLP networks. Regularization provided by early stopping \nwas an important component of improved perfonnance. A simplified \napproach to generating confidence intervals for MLP risk predictions \nusing an auxiliary \"confidence MLP\" was developed. The confidence \nMLP is trained to reproduce confidence intervals that were generated \nduring training using the outputs of 50 MLP networks trained with \ndifferent bootstrap samples. \n\n1 INTRODUCTION \nIn 1992 there were roughly 300,000 coronary artery bypass operations perfonned in the \nUnited States at a cost of roughly $44,000 per operation. The $13.2 billion total cost of \nthese operations is a significant fraction of health care spending in the United States. This \nhas led to recent interest in comparing the quality of cardiac surgery across hospitals using \nrisk-adjusted procedures and large patient populations. It has also led to interest in better \nassessing risks for individual patients and in obtaining improved understanding of the pa(cid:173)\ntient and procedural characteristics that affect cardiac surgery outcomes. \n\n\f1056 \n\nRichard P. Lippmann, Linda Kukolich, David Shahian \n\nINPUT \n\nSELECT \n\nFEATURES \n\nFE ATURES \n-\n~ ... \n... \n.. - REPLACE \n... \n-\n-\n\nMISSING \nFEATURES r--\nr--\n\nAND \n\n~ \n\n~ \n\nI--\n\nI--\n\nCLASSIFY \n\nCONFIDENCE \n\nNETWORK \n\n... -\n\n... ~ \n\nRISK \nPROBABILITY \n\nCONFIDENCE \nINTERVAL \n\nFigure 1. Block diagram of a medical risk predictor. \n\nThis paper describes a experiments that explore the use of neural networks to predict the \nrisk of complications in coronary artery bypass graft (CABO) surgery. Previous approaches \nto risk prediction for bypass surgery used linear or logistic regression or a Bayesian ap(cid:173)\nproach which assumes input features used for risk prediction are independent (e.g. Ed(cid:173)\nwards, 1994; Marshall, 1994; Higgins, 1992; O'Conner, 1992). Neural networks have the \npotential advantages of modeling complex interactions among input features, of allowing \nboth categorical and continuous input features, and of allowing more flexibility in fitting \nthe expected risk than a simple linear or logistic function. \n\n2 RISK PREDICTION AND FEATURE SELECTION \nA block diagram of the medical risk prediction system used in these experiments is shown \nin Figure 1. Input features from a patient's medical record are provided as 105 raw inputs, \na smaller subset of these features is selected, missing features in this subset are replaced \nwith their most likely values from training data, and a reduced input feature vector is fed to \na classifier and to a \"confidence network\". The classifier provides outputs that estimate the \nprobability or risk of one type of complication. The confidence network provides upper and \nlower bounds on these risk estimates. Both logistic regression and multilayer sigmoid neu(cid:173)\nral network (MLP) classifiers were evaluated in this study. Logistic regression is the most \ncommon approach to risk prediction. It is structurally equivalent to a feed-forward network \nwith linear inputs and one output unit with a sigmoidal nonlinearity. Weights and offsets are \nestimated using a maximum likelihood criterion and iterative \"batch\" training. The refer(cid:173)\nence logistic regression classifier used in these experiments was implemented with the S(cid:173)\nPlus glm function (Mathsoft, 1993) which uses iteratively reweighted least squares for \ntraining and no extra regularization such as weight decay. Multilayer feed-forward neural \nnetworks with no hidden nodes (denoted single-layer MLPs) and with one hidden layer and \nfrom 1 to 10 hidden nodes were also evaluated as implemented using LNKnet pattern clas(cid:173)\nsification software (Lippmann, 1993). An MLP committee classifier containing eight mem(cid:173)\nbers trained using different initial random weights was also evaluated. \n\nAll classifiers were evaluated using a data base of 1257 patients who underwent coronary \nartery bypass surgery from 1990 to 1994. Classifiers were used to predict mortality, post(cid:173)\noperative strokes, and renal failure. Predictions were made after a patient's medical history \nwas obtained (History), after pre-surgical tests had been performed (Post-test), immediately \nbefore the operation (preop), and immediately after the operation (Postop). Bootstrap sam(cid:173)\npling (Efron, 1993) was used to assess risk prediction accuracy because there were so few \n\n\fPredicting the Risk of Complications in Coronary Artery Bypass Operations \n\n1057 \n\npatients with complications in this data base. The number of patients with complications \nwas 33 or 2.6% for mortality, 25 or 2.0% for stroke, and 21 or 1.7% for renal failure. All \nexperiments were performed using 50 bootstrap training sets where a risk prediction tech(cid:173)\nnique is trained with a bootstrap training set and evaluated using left-out patterns. \n\nmSTORY \nAge \nCOPD (Chronic Obs. Pul. Disease) \n\nPOST\u00b7TEST \n\nPulmonary Ventricular Congestion \nX-ray Cardiomegaly \nX-ray Pulmonary Edema \n\nPREOP \n\nNTG (Nitroglycerin) \nIABP (Intraaortic Balloon Pump) \nUrgency Status \nMI When \n\nPOSTOP \n\nBlood Used (Packed Cells) \nPerfusion Time \n\nNComplications \n\nNHigh \n27/674 \n71126 \n\n8nl \n61105 \n6/21 \n\n211447 \n111115 \n10/127 \n7/64 \n\n121113 \n91184 \n\n% True Hits \n\n4.0% \n5.6% \n\n11.3% \n5.7% \n26.6% \n\n4.7% \n6.6% \n7.9% \n10.9% \n\n10.6% \n4.9% \n\nFigure 2. Features selected to predict mortality. \n\nThe initial set of 105 raw input features included binary (e.g. MalelFemale), categorical \n(e.g. MI When: none, old, recent, evolving), and continuous valued features (e.g. Perfusion \nTime, Age). There were many missing and irrelevant features and all features were only \nweakly predictive. Small sets of features were selected for each complication using the fol(cid:173)\nlowing procedures: (1) Select those 10 to 40 features experience and previous studies indi(cid:173)\ncate are related to each complication, (2) Omit features if a univariate contingency table \nanalysis shows the feature is not important, (3) Omit features that are missing for more than \n5% of patients, (4) Order features by number of true positives, (5) Omit features that are \nsimilar to other features keeping the most predictive, and (7) Add features incrementally as \na patient's hospital interaction progresses. This resulted in sets of from 3 to 11 features for \nthe three complications. Figure 2 shows the 11 features selected to predict mortality. The \nfirst column lists the features, the second column presents a fraction equal to the number of \ncomplications when the feature was \"high\" divided by the number of times this feature was \n\"high\" (A threshold was assigned for continuous and categorical features that provided \ngood separation), and the last column is the second column expressed as a percentage. Clas(cid:173)\nsifiers were provided identical sets of input features for all experiments. Continuous inputs \nto all classifiers were normalized to have zero mean and unit variance, categorical inputs \nranged from -(D-1)/2 to (D-1)/2 in steps of 1.0, where D is the number of categories, and \nbinary inputs were -0.5 or 0.5. \n\n3 PERFORMANCE COMPARISONS \nRisk prediction was evaluated by plotting and computing the area under receiver operating \ncharacteristic (ROC) curves and also by using chi-square tests to determine how accurately \nclassifiers could stratify subjects into three risk categories. Automated experiments were \nperformed using bootstrap sampling to explore the effect of varying the training step size \n\n\fJ058 \n\n100 \n\n~ 80 \n:~ \n~ (/) 60 \nc \n\nQ) en -~ 40 \n\nI \n?fl. 20 \n\no \n\no \n\n20 \n\n40 \n\nRichard P. Lippmann. Linda Kukolich, David Shahian \n\n\" HISTORY (68.6%) \n\n60 \n\n80 \n\n100 0 \n\n% FALSE ALARMS (100 - Specificity) \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nFigure 3. Fifty preoperative bootstrap ROCs predicting mortality using an MLP \nclassifier with two hidden nodes and the average ROC (left), and average ROCS \nfor mortality using history, preoperative, and postoperative features (right). \n\nfrom 0.005 to 0.1; of using squared-error, cross-entropy, and maximum likelihood cost \nfunctions; of varying the number of hidden nodes from 1 to 8; and of stopping training after \nfrom 5 to 40 epochs. ROC areas varied little as parameters were varied. Risk stratification, \nwhich measures how well classifier outputs approximate posterior probabilities, improved \nsubstantially with a cross-entropy cost function (instead of squared error), with a smaller \nstepsize (0.01 instead of 0.05 or 0.1) and with more training epochs (20 versus 5 or 10). An \nMLP classifier with two hidden nodes provided good overall performance across compli(cid:173)\ncations and patient stages with a cross-entropy cost function, a stepsize of 0.01, momentum \nof 0.6, and stochastic gradient descent stopping after 20 epochs. A single-layer MLP pro(cid:173)\nvided good performance with similar settings, but stopping after 5 epochs. These settings \nwere used for all experiments. The left side of Figure 3 shows the 50 bootstrap ROCs cre(cid:173)\nated using these settings for a two-hidden-node MLP when predicting mortality with pre(cid:173)\noperative features and the ROC created by averaging these curves. There is a large \nvariability in these ROes due to the small amount of training data. The ROC area varies \nfrom 67% to 85% (cr=4.7) and the sensitivity with 20% false alarms varies from 30% to \n79%. Similar variability occurs for other complications. The right side of Figure 3 shows \naverage ROCs for mortality created using this MLP with history, preoperative, and postop(cid:173)\nerative features. As can be seen, the ROC area and prediction accuracy increases from \n68.6% to 79.2% as more input features become available. \nFigure 4 shows ROC areas across all complications and patient stages. Only three and two \npatient stages are shown for stroke and renal failure because no extra features were added \nat the missing stages for these complications. ROC areas are low for all complications and \nrange from 62% to 80%. ROC areas are highest using postoperative features, lowest using \nonly history features, and increase as more features are added. ROC areas are highest for \nmortality (68 to 80%) and lower for stroke (62 to 71 %) and renal failure (62 to 67% ).The \nMLP classifier with two hidden nodes (MLP) always provided slightly higher ROC areas \nthan logistic regression. The average increase with the MLP classifier was 2.7 percentage \n\n\fPredicting the Risk of Complications in Coronary Artery Bypass Operations \n\n1059 \n\n100 \n\n90 MORTALITY \n\n-80 C \n~ 70 \n~ 60 \n8 a: 50 \n\n40 \n\n30 \n\nHISTORY POSTTEST PRE\u00b7OP \n\nPOST\u00b7OP \n\nPATIENT STAGE \n\n100 \n90 STROKE \n\nI=2a \n\n-\n80 \nC \n\u00ab 70 \nw \n~ 60 \n8 a: 50 \n\n40 \n\n30 \n\nHISTORY \n\nPOSTTEST \n\nPATIENT STAGE \n\nPOST\u00b7OP \n\nIII Logistic \n[3 Single\u00b7Layer MLP \n\n\u2022 \n\nMLP \n\n\u2022 MLP\u00b7Commillee \n\n100 \n90 RENAL FAILURE \n80 \n\nI=2a \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\nHISTORY \n\nPOST\u00b7OP \n\nPATIENT STAGE \n\nFigure 4. ROC areas across all complications and patient stages for logistic \nregression, single-layer MLP classifier, two-layer MLP classifier with two hidden \nnodes, and a committee classifier containing eight two-layer MLP classifiers \ntrained using different random starting weights. \n\npoints (the increase ranged from 0.3 to 5.5 points). The single-layer MLPclassifier also pro(cid:173)\nvided good performance. The average ROC area with the single-layer MLP was only 0.6 \npercentage points below that of the MLP with two hidden nodes. The committee using eight \ntwo-layer MLP classifiers performed no better than an individual two-layer MLP classifier. \n\nClassifier outputs were used to bin or stratify each patient into one of four risk levels (0-\n5%, 5-10%, and 10-100%) by treating the output as an estimate of the complication poste(cid:173)\nrior probability. Figure 5 shows the accuracy of risk stratification for the MLP classifier for \nall complications. Each curve was obtained by averaging 50 individual curves obtained us(cid:173)\ning bootstrap sampling as with the ROC curves. Individual curves were obtained by placing \neach patient into one of the three risk bins based on the MLP output. The x's represent the \naverage MLP output for all patients in each bin. Open squares are the true percentage of \npatients in each bin who experienced a complication. The bars represent \u00b12 binomial devi(cid:173)\nations about the true patient percentages. Risk prediction is accurate if the x's are close to \nthe squares and within the confidence intervals. As can be seen, risk prediction is accurate \nand close to the actual number of patients who experienced complications. It is difficult, \nhowever, to assess risk prediction given the limited numbers of patients in the two highest \nbins. For example, in Figure 5, the median number of patients with complications was only \n2 out of 20 in the middle bin and 2 out of 13 in the upper bin. Good and similar risk strati(cid:173)\nfication, as measured by a chi-square test, was provided by all classifiers. Differences be(cid:173)\ntween classifier predictions and true patient percentages were small and not statistically \nsignificant. \n\n\f1060 \n\nRichard P. Lippmann, Linda Kukolich, David Shahian \n\n~ ~------------------------------~ \n\nMORTALITY 0- PATIENT COUNT \n\n30 \n\nX - MLP OUTPUT \n\n10 \no ____ .M __________________________ __ \n\n40 \n\n~ ~--------------------------r_--_, \n\nRENAL FAILURE \n\n30 \n\n20 \n\n10 \no ~ __ _w~ ________ ~ __________ ~ _ _ ~ \n\n0-5 \n\n5\u00b710 \n\n10-100 \n\nBIN PROBABILITY RANGE ('\u00a5o) \n\nFigure 5. Accuracy of MLP risk stratification for three complications using \npreoperative features. Open squares are true percentages of patients in each bin \nwith a complication, x's are MLP predictions, bars represent \u00b12 binomial \nstandard deviation confidence intervals. \n\n4 CONFIDENCE MLP NETWORKS \nEstimating the confidence in the classification decision produced by a neural network is a \ncritical issue that has received relatively little study. Not being able to provide a confidence \nmeasure makes it difficult for physicians and other professionals to accept the use of com(cid:173)\nplex networks. Bootstrap sampling (Efron, 1993) was selected as an approach to generate \nconfidence intervals for medical risk prediction because 1) It can be applied to any type of \nclassifier, 2) It measures variability due to training algorithms, implementation differences, \nand limited training data, and 3) It is simple to implement and apply. As shown in the top \nhalf of Figure 6, 50 bootstrap sets of training data are created from the original training data \nby resampling with replacement. These bootstrap training sets are used to train 50 bootstrap \nMLP classifiers using the same architecture and training procedures that were selected for \nthe risk prediction MLP. When a pattern is fed into these classifiers, their outputs provide \nan estimate of the distribution of the output of the risk prediction MLP. Lower and upper \nconfidence bounds for any input are obtained by sorting these outputs and selecting the 10% \nand 90% cumulative levels. \n\nIt is computationally expensive to have to maintain and query 50 bootstrap MLPs whenever \nconfidence bounds are desired. A simpler approach is to train a single confidence MLP to \nreplicate the confidence bounds predicted by the 50 bootstrap MLPs, as shown in the bot-\n\n\fPredicting the Risk of Complications in Coronary Artery Bypass Operations \n\n1061 \n\nOUTPUT \n\nSTATISTICS \n\nCONFIDENCE \n\nMLP \n\nUPPER \nLIMIT \n\nLOWER \nLIMIT \n\nRISK \n\nPREDICTION \n\nMLP \n\n/ \n\nINPUT \n\nPATTERN \n\nFigure 6. A confidence MLP trained using 50 bootstrap MLPs produces upper and \nlower confidence bounds for a risk prediction MLP. \n\ntom half of Figure 6. The the confidence MLP is fed the input pattern and the output of the \nrisk prediction MLP and produces at its output the confidence intervals that would have \nbeen produced by 50 bootstrap MLPs. The confidence MLP is a mapping or regression net(cid:173)\nwork that replaces the 50 bootstrap networks. It was found that confidence networks with \none hidden layer, two hidden nodes, and a linear output could accurately reproduce the up(cid:173)\nper and lower confidence intervals created by 50 bootstrap two-layer MLP networks. The \nconfidence network outputs were almost always within \u00b115% of the actual bootstrap \nbounds. Upper and lower bounds produced by these confidence networks for all patients \nusing preoperative features predicting mortality are show in Figure 7. Bounds are high (\u00b1 1 0 \npercentage points) when the complication risk is near 20% and drop to lower values (\u00b10.4 \npercentage points) when the risk is near 1 %. This relatively simple approach makes it pos(cid:173)\nsible to create and replicate confidence intervals for many types of classifiers. \n\n5 SUMMARY AND FUTURE PLANS \nMLP networks provided slightly better risk prediction than conventional logistic regression \nwhen used to predict the risk of death, stroke, and renal failure on 1257 patients who un(cid:173)\nderwent coronary artery bypass operations. Bootstrap sampling was required to compare \napproaches and regularization provided by early stopping was an important component of \nimproved performance. A simplified approach to generating confidence intervals for MLP \nrisk predictions using an auxiliary \"confidence MLP\" was also developed. The confidence \nMLP is trained to reproduce the confidence bounds that were generated during training by \n50 MLP networks trained using bootstrap samples. Current research is validating these re(cid:173)\nsults using larger data sets, exploring approaches to detect outlier patients who are so dif(cid:173)\nferent from any training patient that accurate risk prediction is suspect, developing ap(cid:173)\nproaches to explaining which input features are important for an individual patient, and \ndetermining why MLP networks provide improved performance. \n\n\f1062 \n\nRichard P. Lippmann, Linda Kukolich, David Shahian \n\n~r------.------.-------r------' \n\n... T \n\nTT ... \n\nT \n\nJ;.... \n\nT \n\nT \nT \n\n.. \n\nUPPER \n\n30 \n\n'# \n025 \nI-\n5 \n:J20 \nW \n(,) \nZ 15 \nW \nQ \nu:: 10 \nZ o \n\n(,) \n\n5 \n\nCOMPLICATION RISK% \n\nFigure 7. Upper and lower confidence bounds for all patients and preoperative \nmortality risk predictions calculated using two MLP confidence networks. \n\nACKNOWLEDGMENT \nThis work was sponsored by the Department of the Air Force. The views expressed are \nthose of the authors and do not reflect the official policy or position of the U.S. Govern(cid:173)\nment. We wish to thank Stephanie Moisakis and Anne Nilson at the Lahey Clinic and \nYuchun Lee at Lincoln Laboratory for assistance in organizing and preprocessing the data. \nBIBLIOGRAPHY \nF. Edwards, R. Clark, and M. Schwartz. (1994) Coronary Artery Bypass Grafting: The \nSociety of Thoracic Surgeons National Database Experience. In Annals Thoracic Surgery, \nVol. 57, 12-19. \nBradley Efron and Robert J. Tibshirani. (1993) An Introduction to the Bootstrap. Mono(cid:173)\ngraphs on Statistics and Applied Probability 57, New York: Chapman and Hall (1993). \nT. Higgins, F. Estafanous, et. al. (1992) Stratification of Morbidity and Mortality Outcome \nby Preoperative Risk Factors in Coronary Artery Bypass Patients. In Journal of the Amer(cid:173)\nican Medical Society, Vol. 267, No. 17,2344-2348. \nR. Lippmann, L. Kukolich, and E. Singer. (1993) LNKnet: Neural Network, Machine \nLearning, and Statistical Software for Pattern Classification. In Lincoln Laboratory Jour(cid:173)\nnal, Vol. 6, No.2, 249-268. \nMarshall Guillenno, Laurie W. Shroyer, et al. (1994) Bayesian-Logit Model for Risk \nAssessment in Coronary Artery Bypass Grafting, In Annals Thoracic Surgery, Vol. 57, \n1492-5000. \nG. O'Conner, S. Plume, et. al. (1992) Multivariate Prediction of In-Hospital Mortality \nAssociated with Coronary Artery Bypass Surgery. In Circulation, Vol. 85, No.6, 2110-\n2118. \nStatistical Sciences. (1993) S-PLUS Guide to Statistical and Mathematical Analyses, Ver(cid:173)\nsion 3.2, Seattle: StatSci, a division of MathSoft, Inc. \n\n\f", "award": [], "sourceid": 956, "authors": [{"given_name": "Richard", "family_name": "Lippmann", "institution": null}, {"given_name": "Linda", "family_name": "Kukolich", "institution": null}, {"given_name": "David", "family_name": "Shahian", "institution": null}]}