{"title": "A Mixture Model System for Medical and Machine Diagnosis", "book": "Advances in Neural Information Processing Systems", "page_first": 1077, "page_last": 1084, "abstract": null, "full_text": "A Mixture Model System for Medical and \n\nMachine Diagnosis \n\nMagnus Stensmo \n\nTerrence J. Sejnowski \n\nComputational Neurobiology Laboratory \nThe Salk Institute for Biological Studies \n\n10010 North Torrey Pines Road \n\nLa Jolla, CA 92037, U.S.A. \n{magnus,terry}~salk.edu \n\nAbstract \n\nDiagnosis of human disease or machine fault is a missing data problem \nsince many variables are initially unknown. Additional information needs \nto be obtained. The j oint probability distribution of the data can be used to \nsolve this problem. We model this with mixture models whose parameters \nare estimated by the EM algorithm. This gives the benefit that missing \ndata in the database itself can also be handled correctly. The request for \nnew information to refine the diagnosis is performed using the maximum \nutility principle. Since the system is based on learning it is domain \nindependent and less labor intensive than expert systems or probabilistic \nnetworks. An example using a heart disease database is presented. \n\n1 \n\nINTRODUCTION \n\nDiagnosis is the process of identifying diseases in patients or disorders in machines by \nconsidering history, symptoms and other signs through examination. Diagnosis is a common \nand important problem that has proven hard to automate and formalize. A procedural \ndescription is often hard to attain since experts do not know exactly how they solve a \nproblem. \n\nIn this paper we use the information about a specific problem that exists in a database \n\n\f1078 \n\nMagnus Stensmo. Terrence J. Sejnowski \n\nof cases. The disorders or diseases are determined by variables from observations and \nthe goal is to find the probability distribution over the disorders, conditioned on what has \nbeen observed. The diagnosis is strong when one or a few of the possible outcomes are \ndifferentiated from the others. More information is needed if it is inconclusive. Initially \nthere are only a few clues and the rest of the variables are unknown. Additional information \nis obtained by asking questions and doing tests. Since tests may be dangerous, time \nconsuming and expensive, it is generally not possible or desirable to find the answer to \nevery question. Unnecessary tests should be avoided . \n\n. There have been many attempts to automate diagnosis. Early work [Ledley & Lusted, 1959] \nrealized that the problem is not always tractable due to the large number of influences that \ncan exist between symptoms and diseases. Expert systems, e.g. the INTERNIST system \nfor internal medicine [Miller et al., 1982], have rule-bases which are very hard and time \nconsuming to build. Inconsistencies may arise when new rules are added to an existing \ndatabase. There is also a strong domain dependence so knowledge bases can rarely be \nreused for new applications. \n\nBayesian or probabilistic networks [Pearl, 1988] are a way to model a joint probability \ndistribution by factoring using the chain rule in probability theory. Although the models \nare very powerful when built, there are presently no general learning methods for their \nconstruction. A considerable effort is needed. In the Pathfinder system for lymph node \npathology [Heckerman et al., 1992] about 14,000 conditional probabilities had to be assessed \nby an expert pathologist. It is inevitable that errors will occur when such large numbers of \nmanual assessments are involved. \n\nApproaches to diagnosis that are based on domain-independent machine learning alleviate \nsome of the problems with knowledge engineering. For decision trees [Quinlan, 1986], a \npiece of information can only be used if the appropriate question comes up when traversing \nthe tree. This means that irrelevant questions can not be avoided. Feedforward multilayer \nperceptrons for diagnosis [Baxt, 1990] can classify very well, but they need full information \nabout a case. None of these these methods have adequate ways to handle missing data during \nlearning or classification. \n\nThe exponentially growing number of probabilities involved can make exact diagnosis \nintractable. Simple approximations such as independence between all variables and condi(cid:173)\ntional independence given the disease (naive Bayes) introduce errors since there usually are \ndependencies between the symptoms. Even though systems based on these assumptions \nwork surprisingly well, correct diagnosis is not guaranteed. This paper will avoid these \nassumptions by using mixture models. \n\n2 MIXTURE MODELS \n\nDiagnosis can be formulated as a probability estimation problem with missing inputs. The \nprobabilities of the disorders are conditioned on what has currently been observed. If we \nmodel the joint probability distribution it is easy to marginalize to get any conditional \nprobability. This is necessary in order to be able to handle missing data in a principled \nway [Ahmad & Tresp, 1993]. Using mixture models [McLachlan & Basford, 1988], a \nsimple closed form solution to optimal regression with missing data can be formulated. The \nEM algorithm, a method from parametric statistics for parameter estimation, is especially \ninteresting in this context since it can also be formulated to handle missing data in the \n\n\fA Mixture Model System for Medical and Machine Diagnosis \n\n1079 \n\ntraining examples [Dempster et al., 1977; Ghahramani & Jordan, 1994]. \n\n2.1 THE EM ALGORITHM \n\nThe data underlying the model is assumed to be a set of N D-dimensional vectors X = \n{ Z I, . . . , Z N }. Each data point is assumed to have been generated independently from a \nmixture density with M components \n\np(z) = LP(z,Wj;Oj) = LP(Wj)p(zlwj;Oj), \n\n(1) \n\nM \n\nj=l \n\nM \n\nj=l \n\nwhere each mixture component is denoted by Wj. p(Wj), the a priori probability for \nmixturewj, and 8 = (01 , ... , OM) are the model parameters. \nTo estimate the parameters for the different mixtures so that it is likely that the linear com(cid:173)\nbination of them generated the set of data points, we use maximum likelihood estimation. A \ngood method is the iterative Expectation-Maximization, or EM, algorithm [Dempster et al., \n1977]. \n\nTwo steps are repeated. First a likelihood is formulated and its expectation is computed in \nthe E-step. For the type of models that we will use, this step will calculate the probability \nthat a certain mixture component generated the data point in question. The second step \nis the M-step where the parameters that maximize the expectation are found. This can be \nfound analytically for models that can be written in an exponential form, e.g. Gaussian \nfunctions. Equations can be derived for both batch and on-line learning. Update equations \nfor Gaussian distributions with and without missing data will be given here, other distribu(cid:173)\ntions are possible, e.g. binomial or multinomial [Stensmo & Sejnowski, 1994]. Details and \nderivations can be found in [Dempster et al., 1977; Nowlan, 1991; Ghahramani & Jordan, \n1994; Stensmo & Sejnowski, 1994]. \n\nFrom (1) we form the log likelihood of the data \nN \nlogp(zi; OJ) = L \n\nN \nL(8IX) = L \n\nM \n\nlog LP(Wj )P(Zi IWj; OJ). \n\nj=l \n\nThere is unfortunately no analytic solution to the logarithm of the sum in the right hand side \nof the equation. However, if we were to know which of the mixtures generated which data \npoint we could compute it. The EM algorithm solves this by introducing a set of binary \nindicator variables Z = {Zij}. Zij = 1 if and only if the data point Zi was generated by \nmixture component j. The log likelihood can then be manipulated to a form that does not \ncontain the log of a sum. \nThe expectation of %i using the current parameter values 8 k is used since %i is not known \ndirectly. This is the E-step of the EM algorithm. The expected value is then maximized in \ntheM-step. The two steps are iterated until convergence. The likelihood will never decrease \nafter an iteration [Dempster et al., 1977]. Convergence is fast compared to gradient descent. \n\nOne of the main motivations for the EM-algorithm was to be able to handle missing values \nfor variables in a data set in a principled way. In the complete data case we introduced \nmissing indicator variables that helped us solve the problem. With missing data we add \nthe missing components to the Z already missing [Dempster et aI., 1977; Ghahramani & \nJordan, 1994]. \n\n\f1080 \n\nMagnus Stensmo, Terrence J. Sejnowski \n\n2.2 GAUSSIAN MIXTURES \n\nWe specialize here the EM algorithm to the case where the mixture components are radial \nGaussian distributions. For mixture component j with mean I-'j and covariance matrix 1:j \nthis is \n\nThe form of the covariance matrix is often constrained to be diagonal or to have the \nsame values on the diagonal, 1:j = o} I. This corresponds to axis-parallel oval-shaped \nand radially symmetric Gaussians, respectively. Radial and diagonal basis functions can \nfunction well in applications [Nowlan, 1991], since several Gaussians together can form \ncomplex shapes in the space. With fewer parameters over-fitting is minimized. In the radial \ncase, with variance o} \n\nIn the E-step the expected value of the likelihood is computed. For the Gaussian case this \nbecomes the probability that Gaussian j generated the data point \n\nPj(Z) = !(Wj)Gj(z) \n\nl:k=l P(Wk)Gk(Z) \n\n. \n\nThe M-step finds the parameters that maximize the likelihood from the E-step. For complete \ndata the new estimates are \n\n(2) \n\nN \n\nwhere Sj = I:Pj(Zi). \n\ni=l \n\nWhen input variables are missing the Gj(z) is only evaluated over the set of observed \ndimensions O. Missing (unobserved) dimensions are denoted by U. The update equation \nfor p(Wj) is unchanged. To estimate itj we set zf = itY and use (2). The variance \nbecomes \n\nA least squares regression was used to fill in missing data values during classification. For \nmissing variables and Gaussian mixtures this becomes the same approach used by [Ahmad & \nTresp, 1993]. The result of the regression when the outcome variables are missing is a \nprobability distribution over the disorders. This can be reduced to a classification for \ncomparison with other systems by picking the outcome with the maximum of the estimated \nprobabilities. \n\n\fA Mixture Model System for Medical and Machine Diagnosis \n\n1081 \n\n3 REQUESTING MORE INFORMATION \n\nDuring the diagnosis process, the outcome probabilities are refined at each step based on \nnewly acquired knowledge. It it important to select the questions that lead to the minimal \nnumber of necessary tests. There is generally a cost associated with each test and the goal \nis to minimize the total cost. Early work on automated diagnosis [Ledley & Lusted, 1959] \nacknowledged the problem of asking as few questions as possible and suggested the use of \ndecision analysis for the solution. An important idea from the field of decision theory is the \nmaximum expected utility principle [von Neuman & Morgenstern, 1947]: A decision maker \nshould always choose the alternative that maximizes some expected utility of the decision. \nFor diagnosis it is the cost of misclassification. Each pair of outcomes has a utility u(x, y) \nwhen the correct diagnosis is x but y has been incorrectly determined. The expectation can \nbe computed when we know the probabilities of the outcomes. \n\nThe utility values have to be assessed manually in what can be a lengthy and complicated \nprocess. For this reason a simplification of this function has been suggested by [Hecker(cid:173)\nman et al., 1992]: The utility u(x, y) is 1 when both x and y are benign or both are malign, \nand 0 otherwise. This simplification has been found to work well in practice. Another \ncomplication with maximum expected utility principle can also make it intractable. In the \nideal case we would evaluate every possible sequence of future choices to see which is \nthe best. Since the size of the search tree of possibilities grows exponentially this is often \nnot possible. A simplification is to 100k ahead only one or a few steps at a time. This \nnearsighted or myopic approach has been tested in practice with good results [Gorry & \nBarnett, 1967; Heckerman et al., 1992] . \n\n4 THE DIAGNOSIS SYSTEM \n\nThe system we have developed has two phases. First there is a learning phase where a \nprobabilistic model is built. This model is then used for inference in the diagnosis phase. \n\nIn the learning phase, the joint probability distribution of the data is modeled using mixture \nmodels. Parameters are determined from a database of cases by the EM algorithm. The \nk-means algorithm is used for initialization. Input and output variables for each case are \ncombined into one vector per case to form the set of training patterns. The outcomes and \nother nominal variables are coded as J of N . Continuous variables are interval coded. \n\nIn the diagnosis phase, myopic one-step look-ahead was used and utilities were simplified \nas above. The following steps were performed: \n\n1. Initial observations were entered. \n\n2. Conditional expectation regression was used to fill in unknown variables. \n\n3. The maximum expected utility principle was used to recommend the next obser(cid:173)\n\nvation to make. Stop if nothing would be gained by further observations. \n\n4. The user was asked to determine the correct value for the recommended observa(cid:173)\n\ntion. Any other observations could be made, instead of or in addition to this. \n\n5. Continue with step 2. \n\n\f1082 \n\nMagnus Stensmo, Terrence J. Sejnowski \n\nTable 1: The Cleveland Heart Disease database. \n\nObservation Description \nAge in years \nage \nSex of subject \nsex \ncp \nChest pain \nResting blood pressure \ntrestbps \nSerum cholesterol \nchol \nFasting blood sugar \nfbs \nResting electrocardiogr. \nrestecg \nthalach \nMax heart rate achieved \nExercise induced angina \nexang \nST depr. induced by \noldpeak \nexercise relative to rest \nSlope of peak exercise \nSTsegment \n# major vess. col. ftourosc. \nDefect type \nDescription \nHeart disease \n\nca \nthaI \nDisorder \nnum \n\nslope \n\nValues \ncontinuous \nmale/female \nfour types \ncontinuous \ncontinuous \nIt or gt 120 mg/dl \nfive values \ncontinuous \nyes/no \ncontinuous \n\nup/flat/down \n\n0-3 \nnormal/fixed/reversible \nValues \nNot present/4 types \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n\n11 \n\n12 \n13 \n\n14 \n\n5 EXAMPLE \n\nThe Cleveland heart disease data set from UC, Irvine has been used to test the system. It \ncontains 303 examples of four types of heart disease and its absence. There are thirteen \ncontinuous- or nominally-valued variables (Table 1). The continuous variables were interval \ncoded with one unit per standard deviation away from the mean value. This was chosen since \nthey were approximately normally distributed. Nominal variables were coded with one unit \nper value. In total the 14 variables were coded with 55 units. The EM steps were repeated \nuntil convergence ( 60-150 iterations). A varying number of mixture components (20-120) \nwere tried. \n\nPreviously reported results have used only presence or absence of the heart disease. The \nbest of these has been a classification rate of 78.9% using a system that incrementally \nbuilt prototypes [Gennari et al., 1989]. We have obtained 78.6% correct classification \nwith 60 radial Gaussian mixtures as described above. Performance increased with the \nnumber of mixture components. \nIt was not sensitive to a varying number of mixture \ncomponents during training unless there were too few of them. Previous investigators have \npointed out that there is not enough information in the thirteen variables in this data set to \nreach 100% [Gennari et al., 1989]. \n\nAn annotated transcript of a diagnosis session is shown in Figure 1. \n\n6 CONCLUSIONS AND FURTHER WORK \n\nSeveral properties of this model remain to be investigated. It should be tested on several \nmore databases. Unfortunately databases are typically proprietary and difficult to obtain. \nFuture prospects for medical databases should be good since some hospitals are now using \ncomputerized record systems instead of traditional paper-based. It should be fairly easy to \n\n\fA Mixture Model System for Medical and Machine Diagnosis \n\n1083 \n\nThe leftmost number of the five numbers in a line is the estimated probability for no heart \ndisease, followed by the probabilities for the four types of heart disease. The entropy, defined \nas - l:. Pi log Pi' of the diagnoses are given at the same time as a measure of how decisive \nthe current conclusion is. A completely detennined diagnosis has entropy O. Initially all of \nthe variables are unknown and starting diagnoses are the unconditional prior probabilities. \nDisorders (entropy = 1.85): \n\n0.541254 0.181518 0.118812 0.115512 0.042904 \n\nWhat is cp ? 3 \nThe first question is chest pain, and the answer changes the estimated probabilities. This \nvariable is continuous. The answer is to be interpreted how far from the mean the observation \nis in standard deviations. As the decision becomes more conclusive, the entropy decreases. \nDisorders (entropy = 0.69): \n\n0.888209 0.060963 0.017322 0.021657 0.011848 \n\nWhat is age ? 0 \n\nDisorders (entropy = 0.57): \n\nWhat is oldpeak ? -2 \n\n0.91307619 0.00081289 0.02495360 0.03832095 0.02283637 \n\nDisorders (entropy = 0.38): \n\n0.94438718 0.00089016 0.02539957 0.02691099 0.00241210 \n\nWhat is chol? -1 \n\nDisorders (entropy = 0.11): \n\n0.98848758 0.00028553 0.00321580 0.00507073 0 . 00294036 \n\nWe have now detennined that the probability of no heart disease in this case is 98.8%. The \nremaining 0.2% is spread out over the other possibilities. \n\nFigure 1: Diagnosis example. \n\ngenerate data for machine diagnosis. \n\nAn alternative way to choose a new question is to evaluate the variance change in the output \nvariables when a variable is changed from missing to observed. The idea is that a variable \nknown with certainty has zero variance. The variable with the largest resulting conditional \nvariance could be selected as the query, similar to [Cohn et aI., 1995]. \n\nOne important aspect of automated diagnosis is the accompanying explanation for the \nconclusion, a factor that is important for user acceptance. Since the basis functions have \nlocal support and since we have estimates for the probability of each basis function having \ngenerated the observed data, explanations for the conclusions could be generated. \n\nInstead of using the simplified utilities with values 0 and 1 for the expected utility calcula(cid:173)\ntions they could be learned by reinforcement learning. A trained expert would evaluate the \nquality of the diagnosis performed by the system, followed by adjustment of the utilities. \nThe 0 and 1 values can be used as starting values. \n\n\f1084 \n\nMagnus Stensmo. Terrence J. Sejnowski \n\nAcknowledgements \n\nThe heart disease database is from the University of California, Irvine Repository of \nMachine Learning Databases and originates from R. Detrano, Cleveland Clinic Foundation. \nPeter Dayan provided helpful comments on an earlier version of this paper. \n\nReferences \nAhmad, S. & Tresp, V. (1993). Some solutions to the missing feature problem in vision. \nIn Advances in Neural Information Processing Systems, vol. 5, pp 393-400. Morgan \nKaufmann, San Mateo, CA. \n\nBaxt, W. (1990). Use of an artificial neural network for data analysis in clinical decision(cid:173)\nmaking: The diagnosis of acute coronary occlusion. Neural Computation, 2(4),480-\n489. \n\nCohn, D. A., Ghahramani, Z. & Jordan, M.1. (1995). Active learning with statistical models. \nIn Advances in Neural Information Processing Systems, vol. 7. Morgan Kaufmann, San \nMateo, CA. \n\nDempster, A., Laird, N. & Rubin, D. (1977). Maximum likelihood from incomplete data \n\nvia the EM algorithm. Journal of the Royal Statistical Society, Series, B., 39, 1-38. \n\nGennari, 1, Langley, P. & Fisher, D. (1989). Models of incremental concept formation. \n\nArtificial Intelligence, 40, 11-62. \n\nGhahramani, Z. & Jordan, M. (1994). Supervised learning from incomplete data via an EM \napproach. In Advances in Neural Information Processing Systems, vol. 6, pp 120-127. \nMorgan Kaufmann, San Mateo, CA. \n\nGorry, G. A. & Barnett, G. O. (1967). Experience with a model of sequential diagnosis. \n\nComputers and Biomedical Research, 1, 490-507. \n\nHeckerman, D., Horvitz, E. & Nathwani, B. (1992). Toward normative expert systems: \n\nPart I. The Pathfinder project. Methods of Information in Medicine, 31, 90-105. \n\nLedley, R. S. & Lusted, L. B. (1959). Reasoning foundations of medical diagnosis. Science, \n\n130(3366),9-21. \n\nMcLachlan, G. J. & Basford, K. E. (1988). Mixture Models: Inference and Applications to \n\nClustering. Marcel Dekker, Inc., New York, NY. \n\nMiller, R. A., Pople, H. E. & Myers, 1 D. (1982). Internist-I: An experimental computer(cid:173)\n\nbased diagnostic consultant for general internal medicine. New England Journal of \nMedicine, 307, 468-476. \n\nNowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algorithms \nbased on Fitting Statistical Mixtures. PhD thesis, School of Computer Science, Carnegie \nMellon University, Pittsburgh, PA. \n\nPearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible \n\nInference. Morgan Kaufmann, San Mateo, CA. \n\nQuinlan, 1 R. (1986). Induction of decision trees. Machine Learning, 1,81-106. \nStensmo, M. & Sejnowski, T. J. (1994). A mixture model diagnosis system. Tech. Rep. \n\nINC-9401, Institute for Neural Computation, University of California, San Diego. \n\nvon Neuman, J. & Morgenstern, O. (1947). Theory of Games and Economic Behavior. \n\nPrinceton University Press, Princeton, NJ. \n\n\f", "award": [], "sourceid": 1019, "authors": [{"given_name": "Magnus", "family_name": "Stensmo", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}