{"title": "Probabilistic Anomaly Detection in Dynamic Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 832, "abstract": null, "full_text": "Probabilistic Anomaly Detection \n\n\u2022 In \n\nDynamic Systems \n\nPadhraic Smyth \n\nJet Propulsion Laboratory 238-420 \nCalifornia Institute of Technology \n\n4800 Oak Grove Drive \nPasadena, CA 91109 \n\nAbstract \n\nThis paper describes probabilistic methods for novelty detection \nwhen using pattern recognition methods for fault monitoring of \ndynamic systems. The problem of novelty detection is particular(cid:173)\nly acute when prior knowledge and training data only allow one \nto construct an incomplete classification model. Allowance must \nbe made in model design so that the classifier will be robust to \ndata generated by classes not included in the training phase. For \ndiagnosis applications one practical approach is to construct both \nan input density model and a discriminative class model. Using \nBayes' rule and prior estimates of the relative likelihood of data \nof known and unknown origin the resulting classification equations \nare straightforward. The paper describes the application of this \nmethod in the context of hidden Markov models for online fault \nmonitoring of large ground antennas for spacecraft tracking, with \nparticular application to the detection of transient behaviour of \nunknown origin. \n\n1 PROBLEM BACKGROUND \n\nConventional control-theoretic models for fault detection typically rely on an accu(cid:173)\nrate model ofthe plant being monitored (Patton, Frank, and Clark, 1989). However, \nin practice it common that no such model exists for complex non-linear systems. \nThe large ground antennas used by JPL's Deep Space Network (DSN) to track \n\n825 \n\n\f826 \n\nSmyth \n\nJet Prcpllslon Laboratory \n\nMission \nControl \n\nFigure 1: Block diagram of typical Deep Space Network downlink \n\nplanetary spacecraft fall into this category. Quite detailed analytical models exist \nfor the electromechanical pointing systems. However, these models are primarily \nused for determining gross system characteristics such as resonant frequencies; they \nare known to be a poor fit for fault detection purposes. \n\nWe have previously described the application of adaptive pattern recognition meth(cid:173)\nods to the problem of online health monitoring of DSN antennas (Smyth and Mell(cid:173)\nstrom, 1992; Smyth, in press). Rapid detection and identification of failures in the \nelectromechanical antenna pointing systems is highly desirable in order to minimize \nantenna downtime and thus minimise telemetry data loss when communicating with \nremote spacecraft (see Figure 1). Fault detection based on manual monitoring of \nthe various antenna sensors is neither reliable or cost-effective. \n\nThe pattern-recognition monitoring system operates as follows. Sensor data such as \nmotor current, position encoder, tachometer voltages, and so forth are synchronous(cid:173)\nly sampled at 50Hz by a data acquisition system. The data are blocked off into \ndisjoint windows (200 samples are used in practice) and various features (such as \nestimated autoregressive coefficients) are extracted; let the feature vector be fl. \nThe features are fed into a classification model (every 4 seconds) which in turn pro(cid:173)\nvides posterior probability estimates of the m possible states of the system given the \nestimated features from that window, p(wdfl). WI corresponds to normal conditions, \nthe other Wi'S, 1 ~ i ~ m, correspond to known fault conditions. \n\nFinally, since the system has \"memory\" in the sense that it is more likely to remain \nin the current state than to change states, the posterior probabilities need to be \ncorrelated over time. This is achieved by a standard first-order hidden Markov \n\n\fProbabilistic Anomaly Detection in Dynamic Systems \n\n827 \n\nmodel (HMM) which models the temporal state dependence. The hidden aspect \nof the model reflects the fact that while the features are directly observable, the \nunderlying system states are not, i.e., they are in effect \"hidden.\" Hence, the purpose \nof the HMM is to provide a model from which the most likely sequence of system \nstates can be inferred given the observed sequence of feature data. \n\nThe classifier portion of the model is trained using simulated hard ware faults. The \nfeed-forward neural network has been the model of choice for this application be(cid:173)\ncause of its discrimination ability, its posterior probability estimation properties \n(Richard and Lippmann, 1992; Miller, Goodman and Smyth, 1993) and its rel(cid:173)\natively simple implementation in software. It should be noted that unlike typical \nspeech recognition HMM applications, the transition probabilities are not estimated \nfrom data but are designed into the system based on prior knowledge of the sys(cid:173)\ntem mean time between failure (MTBF) and other specific knowledge of the system \nconfiguration (Smyth, in press). \n\n2 LIMITATIONS OF THE DISCRIMINATIVE MODEL \n\nThe model described above assumes that there are m known mutually exclusive and \nexhaustive states (or \"classes\") of the system, WI, ... ,Wm . The mutually exclusive \nassumption is reasonable in many applications where multiple simultaneous failures \nare highly unlikely. However, the exhaustive assumption is somewhat impractical. \nIn particular, for fault detection in a complex system such as a large antenna, there \nare thousands of possible fault conditions which might occur. The probability of \noccurrence of any single condition is very small, but nonetheless there is a significant \nprobability that at least one of these conditions will occur over some finite time. \nWhile the common faults can be directly modelled it is not practical to assign model \nstates to all the other minor faults which might occur. \nAs discussed in (Smyth and Mellstrom, 1992; Smyth 1994) a discriminative model \ndirectly models P(Wi I~), the posterior probabilities of the classes given the feature \ndata, and assumes that the classes WI, ... ,Wm are exhaustive. On the other hand, a \ngenerative model directly models the probability density function of the input data \nconditioned on each class, p(~IWi)' and then indirectly determines posterior class \nprobabilities by application of Bayes' rule. Examples of generative classifiers include \nparametric models such as Gaussian classifiers and memory-based methods such as \nkernel density estimators. Generative models are by nature well suited to novelty \ndetection whereas discriminative models have no built-in mechanism for detecting \ndata which are different to that on which the model was trained. However, there \nis a trade-off; because generative models typically are doing more modelling than \njust searching for a decision boundary, they can be less efficient (than discriminant \nmethods) in their use of the data. For example, generative models typically scale \npoorly with input dimensionality for fixed training sample size. \n\n3 HYBRID MODELS \n\nA relatively simple and practical approach to the novelty detection problem is to \nuse both a generative and discriminative classifier (an idea originally suggested to \nthe author by R. P. Lippmann). An extra \"m+ lth\" state is added to the model to \n\n\f828 \n\nSmyth \n\ncover \"all other possible states\" not accounted for by the known m states. In this \nframework, the posterior estimates of the discriminative classifier are conditioned \non the event that the data come from one of the m known classes . \n\nLet the symbol w{1 , .. . ,m} denote the event that the true system state is one of the \nknown states, let Wm+l be the unknown state, and let p(wm+1I~) be the posterior \nprobability that the system is in an unknown state given the data. Hence, one can \nestimate the posterior probability of individual known states as \n\n(1) \n\nwhere Pd(wd~,w{1,,, . ,m}) is the posterior probability estimate of state i as provided \nby a discriminative model, i.e., given that the system is in one of the known states. \n\nThe calculation of p(wm+ll~) can be obtained via the usual application of Bayes' \nrule if P(~lwm+d, p(wm+d, and P(~IW{l, ,, . ,m}) are known: \n\n( \nP Wm+l \n\nI(}) -\n-\n-\n\n\"\"m' \n(I \nP ~ wm+dp(wm+d + P ~ w{1, ... ,m}) L...Ji p(Wi) \n\nP(~lwm+dp(wm+d \n\n( I \n\n(2) \n\nSpecifying the prior density P(~lwm+d, the distribution of the features conditioned \non the occurrence of the unknown state, can be problematic. In practice we have \nused non-informative Bayesian priors for P(~lwm+d over a bounded space of feature \nvalues (details are available in a technical report (Smyth and Mellstrom, 1993)) , \nalthough the choosing of a prior density for data of unknown origin is basically \nill-posed. The stronger the constraints which can be placed on the features the \nnarrower the resulting prior density and the better the ability of the overall model \nto detect novelty. If we only have very weak prior information, this will translate \ninto a weaker criterion for accepting points which belong to the unknown category. \nThe term P(Wm+l) (in Equation (2)) must be chosen based on the designer's prior \nbelief of how often the system will be in an unknown state -\na practical choice is \nthat the system is at least as likely to be in an unknown failure state as any of the \nknown failure states. \n\nThe P(~IW{l, ,, .,m}) term in Equation (2) is provided directly by the generative mod(cid:173)\nel. Typically this can be a mixture of Gaussian component densities or a kernel \ndensity estimate over all of the training data (ignoring class labels) . In practice, \nfor simplicity of implementation we use a simple Gaussian mixture model. Further(cid:173)\nmore, because of the afore-mentioned scaling problem with input dimensions, only \na subset of relatively significant input features are used in the mixture model. A \nless heuristic approach to this aspect of the problem (with which we have not yet \nexperimented) would be to use a method such as projection pursuit to project the \ndata into a lower dimensional subspace and perform the input density estimation in \nthis space. The main point is that the generative model need not necessarily work \nin the full dimensional space of the input features. \n\nIntegration of Equations (1) and (2) into the hidden Markov model scheme is s(cid:173)\ntraightforward and is not derived here -\nthe HMM now has an extra state, \"un(cid:173)\nknown.\" The choice oftransition probabilities between the unknown and other states \nis once again a matter of design choice. For the antenna application at least, many \nof the unknown states are believed to be relatively brief transient phenomena which \n\n\fProbabilistic Anomaly Detection in Dynamic Systems \n\n829 \n\nlast perhaps no longer than a few seconds: hence, the Markov matrix is designed \nto reflect these beliefs since the expected duration of any state d[wd (in units of \nsampling intervals) must obey \n\n1 \n\nd[wd = - -\nI - PH \n\n(3) \n\nwhere Pii is the self-transition probability of state Wi. \n\n4 EXPERIMENTAL RESULTS \n\nFor illustrative purposes the experimental results from 2 particular models are com(cid:173)\npared. Each was applied to monitoring the servo pointing system of a DSN 34m \nantenna at Goldstone, California. The models were implemented within Lab View \ndata acquisition software running in real-time on a Macintosh II computer at the an(cid:173)\ntenna site. The models had previously been trained off-line on data collected some \nmonths earlier. 12 input features were used consisting of estimated autoregressive \ncoefficients and variance terms from each window of 200 samples of multichannel \ndata. For both models a discriminative feedforward neural network model (with 8 \nhidden units, sigmoidal hidden and output activation functions) was trained (us(cid:173)\ning conjugate-gradient optimization) to discriminate between a normal state and 3 \nknown and commonly occurring fault states (failed tachometer, noisy tachometer, \nand amplifier short circuit -\nalso known as \"compensation loss\"). The network \noutput activations were normalised to sum to 1 in order to provide posterior class \nprobability estimates. \n\nModel (a) used no HMM and assumed that the 4 known states are exhaustive, i.e., \nit just used the feedforward network. Model (b) used a HMM with 5 states, where a \ngenerative model (a Gaussian mixture model) and a flat prior (with bounds on the \nfeature values) were used to determine the probability of the 5th state (as described \nby Equations (1) and (2)). The same neural network as in model (a) was used as a \ndiscriminator for the other 4 known states. The generative mixture model had 10 \ncomponents and used only 2 of the 12 input features, the 2 which were judged to be \nthe most sensitive to system change. The parameters of the HMM were designed \naccording to the guidelines described earlier. Known fault states were assumed to \nbe equally likely with 1 hour MTBF's and with 1 hour mean duration. Unknown \nfaults were assumed to have a 20 minute MTBF and a 10 second mean duration. \nBoth HMMs used 5-step backwards smoothing, i.e., the probability estimates at \nany time n are based on all past data up to time n and future data up to time \nn + 5 (using a larger number of backward steps was found empirically to produce \nno effect on the estimates). \nFigures 2 (a) and (b) show each model's estimates (as a function of time) that \nthe system is in the normal state. The experiment consisted of introducing known \nhardware faults into the system in a controlled manner after 15 minutes and 45 \nminutes, each of 15 minutes duration. \n\nModel (a) 's estimates are quite noisy and contain a significant number of potential \nfalse alarms (highly undesirable in an operational environment). Model (b) is much \nmore stable due to the smoothing effect of the HMM. Nonetheless, we note that \nbetween the 8th and 10th minutes, there appear to be some possible false alarms: \n\n\f830 \n\nSmyth \n\n.. ' ''I' ~ ... \n\n- - Discriminative model, no HMM \n\nl' \n\nProbability \nof nonnal 0.6 \ncmditionl \n\n0.4 \n\n0.2 \n\n0 \n\n0 \n\n0.8 \n\nProbability \nof nonnal 0.6 \ncmditionl \n\n0.4 \n\n0.2 \n\no \n\n0 \n\nI \n\nl?trom1 \nIn \n~mof \ntaclKmJc1l:r fault \n\n20 ~~~~f \n\nnonnal candiuom \n\n~ \n\n40 \nImrod ctiooof \n\nSO \n\n60 \nTime minutes) \n\nalIIUlCIl&&tim lou fault \n\nrr-\n\n- - Hybrid model. with HMM \n\n, \n\nl~~ \nIn \nctimof \ntac:homcliCl' fault \n\n20 \n\nRcsum1 \nnonna1 CCJnditiom \n\n30 \ntim of \n\n~ SO \nc:om'DCllHlim la-. fault \n\nctioo of \n\nTime minu \n\n60 \ntell) \n\nFigure 2: Estimated posterior probability of normal state (a) using no HMM and \nthe exhaustive assumption (normal + 3 fault states), (b) using a HMM with a \nhybrid model (normal + 3 faults + other state). \n\nthese data were classified into the unknown state (not shown). On later inspection \nit was found that large transients (of unknown origin) were in fact present in the \noriginal sensor data and that this was what the model had detected, confirming \nthe classification provided by the model. It is worth pointing out that the model \nwithout a generative component (whether with or without the HMM) also detected \na non-normal state at the same time, but incorrectly classified this state as one of \nthe known fault states (these results are not shown). \n\nAlso not shown are the results from using a generative model alone, with no dis(cid:173)\ncriminative component. While its ability to detect unknown states was similar to \nthe hybrid model, its ability to discriminate between known states was significantly \nworse than the hybrid model. \n\nThe hybrid model has been empirically tested on a variety of other conditions where \nvarious \"known\" faults are omitted from the discriminative training step and then \n\n\fProbabilistic Anomaly Detection in Dynamic Systems \n\n831 \n\npresented to the model during testing: in all cases, the anomalous unknown state \nwas detected by the model, i.e., classified as a state which the model had not seen \nbefore. \n\n5 APPLICATION ISSUES \n\nThe model described here is currently being integrated into an interactive antenna \nhealth monitoring software tool for use by operations personnel at all new DSN \nantennas. The first such antenna is currently being built at the Goldstone (Califor(cid:173)\nnia) DSN site and is scheduled for delivery to DSN operations in late 1994. Similar \nantennas, also equipped with fault detectors of the general nature described here, \nwill be constructed at the DSN ground station complexes in Spain and Australia in \nthe 1995-96 time-frame. \nThe ability to detect previously unseen transient behaviour has important practical \nconsequences: as well as being used to warn operators of servo problems in real(cid:173)\ntime, the model will also be used as a filter to a data logger to record interesting \nand anomalous servo data on a continuous basis. Hence, potentially novel system \ncharacteristics can be recorded for correlation with other antenna-related events \n(such as maser problems, receiver lock drop during RF feedback tracking, etc.) for \nlater analysis to uncover the true cause of the anomaly. A long-term goal is to \ndevelop an algorithm which can automatically analyse the data which have been \nclassified into the unknown state and extract distinct sub-classes which can be \nadded as new explicit states to the HMM monitoring system in a dynamic fashion. \nStolcke and Omohundro (1993) have described an algorithm which dynamically \ncreates a state model for HMMs for the case of discrete-valued features. The case \nof continuous-valued features is considerably more subtle and may not be solvable \nunless one makes significant prior assumptions regarding the nature of the data(cid:173)\ngenerating mechanism. \n\n6 CONCLUSION \n\nA simple hybrid classifier was proposed for novelty detection within a probabilis(cid:173)\ntic framework . Although presented in the context of hidden Markov models for \nfault detection, the proposed scheme is perfectly general for generic classification \napplications. For example, it would seem highly desirable that fielded automated \nmedical diagnosis systems (such as various neural network models which have been \nproposed in the literature) should always contain a \"novelty-detection\" component \nin order that novel data are identified and appropriately classified by the system. \n\nThe primary weakness of the methodology proposed in this paper is the necessity \nfor prior knowledge in the form of densities for the feature values given the unknown \nstate. The alternative approach is not to explicitly model the the data from the \nunknown state but to use some form of thresholding on the input densities from the \nknown states (Aitchison, Habbema, and Kay, 1977; Dubuisson and Masson, 1993). \nHowever, direct specification of threshold levels is itself problematic. In this sense, \nthe specification of prior densities can be viewed as a method for automatically \ndetermining the appropriate thresholds (via Equation (2)). \n\n\f832 \n\nSmyth \n\nAs a final general comment, it is worth noting that online learning systems must \nuse some form of novelty detection. Hence, hybrid generative-discriminative models \n(a simple form of which has been proposed here) may be a useful framework for \nmodelling online learning. \n\nAcknowledgements \n\nThe author would like to thank Jeff Mellstrom, Paul Scholtz, and Nancy Xiao for \nassistance in data acquisition and analysis. The research described in this paper \nwas performed at the Jet Propulsion Laboratory, California Institute of Technology, \nunder a contract with the National Aeronautics and Space Administration and was \nsupported in part by ARPA under grant number NOOOl4-92-J-1860 \n\nReferences \n\nR. Patton, P. Frank, and R. Clark (eds.), Fault Diagnosis in Dynamic Systems: \nTheory and Application, New York, NY: Prentice Hall, 1989. \n\nP. Smyth and J. Mellstrom, 'Fault diagnosis of antenna pointing systems using \nhybrid neural networks and signal processing techniques,' in Advances in Neural \nInformation Processing Systems 4, J. E. Moody, S. J. Hanson, R. P. Lippmann \n(eds.), San Mateo, CA: Morgan Kaufmann, pp.667-674, 1992. \nP. Smyth, 'Hidden Markov models for fault detection in dynamic systems,' Pattern \nRecognition, vo1.27, no.l, in press. \n\nM. D. Richard and R. P. Lippmann, 'Neural network classifiers estimate Bayesian \na posteriori probabilities,' Neural Computation, 3(4), pp.461-483, 1992. \n\nJ. Miller, R. Goodman, and P. Smyth, 'On loss functions which minimize to con(cid:173)\nditional expected values and posterior probabilities,' IEEE Transactions on Infor(cid:173)\nmation Theory, vo1.39, no.4, pp.1404-1408, July 1993. \n\nP. Smyth, 'Probability density estimation and local basis function neural networks,' \nin Computational Learning Theory and Natural Learning Systems, T. Petsche, M. \nKearns, S. Hanson, R. Rivest (eds.), Cambridge, MA: MIT Press, 1994. \nP. Smyth and J. Mellstrom, 'Failure detection in dynamic systems: model con(cid:173)\nstruction without fault training data,' Telecommuncations and Data Acquisition \nProgress Report, vol. 112, pp.37-49, Jet Propulsion Laboratory, Pasadena, CA, \nFebruary 15th 1993. \n\nA. Stokke and S. Omohundro, 'Hidden Markov model induction by Bayesian merg(cid:173)\ning,' in Advances in Neural Information Processing Systems 5, C. L. Giles, S. J. \nHanson and J. D. Cowan (eds.), San Mateo, CA: Morgan Kaufmann, pp.11-18, \n1993. \n\nJ. Aitchison, J. D. F. Habbema, and J. W. Kay, 'A critical comparison of two \nmethods of statistical discrimination,' Applied Statistics, vo1.26, pp.15-25, 1977. \n\nB. Dubuisson and M. Masson, 'A statistical decision rule with incomplete knowledge \nabout the classes,' Pattern Recognition, vo1.26 , no.l, pp.155-165, 1993. \n\n\f", "award": [], "sourceid": 805, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}