{"title": "Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra", "book": "Advances in Neural Information Processing Systems", "page_first": 946, "page_last": 952, "abstract": null, "full_text": "Support Vector Novelty Detection \n\nApplied to Jet Engine Vibration Spectra \n\nPaul Hayton \n\nDepartment of Engineering Science \n\nUniversity of Oxford, UK \n\npmh@robots.ox.ac.uk \n\nBernhard SchOlkopf \nMicrosoft Research \n\n1 Guildhall Street, Cambridge, UK \n\nbsc@scientist.com \n\nLionel Tarassenko \n\nDepartment of Engineering Science \n\nUniversity of Oxford, UK \nlionel@robots.ox.ac.uk \n\nPaul Anuzis \n\nRolls-Royce Civil Aero-Engines \n\nDerby, UK \n\nAbstract \n\nA system has been developed to extract diagnostic information from jet \nengine carcass vibration data. Support Vector Machines applied to nov(cid:173)\nelty detection provide a measure of how unusual the shape of a vibra(cid:173)\ntion signature is, by learning a representation of normality. We describe \na novel method for Support Vector Machines of including information \nfrom a second class for novelty detection and give results from the appli(cid:173)\ncation to Jet Engine vibration analysis. \n\n1 \n\nIntroduction \n\nJet engines have a number of rigorous pass-off tests before they can be delivered to the \ncustomer. The main test is a vibration test over the full range of operating speeds. Vibration \ngauges are attached to the casing of the engine and the speed of each shaft is measured \nusing a tachometer. The engine on the test bed is slowly accelerated from idle to full \nspeed and then gradually decelerated back to idle. As the engine accelerates, the rotation \nfrequency of the two (or three) shafts increases and so does the frequency of the vibrations \ncaused by the shafts. A tracked order is the amplitude of the vibration signal in a narrow \nfrequency band centered on a harmonic of the rotation frequency of a shaft, measured as \na function of engine speed. It tracks the frequency response of the engine to the energy \ninjected by the rotating shaft. Although there are usually some harmonics present, most \nof the energy in the vibration spectrum is concentrated in the fundamental tracked orders. \nThese therefore constitute the \"vibration signature\" of the jet engine under test. It is very \nimportant to detect departures from the normal or expected shapes of these tracked orders \nas this provides very useful diagnostic information (for example, for the identification of \nout-of-balance conditions). \n\nThe detection of such abnormalities is ideally suited to the novelty detection paradigm for \nseveral reasons. Usually, there are far fewer examples of abnormal shapes than normal \nones and often there may only be a single example of a particular type of abnormality in \n\n\fthe available database. More importantly, the engine under test may show up a type of \nabnormality which has never been seen before but which should not be missed. This is \nespecially important in our current work where we are adapting the techniques developed \nfor pass-off tests to in-flight monitoring. \n\nWith novelty detection, we first of all learn a description of normal vibration shapes by \nincluding only examples of normal tracked orders in the training data. Abnormal shapes \nin test engines are subsequently identified by testing for novelty against the description of \nnormality. \n\nIn our previous work [2], we investigated the vibration spectra of a two-shaft jet engine, \nthe Rolls-Royce Pegasus. In the available database, there were vibration spectra recorded \nfrom 52 normal engines (the training data) and from 33 engines with one or more unusual \nvibration feature (the test data). The shape of the tracked orders was encoded as a low(cid:173)\ndimensional vector by calculating a weighted average of the vibration amplitude over six \ndifferent speed ranges (giving an 18-D vector for three tracked orders). With so few engines \navailable, the K -means clustering algorithm (with K = 4) was used to construct a very \nsimple model of normality, following component-wise normalisation of the 18-D vectors. \n\nThe novelty of the vibration signature for a test engine was assessed as the shortest dis(cid:173)\ntance to one of the kernel centres in the clustering model of normality (each distance being \nnormalised by the width associated with that kernel). When cumulative distributions of \nnovelty scores were plotted both for normal (training) engines and test engines, there was \nlittle overlap found between the two distributions [2]. A significant shortcoming of the \nmethod, however, is the inability to rank engines according to novelty, since the shortest \nnormalised distance is evaluated with respect to different cluster centres for different en(cid:173)\ngines. In this paper, we re-visit the problem but for a new engine, the RB211-535. We argue \nthat the SVM paradigm is ideal for novelty detection, as it provides an elegant distribution \nof normality, a direct indication of the patterns on the boundary of normality (the support \nvectors) and, perhaps most importantly, a ranking of \"abnormality\" according to distance \nto the separating hyperplane in feature space. \n\n2 Support Vector Machines for Novelty Detection \n\nSuppose we are given a set of \"normal\" data points X = {Xl, ... , xL}. In most novelty \ndetection problems, this is all we have; however, in the following we shall develop an \nalgorithm that is slightly more general in that it can also take into account some examples \nof abnormality, Z = {Zl' ... ' zt} . Our goal is to construct a real-valued function which, \ngiven a previously unseen test point x, charaterizes the \"X -ness\" of the point x, i.e. which \ntakes large values for points similar to those in X. The algorithm that we shall present \nbelow will return such a function, along with a threshold value, such that a prespecified \nfraction of X will lead to function values above threshold. In this sense we are estimating \na region which captures a certain probability mass. \n\nThe present approach employs two ideas from support vector machines [6] which are cru(cid:173)\ncial for their fine generalization performance even in high-dimensional tasks: maximizing \na margin, and nonlinearly mapping the data into some feature .space F endowed with a dot \nproduct. The latter need not be the case for the input domain X which may be a general set. \nThe connection between the input domain and the feature space is established by a feature \nmap <1> : X -+ F, i.e. a map such that some simple kernel [1,6] \n\nsuch as the Gaussian \n\nk(x,y) = (<1>(x)\u00b7 <1>(y)), \n\nk(x,y) = e-llx-yIl2/c, \n\n(1) \n\n(2) \n\n\fprovides a dot product in the image of O} are called Support Vectors. The expansion (8) \nturns the decision function (7) into a form which only depends on dot prducts, f(x) = \nsgn((LiDi(Xi - t LnZn) . (x - t LnZn)) - p). By multiplying out the dot prod(cid:173)\nucts, we obtain a form that can be written as a nonlinear decision function on the in(cid:173)\nput domain X in terms of a kernel (1) (cf. (3\u00bb. A short calculation yields f(x) = \nsgn (Li Dik(Xi, x) - t Ln k(zn, x) + b Lnp k(zn, zp) - t Lin Dik(Zn, Xi) - p). In \nthe argument of the sgn, only the first two terms depend on x, therefore we may absorb the \n\n\fnext terms in the constant p, which we have not fixed yet. To compute p in the final form \nof the decision function \n\n(9) \n\nwe employ the Karush-Kuhn-Tucker (KKT) conditions of the optimization problem [6, \n\ne.g.]. They state that for points Xi where \u00b0 < Cli < 1/ (vi), the inequality constraints \n\n(6) become equalities (note that in general, Cli E [O,l/(vi)]), and the argument of the \nsgn in the decision function should equal 0, i.e. the corresponding Xi sits exactly on the \nhyperplane of separation. \n\nThe KKT conditions also imply that only those points Xi can have a nonzero Cli for which \nthe first inequality constraint in (6) is precisely met; therefore the support vectors Xi with \n\nCli > \u00b0 will often form but a small subset of X. \n\nSubstituting (8) (the derivative of the Lagrangian by w) and the corresponding conditions \nfor ~ and p into the Lagrangian, we can eliminate the primal variables to get the dual \nproblem. A short calculation shows that it consists of minimizing the quadratic form \n\nW(Cl) = 2\" L CliClj (k(Xi,Xj) + q - qj - qi), \n\n1 \n\nij \n\nwhere q = b I:np k(zn, zp) and qj = t I:n k(xj, zn), subject to the constraints \n\n(10) \n\n(11) \n\nThis convex quadratic program can be solved with standard quadratic programming tools. \nAlternatively, one can employ the SMO algorithm described in [3], which was found to \napproximately scale quadratically with the training set size. \n\nTo illustrate the idea presented in this section, figure 1 shows a 2D example of separating \nthe data from the mean of another data set in feature space. \n\nFigure 1: Separating one class of data from the mean of a second data set. The first class is \na mixture of three gaussians; the SVM algorithm is used to find the hyperplane in feature \nspace that separates the data from the second set (another Gaussian - the black dots). The \nimage intensity represents the SVM output value which is the measure of novelty. \n\n\fWe next state a few theoretical results, beginning with a characterization of the influence \nof v. To this end, first note that the constraints (11) rule out solutions where v > 1, as in \nthat case, the Qi cannot sum up to 1. Negative values of v are ruled out, too, since they \nwould amount to encouraging (rather than penalizing) training errors in (5). Therefore, in \nthe primal problem (5) only v E (0,1] makes sense. We shall now explain that v actually \ncharacterizes how many points of X are allowed to lie outside the region where the decision \nfunction is positive. To this end, we introduce the term outlier to denote points Xi that have \na nonzero slack variable ~i' i.e. points that lie outside of the estimated region. By the \nKKT conditions, all outliers are also support vectors; however there can be support vectors \n(sitting exactly on the margin) that are not outliers. \n\nProposition 1 (v-property) Assume the solution of (5) satisfies p '\" 0. The following \nstatements hold: \n(i) v is an upper bound on the fraction of outliers. \n(ii) v is a lower bound on the fraction of SVs. \n(iii) Suppose the data (4) were generated independently from a distribution P(x) which \ndoes not contain discrete components. Suppose, moreover, that the kernel is analytic and \nnon-constant. With probability 1, asymptotically, v equals both the fraction of Sv.\\\u00b7 and the \nfraction of outliers. \n\nThe proof can be found in [4]. We next state another desirable theoretical result: \n\nProposition 2 (Resistance [3]) Local movements of outliers parallel to w do not change \nthe hyperplane. \n\nEssentially, this result is due to the fact that the errors ~i enter in the objective function only \nlinearly. To determine the hyperplane, we need to find the (constrained) extremum of the \nobjective function, and in finding the extremum, the derivatives are what counts. For the \nlinear error term, however, those are constant, so they do not depend on how far away from \nthe hyperplane an error point lies. \n\nWe conclude this section by noting that if Z is empty, the algorithm is trying to separate \nthe data from the origin in F, and both the decision function and the optimization problem \nreduce to what is described in [5]. \n\n3 Application of SVM to Jet Engine Pass-off Tests \n\nThe Support Vector machine algorithm for novelty detection is applied to the pass-off data \nfrom a set of 162 Rolls-Royce jet engines. The shape of the tracked order of interest is en(cid:173)\ncoded by calculating a weighted average of the vibration amplitude over ten speed ranges, \nthereby generating a lOD shape vector. The available data was split into the following three \nsets: \n\n\u2022 99 Normal Engines to be used as training data; \n\u2022 40 Normal Engines to be used as validation data; \n\u2022 23 engines labelled as having at least one abnormal aspect in their vibration sig(cid:173)\n\nnature (the \"test\" data). \n\nUsing the training dataset, the SVM algorithm finds the hyperplane that separates the nor(cid:173)\nmal data from the origin in feature space with the largest margin. The number of support \nvectors gives an indication of how well the algorithm is generalising (if all data points were \nsupport vectors, the algorithm would have memorized the data). A Gaussian kernel was \n\n\fused with a width c = 40.0 in equation 2 which was chosen by starting with a small kernel \nwidth (so that the algorithm memorizes the data), increasing the width and stopping when \nsimilar results are obtained on the training and validation data. \n\nCumulative novelty distributions are plotted for two different values of v and these are \nshown in figure 2. The curves show a slight overlap between the normal and test engines. \nAlthough it is not given here, a ranking of the engines according to their novelty is also \nprovided to the Rolls-Royce test engineers. \n\nNo oIEng,r.s \n\nNo \"'Eng'''' \n\nrMtEng._ \n\n(a) l/ = 0.1 \n\n(b) l/ = 0.2 \n\nFigure 2: Cumulative novelty distributions for two different values of v. The curves show \nthat there is a slight overlap in the data; For v = 0.1, there are 11 validation engines over \nthe SVM decision boundary and 2 test engines inside the boundary. \n\nSeparating the Normal Engines from the Test Engines. \nIn a retrospective analysis such \nas described in this paper (for which the test engines with unusual vibration signatures have \nalready been identified as such by the Rolls-Royce experts), the SVM algorithm can be re(cid:173)\nrun to find the hyperplane that separates the normal data from the mean of the test data in \nfeature space with the largest margin (instead of separating from the origin). The algorithm \nis trained on the 99 training engines and 22 of the 23 test engines. Each test engine is left \nout in tum and the algorithm re-trained to compute its novelty. Cumulative distributions are \nagain plotted (see figure 3) and these show an improved separation between the two sets of \nengines. It should be noted however, that the improvement is less for the validation engines \nthan for the training engines. Nevertheless, there is an improvement for the validation \nengines seen from the higher intersection of the distribution with the axis. \n\nNo.ofEngille5 \n\nT .... iningEngin .... \n\nV\"'idgtionF.ngi~s \n\nNo.orEngill~S \n\nTeslEngille!l \n\nTest Engines \n\n(a) \n\nNovelty \n\n(v = 0.1) \n\n(b) \n\n,~ \n\nNowlty \n\nFigure 3: Cumulative novelty distributions showing the variation of novelty with number \nof engines for (a) the training data versus the test data (each test engine omitted from the \ntraining phase in tum to compute its novelty) and (b) the validation data versus the test data. \n\n\f4 Discussion \n\nThis paper has presented a novel application of Support Vector Machines and introduced \na method for including information from a second data set when considering novelty de(cid:173)\ntection. The results on the Jet Engine data show very good separation between normal and \ntest engines. We believe Support Vector Machines are an ideal framework for novelty de(cid:173)\ntection and indeed, we have obtained better results than with our previous clustering based \nalgorithms for detecting novel Jet Engine signatures. \n\nThe present work builds on a previous algorithm for estimating a distribution's support \n[5]. That algorithm, separating the data from the origin in feature space, suffered from the \ndrawback that the origin played a special role. One way to think of it is as a prior on where, \nin a novelty detection context, the unknown \"other\" class lies. The present work alleviates \nthis problem by allowing for the possibility to separate from a point inferred from the data, \neither from the same class, or from some other data. \n\nThere is a concern that one could put forward about one of the variants of the presently \nproposed approach, namely about the case where X and Z are disjoint, and we are sep(cid:173)\narating X from Z's centroid: why not actually train a full binary classifier separating X \nfrom all examples from Z, rather that just from its mean? Indeed there might be situations \nwhere this is appropriate. More specifically, whenever Z is representative of the instances \nof the other class that we expect to see in the future, then a binary classification is certainly \npreferable. However, there can be situations where Z is not representative for the other \nclass, for instance due to nonstationarity. Z may even only consists of artificial examples. \nIn this situation, the only real training examples are the positive ones. In this case, separat(cid:173)\ning the data from the mean of some artificial, or non-representative examples, provides a \nway of taking into account some information from the other class which might work better \nthan simply separating the positive data from the origin. \n\nThe philosophy behind our approach is the one advocated by [6] . If you are trying to solve \na learning problem, do it directly, rather than solving a more general problem along the \nway. Applied to the estimation of a distribution's support, this means: do not first estimate \na density and then threshold it to get an estimate of the support. \n\nAcknowledgments. Thanks to John Platt, John Shawe-Taylor, Alex Smola and Bob \nWilliamson for helpful discussions. \n\nReferences \n\n[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. \nIn D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning \nTheory, pages 144-152, Pittsburgh, PA, July 1992. ACM Press. \n\n[2] A. Nairac, N. Townsend, R. Carr, S. King, P. Cowley, and L. Tarassenko. A system for the \nanalysis of jet engine vibration data. Integrated Computer-Aided Engineering, 6:53 - 65, 1999. \n[3] B. SchOlkopf, 1. Platt, J. Shawe-Taylor, AJ. Smola, and R.C. Williamson. Estimating the support \nof a high-dimensional distribution. TR MSR 99 - 87, Microsoft Research, Redmond, WA, 1999. \n[4] B. Scholkopf, J. Platt, and A.J. Smola. Kernel method for percentile feature extraction. TR MSR \n\n2000 - 22, Microsoft Research, Redmond, WA , 2000. \n\n[5] B. SchOlkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector \nmethod for novelty detection. In S.A. Solla, T.K. Leen, and K.-R. Muller, editors, Advances in \nNeural Information Processing Systems 12, pages 582- 588. MIT Press, 2000. \n\n[6] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995. \n\n\f", "award": [], "sourceid": 1887, "authors": [{"given_name": "Paul", "family_name": "Hayton", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Lionel", "family_name": "Tarassenko", "institution": null}, {"given_name": "Paul", "family_name": "Anuzis", "institution": null}]}