{"title": "Support Vector Method for Novelty Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 582, "page_last": 588, "abstract": null, "full_text": "Support Vector Method for Novelty Detection \n\nBernhard Scholkopf*, Robert Williamson\u00a7, \n\nAlex Smola\u00a7, John Shawe-Taylort, John Platt* \n\n* Microsoft Research Ltd., 1 Guildhall Street, Cambridge, UK \n\n\u00a7 Department of Engineering, Australian National University, Canberra 0200 \n\nt Royal Holloway, University of London, Egham, UK \n* Microsoft, 1 Microsoft Way, Redmond, WA, USA \n\nbsc/jplatt@microsoft.com, Bob.WilliamsoniAlex.Smola@anu.edu.au, john@dcs.rhbnc.ac.uk \n\nAbstract \n\nSuppose you are given some dataset drawn from an underlying probabil(cid:173)\nity distribution P and you want to estimate a \"simple\" subset S of input \nspace such that the probability that a test point drawn from P lies outside \nof S equals some a priori specified l/ between 0 and 1. \nWe propose a method to approach this problem by trying to estimate a \nfunction f which is positive on S and negative on the complement. The \nfunctional form of f is given by a kernel expansion in terms of a poten(cid:173)\ntially small subset of the training data; it is regularized by controlling the \nlength of the weight vector in an associated feature space. We provide a \ntheoretical analysis of the statistical performance of our algorithm. \nThe algorithm is a natural extension of the support vector algorithm to \nthe case of unlabelled data. \n\n1 \n\nINTRODUCTION \n\nDuring recent years, a new set of kernel techniques for supervised learning has been devel(cid:173)\noped [8]. Specifically, support vector (SV) algorithms for pattern recognition, regression \nestimation and solution of inverse problems have received considerable attention. There \nhave been a few attempts to transfer the idea of using kernels to compute inner products \nin feature spaces to the domain of unsupervised learning. The problems in that domain \nare, however, less precisely specified. Generally, they can be characterized as estimating \njunctions of the data which tell you something interesting about the underlying distribu(cid:173)\ntions. For instance, kernel PCA can be characterized as computing functions which on the \ntraining data produce unit variance outputs while having minimum norm in feature space \n[4]. Another kernel-based unsupervised learning technique, regularized principal mani(cid:173)\nfolds [6], computes functions which give a mapping onto a lower-dimensional manifold \nminimizing a regularized quantization error. Clustering algorithms are further examples of \nunsupervised learning techniques which can be kernelized [4]. \n\nAn extreme point of view is that unsupervised learning is about estimating densities. \nClearly, knowledge of the density of P would then allow us to solve whatever problem \ncan be solved on the basis of the data. The present work addresses an easier problem: it \n\n\fSupport Vector Method for Novelty Detection \n\n583 \n\nproposes an algorithm which computes a binary function which is supposed to capture re(cid:173)\ngions in input space where the probability density lives (its support), i.e. a function such \nthat most of the data will live in the region where the function is nonzero [5]. In doing so, \nit is in line with Vapnik's principle never to solve a problem which is more general than the \none we actually need to solve. Moreover, it is applicable also in cases where the density \nof the data's distribution is not even well-defined, e.g. if there are singular components. \nPart of the motivation for the present work was the paper [1]. It turns out that there is a \nconsiderable amount of prior work in the statistical literature; for a discussion, cf. the full \nversion of the present paper [3]. \n\n2 ALGORITHMS \n\nWe first introduce terminology and notation conventions. We consider training data \nXl, ... , Xl E X, where fEN is the number of observations, and X is some set. For \nsimplicity, we think of it as a compact subset of liN. Let ~ be a feature map X -t F, \ni.e. a map into a dot product space F such that the dot product in the image of ~ can be \ncomputed by evaluating some simple kernel [8] \n\nk(x, y) = (~(x) . ~(y)), \n\nsuch as the Gaussian kernel \n\n(1) \n\n(2) \n\nIndices i and j are understood to range over 1, ... ,f (in compact notation: 't, J E [fD. \nBold face greek letters denote f-dimensional vectors whose components are labelled using \nnormal face typeset. \n\nIn the remainder of this section, we shall develop an algorithm which returns a function \nf that takes the value + 1 in a \"small\" region capturing most of the data points, and -1 \nelsewhere. Our strategy is to map the data into the feature space corresponding to the \nkernel, and to separate them from the origin with maximum margin. For a new point X, the \nvalue f(x) is determined by evaluating which side of the hyperplane it falls on, in feature \nspace. Via the freedom to utilize different types of kernel functions, this simple geometric \npicture corresponds to a variety of nonlinear estimators in input space. \n\nTo separate the data set from the origin, we solve the following quadratic program: \n\nmin \n\n~llwl12 + ;l Li ei - P \n\n(3) \n\nwEF,eEiRt,PEiR \nsubject to \n\n(w\u00b7 ~(Xi)) 2:: P - ei, ei 2:: o. \n\n(4) \nHere, 1/ E (0, 1) is a parameter whose meaning will become clear later. Since nonzero slack \nvariables ei are penalized in the objective function, we can expect that if wand p solve this \nproblem, then the decision function f(x) = sgn((w . ~(x)) - p) will be positive for most \nexamples Xi contained in the training set, while the SV type regularization term Ilwll will \nstill be small. The actual trade-off between these two goals is controlled by 1/. Deriving the \ndual problem, and using (1), the solution can be shown to have an SV expansion \n\nf(x) = 'gn ( ~ a;k(x;, x) - p) \n\n(5) \n\n(patterns Xi with nonzero ll!i are called SVs), where the coefficients are found as the solu(cid:173)\ntion of the dual problem: \n\nmain ~ L ll!ill!jk(Xi, Xj) subject to 0 ~ ll!i ~ :f' L ll!i = 1. \n\n(6) \n\nij \n\n\f584 \n\nB. ScMlkop/, R. C. Williamson, A. J Smola, J Shawe-Taylor and J C. Platt \n\nThis problem can be solved with standard QP routines. It does, however, possess features \nthat sets it apart from generic QPs, most notably the simplicity of the constraints. This can \nbe exploited by applying a variant of SMO developed for this purpose [3]. \n\nThe offset p can be recovered by exploiting that for any ll:i which is not at the upper or \nlower bound, the corresponding pattern Xi satisfies p = (w . <.P(Xi)) = L:j ll:jk(Xj, Xi) . \nNote that if v approaches 0, the upper boundaries on the Lagrange multipliers tend to infin(cid:173)\nity, i.e. the second inequality constraint in (6) becomes void. The problem then resembles \nthe corresponding hard margin algorithm, since the penalization of errors becomes infinite, \nas can be seen from the primal objective function (3). It can be shown that if the data set \nis separable from the origin, then this algorithm will find the unique supporting hyperplane \nwith the properties that it separates all data from the origin, and its distance to the origin is \nmaximal among all such hyperplanes [3]. If, on the other hand, v approaches I, then the \nconstraints alone only allow one solution, that where all ll:i are at the upper bound 1/ (v\u00a3). \nIn this case, for kernels with integral I, such as normalized versions of (2), the decision \nfunction corresponds to a thresholded Parzen windows estimator. \n\nTo conclude this section, we note that one can also use balls to describe the data in feature \nspace, close in spirit to the algorithms of [2], with hard boundaries, and [7], with \"soft \nmargins.\" For certain classes of kernels, such as Gaussian RBF ones, the corresponding \nalgorithm can be shown to be equivalent to the above one [3]. \n\n3 THEORY \n\nIn this section, we show that the parameter v characterizes the fractions of SVs and outliers \n(Proposition 1). Following that, we state a robustness result for the soft margin (Proposition \n2) and error bounds (Theorem 5). Further results and proofs are reported in the full version \nof the present paper [3]. We will use italic letters to denote the feature space images of the \ncorresponding patterns in input space, i.e. X i := <.P(Xi). \n\nProposition 1 Assume the solution of (4) satisfies p =1= 0. The following statements hold: \n(i) v is an upper bound on the fraction of outliers. \n(ii) v is a lower bound on the fraction of SVs. \n(iii) Suppose the data were generated independently from a distribution P(x) which does \nnot contain discrete components. Suppose, moreover, that the kernel is analytic and non(cid:173)\nconstant. With probability 1, asymptotically, v equals both the fraction of SVs and the \nfraction of outliers. \n\nThe proof is based on the constraints of the dual problem, using the fact that outliers must \nhave Lagrange multipliers at the upper bound. \n\nProposition 2 Local movements of outliers parallel to w do not change the hyperplane. \n\nWe now move on to the subject of generalization. Our goal is to bound the probability \nthat a novel point drawn from the same underlying distribution lies outside of the estimated \nregion by a certain margin. We start by introducing a common tool for measuring the \ncapacity of a class :r of functions that map X to lit \nDefinition 3 Let (X, d) be a pseudo-metric space, I let A be a subset of X and f. > 0. A \nset B ~ X is an f.-cover for A if, for every a E A, there exists b E B such that d( a , b) ::::; f.. \nThe f.-covering number of A, Nd(f., A), is the minimal cardinality of an f.-cover for A (if \nthere is no such finite cover then it is defined to be 00). \n\nI i.e. with a distance function that differs from a metric in that it is only semidefinite \n\n\fSupport Vector Method for Novelty Detection \n\n585 \n\nThe idea is that B should be finite but approximate all of A with respect to the pseudometric \nd. We will use the loo distance over a finite sample X = (Xl, .. \u2022 , Xl) for the pseudo(cid:173)\nmetric in the space of functions, dx(f, g) = m~E[lllf(xd - g(xi)l. Let N(E,~, f) = \nSUPXEXI Ndx (E, ~). Below, logarithms are to base 2. \n\nTheorem 4 Consider any distribution P on X and any 0 E lR. Suppose Xl, \u2022.. , Xl are \ngenerated U.d. from P. Then with probability 1 - 6 over such an f-sample, if we find \n\nf E ~ such that f(Xi) ~ 0 +, for all i E [f), \n\nP{x : f(x) < 0 -,} :s; t(k + log 2(l), \n\nwhere k = rlog:Nb,~, 2f)1-\n\nWe now consider the possibility that for a small number of points f(Xi) fails to exceed \n0+,. This corresponds to having a non-zero slack variable ~i in the algorithm, where we \ntake 0 + , = p / II w II and use the class of linear functions in feature space in the application \nof the theorem. There are well-known bounds for the log covering numbers of this class. \nLet f be a real valued function on a space X. Fix 0 E lR. For X E X, define \n\nd(x,J, ,) = max{O,O +, - f(x)}. \n\nSimilarly for a training sequence X, we define 'D(X, f, ,) = L:xEX d(x, f, ,). \n\nTheorem 5 Fix 0 E lR. Consider a fixed but unknown probability distribution P on the \ninput space X and a class of real valued functions ~ with range [a, b). Then with probability \n1 - 6 over randomly drawn training sequences X of size f, for all, > 0 and any f E ~, \n\nP {x: f(x) < 0 - , and X ~ X} :s; t(k + log ~l ), \n\nwhere k = rlogN('V/2 ~ U) + 64(b-a)'D(X,J,'y) log ( \n\n-y2 \n\nI \n\n, \n\n, \n\nell \n\n8'D(X,J,-y) \n\n) log (32l(b-a)2)1. \n\n-y2 \n\nThe theorem bounds the probability of a new point falling in the region for which f(x) \nhas value less than 0 - \" this being the complement of the estimate for the support of the \ndistribution. The choice of, gives a trade-off between the size of the region over which the \nbound holds (increasing, increases the size of the region) and the size of the probability \nwith which it holds (increasing, decreases the size of the log covering numbers). \n\nThe result shows that we can bound the probability of points falling outside the region of \nestimated support by a quantity involving the ratio of the log covering numbers (which can \nbe bounded by the fat shattering dimension at scale proportional to ,) and the number of \ntraining examples, plus a factor involving the I-norm of the slack variables. It is stronger \nthan related results given by [I], since their bound involves the square root of the ratio of \nthe Pollard dimension (the fat shattering dimension when, tends to 0) and the number of \ntraining examples. \nThe output of the algorithm described in Sec. 2 is a function f(x) = 2:::i aik(xi' x) which \nis greater than or equal to p - ~i on example Xi. Though non-linear in the input space, this \nfunction is in fact linear in the feature space defined by the kernel k. At the same time the \n2-norm of the weight vector is given by B = J aT K a, and so we can apply the theorem \nwith the function class ~ being those linear functions in the feature space with 2-norm \nbounded by B . If we assume that 0 is fixed, then, = p - 0, hence the support of the \ndistribution is the set {x : f (x) ~ 0 - , = 20 - p}, and the bound gives the probability of \na randomly generated point falling outside this set, in terms of the log covering numbers of \nthe function class ~ and the sum of the slack variables ~i. Since the log covering numbers \n\n\f586 \n\nB. SchOlkopj R. C. Williamson, A. 1. Smola, 1. Shawe-Taylor and 1. C. Platt \n\nat scale, /2 of the class ~ can be bounded by O( B:!F-Iog2 f) this gives a bound in terms \nof the 2-norm of the weight vector. \n\n\"Y \n\nIdeally, one would like to allow () to be chosen after the value of p has been determined, \nperhaps as a fixed fraction of that value. This could be obtained by another level of struc(cid:173)\ntural risk minimisation over the possible values of p or at least a mesh of some possible \nvalues. This result is beyond the scope of the current preliminary paper, but the form of the \nresult would be similar to Theorem 5, with larger constants and log factors. \n\nWhilst it is premature to give specific theoretical recommendations for practical use yet, \none thing is clear from the above bound. To generalize to novel data, the decision function \nto be used should employ a threshold TJ \u2022 p, where TJ < 1 (this corresponds to a nonzero I)' \n\n4 EXPERIMENTS \n\nWe apply the method to artificial and real-world data. Figure 1 displays 2-D toy examples, \nand shows how the parameter settings influence the solution. \n\nNext, we describe an experiment on the USPS dataset of handwritten digits. The database \ncontains 9298 digit images of size 16 x 16 = 256; the last 2007 constitute the test set. We \ntrained the algorithm, using a Gaussian kernel (2) of width c = 0.5 . 256 (a common value \nfor SVM classifiers on that data set, cf. [2]), on the test set and used it to identify outliers \nit is folklore in the community that the USPS test set contains a number of patterns \n-\nwhich are hard or impossible to classify, due to segmentation errors or mislabelling. In the \nexperiment, we augmented the input patterns by ten extra dimensions corresponding to the \nclass labels of the digits. The rationale for this is that if we disregarded the labels, there \nwould be no hope to identify mislabelled patterns as outliers. Fig. 2 shows the 20 worst \noutliers for the USPS test set. Note that the algorithm indeed extracts patterns which are \nvery hard to assign to their respective classes. In the experiment, which took 36 seconds on \na Pentium II running at 450 MHz, we used a 11 value of 5%. \n\nFigure I: First two pictures: A single-class SVM applied to two toy problems; 11 = C = 0.5, \ndomain: [-1, 1 F. Note how in both cases, at least a fraction of 11 of all examples is in the \nestimated region (cf. table). The large value of 11 causes the additional data points in the \nupper left comer to have almost no influence on the decision function. For smaller values of \n11, such as 0.1 (third picture), the points cannot be ignored anymore. Alternatively, one can \nforce the algorithm to take these 'outliers' into account by changing the kernel width (2): \nin the fourth picture, using c = 0.1,11 = 0.5, the data is effectively analyzed on a different \nlength scale which leads the algorithm to consider the outliers as meaningful points. \n\n\fSupport Vector Method/or Novelty Detection \n\n587 \n\n~\"1r~ftC \n\n9 -507 1 -4580 -377 1 -282 7 -2162 -2003 -1869 -179 5 -\n\n~\"'J.. ~()nl~) \n\n3 0 -1177 -93 5 -78 0 -58 7 -52 6 -48 3 \n\n-153 3 -143 6 -1286 -\n\nFigure 2: Outliers identified by the proposed algorithm, ranked by the negative output of \nthe SVM (the argument of the sgn in the decision function). The outputs (for convenience \nin units of 10-5 ) are written underneath each image in italics, the (alleged) class labels are \ngiven in bold face. Note that most of the examples are \"difficult\" in that they are either \natypical or even mislabelled. \n\n5 DISCUSSION \n\nOne could view the present work as an attempt to provide an algorithm which is in line \nwith Vapnik's principle never to solve a problem which is more general than the one that \none is actually interested in. E.g., in situations where one is only interested in detecting \nnovelty, it is not always necessary to estimate a full density model of the data. Indeed, \ndensity estimation is more difficult than what we are doing, in several respects. \n\nMathematically speaking, a density will only exist if the underlying probability measure \npossesses an absolutely continuous distribution function. The general problem of estimat(cid:173)\ning the measure for a large class of sets, say the sets measureable in Borel's sense, is not \nsolvable (for a discussion, see e.g. [8]). Therefore we need to restrict ourselves to making \na statement about the measure of some sets. Given a small class of sets, the simplest esti(cid:173)\nmator accomplishing this task is the empirical measure, which simply looks at how many \ntraining points fall into the region of interest. Our algorithm does the opposite. It starts \nwith the number of training points that are supposed to fall into the region, and then esti(cid:173)\nthe \nmates a region with the desired property. Often, there will be many such regions -\nsolution becomes unique only by applying a regularizer, which in our case enforces that \nthe region be small in a feature space associated to the kernel. This, of course, implies, that \nthe measure of smallness in this sense depends on the kernel used, in a way that is no dif(cid:173)\nferent to any other method that regularizes in a feature space. A similar problem, however, \nappears in density estimation already when done in input space. Let p denote a density on \nX. If we perform a (nonlinear) coordinate transformation in the input domain X, then the \ndensity values will change; loosely speaking, what remains constant is p(x) . dx, while dx \nis transformed, too. When directly estimating the probability measure of regions, we are \nnot faced with this problem, as the regions automatically change accordingly. \n\nAn attractive property of the measure of smallness that we chose to use is that it can also \nbe placed in the context of regularization theory, leading to an interpretation of the solution \nas maximally smooth in a sense which depends on the specific kernel used [3]. \n\nThe main inspiration for our approach stems from the earliest work of Vapnik and collab(cid:173)\norators. They proposed an algorithm for characterizing a set of unlabelled data points by \nseparating it from the origin using a hyperplane [9]. However, they quickly moved on to \ntwo-class classification problems, both in terms of algorithms and in the theoretical devel(cid:173)\nopment of statistical learning theory which originated in those days. From an algorithmic \npoint of view, we can identify two shortcomings of the original approach which may have \ncaused research in this direction to stop for more than three decades. Firstly, the original \n\n\f588 \n\nB. Scholkopf, R. C. Williamson, A. J Smola, J Shawe-Taylor and J C. Platt \n\nalgorithm in was limited to linear decision rules in input space, secondly, there was no way \nof dealing with outliers. In conjunction, these restrictions are indeed severe -\na generic \ndataset need not be separable from the origin by a hyperplane in input space. The two mod(cid:173)\nifications that we have incorporated dispose of these shortcomings. Firstly, the kernel trick \nallows for a much larger class of functions by nonlinearly mapping into a high-dimensional \nfeature space, and thereby increases the chances of separability from the origin. In partic(cid:173)\nular, using a Gaussian kernel (2), such a separation exists for any data set Xl, ... , Xl: to \nsee this, note that k(Xi, Xj) > 0 for all i, j, thus all dot products are positive, implying that \nall mapped patterns lie inside the same orthant. Moreover, since k(Xi, Xi) = 1 for all i, \nthey have unit length. Hence they are separable from the origin. The second modification \nallows for the possibility of outliers. We have incorporated this 'softness' of the decision \nrule using the v-trick and thus obtained a direct handle on the fraction of outliers. \n\nWe believe that our approach, proposing a concrete algorithm with well-behaved compu(cid:173)\ntational complexity (convex quadratic programming) for a problem that so far has mainly \nbeen studied from a theoretical point of view has abundant practical applications. To turn \nthe algorithm into an easy-to-use black-box method for practicioners, questions like the \nselection of kernel parameters (such as the width of a Gaussian kernel) have to be tackled. \nIt is our expectation that the theoretical results which we have briefly outlined in this paper \nwill provide a foundation for this formidable task. \n\nAcknowledgement. Part of this work was supported by the ARC and the DFG (# Ja \n37919-1), and done while BS was at the Australian National University and GMD FIRST. \nAS is supported by a grant of the Deutsche Forschungsgemeinschaft (Sm 62/1-1). Thanks \nto S. Ben-David, C. Bishop, C. Schnorr, and M. Tipping for helpful discussions. \n\nReferences \n[1] S. Ben-David and M. Lindenbaum. Learning distributions by their density levels: A \nparadigm for learning without a teacher. Journal of Computer and System Sciences, \n55:171-182,1997. \n\n[2] B. SchOlkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In \nU. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference \non Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA, 1995. \n\n[3] B. SchOlkopf, J. Platt, J. Shawe-Taylor, AJ. Smola, and R.c. Williamson. Estimating \nthe support of a high-dimensional distribution. TR MSR 99 - 87, Microsoft Research, \nRedmond, WA, 1999. \n\n[4] B. Scholkopf, A. Smola, and K.-R. Muller. Kernel principal component analysis. In \n\nB. SchOlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Sup(cid:173)\nport Vector Learning. MIT Press, Cambridge, MA, 1999. 327 - 352. \n\n[5] B. SchOlkopf, R. Williamson, A. Smola, and J. Shawe-Taylor. Single-class support \n\nvector machines. In J. Buhmann, W. Maass, H. Ritter, and N. Tishby, editors, Unsu(cid:173)\npervised Learning, Dagstuhl-Seminar-Report 235, pages 19 - 20, 1999. \n\n[6] A. Smola, R. C. Williamson, S. Mika, and B. Scholkopf. Regularized principal mani(cid:173)\n\nfolds. In Computational Learning Theory: 4th European Conference, volume 1572 of \nLecture Notes in Artificial Intelligence, pages 214 - 229. Springer, 1999. \n\n[7] D.MJ. Tax and R.P.W. Duin. Data domain description by support vectors. In M. Ver(cid:173)\n\nleysen, editor, Proceedings ESANN, pages 251 - 256, Brussels, 1999. D Facto. \n\n[8] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. \n[9] V. Vapnik and A. Lerner. Pattern recognition using generalized portraits. Avtomatika i \n\nTelemekhanika, 24:774 -780, 1963. \n\n\f", "award": [], "sourceid": 1723, "authors": [{"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "John", "family_name": "Platt", "institution": null}]}