{"title": "Robust Novelty Detection with Single-Class MPM", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 936, "abstract": null, "full_text": "Robust Novelty Detection with \n\nSingle-Class MPM \n\nGert R.G. Lanckriet \nEECS, V.C. Berkeley \ngert@eecs.berkeley. edu \n\nLaurent EI Ghaoui \nEECS, V.C. Berkeley \n\nelghaoui@eecs.berkeley.edu \n\nMichael I. Jordan \nComputer Science and \nStatistics, V.C. Berkeley \njordan@cs. berkeley. edu \n\nAbstract \n\nthe \"single-class minimax probabil(cid:173)\n\nIn this paper we consider the problem of novelty detection, pre(cid:173)\nsenting an algorithm that aims to find a minimal region in input \nspace containing a fraction 0: of the probability mass underlying \na data set. This algorithm-\nity machine (MPM)\" -\nis built on a distribution-free methodology \nthat minimizes the worst-case probability of a data point falling \noutside of a convex set, given only the mean and covariance matrix \nof the distribution and making no further distributional assump(cid:173)\ntions. We present a robust approach to estimating the mean and \ncovariance matrix within the general two-class MPM setting, and \nshow how this approach specializes to the single-class problem. We \nprovide empirical results comparing the single-class MPM to the \nsingle-class SVM and a two-class SVM method. \n\n1 \n\nIntroduction \n\nNovelty detection is an important unsupervised learning problem in which test data \nare to be judged as having been generated from the same or a different process as \nthat which generated the training data. In essence, we wish to estimate a quantile \nof the distribution underlying the training data: for a fixed constant 0: E (0,1], \nwe attempt to find a (small) set Q such that Pr{y E Q} = 0:, where, for novelty \ndetection, 0: is typically chosen near one (Scholkopf and Smola, 2001 , Ben-David \nand Lindenbaum, 1997). This formulation of novelty detection in terms of quantile \nestimation is to be compared to the (costly) approach of estimating a density based \non the training data and thresholding the estimated density. \n\nAlthough of reduced complexity when compared to density estimation, multivariate \nquantile estimation is still a challenging problem, necessitating computationally \nefficient methods for representing and manipulating sets in high dimensions. A \nsignificant step forward in this regard was provided by Scholkopf and Smola (2001), \nwho treated novelty detection as a \"single-class\" classification problem in which \ndata are separated from the origin in feature space. This allowed them to invoke \nthe computationally-efficient technology of support vector machines. \n\nIn the current paper we adopt the \"single-class\" perspective of Scholkopf and Smola \n(2001), but make use of a different kernel-based technique for finding discriminant \n\n\fthe minimax probability machine (MPM) of Lanckriet et al. (2002). \nboundaries-\nTo see why the MPM should be particularly appropriate for quantile estimation, \nconsider the following theorem, which lies at the core of the MPM. Given a random \nvector y with mean y and covariance matrix ~y , and given arbitrary constants \na\u00a5- 0, b such that aTy :S b, we have (for a proof, see Lanckriet et al., 2002): \n\ninf \n\ny~(y,:Ey) \n\nPr{aTy:Sb}2::a \n\n{:} b-aTY2::,,;(a) /aT\"5:,ya, \n\nV \n\n(1) \n\nwhere ,,;(a) = Ja/1 - a, and a E [0, 1). Note that this is a \"distribution-free\" \nthe infimum is taken over all distributions for y having mean y and covari(cid:173)\nresult-\nance matrix \"5:,y (assumed to be positive definite for simplicity). While Lanckriet \net al. (2002) were able to exploit this theorem to design a binary classification al(cid:173)\ngorithm, it is clear that the theorem provides even more direct leverage on the \n\"single-class\" problem-\nit directly bounds the probability of an observation falling \noutside of a given set. \n\nThere is one important aspect of the MPM formulation that needs further consider(cid:173)\nation, however, if we wish to apply the approach to the novelty detection problem. \nIn particular, y and ~y are usually unknown in practice and must be estimated \nfrom data. In the classification setting, Lanckriet et al. (2002) successfully made \nuse of plug-in estimates of these quantities-\nin some sense the bias incurred by the \nuse of plug-in estimates in the two classes appears to \"cancel\" and have diminished \noverall impact on the discriminant boundary. In the one-class setting, however, the \nuncertainty due to estimation of y and ~y translates directly into movement of the \ndiscriminant boundary and cannot be neglected. \n\nWe begin in Section 2 by revisiting the MPM and showing how to account for \nuncertainty in the means and covariance matrices within the framework of robust \nestimation. Section 3 then applies this robust estimation approach to the single(cid:173)\nclass MPM problem. We present empirical results in Section 4 and present our \nconclusions in Section 5. \n\n2 Robust Minimax Probability Machine (R-MPM) \n\nLet x, y E jRn denote random vectors in a binary classification problem, modelling \ndata from each of two classes, with means and covariance matrices given by X, Y E \njRn, and \"5:, x , \"5:,y E jRnxn (both symmetric and positive semidefinite), respectively. \nWe wish to determine a hyperplane H(a , b) = {z I aTz = b}, where a E jRn\\{o} \nand b E jR, that maximizes the worst-case probability a that future data points \nare classified correctly with respect to all distributions having these means and \ncovariance matrices: \n\nmax a S.t. \na,a,cO,b \n\ninf \n\nx~(x,:Ex) \n\nPr{ aT x 2:: b} 2:: a \n\n(2) \n\ninf Pr{aTy:Sb} 2:: a, \n\ny~(y , :Ey) \n\nwhere x '\" (x, \"5:,x) refers to the class of distributions that have mean x and covari(cid:173)\nance \"5:,x, but are otherwise arbitrary; likewise for y. The worst-case probability of \nmisclassification is explicitly obtained and given by 1 - a. \n\nSolving this optimization problem involves converting the probabilistic constraints \nin Eq. (2) into deterministic constraints, a step which is achieved via the theorem \nreferred to earlier in Eq. (1). This eventually leads to the following convex optimiza(cid:173)\ntion problem, whose solution determines an optimal hyperplane H(a, b) (Lanckriet \n\n\fet al., 2002): \n\nwhere b is set to the value b* = arx - x:*Jar~xa*, with a* an optimal solution \nof Eq. (3). The optimal worst-case misclassification probability is obtained via \n1 - a* = 1/(1 + x:;). Once an optimal hyperplane is found, classification of a new \ndata point Znew is done by evaluating sign( ar Znew - b*): if this is + 1, Znew is \nclassified as belonging to class x, otherwise Znew is classified as belonging to class \ny. \n\n(3) \n\nWhile in our earlier work, we simply computed sample-based estimates of means \nand covariance matrices and plugged them into the MPM optimization problem in \nEq. (3), we now show how to treat this estimation problem within the framework \nof robust optimization. Assume the mean and covariance matrix of each class are \nunknown but lie within specified convex sets: (x, ~x) E X, with X C jRn X {M E \njRnxnlM = MT,M ~ O}, and (y,~y) E y, with Y c jRn X {M E jRnxnlM = \nM T , M ~ O}. We now want the probabilistic guarantees in Eq. (2) to be robust \nagainst variations of the mean and covariance matrix within these sets: \ninf Pr{aTx2b}2aV(x,~x)EX, \n\n(4) \n\nmax a S.t. \na,a#O,b \n\nx~(x,Ex) \n\ninf Pr{aTy::; b} 2 a V(y,~y) E y. \n\nx~(y , Ey) \n\nIn other words, we would like to guarantee a worst-case misclassification proba(cid:173)\nbility for all distributions which have unknown-but-bounded mean and covariance \nmatrix, but which are otherwise arbitrary. The complexity of this problem depends \nobviously on the structure of the uncertainty sets X, y. We now consider a specific \nchoice for X and y, motivated both statistically and numerically: \n\nX \nY \n\n{(x,~x): (x-xO)T~x-1(X_XO)::;v2, II~x-~xoIIF::;p}, \n{(y,~y): (y_yO)T~y-1(y_yO)::;v2, II~Y-~/IIF::;p}, \n\n(5) \n\nwith xO, ~x 0 the \"nominal\" mean and covariance estimates and with v, p 2 0 fixed \nand, for simplicity, assumed equal for X and y. Section 4 discusses how their values \ncan be determined. The matrix norm is the Frobenius norm: IIAIIj\" = Tr(AT A). \nOur model for the uncertainty in the mean assumes the mean of class y belongs to \ncentered around yO, with shape determined by the \nan ellipsoid -\n(unknown) ~Y' This is motivated by the standard statistical approach to estimating \na region of confidence based on Laplace approximations to a likelihood function. The \ncovariance matrix belongs to a matrix norm ball -\ncentered around \n~Y o. This uncertainty model is perhaps less classical from a statistical viewpoint, \nbut it will lead to a regularization term of a classical form. \n\na convex set -\n\na convex set -\n\nIn order to solve Eq. (4), we apply Eq. (1) and notice that \nb-aTy 2 x:(ah/aT~ya, V(y, ~y) E Y {:} b- max aTy 2 x:(a) \n\n(y,Ey)EY \n\nmax aT~ya, \n\n(y ,Ey)EY \n\nwhere the right-hand side guarantees the constraint for the worst-case estimate of \nthe mean and covariance matrix within the bounded set y. For given a and yO: \n\n(6) \n\nIndeed, the Lagrangian is \u00a3(y, >.) = _aTy + >.((y - yO)T~y -l(y - yO) - v2) and \nis to be maximized with respect to >. 2 0 and minimized with respect to y. At the \n\n\foptimum, we have /y \u00a3(y, A) = 0 and t>.. \u00a3(y, A) = 0, leading to y = yO + A ~ya \nand A = JaT~ya/4v which eventually leads to Eq. (6). For given a and ~/: \n\n(7) \n\nwhere In is the n x n identity matrix. Indeed, without loss of generality, we can let \n~ be of the form ~ = ~o + p~~. We then obtain \nmax \n\naT~ a - aT~ \u00b0a+p \n\nmax \n\naT ~~ a - aT~ \u00b0a+paT a \n, \n\nY \n\ny \n\n-\n\nEy : I I Ey-EyOIlF~P \n\ny \n\n-\n\ny \n\n.6.Ey : II.6.EYI IF~ l \n\n(8) \nusing the Cauchy-Schwarz inequality and compatibility of the Frobenius matrix \nnorm and the Euclidean vector norm: \n\naT ~~a::::: IlaI1211~~aI12 ::::: IlaI1211~~IIFllaI12 ::::: lIall~, \n\nbecause II~~IIF ::::: 1. For ~~ = In , this upper bound is attained and we get \nEq. (7). Combining this with Eq. (6) leads to the robust version of Eq. (1): \n\ninf \n\ny~(y , Ey) \n\nPr{aTy ::::: b} :2: a, \\fey, ~y) E Y \u00a2} b_aTyO :2: (\",(a)+v)JaT(~/ + pln)a. \n\n(9) \nApplying this result to Eq. (4) thus shows that the optimal robust minimax proba(cid:173)\nbility classifier for X, Y given by Eq. (5) can be obtained by solving problem Eq. (3), \nwith ~x = ~x 0 + pIn' ~y = ~y 0 + pIn. If \",:;-1 is the optimal value of that problem, \nthe corresponding worst-case misclassification probability is \n\n1 - a* = \n\n1 \n\n1 + max(O, (\"'* - V))2 \n\n. \n\nWith only uncertainty in the mean (p = 0), the robust hyperplane is the same as the \nnon-robust one; the only change is in the increase in the worst-case misclassification \nprobability. Uncertainty in the covariance matrix adds a term pIn to the covariance \nmatrices, which can be interpreted as regularization term. This affects the hyper(cid:173)\nplane and increases the worst-case misclassification probability as well. If there is \ntoo much uncertainty in the mean (i.e., \"'* < v) , the robust version is not feasible: \nno hyperplane can be found that separates the two classes in the robust minimax \nprobabilistic sense and the worst-case misclassification probability is 1 - a* = 1. \n\nThis robust approach can be readily generalized to allow nonlinear decision bound(cid:173)\naries via the use of Mercer kernels (Lanckriet et al., 2002). \n\n3 Single-class MPM for robust novelty detection \n\nWe now turn to the quantile estimation problem. Recall that for a E (0,1], we \nwish to find a small region Q such that Pr{ x E Q} = a. Let us consider data \nx ,..., (x, ~x) and let us focus (for now) on the linear case where Q is a half-space \nnot containing the origin. \nWe seek a half-space Q(a,b) = {z I aTz :2: b}, with a E JRn\\{o} and b E JR, and \nnot containing 0, such that with probability at least a, the data lies in Q, for every \ndistribution having mean x and covariance matrix ~x. We assume again that the \nreal x, ~x are unknown but bounded in a set X as specified in Eq. (5): \n\ninf \n\nx~(x , Ex) \n\nPr{aTx:2:b}:2:a \\f(x,~x)EX. \n\n\fWe want the region Q to be tight, so we maximize its Mahalanobis distance (with \nrespect to ~x) to the origin in a robust way, i.e., for the worst-case estimate of \n~x-the matrix that gives us the smallest Mahalanobis distance: \n\ns.t. \n\ninf \n\nx~(x , Ex) \n\nPr{ aT x 2:: b} 2:: a \n\n\\I(x, ~x) EX. (10) \n\nNote that Q(a, b) does not contain 0 if and only if b > o. Also, the optimization \nproblem in Eq. (10) is positively homogeneous in (a, b). Thus, without loss of \ngenerality, we can set b = 1 in problem Eq. (10). Furthermore, we can use Eq. (7) \nand Eq. (9) and get (where superscript 0 for the estimates has been omitted): \n\nmln JaT(~x + pIn)a s.t. aTx -12:: (,..(a) + v)JaT(~x + pIn)a , \n\n(11) \n\nwhere a-::/:-O can be omitted since the constraint never holds in this case. Again, \nwe obtain a (convex) second order cone programming problem. The worst-case \nprobability of occurrence outside region Q is given by 1 - a. Notice that the \nparticular choice of a E (0,1] must be feasible , i.e. , \n\n:3 a : aTx -12:: (,..(a) + v)JaT(~x + pIn)a. \n\nFor p -::/:- 0, ~x + pIn is certainly positive definite and the halfspace is unique. \nFurthermore, it can be determined explicitly. To see this, we write Eq. (11) as: \n\nmin 11(~x + pIn? /2 aI12 \n\na \n\ns.t. aTx 2:: 1 + (,..(a) + v) 11(~x + pIn)1/2 a I12 \n\n(12) \n\nDecomposing a as A(~x + pIn)-lx + z, where the variable z satisfies zT X = 0, \nwe easily obtain that at the optimum, z = O. \nIn other words, the optimal a is \nparallel to x, in the form a = A(~x + pIn) - lx, and the problem reduces to the \none-dimensional problem: \nmIn IAIII(~x+pIn) -1/2 xI12 : AxT (~x+pIn)-lx 2:: l+(,..(a)+v) 11(~x+pIn)-1/2xIl2IAI\u00b7 \nThe constraint implies that A 2:: 0, hence the problem reduces to \n\nmin A : A ((2 - (,..(a) + v)() 2:: l. \n>.::::0 \n\n(13) \n\nwith (2 = xT(~x + pIn) - lx > 0 (because Eq. (12) implies x -::/:- 0). Because A 2:: 0, \n(,..(a) + v)( 2:: 0, which is nothing other than the \nthis can only be satisfied if (2 -\nfeasibility condition for a: \n\nIf this is fulfilled, the optimization in Eq. (13) is feasible and boils down to: \n\n. \nmm A s.t. A 2:: (2 (() \n>.::::0 \n\n,.. a + v \n\n)( \n\n1 \n\n-\n\nIt's easy to see that the optimal A is given by A* = 1/((2 - (,..(a) + v)(), yielding: \n\na* = (~x + pIn)-lX, b* = 1, \n\n(2 _ (,..(a) + v)( \n\nwith (= /xT(~x + pIn) -l X. \n\nV \n\n(14) \n\nNotice that the uncertainty in the covariance matrix ~x leads to the typical, well(cid:173)\nknown regularization for inverting this matrix. If the choice of a is not feasible or \nif x = 0 (in this case, no a E (0,1] will be feasible), Eq. (10) has no solution. \n\n\fFuture points z for which a; z :::; b* can then be considered as outliers with respect \nto the region Q , with worst-case probability of occurrence outside Q given by 1- 0:. \n\nOne can obtain a nonlinear region Q in ]Rn for the single-class case, by mapping \nthe data into a feature space ]Rf: x f-t