{"title": "Outlier Detection with One-class Kernel Fisher Discriminants", "book": "Advances in Neural Information Processing Systems", "page_first": 1169, "page_last": 1176, "abstract": null, "full_text": "     Outlier Detection with One-class Kernel Fisher\n                                  Discriminants\n\n\n\n                                         Volker Roth\n                      ETH Zurich, Institute of Computational Science\n                             Hirschengraben 84, CH-8092 Zurich\n                                      vroth@inf.ethz.ch\n\n                                          Abstract\n\n         The problem of detecting \"atypical objects\" or \"outliers\" is one of the\n         classical topics in (robust) statistics. Recently, it has been proposed to\n         address this problem by means of one-class SVM classifiers. The main\n         conceptual shortcoming of most one-class approaches, however, is that in\n         a strict sense they are unable to detect outliers, since the expected fraction\n         of outliers has to be specified in advance. The method presented in this\n         paper overcomes this problem by relating kernelized one-class classifica-\n         tion to Gaussian density estimation in the induced feature space. Having\n         established this relation, it is possible to identify \"atypical objects\" by\n         quantifying their deviations from the Gaussian model. For RBF kernels\n         it is shown that the Gaussian model is \"rich enough\" in the sense that it\n         asymptotically provides an unbiased estimator for the true density. In or-\n         der to overcome the inherent model selection problem, a cross-validated\n         likelihood criterion for selecting all free model parameters is applied.\n\n\n\n1     Introduction\n\nA one-class-classifier attempts to find a separating boundary between a data set and the\nrest of the feature space. A natural application of such a classifier is estimating a contour\nline of the underlying data density for a certain quantile value. Such contour lines may\nbe used to separate \"typical\" objects from \"atypical\" ones. Objects that look \"sufficiently\natypical\" are often considered to be outliers, for which one rejects the hypothesis that they\ncome from the same distribution as the majority of the objects. Thus, a useful application\nscenario would be to find a boundary which separates the jointly distributed objects from\nthe outliers. Finding such a boundary defines a classification problem in which, however,\nusually only sufficiently many labeled samples from one class are available. Usually no\nlabeled samples from the outlier class are available at all, and it is even unknown if there\nare any outliers present.\n\nIt is interesting to notice that the approach of directly estimating a boundary, as opposed to\nfirst estimating the whole density, follows one of the main ideas in learning theory which\nstates that one should avoid solving a too hard intermediate problem. While this line of rea-\nsoning seems to be appealing from a theoretical point of view, it leads to a severe problem\nin practical applications: when it comes to detecting outliers, the restriction to estimating\nonly a boundary makes it impossible to derive a formal characterization of outliers with-\nout prior assumptions on the expected fraction of outliers or even on their distribution. In\npractice, however, any such prior assumptions can hardly be justified. The fundamental\n\n\f\nproblem of the one-class approach lies in the fact that outlier detection is a (partially) unsu-\npervised task which has been \"squeezed\" into a classification framework. The missing part\nof information has been shifted to prior assumptions which can probably only be justified,\nif the solution of the original problem was known in advance.\n\nThis paper aims at overcoming this problem by linking kernel-based one-class classifiers\nto Gaussian density estimation in the induced feature space. Objects which have an \"unex-\npected\" high Mahalanobis distance to the sample mean are considered as \"atypical objects\"\nor outliers. A particular Mahalanobis distance is considered to be unexpected, if it is very\nunlikely to observe an object that far away from the mean vector in a random sample of\na certain size. We will formalize this concept in section 3 by way of fitting linear models\nin quantile-quantile plots. The main technical ingredient of our method is the one-class\nkernel Fisher discriminant classifier (OC-KFD), for which the relation to Gaussian density\nestimation is shown. From the classification side, the OC-KFD-based model inherits the\nsimple complexity control mechanism by using regularization techniques. The explicit re-\nlation to Gaussian density estimation, on the other hand, makes it possible to formalize the\nnotion of atypical objects by observing deviations from the Gaussian model. It is clear that\nthese deviations will heavily depend on the chosen model parameters. In order to derive\nan objective characterization of atypical objects it is, thus, necessary to select a suitable\nmodel in advance. This model-selection problem is overcome by using a likelihood-based\ncross-validation framework for inferring the free parameters.\n\n\n2         Gaussian density estimation and one-class LDA\n\nLet X denote the n  d data matrix which contains the n input vectors xi  Rd as rows. It\nhas been proposed to estimate a one-class decision boundary by separating the dataset from\nthe origin [12], which effectively coincides with replicating all xi with opposite sign and\nseparating X and -X. Typically, a -SVM classifier with RBF kernel function is used.\nThe parameter  bounds the expected number of outliers and must be selected a priori. The\nmethod proposed here follows the same idea of separating the data from their negatively\nreplicated counterparts. Instead of a SVM, however, a Kernel Fisher Discriminant (KFD)\nclassifier is used [7, 10]. The latter has the advantage that is is closely related to Gaussian\ndensity estimation in the induced feature space. By making this relation explicit, outliers\ncan be identified without specifying the expected fraction of outliers in advance. We start\nwith a linear discriminant analysis (LDA) model, and then kernels will be introduced.\n\nLet X = (X, -X)                   denote the augmented (2n  d) data matrix which also contains\nthe negative samples -xi. Without loss of generality we assume that the sample mean\n+                x\n              i         i > 0, so that the sample means of the positive data and the negative data\ndiffer: + = -. Let us now assume that our data are realizations of a normally distributed\nrandom variable in d dimensions: X  Nd(, ). Denoting by Xc the centered data\nmatrix, the estimator for  takes the form ^\n                                                     W = (1/n)Xc Xc.\n\nThe LDA solution                                                          B\n                                maximizes the between-class scatter            with B = ++ +\n                                                                   W  = 1\n     -    - under the constraint on the within-class scatter                    . Note that in our\nspecial case with X = (X, -X) the usual pooled within-class matrix W simply reduces\nto the above defined W = (1/n)Xc Xc. Denoting by y = (2, . . . , 2, -2, . . . , -2)                 a\n2n-indicator vector for class membership in class \"+\" or \"-\", it is well-known (see e.g. [1])\nthat the LDA solution (up to a scaling factor) can be found by minimizing a least-squares\nfunctional: ^\n                    = arg min y -X  2. In [3] a slightly more general form of the problem\nis described where the above functional is minimized under a constrained on , which in\nthe simplest case amounts to adding a term   to the functional. Such a ridge regression\nmodel assumes a penalized total covariance of the form T = (1/(2n))  X                  X + I =\n(1/n)  X X + I. Defining an n-vector of ones y = (1, . . . , 1) , the solution vector ^\n                                                                                                   \n\n\f\nreads\n                  ^\n                   = (X     X + I)-1X y = (X X + I)-1X y.                                   (1)\n\nAccording to [3], an appropriate scaling factor is defied in terms of the quantity s2 =\n(1/n)  y ^\n           y = (1/n)  y X ^\n                                   , which leads us to the correctly scaled LDA vector  =\n                                                                                          \ns-1(1 - s2)-1/2 ^\n                   that fulfills the normalization condition  W  = 1\n                                                                           .\n\nOne further derives from [3] that the mean vector of X, projected onto the 1-dimensional\nLDA-subspace has the coordinate value m+ = s(1 - s2)-1/2, and that the Mahalanobis\ndistance from a vector x to the sample mean + is the sum of the squared Euclidean\ndistance in the projected space and an orthogonal distance term:\n\n     D(x, +) = ( x - m                                                x)2 + x T -1x.\n                             +)2 + D with D = -(1 - s2)(                                  (2)\n\nNote that it is the term D which makes the density estimation model essentially different\nfrom OC-classification: while the latter considers only distances in the direction of the\nprojection vector , the true density model additionally takes into account the distances in\nthe orthogonal subspace.\n\nSince the assumption X  Nd(, ) is very restrictive, we propose to relax it by assuming\nthat we have found a suitable transformation of our input data  : Rd  Rp, x  (x),\nsuch that the transformed data are Gaussian in p dimensions. If the transformation is carried\nout implicitly by introducing a Mercer kernel k(xi, xj), we arrive at an equivalent problem\nin terms of the kernel matrix K =           and the expansion coefficients :\n\n                                        ^\n                                        = (K + I)-1y.                                        (3)\n\nFrom [11] it follows that the mapped vectors can be represented in Rn as (x) =\nK-1/2k(x), where k(x) denotes the kernel vector k(x) = (k(x, x1), . . . , k(x, xn)) .\nFinally we derive the following form of the Mahalanobis distances which again consist of\nthe Euclidean distance in the classification subspace plus an orthogonal term:\n\n            D(x, +) = ( k(x) - m                                 k(x))2 + n(x),\n                                             +)2 - (1 - s2)(                                (4)\n\nwhere (x)      =       (x)(  + I)-1(x)              =     k (x)(K + I)-1K-1k(x),\nm+ = s(1 - s2)-1/2, s2 = (1/n)  y ^\n                                             y = (1/n)  y K ^\n                                                               , and  = s-1(1 - s2)-1/2 ^\n                                                                                               .\n\nEquation (4) establishes the desired link between OC-KFD and Gaussian density estima-\ntion, since for our outlier detection mechanism only Mahalanobis distances are needed.\nWhile it seems to be rather complicated to estimate a density by the above procedure, the\nmain benefit over directly estimating the mean and the covariance lies in the inherent com-\nplexity regulation properties of ridge regression. Such a complexity control mechanism is\nof particular importance in highly nonlinear kernel models. Moreover, for ridge-regression\nmodels it is possible to analytically calculate the effective degrees of freedom, a quantity\nthat will be of particular interest when it comes to detecting outliers.\n\n3    Detecting outliers\n\nLet us assume that the model is completely specified, i.e. both the kernel function k(, )\nand the regularization parameter  are fixed. The central lemma that helps us to detect\noutliers can be found in most statistical textbooks:\n\nLemma 1. Let X be a Gaussian random variable X  Nd(, ). Then   (X -\n) -1(X - ) follows a chi-square (2) distribution on d degrees of freedom.\n\nFor the penalized regression models, it might be more appropriate to use the effective de-\ngrees of freedom df instead of d in the above lemma. In the case of one-class LDA with\nridge penalties we can easily estimate it as df = trace(X(X X + I)-1X ), [8], which\n\n\f\nfor a kernel model translates into df = trace(K(K + I)-1). The intuitive interpretation\nof the quantity df is the following: denoting by V the matrix of eigenvectors of K and by\n{i}ni=1 the corresponding eigenvalues, the fitted values ^\n                                                              y read\n\n                            ^\n                           y = V diag {i = i/(i + )} V y.                                 (5)\n\nIt follows that compared to the unpenalized case, where all eigenvectors vi are constantly\nweighted by 1, the contribution of the i-th eigenvector vi is down-weighted by a factor\ni/1 = i. If the ordered eigenvalues decrease rapidly, however, the values i are either\nclose to zero or close to one, and df determines the number of terms that are \"essentially\ndifferent\" from zero. The same is true for the orthogonal distance term in eq. (4): note that\n\n (x) = k (x)(K + I)-1K-1k(x) = k V diag i = ((i + )i)-1 V k(x). (6)\n\nCompared to the unpenalized case (the contribution of vi is weighted by -2\n                                                                               i    ), the contri-\nbution of vi is down-weighted by the same factor i/-2 = \n                                                         i        i .\n\n\nFrom lemma 1 we conclude that if the data are well described by a Gaussian model in\nthe kernel feature space, the observed Mahalanobis distances should look like a sample\nfrom a 2-distribution with df degrees of freedom. A graphical way to test this hypothesis\nis to plot the observed quantiles against the theoretical 2 quantiles, which in the ideal\ncase gives a straight line. Such a quantile-quantile plot is constructed as follows: Let\n(i) denote the observed Mahalanobis distances ordered from lowest to highest, and pi the\ncumulative proportion before each (i) given by pi = (i - 1/2)/n. Let further zi = F -1pi\ndenote the theoretical quantile at position pi, where F is the cumulative 2-distribution\nfunction. The quantile-quantile plot is then obtained by plotting (i) against zi. Deviations\nfrom linearity can be formalized by fitting a linear model on the observed quantiles and\ncalculating confidence intervals around the fit. Observations falling outside the confidence\ninterval are then treated as outliers. A potential problem of this approach is that the outliers\nthemselves heavily influence the quantile-quantile fit. In order to overcome this problem,\nthe use of robust fitting procedures has been proposed in the literature, see e.g. [4]. In\nthe experiments below we use an M-estimator with Huber loss function. For estimating\nconfidence intervals around the fit we use the standard formula (see [2, 5])\n\n                        ((i)) = b  (2(zi))-1 (pi(1 - pi))/n,                              (7)\n\nwhich can be intuitively understood as the product of the slope b and the standard error of\nthe quantiles. A 100(1 - )% envelope around the fit is then defined as (i)  z/2((i))\nwhere z/2 is the 1 - (1 - )/2 quantile of the standard normal distribution.\n\nThe choice of the confidence level  is somewhat arbitrary, and from a conceptual viewpoint\none might even argue that the problem of specifying one free parameter (i.e. the expected\nfraction of outliers) has simply been transferred into the problem of specifying another\none. In practice, however, selecting  is a much more intuitive procedure than guessing the\nfraction of outliers. Whereas the latter requires problem-specific prior knowledge which is\nhardly available in practice, the former depends only on the variance of a linear model fit.\nThus,  can be specified in a problem independent way.\n\n4    Model selection\n\nIn our model the data are first mapped into some feature space, in which then a Gaussian\nmodel is fitted. Mahalanobis distances to the mean of this Gaussian are computed by\nevaluating (4). The feature space mapping is implicitly defined by the kernel function, for\nwhich we assume that it is parametrized by a kernel parameter . For selecting all free\nparameters in (4), we are, thus, left with the problem of selecting  = (, ) .\n\nThe idea is now to select  by maximizing the cross-validated likelihood. From a theoret-\nical viewpoint, the cross-validated (CV) likelihood framework is appealing, since in [13]\n\n\f\nthe CV likelihood selector has been shown to asymptotically perform as well as the opti-\nmal benchmark selector which characterizes the best possible model (in terms of Kullback-\nLeibler divergence to the true distribution) contained in the parametric family.\n\nFor kernels that map into a space with dimension p > n, however, two problems arise: (i)\nthe subspace spanned by the mapped samples varies with different sample sizes; (ii) not\nthe whole feature space is accessible for vectors in the input space. As a consequence, it\nis difficult to find a \"proper\" normalization of the Gaussian density in the induced feature\nspace. We propose to avoid this problem by considering the likelihood in the input space\nrather than in the feature space, i.e. we are looking for a properly normalized density model\np(x|) in Rd such that p(x|) has the same contour lines as the Gaussian model in the feature\nspace: p(xi|) = p(xi|)  p((xi)|) = p((xj)|). Denoting by Xn = {xi}ni=1 a\nsample from p(x) from which the kernel matrix K is built, a natural input space model is\n\n             pn(x|Xn, ) = Z-1 exp{- 1 D(x; X                                p\n                                      2          n, )}, with Z =      Rd         n(x|Xn, ) dx,            (8)\nwhere D(x; Xn, ) denotes the (parametrized) Mahalanobis distances (4) of a Gaussian\nmodel in the feature space. Note that this density model in the input space has the same\nform as our Gaussian model in the feature space, except for the different normalization\nconstant Z. Computing this constant Z requires us to solve a normalization integral over the\nwhole d-dimensional input space. Since in general this integral is not analytically tractable\nfor nonlinear kernel models, we propose to approximate Z by a Monte Carlo sampling\nmethod. In our experiments, for instance, the VEGAS algorithm [6], which implements a\nmixed importance-stratified sampling approach, showed to be a reasonable method for up\nto 10 input dimensions.\n\nBy using the CV likelihood framework we are guaranteed to (asymptotically) perform as\nwell as the best model in the parametrized family. Thus, the question arises whether the\nfamily of densities defined by a Gaussian model in a kernel-induced feature space is \"rich\nenough\" such that no systematic errors occur. For RBF kernels, the following lemma pro-\nvides a positive answer to this question.\nLemma 2. Let k(x                                   2\n                        i, xj ) = exp(- xi - xj         /). As   0 , pn(x|Xn, ) converges to\na Parzen window with vanishing kernel width: p                                    n\n                                                        n(x|Xn, )  1                   (x - x\n                                                                       n          i=1               i).\n\nA formal proof is omitted due to space limitations. The basic ingredients of the proof\nare: (i) In the limit   0 the expansion coefficients approach ^\n                                                                          1/(1 + )1. Thus,\n^\ny = K ^\n                1/(1 + )1 and s2  1/(1 + ). (ii) D(x; , )  C(x) < , if x \n{x                                                                                 n\n      i}n\n       i=1, and D(x; , )  , else. Finally pn(x|Xn, , )  1                          (x - x\n                                                                        n          i=1               i).\n\nNote that in the limit   0 a Parzen window becomes an unbiased estimator for any\ncontinuous density, which provides an asymptotic justification for our approach: the cross-\nvalidated likelihood framework guarantees us to convergence to a model that performs as\nwell as the best model in our model class as n  . The latter, however, is \"rich enough\"\nin the sense that it contains models which in the limit   0 converge to an unbiased\nestimator for every continuous p(x). Since contour lines of pn(x) are contour lines of a\nGaussian model in the feature space, the Mahalanobis distances are expected to follow a\n2 distribution, and atypical objects can be detected by observing the distribution of the\nempirical Mahalanobis distances as described in the last section.\n\nIt remains to show that describing the data as a Gaussian in a kernel-induced feature space\nis a statistically sound model. This is actually the case, since there exist decay rates for the\nkernel width  such that n grows at a higher rate as the effective degrees of freedom df :\nLemma 3. Let k(x                                   2\n                        i, xj ) = exp(- xi - xj         /) and pn(x|Xn, , ) defined by (8). If\n  1 decays like O(n-1/2), and for fixed   1, the ratio df /n  0 as n  .\n\nA formal proof is omitted due to space limitations. The basic ingredients of the proof are:\n(i) the eigenvalues i of (1/n)K converge to i as n  , (ii) the eigenvalue spectrum of\n\n\f\na Gaussian RBF kernel decays at an exponential-quadratic rate: \n                                                                             i  exp(-i2), (iii) for\nn sufficiently large, it holds that        n      1/[1 + (/n) exp(n-1/2i2)]  n1/2-1 log(n/)\n                                          i=1\n(proof by induction, using the fact that ln(n + 1) - ln(n)  1/(n2 + n) which follows\nfrom a Taylor expansion of the logarithm)  df (n)/n  0.\n\n5    Experiments\n\nThe performance of the proposed method is demonstrated for an outlier detec-\ntion task in the field of face recognition.                         The Olivetti face database (see\nhttp://www.uk.research.att.com/facedatabase.html) contains ten different images of each of\n40 distinct subjects, taken under different lighting conditions and at different facial ex-\npressions and facial details (glasses / no glasses). None of the subjects, however, wears\nsunglasses. All the images are taken against a homogeneous background with the subjects\nin an upright, frontal position. In this experiment we additionally corrupted the dataset by\nincluding two images in which we have artificially changed normal glasses to \"sunglasses\"\nas can be seen in figure 1. The goal is to demonstrate that the proposed method is able to\nidentify these two atypical images without any problem-dependent prior assumptions.\n\n\n\n\n\n              Figure 1: Original and corrupted images with in-painted \"sunglasses\".\n\nEach of the 402 images is characterized by a 10-dimensional vector which contains the\nprojections onto the leading 10 eigenfaces (eigenfaces are simply the eigenvectors of the\nimages treated as pixel-wise vectorial objects). These vectors are feed into a RBF kernel\nof the form k(x                                     2\n                       i, xj ) = exp(- xi - xj           /). In a first step, the free model parameters\n(, ) are selected by maximizing the cross-validated likelihood. A simple 2-fold cross\nvalidation scheme is used: the dataset is randomly split into a training set and a test set of\nequal size, the model is build from the training set (including the numerical solution of the\nnormalization integral), and finally the likelihood is evaluated on the test set. This proce-\ndure is repeated for different values of (, ). In order to simplify the selection process\nwe kept  = 10-4 fixed and varied only . Both the test likelihood and the corresponding\nmodel complexity measured in terms of the effective degrees of freedom (df ) are plotted\nin figure 2. One can clearly identify both an overfitting and an underfitting regime, sepa-\nrated by a broad plateau of models with similarly high likelihood. The df -curve, however,\nshows a similar plateau, indicating that all these models have comparable complexity. This\nobservation suggests that the results should be rather insensitive to variations of  over val-\nues contained in this plateau. This suggestion is indeed confirmed by the results in figure\n2, where we compared the quantile-quantile plot for the maximum likelihood parameter\nvalue with that of a slightly suboptimal model. Both quantile plots look very similar, and\nin both cases two objects clearly fall outside a 99% envelope around the linear fit. Out-\nside the plateau (no figure due to space limitations) the number of objects considered as\noutlies drastically increases in overfitting regime ( too small), or decreases to zero in the\nunderfitting regime ( too large).\n\nIn figure 3 again the quantile plot for the most likely model is depicted. This time, however,\nboth objects identified as outliers are related to the corresponding original images, which\nin fact are the artificially corrupted ones. In addition, the uncorrupted images are localized\nin the plot, indicating that they look rather typical.\n\nSome implementation details. Presumably the easiest way of implementing the model is\nto carry out an eigenvalue decomposition of K. Both the the effective degrees of freedom\ndf =          \n         i         i/(i + ) and the Mahalanobis distances in eq. (4) can then by derived easily\n\n\f\n                                                                                                                                                  -200\n       50\n\n                                                                                                                                                                                                                                                                                           40\n                                                                                                                                                  -250\n\n\n             40                                                                                                                                   -300                                                                            15\n\n                                                                                                                                                                                                                                                                                                 30\n\n                                                                                                                                                  -350                                                                                                                              (i)\n                   30                                                                                                                                                                                                                                                          \n      (i)                                                                                                                                        -400                                                                            10\n                                                                                                                                                                                                                                                                                                       20\n\n                         20                                                                                                                       -450\n                                                                                                                           test log-likelihood\n                                                                                                                                                  -500                                                                            5\n                                                                                                                                                                                                                                                                                                             10\n                               10                                                                                                                                                                                                              effective degrees of freedom\n                                                                                                                                                  -550\n\n\n                                                                                                                                                  -600\n                                     5         10    15    20                                                  25    30                                    6     7           8        9           10           11     12                 13                                                                        5    10    15    20    25    30\n                                          2\n                                      quantiles                                                                                                                                            \n                                                                                                                                                                                      log(  )                                                                                                                           2\n                                                                                                                                                                                                                                                                                                                    quantiles\n\nFigure 2: Middle panel: Selecting the kernel width  by cross-validated likelihood (solid\nline). The dotted line shows the corresponding effective degrees of freedom (df ). Left +\nright panels: quantile plot for optimal model (left) and slightly suboptimal model (right).\n\n\n                                                                       5050\n\n\n\n\n\n                                                                                                                                                                                                                                                                               99.99%\n                                                                               4040\n\n                                                                (i)                                                                                                                                                                                                            99%\n                                                           \n                                                                                       3030\n\n\n\n\n\n                                                                                               2020\n                                                                                                                                                                                                                                                                               99%\n\n                                                                                                                                                                                                                                               99.99%\n\n                                                                                                       1010\n\n\n\n\n\n                                                                                                                                                          5\n                                                                                                                                                          5           10\n                                                                                                                                                                      10              15\n                                                                                                                                                                                      15                20\n                                                                                                                                                                                                        20                  25\n                                                                                                                                                                                                                            25                                                      30\n                                                                                                                                                                                                                                                                                    30\n                                                                                                                                                                                   2 quantiles\n\n Figure 3: Quantile plot with linear fit (solid) and envelopes (99% and 99.99 %, dashed).\n\n\nfrom this decomposition (see (5) and (6)). Efficient on-line variants can be implemented by\nusing standard update formulas for matrix inversion by partitioning. For an implementation\nof the VEGAS algorithm see [9]. The R package \"car\" provides a comfortable implemen-\ntation of quantile-quantile plots and robust line fitting (see also http://www.R-project.org).\n\n6    Conclusion\n\nDetecting outliers by way of one-class classifiers aims at finding a boundary that separates\n\"typical\" objects in a data sample from the \"atypical\" ones. Standard approaches of this\nkind suffer from the problem that they require prior knowledge about the expected fraction\nof outliers. For the purpose of outlier detection, however, the availability of such prior\ninformation seems to be an unrealistic (or even contradictory) assumption. The method\nproposed in this paper overcomes this shortcoming by using a one-class KFD classifier\nwhich is directly related to Gaussian density estimation in the induced feature space. The\nmodel benefits from both the built-in classification method and the explicit parametric den-\nsity model: from the former it inherits the simple complexity regulation mechanism based\non only two tuning parameters. Moreover, within the classification framework it is possi-\nble to quantify the model complexity in terms of the effective degrees of freedom df . The\nGaussian density model, on the other hand, makes it possible to derive a formal descrip-\ntion of atypical objects by way of hypothesis testing: Mahalanobis distances are expected\nto follow a 2-distribution in df dimensions, and deviations from this distribution can be\n\n\f\nquantified by confidence intervals around a fitted line in a quantile-quantile plot. Since\nthe density model is parametrized by both the kernel function and the regularization con-\nstant, it is necessary to select these free parameters before the outlier detection phase. This\nparameter selection is achieved by observing the cross-validated likelihood for different\nparameter values, and choosing those parameters which maximize this quantity. The the-\noretical motivation for this selection process follows from [13] where it has been shown\nthat the cross-validation selector asymptotically performs as well as the so called bench-\nmark selector which selects the best model contained in the parametrized family of models.\nMoreover, for RBF kernels it is shown in lemma 2 that the corresponding model family is\n\"rich enough\" in the sense that it contains an unbiased estimator for the true density (as\nlong as it is continuous) in the limit of vanishing kernel width. Lemma 3 shows that there\nexist decay rates for the kernel width such that the ratio of effective degrees of freedom and\nsample size approaches zero.\n\nThe experiment on detecting persons wearing sunglasses within a collection of rather het-\nerogeneous face images effectively demonstrates that the proposed method is able to detect\natypical objects without prior assumptions on the expected number of outliers. In partic-\nular, it demonstrates that the whole processing pipeline consisting of model selection by\ncross-validated likelihood, fitting linear quantile-quantile models and detecting outliers by\nconsidering confidence intervals around the fit works very well in practical applications\nwith reasonably small input dimensions. For input dimensions                10 the numerical solu-\ntion of the normalization integral becomes rather time consuming when using the VEGAS\nalgorithm. Evaluating the usefulness of more sophisticated sampling models like Markov-\nChain Monte-Carlo methods for this particular task will be subject of future work.\n\nAcknowledgments. The author would like to thank Tilman Lange, Mikio Braun and\nJoachim M. Buhmann for helpful discussions and suggestions.\n\n\nReferences\n\n [1] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley & Sons, 2001.\n\n [2] J. Fox. Applied Regression, Linear Models, and Related Methods. Sage, 1997.\n\n [3] T. Hastie, A. Buja, and R. Tibshirani. Penalized discriminant analysis. Annals of Statistics,\n     23:73102, 1995.\n\n [4] P.J. Huber. Robust Statistics. Wiley, 1981.\n\n [5] M. Kendall and A. Stuart. The Advanced Theory of Statistics, volume 1. McMillan, 1977.\n\n [6] G.P. Lepage. Vegas: An adaptive multidimensional integration program. Technical Report\n     CLNS-80/447, Cornell University, 1980.\n\n [7] S. Mika, G. Ratsch, J. Weston, B. Sch olkopf, and K.-R. M uller. Fisher discriminant analysis\n     with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for\n     Signal Processing IX, pages 4148. IEEE, 1999.\n\n [8] J. Moody. The effective number of parameters: An analysis of generalisation and regularisation\n     in nonlinear learning systems. In J. Moody, S. Hanson, and R. Lippmann, editors, NIPS 4, 1992.\n\n [9] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipies in C.\n     Cambridge University Press, 1992.\n\n[10] V. Roth and V. Steinhage. Nonlinear discriminant analysis using kernel functions. In S.A. Solla,\n     T.K. Leen, and K.-R. M uller, editors, NIPS 12, pages 568574. MIT Press, 2000.\n\n[11] B. Scholkopf, S. Mika, C. Burges, P. Knirsch, K.-R. M uller, G. Ratsch, and A. Smola. Input\n     space vs. feature space in kernel-based methods. IEEE Trans. Neural Networks, 10(5), 1999.\n\n[12] B. Scholkopf, R.C. Williamson, A. Smola, and J. Shawe-Taylor. SV estimation of a distribu-\n     tion's support. In S. Solla, T. Leen, and K.-R. M uller, editors, NIPS 12, pages 582588. 2000.\n\n[13] M.J. van der Laan, S. Dudoit, and S. Keles. Asymptotic optimality of likelihood-based cross-\n     validation. Statistical Applications in Genetics and Molecular Biology, 3(1), 2004.\n\n\f\n", "award": [], "sourceid": 2656, "authors": [{"given_name": "Volker", "family_name": "Roth", "institution": null}]}