{"title": "Robust Kernel Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1185, "page_last": 1192, "abstract": "Kernel Principal Component Analysis (KPCA) is a popular generalization of linear PCA that allows non-linear feature extraction. In KPCA, data in the input space is mapped to higher (usually) dimensional feature space where the data can be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. This paper presents a technique to overcome this problem, and extends it to a unified framework for treating noise, missing data, and outliers in KPCA. Our method is based on a novel cost function to perform inference in KPCA. Extensive experiments, in both synthetic and real data, show that our algorithm outperforms existing methods.", "full_text": "Robust Kernel Principal Component Analysis\n\nMinh Hoai Nguyen & Fernando De la Torre\n\nCarnegie Mellon University, Pittsburgh, PA 15213, USA.\n\nAbstract\n\nKernel Principal Component Analysis (KPCA) is a popular generalization of lin-\near PCA that allows non-linear feature extraction. In KPCA, data in the input\nspace is mapped to higher (usually) dimensional feature space where the data can\nbe linearly modeled. The feature space is typically induced implicitly by a kernel\nfunction, and linear PCA in the feature space is performed via the kernel trick.\nHowever, due to the implicitness of the feature space, some extensions of PCA\nsuch as robust PCA cannot be directly generalized to KPCA. This paper presents\na technique to overcome this problem, and extends it to a uni\ufb01ed framework for\ntreating noise, missing data, and outliers in KPCA. Our method is based on a novel\ncost function to perform inference in KPCA. Extensive experiments, in both syn-\nthetic and real data, show that our algorithm outperforms existing methods.\n\n1 Introduction\nPrincipal Component Analysis (PCA) [9] is one of the primary statistical techniques for feature\nextraction and data modeling. One drawback of PCA is its limited ability to model non-linear\nstructures that exist in many computing applications. Kernel methods [18] enable us to extend PCA\nto model non-linearities while retaining its computational ef\ufb01ciency.\nIn particular, Kernel PCA\n(KPCA) [19] has repeatedly outperformed PCA in many image modeling tasks [19, 14].\n\nUnfortunately, realistic visual data is often corrupted by undesirable artifacts due to occlusion (e.g.\na hand in front of a face, Fig. 1.d), illumination (e.g. specular refection, Fig. 1.e), noise (e.g. from\ncapturing device, Fig.1.b), or from the underlying data generation method (e.g. missing data due\nto transmission, Fig. 1.c). Therefore, robustness to noise, missing data, and outliers is a desired\nproperty to have for algorithms in computer vision.\n\na\n\nb\n\nf\n\nc\n\ng\n\nd\n\nh\n\ne\n\ni\n\nx\n\nz?\n\nPrincipal \nSubspace\n\nFigure 1: Several types of data corruption and\nresults of our method. a) original image, b) cor-\nruption by additive Gaussian noise, c) missing\ndata, d) hand occlusion, e) specular re\ufb02ection.\nf) to i) are the results of our method for recover-\ning uncorrupted data from b) to e) respectively.\n\nInput Space\n\nFeature Space\n\nFigure 2: Using KPCA principal subspace to\n\ufb01nd z, a clean version of corrupted sample x.\n\nThroughout the years, several extensions of PCA have been proposed to address the problems of\noutliers and missing data, see [6] for a review. However, it still remains unclear how to generalize\nthose extensions to KPCA; since directly migrating robust PCA techniques to KPCA is not possible\n\n1\n\n\fdue to the implicitness of the feature space. To overcome this problem, in this paper, we propose\nRobust KPCA (RKPCA), a uni\ufb01ed framework for denoising images, recovering missing data, and\nhandling intra-sample outliers. Robust computation in RKPCA does not suffer from the implicit-\nness of the feature space because of a novel cost function for reconstructing \u201cclean\u201d images from\ncorrupted data. The proposed cost function is composed of two terms, requiring the reconstructed\nimage to be close to the KPCA principal subspace as well as to the input sample. We show that\nrobustness can be naturally achieved by using robust functions to measure the closeness between the\nreconstructed and the input data.\n\n2 Previous work\n2.1 KPCA and pre-image\nKPCA [19, 18, 20] is a non-linear extension of principal component analysis (PCA) using kernel\nmethods. The kernel represents an implicit mapping of the data to a (usually) higher dimensional\nspace where linear PCA is performed.\n\nLet X denote the input space and H the feature space. The mapping function \u03d5 : X \u2192 H is\nimplicitly induced by a kernel function k : X \u00d7 X \u2192 \u211c that de\ufb01nes the similarity between data in\nthe input space. One can show that if k(\u00b7, \u00b7) is a kernel then the function \u03d5(\u00b7) and the feature space\nH exist; furthermore k(x, y) = h\u03d5(x), \u03d5(y)i [18].\nHowever, directly performing linear PCA in the feature space might not be feasible because the\nfeature space typically has very high dimensionality (including in\ufb01nity). Thus KPCA is often done\nvia the kernel trick. Let D = [d1 d2 ... dn], see notation1, be a training data matrix, such that\ndi \u2208 X \u2200i = 1, n. Let k(\u00b7, \u00b7) denote a kernel function, and K denote the kernel matrix (element\nij of K is kij = k(di, dj)). KPCA is computed in closed form by \ufb01nding \ufb01rst m eigenvectors\n(ai\u2019s) corresponding to the largest eigenvalues (\u03bbi\u2019s) of the kernel matrix K (i.e. KA = A\u039b). The\neigenvectors in the feature space V can be computed as V = \u0393A, where \u0393 = [\u03d5(d1)...\u03d5(dn)]. To\nensure orthonormality of {vi}m\ni=1, KPCA imposes that \u03bbihai, aii = 1. It can be shown that {vi}m\ni=1\nform an orthonormal basis of size m that best preserves the variance of data in the feature space [19].\nAssume x is a data point in the input space, and let P\u03d5(x) denote the projection of \u03d5(x) onto\nthe principal subspace {vi}m\n1 is a set of orthonormal vectors, we have P\u03d5(x) =\nPm\n\ni=1 h\u03d5(x), vii vi. The reconstruction error (in feature space) is given by:\n\n1 . Because {vi}m\n\nEproj(x) = ||\u03d5(x) \u2212 P\u03d5(x)||2\nwhere r(x) = \u0393T \u03d5(x), and M = X aiaT\n\n2 = h\u03d5(x), \u03d5(x)i \u2212 X h\u03d5(x), vii2 = k(x, x) \u2212 r(x)T Mr(x),\n(1)\n\n.\n\ni\n\nThe pre-image of the projection is the z \u2208 X that satis\ufb01es \u03d5(z) = P\u03d5(x); z is also referred to as\nthe KPCA reconstruction of x. However, the pre-image of P\u03d5(x) usually does not exist, so \ufb01nding\nthe KPCA reconstruction of x means \ufb01nding z such that \u03d5(z) is as close to P\u03d5(x) as possible.\nIt should be noted that the closeness between \u03d5(z) and P\u03d5(x) can be de\ufb01ned in many ways, and\ndifferent cost functions lead to different optimization problems. Sch\u00a8olkopf et al [17] and Mika et\nal [13] propose to approximate the reconstruction of x by arg minz ||\u03d5(z) \u2212 P\u03d5(x)||2\n2. Two other\nobjective functions have been proposed by Kwok & Tsang [10] and Bakir etal [2].\n\n2.2 KPCA-based algorithms for dealing with noise, outliers and missing data\nOver the years, several methods extending KPCA algorithms to deal with noise, outliers, or missing\ndata have been proposed. Mika et al [13], Kwok & Tsang [10], and Bakir et al [2] show how\ndenoising can be achieved by using the pre-image. While these papers present promising denoising\nresults for handwritten digits, there are at least two problems with these approaches. Firstly, because\nthe input image x is noisy, the similarity measurement between x and other data point di (i.e.\nk(x, di) the kernel) might be adversely affected, biasing the KPCA reconstruction of x. Secondly,\n\n1Bold uppercase letters denote matrices (e.g. D), bold lowercase letters denote column vectors (e.g. d). dj\nrepresents the jth column of the matrix D. dij denotes the scalar in the row ith and column jth of the matrix\nD and the ith element of the column vector dj. Non-bold letters represent scalar variables. 1k \u2208 Rk\u00d71 is a\ncolumn vector of ones. Ik \u2208 Rk\u00d7k is the identity matrix.\n\n2\n\n\fx\n\nz?\n\nPrincipal \nSubspace\n\nx\n\nz?\n\nPrincipal \nSubspace\n\nInput Space\n\na\n\nFeature Space\n\nInput Space\n\nFeature Space\n\nb\n\nFigure 3: Key difference between previous work (a) and ours (b). In (a), one seeks z such that \u03d5(z)\nis close to P\u03d5(x). In (b), we seek z such that \u03d5(z) is close to both \u03d5(x) and the principal subspace.\n\ncurrent KPCA reconstruction methods equally weigh all the features (i.e. pixels); it is impossible to\nweigh the importance of some features over the others.\n\nOther existing methods also have limitations. Some [7, 22, 1] only consider robustness of the princi-\npal subspace; they do not address robust \ufb01tting. Lu etal [12] present an iterative approach to handle\noutliers in training data. At each iteration, the KPCA model is built, and the data points that have the\nhighest reconstruction errors are regarded as outliers and discarded from the training set. However,\nthis approach does not handle intra-sample outliers (outliers that occur at a pixel level [6]). Several\nother approaches also considering Berar etal [3] propose to use KPCA with polynomial kernels to\nhandle missing data. However, it is not clear how to extend this approach to other kernels. Further-\nmore, with polynomial kernels of high degree, the objective function is hard to optimize. Sanguinetti\n& Lawrence [16] propose an elegant framework to handle missing data. The framework is based on\nthe probabilistic interpretation inherited from Probabilistic PCA [15, 21, 11]. However, Sanguinetti\n& Lawrence [16] do not address the problem of outliers.\n\nThis paper presents a novel cost function that uni\ufb01es the treatment of noise, missing data and outliers\nin KPCA. Experiments show that our algorithm outperforms existing approaches [6, 10, 13, 16].\n\n3 Robust KPCA\n3.1 KPCA reconstruction revisited\nGiven an image x \u2208 X , Fig. 2 describes the task of \ufb01nding the KPCA-reconstructed image of x\n(uncorrupted version of x to which we will refer as KPCA reconstruction). Mathematically, the task\nis to \ufb01nd a point z \u2208 X such that \u03d5(z) is in the principal subspace (denote PS) and \u03d5(z) is as close\nto \u03d5(x) as possible. In other words, \ufb01nding the KPCA reconstruction of x is to optimize:\n\narg min\n\nz\n\n||\u03d5(z) \u2212 \u03d5(x)||2 s.t. \u03d5(z) \u2208 PS .\n\n(2)\n\nHowever, since there might not exist z \u2208 X such that \u03d5(z) \u2208 PS, the above optimization problem\nneeds to be relaxed. There is a common relaxation approach used by existing methods for computing\nthe KPCA reconstruction of x. This approach conceptually involves two steps:(i) \ufb01nding P\u03d5(x)\nwhich is the closest point to \u03d5(x) among all the points in the principal subspace, (ii) \ufb01nding z\nsuch that \u03d5(z) is as close to P\u03d5(x) as possible. This relaxation is depicted in Fig. 3a. If L2 norm\nis used to measure the closeness between \u03d5(z) and P\u03d5(x), the resulting KPCA reconstruction is\narg minz ||\u03d5(z) \u2212 P\u03d5(x)||2\n2 .\nThis approach for KPCA reconstruction is not robust. For example, if x is corrupted with intra-\nsample outliers (e.g. occlusion), \u03d5(x) and P\u03d5(x) will also be adversely affected. As a conse-\nquence, \ufb01nding z that minimizes ||\u03d5(z) \u2212 P\u03d5(x)||2\n2 does not always produce a \u201cclean\u201d version of\nx. Furthermore, it is unclear how to incorporate robustness to the above formulation.\n\nHere, we propose a novel relaxation of (2) that enables the incorporation of robustness. The KPCA\nreconstruction of x is taken as:\n\narg min\n\nz\n\n||\u03d5(x) \u2212 \u03d5(z)||2\n\n2 + C ||\u03d5(z) \u2212 P\u03d5(z)||2\n2\n}\n\nEproj (z)\n\n{z\n\n|\n\n3\n\n.\n\n(3)\n\n\fAlgorithm 1 RKPCA for missing attribute values in training data\n\nInput: training data D, number of iterations m, number of partitions k.\nInitialize: missing values by the means of known values.\nfor iter = 1 to m do\n\nRandomly divide D into k equal partitions D1, ..., Dk\nfor i = 1 to k do\n\nTrain RKPCA using data D \\ Di\nRun RKPCA \ufb01tting for Di with known missing attributes.\n\nend for\nUpdate missing values of D\n\nend for\n\nIntuitively, the above cost function requires the KPCA reconstruction of x is a point z that \u03d5(z)\nis close to both \u03d5(x) and the principal subspace. C is a user-de\ufb01ned parameter that controls the\nrelative importance of these two terms. This approach is depicted in Fig. 3b.\n\nIt is possible to generalize the above cost function further. The \ufb01rst term of Eq. 3 is not necessarily\n2 is replaced\n||\u03d5(x) \u2212 \u03d5(z)||2\nby a robust function E0 : X \u00d7 X \u2192 \u211c for measuring similarity between x and z. Furthermore,\nthere is no reason why E0 should be restricted to the metric of the feature space. In short, the KPCA\nreconstruction of x can be taken as:\n\n2. In fact, for the sake of robustness, it is preferable that ||\u03d5(x) \u2212 \u03d5(z)||2\n\narg min\n\nz\n\nE0(x, z) + CEproj(z) .\n\n(4)\n\nBy choosing appropriate forms for E0, one can use KPCA to handle noise, missing data, and intra-\nsample outliers. We will show that in the following sections.\n\n3.2 Dealing with missing data in testing samples\nAssume the KPCA has been learned from complete and noise free data. Given a new image x\nwith missing values, a logical function E0 that does not depend on the the missing values could be:\nE0(x, z) = \u2212exp(\u2212\u03b32||W(x \u2212 z)||2\n2), where W is a diagonal matrix; the elements of its diagonal\nare 0 or 1 depending on whether the corresponding attributes of x have missing values or not.\n\n3.3 Dealing with intra-sample outliers in testing samples\nTo handle intra-sample outliers, we could use a robust function for E0. For instance: E0(x, z) =\n\u2212exp(\u2212\u03b32 Pd\ny2+\u03c32 ,\nand \u03c3 is a parameter of the function. This function is also used in [6] for Robust PCA.\n\ni=1 \u03c1(xi \u2212 zi, \u03c3)), where \u03c1(\u00b7, \u00b7) is the Geman-McClure function, \u03c1(y, \u03c3) = y2\n\n3.4 Dealing with missing data and intra-sample outliers in training data\nPrevious sections have shown how to deal with outliers and missing data in the testing set (assuming\nKPCA has been learned from a clean training set). If we have missing data in the training samples\n[6], a simple approach is to iteratively alternate between estimating the missing values and updat-\ning the KPCA principal subspace until convergence. Algorithm 1 outlines the main steps of this\napproach. An algorithm for handling intra-sample outliers in training data could be constructed\nsimilarly.\nAlternatively, a kernel matrix could be computed ignoring the missing values, that is, each kij =\nexp(\u2212\u03b32||WiWj(xi \u2212 xj)||2\n2), where \u03b32 =\ntrace(WiWj) . However, the positive de\ufb01niteness of the\nresulting kernel matrix cannot be guaranteed.\n\n1\n\n3.5 Optimization\nIn general, the objective function in Eq. 4 is not concave, hence non-convex optimization tech-\nniques are required. In this section, we restrict our attention to the Gaussian kernel (k(x, y) =\nexp(\u2212\u03b3||x \u2212 y||2)) that is the most widely used kernel. If E0 takes the form of Sec.3.2, we need to\n(5)\n\n,\n\nmaximize E(z) = exp(\u2212\u03b32||W(x \u2212 z)||2)\n}\n\n{z\n\nE1(z)\n\n|\n\n+C \u2217 r(z)T Mr(z)\n}\n\n{z\n\nE2(z)\n\n|\n\n4\n\n\fwhere r(\u00b7), and M are de\ufb01ned in Eq.1. Note that optimizing this function is not harder than op-\ntimizing the objective function used by Mika et al [13]. Here, we also derive a \ufb01xed-point opti-\nmization algorithm. The necessary condition for a minimum has to satisfy the following equation:\n\u2207zE(z) = \u2207zE1(z) + \u2207zE2(z) = 0 . The expression for the gradients are given by:\n\n\u2207zE1(z) = \u22122\u03b32 exp(\u2212\u03b32||W(x \u2212 z)||2)W2\n(z \u2212 x), \u2207zE2(z) = \u22124\u03b3[(1T\n}\n\n{z\n\nW2\n\n|\n\nn Q1n)z \u2212 DQ1n] ,\n\nwhere Q is a matrix such that qij = mijexp(\u2212\u03b3||z \u2212 di||2 \u2212 \u03b3||z \u2212 dj||2). A \ufb01xed-point update is:\n\n\u03b32\n\u03b3\n\n1\nz = [\n2C\n|\n\nW2 + (1T\n\n1\n]\u22121(\nn Q1n)In\n2C\n|\n}\n\n\u03b32\n\u03b3\n\nW2x + DQ1n\n)\n}\n\n{z\n\nu\n\n(6)\n\n{z\n\nW3\n\ni=1(vT\n\ni \u03d5(z))2, so 1T\n\n3 u. The diagonal matrix W\u22121\n3\n\nThe above equation is the update rule for z at every iteration of the algorithm. The algorithm stops\nwhen the difference between two successive z\u2019s is smaller than a threshold.\nNote that W3 is a diagonal matrix with non-negative entries since Q is a positive semi-de\ufb01nite\nmatrix. Therefore, W3 is not invertible only if there are some zero elements on the diagonal. This\nonly happens if some elements of the diagonal of W are 0 and 1T\nn Q1n = 0. It can be shown that\nn Q1n = Pm\nn Q1n = 0 when \u03d5(z)\u22a5vi, \u2200i. However, this rarely occurs in\n1T\npractice; moreover, if this happens we can restart the algorithm from a different initial point.\nConsider the update rule given in Eq.6: z = W\u22121\nacts as a normal-\nization factor of u. Vector u is a weighted combination of two terms, the training data D and x.\nFurthermore, each element of x is weighted differently by W2 which is proportional to W. In the\ncase of missing data (some entries in the diagonal of W, and therefore W2, will be zero), missing\ncomponents of x would not affect the computation of u and z. Entries corresponding to the missing\ncomponents of the resulting z will be pixel-weighted combinations of the training data. The contri-\nbution of x also depends on the ratio \u03b32/\u03b3, C, and the distance from the current z to x. Similar to\nthe observation of Mika et al [13], the second term of vector u pulls z towards a single Gaussian\ncluster. The attraction force generated by a training data point di re\ufb02ects the correlation between\n\u03d5(z) and \u03d5(di), the correlation between \u03d5(z) and eigenvectors vj\u2019s, and the contributions of \u03d5(di)\nto the eigenvectors. The forces from the training data, together with the attraction force by x, draw\nz towards a Gaussian cluster that is close to x.\n\u03b32\nOne can derive a similar update rule for z if E0 takes the form in Sec.3.3. z = [ 1\n\u03b3 W4 +\n2C\n5, where W5\n(1T\nis a diagonal matrix; the ith entry of the diagonal is \u03c3/((zi \u2212 xi)2 + \u03c32). The parameter \u03c3 is updated\nat every iteration as follows: \u03c3 = 1.4826 \u00d7 median({|zi \u2212 xi|}d\n\n\u03b3 W4x + DQ1n), with W4 = exp(\u2212\u03b32 Pd\n\ni=1 \u03c1(xi \u2212 zi, \u03c3))W2\n\nn Q1n)In]\u22121( 1\n2C\n\n\u03b32\n\ni=1) [5].\n\n4 Experiments\n4.1 RKPCA for intra-sample outliers\nIn this section, we compare RKPCA with three approaches for handling intra-sample outliers: (i)\nRobust Linear PCA [6], (ii) Mika etal\u2019s KPCA reconstruction [13], and (iii) Kwok & Tsang\u2019s KPCA\nreconstruction [10]. The experiments are done on the CMU Multi-PIE database [8].\n\nThe Multi-PIE database consists of facial images of 337 subjects taken under different illuminations,\nexpressions and poses, at four different sessions. We only make use of the directly-illuminated\nfrontal face images under \ufb01ve different expressions (smile, disgust, squint, surprise and scream),\nsee Fig. 4. Our dataset contains 1100 images, 700 are randomly selected for training, 100 are used\nfor validation, and the rest is reserved for testing. Note that no subject in the testing set appears\nin the training set. Each face is manually labeled with 68 landmarks, as shown in Fig. 4a. A\nshape-normalized face is generated for every face by warping it towards the mean shape using af\ufb01ne\ntransformation. Fig. 4b shows an example of such a shape-normalized face. The mean shape is used\nas the face mask and the values inside the mask are vectorized.\n\nTo quantitatively compare different methods, we introduce synthetic occlusions of different sizes\n(20, 30, and 40 pixels) into the test images. For each occlusion size and test image pair, we generate\n\n5\n\n\fOcc.Sz Region Type Base Line Mika et al Kwok&Tsang Robust PCA Ours\n\n20\n\n30\n\n40\n\n20\n\n30\n\n40\n\n%\n0\n8\n\ny\ng\nr\ne\nn\nE\n\n%\n5\n9\n\ny\ng\nr\ne\nn\nE\n\nWhole face\nOcc. Reg.\n\n14.0\u00b15.5 13.5\u00b13.3\n71.5\u00b15.5 22.6\u00b17.9\nNon-occ Reg. 0.0\u00b10.0 11.3\u00b12.3\nWhole face 27.7\u00b110.2 17.5\u00b14.8\nOcc. Reg.\n70.4\u00b13.9 24.2\u00b17.1\nNon-occ Reg. 0.0\u00b10.0 13.3\u00b13.0\nWhole face 40.2\u00b112.7 20.9\u00b15.9\nOcc. Reg.\n70.6\u00b13.6 25.7\u00b17.2\nNon-occ Reg. 0.0\u00b10.0 15.2\u00b14.2\n14.2\u00b15.3 12.6\u00b13.1\nWhole face\n71.2\u00b15.4 29.2\u00b18.4\nOcc. Reg.\n8.6\u00b11.6\n26.8\u00b19.5 17.4\u00b14.4\n70.9\u00b14.4 30.0\u00b17.6\nNon-occ Reg. 0.0\u00b10.0 10.1\u00b11.9\nWhole face 40.0\u00b111.9 22.0\u00b15.9\nOcc. Reg.\n70.7\u00b13.6 30.1\u00b17.2\nNon-occ Reg. 0.0\u00b10.0 12.1\u00b13.3\n\nNon-occ Reg. 0.0\u00b10.0\nWhole face\nOcc. Reg.\n\n14.1\u00b13.4\n17.3\u00b16.6\n13.2\u00b12.9\n16.6\u00b14.6\n19.3\u00b16.6\n14.7\u00b13.8\n18.8\u00b15.8\n21.1\u00b17.1\n16.1\u00b15.3\n13.8\u00b13.2\n17.3\u00b16.4\n12.9\u00b12.9\n16.2\u00b14.1\n19.5\u00b16.5\n14.1\u00b13.2\n18.9\u00b16.0\n21.4\u00b17.4\n15.9\u00b15.2\n\n8.1\u00b12.3\n10.8\u00b12.4\n13.3\u00b15.5 16.1\u00b16.1\n6.0\u00b11.7\n10.1\u00b12.2\n12.2\u00b13.2 10.9\u00b14.2\n15.4\u00b15.1 18.4\u00b15.8\n5.7\u00b14.3\n9.6\u00b12.3\n16.4\u00b17.1 14.3\u00b16.3\n20.1\u00b18.0 19.8\u00b16.3\n8.8\u00b18.1\n9.4\u00b12.3\n7.0\u00b12.1\n9.1\u00b12.3\n18.6\u00b17.1 18.1\u00b16.1\n4.1\u00b11.6\n6.5\u00b11.4\n13.4\u00b15.0 10.2\u00b13.7\n23.8\u00b17.8 21.0\u00b16.3\n3.1\u00b11.7\n6.3\u00b11.4\n22.7\u00b111.7 14.3\u00b15.8\n32.4\u00b111.9 22.4\u00b17.0\n5.0\u00b16.7\n7.0\u00b12.5\n\nFigure 4:\na) 68\nlandmarks, b) a\nshape-normalized\nface, c) synthetic\nocclusion.\n\nFigure 5: Results of several methods on MPIE database. This shows the means\nand standard deviations of the absolute differences between reconstructed im-\nages and the ground-truths. The statistics are available for three types of face\nregions (whole face, occluded region, and non-occluded region), different oc-\nclusion sizes, and different energy settings. Our method consistently outper-\nforms other methods for different occlusion sizes and energy levels.\n\na square occlusion window of that size, drawing the pixel values randomly from 0 to 255. A syn-\nthetic testing image is then created by pasting the occlusion window at a random position. Fig. 4c\ndisplays such an image with occlusion size of 20. For every synthetic testing image and each of\nthe four algorithms, we compute the mean (at pixel level) of the absolute differences between the\nreconstructed image and the original test image without occlusion. We record these statistics for\noccluded region, non-occluded region and the whole face. The average statistics together with stan-\ndard deviations are then calculated over the set of all testing images. These results are displayed\nin Fig. 5. We also experiment with several settings for the energy levels for PCA and KPCA. The\nenergy level essentially means the number of components of PCA/KPCA subspace. In the interest\nof space, we only display results for two settings 80% and 95%. BaseLine is the method that does\nnothing; the reconstructed images are exactly the same as the input testing images. As can be seen\nfrom Fig.5, our method consistently outperforms others for all energy levels and occlusion sizes\n(using the whole-face statistics). Notably, the performance of our method with the best parameter\nsettings is also better than the performances of other methods with their best parameter settings.\n\nThe experimental results for Mika et al, Kwok & Tsang, Robust PCA [6] are generated using our\nown implementations. For Mika et al, Kwok & Tsang\u2019s methods, we use Gaussian kernels with\n\u03b3 = 10\u22127. For our method, we use E0 de\ufb01ned in Sec. 3.3. The kernel is Gaussian with \u03b3 =\n10\u22127, \u03b32 = 10\u22126, and C = 0.1. The parameters are tuned using validation data.\n\n4.2 RKPCA for incomplete training data\nTo compare the ability to handle missing attributes in training data of our algorithm with other\nmethods, we perform some experiments on the well known Oil Flow dataset [4]. This dataset is\nalso used by Sanguinetti & Lawrence [16]. This dataset contains 3000 12-dimensional synthetically\ngenerated data points modeling the \ufb02ow of a mixture of oil, water and gas in a transporting pipe-line.\n\nWe test our algorithm with different amount of missing data (from 5% to 50%) and repeat each\nexperiment for 50 times. For each experiment, we randomly choose 100 data points and randomly\nremove some attribute values at some certain rate. We run Algorithm 1 to recover the values of the\nmissing attributes, with m = 25, k = 10, \u03b3 = 0.0375 (same as [16]), \u03b32 = 0.0375, C = 107. The\nsquared difference between the reconstructed data and the original groundtruth data is measured,\nand the mean and standard deviation for 50 runs are calculated. Note that this experiment setting is\nexactly the same as the setting by [16].\n\n6\n\n\fTable 1: Reconstruction errors for 5 different methods and 10 probabilities of missing values for the\nOil Flow dataset. Our method outperforms other methods for all levels of missing data.\n\n0.10\n\n0.15\n\n0.20\n\n0.30\n0.05\n13 \u00b1 4 28 \u00b1 4 43 \u00b1 7 53 \u00b1 8 70 \u00b1 9 81 \u00b1 9\n\np(del)\nmean\n1-NN\nPPCA 3.7 \u00b1 .6 9 \u00b1 2 17 \u00b1 5 25 \u00b1 9 50 \u00b1 10 90 \u00b1 30 110 \u00b1 30 110 \u00b1 20 120 \u00b1 30 140 \u00b1 30\nPKPCA 5 \u00b1 1\n61 \u00b1 8 70 \u00b1 10 100 \u00b1 20\nOurs 3.2 \u00b1 1.9 8 \u00b1 4 12 \u00b1 4 19 \u00b1 6 27 \u00b1 8 34 \u00b1 10 44 \u00b1 9 53 \u00b1 12 69 \u00b1 13 83 \u00b1 15\n\n14 \u00b1 5 30 \u00b1 10 60 \u00b1 20 90 \u00b1 20 NA\n\n12 \u00b1 3 19 \u00b1 5 24 \u00b1 6 32 \u00b1 6 40 \u00b1 7\n\n45 \u00b1 4\n\n0.35\n97 \u00b1 9 109 \u00b1 8 124 \u00b1 7 139 \u00b1 7\n\n0.40\n\n0.50\n\n0.45\n\n0.25\n\n5 \u00b1 3\n\nNA\n\nNA\n\nNA\n\nNA\n\nExperimental results are summarized in Tab. 1. The results of our method are shown in the last\ncolumn. The results of other methods are copied verbatim from [16]. The mean method is a widely\nused heuristic where the missing value of an attribute of a data point is \ufb01lled by the mean of known\nvalues of the same attribute of other data points. The 1-NN method is another widely used heuristic\nin which the missing values are replaced by the values of the nearest point, where the pairwise\ndistance is calculated using only the attributes with known values. PPCA is the probabilistic PCA\nmethod [11], and PKPCA is the method proposed by [16]. As can be seen from Tab. 1, our method\noutperforms other methods for all levels of missing data.\n\n4.3 RKPCA for denoising\nThis section describes denoising experiments on the Multi-PIE database with Gaussian additive\nnoise. For a fair evaluation, we only compare our algorithm with Mika et al\u2019s, Kwok & Tsang\u2019s\nand Linear PCA. These are the methods that perform denoising based on subspaces and do not rely\nexplicitly on the statistics of natural images. Quantitative evaluations show that the denoising ability\nof our algorithm is comparable with those of other methods.\n\nFigure 6: Example of denoised images. a) original image, b) corrupted by Gaussian noise, c) de-\nnoised using PCA, d) using Mika etal, e) using Kwok & Tsang method, f) result of our method.\n\nThe set of images used in these experiments is exactly the same as those in the occlusion experiments\ndescribed in Sec. 4.1. For every testing image, we synthetically corrupt it with Gaussian additive\nnoise with standard deviation of 0.04. An example for a pair of clean and corrupted images are\nshown in Fig. 6a and 6b. For every synthetic testing image, we compute the mean (at pixel level)\nof the absolute difference between the denoised image and the ground-truth. The results of different\nmethods with different energy settings are summarized in Tab. 2. For these experiments, we use E0\nde\ufb01ned in Sec. 3.2 with W being the identity matrix. We use Gaussian kernel with \u03b3 = \u03b32 = 10\u22127,\nand C = 1. These parameters are tuned using validation data.\n\nTable 2: Results of image denoising on the Multi-PIE database. BaseLine is the method that does\nnothing. The best energy setting for all methods is 100%. Our method is better than the others.\n\nEnergy\n80%\n95%\n100%\n\nBase Line\n8.14\u00b10.16\n8.14\u00b10.16\n8.14\u00b10.16\n\nMika\n\n9.07\u00b11.86\n6.37\u00b11.30\n5.55\u00b10.97\n\nKwok& Tsang\n11.79\u00b12.56\n11.55\u00b12.52\n11.52\u00b12.52\n\nPCA\n\n10.04\u00b11.99\n6.70\u00b11.20\n6.44\u00b10.39\n\nOurs\n\n7.01\u00b11.27\n5.70\u00b10.96\n5.43\u00b10.78\n\nTab. 2 and Fig. 6 show the performance of our method is comparable with others. In fact, the quan-\ntitative results show that our method is marginally better than Mika etal\u2019s method and substantially\nbetter than the other two. In terms of visual appearance (Fig. 6c-f), the reconstruction image of our\nmethod preserves much more \ufb01ne details than the others.\n\n7\n\n\f5 Conclusion\nIn this paper, we have proposed Robust Kernel PCA, a uni\ufb01ed framework for handling noise, occlu-\nsion and missing data. Our method is based on a novel cost function for Kernel PCA reconstruction.\nThe cost function requires the reconstructed data point to be close to the original data point as well\nas to the principal subspace. Notably, the distance function between the reconstructed data point\nand the original data point can take various forms. This distance function needs not to depend on\nthe kernel function and can be evaluated easily. Therefore, the implicitness of the feature space is\navoided and optimization is possible. Extensive experiments, in two well known data sets, show that\nour algorithm outperforms existing methods.\n\nReferences\n\n[1] Alzate, C. & Syukens, J.A. (2005) \u2018Robust Kernel Principal Component Analysis uisng Huber\u2019s Loss\n\nFunction.\u2019 24th Benelux Meeting on Systems and Control.\n\n[2] Bakir, G.H., Weston, J. & Sch\u00a8olkopf, B. (2004) \u2018Learning to Find Pre-Images.\u2019 in Thrun, S., Saul, L. &\n\nSch\u00a8olkopf, B. (Eds) Advances in Neural Information Processing Systems.\n\n[3] Berar, M., Desvignes, M., Bailly, G., Payan, Y. & Romaniuk, B. (2005) \u2018Missing Data Estimation using\n\nPolynomial Kernels.\u2019 Proceedings of International Conference on Advances in Pattern Recognition.\n\n[4] Bishop, C.M., Svens\u00b4en, M. & Williams, C.K.I. (1998) \u2018GTM: The Generative Topographic Mapping.\u2019\n\nNeural Computation, 10(1), 215\u2013234.\n[5] Black, M.J. & Anandan, P. (1996)\n\nPiecewise-smooth Flow Fields.\u2019 Computer Vision and Image Understanding, 63(1), 75\u2013104.\n\n\u2018The Robust Estimation of Multiple Motions: Parametric and\n\n[6] de la Torre, F. & Black, M.J. (2003) \u2018A Framework for Robust Subspace Learning.\u2019 International Journal\n\nof Computer Vision, 54(1\u20133), 117\u2013142.\n\n[7] Deng, X., Yuan, M. & Sudijanto, A. (2007) \u2018A Note on Robust Principal Component Analysis.\u2019 Contem-\n\nporary Mathematics, 443, 21\u201333.\n\n[8] Gross, R., Matthews, I., Cohn, J., Kanade, T. & Baker, S. (2007) \u2018The CMU Multi-pose, Illumination,\n\nand Expression (Multi-PIE) Face Database.\u2019 Technical report, Carnegie Mellon University.TR-07-08.\n\n[9] Jolliffe, I. (2002) Principal Component Analysis. 2 edn. Springer-Verlag, New York.\n[10] Kwok, J.T.Y. & Tsang, I.W.H. (2004) \u2018The Pre-Image Problem in Kernel Methods.\u2019 IEEE Transactions\n\non Neural Networks, 15(6), 1517\u20131525.\n\n[11] Lawrence, N.D. (2004) \u2018Gaussian Process Latent Variable Models for Visualization of High Dimensional\nData.\u2019 in Thrun, S., Saul, L. & Sch\u00a8olkopf, B. (Eds) Advances in Neural Information Processing Systems.\n[12] Lu, C., Zhang, T., Zhang, R. & Zhang, C. (2003) \u2018Adaptive Robust Kernel PCA Algorithm.\u2019 Proceedings\n\nof IEEE International Conference on Acoustics, Speech, and Signal Processing.\n\n[13] Mika, S., Sch\u00a8olkopf, B., Smola, A., M\u00a8uller, K.R., Scholz, M. & R\u00a8atsch, G. (1999)\nDe-Noising in Feature Spaces.\u2019 Advances in Neural Information Processing Systems.\n\n\u2018Kernel PCA and\n\n[14] Romdhani, S., Gong, S. & Psarrou, A. (1999) \u2018Multi-view Nonlinear Active Shape Model Using Kernel\n\nPCA.\u2019 British Machine Vision Conference, 483\u2013492.\n\n[15] Roweis, S. (1998) \u2018EM Algorithms for PCA and SPCA.\u2019 in Jordan, M., Kearns, M. & Solla, S. (Eds)\n\nAdvances in Neural Information Processing Systems 10.\n\n[16] Sanguinetti, G. & Lawrence, N.D. (2006)\n\nConference on Machine Learning.\n\n\u2018Missing Data in Kernel PCA.\u2019 Proceedings of European\n\n[17] Sch\u00a8olkopf, B., Mika, S., Smola, A., R\u00a8atsch, G. & M\u00a8uller, K.R. (1998) \u2018Kernel PCA Pattern Reconstruc-\n\ntion via Approximate Pre-Images.\u2019 International Conference on Arti\ufb01cial Neural Networks.\n\n[18] Sch\u00a8olkopf, B. & Smola, A. (2002) Learning with Kernels: Support Vector Machines, Regularization,\n\nOptimization, and beyond. MIT Press, Cambridge, MA.\n\n[19] Sch\u00a8olkopf, B., Smola, A. & Mller, K. (1998) \u2018Nonlinear Component Analysis as a Kernel Eigenvalue\n\nProblem.\u2019 Neural Computation, 10, 1299\u20131319.\n\n[20] Shawe-Taylor, J. & Cristianini, N. (2004) Kernel Methods for Pattern Analysis. Cambridge Uni. Press.\n[21] Tipping, M. & Bishop, C.M. (1999) \u2018Probabilistic Principal Component Analysis.\u2019 Journal of the Royal\n\nStatistical Society B, 61, 611\u2013622.\n\n[22] Wang, L., Pang, Y.W., Shen, D.Y. & Yu, N.H. (2007) \u2018An Iterative Algorithm for Robust Kernel Principal\n\nComponent Analysis.\u2019 Conference on Machine Learning and Cybernetics.\n\n8\n\n\f", "award": [], "sourceid": 179, "authors": [{"given_name": "Minh", "family_name": "Nguyen", "institution": null}, {"given_name": "Fernando", "family_name": "Torre", "institution": null}]}