{"title": "Robust Transfer Principal Component Analysis with Rank Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 1151, "page_last": 1159, "abstract": "Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Speci\ufb01cally, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with shared common principal components across matrices and individual principal components speci\ufb01c to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an ef\ufb01cient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and signi\ufb01cantly outperform both standard PCA and robust PCA.", "full_text": "Robust Transfer Principal Component Analysis\n\nwith Rank Constraints\n\nYuhong Guo\n\nDepartment of Computer and Information Sciences\nTemple University, Philadelphia, PA 19122, USA\n\nyuhong@temple.edu\n\nAbstract\n\nPrincipal component analysis (PCA), a well-established technique for data analy-\nsis and processing, provides a convenient form of dimensionality reduction that is\neffective for cleaning small Gaussian noises presented in the data. However, the\napplicability of standard principal component analysis in real scenarios is limited\nby its sensitivity to large errors. In this paper, we tackle the challenge problem of\nrecovering data corrupted with errors of high magnitude by developing a novel\nrobust transfer principal component analysis method. Our method is based on the\nassumption that useful information for the recovery of a corrupted data matrix can\nbe gained from an uncorrupted related data matrix. Speci\ufb01cally, we formulate the\ndata recovery problem as a joint robust principal component analysis problem on\nthe two data matrices, with common principal components shared across matrices\nand individual principal components speci\ufb01c to each data matrix. The formulated\noptimization problem is a minimization problem over a convex objective function\nbut with non-convex rank constraints. We develop an ef\ufb01cient proximal projected\ngradient descent algorithm to solve the proposed optimization problem with con-\nvergence guarantees. Our empirical results over image denoising tasks show the\nproposed method can effectively recover images with random large errors, and sig-\nni\ufb01cantly outperform both standard PCA and robust PCA with rank constraints.\n\n1\n\nIntroduction\n\nDimensionality reduction, as an important form of unsupervised learning, has been widely explored\nfor analyzing complex data such as images, video sequences, text documents, etc. It has been used to\ndiscover important latent information about observed data matrices for visualization, feature recov-\nery, embedding and data cleaning. The fundamental assumption roots in dimensionality reduction\nis that the intrinsic structure of high dimensional observation data lies on a low dimensional linear\nsubspace. Principal component analysis (PCA) [7] is a classic and one of most commonly used di-\nmensionality reduction method. It seeks the best low-rank approximation of the given data matrix\nunder a well understood least-squares reconstruction loss, and projects data onto uncorrelated low\ndimensional subspace. Moreover, it admits an ef\ufb01cient procedure for computing optimal solutions\nvia the singular value decomposition. These properties make PCA a well suited reduction method\nwhen the observed data is mildly corrupted with small Gaussian noise [12]. But standard PCA is\nvery sensitive to the high magnitude errors of the observed data. Even a small fraction of large\nerrors can cause severe degradation in PCA\u2019s estimate of the low rank structure.\n\nReal-life data, however, is often corrupted with large errors or even missing observations. To tackle\ndimensionality reduction with arbitrarily large errors and outliers, a number of approaches that ro-\nbustify PCA have been developed in the literature, including \u21131-norm regularized robust PCA [14],\nin\ufb02uence function techniques [5, 13], and alternating \u21131-norm minimization [8]. Nevertheless, the\n\n1\n\n\fcapacity of these approaches on recovering the low-rank structure of a corrupted data matrix can\nstill be degraded with the increasing of the fraction of the large errors.\n\nIn this paper, we propose a novel robust transfer principal component analysis method to recover the\nlow rank representation of heavily corrupted data by leveraging related uncorrupted auxiliary data.\nSeeking knowledge transfer from a related auxiliary data source for the target learning problem has\nbeen popularly studied in supervised learning. It is also known that modeling related data sources\ntogether provides rich information for discovering theirs shared subspace representations [4]. We\nextend such a transfer learning scheme into the PCA framework to perform joint robust principal\ncomponent analysis over a corrupted target data matrix and a related auxiliary source data matrix\nby enforcing the two robust PCA operations on the two data matrices to share a subset of com-\nmon principal components, while maintaining their unique variations through individual principal\ncomponents speci\ufb01c for each data matrix. This robust transfer PCA framework combines aspects\nof both robust PCA and transfer learning methodologies. We expect the critical low rank structure\nshared between the two data matrices can be effectively transferred from the uncorrupted auxiliary\ndata to recover the low dimensional subspace representation of the heavily corrupted target data\nin a robust manner. We formulate this robust transfer PCA as a joint minimization problem over\na convex combination of least squares losses with non-convex matrix rank constraints. Though a\nsimple relaxation of the matrix rank constraints into convex nuclear norm constraints can lead to a\nconvex optimization problem, it is very dif\ufb01cult to control the rank of the low-rank representation\nmatrix we aim to recover through the nuclear norm. We thus develop a proximal projected gradient\ndescent optimization algorithm to solve the proposed optimization problem with rank constraints,\nwhich permits a convenient closed-form solution for each proximal step based on singular value\ndecomposition and converges to a stationary point. Our experiments over image denoising tasks\nshow the proposed method can effectively recover images corrupted with random large errors, and\nsigni\ufb01cantly outperform both standard PCA and robust PCA with rank constraints.\nNotations: In this paper, we use In to denote an n \u00d7 n identify matrix, use On,m to denote an n \u00d7 m\nmatrix with all 0 values, use k \u00b7 kF to denote the matrix Frobenius norm, and use k \u00b7 k\u2217 to denote the\nnuclear norm (trace norm).\n\n2 Preliminaries\n\nAssume we are given an observed data matrix X \u2208 Rn\u00d7d consisting of n observations of d-\ndimensional feature vectors, which was generated by corrupting some entries of a latent low-rank\nmatrix M \u2208 Rn\u00d7d with an error matrix E \u2208 Rn\u00d7d such that X = M + E. We aim to to recover\nthe low-rank matrix M by projecting the high dimensional observations X into a low dimensional\nmanifold representation matrix Z \u2208 Rn\u00d7k over the low dimensional subspace B \u2208 Rk\u00d7d, such that\nM = ZB, BB\u22a4 = Ik for k < d.\n\n2.1 PCA\n\nGiven the above setup, standard PCA assumes the error matrix E contains small i.i.d. Gaussian\nnoises, and seeks optimal low dimensional encoding matrix Z and basis matrix B to reconstruct X\nby X = ZB + E. Under a least squares reconstruction loss, PCA is equivalent to the following\nself-supervised regression problem\nmin\nZ,B\n\ns.t. BB\u22a4 = Ik.\n\nkX \u2212 ZBk2\nF\n\n(1)\n\nThat is, standard PCA seeks the best rank-k estimate of the latent low-rank matrix M = ZB by\nsolving\n\nmin\nM\n\nkX \u2212 M k2\nF\n\ns.t. rank(M ) \u2264 k.\n\n(2)\n\nAlthough the optimization problem in (1) or (2) is not convex and does not appear to be easy, it can\nbe ef\ufb01ciently solved by performing a singular value decomposition (SVD) over X, and permits the\nfollowing closed-form solution\n\n(3)\nwhere Vk is comprised of the top k right singular vectors of X. With the convenient solution, stan-\ndard PCA has been widely used for modern data analysis and serves as an ef\ufb01cient and effective\ndimensionality reduction procedure when the error E is small and i.i.d. Gaussian [7].\n\nk , Z \u2217 = XB\u2217, M \u2217 = Z \u2217B\u2217,\n\nB\u2217 = V \u22a4\n\n2\n\n\f2.2 Robust PCA\n\nThe validity of standard PCA however breaks down when corrupted errors in the observed data\nmatrix are large. Note that even a single grossly corrupted entry in the observation matrix X can\nrender the recovered M \u2217 matrix to be shifted away from the true low-rank matrix M. To recover\nthe intrinsic low-rank matrix M from the observation matrix X corrupted with sparse large errors\nE, a polynomial-time robust PCA method has been developed in [14], which induces the following\noptimization problem\n\nmin\nM,E\n\nrank(M ) + \u03b3kEk0\n\ns.t. X = M + E.\n\n(4)\n\nBy relaxing the non-convex rank function and the \u21130-norm into their convex envelopes of nuclear\nnorm and \u21131-norm respectively, a convex relaxation of the robust PCA can be yielded\n\nmin\nM,E\n\nkM k\u2217 + \u03bbkEk1\n\ns.t. X = M + E.\n\n(5)\n\nWith an appropriate choice of \u03bb parameter, one can exactly recover the M, E matrices that generated\nthe observations X by solving this convex program.\n\nTo produce a scalable optimization for robust PCA, a more convenient relaxed formulation has been\nconsidered in [14]\n\nmin\nM,E\n\nkM k\u2217 + \u03bbkEk1 +\n\n\u03b1\n2\n\nkM + E \u2212 Xk2\nF\n\n(6)\n\nwhere the original equality constraint is replaced with a reconstruction loss penalty term. This for-\nmulation apparently seeks the lowest rank M that can best reconstruct the observation matrix X\nsubjecting to sparse errors E.\n\nRobust PCA though can effectively recover the low-rank matrix given very sparse large errors in the\nobserved data, its performance can be degraded when the observation data is heavily corrupted with\ndense large errors. In this work, we propose to tackle this problem by exploiting information from\nrelated uncorrupted auxiliary data.\n\n3 Robust Transfer PCA\n\nExploring labeled information in a related auxiliary data set to assist the learning problem on a target\ndata set has been widely studied in supervised learning scenarios within the context of transfer learn-\ning, domain adaptation and multi-task learning [10]. Moreover, it has also been shown that modeling\nrelated data sources together can provide useful information for discovering their shared subspace\nrepresentations in an unsupervised manner [4]. The principle behind these knowledge transfer learn-\ning approaches is that related data sets can complement each other on identifying the intrinsic latent\nstructure shared between them.\n\nFollowing this transfer learning scheme, we present a robust transfer PCA method for recovering\nlow-rank matrix from a heavily corrupted observation matrix. Assume we are given a target data\nmatrix Xt \u2208 Rnt\u00d7d corrupted with errors of large magnitude, and a related source data matrix\nXs \u2208 Rns\u00d7d. The robust transfer PCA aims to achieve the following robust joint matrix factorization\n\nXs = NsBc + ZsBs + Es,\nXt = NtBc + ZtBt + Et,\n\n(7)\n(8)\n\nwhere Bc \u2208 Rkc\u00d7d is the orthogonal basis matrix shared between the two data matrices, Bs \u2208 Rks\u00d7d\nand Bt \u2208 Rkt\u00d7d are the orthogonal basis matrices speci\ufb01c to each data matrix respectively,\nNs \u2208 Rns\u00d7kc , Nt \u2208 Rnt\u00d7kc, Zs \u2208 Rns\u00d7ks , Zt \u2208 Rnt\u00d7kt are the corresponding low dimensional re-\nconstruction coef\ufb01cient matrices, Es \u2208 Rns\u00d7d and Et \u2208 Rnt\u00d7d represent the additive errors in each\ndata matrix. Let Zc = [Ns; Nt]. Given constant matrices As = [Ins , Ons,nt ] and At = [Ont,ns , Int ],\nwe can re-express Ns and Nt in term of the uni\ufb01ed matrix Zc such that Ns = AsZc and Nt = AtZc.\nThe learning problem of robust transfer PCA can then be formulated as the following joint minimiza-\n\n3\n\n\ftion problem\n\nmin\n\nZc,Zs,Zt,Bc,Bs,Bt,Es,Et\n\n\u03b1s\n2\n\u03b1t\n2\n\nkAsZcBc + ZsBs + Es \u2212 Xsk2\n\nF +\n\n(9)\n\nkAtZcBc + ZtBt + Et \u2212 Xtk2\n\nF + \u03b2skEsk1 + \u03b2tkEtk1\n\ns.t. BcB\u22a4\n\nc = Ikc , BsB\u22a4\n\ns = Iks , BtB\u22a4\n\nt = Ikt\n\nwhich minimizes the least squares reconstruction losses on both data matrices with \u21131-norm regu-\nlarizers over the additive error matrices. The intuition is that by sharing the common column basis\nvectors in Bc, one can best capture the common intrinsic low-rank structure of the data based on lim-\nited observations from both data sets, and by allowing data embedding onto individual basis vectors\nBs, Bt, the additional low-rank structure speci\ufb01c to each data set can be captured. Nevertheless, this\nis a dif\ufb01cult non-convex optimization problem as both the objective function and the orthogonality\nconstraints are non-convex. To simplify this optimization problem, we introduce replacements\n\nMc = ZcBc, Ms = ZsBs, Mt = ZtBt\n\n(10)\n\nand rewrite the optimization problem (9) equivalently into the formulation below\n\n\u03b1t\n2\n\nkAtMc + Mt + Et \u2212 Xtk2\nF\n\n(11)\n\nmin\n\nMc,Ms,Mt,Es,Et\n\ns.t.\n\nkAsMc + Ms + Es \u2212 Xsk2\n\nF +\n\n\u03b1s\n2\n+ \u03b2skEsk1 + \u03b2tkEtk1\nrank(Mc) \u2264 kc,\n\nrank(Ms) \u2264 ks,\n\nrank(Mt) \u2264 kt\n\nwhich has a \u21131-norm regularized convex objective function, but is subjecting to non-convex inequal-\nity rank constraints. A standard convexi\ufb01cation of the rank constraints is to replace rank functions\nwith their convex envelopes, nuclear norms [3, 14, 1, 6, 15]. For example, one can replace the rank\nconstraints in (11) with relaxed nuclear norm regularizers in the objective function\n\nmin\n\nMc,Ms,Mt,Es,Et\n\nkAsMc + Ms + Es \u2212 Xsk2\n\n\u03b1s\n2\n+ \u03b2skEsk1 + \u03b2tkEtk1 + \u03bbckMck\u2217 + \u03bbskMsk\u2217 + \u03bbtkMtk\u2217\n\nkAtMc + Mt + Et \u2212 Xtk2\nF\n\n\u03b1t\n2\n\nF +\n\n(12)\n\nMany ef\ufb01cient and scalable algorithms have been proposed to solve such nuclear norm regular-\nized convex optimization problems, including proximal gradient algorithm [6, 14], \ufb01xed point and\nBregman iterative method [9]. However, though the nuclear norm is a convex envelope of the rank\nfunction, it is not always a high-quality approximation of the rank function [11]. Moreover, it is very\ndif\ufb01cult to select the appropriate trade-off parameters \u03bbs, \u03bbt for the nuclear norm regularizers in (12)\nto recover the low-rank matrix solutions in the original optimization in (11). In principal component\nanalysis problems it is much more convenient to have explicit control on the rank of the low-rank so-\nlution matrices. Therefore instead of solving the nuclear norm based convex optimization problem\n(12), we develop a scalable and ef\ufb01cient proximal gradient algorithm to solve the rank constraint\nbased minimization problem (11) directly, which is shown to converge to a stationary point.\nAfter solving the optimization problem (11), the low-rank approximation of the corrupted matrix Xt\ncan be obtained as \u02c6Xt = AtMc + Mt.\n\n4 Proximal Projected Gradient Descent Algorithm\n\nProximal gradient methods have been popularly used for unconstrained convex optimization prob-\nlems with continuous but non-smooth regularizers [2]. In this work, we develop a proximal projected\ngradient algorithm to solve the non-convex optimization problem with matrix rank constraints in\n(11). Let \u0398 = [Mc; Ms; Mt; Es; Et] be the optimization variable set of (11). We denote the objec-\ntive function of (11) as F (\u0398) such that\n\nF (\u0398) = f (\u0398) + g(\u0398)\n\nf (\u0398) =\n\n\u03b1s\n2\n\nkAsMc + Ms + Es \u2212 Xsk2\n\nF +\n\n\u03b1t\n2\n\nkAtMc + Mt + Et \u2212 Xtk2\nF\n\ng(\u0398) = \u03b2skEsk1 + \u03b2tkEtk1\n\n(13)\n\n(14)\n\n(15)\n\n4\n\n\fAlgorithm 1 Proximal Projected Gradient Descent\n\nc\n\ns\n\n, E(1)\n\n, M (1)\n\n, M (1)\n\nInput: data matrices Xs, Xt, parameters \u03b1s, \u03b1t, \u03b2s, \u03b2t, kc, ks, kt.\nSet \u03b7 = 3 max(\u03b1s, \u03b1t), k = 1.\nInitialize M (1)\nWhile not converged do\n\u2022 Set \u0398(k) = [M (k)\n\u2022 Update M (k+1)\nE(k+1)\n\u2022 Set k = k + 1.\n\n; M (k)\n; E(k)\n].\n= pMs(\u03b7, \u0398(k)), M (k+1)\n= pMc (\u03b7, \u0398(k)), M (k+1)\n= pEs(\u03b7, \u0398(k)), E(k+1)\n\n= pEt(\u03b7, \u0398(k)).\n\n; M (k)\n\n; E(k)\n\n, E(1)\n\ns\n\ns\n\ns\n\ns\n\ns\n\nc\n\nc\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nEnd While\n\n= pMt (\u03b7, \u0398(k)),\n\nHere f (\u0398) is a convex and continuously differentiable function while g(\u0398) is a convex but non-\nsmooth function. For any \u03b7 > 0, we consider the following quadratic approximation of F (\u0398) at a\n\nQ\u03b7(\u0398,e\u0398) = f (e\u0398) + h\u0398 \u2212 e\u0398, \u2207f (e\u0398)i +\n\ngiven point e\u0398 = [fMc;fMs;fMt; eEs; eEt]\nwhere \u2207f (e\u0398) is the gradient of the function f (\u00b7) at point e\u0398. Let C = {\u0398 : rank(Mc) \u2264 kc,\nrank(Ms) \u2264 ks, rank(Mt) \u2264 kt}. The minimization over Q\u03b7(\u0398,e\u0398) can be conducted as\nF(cid:27)\n\u2207f (e\u0398))(cid:13)(cid:13)2\n\nQ\u03b7(\u0398,e\u0398) = arg min\n\np(\u03b7,e\u0398) = arg min\n\nwhich admits the following closed-form solution through singular value decomposition and soft-\nthresholding:\n\n\u0398\u2208C (cid:26)g(\u0398) +\n\n2(cid:13)(cid:13)\u0398 \u2212 (e\u0398 \u2212\n\nk\u0398 \u2212 e\u0398k2\n\nF + g(\u0398)\n\n(16)\n\n(17)\n\n1\n\u03b7\n\n\u03b7\n2\n\n\u0398\u2208C\n\n\u03b7\n\nfor U \u03a3V \u22a4 = SVD(fMc \u2212 1\nfor U \u03a3V \u22a4 = SVD(fMs \u2212 1\nfor U \u03a3V \u22a4 = SVD(fMt \u2212 1\n\n\u03b7 )+ \u25e6 sign(cEs), with cEs = fEs \u2212 1\n\u03b7 )+ \u25e6 sign(cEt), with cEt = fEt \u2212 1\n\n\u03b7 \u2207Mc f (e\u0398))\n\u03b7 \u2207Ms f (e\u0398))\n\u03b7 \u2207Mt f (e\u0398))\n\u03b7 \u2207Es f (e\u0398)\n\u03b7 \u2207Et f (e\u0398)\n\nkc ,\n\nks ,\n\npMc (\u03b7,e\u0398) = Ukc\u03a3kc V \u22a4\npMs (\u03b7,e\u0398) = Uks\u03a3ks V \u22a4\npMt (\u03b7,e\u0398) = Ukt\u03a3kt V \u22a4\npEs (\u03b7,e\u0398) = (|cEs| \u2212 \u03b2s\npEt (\u03b7,e\u0398) = (|cEt| \u2212 \u03b2t\n\nkt ,\n\nwhere Uk and Vk denote the top k singular vectors from U and V respectively, and \u03a3k denotes\nthe diagonal matrix with the corresponding top k singular values for k \u2208 {kc, ks, kt} respectively;\n\nthe operator \u201c\u25e6\u201d denotes matrix Hadamard product, and the operator (\u00b7)+ = max(\u00b7, 0); \u2207Mc f (e\u0398),\n\u2207Ms f (e\u0398), \u2207Mt f (e\u0398), \u2207Es f (e\u0398), and \u2207Et f (e\u0398) denote parts of the gradient matrix \u2207f (e\u0398) corre-\n\nsponding to Mc, Ms, Mt, Es, Et respectively.\nOur proximal projected gradient algorithm is an iterative procedure. After \ufb01rst initializing the pa-\nrameter matrices to zeros, in each k-th iteration, it updates the model parameters by minimizing the\napproximation function Q(\u0398, \u0398(k)) at the given point \u0398(k), using the closed-form update equations\nabove. The overall procedure is given in Algorithm 1. Below we discuss the convergence property\nof this algorithm.\n\nLemma 1 For \u03b7 = 3 max(\u03b1s, \u03b1t), we have F (\u0398) \u2264 Q\u03b7(\u0398,e\u0398) for every feasible \u0398,e\u0398.\n\nProof: First it is easy to check that \u03b7 = 3 max(\u03b1s, \u03b1t) is a Lipschitz constant of \u2207f (\u0398), such that\n\nk\u2207f (\u0398) \u2212 \u2207f (e\u0398)kF \u2264 \u03b7k\u0398 \u2212 e\u0398kF\n\nfor any feasible pair \u0398,e\u0398\n\nThus f (\u00b7) is a continuously differentiable function with Lipschitz continuous gradient and Lipschitz\nconstant \u03b7. Following [2, Lemma 2.1], we have\n\n(18)\n\nf (\u0398) \u2264 f (e\u0398) + h\u0398 \u2212 e\u0398, \u2207f (e\u0398)i +\n\n\u03b7\n2\n\nk\u0398 \u2212 e\u0398k2\n\nF\n\nfor any feasible pair \u0398,e\u0398\n\n(19)\n\n5\n\n\fBased on (16) and (19), we can then derive\n\nF (\u0398) = f (\u0398) + g(\u0398) \u2264 f (e\u0398) + h\u0398 \u2212 e\u0398, \u2207f (e\u0398)i +\n\n(cid:3)\nBased on this lemma, we can see the update steps of Algorithm 1 satisfy\n\nk\u0398 \u2212 e\u0398k2\n\nF + g(\u0398) = Q\u03b7(\u0398,e\u0398)\n\n\u03b7\n2\n\nF (\u0398(k+1)) \u2264 Q\u03b7(\u0398(k+1), \u0398(k)) \u2264 Q\u03b7(\u0398(k), \u0398(k)) = F (\u0398(k))\n\n(20)\n\n(21)\n\nTherefore the sequence of points, \u0398(1), \u0398(2), . . . , \u0398\u2217, produced by Algorithm 1 have nonincreasing\nfunction values F (\u0398(1)) \u2265 F (\u0398(2)) \u2265 . . . \u2265 F (\u0398\u2217), and converge to a stationary point.\n\n5 Experiments\n\nWe evaluate the proposed approach using image denoising tasks constructed on the Yale Face\nDatabase, which contains 165 grayscale images of 15 individuals. There are 11 images per subject,\none per different facial expression or con\ufb01guration.\n\nOur goal is to investigate the performance of the proposed approach on recovering data corrupted\nwith large and dense errors. Thus we constructed noisy images by adding large errors. Let X 0\nt denote\na target image matrix from one subject, which has values between 0 and 255. We randomly select\na fraction of its pixels to add large errors to reach value 255, where the fraction of noisy pixels is\ncontrolled using a noise level parameter \u03c3. The obtained noisy target image matrix is Xt. We then\ns from the same or different subject as the source matrix to help\nuse an uncorrupted image matrix X 0\nthe image denoising of Xt by recovering its low-rank approximation matrix \u02c6Xt. In the experiments,\nwe compared the performance of the following methods on image denoising with large errors:\n\n\u2022 R-T-PCA: This is the proposed robust transfer PCA method. For this method, we used\n\nparameters \u03b1s = \u03b1t = 1, \u03b2s = \u03b2t = 0.1, unless otherwise speci\ufb01ed.\n\n\u2022 R-S-PCA: This is a robust shared PCA method that applies a rank-constrained version of\ns ; Xt] to recover a low-rank approx-\n\nthe robust PCA in [14] on the concatenated matrix [X 0\nimation matrix \u02c6Xt with rank kc + kt.\n\n\u2022 R-PCA: This is a robust PCA method that applies a rank-constrained version of the robust\n\nPCA in [14] on Xt to recover a low-rank approximation matrix \u02c6Xt with rank kc + kt.\n\n\u2022 S-PCA: This is a shared PCA method that applies PCA on concatenated matrix [X 0\n\ns ; Xt] to\n\nrecover a low-rank approximation matrix \u02c6Xt with rank kc + kt.\n\n\u2022 PCA: This method applies PCA on the noisy target matrix Xt to recover a low-rank ap-\n\nproximation matrix \u02c6Xt with rank kc + kt.\n\n\u2022 R-2Step-PCA: This method exploits the auxiliary source matrix by \ufb01rst performing robust\nPCA over the concatenated matrix [X 0\ns ; Xt] to produce a shared matrix Mc with rank kc,\nand then performing robust PCA over the residue matrix (Xt \u2212 AtMc) to produce a matrix\nMt with rank kt. The \ufb01nal low-rank approximation of Xt is given by \u02c6Xt = AtMc + Mt.\n\nAll the methods are evaluated using the root mean square error (RMSE) between the true target\nt and the low-rank approximation matrix \u02c6Xt recovered from the noisy image matrix.\nimage matrix X 0\nUnless speci\ufb01ed otherwise, we used kc = 8, ks = 3, kt = 3 in all experiments.\n\n5.1\n\nIntra-Subject Experiments\n\nWe \ufb01rst conducted experiments by constructing 15 transfer tasks for the 15 subjects. Speci\ufb01cally, for\neach subject, we used the \ufb01rst image matrix as the target matrix and used each of the remaining 10\nimage matrices as the source matrix each time. For each source matrix, we repeated the experiments\n5 times by randomly generating noisy target matrix using the procedure described above. Thus in\ntotal, for each experiment, we have results from 50 runs. The average denoising results in terms of\nroot mean square error (RMSE) with noise level \u03c3 = 5% are reported in Table 1. The standard devi-\nations for these results range between 0.001 and 0.015. We also present one visualization result for\nTask-1 in Figure 1. We can see that the proposed method R-T-PCA outperforms all other methods\n\n6\n\n\fTable 1: The average denoising results in terms of RMSE at noise level \u03c3 = 5%.\n\nTasks\nTask-1\nTask-2\nTask-3\nTask-4\nTask-5\nTask-6\nTask-7\nTask-8\nTask-9\nTask-10\nTask-11\nTask-12\nTask-13\nTask-14\nTask-15\n\nR-T-PCA R-S-PCA R-PCA S-PCA PCA R-2Step-PCA\n\n0.143\n0.134\n0.136\n0.140\n0.142\n0.156\n0.172\n0.203\n0.140\n0.198\n0.191\n0.151\n0.193\n0.176\n0.159\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n0.185\n0.167\n0.153\n0.162\n0.166\n0.195\n0.206\n0.222\n0.159\n0.209\n0.211\n0.189\n0.218\n0.201\n0.170\n\n0.218\n0.201\n0.226\n0.201\n0.241\n0.196\n0.300\n0.223\n0.203\n0.259\n0.283\n0.194\n0.277\n0.240\n0.266\n\n0.330\n0.320\n0.386\n0.369\n0.382\n0.290\n0.477\n0.348\n0.317\n0.394\n0.389\n0.337\n0.436\n0.366\n0.413\n\n0.365\n0.353\n0.430\n0.406\n0.414\n0.310\n0.523\n0.386\n0.349\n0.439\n0.423\n0.366\n0.474\n0.392\n0.464\n\n0.212\n0.202\n0.215\n0.215\n0.208\n0.202\n0.264\n0.243\n0.201\n0.255\n0.274\n0.213\n0.257\n0.224\n0.245\n\nSource\n\nTarget\n\nNoise: 5%\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nR\u2212T\u2212PCA\n\nR\u2212S\u2212PCA\n\nR\u2212PCA\n\nS\u2212PCA\n\nPCA\n\nR\u22122Step\u2212PCA\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nFigure 1: Denoising results on Task-1.\n\nacross all the 15 tasks. The comparison between the two groups of methods, {R-T-PCA, R-S-PCA,\nS-PCA} and {R-PCA, PCA}, shows that a related source matrix is indeed useful for denoising the\ntarget matrix. The superior performance of R-T-PCA over R-2Step-PCA demonstrates the effective-\nness of our joint optimization framework over its stepwise alternative. The superior performance\nof R-T-PCA over R-S-PCA and S-PCA demonstrates the ef\ufb01cacy of our transfer PCA framework\nin exploiting the auxiliary source matrix over methods that simply concatenate the auxiliary source\nmatrix and target matrix.\n\n5.2 Cross-Subject Experiments\n\nNext, we conducted transfer experiments using source matrix and target matrix from different sub-\njects. We randomly constructed 5 transfer tasks, Task-6-1, Task-8-2, Task-9-4, Task-12-8 and Task-\n14-11, where the \ufb01rst number in the task name denotes the source subject index and second number\ndenotes the target subject index. For example, to construct Task-6-1, we used the \ufb01rst image matrix\nfrom subject-6 as the source matrix and used the \ufb01rst image matrix from subject-1 as the target ma-\ntrix. For each task, we conducted experiments with two different noise levels, 5% and 10%. We re-\npeated each experiment 10 times using randomly generated noisy target matrix. The average results\nin terms of RMSE are reported in Table 2 with standard deviations less than 0.015. We can see that\nwith the increase of noise level, the performance of all methods degrades. But at each noise level, the\ncomparison results are similar to what we observed in previous experiments: The proposed method\noutperforms all other methods. These results also suggest that even a remotely related source image\ncan be useful. All these experiments demonstrate the ef\ufb01cacy of the proposed method in exploiting\nuncorrupted auxiliary data matrix for denoising target images corrupted with large errors.\n\n7\n\n\fTable 2: The average denoising results in terms of RMSE.\n\nTasks\n\nR-T-PCA R-S-PCA R-PCA S-PCA PCA R-2Step-PCA\n\nTask-6-1\n\nTask-8-2\n\nTask-9-4\n\nTask-12-8\n\nTask-14-11\n\n\u03c3=5%\n\u03c3=10%\n\u03c3=5%\n\u03c3=10%\n\u03c3=5%\n\u03c3=10%\n\u03c3=5%\n\u03c3=10%\n\u03c3=5%\n\u03c3=10%\n\n0.147\n0.203\n0.132\n0.154\n0.148\n0.204\n0.207\n0.244\n0.172\n0.319\n\n0.177\n0.246\n0.159\n0.211\n0.170\n0.240\n0.231\n0.272\n0.215\n0.368\n\n0.224\n0.326\n0.234\n0.323\n0.229\n0.344\n0.245\n0.359\n0.274\n0.431\n\n0.337\n0.490\n0.313\n0.457\n0.373\n0.546\n0.373\n0.518\n0.403\n0.592\n\n0.370\n0.526\n0.354\n0.500\n0.410\n0.585\n0.397\n0.548\n0.424\n0.612\n\n0.218\n0.291\n0.200\n0.276\n0.212\n0.282\n0.249\n0.317\n0.268\n0.372\n\n \n\nR\u2212T\u2212PCA\n\n0.19\n\n0.18\n\n0.17\n\n0.16\n\n0.15\n\n \n\nE\nS\nM\nR\ne\ng\na\nr\ne\nv\nA\n\n \n\n0 0.1\n\n0.25\n\n0.5\n\n\u03b2 Value\n\n \n\n1\n\nR\u2212T\u2212PCA\nR\u2212S\u2212PCA\nR\u2212PCA\nS\u2212PCA\nPCA\nR\u22122Step\u2212PCA\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nE\nS\nM\nR\n \ne\ng\na\nr\ne\nv\nA\n\n \n\n(6,3,3)\n\n(10,3,3)\n(8,3,3)\n, k\n, k\nDifferent settings of k\ns\nc\nt\n\n(8,5,5)\n\n(10,5,5)\n\nFigure 2: Parameter analysis on Task-6-1 with \u03c3 = 5%.\n\n5.3 Parameter Analysis\n\nThe optimization problem (11) for the proposed R-T-PCA method has a number of parameters to\nbe set: \u03b1s, \u03b1t, \u03b2s, \u03b2t, kc, ks and kt. To investigate the in\ufb02uence of these parameters over the per-\nformance of the proposed method, we conducted two experiments using the \ufb01rst cross-subject\ntask, Task-6-1, with noise level \u03c3 = 5%. Given that the source and target matrices are similar\nin size, in these experiments we set \u03b1s = \u03b1t = 1, \u03b2s = \u03b2t and ks = kt. In the \ufb01rst experiment, we\nset (kc, ks, kt) = (8, 3, 3) and study the performance of R-T-PCA with different \u03b2s = \u03b2t= \u03b2 val-\nues, for \u03b2 \u2208 {0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1}. The average RMSE results over 10 runs are pre-\nsented in the left sub-\ufb01gure of Figure 2. We can see that R-T-PCA is quite robust to \u03b2 within\nthe range of values, {0.05, 0.1, 0.25, 0.5, 1}. In the second experiment, we \ufb01xed \u03b2s = \u03b2t = 0.1\nand compared R-T-PCA with other methods across a few different settings of (kc, ks, kt), with\n(kc, ks, kt) \u2208 {(6, 3, 3), (8, 3, 3), (8, 5, 5), (10, 3, 3), (10, 5, 5)}. The average comparison results in\nterms of RMSE are presented in the right sub-\ufb01gure of Figure 2. We can see that though the per-\nformance of all methods varies across different settings, R-T-PCA is less sensitive to the parameter\nchanges comparing to the other methods and it produced the best result across different settings.\n\n6 Conclusion\n\nIn this paper, we developed a novel robust transfer principal component analysis method to recover\nthe low-rank representation of corrupted data by leveraging related uncorrupted auxiliary data. This\nrobust transfer PCA framework combines aspects of both robust PCA and transfer learning method-\nologies. We formulated this method as a joint minimization problem over a convex combination of\nleast squares losses with non-convex matrix rank constraints, and developed a proximal projected\ngradient descent algorithm to solve the proposed optimization problem, which permits a convenient\nclosed-form solution for each proximal step based on singular value decomposition and converges to\na stationary point. Our experiments over image denoising tasks demonstrated the proposed method\ncan effectively exploit auxiliary uncorrupted image to recover images corrupted with random large\nerrors and signi\ufb01cantly outperform a number of comparison methods.\n\n8\n\n\fReferences\n\n[1] F. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research,\n\n9:1019\u20131048, 2008.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sciences, 2, No. 1:183\u2013202, 2009.\n\n[3] M Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.\n[4] S. Gupta, D. Phung, B. Adams, and S. Venkatesh. Regularized nonnegative shared subspace\n\nlearning. Data Mining and Knowledge Discovery, 26:57\u201397, 2013.\n\n[5] P. Huber. Robust Statistics. New York, New York, 1981.\n[6] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization.\n\nInternational Conference on Machine Learning (ICML), 2009.\n\nIn Proc. of\n\n[7] I. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, New York, 1986.\n[8] Q. Ke and T. Kanade. Robust l1 norm factorization in the presence of outliers and missing data\nby alternative convex programming. In Proc. of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2005.\n\n[9] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank\n\nminimization. Mathematical Programming, 2009.\n\n[10] S. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and\n\nData Engineering, 2010.\n\n[11] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions to linear matrix equa-\n\ntions via nuclear norm minimization. SIAM Review, 52, no 3:471\u2013501, 2010.\n\n[12] M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society, B, 6(3):611\u2013622, 1999.\n\n[13] F. De La Torre and M. Black. A framework for robust subspace learning. International Journal\n\nof Computer Vision (IJCV), 54(1-3):117\u2013142, 2003.\n\n[14] J. Wright, Y. Peng, Y. Ma, A. Ganesh, and S. Rao. Robust principal component analysis:\nExact recovery of corrupted low-rank matrices by convex optimization. In Advances in Neural\nInformation Processing Systems (NIPS), 2009.\n\n[15] X. Zhang, Y. Yu, M. White, R. Huang, and D. Schuurmans. Convex sparse coding, subspace\nlearning, and semi-supervised extensions. In Proc. of AAAI Conference on Arti\ufb01cial Intelli-\ngence (AAAI), 2011.\n\n9\n\n\f", "award": [], "sourceid": 603, "authors": [{"given_name": "Yuhong", "family_name": "Guo", "institution": "Temple University"}]}