{"title": "Online Robust PCA via Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 404, "page_last": 412, "abstract": "Robust PCA methods are typically based on batch optimization and have to load all the samples into memory. This prevents them from efficiently processing big data. In this paper, we develop  an Online Robust Principal Component Analysis (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the data size,  significantly enhancing the computation and storage efficiency. The proposed method is based on stochastic optimization of an equivalent reformulation of the batch RPCA method. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust  to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods.", "full_text": "Online Robust PCA via Stochastic Optimization\n\nJiashi Feng\n\nECE Department\n\nNational University of Singapore\n\njiashi@nus.edu.sg\n\nHuan Xu\n\nME Department\n\nNational University of Singapore\n\nmpexuh@nus.edu.sg\n\nShuicheng Yan\nECE Department\n\nNational University of Singapore\n\neleyans@nus.edu.sg\n\nAbstract\n\nRobust PCA methods are typically based on batch optimization and have to load\nall the samples into memory during optimization. This prevents them from ef-\n\ufb01ciently processing big data. In this paper, we develop an Online Robust PCA\n(OR-PCA) that processes one sample per time instance and hence its memory cost\nis independent of the number of samples, signi\ufb01cantly enhancing the computation\nand storage ef\ufb01ciency. The proposed OR-PCA is based on stochastic optimization\nof an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA\nprovides a sequence of subspace estimations converging to the optimum of its\nbatch counterpart and hence is provably robust to sparse corruption. Moreover,\nOR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive\nsimulations on subspace recovering and tracking demonstrate the robustness and\nef\ufb01ciency advantages of the OR-PCA over online PCA and batch RPCA methods.\n\n1\n\nIntroduction\n\nPrincipal Component Analysis (PCA) [19] is arguably the most widely used method for dimension-\nality reduction in data analysis. However, standard PCA is brittle in the presence of outliers and cor-\nruptions [11]. Thus many techniques have been developed towards robustifying it [12, 4, 24, 25, 7].\nOne prominent example is the Principal Component Pursuit (PCP) method proposed in [4] that ro-\nbustly \ufb01nds the low-dimensional subspace through decomposing the sample matrix into a low-rank\ncomponent and an overall sparse component. It is proved that both components can be recovered\nexactly through minimizing a weighted combination of the nuclear norm of the \ufb01rst term and (cid:96)1\nnorm of the second one. Thus the subspace estimation is robust to sparse corruptions.\nHowever, PCP and other robust PCA methods are all implemented in a batch manner. They need\nto access every sample in each iteration of the optimization. Thus, robust PCA methods require\nmemorizing all samples, in sharp contrast to standard PCA where only the covariance matrix is\nneeded. This pitfall severely limits their scalability to big data, which are becoming ubiquitous\nnow. Moreover, for an incremental samples set, when a new sample is added, the optimization\nprocedure has to be re-implemented on all available samples. This is quite inef\ufb01cient in dealing with\nincremental sample sets such as network detection, video analysis and abnormal events tracking.\nAnother pitfall of batch robust PCA methods is that they cannot handle the case where the underlying\nsubspaces are changing gradually. For example, in the video background modeling, the background\nis assumed to be static across different frames for applying robust PCA [4]. Such assumption is too\nrestrictive in practice. A more realistic situation is that the background is changed gradually along\n\n1\n\n\fwith the camera moving, corresponding to a gradually changing subspace. Unfortunately, traditional\nbatch RPCA methods may fail in this case.\nIn order to ef\ufb01ciently and robustly estimate the subspace of a large-scale or dynamic samples set,\nwe propose an Online Robust PCA (OR-PCA) method. OR-PCA processes only one sample per\ntime instance and thus is able to ef\ufb01ciently handle big data and dynamic sample sets, saving the\nmemory cost and dynamically estimating the subspace of evolutional samples. We brie\ufb02y explain\nour intuition here. The major dif\ufb01culty of implementing the previous RPCA methods, such as PCP,\nin an online fashion is that the adopted nuclear norm tightly couples the samples and thus the samples\nhave to be processed simultaneously. To tackle this, OR-PCA pursues the low-rank component in\na different manner: using an equivalent form of the nuclear norm, OR-PCA explicitly decomposes\nthe sample matrix into the multiplication of the subspace basis and coef\ufb01cients plus a sparse noise\ncomponent. Through such decomposition, the samples are decoupled in the optimization and can be\nprocessed separately. In particular, the optimization consists of two iterative updating components.\nThe \ufb01rst one is to project the sample onto the current basis and isolate the sparse noise (explaining\nthe outlier contamination), and the second one is to update the basis given the new sample.\nOur main technical contribution is to show the above mentioned iterative optimization sheme con-\nverges to the global optimal solution of the original PCP formulation, thus we establish the validity\nof our online method. Our proof is inspired by recent results from [16], who proposed an online dic-\ntionary learning method and provided the convergence guarantee of the proposed online dictionary\nlearning method. However, [16] can only guarantee that the solution converges to a stationary point\nof the optimization problem.\nBesides the nice behavior on single subspace recovering, OR-PCA can also be applied for tracking\ntime-variant subspace naturally, since it updates the subspace estimation timely after revealing one\nnew sample. We conduct comprehensive simulations to demonstrate the advantages of OR-PCA for\nboth subspace recovering and tracking in this work.\n\n2 Related Work\n\nThe robust PCA algorithms based on nuclear norm minimization to recover low-rank matrices are\nnow standard, since the seminal works [21, 6]. Recent works [4, 5] have taken the nuclear norm\nminimization approach to the decomposition of a low-rank matrix and an overall sparse matrix.\nDifferent from the setting of samples being corrupted by sparse noise, [25, 24] and [7] solve robust\nPCA in the case that a few samples are completely corrupted. However, all of these RPCA methods\nare implemented in batch manner and cannot be directly adapted to the online setup.\nThere are only a few pieces of work on online robust PCA [13, 20, 10], which we discuss below.\nIn [13], an incremental and robust subspace learning method is proposed. The method proposes\nto integrate the M-estimation into the standard incremental PCA calculation. Speci\ufb01cally, each\nnewly coming data point is re-weighted by a pre-de\ufb01ned in\ufb02uence function [11] of its residual\nto the current estimated subspace. However, no performance guarantee is provided in this work.\nIn [20], a compressive sensing based recursive robust PCA algorithm is proposed. The proposed\nmethod essentially solves compressive sensing optimization over a small batch of data to update the\nprincipal components estimation instead of using a single sample, and it is not clear how to extend\nthe method to the latter case. Recently, He et al. propose an incremental gradient descent method\non Grassmannian manifold for solving the robust PCA problem, named GRASTA [10]. In each\niteration, GRASTA uses the gradient of the updated augmented Lagrangian function after revealing\na new sample to perform the gradient descent. However, no theoretic guarantee of the algorithmic\nconvergence for GRASTA is provided in this work. Moreover, in the experiments in this work, we\nshow that our proposed method is more robust than GRASTA to the sparse corruption and achieves\nhigher breakdown point.\nThe most closely related work to ours in technique is [16], which proposes an online learning method\nfor dictionary learning and sparse coding. Based on that work, [9] proposes an online nonnegative\nmatrix factorization method. Both works can be seen as solving online matrix factorization problems\nwith speci\ufb01c constraints (sparse or non-negative). Though OR-PCA can also be seen as a kind of\nmatrix factorization, it is essentially different from those two works. In OR-PCA, an additive sparse\nnoise matrix is considered along with the matrix factorization. Thus the optimization and analysis\n\n2\n\n\fare different from the ones in those works. In addition, bene\ufb01tting from explicitly considering the\nnoise, OR-PCA is robust to sparse contamination, which is absent in either the dictionary learning or\nnonnegative matrix factorization works. Most importantly, in sharp contrast to [16, 9] which shows\ntheir methods converge to a stationary point, our method is solving essentially a re-formulation of a\nconvex optimization, and hence we can prove that the method converges to the global optimum.\nAfter this paper was accepted, we found similar works which apply the same main idea of combining\nthe online learning framework in [16] with the factorization formulation of nuclear norm was pub-\nlished in [17, 18, 23] before. However, in this work, we use different optimization from them. More\nspeci\ufb01cally, our proposed algorithm needs not determine the step size or solve a Lasso subproblem.\n\n3 Problem Formulation\n\n3.1 Notation\nWe use bold letters to denote vectors. In particular, x \u2208 Rp denotes an authentic sample without\ncorruption, e \u2208 Rp is for the noise, and z \u2208 Rp is for the corrupted observation z = x + e. Here p\ndenotes the ambient dimension of the observed samples. Let r denote the intrinsic dimension of the\nsubspace underlying {xi}n\ni=1. Let n denote the number of observed samples, t denote the index of\nthe sample/time instance. We use capital letters to denote matrices, e.g., Z \u2208 Rp\u00d7n is the matrix of\nobserved samples. Each column zi of Z corresponds to one sample. For an arbitrary real matrix E,\ni,j |Eij| denote the (cid:96)1-norm of E seen as a long\ni \u03c3i(E) denote its nuclear norm, i.e., the sum of its singular values.\n\nLet (cid:107)E(cid:107)F denote its Frobenius norm, (cid:107)E(cid:107)(cid:96)1 =(cid:80)\nvector in Rp\u00d7n, and (cid:107)E(cid:107)\u2217 =(cid:80)\n\n3.2 Objective Function Formulation\n\nRobust PCA (RPCA) aims to accurately estimate the subspace underlying the observed samples,\neven though the samples are corrupted by gross but sparse noise. As one of the most popular RPCA\nmethods, the Principal Component Pursuit (PCP) method [4] proposes to solve RPCA by decompos-\ning the observed sample matrix Z into a low-rank component X accounting for the low-dimensional\nsubspace plus an overall sparse component E incorporating the sparse corruption. Under mild con-\nditions, PCP guarantees that the two components X and E can be exactly recovered through solving:\n\nmin\nX,E\n\n1\n2\n\n(cid:107)Z \u2212 X \u2212 E(cid:107)2\n\nF + \u03bb1(cid:107)X(cid:107)\u2217 + \u03bb2(cid:107)E(cid:107)1.\n\n(1)\n\nTo solve the problem in (1), iterative optimization methods such as Accelerated Proximal Gradient\n(APG) [15] or Augmented Lagrangian Multiplier (ALM) [14] methods are often used. However,\nthese optimization methods are implemented in a batch manner. In each iteration of the optimization,\nthey need to access all samples to perform SVD. Hence a huge storage cost is incurred when solving\nRPCA for big data (e.g., web data, large image set).\nIn this paper, we consider online implementation of PCP. The main dif\ufb01culty is that the nuclear norm\ncouples all the samples tightly and thus the samples cannot be considered separately as in typical\nonline optimization problems. To overcome this dif\ufb01culty, we use an equivalent form of the nuclear\nnorm for the matrix X whose rank is upper bounded by r, as follows [21],\n\n(cid:26) 1\n\n(cid:27)\n\n.\n\n(cid:107)X(cid:107)\u2217 =\n\ninf\n\nL\u2208Rp\u00d7r,R\u2208Rn\u00d7r\n\n(cid:107)L(cid:107)2\n\nF +\n\n1\n2\n\n(cid:107)R(cid:107)2\n\nF : X = LRT\n\n2\n\nNamely, the nuclear norm is re-formulated as an explicit low-rank factorization of X. Such nuclear\nnorm factorization is developed in [3] and well established in recent works [22, 21]. In this decom-\nposition, L \u2208 Rp\u00d7r can be seen as the basis of the low-dimensional subspace and R \u2208 Rn\u00d7r denotes\nthe coef\ufb01cients of the samples w.r.t. the basis. Thus, the RPCA problem (1) can be re-formulated as\n\nmin\n\nX,L\u2208Rp\u00d7r,R\u2208Rn\u00d7r,E\n\n(cid:107)Z \u2212 X \u2212 E(cid:107)2\n\nF +\n\n1\n2\n\n\u03bb1\n2\n\n((cid:107)L(cid:107)2\n\nF + (cid:107)R(cid:107)2\n\nF ) + \u03bb2(cid:107)E(cid:107)1, s.t. X = LRT .\n\nSubstituting X by LRT and removing the constraint, the above problem is equivalent to:\nF ) + \u03bb2(cid:107)E(cid:107)1.\n\n(cid:107)Z \u2212 LRT \u2212 E(cid:107)2\n\nF + (cid:107)R(cid:107)2\n\n((cid:107)L(cid:107)2\n\nF +\n\nmin\n\nL\u2208Rp\u00d7r,R\u2208Rn\u00d7r,E\n\n1\n2\n\n\u03bb1\n2\n\n3\n\n(2)\n\n\f(3)\n\n(4)\n\n(cid:18) 1\n\nt(cid:88)\n\n2\n\ni=1\n\n(cid:19)\n\nThough the reformulated objective function is not jointly convex w.r.t. the variables L and R, we\nprove below that the local minima of (2) are global optimal solutions to original problem in (1). The\ndetails are given in the next section.\nGiven a \ufb01nite set of samples Z = [z1, . . . , zn] \u2208 Rp\u00d7n, solving problem (2) indeed minimizes the\nfollowing empirical cost function,\n\nn(cid:88)\n\ni=1\n\nfn(L) (cid:44) 1\nn\n\n(cid:96)(zi, L) +\n\n(cid:107)L(cid:107)2\nF ,\n\n\u03bb1\n2n\n\nwhere the loss function for each sample is de\ufb01ned as\n\n(cid:96)(zi, L) (cid:44) min\n\nr,e\n\n1\n2\n\n(cid:107)zi \u2212 Lr \u2212 e(cid:107)2\n\n2 +\n\n(cid:107)r(cid:107)2\n\n2 + \u03bb2(cid:107)e(cid:107)1.\n\n\u03bb1\n2\n\nThe loss function measures the representation error for the sample z on a \ufb01xed basis L, where the\ncoef\ufb01cients on the basis r and the sparse noise e associated with each sample are optimized to\nminimize the loss. In the stochastic optimization, one is usually interested in the minimization of\nthe expected cost overall all the samples [16],\n\nf (L) (cid:44) Ez[(cid:96)(z, L)] = lim\n\nn\u2192\u221e fn(L),\n\n(5)\n\nwhere the expectation is taken w.r.t. the distribution of the samples z. In this work, we \ufb01rst establish\na surrogate function for this expected cost and then optimize the surrogate function for obtaining the\nsubspace estimation in an online fashion.\n\n4 Stochastic Optimization Algorithm for OR-PCA\n\nWe now present our Online Robust PCA (OR-PCA) algorithm. The main idea is to develop a\nstochastic optimization algorithm to minimize the empirical cost function (3), which processes one\nsample per time instance in an online manner. The coef\ufb01cients r, noise e and basis L are optimized\nin an alternative manner. In the t-th time instance, we obtain the estimation of the basis Lt through\nminimizing the cumulative loss w.r.t. the previously estimated coef\ufb01cients {ri}t\ni=1 and sparse noise\n{ei}t\n\ni=1. The objective function for updating the basis Lt is de\ufb01ned as,\n\ngt(L) (cid:44) 1\nt\n\n(cid:107)zi \u2212 Lri \u2212 ei(cid:107)2\n\n2 +\n\n(cid:107)ri(cid:107)2\n\n2 + \u03bb2(cid:107)ei(cid:107)1\n\n\u03bb1\n2\n\n+\n\n(cid:107)L(cid:107)2\nF .\n\n\u03bb1\n2t\n\n(6)\n\nThis is a surrogate function of the empirical cost function ft(L) de\ufb01ned in (3), i.e., it provides an\nupper bound for ft(L): gt(L) \u2265 ft(L).\nThe proposed algorithm is summarized in Algorithm 1. Here, the subproblem in (7) involves solving\na small-size convex optimization problem, which can be solved ef\ufb01ciently by the off-the-shelf solver\n(see the supplementary material). To update the basis matrix L, we adopt the block-coordinate\ndescent with warm restarts [2]. In particular, each column of the basis L is updated individually\nwhile \ufb01xing the other columns.\nThe following theorem is the main theoretic result of the paper, which states that the solution from\nAlgorithm 1 will converge to the optimal solution of the batch optimization. Thus, the proposed\nOR-PCA converges to the correct low-dimensional subspace even in the presence of sparse noise,\nas long as the batch version \u2013 PCP \u2013 works.\nTheorem 1. Assume the observations are always bounded. Given the rank of the optimal solution\nto (5) is provided as r, and the solution Lt \u2208 Rp\u00d7r provided by Algorithm 1 is full rank, then Lt\nconverges to the optimal solution of (5) asymptotically.\n\nNote that the assumption that observations are bounded is quite natural for the realistic data (such as\nimages, videos). We \ufb01nd in the experiments that the \ufb01nal solution Lt is always full rank. A standard\nstochastic gradient descent method may further enhance the computational ef\ufb01ciency, compared\nwith the used method here. We leave the investigation for future research.\n\n4\n\n\fAlgorithm 1 Stochastic Optimization for OR-PCA\n\nInput: {z1, . . . , zT} (observed data which are revealed sequentially), \u03bb1, \u03bb2 \u2208 R (regularization\nparameters), L0 \u2208 Rp\u00d7r, r0 \u2208 Rr, e0 \u2208 Rp (initial solutions), T (number of iterations).\nfor t = 1 to T do\n\n1) Reveal the sample zt.\n2) Project the new sample:\n\n{rt, et} = arg min\n\n(cid:107)zt \u2212 Lt\u22121r \u2212 e(cid:107)2\n\n2 +\n\n1\n2\n\n(cid:107)r(cid:107)2\n\n2 + \u03bb2(cid:107)e(cid:107)1.\n\n\u03bb1\n2\n\n3) At \u2190 At\u22121 + rtrT\n4) Compute Lt with Lt\u22121 as warm restart using Algorithm 2:\n\nt , Bt \u2190 Bt\u22121 + (zt \u2212 et)rT\nt .\n\nTr(cid:2)LT (At + \u03bb1I) L(cid:3) \u2212 Tr(LT Bt).\n\nLt (cid:44) arg min\n\n1\n2\n\nend for\nReturn XT = LT RT\n\nT (low-rank data matrix), ET (sparse noise matrix).\n\nAlgorithm 2 The Basis Update\n\nInput: L = [l1, . . . , lr] \u2208 Rp\u00d7r, A = [a1, . . . , ar] \u2208 Rr\u00d7r, and B = [b1, . . . , br] \u2208 Rp\u00d7r.\n\u02dcA \u2190 A + \u03bb1I.\nfor j = 1 to r do\n\nlj \u2190 1\n\u02dcAj,j\n\n(bj \u2212 L\u02dcaj) + lj.\n\nend for\nReturn L.\n\n5 Proof Sketch\n\n(7)\n\n(8)\n\n(9)\n\nIn this section we sketch the proof of Theorem 1. The details are deferred to the supplementary\nmaterial due to space limit.\nThe proof of Theorem 1 proceeds in the following four steps: (I) we \ufb01rst prove that the surrogate\nfunction gt(Lt) converges almost surely; (II) we then prove that the solution difference behaves as\n(cid:107)Lt \u2212 Lt\u22121(cid:107)F = O(1/t); (III) based on (II) we show that f (Lt) \u2212 gt(Lt) \u2192 0 almost surely, and\nthe gradient of f vanishes at the solution Lt when t \u2192 \u221e; (IV) \ufb01nally we prove that Lt actually\nconverges to the optimum solution of the problem (5).\nTheorem 2 (Convergence of the surrogate function gt). Let gt denote the surrogate function de\ufb01ned\nin (6). Then, gt(Lt) converges almost surely when the solution Lt is given by Algorithm 1.\n\nWe prove Theorem 2, i.e., the convergence of the stochastic positive process gt(Lt) > 0, by showing\nthat it is a quasi-martingale. We \ufb01rst show that the summation of the positive difference of gt(Lt) is\nbounded utilizing the fact that gt(Lt) upper bounds the empirical cost ft(Lt) and the loss function\n(cid:96)(zt, Lt) is Lipschitz. These imply that gt(Lt) is a quasi-martingale. Applying the lemma from [8]\nabout the convergence of quasi-martingale, we conclude that gt(Lt) converges.\nNext, we show the difference of the two successive solutions converges to 0 as t goes to in\ufb01nity.\nTheorem 3 (Difference of the solution Lt). For the two successive solutions obtained from Algo-\nrithm 1, we have\n\n(cid:107)Lt+1 \u2212 Lt(cid:107)F = O(1/t) a.s.\n\nTo prove the above result, we \ufb01rst show that the function gt(L) is strictly convex. This holds since the\nregularization component \u03bb1(cid:107)L(cid:107)2\nF naturally guarantees that the eigenvalues of the Hessian matrix\nare bounded away from zero. Notice that this is essentially different from [16], where one has to\nassume that the smallest eigenvalue of the Hessian matrix is lower bounded. Then we further show\n\n5\n\n\fthat variation of the function gt(L), gt(Lt)\u2212 gt+1(Lt), is Lipschitz if using the updating rule shown\nin Algorithm 2. Combining these two properties establishes Theorem 3.\nIn the third step, we show that the expected cost function f (Lt) is a smooth one, and the difference\nf (Lt)\u2212 gt(Lt) goes to zero when t \u2192 \u221e. In order for showing the regularity of the function f (Lt),\nwe \ufb01rst provide the following optimality condition of the loss function (cid:96)(Lt).\nLemma 1 (Optimality conditions of Problem (4)). r(cid:63) \u2208 Rr and e(cid:63) \u2208 Rp is a solution of Problem (4)\nif and only if\n\n\u039b) = \u03bb2sign(e(cid:63)\n\nC\u039b(z\u039b \u2212 e(cid:63)\n|C\u039bc(z\u039bc \u2212 e(cid:63)\n\u039bc)| \u2264 \u03bb2, otherwise,\nr(cid:63) = (LT L + \u03bb1I)\u22121LT (z \u2212 e(cid:63)),\n\n\u039b),\n\nwhere C = I \u2212 L(LT L + \u03bb1I)\u22121LT and C\u039b denotes the columns of matrix C indexed by \u039b =\n{j|e(cid:63)[j] (cid:54)= 0} and \u039bc denotes the complementary set of \u039b. Moreover, the optimal solution is unique.\n\nBased on the above lemma, we can prove that the solution r(cid:63) and e(cid:63) are Lipschitz w.r.t. the basis L.\nThen, we can obtain the following results about the regularity of the expected cost function f.\nLemma 2. Assume the observations z are always bounded. De\ufb01ne\n(cid:107)r(cid:107)2\n\n{r(cid:63), e(cid:63)} = arg min\n\n(cid:107)z \u2212 Lr \u2212 e(cid:107)2\n\n2 + \u03bb2(cid:107)e(cid:107)1.\n\n2 +\n\n\u03bb1\n2\n\n1\n2\n\nr,e\n\nThen, 1) the function (cid:96) de\ufb01ned in (4) is continuously differentiable and\n\n\u2207L(cid:96)(z, L) = (Lr(cid:63) + e(cid:63) \u2212 z)r(cid:63)T ;\n\n2) \u2207f (L) = Ez[\u2207L(cid:96)(z, L)]; and 3)\u2207f (L) is Lipschitz.\nEquipped with the above regularities of the expected cost function f, we can prove the convergence\nof f, as stated in the following theorem.\nTheorem 4 (Convergence of f). Let gt denote the surrogate function de\ufb01ned in (2). Then, 1)\nf (Lt) \u2212 gt(Lt) converges almost surely to 0; and 2) f (Lt) converges almost surely, when the\nsolution Lt is given by Algorithm 1.\n\nFollowing the techniques developed in [16], we can show the solution obtained from Algorithm 1,\nL\u221e, satis\ufb01es the \ufb01rst order optimality condition for minimizing the expected cost f (L). Thus the\nOR-PCA algorithm provides a solution converging to a stationary point of the expected loss.\nTheorem 5. The \ufb01rst order optimal condition for minimizing the objective function in (5) is satis\ufb01ed\nby Lt, the solution provided by Algorithm 1, when t tends to in\ufb01nity.\n\nFinally, to complete the proof, we establish the following result stating that any full-rank L that\nsatis\ufb01es the \ufb01rst order condition is the global optimal solution.\nTheorem 6. When the solution L satis\ufb01es the \ufb01rst order condition for minimizing the objective\nfunction in (5) , the obtained solution L is the optimal solution of the problem (5) if L is full rank.\n\nCombining Theorem 5 and Theorem 6 directly yields Theorem 1 \u2013 the solution from Algorithm 1\nconverges to the optimal solution of Problem (5) asymptotically.\n\n6 Empirical Evaluation\n\nWe report some numerical results in this section. Due to space constraints, more results, including\nthose of subspace tracking, are deferred in the supplementary material.\n\n6.1 Medium-scale Robust PCA\n\nWe here evaluate the ability of the proposed OR-PCA of correctly recovering the subspace of cor-\nrupted observations, under various settings of the intrinsic subspace dimension and error density. In\nparticular, we adopt the batch robust PCA method, Principal Component Pursuit [4], as the batch\n\n6\n\n\f(a) Batch RPCA\n\n(b) OR-PCA\n\n(c) \u03c1s = 0.1\n\n(d) \u03c1s = 0.3\n\nFigure 1: (a) and (b): subspace recovery performance under different corruption fraction \u03c1s (verti-\ncal axis) and rank/n (horizontal axis). Brighter color means better performance; (c) and (d): the\nperformance comparison of the OR-PCA, Grasta, and online PCA methods against the number of\nrevealed samples under two different corruption levels \u03c1s with PCP as reference.\n\ncounterpart of the proposed OR-PCA method for reference. PCP estimates the subspace in a batch\nmanner through solving the problem in (1) and outputs the low-rank data matrix. For fair compari-\nson, we follow the data generation scheme of PCP as in [4]: we generate a set of n clean data points\nas a product of X = U V T , where the sizes of U and V are p \u00d7 r and n \u00d7 r respectively. The\nelements of both U and V are i.i.d. sampled from the N (0, 1/n) distribution. Here U is the basis of\nthe subspace and the intrinsic dimension of the subspace spanned by U is r. The observations are\ngenerated through Z = X + E, where E is a sparse matrix with a fraction of \u03c1s non-zero elements.\nThe elements in E are from a uniform distribution over the interval of [\u22121000, 1000]. Namely, the\nmatrix E contains gross but sparse errors.\nWe run the OR-PCA and the PCP algorithms 10 times under the following settings: the ambient\ndimension and number of samples are set as p = 400 and n = 1, 000; the intrinsic rank r of\n\u221a\nthe subspace varies from 4 to 200; the value of error fraction, \u03c1s, varies from very sparse 0.01 to\nrelatively dense 0.5. The trade-off parameters of OR-PCA are \ufb01xed as \u03bb1 = \u03bb2 = 1/\np. The\nperformance is evaluated by the similarity between the subspace obtained from the algorithms and\nthe groundtruth.\nIn particular, the similarity is measured by the Expressed Variance (E.V.) (see\nde\ufb01nition in [24]). A larger value of E.V. means better subspace recovery.\nWe plot the averaged E.V. values of PCP and OR-PCA under different settings in a matrix form, as\nshown in Figure 1(a) and Figure 1(b) respectively. The results demonstrate that under relatively low\nintrinsic dimension (small rank/n) and sparse corruption (small \u03c1s), OR-PCA is able to recover\nthe subspace nearly perfectly (E.V.= 1). We also observe that the performance of OR-PCA is\nclose to that of the PCP. This demonstrates that the proposed OR-PCA method achieves comparable\nperformance with the batch method and veri\ufb01es our convergence guarantee on the OR-PCA. In the\nrelatively dif\ufb01cult setting (high intrinsic dimension and dense error, shown in the top-right of the\nmatrix), OR-PCA performs slightly worse than the PCP, possibly because the number of streaming\nsamples is not enough to achieve convergence.\nTo better demonstrate the robustness of OR-PCA to corruptions and illustrate how the performance\nof OR-PCA is improved when more samples are revealed, we plot the performance curve of OR-\nPCA against the number of samples in Figure 1(c), under the setting of p = 400, n = 1, 000,\n\u03c1s = 0.1, r = 80, and the results are averaged from 10 repetitions. We also apply GRASTA [10] to\nsolve this RPCA problem in an online fashion as a baseline. The parameters of GRASTA are set as\nthe values provided in the implementation package provided by the authors. We observe that when\nmore samples are revealed, both OR-PCA and GRASTA steadily improve the subspace recovery.\nHowever, our proposed OR-PCA converges much faster than GRASTA, possibly because in each\niteration OR-PCA obtains the optimal closed-form solution to the basis updating subproblem while\nGRASTA only takes one gradient descent step. Observe from the \ufb01gure that after 200 samples are\nrevealed, the performance of OR-PCA is already satisfactory (E.V.> 0.8). However, for GRASTA,\nit needs about 400 samples to achieve the same performance. To show the robustness of the pro-\nposed OR-PCA, we also plot the performance of the standard online (or incremental) PCA [1] for\ncomparison. This work focuses on developing online robust PCA. The non-robustness of (online)\nPCA is independent of used optimization method. Thus, we only compare with the basic online\nPCA method [1], which is enough for comparing robustness. The comparison results are given in\nFigure 1(c). We observe that as expected, the online PCA cannot recover the subspace correctly\n(E.V.\u2248 0.1), since standard PCA is fragile to gross corruptions. We then increase the corruption\n\n7\n\nBatch RPCArank/n\u03c1s0.10.20.30.40.50.50.40.30.20.1Online RPCArank/n\u03c1s0.10.20.30.40.50.50.40.30.20.10200400600800100000.20.40.60.81Number of SamplesE.V.  OR\u2212PCAGrastaonline PCAbatch RPCA200400600800100000.20.40.60.81Number of SamplesE.V.  OR\u2212PCAGrastaonline PCAbatch RPCA\flevel to \u03c1s = 0.3, and plot the performance curve of the above methods in Figure 1(d). From the\nplot, it can be observed that the performance of GRASTA decreases severely (E.V.\u2248 0.3) while\nOR-PCA still achieves E.V. \u2248 0.8. The performance of PCP is around 0.88. This result clearly\ndemonstrates the robustness advantage of OR-PCA over GRASTA. In fact, from other simulation\nresults under different settings of intrinsic rank and corruption level (see supplementary material),\nwe observe that the GRASTA breaks down at 25% corruption (the value of E.V. is zero). However,\nOR-PCA achieves a performance of E.V.\u2248 0.5, even in presence of 50% outlier corruption.\n\n6.2 Large-scale Robust PCA\n\nWe now investigate the computational ef\ufb01ciency of OR-PCA and the performance for large scale\ndata. The samples are generated following the same model as explained in the above subsection.\nThe results are provided in Table 1. All of the experiments are implemented in a PC with 2.83GHz\nQuad CPU and 8GB RAM. Note that batch RPCA cannot process these data due to out of memory.\n\nTable 1: The comparison of OR-PCA and GRASTA under different settings of sample size (n) and\nambient dimensions (p). Here \u03c1s = 0.3, r = 0.1p. The corresponding computational time (in \u00d7103\nseconds) is shown in the top row and the E.V. values are shown in the bottom row correspondingly.\nThe results are based on the average of 5 repetitions and the variance is shown in the parentheses.\n\np\nn\n\n1 \u00d7 106\n\n1 \u00d7 103\n1 \u00d7 108\n\n1 \u00d7 1010\n\n1 \u00d7 106\n\n1 \u00d7 108\n\n1 \u00d7 104\n\n0.013(0.0004)\n\nOR-PCA\n0.99(0.01)\nGRASTA 0.023(0.0008)\n0.54(0.08)\n\n1.312(0.082)\n\n139.233(7.747)\n\n0.633(0.047)\n\n15.910(2.646)\n\n0.99(0.00)\n\n0.99(0.00)\n\n0.82(0.09)\n\n0.82(0.01)\n\n2.137(0.016)\n\n240.271(7.564)\n\n2.514(0.011)\n\n252.630(2.096)\n\n0.55(0.02)\n\n0.57(0.03)\n\n0.45(0.02)\n\n0.46(0.03)\n\nFrom the above results, we observe that OR-PCA is much more ef\ufb01cient and performs better than\nGRASTA. In fact, the computational time of OR-PCA is linear in the sample size and nearly linear\nin the ambient dimension. When the ambient dimension is large (p = 1\u00d7 104), OR-PCA is more ef-\n\ufb01cient than GRASTA with an order magnitude ef\ufb01ciency enhancement. We then compare OR-PCA\nwith batch PCP. In each iteration, batch PCP needs to perform an SVD plus a thresholding operation,\nwhose complexity is O(np2). In contrast, for OR-PCA, in each iteration, the computational cost is\nO(pr2), which is independent of the sample size and linear in the ambient dimension. To see this,\nnote that in step 2) of Algorithm 1, the computation complexity is O(r2 + pr + r3). Here O(r3) is\nfor computing LT L. The complexity of step 3) is O(r2 + pr). For step 4) (i.e., Algorithm 2), the\ncost is O(pr2) (updating each column of L requires O(pr) and there are r columns in total). Thus\nthe total complexity is O(r2 + pr + r3 + pr2). Since p (cid:29) r, the overall complexity is O(pr2).\nThe memory cost is signi\ufb01cantly reduced too. The memory required for OR-PCA is O(pr), which\nis independent of the sample size. This is much smaller than the memory cost of the batch PCP\nalgorithm (O(pn)), where n (cid:29) p for large scale dataset. This is quite important for processing big\ndata. The proposed OR-PCA algorithm can be easily parallelized to further enhance its ef\ufb01ciency.\n\n7 Conclusions\n\nIn this work, we develop an online robust PCA (OR-PCA) method. Different from previous batch\nbased methods, the OR-PCA need not \u201cremember\u201d all the past samples and achieves much higher\nstorage ef\ufb01ciency. The main idea of OR-PCA is to reformulate the objective function of PCP (a\nwidely applied batch RPCA algorithm) by decomposing the nuclear norm to an explicit product of\ntwo low-rank matrices, which can be solved by a stochastic optimization algorithm. We provide the\nconvergence analysis of the OR-PCA method and show that OR-PCA converges to the solution of\nbatch RPCA asymptotically. Comprehensive simulations demonstrate the effectiveness of OR-PCA.\n\nAcknowledgments\n\nJ. Feng and S. Yan are supported by the Singapore National Research Foundation under its Inter-\nnational Research Centre @Singapore Funding Initiative and administered by the IDM Programme\nOf\ufb01ce. H. Xu is partially supported by the Ministry of Education of Singapore through AcRF Tier\nTwo grant R-265-000-443-112 and NUS startup grant R-265-000-384-133.\n\n8\n\n\fReferences\n[1] M. Artac, M. Jogan, and A. Leonardis. Incremental pca for on-line visual learning and recogni-\ntion. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3,\npages 781\u2013784. IEEE, 2002.\n\n[2] D.P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[3] Samuel Burer and Renato Monteiro. A nonlinear programming algorithm for solving semidef-\n\ninite programs via low-rank factorization. Math. Progam., 2003.\n\n[4] E.J. Candes, X. Li, Y. Ma, and J. Wright.\n\nArXiv:0912.3599, 2009.\n\nRobust principal component analysis?\n\n[5] V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, and A.S. Willsky. Rank-sparsity incoherence for\n\nmatrix decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[6] M. Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford\n\nUniversity, 2002.\n\n[7] J. Feng, H. Xu, and S. Yan. Robust PCA in high-dimension: A deterministic approach. In\n\nICML, 2012.\n\n[8] D.L. Fisk. Quasi-martingales. Transactions of the American Mathematical Society, 1965.\n[9] N. Guan, D. Tao, Z. Luo, and B. Yuan. Online nonnegative matrix factorization with robust\nstochastic approximation. Neural Networks and Learning Systems, IEEE Transactions on,\n23(7):1087\u20131099, 2012.\n\n[10] Jun He, Laura Balzano, and John Lui. Online robust subspace tracking from partial informa-\n\ntion. arXiv preprint arXiv:1109.3827, 2011.\n\n[11] P.J. Huber, E. Ronchetti, and MyiLibrary. Robust statistics. John Wiley & Sons, New York,\n\n1981.\n\n[12] M. Hubert, P.J. Rousseeuw, and K.V. Branden. Robpca: a new approach to robust principal\n\ncomponent analysis. Technometrics, 2005.\n\n[13] Y. Li. On incremental and robust subspace learning. Pattern recognition, 2004.\n[14] Z. Lin, M. Chen, and Y. Ma. The augmented lagrange multiplier method for exact recovery of\n\ncorrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010.\n\n[15] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma. Fast convex optimization algorithms\nfor exact recovery of a corrupted low-rank matrix. Computational Advances in Multi-Sensor\nAdaptive Processing (CAMSAP), 2009.\n\n[16] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse\n\ncoding. JMLR, 2010.\n\n[17] Morteza Mardani, Gonzalo Mateos, and G Giannakis. Dynamic anomalography: Tracking\n\nnetwork anomalies via sparsity and low rank. 2012.\n\n[18] Morteza Mardani, Gonzalo Mateos, and Georgios B Giannakis. Rank minimization for sub-\n\nspace tracking from incomplete data. In ICASSP, 2013.\n\n[19] K. Pearson. On lines and planes of closest \ufb01t to systems of points in space. Philosophical\n\nMagazine, 1901.\n\n[20] C. Qiu, N. Vaswani, and L. Hogben. Recursive robust pca or recursive sparse recovery in large\n\nbut structured noise. arXiv preprint arXiv:1211.3754, 2012.\n\n[21] B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[22] Jasson Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In ICML, 2005.\n\n[23] Pablo Sprechmann, Alex M Bronstein, and Guillermo Sapiro. Learning ef\ufb01cient sparse and\n\nlow rank models. arXiv preprint arXiv:1212.3631, 2012.\n\n[24] H. Xu, C. Caramanis, and S. Mannor. Principal component analysis with contaminated data:\n\nThe high dimensional case. In COLT, 2010.\n\n[25] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit. Information Theory,\n\nIEEE Transactions on, 58(5):3047\u20133064, 2012.\n\n9\n\n\f", "award": [], "sourceid": 268, "authors": [{"given_name": "Jiashi", "family_name": "Feng", "institution": "NUS"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "National University of Singapore"}]}