{"title": "Manifold denoising by Nonlinear Robust Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 13390, "page_last": 13400, "abstract": "This paper extends robust principal component analysis (RPCA) to nonlinear manifolds. Suppose that the observed data matrix is the sum of a sparse component and a component drawn from some low dimensional manifold. Is it possible to separate them by using similar ideas as RPCA? Is there any benefit in treating the manifold as a whole as opposed to treating each local region independently? We answer these two questions affirmatively by proposing and analyzing an optimization framework that separates the sparse component from the manifold under noisy data. Theoretical error bounds are provided when the tangent spaces of the manifold satisfy certain incoherence conditions. We also provide a near optimal choice of the tuning parameters for the proposed optimization formulation with the help of a new curvature estimation method. The efficacy of our method is demonstrated on both synthetic and real datasets.", "full_text": "Manifold Denoising by Nonlinear Robust Principal\n\nComponent Analysis\n\nHe Lyu, Ningyu Sha, Shuyang Qin, Ming Yan, Yuying Xie, Rongrong Wang\n\nDepartment of Computational Mathematics, Science and Engineering\n\nMichigan State University\n\n{lyuhe,shaningy,qinshuya,myan,xyy,wangron6}@msu.edu\n\nAbstract\n\nThis paper extends robust principal component analysis (RPCA) to nonlinear mani-\nfolds. Suppose that the observed data matrix is the sum of a sparse component and\na component drawn from some low dimensional manifold. Is it possible to separate\nthem by using similar ideas as RPCA? Is there any bene\ufb01t in treating the manifold\nas a whole as opposed to treating each local region independently? We answer\nthese two questions af\ufb01rmatively by proposing and analyzing an optimization\nframework that separates the sparse component from the manifold under noisy data.\nTheoretical error bounds are provided when the tangent spaces of the manifold\nsatisfy certain incoherence conditions. We also provide a near optimal choice of\nthe tuning parameters for the proposed optimization formulation with the help of a\nnew curvature estimation method. The ef\ufb01cacy of our method is demonstrated on\nboth synthetic and real datasets.\n\n1\n\nIntroduction\n\nManifold learning and graph learning are nowadays widely used in computer vision, image processing,\nand biological data analysis on tasks such as classi\ufb01cation, anomaly detection, data interpolation,\nand denoising. In most applications, graphs are learned from the high dimensional data and used\nto facilitate traditional data analysis methods such as PCA, Fourier analysis, and data clustering\n[7, 8, 9, 15, 12]. However, the quality of the learned graph may be greatly jeopardized by outliers\nwhich cause instabilities in all the aforementioned graph assisted applications.\nIn recent years, several methods have been proposed to handle outliers in nonlinear data [11, 21, 3].\nDespite the success of those methods, they only aim at detecting the outliers instead of correcting them.\nIn addition, very few of them are equipped with theoretical analysis of the statistical performances.\nIn this paper, we propose a novel non-task-driven algorithm for the mixed noise model in (1) and\nprovide theoretical guarantees to control its estimation error. Speci\ufb01cally, we consider the mixed\nnoise model as\n\ni = 1, . . . , n,\n\n\u02dcXi = Xi + Si + Ei,\n\n(1)\nwhere Xi \u2208 Rp is the noiseless data independently drawn from some manifold M with an intrinsic\ndimension d (cid:28) p, Ei is the i.i.d. Gaussian noise with small magnitudes, and Si is the sparse\nnoise with possibly large magnitudes. If Si has a large entry, then the corresponding \u02dcXi is usually\nconsidered as an outlier. The goal of this paper is to simultaneously recover Xi and Si from \u02dcXi,\ni = 1, .., n.\nThere are several bene\ufb01ts in recovering the noise term Si along with the signal Xi. First, the support\nof Si indicates the locations of the anomaly, which is informative in many applications. For example,\nif Xi is the gene expression data from the ith patient, the nonzero elements in Si indicate the\ndifferentially expressed genes that are the candidates for personalized medicine. Similarly, if Si is a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fresult of malfunctioned hardware, its nonzero elements indicate the locations of the malfunctioned\nparts. Secondly, the recovery of Si allows the \u201coutliers\u201d to be pulled back to the data manifold instead\nof simply being discarded. This prevents a waste of information and is especially bene\ufb01cial in cases\nwhere data is insuf\ufb01cient. Thirdly, in some applications, the sparse Si is a part of the clean data rather\nthan a noise term, then the algorithm provides a natural decomposition of the data into a sparse and a\nnon-sparse component that may carry different pieces of information.\nAlong a similar line of research, Robust Principle Component Analysis (RPCA) [2] has received\nconsiderable attention and has demonstrated its success in separating data from sparse noise in many\napplications. However, its assumption that the data lies in a low dimensional subspace is somewhat\nstrict. In this paper, we generalize the Robust PCA idea to the non-linear manifold setting. The major\nnew components in our algorithm are: 1) an incorporation of the manifold curvature information into\nthe optimization framework, and 2) a uni\ufb01ed way to apply RPCA to a collection of tangent spaces of\nthe manifold.\n\n2 Methodology\nLet \u02dcX = [ \u02dcX1, . . . , \u02dcXn] \u2208 Rp\u00d7n be the noisy data matrix containing n samples. Each sample is a\nvector in Rp independently drawn from (1). The overall data matrix \u02dcX has the representation\n\n\u02dcX = X + S + E\n\nwhere X is the clean data matrix, S is the matrix of the sparse noise, and E is the matrix of the\nGaussian noise. We further assume that the clean data X lies on some manifold M embedded in Rp\nwith a small intrinsic dimension d (cid:28) p and the samples are suf\ufb01cient (n \u2265 p). The small intrinsic\ndimension assumption ensures that data is locally low dimensional so that the corresponding local\ndata matrix is of low rank. This property allows the data to be separated from the sparse noise.\nThe key idea behind our method is to handle the data locally. We use the k Nearest Neighbors (kNN)\nto construct local data matrices, where k is larger than the intrinsic dimension d. For a data point\nXi \u2208 Rp, we de\ufb01ne the local patch centered at it to be the set consisted of its kNN and itself, and\na local data matrix X (i) associated with this patch is X (i) = [Xi1, Xi2 , . . . , Xik , Xi], where Xij is\nthe jth-nearest neighbor of Xi. Let Pi be the restriction operator to the ith patch, i.e., Pi(X) = XPi\nwhere Pi is the n\u00d7 (k + 1) matrix that selects the columns of X in the ith patch. Then X (i) = Pi(X).\nSimilarly, we de\ufb01ne S(i) = Pi(S), E(i) = Pi(E) and \u02dcX (i) = Pi( \u02dcX).\nSince each local data matrix X (i) is nearly of low rank and S is sparse, we can decompose the noisy\ndata matrix into low-rank parts and sparse parts through solving the following optimization problem\n{ \u02c6S,{ \u02c6S(i)}n\n\nF (S,{S(i)}n\n\ni=1} = arg min\n\ni=1,{ \u02c6L(i)}n\n\nS,S(i),L(i)\n\n\u2261 arg min\n\nS,S(i),L(i)\n\nn(cid:88)\n\ni=1\n\ni=1,{L(i)}n\n\ni=1)\n\n(cid:0)\u03bbi(cid:107) \u02dcX (i) \u2212 L(i) \u2212 S(i)(cid:107)2\n\nF + (cid:107)C(L(i))(cid:107)\u2217 + \u03b2(cid:107)S(i)(cid:107)1\n\n(cid:1)\n\nsubject to S(i) = Pi(S),\n\n(2)\nhere we take \u03b2 = max{k + 1, p}\u22121/2 as in RPCA, \u02dcX (i) = Pi( \u02dcX) is the local data matrix on the\nith patch and C is the centering operator that subtracts the column mean: C(Z) = Z(I \u2212 1\nk+1 11T ),\nwhere 1 is the (k + 1)-dimensional column vector of all ones. Here we are decomposing the data\non each patch into a low-rank part L(i) and a sparse part S(i) by imposing the nuclear norm and\nentry-wise (cid:96)1 norm on L(i) and S(i), respectively. There are two key components in this formulation:\n1). the local patches are overlapping (for example, the \ufb01rst data point X1 may belong to several\npatches). Thus, the constraint S(i) = Pi(S) is particularly important because it ensures copies of\nthe same point on different patches (and those of the sparse noise on different patches) remain the\nsame. 2). we do not require L(i) to be restrictions of a universal L to the ith patch, because the L(i)s\ncorrespond to the local af\ufb01ne tangent spaces, and there is no reason for a point on the manifold to\nhave the same projection on different tangent spaces. This seemingly subtle difference has a large\nimpact on the \ufb01nal result.\nIf the data only contains sparse noise, i.e., E = 0, then \u02c6X \u2261 \u02dcX \u2212 \u02c6S is the \ufb01nal estimation for X.\nIf E (cid:54)= 0, we apply Singular Value Hard Thresholding [6] to truncate C( \u02dcX (i) \u2212 Pi(S)) and remove\n\n2\n\n\fthe Gaussian noise (See \u00a76), and use the resulting \u02c6L(i)\nsquares \ufb01tting\n\n\u03c4\u2217 to construct a \ufb01nal estimate \u02c6X of X via least\n\n\u03bbi(cid:107)Pi(Z) \u2212 \u02c6L(i)\n\u03c4\u2217(cid:107)2\nF .\n\n(3)\n\nn(cid:88)\n\ni=1\n\n\u02c6X = arg min\nZ\u2208Rp\u00d7n\n\nThe following discussion revolves around (2) and (3), and the structure of the paper is as follows. In\n\u00a73, we explain the geometric meaning of each term in (2). In \u00a74, we establish theoretical recovery\nguarantees for (2) which justi\ufb01es our choice of \u03b2 and allows us to theoretically choose \u03bb. The\ncalculation of \u03bb uses the curvature of the manifold, so in \u00a75, we provide a simple method to estimate\nthe average manifold curvature and the method is robust to sparse noise. The optimization algorithms\nthat solve (2) and (3) are presented in \u00a76 and the numerical experiments are in \u00a77.\n\n3 Geometric explanation\n\nWe provide a geometric intuition for the formulation (2). Let us write the clean data matrix X (i) on\nthe ith patch in its Taylor expansion along the manifold,\n\nX (i) = Xi1T + T (i) + R(i),\n\n(4)\nwhere the Taylor series is expanded at Xi (the center point of the ith patch), T (i) stores the \ufb01rst order\nterm and its columns lie in the tangent space of the manifold at Xi, and R(i) contains all the higher\norder terms. The sum of the \ufb01rst two terms Xi1T + T (i) is the linear approximation to X (i) that is\nunknown if the tangent space is not given. This linear approximation precisely corresponds to the\nL(i)s in (2), i.e., L(i) = Xi1T + T (i). Since the tangent space has the same dimensionality d as the\nmanifold, with randomly chosen points, we have with probability one, that rank(T (i)) = d. As a\nresult, rank(L(i)) = rank(Xi1T + T (i)) \u2264 d + 1. By the assumption that d < min{p, k}, we know\nthat L(i) is indeed low rank.\nCombing (4) with \u02dcX (i) = X (i) + S(i) + E(i), we \ufb01nd the mis\ufb01t term \u02dcX (i) \u2212 L(i) \u2212 S(i) in (2)\nequals E(i) + R(i). This implies that the mis\ufb01t contains the high order residues (i.e., the linear\napproximation error) and the Gaussian noise.\n\n4 Theoretical choice of tuning parameters\n\nTo establish the error bound, we need a coherence condition on the tangent spaces of the manifold.\nDe\ufb01nition 4.1 Let U \u2208 Rm\u00d7r (m \u2265 r) be a matrix with U\u2217U = I, the coherence of U is de\ufb01ned as\n\n\u00b5(U ) =\n\nm\nr\n\nmax\n\nk\u2208{1,...,m}\n\n(cid:107)U\u2217ek(cid:107)2\n2,\n\nwhere ek is the kth element of the canonical basis. For a subspace T , its coherence is de\ufb01ned as\n\n\u00b5(V ) =\n\nm\nr\n\nmax\n\nk\u2208{1,...,m}\n\n(cid:107)V \u2217ek(cid:107)2\n2,\n\nwhere V is an orthonormal basis of T . The coherence is independent of the choice of basis.\n\nThe following theorem is proved for local patches constructed using the \u0001-neighborhoods. We use\nkNN in the experiments because kNN is more robust to insuf\ufb01cient samples. The full version of\nTheorem 4.2 can be found in the supplementary material.\nTheorem 4.2 [succinct version] Let each Xi \u2208 Rp, i = 1, ..., n, be independently drawn from a\ncompact manifold M \u2286 Rp with an intrinsic dimension d and endowed with the uniform distribution.\nLet Xij , j = 1, . . . , ki be the ki points falling in an \u03b7-neighborhood of Xi with radius \u03b7, where \u03b7 > 0\nis some \ufb01xed small constant. These points form the matrix X (i) = [Xi1 , . . . , Xiki\n, Xi]. For any\nq \u2208 M, let Tq be the tangent space of M at q and de\ufb01ne \u00af\u00b5 = supq\u2208M \u00b5(Tq). Suppose the support\nof the noise matrix S(i) is uniformly distributed among all sets of cardinality mi. Then as long as\nd < \u03c1r min{k, p}\u00af\u00b5\u22121 log\n\u22122 max{\u00afk, p}, and mi \u2264 0.4\u03c1spk (here \u03c1r and \u03c1s are positive constants,\n\n3\n\n\f\u03bbi =\n\nhas the error bound\n\nmin{ki + 1, p}1/2\n\n\u0001i\n\n(cid:88)\n\ni\n\n(cid:107)Pi( \u02c6S) \u2212 S(i)(cid:107)2,1 \u2264 C\n\n\u221a\n\npn\u00afk(cid:107)\u0001(cid:107)2.\n\n\u00afk = maxi ki, and k = mini ki) , then with probability over 1\u2212 c1n max{k, p}\u221210 \u2212 e\u2212c2k for some\nconstants c1 and c2, the minimizer \u02c6S to (2) with weights\n\n\u03b2i = max{ki + 1, p}\u22121/2\n\n,\n\n(5)\n\nHere \u0001i = (cid:107) \u02dcX (i) \u2212 Xi1T \u2212 T (i) \u2212 S(i)(cid:107)F will be estimated in the next section, \u0001 = [\u00011, ..., \u0001n],\n(cid:107) \u00b7 (cid:107)2,1 stands for taking (cid:96)2 norm along columns and (cid:96)1 norm along rows, and T (i) is the projection\nof X (i) \u2212 Xi1T to the tangent space TXi.\nRemark. We can interpret \u0001 as the total noise in the data. As explained in \u00a73, (cid:107) \u02dcX (i) \u2212 Xi1T \u2212 T (i) \u2212\nS(i)(cid:107)F = (cid:107)R(i) + E(i)(cid:107)F , thus \u0001 = 0 if the manifold is linear and the Gaussian noise is absent. The\nn in front of (cid:107)\u0001(cid:107)2 takes into account the use of different norms on the two hand sides (the\nfactor\nright hand side is the Frobenius norm of the noise matrix {R(i) + E(i)}n\ni=1 obtained by stacking the\n\u221a\np is due to the small weight\nR(i) + E(i) associated with each patch into one big matrix). The factor\n\u03b2i of (cid:107)S(i)(cid:107)1 compared to the weight 1 on (cid:107) \u02dcX (i) \u2212 L(i) \u2212 S(i)(cid:107)2\nF . The factor \u00afk appears because on\naverage, each column of \u02c6S \u2212 S is added about k := 1\n\ni ki times on the left hand side.\n\n(cid:80)\n\n\u221a\n\nn\n\n5 Estimating the curvature\n\nF \u2261\nThe de\ufb01nition \u03bbi in (5) involves an unknown quantity \u00012\n(cid:107)R(i) + E(i)(cid:107)2\nF . We assume the standard deviation \u03c3 of the i.i.d. Gaussian entries of E(i) is known,\nF can be approximated. Since R(i) is independent of E(i), the cross term (cid:104)R(i), E(i)(cid:105) is\nso (cid:107)E(i)(cid:107)2\nsmall. Our main task is estimating (cid:107)R(i)(cid:107)2\nF , the linear approximation error de\ufb01ned in \u00a73. At local\nregions, second order terms dominates the linear approximation residue, hence estimating (cid:107)R(i)(cid:107)2\nrequires the curvature information.\n\ni = (cid:107) \u02dcX (i) \u2212 Xi1T \u2212 T (i) \u2212 S(i)(cid:107)2\n\nF\n\n5.1 A short review of related concepts in Riemannian geometry\n\nThe principal curvatures at a point on a high dimensional manifold are de\ufb01ned as the singular values\nof the second fundamental forms [10]. As estimating all the singular values from the noisy data may\nnot be stable, we are only interested in estimating the mean curvature, that is the root mean squares\nof the principal curvatures.\nFor the simplicity of illustration, we review the\nrelated concepts using the 2D surface M embed-\nded in R3 (Figure 1). For any curve \u03b3(s) in M\nparametrized by arclength with unit tangent vec-\ntor t\u03b3(s), its curvature is the norm of the covari-\nant derivative of t\u03b3: (cid:107)dt\u03b3(s)/ds(cid:107) = (cid:107)\u03b3(cid:48)(cid:48)(s)(cid:107). In\nparticular, we have the following decomposition\n\n\u03b3(cid:48)(cid:48)(s) = kg(s)\u02c6v(s) + kn(s)\u02c6n(s),\n\nwhere \u02c6n(s) is the unit normal direction of the\nmanifold at \u03b3(s) and \u02c6v is the direction perpendic-\nular to \u02c6n(s) and t\u03b3(s), i.e., \u02c6v = \u02c6n \u00d7 t\u03b3(s). The\ncoef\ufb01cient kn(s) along the normal direction is\ncalled the normal curvature, and the coef\ufb01cient kg(s) along the perpendicular direction \u02c6v is called the\ngeodesic curvature. The principal curvatures purely depend on kn. In particular, in 2D, the principal\ncurvatures are precisely the maximum and minimum of kn among all possible directions.\nA natural way to compute the normal curvature is through geodesic curves. The geodesic curve\nbetween two points is the shortest curve connecting them. Therefore geodesic curves are usually\nviewed as \u201cstraight lines\u201d on the manifold. The geodesic curves have the favorable property that their\ncurvatures have 0 contribution from kg. That is to say, the second order derivative of the geodesic\ncurve parameterized by the arclength is exactly kn.\n\nFigure 1: Local manifold geometry\n\n4\n\n\fAlgorithm 1: Estimate the mean curvature \u00af\u0393(p) at some point p\nInput: Distance matrix D, adjacency matrix A, some proper constants r1 < r2, number of pairs m\nOutput: the estimated mean curvature \u00af\u0393(p)\n\nRandomly pick some point qi \u2208 B(p, r2)\\B(p, r1);\nCalculate the geodesic distance dg(p, qi) using A;\nSolve for the radius Ri based on (7);\n\n1 for i = 1: m do\n2\n3\n4\n5 end\n6 Compute estimated curvature \u00af\u0393(p) = ( 1\nm\n\n(cid:80)m\ni=1 R\u22122\n\ni\n\n)1/2.\n\nAlgorithm 2: Estimate the overall curvature \u00af\u0393(\u2126) for some region \u2126\nInput: Distance matrix D, adjacency matrix A, some proper constants r1 < r2, number of pairs m\nOutput: the estimated overall curvature \u00af\u0393(\u2126)\n\nRandomly pick a pair of points pi, qi \u2208 \u2126 such that r1 \u2264 d(pi, qi) \u2264 r2 ;\nCalculate the geodesic distance dg(pi, qi) using A;\nSolve for the radius Ri based on (7);\n\n1 for i = 1: m do\n2\n3\n4\n5 end\n6 Compute estimated curvature \u00af\u0393(\u2126) = ( 1\nm\n\n(cid:80)m\ni=1 R\u22122\n\n)1/2.\n\ni\n\n5.2 The proposed method\n\nAll existing curvature estimation methods we are aware of are in the \ufb01eld of computer vision where\nthe objects are 2D surfaces in 3D [5, 4, 19, 14]. Most of these methods are dif\ufb01cult to generalize to\nhigh (> 3) dimensions with the exception of the integral invariant based approaches [17]. However,\nthe integral invariant based approaches is not robust to sparse noise and is unsuited to our problem.\nWe propose a new method to estimate the mean curvature from the noisy data. Although the graphic\nillustration is made in 3D, the method is dimension independent. To compute the average normal\ncurvature at a point p \u2208 M, we randomly pick m points qi \u2208 M on the manifold lying within a\nproper distance to p as speci\ufb01ed in Algorithm 1. Let \u03b3i be the geodesic curve between p and qi. For\neach i, we compute the pairwise Euclidean distance (cid:107)p \u2212 qi(cid:107)2 and the pairwise geodesic distance\ndg(p, qi) using the Dijkstra\u2019s algorithm. Through a circular approximation of the geodesic curve as\ndrawn in Figure 1, we can compute the curvature of the geodesic curve as the inverse of the radius\n(6)\nis the radius of the\nthrough the\n\nwhere \u03b3(cid:48)\ncircular approximation to the curve \u03b3 at p, which can be solved along with the angle \u03b8\u03b3(cid:48)\ngeometric relations\n\ni is the tangent direction along which the curvature is calculated and R\u03b3(cid:48)\n\ni (p)(cid:107) = 1/R\u03b3(cid:48)\n\n(cid:107)\u03b3(cid:48)(cid:48)\n\n,\n\ni\n\ni\n\ni\n\n2R\u03b3(cid:48)\n\ni\n\nsin(\u03b8\u03b3(cid:48)\n\ni\n\n/2) = (cid:107)p \u2212 qi(cid:107)2, R\u03b3(cid:48)\n\n\u03b8\u03b3(cid:48)\n\ni\n\n= dg(p, qi),\n\n(7)\n\ni\n\nas indicated in Figure 1. Finally, we de\ufb01ne the average curvature \u00af\u0393(p) at p to be\n\n\u00af\u0393(p) := (Eqi(cid:107)\u03b3(cid:48)(cid:48)\n\ni (p)(cid:107)2)1/2 \u2261 (EqiR\u22122\n\n(8)\nTo estimate the mean curvature from the data, we construct two matrices D and A. D \u2208 Rn\u00d7n is the\npairwise distance matrix, where Dij denotes the Euclidean distance between two points Xi and Xj.\nA is a type of adjacency matrix de\ufb01ned as follows and is to be used to compute the pairwise geodesic\ndistances from the data,\n\n)1/2.\n\n\u03b3i\n\nif Xj is in the k nearest neighbors of Xi\nelsewhere.\n\n(9)\n\n(cid:26)Dij\n\n0\n\nAij =\n\nAlgorithm 1 estimates the mean curvature at some point p and Algorithm 2 estimates the overall\ncurvature within some region \u2126 on the manifold.\nThe geodesic distance is computed using the Dijkstra\u2019s algorithm, which is not accurate when p and\nq are too close to each other. The constant r1 in Algorithm 1 and 2 is thus used to make sure that p\nand q are suf\ufb01ciently apart. The constant r2 is to make sure that q is not too far away from p, as after\nall we are computing the mean curvature around p.\n\n5\n\n\f5.3 Estimating \u03bbi from the mean curvature\n\nWe provide a way to approximate \u03bbi when the number of points n is \ufb01nite. In the asymptotic limit\n(k \u2192 \u221e, k/n \u2192 0), all the approximate sign \u201c\u2248\u201d below become \u201c=\u201d.\nFix a point p \u2208 M and another point qi in the \u03b7-neighborhood of p. Let \u03b3i be the geodesic curve\nbetween them. With the computed curvature \u00af\u0393(p), we can estimate the linear approximation error\nof expanding qi at p: qi \u2248 p + PTp (qi \u2212 p), where PTp is the projection onto the tangent space at\np. Let E be the error of this linear approximation E(qi, p) = qi \u2212 p \u2212 PTp (qi \u2212 p) = PT \u22a5\n(qi \u2212 p)\nwhere T \u22a5\np is the orthogonal complement of the tangent space. From Figure 1, the relation between\n(cid:107)E(p, qi)(cid:107)2, (cid:107)p \u2212 qi(cid:107)2, and \u03b8\u03b3(cid:48)\n\nis\n\np\n\n\u03b8\u03b3(cid:48)\n2 =\n\ni\n\n(cid:107)p\u2212qi(cid:107)2\n2R\u03b3(cid:48)\n\n2\n\n(cid:107)E(p, qi)(cid:107)2 \u2248 (cid:107)p \u2212 qi(cid:107)2 sin\n\n.\n\n(10)\nTo obtain a closed-form formula for E, we assume that for the \ufb01xed p and a randomly chosen qi in an\n\u03be neighborhood of p, the projection PTp (qi \u2212 p) follows a uniform distribution in a ball with radius \u03b7(cid:48)\n(in fact \u03b7(cid:48) \u2248 \u03b7 since when \u03b7 is small, the projection of q \u2212 p is almost q \u2212 p itself, therefore the radius\nof the projected ball almost equal to the radius of the original neighborhood). Under this assumption,\nlet ri = (cid:107)PTp (qi \u2212 p)(cid:107)2 be the magnitude of the projection and \u03c6i = PTp (qi \u2212 p)/(cid:107)PTp (qi \u2212 p)(cid:107)2\nbe the direction, by [20], ri and \u03c6i are independent of each other. As the curvature R\u03b3i only depends\non the direction, the numerator and the denominator of the right hand side of (10) are independent of\neach other. Therefore,\n\ni\n\ni\n\nE(cid:107)E(p, qi)(cid:107)2\n\n2 \u2248 E(cid:107)p\u2212qi(cid:107)4\n4R2\n\u03b3(cid:48)\n\n2\n\ni\n\n=\n\nE(cid:107)p\u2212qi(cid:107)4\n\n2\n\n4\n\nER\u22122\n\u03b3(cid:48)\n\ni\n\n=\n\nE(cid:107)p\u2212qi(cid:107)4\n\n2\n\n4\n\n\u00b7 \u00af\u03932(p),\n\n(11)\n\nwhere the \ufb01rst equality used the independence and the last equality used the de\ufb01nition of the mean\ncurvature in the previous subsection.\nNow we apply this estimation to the neighborhood of Xi. Let p = Xi, and qj = Xij be the neighbors\nof Xi. Using (11), the average linear approximation error on this patch is\nk\u2192\u221e\u2212\u2212\u2212\u2212\u2192 E(cid:107)Xi\u2212Xij (cid:107)4\n\n(cid:107)R(i)(cid:107)2\n\nk(cid:80)\n\n(12)\n\n\u00af\u03932(Xi),\n\n2\n\n(cid:107)E(Xij , Xi)(cid:107)2\nwhere the right hand side can also be estimated with\n\nF := 1\nk\n\n1\nk\n\nj=1\n\n2\n\n4\n\n(cid:107)Xi \u2212 Xij(cid:107)4\n\n2\n\n\u00af\u03932(Xi) k\u2192\u221e\u2212\u2212\u2212\u2212\u2192 E(cid:107)Xi \u2212 Xij(cid:107)4\n\n2\n\n\u00af\u03932(Xi)\n\n(13)\n\nk(cid:88)\n\n1\nk\n\nj=1\n\n4\n\n4\n\nk(cid:80)\n\n(cid:107)Xi\u2212Xij (cid:107)4\n\n2\n\nso when k is suf\ufb01cient large, 1\n\n\u00af\u03932(Xi), which can be\ncompletely computed from the data. Combining this with the argument at the beginning of \u00a75 we get,\n\nF is also close to 1\n\nj=1\n\nk\n\n4\n\nk(cid:107)R(i)(cid:107)2\n\nF ) \u2248(cid:16)\n\nF + (cid:107)E(i)(cid:107)2\n\n(k + 1)p\u03c32 +\n\nk(cid:88)\n\n(cid:107)Xi \u2212 Xij(cid:107)4\n\n2\n\nj=1\n\n4\n\n(cid:17)1/2\n\n\u00af\u03932(Xi)\n\n=: \u02c6\u0001.\n\ndue to (5). We show in the supplementary material that\n\n\u0001i = (cid:107)R(i)+E(i)(cid:107)F \u2248(cid:113)(cid:107)R(i)(cid:107)2\n(cid:12)(cid:12)(cid:12) \u02c6\u03bbi\u2212\u03bb\u2217\n(cid:12)(cid:12)(cid:12) k\u2192\u221e\u2212\u2212\u2212\u2212\u2192 0, where \u03bb\u2217\n\nThus we can set \u02c6\u03bbi = min{k+1,p}1/2\n\n\u03bb\u2217\n\n\u02c6\u0001i\n\ni\n\ni\n\ni = min{k+1,p}1/2\n\n\u0001i\n\nas in (5).\n\n6 Optimization algorithm\n\nTo solve the convex optimization problem (2) in a memory-economic way, we \ufb01rst write L(i) as a\nfunction of S and eliminate them from the problem. We can do so by \ufb01xing S and minimizing the\nobjective function with respect to L(i)\n\n\u02c6L(i) = arg min\n\nL(i)\n\n= arg min\n\nL(i)\n\n\u03bbi(cid:107) \u02dcX (i) \u2212 L(i) \u2212 S(i)(cid:107)2\n\u03bbi(cid:107)C(L(i)) \u2212 C( \u02dcX (i) \u2212 S(i))(cid:107)2\n\nF + (cid:107)C(L(i))(cid:107)\u2217\n\nF + (cid:107)C(L(i))(cid:107)\u2217 + \u03bbi(cid:107)(I \u2212 C)(L(i) \u2212 ( \u02dcX (i) \u2212 S(i)))(cid:107)2\nF .\n\n6\n\n(14)\n\n\fNotice that L(i) can be decomposed as L(i) = C(L(i)) + (I \u2212 C)(L(i)), set A = C(L(i)), B =\n(I \u2212 C)(L(i)), then (14) is equivalent to\n\n( \u02c6A, \u02c6B) = arg min\n\nA,B\n\n\u03bbi(cid:107)A \u2212 C( \u02dcX (i) \u2212 S(i))(cid:107)2\n\nF + (cid:107)A(cid:107)\u2217 + \u03bbi(cid:107)B \u2212 (I \u2212 C)( \u02dcX\u2217(i) \u2212 S(i)))(cid:107)2\nF ,\n\nwhich decouples into\n\u02c6A = arg min\n\nA\n\n\u03bbi(cid:107)A \u2212 C( \u02dcX (i) \u2212 S(i))(cid:107)2\n\nF + (cid:107)A(cid:107)\u2217, \u02c6B = arg min\n\n\u03bbi(cid:107)B \u2212 (I \u2212 C)( \u02dcX (i) \u2212 S(i))(cid:107)2\nF .\n\nB\n\nThe problems above have closed form solutions\n\n\u02c6A = T1/2\u03bbi(C( \u02dcX (i) \u2212 Pi(S))), \u02c6B = (I \u2212 C)( \u02dcX (i) \u2212 Pi(S))\n\nwhere T\u00b5 is the soft-thresholding operator on the singular values\n\nT\u00b5(Z) = U max{\u03a3 \u2212 \u00b5I, 0}V \u2217, where U \u03a3V \u2217 is the SVD of Z.\n\nCombing \u02c6A and \u02c6B, we have derived the closed form solution for \u02c6L(i)\n\n\u02c6L(i)(S) = T1/2\u03bbi(C( \u02dcX (i) \u2212 Pi(S))) + (I \u2212 C)( \u02dcX (i) \u2212 Pi(S)).\n\n(15)\n\n(16)\n\nPlugging (16) into F in (2), the resulting optimization problem solely depends on S. Then we apply\nFISTA [1, 18] to \ufb01nd the optimal solution \u02c6S with\n\u02c6S = arg min\n\n(17)\nOnce \u02c6S is found, if the data has no Gaussian noise, then the \ufb01nal estimation for X is \u02c6X \u2261 \u02dcX \u2212 \u02c6S; if\nthere is Gaussian noise, we use the following denoised local patches \u02c6L(i)\n\u03c4\u2217\n\nF ( \u02c6L(i)(S), S).\n\nS\n\n\u03c4\u2217 = H\u03c4\u2217 (C( \u02dcX (i) \u2212 Pi( \u02c6S))) + (I \u2212 C)( \u02dcX (i) \u2212 Pi( \u02c6S)),\n\u02c6L(i)\n\n(18)\nwhere H\u03c4\u2217 is the Singular Value Hard Thresholding Operator with the optimal threshold as de\ufb01ned\nin [6]. This optimal thresholding removes the Gaussian noise from \u02c6L(i)\n\u03c4\u2217 , we\nsolve (3) to obtain the denoised data\n\n\u03c4\u2217 . With the denoised \u02c6L(i)\n\n\u02c6X = (\n\n\u03bbi \u02c6L(i)\n\n\u03c4\u2217 P T\n\ni )(\n\n\u03bbiPiP T\n\ni )\u22121.\n\n(19)\n\nThe proposed Nonlinear Robust Principle Component Analysis (NRPCA) algorithm is summarized\nin Algorithm 3. There is one caveat in solving (2): the strong sparse noise may result in a wrong\n\ni=1\n\ni=1\n\nn(cid:88)\n\nn(cid:88)\n\nAlgorithm 3: Nonlinear Robust PCA\nInput: Noisy data matrix \u02dcX, k (number of neighbors in each local patch), T (number of\n\nneighborhood updates iterations)\n\nOutput: the denoised data \u02c6X, the estimated sparse noise \u02c6S\n\n1 Estimate the curvature using (8);\n2 Estimate \u03bbi, i = 1, . . . , n as in \u00a75, set \u03b2 as in (2);\n3 \u02c6S \u2190 0;\n4 for iter = 1: T do\n5\n6\n\nFind the kNN for each point using \u02dcX \u2212 \u02c6S and construct the restriction operators {Pi}n\nConstruct the local data matrices \u02dcX (i) = Pi( \u02dcX) using Pi and the noisy data \u02dcX;\n\u02c6S \u2190 minimizer of (17) iteratively using FISTA;\n\ni=1;\n\n7\n8 end\n9 Compute each \u02c6L(i)\n\n\u03c4\u2217 from (18) and assign \u02c6X from (19).\n\nneighborhood assignment when constructing the local patches. Therefore, once \u02c6S is obtained and\nremoved from the data, we update the neighborhood assignment and re-compute \u02c6S. This procedure is\nrepeated T times.\n\n7\n\n\f7 Numerical experiment\n\nSimulated Swiss roll: We demonstrate the superior performance of NRPCA on a synthetic dataset\nfollowing the mixed noise model (1). We sampled 2000 noiseless data Xi uniformly from a 3D Swiss\nroll and generated the Gaussian noise matrix with i.i.d. entries obeying N (0, 0.25). The sparse noise\nmatrix S is generated by randomly replacing 100 entries of a zero p \u00d7 n matrix with i.i.d. samples\ngenerated from (\u22121)y \u00b7 z where y \u223c Bernoulli(0.5) and z \u223c N (5, 0.09). We applied NRPCA to the\nsimulated data with patch size k = 15. Figure 2 reports the denoising results in the original space\n(3D) looking down from above. We compare two ways of using the outputs of NRPCA: 1). only\nremove the sparse noise from the data \u02dcX \u2212 \u02c6S; 2). remove both the sparse and Gaussian noise from\nthe data: \u02c6X. In addition, we plotted \u02dcX \u2212 \u02c6S with and without the neighbourhood update. These results\nare all superior to an ad-hoc application of the Robust PCA on the individual local patches.\n\nFigure 2: NRPCA applied to the noisy 3D Swiss roll dataset. \u02dcX \u2212 \u02c6S is the result after subtracting\nthe sparse noise estimated by setting T = 1 in NRPCA, i.e., no neighbour update; \u201c \u02dcX \u2212 \u02c6S with one\nneighbor update\u201d used the \u02c6S obtained by setting T = 2 in NRPCA; clearly, the neighbour update\nhelped to remove more sparse noise; \u02c6X is the data obtained via \ufb01tting the denoised tangent spaces as\nin (3). Compared to\u201c \u02dcX \u2212 \u02c6S with one neighbor update\u201d, it further removed the Gaussian noise from\nthe data; \u201dPatch-wise Robust PCA\u201d refers to the ad-hoc application of the vanilla Robust PCA to each\nlocal patch independently, whose performance is worse than the proposed joint-recovery formulation.\n\nThe MNIST datasest: We observed some interesting dimension reduction result of MNIST with the\nhelp of NRPCA. It is well-known that the handwritten digits 4 and 9 are so similar that the popular\ndimension reduction methods Isomap and Laplacian Eigenmaps fail to separate them into two clusters\n(\ufb01rst column of Figure 3). We conjecture that the similarity between the two clusters is caused by\npersonalized writing styles of the beginning and \ufb01nishing strokes. As this type of variation can be\nbetter modeled by sparse noise than Gaussian or Poisson noises, we applied NRPCA to the raw\nMNIST images. The right column of Figure 3 shows that after the NRPCA denoising (with k = 11),\nthe separability of the two clusters in the \ufb01rst two coordinates of Isomap and Laplacian Eigenmaps\nincreases. In addition, these new embeddings seem to suggest that some trajectory patterns exist in\nthe data. We provide additional plots in the supplementary material to support this observation.\nBiological data: We illustrate the potential usefulness of NRPCA algorithm on an embryoid body\n(EB) differentiation dataset over a 27-day time course, which consists of gene expressions for 31,000\ncells measured with single-cell RNA-sequencing technology (scRNAseq) [13, 16]. This EB data\ncomprising expression measurement for cells originated from embryoid at different stages is hence\ndevelopmental in nature, which should exhibit a progressive type of characters such as tree structure\nbecause all cells arise from a single oocyte and then develop into different highly-differentiated\ntissues. This progression character is often missing when we directly apply dimension reduction\n\n8\n\n\fFigure 3: Laplacian eigenmaps and Isomap results for the original and the NRPCA denoised digits 4\nand 9 from the MNIST dataset.\n\nmethods to the data as shown in Figure 4 because biological data including scRNAseq is highly noisy\nand often is contaminated with outliers from different sources including environmental effects and\nmeasurement error. In this case, we aim to reveal the progressive nature of the single-cell data from\ntranscript abundance as measured by scRNAseq.\nWe \ufb01rst normalized the scRNAseq data following the procedure described in [16] and randomly\nselected 1000 cells using the strati\ufb01ed sampling framework to maintain the ratios among different\ndevelopmental stages. We applied our NRPCA method to the normalized subset of EB data and\nthen applied Locally Linear Embedding (LLE) to the denoised results. The two-dimensional LLE\nresults are shown in Figure 4. Our analysis demonstrated that although LLE is unable to show the\nprogression structure using noisy data, after the NRPCA denoising, LLE successfully extracted the\ntrajectory structure in the data, which re\ufb02ects the underlying smooth differentiating processes of\nembryonic cells. Interestingly, using the denoised data from \u02dcX \u2212 \u02c6S with neighbor update, the LLE\nembedding showed a branching at around day 9 and increased variance in later time points, which\nwas con\ufb01rmed by manual analysis using 80 biomarkers in [16].\n\nFigure 4: LLE results for denoised scRNAseq data set.\n\n8 Conclusion\n\nIn this paper, we proposed the \ufb01rst outlier correction method for nonlinear data analysis that can\ncorrect outliers caused by the addition of large sparse noise. The method is a generalization of the\nRobust PCA method to the nonlinear setting. We provided procedures to treat the non-linearity\nby working with overlapping local patches of the data manifold and incorporating the curvature\ninformation into the denoising algorithm. We established a theoretical error bound on the denoised\ndata that holds under conditions only depending on the intrinsic properties of the manifold. We tested\nour method on both synthetic and real dataset that were known to have nonlinear structures and\nreported promising results.\n\n9\n\n-0.01-0.008-0.006-0.004-0.00200.0020.0040.0060.0080.01Laplacian1-0.01-0.00500.0050.010.015Laplacian2Original, Laplacian-0.01-0.00500.0050.010.015Laplacian1-0.015-0.01-0.00500.0050.010.015Laplacian2Denoised Laplacian-20-15-10-50510152025Isomap1-15-10-505101520Isomap2Original Isomap-15-10-5051015Isomap1-10-50510Isomap2Denoised Isomap\fAcknowledgements The authors would like to thank Shuai Yuan, Hongbo Lu, Changxiong Liu,\nJonathan Fleck, Yichen Lou, and Lijun Cheng for useful discussions. This work was supported\nin part by the NIH grants U01DE029255, 5RO3DE027399 and the NSF grants DMS-1902906,\nDMS-1621798, DMS-1715178, CCF-1909523 and NCS-1630982.\n\nReferences\n[1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[2] Emmanuel J. Candes, Xiaodong Li, Yi Ma, and John Wright. Robust Principal Component\n\nAnalysis? J. ACM, 58(3):11:1\u201311:37, June 2011.\n\n[3] Chun Du, Jixiang Sun, Shilin Zhou, and Jingjing Zhao. An Outlier Detection Method for Robust\nManifold Learning. In Zhixiang Yin, Linqiang Pan, and Xianwen Fang, editors, Proceedings of\nThe Eighth International Conference on Bio-Inspired Computing: Theories and Applications\n(BIC-TA), 2013, Advances in Intelligent Systems and Computing, pages 353\u2013360. Springer\nBerlin Heidelberg, 2013.\n\n[4] Sagi Eppel. Using curvature to distinguish between surface re\ufb02ections and vessel contents\nin computer vision based recognition of materials in transparent vessels. arXiv preprint\narXiv:1602.00177, 2006.\n\n[5] Patrick J Flynn and Anil K Jain. On reliable curvature estimation. Computer Vision and Pattern\n\nRecognition, 89:110\u2013116, 1989.\n\n\u221a\n\n3. IEEE\n\n[6] M. Gavish and D. L. Donoho. The optimal hard threshold for singular values is 4/\n\nTransactions on Information Theory, 60(8):5040\u20135053, Aug 2014.\n\n[7] David K. Hammond, Pierre Vandergheynst, and R\u00e9mi Gribonval. Wavelets on graphs via\nspectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129\u2013150, March\n2011.\n\n[8] Jianbo Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 22(8):888\u2013905, August 2000.\n\n[9] Bo. Jiang, Chris. Ding, Bin Luo, and Jin. Tang. Graph-Laplacian PCA: Closed-Form Solution\nand Robustness. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages\n3492\u20133498, June 2013.\n\n[10] Shoshichi Kobayashi and Katsumi Nomizu. Foundations of differential geometry. 2, 1996.\n\n[11] Xiang-Ru Li, Xiao-Ming Li, Hai-Ling Li, and Mao-Yong Cao. Rejecting Outliers Based on\n\nCorrespondence Manifold. Acta Automatica Sinica, 35(1):17\u201322, January 2009.\n\n[12] Anna Little, Yuying Xie, and Qiang Sun. An analysis of classical multidimensional scaling.\n\narXiv preprint arXiv:1812.11954, 2018.\n\n[13] G. R. Martin and M. J. Evans. Differentiation of clonal lines of teratocarcinoma cells: formation\nof embryoid bodies in vitro. Proceedings of the National Academy of Sciences, 72(4):1441\u20131445,\nApril 1975.\n\n[14] Dereck S. Meek and Desmond J. Walton. On surface normal and gaussian curvature approx-\nimations given data sampled from a smooth surface. Computer Aided Geometric Design,\n17(6):521\u2013543, 2000.\n\n[15] Marina Meila and Jianbo Shi. Learning Segmentation by Random Walks. In T. K. Leen, T. G.\nDietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages\n873\u2013879. MIT Press, 2001.\n\n[16] Kevin Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S.\nChen, Kristina Yim, Antonia van den Elzen, Matthew J. Hirn, Ronald R. Coifman, Natalia B.\nIvanova, Guy Wolf, and Smita Krishnaswamy. Visualizing Structure and Transitions for\nBiological Data Exploration. bioRxiv, page 120378, April 2019.\n\n10\n\n\f[17] Helmut Pottmann, Johannes Wallner, Yong-Liang Yang, Yu-Kun Lai, and Shi-Min Hu. Principal\ncurvatures from the integral invariant viewpoint. Computer Aided Geometric Design, 24(8):428\u2013\n442, 2007.\n\n[18] Ningyu Sha, Ming Yan, and Youzuo Lin. Ef\ufb01cient seismic denoising techniques using robust\nprincipal component analysis. In SEG Technical Program Expanded Abstracts 2019, pages\n2543\u20132547. Society of Exploration Geophysicists, 2019.\n\n[19] Wai-Shun Tong and Chi-Keung Tang. Robust estimation of adaptive tensors of curvature by\ntensor voting. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(3):434\u2013449,\n2005.\n\n[20] Roman Vershynin. High-dimensional probability: An introduction with applications in data\n\nscience, volume 47. Cambridge University Press, 2018.\n\n[21] Zhigang Tang, Jun Yang, and Bingru Yang. A new Outlier detection algorithm based on\nManifold Learning. In 2010 Chinese Control and Decision Conference, pages 452\u2013457, May\n2010.\n\n11\n\n\f", "award": [], "sourceid": 7377, "authors": [{"given_name": "He", "family_name": "Lyu", "institution": "Michigan State University"}, {"given_name": "Ningyu", "family_name": "Sha", "institution": "MSU"}, {"given_name": "Shuyang", "family_name": "Qin", "institution": "Michigan State University"}, {"given_name": "Ming", "family_name": "Yan", "institution": "Michigan State University"}, {"given_name": "Yuying", "family_name": "Xie", "institution": "Michigan State University"}, {"given_name": "Rongrong", "family_name": "Wang", "institution": "Michigan State University"}]}