{"title": "Outlier Detection and Robust PCA Using a Convex Measure of Innovation", "book": "Advances in Neural Information Processing Systems", "page_first": 14223, "page_last": 14233, "abstract": "This paper presents a provable and strong algorithm, termed Innovation Search (iSearch), to robust Principal Component Analysis (PCA) and outlier detection. An outlier by definition is a data point which does not participate in forming a low dimensional structure with a large number of data points in the data. In other word, an outlier carries some innovation with respect to most of the other data points. iSearch ranks the data points based on their values of innovation. A convex optimization problem is proposed whose optimal value is used as our measure of innovation. We derive analytical performance guarantees for the proposed robust PCA method under different models for the distribution of the outliers including randomly distributed outliers, clustered outliers, and linearly dependent outliers. Moreover, it is shown that iSearch provably recovers the span of the inliers when the inliers lie in a union of subspaces. In the challenging scenarios in which the outliers are close to each other or they are close to the span of the inliers, iSearch is shown to outperform most of the existing methods.", "full_text": "Outlier Detection and Robust PCA Using a Convex\n\nMeasure of Innovation\n\nMostafa Rahmani and Ping Li\n\nCognitive Computing Lab\n\nBaidu Research\n\n10900 NE 8th St. Bellevue, WA 98004, USA\n{mostafarahmani,liping11}@baidu.com\n\nAbstract\n\nThis paper presents a provable and strong algorithm, termed Innovation Search\n(iSearch), to robust Principal Component Analysis (PCA) and outlier detection.\nAn outlier by de\ufb01nition is a data point which does not participate in forming a\nlow dimensional structure with a large number of data points in the data. In other\nwords, an outlier carries some innovation with respect to most of the other data\npoints. iSearch ranks the data points based on their values of innovation. A convex\noptimization problem is proposed whose optimal value is used as our measure of\ninnovation. We derive analytical performance guarantees for the proposed robust\nPCA method under different models for the distribution of the outliers including\nrandomly distributed outliers, clustered outliers, and linearly dependent outliers.\nMoreover, it is shown that iSearch provably recovers the span of the inliers when\nthe inliers lie in a union of subspaces. In the challenging scenarios in which the\noutliers are close to each other or they are close to the span of the inliers, iSearch\nis shown to outperform most of the existing methods.\n\n1\n\nIntroduction\n\nOutlier detection is an important research problem in unsupervised machine learning. Outliers are\nassociated with important rare events such as malignant tissues [14], the failures of a system [10,\n12, 31], web attacks [16], and misclassi\ufb01ed data points [9, 27]. In this paper, the proposed outlier\ndetection method is introduced as a robust Principal Component Analysis (PCA) algorithm, i.e.,\nthe inliers lie in a low dimensional subspace. In the literature of robust PCA, two main models for\nthe data corruption are considered: the element-wise model and the column-wise model. These\ntwo models are corresponding to two different robust PCA problems. In the element-wise model,\nit is assumed that a small subset of the elements of the data matrix are corrupted and the support\nof the corrupted elements is random. This problem is known as the low rank plus sparse matrix\ndecomposition problem [1, 3, 4, 23, 24]. In the column-wise model, a subset of the columns of the\ndata are affected by the data corruption [5, 7, 8, 11, 17, 20, 25, 26, 36\u201339]. Section 2 provides a review\nof the robust (to column-wise corruption) PCA methods. This paper focuses on the column-wise\nmodel, i.e., we assume that the given data follows Data Model 1.\nData Model 1. The data matrix D \u2208 RM1\u00d7M2 can be expressed as D = [B (A + N)] T , where\nA \u2208 Rm\u00d7ni, B \u2208 Rm\u00d7no, T is an arbitrary permutation matrix, and [B (A + N)] represents\nthe concatenation of B and (A + N). The columns of A lie in an r-dimensional subspace U. The\ncolumns of B do not lie entirely in U, i.e., the ni columns of A are the inliers and the no columns of\nB are the outliers. The matrix N represents additive noise. The orthonormal matrix U \u2208 RM1\u00d7r is\na basis for U. Evidently, M2 = ni + no.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn the robust PCA problem, the main task is to recover U. Clearly, if U is estimated accurately, the\noutliers can be located using a simple subspace projection [22].\nSummary of Contributions: The main contributions can be summarized as follows.\n\n\u2022 The proposed approach introduces a new idea to the robust PCA problem. iSearch uses a\nconvex optimization problem to measure the Innovation of the data points. It is shown that\niSearch mostly outperforms the exiting methods in handling close outliers and noisy data.\n\n\u2022 To the best of our knowledge, the proposed approach and the CoP method presented\nin [27] are the only robust PCA methods which are supported with analytical performance\nguarantees under different models for the distributions of the outliers including the randomly\ndistributed outliers, the clustered outliers, and the linearly dependent outliers.\n\n\u2022 In addition to considering different models for the distribution of the outliers, we provide\nanalytical performance guarantees under different models for the distributions of the inliers\ntoo. The presumed models include the union of subspaces and the uniformly at random\ndistribution on U \u2229 SM1\u22121 where SM1\u22121 denotes the unit (cid:96)2-norm sphere in RM1.\n\nNotation: Given a matrix A, (cid:107)A(cid:107) denotes its spectral norm. For a vector a, (cid:107)a(cid:107)p denotes its (cid:96)p-norm\nand a(i) its ith element. Given two matrices A1 and A2 with an equal number of rows, the matrix\nA3 = [A1 A2] is the matrix formed by concatenating their columns. For a matrix A, ai denotes\nits ith column. The subspace U\u22a5 is the complement of U. The cardinality of set I is de\ufb01ned as |I|.\nAlso, for any positive integer n, the index set {1, ..., n} is denoted [n]. The coherence between vector\na and subspace H with orthonormal basis H is de\ufb01ned as (cid:107)aT H(cid:107)2.\n\n2 Related Work\n\nIn this section, we brie\ufb02y review some of the related works. We refer readers to [18, 27] for a more\ncomprehensive review on the topic. One of the early approaches to robust PCA was to replace the\nFrobenius norm in the cost function of PCA with (cid:96)1-norm because (cid:96)1-norm were shown to be robust to\nthe presence of the outliers [2,15]. The method proposed in [6] leveraged the column-wise structure of\nthe corruption matrix and replaced the (cid:96)1-norm minimization problem with an (cid:96)1,2-norm minimization\nproblem. In [19] and [39], the optimization problem used in [6] was relaxed to a convex optimization\nproblem and it was proved that under some suf\ufb01cient conditions the optimal point is a projection\nmatrix which spans U. In [34], a provable outlier rejection method was presented. However, [34]\nassumed that the outliers are randomly distributed on SS\u22121 and the inliers are distributed randomly\non U \u2229 SM1\u22121. In [36], a convex optimization problem was proposed which decomposes the data into\na low rank component and a column sparse component. The approach presented in [36] is provable\nbut it requires no to be signi\ufb01cantly smaller than ni. In [32], it was assumed that the outliers are\nrandomly distributed on SM1\u22121 and a small number of them are not linearly dependent. The method\npresented in [32] detects a data point as an outlier if it does not have a sparse representation with\nrespect to the other data points.\nConnection and Contrast to Coherence Pursuit: In [27], Coherence Pursuit (CoP) was proposed\nas a provable robust PCA method. CoP computes the Coherence Values for all the data points to rank\nthe data points. The Coherence value corresponding to data column d is a measure of resemblance\nbetween d and the rest of the data columns. CoP uses the inner product between d and the rest of the\ndata points to measure the resemblance between d and the rest of data. In sharp contrast, iSearch\n\ufb01nds an optimal direction corresponding to each data column. The optimal direction corresponding\nto data column d is used to measure the innovation of d with respect to the rest of the data columns.\nWe show through theoretical studies and numerical experiments that \ufb01nding the optimal directions\nmakes iSearch signi\ufb01cantly stronger than CoP in detecting outliers which carry weak innovation.\nConnection and Contrast to Innovation Pursuit: In [28, 29], Innovation Pursuit was proposed as\na new subspace clustering method. The optimization problem proposed in [28] \ufb01nds a direction in the\nspan of the data such that it is orthogonal to the maximum number of data points. We present a new\ndiscovery about the applications of Innovation Pursuit. It is shown that the idea of innovation search\ncan be used to design a strong outlier detection algorithm. iSearch uses an optimization problem\nsimilar to the linear optimization problem used in [28] to measure the innovation of the data points.\n\n2\n\n\fAlgorithm 1 Subspace Recovery Using iSearch\n1. Data Preprocessing. The input is data matrix D \u2208 RM1\u00d7M2.\n1.1 De\ufb01ne Q \u2208 RM1\u00d7rd as the matrix of \ufb01rst rd left singular vectors of D where rd is the number of\nnon-zero singular values. Set D = QT D. If dimensionality reduction is not required, skip this step.\n1.2 Normalize the (cid:96)2-norm of the columns of D, i.e., set di equal to di/(cid:107)di(cid:107)2 for all 1 \u2264 i \u2264 M2.\n2. Direction Search. De\ufb01ne C\u2217 \u2208 Rrd\u00d7M2 such that c\u2217\nsubject to\n\ni \u2208 Rrd\u00d71 is the optimal point of\n\n(cid:107)cT D(cid:107)1\n\ncT di = 1\n\nmin\n\nc\n\nor de\ufb01ne C\u2217 \u2208 Rrd\u00d7M2 as the optimal point of\n\n(cid:107)(CT D)T(cid:107)1\n\nC\n\nmin\n\nsubject to\n\n(1)\n3. Computing the Innovation Values. De\ufb01ne vector x \u2208 RM2\u00d71 such that x(i) = 1/(cid:107)DT c\u2217\ni (cid:107)1.\n4. Building Basis. Construct matrix Y from the columns of D corresponding to the smallest\nelements of x such that they span an r-dimensional subspace.\nOutput: The column-space of Y is the identi\ufb01ed subspace.\n\ndiag(CT D) = 1 .\n\n3 Proposed Approach\n\nAlgorithm 1 presents the proposed method along with the de\ufb01nition of the used symbols. iSearch\nconsists of 4 steps. In the next subsections, Step 2 and Step 4 are discussed. In this paper, we use an\nADMM solver to solve (1). The computation complexity of the solver is O(max(M1M 2\n1 M2)).\nIf PCA is used in the prepossessing step to reduce the dimensionality of the data to rd, the computation\ncomplexity of the solver is O(max(rdM 2\n\n2 , M 2\n\n2 , r2\n\ndM2)) 1.\n\n3.1 An Illustrative Example for Innovation Value\n\nWe use a synthetic numerical example to explain the idea behind the proposed approach. Suppose\nD \u2208 R20\u00d7250, ni = 200, no = 50, and r = 3. Assume that D follows Assumption 1.\nAssumption 1. The columns of A are drawn uniformly at random from U \u2229 SM1\u22121. The columns of\nB are drawn uniformly at random from SM1\u22121. To simplify the exposition and notation, it is assumed\nwithout loss of generality that T in Data Model 1 is the identity matrix, i.e, D = [B A].\nSuppose d is a column of D, de\ufb01ne c\u2217 as the optimal point of\n\n(cid:107)cT D(cid:107)1\n\nc\n\nmin\n\nsubject to\n\ncT d = 1 ,\n\n(2)\nand de\ufb01ne the Innovation Value corresponding to d as 1/(cid:107)DT c\u2217(cid:107)1. The main idea of iSearch is that\nc\u2217 has two completely different behaviours with respect to U (when d is an outlier and when d is\nan inlier). Suppose d is an outlier. The optimization problem (2) searches for a direction whose\nprojection on d is non-zero and it has the minimum projection on the rest of the data points. As d is\nan outlier, d has a non-zero projection on U\u22a5. In addition, as ni is large, (2) searches for a direction\nin the ambient whose projection on U is as weak as possible. Thus, c\u2217 lies in U\u22a5 or it is close to U\u22a5.\nThe left plot of Figure 1 shows DT c\u2217 when d is an outlier. In this case, c\u2217 is orthogonal to all the\ninliers. Accordingly, when d is an outliers, (cid:107)DT c\u2217(cid:107)1 is approximately equal to (cid:107)BT c\u2217(cid:107)1. On the\nother hand, when d is an inlier, the linear constraint strongly discourages c\u2217 to lie in U\u22a5 or to be\nclose to U\u22a5. Inliers lie in a low dimensional subspace and mostly they are close to each other. Since\nc\u2217 has a strong projection on d, it has strong projections on many of the inliers. Accordingly, the\nvalue of (cid:107)AT c\u2217(cid:107)1 is much larger when d is an inlier. Therefore, the Innovation Value corresponding\nto an inlier is smaller than the Innovation Value corresponding to an outlier because (cid:107)AT c\u2217(cid:107)1 is\nmuch larger when d is an inliers. Figure 1 compares the vector DT c\u2217 when d is an outliers with the\nsame vector when d is an inlier. In addition, it shows the vector of Innovation Values (right plot).\nOne can observe that the Innovation Values make the outliers clearly distinguishable.\n\n1If the data is noisy, rd should be set equal to the number of dominant singular values. In this paper, we do\nnot theoretically analyze iSearch in the presence of noise. In the numerical experiments, we set rd equal to the\nindex of the largest singular value which is less than or equal to 0.01 % of the \ufb01rst singular value.\n\n3\n\n\fFigure 1: The \ufb01rst 50 columns are outliers. The left panel shows vector DT c\u2217 when d is an outlier.\nThe middle panel depicts DT c\u2217 when d is an inlier. The right panel shows the Innovation Values\ncorresponding to all the data points (vector x was de\ufb01ned in Algorithm 1).\n\n3.2 Building the Basis Matrix\n\nThe data points corresponding to the least Innovation Values are used to construct the basis matrix Y.\nIf the data follows Assumption 1, the r data points corresponding to the r smallest Innovation Values\nspan U with overwhelming probability [35]. In practise, the algorithm should continue adding new\ncolumns to Y until the columns of Y spans an r-dimensional subspace. This approach requires to\ncheck the singular values of Y several times. We propose two techniques to avoid this extra steps.\nThe \ufb01rst approach is based on the side information that we mostly have about the data. In many\napplications, we can have an upper-bound on no because outliers are mostly associated with rare\nevents. If we know that the number of outliers is less than y percent of the data, matrix Y can be\nconstructed using (1\u2212 y) percent of the data columns which are corresponding to the least Innovation\nValues. The second approach is the adaptive column sampling method proposed in [27]. The adaptive\ncolumn sampling method avoids sampling redundant columns.\n\n4 Theoretical Studies\n\nIn this section, we analyze the performance of the proposed approach with three different models\nfor the distribution of the outliers: unstructured outliers, clustered outliers, and linearly dependent\noutliers. Moreover, we analyze iSearch with two different models for the distribution of the inliers.\nThese models include the union of subspaces and uniformly at random distribution on U \u2229 SM1\u22121.\nDue to space limitation, in this paper we do not include theoretical guarantees with noisy data and\nwe refer the reader for the analysis of iSearch with noisy data to [30]. In Section 5, it is shown with\nreal and synthetic data that iSearch accurately detects the outliers even in the low signal to noise\nratio cases and it mostly outperforms the existing approaches when the data is noisy. The theoretical\nresults are followed by short discussions which highlight the important aspects of the theorems. The\nproofs of the presented theorems are available in an extended version of this work [30].\n\n4.1 Randomly Distributed Outliers\n\nIn this section, it is assumed that D follows Assumption 1. In order to guarantee the performance of\nthe proposed approach, it is enough to show that the Innovation Values corresponding to the outliers\nare greater than the Innovation Values corresponding to the inliers. In other word, it suf\ufb01ces to show\n\ni\n\n4\n\n(cid:16){1/(cid:107)DT c\u2217\n\nmax\n\ni (cid:107)1}M2\n\ni=no+1\n\n(cid:17)\n\n< min(cid:0){1/(cid:107)DT c\u2217\n\n(cid:1) .\n\nj(cid:107)1}no\n\nj=1\n\nz = max(cid:0){|I i\n\nBefore we state the theorem, let us provide the following de\ufb01nitions and remarks.\nDe\ufb01nition 1. De\ufb01ne c\u2217\n(cid:48)\n\n(cid:107)cT D(cid:107)1. In addition, de\ufb01ne \u03c7 = max(cid:0){(cid:107)c\u2217\n\nj = arg min\nj c=1\n\ndT\n\n0|}no\n\nn\n0| is the number of outliers which are orthogonal to c\u2217\n|I i\ni .\nRemark 1. In Assumption 1, the outliers are randomly distributed. Thus, if no is signi\ufb01cantly larger\nthan M1, n\n\n(cid:48)\nz is signi\ufb01cantly smaller than no with overwhelming probability.\n\nT bi = 0} and bi is the ith column of B. The value\n\n0 = {i \u2208 [no] : c\u2217\n\n(cid:1) where I i\n\nj(cid:107)2}no\n\ni=1\n\ni=1\n\n(3)\n\n(cid:1), and\n\n00.20.40.60.811.2|DT c*|d is an outlier050100150200250Element Index00.20.40.60.81|DT c*|d is an inlier050100150200250Element Index00.20.40.60.81xInnovation Values050100150200250Element Index\fTheorem 1. Suppose D follows Assumption 1 and de\ufb01ne A =\n\n(cid:113) 1\n\n2\u03c0\n\nr \u2212 \u221a\n\nni\u221a\n\nni \u2212(cid:113) ni log 1\n\n2r\u22122 . If\n\n\u03b4\n\n+\n\n(cid:48)(cid:48)\nnoc\n\u03b4 log no/\u03b4\nM 2\n1\n\n(cid:115)\n(cid:35)(cid:114) 4M1c\u03b4\n(cid:115)\n\nM1 \u2212 c\u03b4r\n\n2\u03c7no log 1\n\u03b4\nM \u2212 1\n\n, 2n\n\nA >\n\n+ 2\n\nA > max\n\n(cid:34)\n\nno\nM1\n\n(cid:115)\n\n(cid:48)\nz\n\nn\n\n(cid:16)\n(cid:16) 4\n\n(cid:115)\n(cid:114) no\n(cid:115)(cid:18) no\n\nM1\n\n+\n\n2no log 1/\u03b4\n(M1 \u2212 1)M1\n\n+\n\n(cid:19)\n\n+\n\n\u03b7\u03b4\nM1\n\nlog no/\u03b4\n\n\u221a\n\n\u03c7) + 2\n\nno(1 +\n\nM 2\n1\n\u221a\n\n+ 2\n\n(cid:113) 8M1 log n0/\u03b4\n(cid:17)\n\n(M1\u22121)r\n\n(cid:17)\n\n,\n\n(cid:113)\n\n+\n\n(cid:48)(cid:48)\nc\n\u03b4\nM 2\n1\n\nno\u221a\nM1\n\n\uf8eb\uf8ed\u03c7\n(cid:113) 8M1\u03c0\n(cid:113)\n\nmax\n\n3 log 2M1\n\u03b4 ,\n\n4 no\nM1\n\nlog 2M1\n\u03b4\n\n.\n\n(cid:115)\n\n+ 2\n\n(cid:48)\nz\n\nand\n\n(cid:114) c\u03b4r\n(cid:113) 8M1\u03c0\n\nM1\n\n(cid:18)\n\nnoc\u03b4r log no/\u03b4\n\nM1\n\n(cid:113) 16M1 log n0/\u03b4\n\n(cid:19)\n\n(4)\n\n\uf8f6\uf8f8 ,\n\n\u221a\n\nthen (3) holds and U is recovered exactly with probability at least 1 \u2212 7\u03b4 where\n3 max\n\n1,\n\n1,\n\n(cid:48)(cid:48)\nc\n\u03b4 = 3 max\n\n(M1\u22121)r ,\n\nM1\u22121 ,\n\nM1\u22121\n\nc\u03b4 =\n, and \u03b7\u03b4 =\n\nTheorem 1 shows that as long as ni/r is suf\ufb01ciently larger than no/M1, the proposed approach is\nguaranteed to detect the randomly distributed outliers exactly.\nIt is important to note that in the\nsuf\ufb01cient conditions ni is scaled with 1/r but no is scaled with 1/M1. It shows that if r is suf\ufb01ciently\nsmaller than M1, iSearch provably detects the unstructured outliers even if no is much larger than ni.\nThe numerical experiments presented in Section 5 con\ufb01rms this feature of iSearch and they show that\nif the outliers are unstructured, iSearch can yield exact recovery even if no > 100 ni. It is important\nto note that when the outliers are structured, by the de\ufb01nition of outlier, no cannot be larger than ni.\n\n4.2 Structured Outliers\n\nIn this section, we analyze the proposed approach with structured outliers.\nIn contrast to the\nunstructured outliers, structured outliers can form a low dimensional structure different from the\nstructure of the majority of the data points. Structured outliers are associated with important rare\nevents such as malignant tissues [14] or web attacks [16]. In this section, we assume that the\noutliers form a cluster outside of U. The following assumption speci\ufb01es the presumed model for the\ndistribution of the structured outliers.\nAssumption 2. A column of B is formed as bi = 1\u221a\nnot lie in U, {vi}no\nAccording to Assumption 2, the outliers cluster around vector q where q (cid:54)\u2208 U. In Algorithm 1, if the\ndimensionality reduction step is performed, the direction search optimization problem is applied to\nQT D. Thus, (2) is equivalent to\n(cid:107)cT D(cid:107)1\n\ni=1 are drawn uniformly at random from SM1\u22121, and \u03b7 is a positive number.\n\nsubject to cT d = 1 and c \u2208 Q ,\n\n(q + \u03b7vi). The unit (cid:96)2-norm vector q does\n\nmin\n\n(5)\n\n1+\u03b72\n\nc\n\nwhere c \u2208 RM1\u00d71 and D \u2208 RM1\u00d7M2. The subspace Q is the column-space of D. In this section, we\nare interested in studying the performance of iSearch in identifying tightly clustered outliers because\nsome of the existing outlier detection algorithms fail if the outliers form a tight cluster. For instance,\nthe thresholding based method [13] and the sparse representation based algorithm [32] fail when the\noutliers are close to each other. Therefore, we assume that the span of Q is approximately equal to\n\u221a\nthe column-space of [U q]. The following Theorem shows that even if the outliers are close to each\nother, iSearch successfully identi\ufb01es the outliers provided that ni/\nr is suf\ufb01ciently larger than no.\nTheorem 2. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-2.\nAssume that Q is equal to the column-space of [U q]. De\ufb01ne q\u22a5 = (I\u2212UUT )q\n, de\ufb01ne \u03b2 =\n(cid:107)(I\u2212UUT )q(cid:107)2\ni as the optimal point of (5) with d = di, and assume that\n\n: di \u2208 B}(cid:1), de\ufb01ne c\u2217\n\nmax(cid:0){1/|dT\n\ni q\u22a5|\n\n5\n\n\f\u03b7 < |qT q\u22a5|. In addition, de\ufb01ne A =\n\n\u221a\n\n1+\u03b72\n2\u03b2\n\nni \u2212(cid:113) 2ni log 1\n\nr\u22121\n\n\u03b4\n\n(cid:19)\n\n. If\n\n\u221a\nr \u2212 2\nni\u221a\n\n\u03c0\n\n(cid:18)(cid:113) 2\n(cid:115)\n(cid:115)\n\nA > no(cid:107)UT q(cid:107)2 + \u03b7\n\nA > no|qT q\u22a5| + no\u03b7\n\nnorc\u03b4 log no/\u03b4\n\nM1\n\nc(cid:48)(cid:48)\n\u03b4 log no/\u03b4\n\nM1\n\n,\n\n,\n\n(6)\n\nthen (3) holds and U is recovered exactly with probability at least 1 \u2212 5\u03b4.\n\n\u221a\nIn contrast to (4), in (6) no is not scaled with 1/\nM1. Theorem 2 shows that in contrast to the\nunstructured outliers, the number of the structured outliers should be suf\ufb01ciently smaller than the\nnumber of the inliers for the small values of \u03b7. This is consistent with our intuition regarding the\ndetection of structured outliers. If the columns of B are highly structured and most of the data points\nare outliers, it violates the de\ufb01nition of outlier to label the columns of B as outliers.\nThe presence of parameter \u03b2 emphasizes that the closer the outliers are to U, the harder it is to\ndistinguish them.\nIn Section 5, it is shown that iSearch signi\ufb01cantly outperforms the existing\nmethods when the outliers are close to U. The main reason is that even if an outlier is close to\nU, its corresponding optimal direction obtained by (2) is highly incoherent with U. Therefore, its\ncorresponding optimal direction is incoherent with the inliers.\nWhen the outliers are very close to the span of the inliers, the norm of c\u2217 should be large to satisfy\nthe linear constraint of (2) because c\u2217 is orthogonal or nearly orthogonal to U. Accordingly, in the\napplications in which the outliers are highly coherent with U, the (cid:96)2-norm of c\u2217 should be normalized\nbefore computing the Innovation Values.\n\n4.3 Linearly Dependent Outliers\n\nIn some applications, the outliers are linearly dependent. For instance, in [9], it was shown that a\nrobust PCA algorithm can be used to reduce the clustering error of a subspace segmentation method.\nIn this application, a small subset of the outliers can be linearly dependent. This section focuses on\ndetecting linearly dependent outliers. The following assumption speci\ufb01es the presumed model for\nmatrix B and Theorem 3 provides the guarantees.\nAssumption 3. De\ufb01ne subspace Uo with dimension ro such that Uo /\u2208 U and U /\u2208 Uo. The outliers\nare randomly distributed on SM1\u22121 \u2229 Uo. The orthonormal matrix Uo \u2208 RM1\u00d7ro is a basis for Uo.\n\nTheorem 3. Suppose the distribution of the inliers/outliers follows Assumption-1/Assumption-3.\nDe\ufb01ne A =\nA > 2n\n\nr\u22121\n\n\u221a\n\n. If\n\n\u03c0\n\n(cid:48)\n\n\u03b4\n\n(cid:115)(cid:18) no\n\nro\n\n+ 2\n\n(cid:19)\n\n+ \u03b7\n\n(cid:48)\n\u03b4\n\nlog\n\n+ n\n\n(cid:48)\nz\n\nno\n\u03b4\n\n,\n\n(7)\n\n(cid:33)\n\n2(cid:107)UT Uo(cid:107)\n\nr \u2212 2\nni\u221a\n(cid:32)\n\nni \u2212(cid:113) 2ni log 1\n(cid:115)\n\n(cid:113) 2\nz(cid:107)UT Uo(cid:107) + 2(cid:107)UT Uo(cid:107)(cid:112)no log no/\u03b4 ,\n\uf8f6\uf8f8(cid:107)UT\n\uf8eb\uf8ed \u03c7no\u221a\n(cid:16) 4\n\n2no log 1\n\u03b4\nro \u2212 1\n\n2no log 1\n\u03b4\nro \u2212 1\n\nno\u221a\nro\n\n(cid:115)\n\n(cid:113)\n\n\u03c7no +\n\nno +\n\n(cid:17)\n\n+ 2\n\n+ 2\n\n\u221a\n\n\u03be\n\nro\n\n\u221a\n\n\u03c7\n\n3 log 2(ro)/\u03b4 ,\n\n4 no\nro\n\nlog 2rd\n\u03b4\n\nand \u03be =\n\nA >\n\nA >\n\n(cid:48)\n\u03b4 = max\n\n\u03b7\n\no U\u22a5(cid:107) ,\n(cid:16){(cid:107)bT\n\nmin\n\n(cid:17)\nthen (3) holds and U is recovered exactly with probability at\n\nj U\u22a5(cid:107)2}no\n(cid:107)UT\no U\u22a5(cid:107)\n\nj=1\n\nleast 1 \u2212 5\u03b4 where\n\n.\n\nTheorem 3 indicates that ni/r should be suf\ufb01ciently larger than no/ro. If ro is comparable to r, it\nis in fact a necessary condition because we can not label the columns of B as outliers if no is also\ncomparable with ni. If ro is large, the suf\ufb01cient condition is similar to the suf\ufb01cient conditions of\nTheorem 1 in which the outliers are distributed randomly on SM1\u22121.\n\n6\n\n\fIt is informative to compare the requirements of iSearch with the requirements of CoP. With iSearch,\n(cid:107)UoU\u22a5(cid:107) to guarantee that the algorithm distinguishes the\nni/r should be suf\ufb01ciently larger than no\nro\noutliers successfully. With CoP, ni/ri should be suf\ufb01ciently larger than no/ro +(cid:107)UT\no U(cid:107)ni/ri [9,27].\nThe reason that CoP requires a stronger condition is that iSearch \ufb01nds a direction for each outlier\nwhich is highly incoherent with U.\n\n4.4 Outlier Detection When the Inliers are Clustered\n\n(cid:19)\n\n\u221a\n\u2212 2\n\nng\u221a\nd\n\n\u03c0\n\nng \u2212(cid:113) 2ng log 1\n\nr\u22121\n\n\u03b4\n\n(cid:80)m\n\nIn the analysis of the robust PCA methods, mostly it is assumed that the inliers are randomly\ndistributed in U. In practise the inliers form several clusters in the column-space of the data. In\nthis section, it is assumed that the inliers form m clusters. The following assumption speci\ufb01es the\npresumed model and Theorems 4 provides the suf\ufb01cient conditions.\nAssumption 4. The matrix of inliers can be written as A = [A1 ... Am]TA where Ak \u2208 RM1\u00d7nik,\nk=1 nik = ni, and TA is an arbitrary permutation matrix. The columns of Ak are drawn uniformly\nat random from the intersection of subspace Uk and SM1\u22121 where Uk is a d-dimensional subspace.\nk=1 and (U1 \u2295 ... \u2295 Um) = U\nIn other word, the columns of A lie in a union of subspaces {Uk}m\nwhere \u2295 denotes the direct sum operator.\n\nTheorem 4. Suppose the distribution of the outliers/inliers follows Assumptions 1 to 4. Further\nde\ufb01ne A = \u03c1\n\n(cid:18)(cid:113) 2\n(cid:80)m\nk=1 (cid:107)\u03b4TUk(cid:107)2 . If the suf\ufb01cient conditions in (4) are satis\ufb01ed, then (3) holds and U is\n\n\u03c1 = inf\n\u03b4\u2208U\n(cid:107)\u03b4(cid:107)=1\nrecovered exactly with probability at least 1 \u2212 7\u03b4.\nSince the dimensions of the subspaces {Uk}m\nk=1 are equal and the distribution of the inliers inside\n(cid:80)m\n\u221a\nthese subspace are similar, roughly g = arg mink nik [19]. Thus, the suf\ufb01cient conditions indicate\nthat the population of the smallest cluster scaled by 1/\nd should be suf\ufb01ciently larger than no/M1.\nk=1 (cid:107)\u03b4TUk(cid:107)2 is similar to the permeance statistic introduced in [19]. It\nThe parameter \u03c1 = inf\n\u03b4\u2208U\n(cid:107)\u03b4(cid:107)=1\nshows how well the inliers are distributed in U. Evidently, if the inliers populate all the directions\ninside U, a subspace recovery algorithm is more likely to recover U correctly. However, having a\nlarge value of permeance statistic is not a necessary condition. The reason that permeance statistic\nappears in the suf\ufb01cient conditions is that we establish the suf\ufb01cient conditions to guarantee the\nperformance of iSearch in the worst case scenarios. In fact, if the inliers are close to each other or\nthe subspaces {Ui}m\ni=1 are close to each other, generally the performance of iSearch improves. The\nreason is that the more inliers are close to each other, the smaller their Innovation Values are.\n\nwhere g = arg mink inf\n\u03b4\u2208Uk\n(cid:107)\u03b4(cid:107)=1\n\n(cid:107)\u03b4T Ak(cid:107)1, and\n\n5 Numerical Experiments\n\nA set of experiments with synthetic data and real data are presented to study the performance and\nthe properties of the iSearch algorithm. In the presented experiments, iSearch is compared with the\nexisting methods including FMS [17], GMS [39], CoP [27], OP [36], and R1-PCA [6].\n\n5.1 Phase Transition\n\nIn this experiment, the phase transition of iSearch is studied. De\ufb01ne \u02c6U as an orthonormal basis for\nthe recovered subspace. A trial is considered successful if\n\n(cid:107)(I \u2212 UUT ) \u02c6U(cid:107)F\n\n(cid:107)U(cid:107)F\n\n< 10\u22122 .\n\nThe data follows Assumption 1 with r = 4 and M1 = 100. The left plot of Figure 2 shows the phase\ntransition of iSearch versus ni/r and no/M1. White indicates correct subspace recovery and black\ndesignates incorrect recovery. Theorem 1 indicated that if ni/r is suf\ufb01ciently large, iSearch yields\nexact recovery even if no is larger than ni. This experiment con\ufb01rms the theoretical result. According\nto Figure 2, even when no = 3000, 40 inliers are enough to guarantee exact subspace recovery.\n\n7\n\n\fFigure 2: Left panel: The phase transition of iSearch in presence of the unstructured outliers versus\nni/r and no/M1 (M1 = 100 and r = 4). Middle panel: The probability of accurate subspace\nrecovery versus the number of structured outliers (ni = 100, \u03b7 = 0.1, M1 = 100, and r = 10).\nRight panel: The probability of exact outlier detection versus SNR. The data contains 10 structured\noutliers and 300 unstructured outliers (ni = 100, no = 310, r = 5, and M1 = 100).\n\n5.2 Structured Outliers\n\nIn this experiment, we consider structured outliers. The distribution of the outliers follows Assump-\ntion 2 with \u03b7 = 0.1 and M1 = 100. In addition, the inliers are clustered and they lie in a union of\n5 2-dimensional linear subspaces. There are 20 data points in each subspace (i.e., ni = 100) and\nr = 10. A successful trial is de\ufb01ned similar to Section 5.1. We are interested in investigating the\nperformance of iSearch in identifying structured outliers when they are close to U. Therefore, we\ngenerate vector q, the center of the cluster of the outliers, close to U. Vector q is constructed as\n, where the unit (cid:96)2-norm vector p \u2208 RM1\u00d71 is generated as a random direction on\nq = [U p]h\n(cid:107)[U p]h(cid:107)2\nSM1\u22121 and the elements of h \u2208 R(r+1)\u00d71 are sampled independently from N (0, 1). The generated\nvector q is close to U with high probability because the column-space of [U p] is close to the\ncolumn-space of U. The middle plot of Figure 2 shows the probability of accurate subspace recovery\nversus the number of outliers. The number of evaluation runs was 50. One can observe that in contrast\nto the unstructured outliers, the robust PCA methods tolerate few number of structured outliers.\n\n5.3 Noisy Data\n\nIn this section, we consider the simultaneous presence of noise, the structured outliers and the\nunstructured outliers. In this experiment, M1 = 100, r = 5, and ni = 100. The data contains 300\nunstructured and 10 structured outliers. The distribution of the structured outliers follow Assumption 2\nwith \u03b7 = 0.1. The vector q, the center of the cluster of the structured outliers, is generated as a\nrandom direction on SM1\u22121. The generated data in this experiment can be expressed as D = [B An].\nThe matrix An = A + \u03b6N where N represents the additive Gaussian noise, and \u03b6 controls the power\nof the additive noise. De\ufb01ne SNR =\n. Since the data is noisy, the algorithms can not achieve\nexact subspace recovery. Therefore, we examine the probability that an algorithm distinguishes all\nthe outliers correctly. De\ufb01ne vector f \u2208 RM2\u00d71 such that f (k) = (cid:107)(I \u2212 \u02c6U \u02c6UT )dk(cid:107)2. A trial is\nconsidered successful if\n\n(cid:107)A(cid:107)2\n(cid:107)\u03b6N(cid:107)2\n\nF\n\nF\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\nmax\n\n{f (k) : k > no}\n\n< min\n\n{f (k) : k \u2264 no}\n\n.\n\nThe right plot of Figure 2 shows the probability of exact outlier detection versus SNR. It shows that\niSearch robustly distinguishes the outliers in the strong presence of noise. The number of evaluation\nruns was 50.\n\n5.4 Outlier Detection in Real Data\n\nAn application of the outlier detection methods is to identify the misclassi\ufb01ed data points of a\nclustering method [9,27]. In each identi\ufb01ed cluster, the misclassi\ufb01ed data points can be considered as\noutliers. In this experiment, we assume an imaginary clustering method whose clustering error is\n25 %. The robust PCA method is applied to each cluster to \ufb01nd the misclassi\ufb01ed data points. The\nclustering is re-evaluated after identifying the misclassi\ufb01ed data points. We use the Hopkins155\n\n8\n\n510152025no / M15101520ni / r5101520no00.20.40.60.81Probability of Accurate RecoveryiSearchOutlier PursuitFMSGMSCoP012345SNR00.20.40.60.81Probability of exact outlier detectioniSearchOutlier PursuitFMSGMSCoP\fdataset [33], which contains data matrices with 2 or 3 clusters. In this experiment, 27 matrices with\n3 clusters are used (i.e., the columns of each data matrix lie in 3 clusters). The outliers are linearly\ndependent and they are very close to the span of the inliers since the clusters in the Hopkins155\ndataset are close to each other. In addition, the inliers form a tight cluster. Evidently, the robust PCA\nmethods which assume that the outliers are randomly distributed fail in this task. This experiment\nwith real data contains most of the challenges that a robust PCA method can encounter. For more\ndetails about this experiment, we refer the reader to [9, 27].\nTable 1 shows the average clustering error after applying the robust PCA methods to the output of the\nclustering method. One can observe that iSearch signi\ufb01cantly outperforms the other methods. The\nmain reason is that iSearch is robust against outliers which are closed to U. In addition, the coherency\nbetween the inliers enhances the performance of iSearch.\n\nTable 1: Clustering error after using the robust PCA methods to detect the misclassi\ufb01ed data points.\n\niSearch CoP\n\nFMS\n\n2 %\n\nPCA\n7 % 20.3 % 16.8 % 12.1 %\n\nR1-PCA\n\n5.5 Activity Detection in Real Noisy Data\n\nIn this experiment, we use the robust PCA methods to identify a rare event in a video \ufb01le. We use\nthe Waving Tree video \ufb01le [21]. In this video, a tree is smoothly waving and in the middle of the\nvideo a person crosses the frame. The frames which only contain the background (the tree and the\nenvironment) are inliers and the few frames corresponding to the event, the presence of the person,\nare the outliers. Since the tree is waving, the inliers are noisy and we use r = 3 for all the methods.\nIn addition, we identify column d as outlier if (cid:107)d \u2212 \u02c6U \u02c6Ud(cid:107)2/(cid:107)d(cid:107)2 \u2265 0.2 where \u02c6U is the recovered\nsubspace. In this experiments, the outliers are very similar to each other since the consecutive frames\nare quite similar to each other. We use iSearch, CoP, FMS, and R1-PCA to detect the outlying frames.\niSearch, CoP, and FMS identi\ufb01ed all the outlying frames correctly. R1-PCA could not identify those\nframes in which the person does not move. The reason is that those frames are exactly similar to each\nother. Figure 3 shows some of the outlying frames which is missed by R1-PCA.\n\n+\n\nFigure 3: Some of the frames of the Waving Tree video \ufb01le. The highlighted frames are detected as\noutliers by R1-PCA.\n\n6 Conclusion\n\nA new discovery about the applications of Innovation Search was presented. It was shown that\nthe directions of innovation can be utilized to measure the innovation of the data points and to\nidentify the outliers as the most innovative data points. In the robust PCA setting, the proposed\napproach recovers the span of the inliers using the least innovative data points. It was shown that\niSearch can provably recover the span of the inliers with different models for the distribution of the\noutliers including randomly distributed outliers, linearly dependent outliers, and clustered outliers. In\naddition, analytical performance guarantees with clustered inliers were presented. The theoretical\nand numerical results showed that \ufb01nding the optimal directions makes iSearch signi\ufb01cantly robust\nto the outliers which carry weak innovation. Moreover, the experiments with real and synthetic data\ndemonstrate the robustness of the proposed method against the strong presence of noise.\n\n9\n\n\fReferences\n\n[1] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM, 58(3):11, 2011.\n\n[2] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on\n\nInformation Theory, 51(12):4203\u20134215, 2005.\n\n[3] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity\nincoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n[4] Adam Charles, Ali Ahmed, Aditya Joshi, Stephen Conover, Christopher Turnes, and Mark\nDavenport. Cleaning up toxic waste: removing nefarious contributions to recommendation\nIn IEEE International Conference on Acoustics, Speech and Signal Processing\nsystems.\n(ICASSP), pages 6571\u20136575, Vancouver, Canada, 2013.\n\n[5] Vartan Choulakian. L1-norm projection pursuit principal component analysis. Computational\n\nStatistics & Data Analysis, 50(6):1441\u20131451, 2006.\n\n[6] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-PCA: rotational invariant L1-\nnorm principal component analysis for robust subspace factorization. In Proceedings of the\n23rd International Conference on Machine Learning (ICML), pages 281\u2013288, Pittsburgh, PA,\n2006.\n\n[7] Jiashi Feng, Huan Xu, and Shuicheng Yan. Robust PCA in high-dimension: A deterministic\napproach. In Proceedings of the 29th International Conference on Machine Learning (ICML),\nEdinburgh, UK, 2012.\n\n[8] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model\n\ufb01tting with applications to image analysis and automated cartography. Communications of the\nACM, 24(6):381\u2013395, 1981.\n\n[9] Andrew Gitlin, Biaoshuai Tao, Laura Balzano, and John Lipor. Improving k-subspaces via\n\ncoherence pursuit. IEEE Journal of Selected Topics in Signal Processing, 2018.\n\n[10] Yoshiyuki Harada, Yoriyuki Yamagata, Osamu Mizuno, and Eun-Hye Choi. Log-based anomaly\n\ndetection of cps using a statistical method. arXiv:1701.03249, 2017.\n\n[11] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust subspace recovery. In The\n\n26th Annual Conference on Learning Theory (COLT), pages 354\u2013375, Princeton, NJ, 2013.\n\n[12] Milos Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F Cooper, and\nGilles Clermont. Outlier detection for patient monitoring and alerting. Journal of Biomedical\nInformatics, 46(1):47\u201355, 2013.\n\n[13] Reinhard Heckel and Helmut B\u00f6lcskei. Robust subspace clustering via thresholding. IEEE\n\nTransactions on Information Theory, 61(11):6320\u20136342, 2015.\n\n[14] Seppo Karrila, Julian Hock Ean Lee, and Greg Tucker-Kellogg. A comparison of methods for\ndata-driven cancer outlier discovery, and an application scheme to semisupervised predictive\nbiomarker discovery. Cancer Informatics, 10:109, 2011.\n\n[15] Qifa Ke and Takeo Kanade. Robust L1 norm factorization in the presence of outliers and\nmissing data by alternative convex programming. In 2005 IEEE Computer Society Conference\non Computer Vision and Pattern Recognition (CVPR), pages 739\u2013746, San Diego, CA, 2005.\n[16] Christopher Kruegel and Giovanni Vigna. Anomaly detection of web-based attacks. In Proceed-\nings of the 10th ACM Conference on Computer and Communications Security (CCS), pages\n251\u2013261, Washington, DC, 2003.\n\n[17] Gilad Lerman and Tyler Maunu. Fast, robust and non-convex subspace recovery. Information\n\nand Inference: A Journal of the IMA, 7(2):277\u2013336, 2017.\n\n[18] Gilad Lerman and Tyler Maunu. An overview of robust subspace recovery. Proceedings of the\n\nIEEE, 106(8):1380\u20131410, 2018.\n\n[19] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. Robust computation of linear\nmodels by convex relaxation. Foundations of Computational Mathematics, 15(2):363\u2013410,\n2015.\n\n10\n\n\f[20] Guoying Li and Zhonglian Chen. Projection-pursuit approach to robust dispersion matrices and\nprincipal components: primary theory and monte carlo. Journal of the American Statistical\nAssociation, 80(391):759\u2013766, 1985.\n\n[21] Liyuan Li, Weimin Huang, Irene Yu-Hua Gu, and Qi Tian. Statistical modeling of com-\nplex backgrounds for foreground object detection. IEEE Transactions on Image Processing,\n13(11):1459\u20131472, 2004.\n\n[22] Xingguo Li and Jarvis Haupt. Identifying outliers in large matrices via randomized adaptive\n\ncompressive sampling. IEEE Transactions on Signal Processing, 63(7):1792\u20131807, 2015.\n\n[23] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In\nAdvances in Neural Information Processing Systems (NIPS), pages 1206\u20131214, Montreal,\nCanada, 2014.\n\n[24] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence.\n\nIEEE Transactions on Signal Processing, 64(21):5623\u20135633, 2016.\n\n[25] Panos P Markopoulos, George N Karystinos, and Dimitris A Pados. Optimal algorithms for\nl1-subspace signal processing. IEEE Transactions on Signal Processing, 62(19):5046\u20135058,\n2014.\n\n[26] Michael McCoy, Joel A Tropp, et al. Two proposals for robust PCA using semide\ufb01nite\n\nprogramming. Electronic Journal of Statistics, 5:1123\u20131160, 2011.\n\n[27] Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, and robust principal\n\ncomponent analysis. IEEE Transactions on Signal Processing, 65(23):6260\u20136275, 2017.\n\n[28] Mostafa Rahmani and George K Atia.\n\nInnovation pursuit: A new approach to subspace\n\nclustering. IEEE Transactions on Signal Processing, 65(23):6276\u20136291, 2017.\n\n[29] Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE\n\nSignal Processing Letters, 24(12):1793\u20131797, 2017.\n\n[30] Mostafa Rahmani and Ping Li. Outlier detection and data clustering via innovation search.\n\nTechnical report, arXiv:1912.12988, 2019.\n\n[31] Benjamin Recht. A simpler approach to matrix completion. The Journal of Machine Learning\n\nResearch, 12:3413\u20133430, 2011.\n\n[32] Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis of subspace clustering\n\nwith outliers. The Annals of Statistics, pages 2195\u20132238, 2012.\n\n[33] Roberto Tron and Ren\u00e9 Vidal. A benchmark for the comparison of 3-d motion segmentation\nalgorithms. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 1\u20138, Minneapolis, MN, 2007.\n\n[34] Manolis C Tsakiris and Ren\u00e9 Vidal. Dual principal component pursuit. In 2015 IEEE Interna-\ntional Conference on Computer Vision Workshop, ICCV Workshops, pages 10\u201318, Santiago,\nChile, 2015.\n\n[35] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\narXiv:1011.3027, 2010.\n\n[36] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. In\nAdvances in Neural Information Processing Systems (NIPS), pages 2496\u20132504, Vancouver,\nCanada, 2010.\n\n[37] Chong You, Daniel P Robinson, and Ren\u00e9 Vidal. Provable self-representation based outlier\ndetection in a union of subspaces. In 2017 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 4323\u20134332, 2017.\n\n[38] Teng Zhang. Robust subspace recovery by Tyler\u2019s m-estimator. Information and Inference: A\n\nJournal of the IMA, 5(1):1\u201321, 2016.\n\n[39] Teng Zhang and Gilad Lerman. A novel m-estimator for robust PCA. The Journal of Machine\n\nLearning Research, 15(1):749\u2013808, 2014.\n\n11\n\n\f", "award": [], "sourceid": 7987, "authors": [{"given_name": "Mostafa", "family_name": "Rahmani", "institution": "Baidu Research"}, {"given_name": "Ping", "family_name": "Li", "institution": "Baidu Research USA"}]}