{"title": "Subspace Clustering with Irrelevant Features via Robust Dantzig Selector", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 765, "abstract": "This paper considers the subspace clustering problem where the data contains irrelevant or corrupted features. We propose a method termed ``robust Dantzig selector'' which can successfully identify the clustering structure even with the presence of irrelevant features. The idea is simple yet powerful: we replace the inner product by its robust counterpart, which is insensitive to the irrelevant features given an upper bound of the number of irrelevant features. We establish theoretical guarantees for the algorithm to identify the correct subspace, and demonstrate the effectiveness of the algorithm via numerical simulations. To the best of our knowledge, this is the first method developed to tackle subspace clustering with irrelevant features.", "full_text": "Subspace Clustering with Irrelevant Features via\n\nRobust Dantzig Selector\n\nChao Qu\n\nDepartment of Mechanical Engineering\n\nNational University of Singapore\n\nHuan Xu\n\nDepartment of Mechanical Engineering\n\nNational University of Singapore\n\nA0117143@u.nus.edu\n\nmpexuh@nus.edu.sg\n\nAbstract\n\nThis paper considers the subspace clustering problem where the data contains\nirrelevant or corrupted features. We propose a method termed \u201crobust Dantzig se-\nlector\u201d which can successfully identify the clustering structure even with the pres-\nence of irrelevant features. The idea is simple yet powerful: we replace the inner\nproduct by its robust counterpart, which is insensitive to the irrelevant features\ngiven an upper bound of the number of irrelevant features. We establish theoreti-\ncal guarantees for the algorithm to identify the correct subspace, and demonstrate\nthe effectiveness of the algorithm via numerical simulations. To the best of our\nknowledge, this is the \ufb01rst method developed to tackle subspace clustering with\nirrelevant features.\n\n1\n\nIntroduction\n\nThe last decade has witnessed fast growing attention in research of high-dimensional data: images,\nvideos, DNA microarray data and data from many other applications all have the property that the\ndimensionality can be comparable or even much larger than the number of samples. While this\nsetup appears ill-posed in the \ufb01rst sight, the inference and recovery is possible by exploiting the fact\nthat high-dimensional data often possess low dimensional structures [3, 14, 19]. On the other hand,\nin this era of big data, huge amounts of data are collected everywhere, and such data is generally\nheterogeneous. Clean data and irrelevant or even corrupted information are often mixed together,\nwhich motivates us to consider the high-dimensional, big but dirty data problem. In particular, we\nstudy the subspace clustering problem in this setting.\nSubspace clustering is an important subject in analyzing high-dimensional data, inspired by many\nreal applications[15]. Given data points lying in the union of multiple linear spaces, subspace clus-\ntering aims to identify all these linear spaces, and cluster the sample points according to the linear\nspaces they belong to. Here, different subspaces may correspond to motion of different objects in\nvideo sequence [11, 17, 20], different rotations, translations and thickness in handwritten digit or\nthe latent communities for the social graph [15, 5].\nA variety of algorithms of subspace clustering have been proposed in the last several years includ-\ning algebraic algorithms [16], iterative methods [9, 1], statistical methods [11, 10], and spectral\nclustering-based methods [6, 7]. Among them, sparse subspace clustering (SSC) not only achieves\nstate-of-art empirical performance, but also possesses elegant theoretical guarantees. In [12], the\nauthors provide a geometric analysis of SSC which explains rigorously why SSC is successful even\nwhen the subspaces are overlapping [12]. [18] and [13] extend SSC to the noisy case, where data are\ncontaminated by additive Gaussian noise. Different from these work, we focus on the case where\nsome irrelevant features are involved.\n\n1\n\n\fMathematically, SSC indeed solves for each sample a sparse linear regression problem with the\ndictionary being all other samples. Many properties of sparse linear regression problem are well\nunderstood in the clean data case. However, the performance of most standard algorithms deterio-\nrates (e.g. LASSO and OMP) even only a few entries are corrupted. As such, it is well expected\nthat standard SSC breaks for subspace clustering with irrelevant or corrupted features (see Section 5\nfor numerical evidences). Sparse regression under corruption is a hard problem, and few work has\naddressed this problem [8][21] [4].\nOur contribution: Inspired by [4], we use a simple yet powerful tool called robust inner product\nand propose the robust Dantzig selector to solve the subspace clustering problem with irrelevant\nfeatures. While our work is based upon the robust inner product developed to solve robust sparse\nregression, the analysis is quite different from the regression case since both the data structures and\nthe tasks are completely different: for example, the RIP condition \u2013 essential for sparse regression \u2013\nis hardly satis\ufb01ed for subspace clustering [18]. We provide suf\ufb01cient conditions to ensure that\nthe Robust Dantzig selector can detect the true subspace clustering. We further demonstrate via\nnumerical simulation the effectiveness of the proposed method. To the best of our knowledge, this\nis the \ufb01rst attempt to perform subspace clustering with irrelevant features.\n\n2 Problem setup and method\n\n2.1 Notations and model\nThe clean data matrix is denoted by XA \u2208 RD\u00d7N , where each column corresponds to a data point,\nnormalized to a unit vector. The data points are lying on a union of L subspace S = \u222aL\nl=1Sl.\nEach subspace Sl is of dimension dl which is smaller than D and contains Nl data samples with\nN1+N2+\u00b7\u00b7\u00b7+NL = N. We denote the observed dirty data matrix by X \u2208 R(D+D1)\u00d7N . Out of the\nD + D1 features, up to D1 of them are irrelevant. Without loss of generality, let X = [X T\nA ]T ,\nwhere XO \u2208 RD1\u00d7N denotes the irrelevant data. The subscript A and O denote the set of row\nindices corresponding to true and irrelevant features and the superscript T denotes the transpose.\nNotice that we do not know O a priori except its cardinality is D1. The model is illustrated in Figure\nA \u2208 RD\u00d7Nl denote the selection of columns in XA that belongs to Sl. Similarly, denote\n1. Let X (l)\nthe corresponding columns in X by X (l). Without loss of generality, let X = [X (1), X (2), ..., X (L)]\nbe ordered. Further more, we use the subscript \u201c\u2212i\u201dto describe a matrix that excludes the column\ni, e.g., (XA)(l)\u2212i = [(xA)(l)\ni+1, ..., (xA)(l)\n]. We use the superscript lc to describe\nNl\n= [X (1)\n, ..., X (L)\na matrix that excludes column in subspace l, e.g., (XA)lc\nA ].\nFor a matrix \u03a3, we use \u03a3s,\u03b7 to denote the submatrix with row indices in set s and column indices\nin set \u03b7. For any matrix Z, P (Z) denotes the symmetrized convex hull of its column, i.e., P (Z) =\nconv(\u00b1z1,\u00b1z2, ....,\u00b1zN ) . We de\ufb01ne P l\u2212i := P ((XA)(l)\u2212i) for simpli\ufb01cation, i.e., the symmetrized\nconvex hull of clean data in subspace l except data i. Finally we use (cid:107) \u00b7 (cid:107)2 to denote the l2 norm of\na vector and (cid:107) \u00b7 (cid:107)\u221e to denote in\ufb01nity norm of a vector or a matrix. Caligraphic letters such as X ,Xl\nrepresent the set containing all columns of the corresponding clean data matrix.\n\nA , ..., X (l\u22121)\n\n, X (l+1)\n\n1 , ..., (xA)(l)\n\ni\u22121, (xA)(l)\n\nO , X T\n\nA\n\nA\n\nFigure 1: Illustration of the model of irrelevant features in the subspace clustering problem. The\nleft one is the model addressed in this paper: Among total D + D1 features, up tp D1 of them are\nirrelevant. The right one illustrates a more general case, where the value of any D1 element of each\ncolumn can be arbitrary (e.g., due to corruptions). It is a harder case and left for future work.\n\n2\n\n\fFigure 2: Illustration of the Subspace Detection Property. Here, each \ufb01gure corresponds to a matrix\nwhere each column is ci, and non-zero entries are in white. The left \ufb01gure satis\ufb01es this property.\nThe right one does not.\n\n2.2 Method\n\nIn this secion we present our method as well as the intuition that derives it. When all observed data\nare clean, to solve the subspace clustering problem, the celebrated SSC [6] proposes to solve the\nfollowing convex programming\n\n(cid:107)ci(cid:107)1\n\nmin\nci\n\ns.t. xi = X\u2212ici,\n\n(1)\n\nfor each data point xi. When data are corrupted by noise of small magnitude such as Gaussian noise,\na straightforward extension of SSC is the Lasso type method called \u201cLasso-SSC\u201d [18, 13]\n\n(cid:107)ci(cid:107)1 +\n\n\u03bb\n2\n\nmin\nci\n\n(cid:107)xi \u2212 X\u2212ici(cid:107)2\n2.\n\n(2)\n\nNote that while Formulation (2) has the same form as Lasso, it is used to solve the subspace cluster-\ning task. In particular, the support recovery analysis of Lasso does not extend to this case, as X\u2212i\ntypically does not satisfy the RIP condition [18].\nThis paper considers the case where X contains irrelevant/gross corrupted features. As we dis-\ncussed above, Lasso is not robust to such corruption. An intuitive idea is to consider the following\nformulation \ufb01rst proposed for sparse linear regression [21].\n\n(cid:107)ci(cid:107)1 +\n\n(cid:107)xi \u2212 (X\u2212i \u2212 E)ci(cid:107)2\n\n2 + \u03b7(cid:107)E(cid:107)\u2217,\n\n\u03bb\n2\n\nmin\nci,E\n\nand the smallest (D \u2212 k) are selected. Let \u2126 be the set of selected indices, then (cid:104)a, b(cid:105)k =(cid:80)\n\n(3)\nwhere (cid:107) \u00b7 (cid:107)\u2217 is some norm corresponding to the sparse type of E. One major challenge of this\nformulation is that it is not convex. As such, it is not clear how to ef\ufb01ciently \ufb01nd the optimal\nsolution, and how to analyze the property of the solution (typically done via convex analysis) in the\nsubspace clustering task.\nOur method is based on the idea of robust inner product. The robust inner product (cid:104)a, b(cid:105)k is de\ufb01ned\nas follows: For vector a \u2208 RD, b \u2208 RD, we compute qi = aibi, i = 1, ..., N. Then {|qi|} are sorted\ni\u2208\u2126 qi,\ni.e., the largest k terms are truncated. Our main idea is to replace all inner products involved by\nrobust counterparts (cid:104)a, b(cid:105)D1, where D1 is the upper bound of the number of irrelevant features.\nThe intuition is that the irrelevant features with large magnitude may affect the correct subspace\nclustering. This simple truncation process will avoid this. We remark that we do not need to know\nthe exact number of irrelevant feature, but instead only an upper bound of it.\nExtending (2) using the robust inner product leads the following formulation:\n\n(cid:107)ci(cid:107)1 +\n\n\u03bb\n2\n\ncT\ni\n\nmin\nci\n\n\u02c6\u03a3ci \u2212 \u03bb\u02c6\u03b3T ci,\n\n(4)\n\nwhere \u02c6\u03a3 and \u02c6\u03b3 are robust counterparts of X T\u2212iX\u2212i and X T\u2212ixi. Unfortunately, \u02c6\u03a3 may not be a\npositive semide\ufb01nite matrix, thus (4) is not a convex program. Unlike the work [4][8] which studies\n\n3\n\n\fnon-convexity in linear regression, the dif\ufb01culty of non-convexity in the subspace clustering task\nappears to be hard to overcome.\nInstead we turn to the Dantzig Selector, which is essentially a linear program (and hence no positive\nsemide\ufb01niteness is required):\n\n(cid:107)ci(cid:107)1 + \u03bb(cid:107)X T\u2212i(X\u2212ici \u2212 xi)(cid:107)\u221e.\n\nmin\nci\n\n(5)\n\nReplace all inner product by its robust counterpart, we propose the following Robust Dantzig Selec-\ntor, which can be easily recast as a linear program:\n\nRobust Dantzig Selector:\n\n(cid:107)ci(cid:107)1 + \u03bb(cid:107) \u02c6\u03a3ci \u2212 \u02c6\u03b3(cid:107)\u221e,\n\nmin\nci\n\n(6)\n\nSubspace Detection Property: To measure whether the algorithm is successful, we de\ufb01ne the\ncriterion Subspace Detection Property following [18]. We say that the Subspace Detection Prop-\nerty holds, if and only if for all i, the optimal solution to the robust Dantzig Selector satis\ufb01es (1)\nNon-triviality: ci is not a zero vector; (2) Self-Expressiveness Property: nonzeros entries of ci\ncorrespond to only columns of X sampled from the same subspace as xi. See Figure 2 for illustra-\ntions.\n\n3 Main Results\n\nTo avoid repetition and cluttered notations, we denote the following primal convex problem by\nP (\u03a3, \u03b3)\n\nIts dual problem, denoted by D(\u03a3, \u03b3), is\n\nmin\n\nc\n\n(cid:107)c(cid:107)1 + \u03bb(cid:107)\u03a3c \u2212 \u03b3(cid:107)\u221e.\n\n(cid:104)\u03be, \u03b3(cid:105)\n\nsubject to (cid:107)\u03be(cid:107)1 = \u03bb (cid:107)\u03a3\u03be(cid:107)\u221e \u2264 1.\n\nmax\n\n\u03be\n\n(7)\n\nBefore we presents our results, we de\ufb01ne some quantities.\nThe dual direction is an important geometric term introdcued in analyzing SSC [12]. Here we de\ufb01ne\nsimilarly the dual direction of the robust Dantzig selector: Notice that the dual of robust Dantzig\nproblem is D( \u02c6\u03a3, \u02c6\u03b3), where \u02c6\u03b3 and \u02c6\u03a3 are robust counterparts of X T\u2212ixi and X T\u2212iX\u2212i respectively\n(recall that X\u2212i and xi are the dirty data). We decompose \u02c6\u03a3 into two parts \u02c6\u03a3 = (XA)T\u2212i(XA)\u2212i+ \u02dc\u03a3,\nwhere the \ufb01rst term corresponds to the clean data, and the second term is due to the irrelevant features\nand truncation from the robust inner product. Thus, the second constraint of the dual problem\nbecomes (cid:107)((XA)T\u2212i(XA)\u2212i + \u02dc\u03a3)\u03be(cid:107)\u221e \u2264 1. Let \u03be be the optimal solution to the above optimization\nproblem, we de\ufb01ne v(xi, X\u2212i, \u03bb) := (XA)\u2212i\u03be and the dual direction as vl =\n\n.\n\nv(xl\n(cid:107)v(xl\n\ni,X (l)\u2212i ,\u03bb)\ni,X (l)\u2212i ,\u03bb)(cid:107)2\n\nSimilarly as SSC [12], we de\ufb01ne the subspace incoherence. Let V l = [vl\nence of a point set Xl to other clean data points is de\ufb01ned as \u00b5(Xl) = maxk:k(cid:54)=l (cid:107)(X (k)\nRecall that we decompose \u02c6\u03a3 and \u02c6\u03b3 as \u02c6\u03a3 = (XA)T\u2212i(XA)\u2212i + \u02dc\u03a3 and \u02c6\u03b3 = (XA)T\u2212i(xA)i + \u02dc\u03b3.\nIntuitively, for robust Dantzig selecter to succeed, we want \u02dc\u03a3 and \u02dc\u03b3 not too large. Particularly, we\nassume (cid:107)(xA)i(cid:107)\u221e \u2264 \u00011 and (cid:107)(XA)\u2212i(cid:107)\u221e \u2264 \u00012.\nTheorem 1 (Deterministic Model). Denote \u00b5l\nminl=1,...,L rl and suppose \u00b5l < rl for all l. If\n\n]. The incoher-\nA )T V l(cid:107)\u221e.\n\n:= mini:xi\u2208Xl r(P l\u2212i), r :=\n\n:= \u00b5(Xl), rl\n\n2, ..., vl\nNl\n\n1, vl\n\n2D1\u00012\nthen the subspace detection property holds for all \u03bb in the range\n\nr2 \u2212 4D1\u00011\u00012r \u2212 2D1\u00012\n\n< min\n\n2\n\nl\n\n1\n\n1\n\nr2 \u2212 4D1\u00011\u00012r \u2212 2D1\u00012\n\n2\n\n< \u03bb < min\n\nl\n\n4\n\nrl \u2212 ul\n2(ul + rl)\n\n,\n\nrl \u2212 ul\n2(ul + rl)\n\n2D1\u00012\n\n(8)\n\n(9)\n\n.\n\n\fIn an ideal case when D1 = 0, the condition of the upper bound of \u03bb reduces to rl > ul, similar to\nthe condition for SSC in the noiseless case [12].\nBased on Condition (8), under a randomized generative model, we can derive how many irrelevant\nfeatures can be tolerated.\nTheorem 2 (Random model). Suppose there are L subspaces and for simplicity, all subspaces have\nsame dimension d and are chosen uniformly at random. For each subspace there are \u03c1d + 1 points\nchosen independently and uniformly at random. Up to D1 features of data are irrelevant. Each\ndata point (including true and irrelevant features) is independent from other data points. Then for\nsome universal constants C1, C2, the subspace detection property holds with probability at least\n1 \u2212 4\n\nN \u2212 N exp(\u2212\u221a\n\n\u03c1d) if\n\nd \u2264 Dc2(\u03c1) log(\u03c1)\n\n12 log N\n\n,\n\nand\n\n1\n\n2 c2(\u03c1) log \u03c1\n\n\u221a\nd \u2212 (\n(cid:113) 12d log N\n\n2c(\u03c1)\n\n(cid:113) log \u03c1\n\n1\nd + 1) C1D1(log D+C2 log N )\n\nD\n\n< \u03bb <\n\n1 \u2212 \u03ba\n1 + \u03ba\n\nD\n\nC1D1(log D + C2 log N )\n\n,\n\nc(\u03c1)D log \u03c1\n\nN \u2212 N exp(\u2212\u221a\n(cid:113) log \u03c1\nN \u2212 N exp(\u2212\u221a\n\nDc(\u03c1)\n\nd\n\nDc2(\u03c1) log \u03c1 ; c(\u03c1) is a constant only depending on the density of data points on\nwhere \u03ba =\nsubspace and satis\ufb01es (1) c(\u03c1) > 0 for all \u03c1 > 1, (2) there is a numerical value \u03c10, such that for all\n\u03c1 > \u03c10, one can take c(\u03c1) = 1/\nSimplifying the above conditions, we can determine the number of irrelevant features that can be\ntolerated. In particular, if d \u2265 2c2(\u03c1) log \u03c1 and we choose the \u03bb as\n\n\u221a\n\n8.\n\nthen the maximal number of irrelevant feature D1 that can be torelated is\n\n\u03bb =\n\n4d\n\nc2(\u03c1) log \u03c1\n\n,\n\nD1 = min{\n\n8C1d(log(D) + C2 log N )\n\n1 \u2212 \u03ba\n1 + \u03ba\n\n,\n\nC0Dc2(\u03c1) log \u03c1\n\nC1d(log(D) + C2 log N )\n\n},\n\nwith probability at least 1 \u2212 4\nIf d \u2264 2c2(\u03c1) log \u03c1, and we choose the same \u03bb, then the number of irrelevant feature we can tolerate\nis\n\n\u03c1d).\n\nD1 = min{\n\n\u221a\n4\n\nwith probability at least 1 \u2212 4\n\n\u03c1d).\n\n2C1(log(D) + C2 log N )\n\n1 \u2212 \u03ba\n1 + \u03ba\n\n,\n\nC0Dc2(\u03c1) log \u03c1\n\nC1d(log(D) + C2 log N )\n\n},\n\nRemark 1. If D is much larger than D1, the lower bound of \u03bb is proportional to the subspace\ndimension d. When d increases, the upper bound of \u03bb decreases, since 1\u2212\u03ba\n1+\u03ba decreases. Thus the\nvalid range of \u03bb shrinks when d increases.\nRemark 2.\n1\u2212\u03ba\nmin(C1\n1+\u03ba\n\nIgnoring the logarithm terms, when d is large, the tolerable D1 is proportional to\nD\nd , C2\n\nd ). When d is small, D1 is proportional to min(C1\n\n\u221a\nD\nd , C2D/\n\n1\u2212\u03ba\n1+\u03ba\n\nd) .\n\nD\n\n4 Roadmap of the Proof\n\nIn this section, we lay out the roadmap of proof. In speci\ufb01c we want to establish the condition with\nthe number of irrelevant features, and the structure of data (i.e., the incoherence \u00b5 and inradius r)\nfor the algorithm to succeed. Indeed, we provide a lower bound of \u03bb such that the optimal solution\nci is not trivial; and an upper bound of \u03bb so that the Self-Expressiveness Property holds. Combining\nthem together established the theorems.\n\n5\n\n\f4.1 Self-Expressiveness Property\n\nThe Self-Expressiveness Property is related to the upper bound of \u03bb. The proof technique is inspired\nby [18] and [12], we \ufb01rst establish the following lemma, which provides a suf\ufb01cient condition such\nthat Self-Expressiveness Property holds of the problem 6.\nLemma 1. Consider a matrix \u03a3 \u2208 RN\u00d7N and \u03b3 \u2208 RN\u00d71, If there exist a pair (\u02dcc, \u03be) such that \u02dcc\nhas a support S \u2286 T and\n\nsgn(\u02dccs) + \u03a3s,\u03b7\u03be\u03b7 = 0,\n(cid:107)\u03a3sc\u2229T,\u03b7\u03be\u03b7(cid:107)\u221e \u2264 1,\n(cid:107)\u03be(cid:107)1 = \u03bb,\n(cid:107)\u03a3T c,\u03b7\u03be\u03b7(cid:107)\u221e < 1,\n\n(10)\n\nwhere \u03b7 is the set of indices of entry i such that |(\u03a3\u02dcc \u2212 \u03b3)i| = (cid:107)\u03a3\u02dcc \u2212 \u03b3(cid:107)\u221e, then for all optimal\nsolution c\u2217 to the problem P (\u03a3, \u03b3), we have c\u2217\n\nT c = 0.\n\nThe variable \u03be in Lemma 1 is often termed the \u201cdual certi\ufb01cate\u201d. We next consider an oracle problem\nP ( \u02c6\u03a3l,l, \u02c6\u03b3l), and use its dual optimal variable denoted by \u02c6\u03be, to construct such a dual certi\ufb01cate. This\ncandidate satis\ufb01es all conditions in the Lemma 1 automatically except to show\n\n(cid:107) \u02c6\u03a3lc,\u02c6\u03b7\n\n\u02c6\u03be\u02c6\u03b7(cid:107)\u221e < 1,\n\n(11)\nwhere lc denotes the set of indices expect the ones corresponding to subspace l. We can compare\nthis condition with the corresponding one in analyzing SSC, in which one need (cid:107)(X)(lc)T v(cid:107)\u221e < 1,\nwhere v is the dual certi\ufb01cate. Recall that we can decompose \u02c6\u03a3lc,\u02c6\u03b7 = (XA)(lc)T (XA)\u02c6\u03b7 + \u02dc\u03a3lc,\u02c6\u03b7.\nThus Condition 11 becomes\n\n(cid:107)(XA)(lc)T ((XA)\u02c6\u03b7\n\n\u02c6\u03be\u02c6\u03b7) + \u02dc\u03a3lc,\u02c6\u03b7\n\n\u02c6\u03be\u02c6\u03b7(cid:107)\u221e < 1.\n\u02c6\u03be\u02c6\u03b7(cid:107)2 and (cid:107) \u02dc\u03a3lc,\u02c6\u03b7\n\n(12)\n\n\u02c6\u03be\u02c6\u03b7(cid:107)\u221e.\n\nTo show this holds, we need to bound two terms (cid:107)(XA)\u02c6\u03b7\nBounding (cid:107) \u02dc\u03a3(cid:107)\u221e,(cid:107)\u02dc\u03b3(cid:107)\u221e\nThe following lemma relates D1 with (cid:107) \u02dc\u03a3(cid:107)\u221e and (cid:107)\u02dc\u03b3(cid:107)\u221e.\nLemma 2. Suppose \u02c6\u03a3 and \u02c6\u03b3 are robust counterparts of X T\u2212iX\u2212i and X T\u2212ixi respectively and among\nD + D1 features, up to D1 are irrelevant. We can decompose \u02c6\u03a3 and \u02c6\u03b3 into following form \u02c6\u03a3 =\n(XA)T\u2212i(XA)\u2212i + \u02dc\u03a3 and \u02c6\u03b3 = (XA)T\u2212i(xA)i + \u02dc\u03b3. We de\ufb01ne \u03b41 := (cid:107)\u02dc\u03b3(cid:107)\u221e and \u03b42 := (cid:107) \u02dc\u03a3(cid:107)\u221e .If\n(cid:107)(xA)i(cid:107)\u221e \u2264 \u00011 and (cid:107)(XA)\u2212i(cid:107)\u221e \u2264 \u00012, then \u03b42 \u2264 2D1\u00012\n\u221a\nWe then bound \u00011 and \u00012 in the random model using the upper bound of the spherical cap [2].\nIndeed we have \u00011 \u2264 C1(log D + C2 log N )/\nD with high\nprobability.\nBounding (cid:107)X\u02c6\u03b7\nBy exploiting the feasible condition in the dual of the oracle problem, we obtain the following\nbound:\n\n\u221a\nD and \u00012 \u2264 C1(log D + C2 log N )/\n\n2, \u03b41 \u2264 2D1\u00011\u00012.\n\n\u02c6\u03be\u02c6\u03b7(cid:107)2\n\n(cid:107)X\u02c6\u03b7\n\n\u02c6\u03be\u02c6\u03b7(cid:107)2 \u2264 1 + 2D1\u03bb\u00012\n\n2\n\n.\n\n(cid:113) log \u03c1\n\nr(P l\u2212i)\n\n\u221a\n\nFurthermore, r(P l\u2212i) can be lower bound by c(\u03c1)\u221a\nC2 log N )/\nPlugging this upper bound into (12), we obtain the upper bound of \u03bb.\n\nd\n\n2\n\nD in the random model with high probability. Thus the RHS can be upper bounded.\n\nand \u00012 can be upper bounded by C1(log D+\n\n4.2 Non-triviality with suf\ufb01ciently large \u03bb\n\nTo ensure that the solution is not trivial (i.e., not all-zero), we need a lower bound on \u03bb.\n\n6\n\n\fIf \u03bb satis\ufb01es the following condition, the optimal solution to problem 6 can not be zero\n\n\u03bb >\n\nr2(P l\u2212i) \u2212 2D1\u00012\n\n1\n2 \u2212 4r(P l\u2212i)D1\u00011\u00012\n\n.\n\n(13)\n\nThe proof idea is to show when \u03bb is large enough, the trivial solution c = 0 can not be optimal. In\nparticular, if c = 0, the corresponding value in the primal problem is \u03bb(cid:107)\u02c6\u03b3l(cid:107)\u221e. We then establish a\nlower bound of (cid:107)\u02c6\u03b3l(cid:107)\u221e and a upper bound of (cid:107)c(cid:107)1 + \u03bb(cid:107) \u02c6\u03a3l,lc\u2212 \u02c6\u03b3l(cid:107)\u221e so that the following inequality\nalways holds by some carefully choosen c.\n\n(cid:107)c(cid:107)1 + \u03bb(cid:107) \u02c6\u03a3l,lc \u2212 \u02c6\u03b3l(cid:107)\u221e < \u03bb(cid:107)\u02c6\u03b3l(cid:107)\u221e.\n\n(14)\n\nWe then further lower bound the RHS of Equation (13) using the bound of \u00011, \u00012 and r(P l\u2212i). Notice\nthat condition (14) requires that \u03bb > A and condition (11) requires \u03bb < B, where A and B are some\nterms depending on the number of irrelevant features. Thus we require A < B to get the maximal\nnumber of irrelevant features that can be tolerated.\n\n5 Numerical simulations\n\nIn this section, we use three numerical experiments to demonstrate the effectiveness of our method\nto handle irrelevant/corrupted features. In particular, we test the performance of our method and\neffect of number of irrelevant features and dimension subspaces d with respect to different \u03bb. In\nall experiments, the ambient dimension D = 200, sample density \u03c1 = 5, the subspace are drawn\nuniformly at random. Each subspace has \u03c1d+1 points chosen independently and uniformly random.\nWe measure the success of the algorithms using the relative violation of the subspace detection\nproperty de\ufb01ned as follows,\n\nRelV iolation(C,M) =\n\n(cid:80)\n(cid:80)\n(i,j) /\u2208M |C|i,j\n(i,j)\u2208M |C|i,j\n\n,\n\nwhere C = [c1, c2, ..., cN ], M is the ground truth mask containing all (i, j) such that xi, xj belong\nto a same subspace. If RelV iolation(C,M) = 0, then the subspace detection property is satis\ufb01ed.\nWe also check whether we obtain a trivial solution, i.e., if any column in C is all-zero.\nWe \ufb01rst compare the robust Dantzig selector(\u03bb = 2) with SSC and LASSO-SSC ( \u03bb = 10). The\nresults are shown in Figure 3. The X-axis is the number of irrelevant features and the Y-axis is the\nRelviolation de\ufb01ned above. The ambient dimension D = 200, L = 3, d = 5, the relative sample\ndensity \u03c1 = 5. The values of irrelevant features are independently sampled from a uniform distribu-\ntion in the region [\u22122.5, 2.5] in (a) and [\u221210, 10] in (b). We observe from Figure 3 that both SSC\nand Lasso SSC are very sensitive to irrelevant information. (Notice that RelViolation=0.1 is pretty\nlarge and can be considered as clustering failure.) Compared with that, the proposed Robust Dantzig\nSelector performs very well. Even when D1 = 20, it still detects the true subspaces perfectly. In\nthe same setting, we do some further experiments, our method breaks when D1 is about 40. We\nalso do further experiment for Lasso-SSC with different \u03bb in the supplementary material to show\nLasso-SSC is not robust to irrelevant features.\nWe also examine the relation of \u03bb to the performance of the algorithm. In Figure 4a, we test the\nsubspace detection property with different \u03bb and D1. When \u03bb is too small, the algorithm gives a\ntrivial solution (the black region in the \ufb01gure). As we increase the value of \u03bb, the corresponding\nsolutions satisfy the subspace detection property (represented as the white region in the \ufb01gure).\nWhen \u03bb is larger than certain upper bound, RelV iolation becomes non-zero, indicating errors in\nsubspace clustering. In Figure 4b, we test the subspace detection property with different \u03bb and d.\nNotice we rescale \u03bb with d, since by Theorem 3, \u03bb should be proportional to d. We observe that the\nvalid region of \u03bb shrinks with increasing d which matches our theorem.\n\n6 Conclusion and future work\n\nWe studied subspace clustering with irrelevant features, and proposed the \u201crobust Dantzig selector\u201d\nbased on the idea of robust inner product, essentially a truncated version of inner product to avoid\n\n7\n\n\f(a)\n\n(b)\n\nFigure 3: Relviolation with different D1. Simulated with D = 200, d = 5, L = 3, \u03c1 = 5, \u03bb = 2,\nand D1 from 1 to 20.\n\n(a) Exact recovery with different number of\nirrelevant features. Simulated with D =\n200, d = 5, L = 3, \u03c1 = 5 with an in-\ncreasing D1 from 1 to 10. Black region:\ntrivial solution. White region: Non-trivial\nsolution with RelViolation=0. Gray region:\nRelViolation> 0.02.\n\n(b) Exact recovery with different subspace\ndimension d. Simulated with D = 200,\nL = 3, \u03c1 = 5, D1 = 5 and an increas-\ning d from 4 to 16. Black region:\ntriv-\nial solution. White region: Non-trivial so-\nlution with RelViolation=0. Gray region:\nRelViolation> 0.02.\n\nFigure 4: Subspace detection property with different \u03bb, D1, d.\n\nany single entry having too large in\ufb02unce on the result. We established the suf\ufb01cient conditions\nfor the algorithm to exactly detect the true subspace under the deterministic model and the random\nmodel. Simulation results demonstrate that the proposed method is robust to irrelevant information\nwhereas the performance of original SSC and LASSO-SSC signi\ufb01cantly deteriorates.\nWe now outline some directions of future research. An immediate future work is to study theoretical\nguarantees of the proposed method under the semi-random model, where each subspace is chosen\ndeterministically, while samples are randomly distributed on the respective subspace. The challenge\nhere is to bound the subspace incoherence, previous methods uses the rotation invariance of the data,\nwhich is not possible in our case as the robust inner product is invariant to rotations.\n\nAcknowledgments\n\nThis work is partially supported by the Ministry of Education of Singapore AcRF Tier Two grant\nR-265-000-443-112, and A*STAR SERC PSF grant R-265-000-540-305.\n\nReferences\n\n[1] Pankaj K Agarwal and Nabil H Mustafa. k-means projective clustering. In Proceedings of the\n23rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages\n\n8\n\n0510152000.20.40.60.81Number of irrelevant featuresRelViolation Original SSCLasso SSCRobust Dantzig Selector0510152000.20.40.60.81Number of irrelevant featuresRelViolation Original SSCLasso SSCRobust Dantzig SelectorNumber of irrelevant features D1\u03bb1 2 3 4 5 6 7 8 9 1010.510 9.5 9 8.5 8 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 Subspace dimension d\u03bb/d4 6 8 101214162.52.32.11.91.71.51.31.10.90.70.50.30.1\f155\u2013165, 2004.\n\n[2] Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry,\n\n31:1\u201358, 1997.\n\n[3] Emmanuel J Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM, 58(3):11, 2011.\n\n[4] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under ad-\nversarial corruption. In Proceedings of the 30th International Conference on Machine Learn-\ning, pages 774\u2013782, 2013.\n\n[5] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs\nvia convex optimization. The Journal of Machine Learning Research, 15(1):2213\u20132238, 2014.\n[6] Ehsan Elhamifar and Ren\u00b4e Vidal. Sparse subspace clustering. In CVPR 2009, pages 2790\u2013\n\n2797.\n\n[7] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank rep-\nresentation. In Proceedings of the 27th International Conference on Machine Learning, pages\n663\u2013670, 2010.\n\n[8] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing\ndata: Provable guarantees with non-convexity. In Advances in Neural Information Processing\nSystems, pages 2726\u20132734, 2011.\n\n[9] Le Lu and Ren\u00b4e Vidal. Combined central and subspace clustering for computer vision ap-\nplications. In Proceedings of the 23rd international conference on Machine learning, pages\n593\u2013600, 2006.\n\n[10] Yi Ma, Harm Derksen, Wei Hong, and John Wright. Segmentation of multivariate mixed data\nvia lossy data coding and compression. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 29(9):1546\u20131562, 2007.\n\n[11] Shankar R Rao, Roberto Tron, Ren\u00b4e Vidal, and Yi Ma. Motion segmentation via robust sub-\nspace separation in the presence of outlying, incomplete, or corrupted trajectories. In CVPR\n2008.\n\n[12] Mahdi Soltanolkotabi, Emmanuel J Candes, et al. A geometric analysis of subspace clustering\n\nwith outliers. The Annals of Statistics, 40(4):2195\u20132238, 2012.\n\n[13] Mahdi Soltanolkotabi, Ehsan Elhamifar, Emmanuel J Candes, et al. Robust subspace cluster-\n\ning. The Annals of Statistics, 42(2):669\u2013699, 2014.\n\n[14] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety. Series B, pages 267\u2013288, 1996.\n\n[15] Ren\u00b4e Vidal. A tutorial on subspace clustering. IEEE Signal Processing Magazine, 28(2):52\u2013\n\n68, 2010.\n\n[16] Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca).\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1945\u20131959, 2005.\n\n[17] Ren\u00b4e Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentation with missing\ndata using powerfactorization and gpca. International Journal of Computer Vision, 79(1):85\u2013\n105, 2008.\n\n[18] Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. In Proceedings of The 30th\n\nInternational Conference on Machine Learning, pages 89\u201397, 2013.\n\n[19] H. Xu, C. Caramanis, and S. Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions on\n\nInformation Theory, 58(5):3047\u20133064, 2012.\n\n[20] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent,\narticulated, rigid, non-rigid, degenerate and non-degenerate. In ECCV 2006, pages 94\u2013106.\n2006.\n\n[21] Hao Zhu, Geert Leus, and Georgios B Giannakis. Sparsity-cognizant total least-squares for\nperturbed compressive sampling. IEEE Transactions on Signal Processing, 59(5):2002\u20132016,\n2011.\n\n9\n\n\f", "award": [], "sourceid": 505, "authors": [{"given_name": "Chao", "family_name": "Qu", "institution": "NUS"}, {"given_name": "Huan", "family_name": "Xu", "institution": "National University of Singapore"}]}