{"title": "Selective Sampling-based Scalable Sparse Subspace Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 12416, "page_last": 12425, "abstract": "Sparse subspace clustering (SSC) represents each data point as a sparse linear combination of other data points in the dataset. In the representation learning step SSC finds a lower dimensional representation of data points, while in the spectral clustering step data points are clustered according to the underlying subspaces. However, both steps suffer from high computational and memory complexity, preventing the application of SSC to large-scale datasets. To overcome this limitation, we introduce Selective Sampling-based Scalable Sparse Subspace Clustering (S5C) algorithm which selects subsamples based on the approximated subgradients and linearly scales with the number of data points in terms of time and memory requirements. Along with the computational advantages, we derive theoretical guarantees for the correctness of S5C. Our theoretical result presents novel contribution for SSC in the case of limited number of subsamples. Extensive experimental results demonstrate effectiveness of our approach.", "full_text": "Selective Sampling-based Scalable Sparse Subspace\n\nClustering\n\nShin Matsushima\nUniversity of Tokyo\n\nsmatsus@graco.c.u-tokyo.ac.jp\n\nAbstract\n\nMaria Brbi\u00b4c\n\nStanford University\n\nmbrbic@cs.stanford.edu\n\nSparse subspace clustering (SSC) represents each data point as a sparse linear\ncombination of other data points in the dataset. In the representation learning step\nSSC \ufb01nds a lower dimensional representation of data points, while in the spectral\nclustering step data points are clustered according to the underlying subspaces.\nHowever, both steps suffer from high computational and memory complexity, pre-\nventing the application of SSC to large-scale datasets. To overcome this limitation,\nwe introduce Selective Sampling-based Scalable Sparse Subspace Clustering (S5C)\nalgorithm which selects subsamples based on the approximated subgradients and\nlinearly scales with the number of data points in terms of time and memory require-\nments. Along with the computational advantages, we derive theoretical guarantees\nfor the correctness of S5C. Our theoretical result presents novel contribution for\nSSC in the case of limited number of subsamples. Extensive experimental results\ndemonstrate effectiveness of our approach.\n\n1\n\nIntroduction\n\nSubspace clustering algorithms rely on the assumption that high-dimensional data points can be well\nrepresented as lying in the union of low-dimensional subspaces. Based on this assumption, the task of\nsubspace clustering is to identify the subspaces and assign data points according to the corresponding\nsubspaces [1]. The clustering task is usually performed in two steps: (i) representation learning; and\n(ii) spectral clustering. In the representation learning step the goal is to \ufb01nd a representation of data\npoints according to the underlying low-dimensional subspaces. The obtained representation is then\nused to construct the af\ufb01nity matrix whose entries de\ufb01ne similarity between data points. Ideally,\nthe af\ufb01nity matrix is block diagonal and non-zero values are assigned only to data points lying in\nthe same subspace. Given an af\ufb01nity matrix as an input, spectral clustering [2] assigns subspace\nmembership to data points. In particular, spectral clustering de\ufb01nes clustering problem as a minimum\ncut problem on a graph and minimizes relaxed versions of the originally NP-hard normalized cut\n(NCut) [3, 4] or ratio cut (RCut) [5] objective functions.\nSubspace clustering algorithms often differ in regularizations imposed on the representation matrix,\nsuch as sparsity [6\u20138], low-rankness [9, 10], or their combination [11\u201313]. In this paper we are\ninterested in the Sparse Subspace Clustering (SSC), proposed by Elhamifar and Vidal [7]. SSC\nimposes sparsity constraint on data representation matrix by solving the (cid:96)1 norm regularized objective.\nSSC enjoys strong theoretical guarantees and can succeed in the noiseless case even when subspaces\nintersect [14]. Moreover, SSC is provably effective with noisy data as long as the magnitude of noise\ndoes not exceed a certain threshold [15]. Tsakiris and Vidal [16] recently established guarantees for\nSSC with missing data.\nDespite the strong theoretical guarantees [14, 15] and superior performance [7], a key challenge\ntowards the wide applicability of SSC lies in the development of methods able to handle large-scale\ndata. In particular, learning representation matrix takes O(N 3) operations in ADMM-based solver\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fused in SSC, where N is the number of data points. Pourkamali-Anaraki and Becker [17] address this\nproblem by proposing more ef\ufb01cient implementation based on the matrix-inversion lemma; however,\nthe method still requires O(N 2) operations. The same problem is present in the spectral clustering\nstep which performs eigenvalue decomposition of the Laplacian matrix, resulting in polynomial time\ncomplexity. In addition to high time complexity, the memory cost of SSC requires O(N 2) space.\nOverall, high time and space complexity limit the application of SSC to small or moderately sized\ndatasets. Since unlabeled data is often easily obtainable, this limitation is in contrast with many\nreal-world clustering tasks.\nMotivated by the above challenges, we propose Selective Sampling-based Scalable Sparse Subspace\nClustering (S5C) algorithm which linearly scales with the number of data points in terms of computa-\ntional and memory requirements. Instead of relying on a random subsample [18\u201320], the key idea\nof our approach is to select data points in terms of the most violating subgradient of the objective\nfunction. In the representation learning step, we solve a small number of LASSO problems [21] and\nselect subsamples in an iterative manner. Once representation matrix is obtained, we perform spectral\nclustering by approximating eigenvectors of graph Laplacian using the block version of the power\nmethod. Whereas in general setting power method suffers from quadratic complexity, S5C achieves\nlinear time and space by the guarantee to have at most O(N ) elements different from zero in the\nsubspace learning step.\nFrom the theoretical aspect, we provide approximation guarantees under which subspace detection\nproperty of S5C is preserved. Our main result states that SSC can exactly recover subspaces even\nin the case of limited number of subsamples, where the number of subsamples is independent of\ndata size. This notable result has a broader signi\ufb01cance and can be applied to other sampling-based\nSSC algorithms, such as [19, 20]. Compared to random sampling, theory implies selective sampling\nis advantageous. Extensive experiments on six real-world datasets of varying size demonstrate the\nsuperior clustering performance of S5C compared to the state-of-art large-scale sparse subspace\nclustering algorithms. Considering that all existing methods avoid to directly solve (cid:96)1 regularized\nbasis pursuit problem, to the best of our knowledge, this is the \ufb01rst method with the original SSC\nformulation that scales linearly with the number of data points.\n\n1.1 Related work\n\nAlgorithmic aspect. Much of the existing work has been devoted to scaling representation learning\nstep in SSC. Although more ef\ufb01cient than SSC, Orthogonal Matching Pursuit (OMP) [22, 23] and\nnearest neighbor based SSC methods [24, 25] do not scale well to large datasets. Scalable Sparse\nSubspace Clustering (SSSC) [19, 20] randomly samples small set of data points and performs SSC.\nOut-of-sample data points are then classi\ufb01ed by minimizing the residual over the in-sample data.\nAlthough this method solves large N problem, the original SSC is still performed on a small-scale\ndataset. Furthermore, relying only on a random subsample can result in weak performance in the\ncases when the subsample is not representative of the original dataset. All existing methods avoid to\ndirectly solve (cid:96)1 regularized basis pursuit problem. In contrast to them, S5C preserves the original\nconstruction of the af\ufb01nity matrix of SSC. In the spectral clustering step, most existing methods\napply computationally inef\ufb01cient spectral clustering. Power Iteration Clustering (PIC) [26] has been\nproposed as a fast and scalable alternative to spectral clustering. However, if PIC is applied to SSC,\ntheoretical guarantees of SSC do not hold anymore. On the other hand, our spectral clustering step\npreserves theoretical guarantees, while retaining advantages of PIC: fast convergence, scalability and\nsimple implementation. Furthermore, experiments show it achieves signi\ufb01cantly better performance\nthan PIC.\nTheoretical aspect. Another limitation of the existing work lies in inability to preserve desirable\ntheoretical properties of SSC. EnSC-ORGEN [27] and SSC-OMP [23] derive scalable active set\nmethod and prove subspace preserving property for arbitrary subspaces. However, their guarantee\nholds only in a \ufb01nite number of subsamples which can be all data points, and therefore, does not\nensure that the algorithm is more ef\ufb01cient than SSC. Recently proposed exemplar-based subspace\nclustering [28] selects subset of data points such that robustness to imbalanced data is achieved and\nconstructs af\ufb01nity matrix by nearest neighbor. Although it has linear time and memory complexity, it\nfails to prove subspace preserving property except in the setting of independent subspaces which is\noverly restrictive assumption [29]. SSSC [19, 20] relies on a random subset selection and does not\nprovide any theoretical justi\ufb01cation. Whereas our focus in this work is on selecting samples based on\n\n2\n\n\fsubgradient approximation instead of random sampling, we show how our theoretical results can be\nreadily extended to random sampling case. Table 1 summarizes relation of our theoretical analyses to\nthe analyses of existing work.\n\nTable 1: Relation to the existing theoretical results\n\nTheorem 2 in [23]\nTheorem 2.8 in [14]\nS5C Theorem 1\nS5C Theorem 2\n\nSubsample Noise Data model Measure for subspaces Condition on data\nno\nno\nyes\nyes\n\ndeterministic\nsemi-random af\ufb01nity\ndeterministic\nsemi-random af\ufb01nity\n\nlarge inradius\nlarge number of data\nlarge persistent inradius\nlarge number of data\n\nno\nyes\nno\nno\n\nincoherence\n\nincoherence\n\n2 Sparse subspace clustering\nConsider data matrix X \u2208 RM\u00d7N whose columns are N data points drawn from a union of L linear\n(cid:96)\u2208[L] S(cid:96) of unknown dimensions {d(cid:96)}(cid:96)\u2208[L] in RM . Sparse subspace clustering (SSC)\n\nsubspaces(cid:83)\n\nsolves the following optimization problem:\n(cid:107)X \u2212 XC(cid:107)2\n\n1\n2\n\nminimize\nC\u2208RN\u00d7N\n\nF + \u03bb(cid:107)C(cid:107)1 , subject to diag(C) = 0,\n\n(1)\nwhere C \u2208 RN\u00d7N is representation matrix and \u03bb is a hyperparameter for sparsity regularization.\nSSC solves the resulting convex optimization problem using the ADMM solver [30, 7]. Once\nrepresentation matrix is obtained, af\ufb01nity matrix W \u2208 RN\u00d7N is constructed to achieve symmetry as\nW = |C| + |C|(cid:62)\nGiven af\ufb01nity matrix W and number of clusters L, SSC applies spectral clustering algorithm [4, 2].\nSpeci\ufb01cally, it \ufb01nds L eigenvectors corresponding to the L smallest eigenvalues of the symmetric\n2 , where D \u2208 RN\u00d7N is\nnormalized graph Laplacian matrix de\ufb01ned as LS = IN \u2212 D\u2212 1\ndiagonal degree matrix in which (i, i)-th element is the sum of i-th column of W. Given matrix\nwhose columns are L eigenvectors, cluster memberships of data points are obtained by applying\nK-means algorithm to the normalized rows of the matrix.\n\n2 WD\u2212 1\n\n.\n\n3 Selective sampling-based SSC\n\nIn this section, we \ufb01rst propose how to ef\ufb01ciently learn representation matrix in SSC, and then\npropose the solution for scaling spectral clustering step. Time and memory complexity of S5C\nalgorithm are analyzed in Appendix A.\n\n3.1 Representation learning\n\n(cid:13)(cid:13)(cid:13)(cid:13)xi \u2212(cid:88)\n\n1\n2\n\nIn the representation learning step we aim to solve SSC problem in (1) using only a small number of\nselectively sampled data points instead of the entire data matrix X. Let Cji denote (j, i)-th element\nof C and xi \u2208 RM denote i-th column of X. The problem in (1) can be decomposed by N problems,\nwhere the following problem needs to be solved for i-th column of C:\n\n|Cji| , subject to Cii = 0.\n\n+ \u03bb\n\nj\u2208[N ]\n\nCjixj\n\nminimize\n(Cji)j\u2208[N ]\u2208RN\n\n(2)\nNote that for each i \u2208 [N ] the decomposed problem in (2) has O(N ) parameters, so the resulting\ntime and space complexity is O(N 2) which is not acceptable for large-scale data.\nFollowing the basic subspace clustering assumption that data points are generated from the low-\ndimensional subspaces, a key intuition of our approach is that we can effectively approximate the\nsolution of (2) using only a small number of selectively sampled data points instead of the whole data\nmatrix X. Speci\ufb01cally, we solve the following problem:\n\nj\u2208[N ]\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)xi \u2212(cid:88)\n\nj\u2208[N ]\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:88)\n\nj\u2208[N ]\n\nminimize\n(Cji)j\u2208[N ]\u2208RN\n\n1\n2\n\nCjixj\n\n+ \u03bb\n\n|Cji| , subject to Cji = 0,\u2200j \u2208 {i} \u222a ([N ] \\ S) ,\n\n(3)\n\n3\n\n\f\uf8f1\uf8f2\uf8f3\n\n(cid:68)\nxi(cid:48),(cid:80)\n(cid:68)\nxi(cid:48),(cid:80)\n\nGi(cid:48)i =\n\n(cid:69)\n(cid:69)\n\n+ [\u2212\u03bb, \u03bb]\n\n+ sign(cid:0)C S\n\ni(cid:48)i\n\nC S\n\ni(cid:48)i = 0,\n\n(cid:1) \u03bb otherwise.\n(cid:29)\n\n(cid:27)\n\nj\u2208[N ] C S\nj\u2208[N ] C S\n\njixj \u2212 xi\njixj \u2212 xi\n(cid:28)\n(cid:88)\n\n(cid:26)\n\nThen, the necessary condition can be written as follows:\n\nwhere S \u2282 [N ] denotes indices of selected subsamples. This problem can be solved by standard\nsolvers of (cid:96)1 minimization problem, such as GLMNET [31] and coordinate descent methods [32] by\ntime and space complexity independent of N. The key challenge of the approach is to obtain the set\nof subsamples S such that all subspaces are suf\ufb01ciently covered and the obtained solution is close to\nthe global solution of (1).\nTo solve this challenge, we propose an incremental algorithm for obtaining S based on the stochastic\napproximation of a subgradient. Let us assume that (C S\nji)ji = CS is formed by the optimal solutions\nof (3) and we need to \ufb01nd the next data point i+ \u2208 [N ] \\ S so that CS\u222a{i+} is close to the optimal\nsolution of (1). Our strategy is to choose next i+ in terms of the most violating subgradient. We\nexplain below how we de\ufb01ne the violation and compute it ef\ufb01ciently.\nFirst, let Gji be the subdifferential of the objective function in (2) with respect to C S\nji. Then, a\nnecessary condition for the objective function of (2) not to decrease by newly adding i(cid:48) \u2208 [N ] \\ S\nto S can be written as Gi(cid:48)i (cid:51) 0, for all i \u2208 [N ] \\ {i(cid:48)}. Here, the subdifferential Gi(cid:48)i is given by the\nfollowing equation:\n\nGi(cid:48)i (cid:51) 0 \u21d4 med\n\njixj \u2212 xi\nC S\n\n\u00b1 \u03bb\n\n= 0,\n\nj\u2208[N ]\n\n0,\n\nxi(cid:48),\n\n(cid:80)\n\ni(cid:48)i, where\n\ni\u2208[N ]\\{i(cid:48)} g2\n\n(4)\nwhere med denotes the median of three values. To assure that adding i(cid:48)-th data point to S always\nimproves the objective function value of (3), the left hand side of (4) has to be non-zero for at least\nsome i \u2208 [N ] \\ {i(cid:48)}. Therefore, we measure the violation of the subgradient for each i(cid:48) \u2208 [N ] \\ S by\n\n(cid:29)\ncomplexity, we perform stochastic approximation of the amount(cid:80)\n\n(5)\nHowever, computing (5) for all (i(cid:48), i) \u2208 ([N ] \\ S) \u00d7 [N ] requires O(N 2) time. To reduce time\ni(cid:48)i. Speci\ufb01cally, we\napproximate the violation of the subgradient for each i(cid:48) \u2208 [N ]\\ S using a random subsample I \u2282 [N ]\nas\n\njixj \u2212 xi\nC S\n\ni\u2208[N ]\\{i(cid:48)} g2\n\n(cid:88)\n\ngi(cid:48)i = med\n\n0,\n\nxi(cid:48),\n\n\u00b1 \u03bb\n\n.\n\n(cid:27)\n\n(cid:26)\n\n(cid:28)\n\nj\u2208[N ]\n\n(cid:88)\n\ni\u2208[N ]\\{i(cid:48)}\n\ni(cid:48)i \u2248 N \u2212 1\n|I \\ {i(cid:48)}|\n\ng2\n\n(cid:88)\n\ng2\ni(cid:48)i,\n\ni\u2208I\\{i(cid:48)}\n\nwhere | \u00b7 | denotes cardinality function. Finally, we select i+ as the maximizer of the right hand side\namong i(cid:48) \u2208 [N ] \\ S, which can be computed in O (|I|N ), where |I| (cid:28) N and can be considered\nas a constant. In all experiments and analyses, we use only one random subsample, i.e., |I| = 1.\nSince this is not time critical step, using more subsamples bene\ufb01ts the algorithm. Pseudocode of\nrepresentation learning step is summarized in Algorithm 1.\n\n3.2 Spectral clustering\n\nGiven sparse af\ufb01nity matrix W, S5C algorithm ef\ufb01cently solves spectral clustering step by perform-\ning eigenvalue decomposition using orthogonal iteration. Power method is a well known approach\nfor approximating dominant eigenvector by iterative matrix-vector multiplication. Orthogonal iter-\nation computes eigenvectors in a block by iteratively performing matrix-matrix multiplication and\northogonalization of the block using QR factorization. In the general setting, orthogonal iteration\nsuffers from O(N 2) computational complexity. On the other hand, orthogonal iteration in our setting\nenjoys O(N ) scalability. This is achieved by the guarantee that W contains non-zero elements linear\nin the number of data points N.\nSpectral clustering algorithm requires computation of L eigenvectors associated with the L smallest\neigenvalues of symmetric normalized Laplacian matrix LS. They can be found as the largest\neigenvectors of the positive semide\ufb01nite matrix 2IN \u2212 LS, where 2 comes from the upper-bound\n\n4\n\n\fAlgorithm 1 Representation learning step in S5C\nRequire: Dataset (x1, . . . , xN ) \u2208 RM\u00d7N , hyperparameter \u03bb, number of iterations T , batch size B\n1: S \u2190 \u2205\n2: for t \u2208 [T ] do\n3:\n4:\n5:\n6:\ni\u2208I\\{i+} g2\n7:\nS \u2190 S \u222a {i+}\n8:\n9: Obtain C by solving (3) for all i \u2208 [N ]\n10: W \u2190 |C| + |C|(cid:62)\n\n(cid:68)\nRandomly sample I \u2282 [N ] such that |I| = B\nxi(cid:48),(cid:80)\nObtain (Cji)j\u2208[N ] by solving (3) for i \u2208 I\ngi(cid:48)i \u2190 med\nj\u2208S\\{i} Cjixj \u2212 xi\ni+ \u2190 argmaxi(cid:48)\u2208[N ]\\S\nN\u22121\ni\u2208I\\{i(cid:48)} g2\n|I\\{i(cid:48)}|\ni(cid:48)i\ni+i (cid:54)= 0 then\n\nfor (i(cid:48), i) \u2208 ([N ] \\ S) \u00d7 I\n\n(cid:69) \u00b1 \u03bb\n(cid:111)\n\nif(cid:80)\n\n(cid:110)\n\n0,\n\n(cid:80)\n\nof the eigenvalues of LS [33]. We then apply orthogonal iteration to matrix LM to \ufb01nd its L\nlargest eigenvectors. We check the convergence condition of orthogonal iteration by evaluating the\nscaled norm difference between previous and current solutions. Spectral clustering step of S5C is\nsummarized in Algorithm 2 in Appendix A.\n\n4 Theoretical guarantees\nIn this section, we analyze S5C algorithm from the theoretical aspect. We assume dim(S(cid:96)) = d for\nall (cid:96) \u2208 [L] solely for the simplicity of notation. As established in the literature [14], we provide\nguarantees on Subspace Detection Property (SDP), which is formally de\ufb01ned as follows.\nDe\ufb01nition 1 (Subspace Detection Property). An algorithm is said to exhibit subspace detection\nproperty if and only if it produces af\ufb01nity matrix C \u2208 RN\u00d7N such that the following conditions hold:\n\n1. For all i \u2208 [N ], i-th column of C is not 0.\n2. For all i \u2208 [N ], i-th column of C has non-zero elements only in those rows that correspond\n\nto data points that belong to the same subspace as i-th data point.\n\nSDP is known to be guaranteed if SSC is solved with all data points [27, 15], i.e., |S| = N in\nour notation. In this work, we show that SDP is guaranteed even when |S| = \u02dcO(dL + L2), i.e.,\nindependent of number of data points N. We analyzed S5C algorithm under deterministic data model\nand random data model. We provide all proofs in Appendices B and C. Our theoretical results can be\neasily adapted for the case when data points are randomly sampled (Appendix D).\n\n4.1 Deterministic data model\n\nIn deterministic data model [15], we assume there is no noise but subspaces can intersect in an\narbitrary manner. To quantify subspace structure, we introduce two measures: persistent inradius\nand coherence. Persistent inradius of data points is a measure originally introduced in our work as a\nuseful extension of inradius of data points [14] and quanti\ufb01es how much data points are uniformly\ndistributed in each subspace. Figure 1 illustrates the idea of persistent inradius in the low-dimenional\nspace. Coherence [23] is a measure which quanti\ufb01es closeness between two subspaces.\nDe\ufb01nition 2 (Inradius). The inradius of convex body P , denoted by r(P ), is de\ufb01ned as the radius of\nthe largest Euclidean ball inscribed in P .\nDe\ufb01nition 3 (Persistent inradius). The persistent inradius with respect to P = {Pi}i\u2208[m] \u2282 Rd,\ndenoted by \u02c7r(P ), is de\ufb01ned as the minimum inradius of symmetric convex bodies represented as\n\nconv(cid:0){\u00b1Pi}i\u2208I\n\n(cid:1), where |I| \u2265 d.\n\nDe\ufb01nition 4 (Coherence). The coherence \u00b5(X, Y ) between two sets of points of unit norm, X and\nY , is de\ufb01ned as\n\n\u00b5(X, Y ) = max\n\nx\u2208X,y\u2208Y\n\n(cid:104)x, y(cid:105) .\n\n5\n\n\fTheorem 1. Assume that data X \u2208 RM\u00d7N with normalized columns and subspaces {S(cid:96)}(cid:96)\u2208[L]\nare given. We de\ufb01ne (cid:96)(i) so that the subspace corresponding to i-th data is S(cid:96)(i) and S(cid:96) =\n{i \u2208 [N ]|(cid:96)(i) = (cid:96)}. Assume that |S(cid:96)| = N/L and dimS(cid:96) = d, for all (cid:96) \u2208 [L]. X[S(cid:96)] denotes\nthe subset which corresponds to data in S(cid:96). We de\ufb01ne\n\u02c7r(X[S(cid:96)]), \u00b5 = max\n\n\u02c7r = min\n\n(cid:96)\n\nIf it holds that\n\n0 < \u00b5 \u2264 \u03bb < \u02c7r, T \u2265 2\n\n(cid:18)\n\n(cid:96)(cid:54)=(cid:96)(cid:48) \u00b5 (X[S(cid:96)], X[S(cid:96)(cid:48)]) .\n\n(cid:0)log(2L\u03b4\u22121)(cid:1)(cid:19)\n\ndL,\n\n1 +\n\nL\nd\n\nP1\n\nP2\n\nP1\n\nP2\n\n\u2212P3\n\n\u2212P3\n\nthen, S5C of T iterations with hyerparameter \u03bb has subspace detection property with at least\nprobability 1 \u2212 \u03b4.\nThis theorem implies that if \u00b5 < \u02c7r holds, there\nexists hyperparameter \u03bb that makes S5C able to\nexactly recover subspaces. The randomness in\nthe model is introduced with the random selec-\ntion of subsample I (line 3 of Algorithm 1). The-\norem also provides approximation guarantees by\nimplying that the number of iterations suf\ufb01cient\nfor S5C to obtain SDP with high probability\nis independent of N. Note that S5C chooses\nsubsample only if the condition in line 7 of Al-\ngorithm 1 is satis\ufb01ed, meaning that less than T\nsubsamples can be suf\ufb01cient for the algorithm\nto obtain SDP. Therefore, number of iterations\nT is linearly connected to the runtime of the\nalgorithm and has an interpretation as an upper\nbound on the number of subsamples |S|. In the\ncase when subsamples are randomly chosen, we\ncan easily extend the proof and show that T ran-\ndomly chosen subsamples can also enjoy SDP with high probability. However, in this case the number\nof iterations T corresponds to the number of subsamples. Therefore, theory implies that S5C may\nneed less number of subsamples to satisfy SDP compared to random subsamples. In the case when\nthe number of dimensions varies among the subspaces, it is straightforward to generalize the theorem\nby setting d = max(cid:96)\u2208[L] d(cid:96).\n\nFigure 1: Concept of persistent inradius. Left: inra-\ndius r (conv(P )) where P = {\u00b1P1,\u00b1P2,\u00b1P3}.\nRight: persistent inradius \u02c7r(P ).\n\n\u2212P1 \u2212P2\n\n\u2212P1 \u2212P2\n\nP3\n\nr(P )\n\n\u02c7r(P )\n\nP3\n\n4.2 Random data model\n\nWe introduce semi-random model [14, 15] for our analysis of random data model.\nDe\ufb01nition 5 (Semi-random model). Data X is drawn from semi-random model if and only if, for\neach (cid:96), each element of X[S(cid:96)] is drawn from uniform distribution on the surface of the unit ball with\nrespect to the subspace S(cid:96).\nTo measure the closeness between two subspaces under the random data model, we introduce af\ufb01nity.\nDe\ufb01nition 6 (Af\ufb01nity). Af\ufb01nity between two d-dimensional subspaces S and S(cid:48) in RM denoted by\na\ufb00(S,S(cid:48)) is de\ufb01ned as follows:\n\n(cid:13)(cid:13)U(cid:62)V(cid:13)(cid:13)F ,\n\nmax\n\nV\u2208O(S(cid:48))\n\na\ufb00(S,S(cid:48)) = max\nU\u2208O(S)\n\n(cid:8)V = (vj)j \u2208 Rd\u00d7M(cid:12)(cid:12) vj \u2208 S,(cid:104)vi, vj(cid:105) = \u03b4ij\n\n(cid:9).\n\nwhere O(S) denotes the set of matrices which induces projection onto S,\n\ni.e., O(S) =\n\nAn alternative de\ufb01nition of af\ufb01nity in terms of principal angles can be found in [14, 15].\nTheorem 2. Assume that data X \u2208 RM\u00d7N is drawn from semi-random model in which subspaces\n{S(cid:96)}(cid:96)\u2208[L] are given. We de\ufb01ne\n\n\u03c1 =\n\nN\ndL\n\n, a = min\n\n(cid:96)(cid:54)=(cid:96)(cid:48) a\ufb00(S(cid:96),S(cid:96)(cid:48)).\n\n6\n\n\fIf it holds that\n\n4 < log \u03c1 < 4d, a \u2264 \u03bb <\n\n(cid:114)\n\n1\n8\n\n(cid:18)\n\n1 +\n\nL\nd\n\n(cid:0)log(2L\u03b4\u22121)(cid:1)(cid:19)\n\n, T \u2265 2\n\nlog \u03c1\n\nd\n\ndL,\n\n(6)\n\nthen, S5C of T iterations with hyperparameter \u03bb has subspace detection property with at least\nprobability 1 \u2212 \u03b4 \u2212 L exp(\u2212d\n\n\u03c1).\n\n\u221a\n\nThis theorem implies that if conditions (6) on \u03c1, \u03bb and T hold, then S5C satis\ufb01es SDP with high\nprobability. This theorem can also be easily adapted for the case of randomly selected data points.\n\n5 Experimental evaluation\n\nN\n\n(cid:17)\n\n(cid:16)\n\n(cid:80)\n\n1 \u2212 1\n\ni\u2208[N ] 1{\u03c0(\u02c6ri)=ri}\n\nBaselines and evaluation metrics. We compare clustering performance and scalability to other\nSSC based methods, including Sparse Subspace Clustering (SSC) [7], Scalable Sparse Subspace\nClustering (SSSC) [19, 20], Sparse Subspace Clustering via Orthogonal Matching Pursuit (SSC-\nOMP) [22] and Elastic Net Subspace Clustering with ORacle Guided Elastic Net (EnSC-ORGEN)\n[27]. Besides sparse subspace clustering methods, we compare performance to Nystr\u00f6m algorithm\n[34] and Approximate Kernel K-means (AKK) [35]. Our code is available at https://github.\ncom/smatsus/S5C. Clustering performance is evaluated in terms of the clustering error (CE) de\ufb01ned\nas CE(\u02c6r, r) = min\u03c0\u2208\u03a0L\n, where \u03a0L is the set of all permutations on\n[L].\nBenchmark datasets. We verify the effectiveness of S5C on six benchmark datasets including face\nimage dataset Yale B [36, 37], motion segmentation Hopkins 155 [38], object recognition datasets\nCOIL-100 [39] and CIFAR-10 [40], handwritten digits dataset MNIST [41], letter recognition dataset\nof different fonts Letter-rec [42], and handwritten character recognition dataset Devanagari [43]. The\nsummary of datasets and details of experimental setup are provided in Appendix E.\nClustering performance. Clustering error of S5C algorithm compared to the state-of-the-art methods\non six real-world datasets is presented in Table 2. The results show that S5C is the only algorithm\nwhich consistently has good performance, achieving 13% better median performance over the second\nbest SSC-ORGEN. On the COIL-100 dataset which has 100 classes, S5C achieves score close to the\nSSC baseline and signi\ufb01cantly outperforms all other methods. In all experiments, we use only one\nrandom subsample, i.e., |I| = 1. In order to examine the sensitivity of S5C to the random sampling\nline 3 in Algorithm 1, we rerun the algorithm with different random seeds and report means and\nstandard deviations over 10 runs. The results demonstrate that S5C is not sensitive to this step and\nstandard deviation varies from 0.4 % to 2.3 % across all datasets.\n\nTable 2: Clustering error (%): Character \u2018/\u2019 denotes that either time limit of 24 hours or memory limit\nof 16 GB was exceeded. Standard deviations of S5C are given in parentheses.\n\nNystr\u00f6m AKK SSC SSC-OMP\n\nDataset\nYale B\nHopkins 155\nCOIL-100\nLetter-rec\nCIFAR-10\nMNIST\nDevanagari\n\n76.8\n21.8\n54.5\n73.3\n76.6\n45.7\n73.5\n\n85.7\n20.6\n53.1\n71.7\n75.6\n44.6\n72.8\n\n33.8\n4.1\n42.5\n\n/\n/\n/\n/\n\n35.9\n23.0\n57.9\n95.2\n\n/\n/\n/\n\nSSC-ORGEN SSSC\n59.6\n21.1\n67.8\n68.4\n82.4\n48.7\n84.9\n\n37.4\n20.5\n89.7\n68.6\n82.4\n28.7\n58.6\n\nS5C\n\n39.3 (1.8)\n14.6 (0.4)\n45.9 (0.5)\n67.7 (1.3)\n75.1 (0.8)\n40.4 (2.3)\n67.2 (1.3)\n\nComputational time. We compare computational time to other large-scale methods using randomly\nsampled subsets on the COIL-100 and MNIST datasets. Figures 2 (a) and (b) show the mean compu-\ntational time for each cardinality of independents subsets. As expected by theory, computational time\nof S5C increases only linearly with the respect to the number of data points. Most of the time of S5C\nis taken by solving LASSO, which is extremely easy to parallelize just by partitioning data points\nacross machines. We do not focus on such implementation improvements as our point here is not\nin reporting faster time, but in showing the linear scalability and consequently the ability to handle\nlarge-scale data.\nBene\ufb01ts of selective sampling. The main motivation behind the selective sampling in the representa-\ntion learning step is to better capture structure of the entire dataset than simple random sampling. To\n\n7\n\n\fevaluate this hypothesis, we design an experiment which compares the performance of subsamples\nselected based on the stochastic approximation of the subgradient to random subsamples. For this\npurpose, we consider a method in which the selective sampling in the representation learning step\nis replaced with random sampling. We call this method S5C-rand. Figure 2 (c) shows the objective\nfunction value with respect to the number of subsamples achieved by S5C and S5C-rand methods on\nthe Yale B dataset. It can be seen that for each subsample S5C achieves lower value of the objective\nfunction. Furthermore, by using \u223c 75% of subsamples S5C achieves the same objective function\nvalue as S5C-rand.\n\n)\ns\n(\n\ne\nm\nT\n\ni\n\n104\n\n103\n\n102\n\n101\n\n100\n\n(a) COIL-100\n\nquadratic\n\nlinear\n\nS5C\nSSSC\nORGEN\nNystr\u00f6m\n\nAKK\nSSC\nOMP\n\n103\nNumber of Datapoints\n\n104\n\n103\n\n102\n\n101\n\n100\n\n)\ns\n(\n\ne\nm\nT\n\ni\n\n10\u22121\n\n10\u22122\n\n(b) MNIST\n\n(c) Yale B\n\nquadratic\n\nlinear\n\nS5C\nSSSC\nORGEN\nNystr\u00f6m\n\nAKK\n\n101\n\n102\n\n103\n\nNumber of Datapoints\n\n104\n\nS5C-rand\n\nS5C\n\n120\n\n100\n\ne\nu\nl\na\nV\nn\no\ni\nt\nc\nn\nu\nF\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n80\n\n60\n\n40\n\n0\n\n500\n\n1,000\n\n1,500\n\nNumber of Subsamples\n\nFigure 2: (a) and (b) Relation between training time and number of datapoints on the COIL-100 and\nMNIST datasets. (c) Objective function value of selective and random sampling based S5C on the\nYale B dataset.\nBene\ufb01ts of orthogonal iteration. We further compare performance and time ef\ufb01ciency of the\nspectral clustering step in S5C with a classic eigenvalue decomposition algorithm for the normalized\ncut (NCut) referred to as NCutE in [26], and Power Iteration Clustering (PIC) [26]. In PIC method\nauthors use power method to \ufb01nd the dominant eigenvector and then apply K-means clustering to\none-dimensional vector. We call our method Orthogonal Iteration Clustering (OIC). We design the\nexperiment so that each of the algorithms receives the same af\ufb01nity matrix W at the input obtained\nby S5C representation learning step. In this way we compare clustering performance of only spectral\ndecomposition. Since high computational complexity of NCut limits the application to large-datasets,\nwe compare performance on Yale B and COIL-100 datasets. To avoid that the computational time is\ndominated by K-means clustering, we report time obtained with only one execution of K-means,\nwhile in practice it is often executed several times with different initializations. The results are\nshown in Table 3. The experiments demonstrate that OIC performs comparably to NCut and does not\ndegrade clustering performance. Although PIC has lower computational time than OIC, it fails to\nprovide satisfying clustering accuracy.\n\nTable 3: Clustering error (CE) and computational time of spectral clustering step on the Yale B and\nCOIL-100 datasets.\n\nDataset\n\nYale B\n\nCOIL-100\n\nMeasure NCut\n42.7\nCE (%)\n38.7\nTime (s)\n47.0\nCE (%)\nTime (s)\n290.0\n\nPIC OIC\n42.8\n83.9\n16.2\n12.0\n45.4\n77.4\n16.4\n27.4\n\n6 Conclusion\n\nBuilding on the existing work on sparse subspace clustering (SSC), this paper introduced the ef\ufb01cient\nSSC algorithm, called S5C, able to linearly scale to the number of data points in both representation\nlearning and spectral clustering steps. We derived theoretical conditions under which subspace\ndetection property of S5C is preserved. Besides computational ef\ufb01ciency, experimental results\nshowed that S5C achieves performance improvement over existing large-scale sparse subspace\nclustering algorithms. Our algorithm is not restricted to SSC but can be easily extended to elastic\nnet subspace clustering. We believe our approach will expand the applicability of sparse subspace\nclustering algorithm to large-scale datasets.\n\n8\n\n\fAcknowledgments\nThis work was supported by KAKENHI 19K20336.\n\nReferences\n[1] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52\u201368, 2011.\n\n[2] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n\n[3] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence, 22(8):888\u2013905, 2000.\n\n[4] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in\n\nNeural Information Processing Systems (NeurIPS), pages 849\u2013856, 2001.\n\n[5] L. Hagen and A. B Kahng. New spectral methods for ratio cut partitioning and clustering.\n\nTransactions on Computer-aided Design of Integrated Circuits and Systems, 11(9):1074\u20131085, 1992.\n\nIEEE\n\n[6] E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE Conference on Computer Vision and\n\nPattern Recognition (CVPR), pages 2790\u20132797, 2009.\n\n[7] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications.\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(11):2765\u20132781, 2013.\n\nIEEE\n\n[8] R. Vidal and P. Favaro. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47\u201361,\n\n2014.\n\n[9] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In International\n\nConference on Machine Learning (ICML), pages 663\u2013670, 2010.\n\n[10] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank\nrepresentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171\u2013184, 2013.\n\n[11] Y. X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When LRR meets SSC. In Advances in\n\nNeural Information Processing Systems (NeurIPS), pages 64\u201372, 2013.\n\n[12] M. Brbi\u00b4c and I. Kopriva. Multi-view low-rank sparse subspace clustering. Pattern Recognition, 73:247\u2013258,\n\n2018.\n\n[13] M. Brbi\u00b4c and I. Kopriva. (cid:96)0-motivated low-rank sparse subspace clustering.\n\nCybernetics, pages 1\u201315, 2018. ISSN 2168-2267. doi: 10.1109/TCYB.2018.2883566.\n\nIEEE Transactions on\n\n[14] M. Soltanolkotabi and E. J. Candes. A geometric analysis of subspace clustering with outliers. The Annals\n\nof Statistics, 40(4):2195\u20132238, 2012.\n\n[15] Y.-X. Wang and H. Xu. Noisy sparse subspace clustering. The Journal of Machine Learning Research, 17\n\n(1):320\u2013360, 2016.\n\n[16] M. C. Tsakiris and R. Vidal. Theoretical analysis of sparse subspace clustering with missing entries. In\n\nInternational Conference on Machine Learning (ICML), pages 4006\u20134015, 2018.\n\n[17] F. Pourkamali-Anaraki and S. Becker. Ef\ufb01cient solvers for sparse subspace clustering. arXiv preprint\n\narXiv:1804.06291, 2018.\n\n[18] D. Cai and X. Chen. Large scale spectral clustering via landmark-based sparse representation. IEEE\n\nTransactions on Cybernetics, 45(8):1669\u20131680, 2015.\n\n[19] X. Peng, L. Zhang, and Z. Yi. Scalable sparse subspace clustering. In IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR), pages 430\u2013437, 2013.\n\n[20] X. Peng, H. Tang, L. Zhang, Z. Yi, and S. Xiao. A uni\ufb01ed framework for representation-based subspace\nclustering of out-of-sample and large-scale data. IEEE Transactions on Neural Networks and Learning\nSystems, 27(12):2499\u20132512, 2018.\n\n[21] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[22] E. L. Dyer, A. C. Sankaranarayanan, and R. G. Baraniuk. Greedy feature selection for subspace clustering.\n\nJournal of Machine Learning Research, 14(1):2487\u20132517, 2013.\n\n9\n\n\f[23] C. You, D. Robinson, and R. Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3918\u20133927, 2016.\n\n[24] D. Park, C. Caramanis, and S. Sanghavi. Greedy subspace clustering. In Advances in Neural Information\n\nProcessing Systems (NeurIPS), pages 2753\u20132761, 2014.\n\n[25] R. Heckel and H. B\u00f6lcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information\n\nTheory, 61(11):6320\u20136342, 2015.\n\n[26] F. Lin and W. W. Cohen. Power iteration clustering. In International Conference on Machine Learning\n\n(ICML), pages 655\u2013662, 2010.\n\n[27] C. You, C. G. Li, D. P. Robinson, and R. Vidal. Oracle based active set algorithm for scalable elastic net\nsubspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n3928\u20133937, 2016.\n\n[28] C. You, C. Li, D. P/ Robinson, and R. Vidal. A scalable exemplar-based subspace clustering algorithm\nfor class-imbalanced data. In European Conference on Computer Vision (ECCV), pages 68\u201385. Springer,\n2018.\n\n[29] K. Tang, R. Liu, Z. Su, and J. Zhang. Structure-constrained low-rank representation. IEEE Transactions\n\non Neural Networks and Learning Systems, 25(12):2167\u20132179, Dec 2014. ISSN 2162-237X.\n\n[30] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine learning, 3(1):\n1\u2013122, 2011.\n\n[31] J. Friedman, T. H., and R. Tibshirani. Regularization paths for generalized linear models via coordinate\n\ndescent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[32] T. T. Wu and K. Lange. Coordinate descent algorithms for Lasso penalized regression. The Annals of\n\nApplied Statistics, 2(1):224\u2013244, 2008.\n\n[33] F. R. K. Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997.\n\n[34] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystr\u00f6m method. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 26(2):214\u2013225, Feb 2004.\n\n[35] R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means: Solution to large scale kernel\nclustering. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 895\u2013903.\nACM, 2011.\n\n[36] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models for\nface recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 23(6):643\u2013660, 2001.\n\n[37] K.-C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable\n\nlighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):684\u2013698, 2005.\n\n[38] R. Tron and R. Vidal. A benchmark for the comparison of 3-d motion segmentation algorithms.\n\n[39] S. A. Nene, S. K. Nayar, and H. Murase. Columbia Object Image Library (COIL-100). Technical report,\n\nColumbia University, 1996.\n\n[40] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[41] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[42] P. W. Frey and D. J. Slate. Letter recognition using Holland-style adaptive classi\ufb01ers. Machine learning, 6\n\n(2):161\u2013182, 1991.\n\n[43] S. Acharya, A. K. Pant, and P. K. Gyawali. Deep learning based large scale handwritten Devanagari\ncharacter recognition. In International Conference on Software, Knowledge, Information Management and\nApplications (SKIMA), pages 1\u20136. IEEE, 2015.\n\n10\n\n\f", "award": [], "sourceid": 6721, "authors": [{"given_name": "Shin", "family_name": "Matsushima", "institution": "The University of Tokyo"}, {"given_name": "Maria", "family_name": "Brbic", "institution": "Stanford University"}]}