{"title": "Differentially private subspace clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1000, "page_last": 1008, "abstract": "Subspace clustering is an unsupervised learning problem that aims at grouping data points into multiple ``clusters'' so that data points in a single cluster lie approximately on a low-dimensional linear subspace. It is originally motivated by 3D motion segmentation in computer vision, but has recently been generically applied to a wide range of statistical machine learning problems, which often involves sensitive datasets about human subjects. This raises a dire concern for data privacy. In this work, we build on the framework of ``differential privacy'' and present two provably private subspace clustering algorithms. We demonstrate via both theory and experiments that one of the presented methods enjoys formal privacy and utility guarantees; the other one asymptotically preserves differential privacy while having good performance in practice. Along the course of the proof, we also obtain two new provable guarantees for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests.", "full_text": "Differentially Private Subspace Clustering\n\nMachine Learning Department, Carnegie Mellon Universty, Pittsburgh, USA\n\nYining Wang, Yu-Xiang Wang and Aarti Singh\n{yiningwa,yuxiangw,aarti}@cs.cmu.edu\n\nAbstract\n\nSubspace clustering is an unsupervised learning problem that aims at grouping\ndata points into multiple \u201cclusters\u201d so that data points in a single cluster lie ap-\nproximately on a low-dimensional linear subspace. It is originally motivated by\n3D motion segmentation in computer vision, but has recently been generically\napplied to a wide range of statistical machine learning problems, which often in-\nvolves sensitive datasets about human subjects. This raises a dire concern for\nIn this work, we build on the framework of differential privacy\ndata privacy.\nand present two provably private subspace clustering algorithms. We demonstrate\nvia both theory and experiments that one of the presented methods enjoys formal\nprivacy and utility guarantees; the other one asymptotically preserves differential\nprivacy while having good performance in practice. Along the course of the proof,\nwe also obtain two new provable guarantees for the agnostic subspace clustering\nand the graph connectivity problem which might be of independent interests.\n\n1\n\nIntroduction\n\nSubspace clustering was originally proposed to solve very speci\ufb01c computer vision problems having\na union-of-subspace structure in the data, e.g., motion segmentation under an af\ufb01ne camera model\n[11] or face clustering under Lambertian illumination models [15]. As it gains increasing attention\nin the statistics and machine learning community, people start to use it as an agnostic learning tool in\nsocial network [5], movie recommendation [33] and biological datasets [19]. The growing applica-\nbility of subspace clustering in these new domains inevitably raises the concern of data privacy, as\nmany such applications involve dealing with sensitive information. For example, [19] applies sub-\nspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace\nclustering as a effective tool to conduct linkage attacks on individuals in movie rating datasets. Nev-\nertheless, privacy issues in subspace clustering have been less explored in the past literature, with\nthe only exception of a brief analysis and discussion in [29]. However, the algorithms and analysis\npresented in [29] have several notable de\ufb01ciencies. For example, data points are assumed to be inco-\nherent and it only protects the differential privacy of any feature of a user rather than the entire user\npro\ufb01le in the database. The latter means it is possible for an attacker to infer with high con\ufb01dence\nwhether a particular user is in the database, given suf\ufb01cient side information.\nIt is perhaps reasonable why there is little work focusing on private subspace clustering, which\nis by all means a challenging task. For example, a negative result in [29] shows that if utility is\nmeasured in terms of exact clustering, then no private subspace clustering algorithm exists when\nneighboring databases are allowed to differ on an entire user pro\ufb01le. In addition, state-of-the-art\nsubspace clustering methods like Sparse Subspace Clustering (SSC, [11]) lack a complete analysis of\nits clustering output, thanks to the notorious \u201cgraph connectivity\u201d problem [21]. Finally, clustering\ncould have high global sensitivity even if only cluster centers are released, as depicted in Figure 1.\nAs a result, general private data releasing schemes like output perturbation [7, 8, 2] do not apply.\nIn this work, we present a systematic and principled treatment of differentially private subspace\nclustering. To circumvent the negative result in [29], we use the perturbation of recovered low-\n\n1\n\n\fdimensional subspace from the ground truth as the utility measure. Our contributions are two-fold.\nFirst, we analyze two ef\ufb01cient algorithms based on the sample-aggregate framework [22] and estab-\nlished formal privacy and utility guarantees when data are generated from some stochastic model or\nsatisfy certain deterministic separation conditions. New results on (non-private) subspace clustering\nare obtained along our analysis, including a fully agnostic subspace clustering on well-separated\ndatasets using stability arguments and exact clustering guarantee for thresholding-based subspace\nclustering (TSC, [14]) in the noisy setting. In addition, we employ the exponential mechanism [18]\nand propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak\nin sampling from a matrix Bingham distribution. The method works well in practice and we show it\nis closely related to the well-known mixtures of probabilistic PCA model [27].\n\nRelated work Subspace clustering can be thought as a generalization of PCA and k-means clus-\ntering. The former aims at \ufb01nding a single low-dimensional subspace and the latter uses zero-\ndimensional subspaces as cluster centers. There has been extensive research on private PCA\n[2, 4, 10] and k-means [2, 22, 26]. Perhaps the most similar work to ours is [22, 4]. [22] applies the\nsample-aggregate framework to k-means clustering and [4] employs the exponential mechanism to\nrecover private principal vectors. In this paper we give non-trivial generalization of both work to the\nprivate subspace clustering setting.\n2 Preliminaries\n2.1 Notations\n\nFor a vector x \u2208 Rd, its p-norm is de\ufb01ned as (cid:107)x(cid:107)p = ((cid:80)\nis, (cid:107)A(cid:107)2 = \u03c31(A) and (cid:107)A(cid:107)F = (cid:112)(cid:80)n\n\ni )1/p. If p is not explicitly speci\ufb01ed\nthen the 2-norm is used. For a matrix A \u2208 Rn\u00d7m, we use \u03c31(A) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3n(A) \u2265 0 to\ndenote its singular values (assuming without loss of generality that n \u2264 m). We use (cid:107) \u00b7 (cid:107)\u03be to\ndenote matrix norms, with \u03be = 2 the matrix spectral norm and \u03be = F the Frobenious norm. That\ni=1 \u03c3i(A)2. For a q-dimensional subspace S \u2286 Rd, we\nassociate with a basis U \u2208 Rd\u00d7q, where the q columns in U are orthonormal and S = range(U).\nWe use Sd\nGiven x \u2208 Rd and S \u2286 Rd, the distance d(x,S) is de\ufb01ned as d(x,S) = inf y\u2208S (cid:107)x \u2212 y(cid:107)2. If S is\na subspace associated with a basis U, then we have d(x,S) = (cid:107)x \u2212 PS (x)(cid:107)2 = (cid:107)x \u2212 UU(cid:62)x(cid:107)2,\nwhere PS (\u00b7) denotes the projection operator onto subspace S. For two subspaces S,S(cid:48) of dimension\nq, the distance d(S,S(cid:48)) is de\ufb01ned as the Frobenious norm of the sin matrix of principal angles; i.e.,\n\nq to denote the set of all q-dimensional subspaces in Rd.\n\ni xp\n\nd(S,S(cid:48)) = (cid:107) sin \u0398(S,S(cid:48))(cid:107)F = (cid:107)UU(cid:62) \u2212 U(cid:48)U(cid:48)(cid:62)(cid:107)F ,\nwhere U, U(cid:48) are orthonormal basis associated with S and S(cid:48), respectively.\n\n(1)\n\n2.2 Subspace clustering\nGiven n data points x1,\u00b7\u00b7\u00b7 , xn \u2208 Rd, the task of subspace clustering is to cluster the data points\ninto k clusters so that data points within a subspace lie approximately on a low-dimensional sub-\nspace. Without loss of generality, we assume (cid:107)xi(cid:107)2 \u2264 1 for all i = 1,\u00b7\u00b7\u00b7 , n. We also use\nX = {x1,\u00b7\u00b7\u00b7 , xn} to denote the dataset and X \u2208 Rd\u00d7n to denote the data matrix by stacking\nall data points in columnwise order. Subspace clustering seeks to \ufb01nd k q-dimensional subspaces\n\u02c6C = { \u02c6S1,\u00b7\u00b7\u00b7 , \u02c6Sk} so as to minimize the Wasserstein\u2019s distance or distance squared de\ufb01ned as\n\nk(cid:88)\n\ni=1\n\n2\n\nd2( \u02c6Si,S\u2217\n\nW ( \u02c6C,C\u2217) = min\nd2\n\u03c0:[k]\u2192[k]\n\n(2)\nwhere \u03c0 are taken over all permutations on [k] and S\u2217 are the optimal/ground-truth subspaces. In a\nmodel based approach, C\u2217 is \ufb01xed and data points {xi}n\ni=1 are generated either deterministically or\nstochastically from one of the ground-truth subspaces in C\u2217 with noise corruption; for a completely\nagnostic setting, C\u2217 is de\ufb01ned as the minimizer of the k-means subspace clustering objective:\nd2(xi,Sj).\n\ncost(C;X ) = argminC={S1,\u00b7\u00b7\u00b7 ,Sk}\u2286Sd\n\nC\u2217 := argminC={S1,\u00b7\u00b7\u00b7 ,Sk}\u2286Sd\n\nn(cid:88)\n\n\u03c0(i)),\n\n(3)\n\nq\n\n1\nn\n\nq\n\nmin\n\nj\n\ni=1\n\nTo simplify notations, we use \u2206k(X ) = cost(C\u2217;X ) to denote cost of the optimal solution.\n\n\fi=1 \u2286 Rd, number of subsets m, privacy parameters \u03b5, \u03b4; f, dM.\n\nm, \u03b1 = \u03b5/(5(cid:112)2 ln(2/\u03b4)), \u03b2 = \u03b5/(4(D + ln(2/\u03b4))).\n\nAlgorithm 1 The sample-aggregate framework [22]\n1: Input: X = {xi}n\n\u221a\n2: Initialize: s =\n3: Subsampling: Select m random subsets of size n/m of X independently and uniformly at\n\u221a\nrandom without replacement. Repeat this step until no single data point appears in more than\nm of the sets. Mark the subsampled subsets XS1 ,\u00b7\u00b7\u00b7 ,XSm.\n\ni=1 \u2286 RD, where si = f (XSi).\n\n4: Separate queries: Compute B = {si}m\n5: Aggregation: Compute g(B) = si\u2217 where i\u2217 = argminm\n2 + 1). Here\n6: Noise calibration: Compute S(B) = 2 maxk(\u03c1(t0 + (k + 1)s) \u00b7 e\u2212\u03b2k), where \u03c1(t) is the mean\n7: Output: A(X ) = g(B) + S(B)\n\nri(t0) denotes the distance dM(\u00b7,\u00b7) between si and the t0-th nearest neighbor to si in B.\nof the top (cid:98)s/\u03b2(cid:99) values in {r1(t),\u00b7\u00b7\u00b7 , rm(t)}.\n\n\u03b1 u, where u is a standard Gaussian random vector.\n\ni=1ri(t0) with t0 = ( m+s\n\n2.3 Differential privacy\nDe\ufb01nition 2.1 (Differential privacy, [7, 8]). A randomized algorithm A is (\u03b5, \u03b4)-differentially private\nif for all X ,Y satisfying d(X ,Y) = 1 and all sets S of possible outputs the following holds:\n\nPr[A(X ) \u2208 S] \u2264 e\u03b5 Pr[A(Y) \u2208 S] + \u03b4.\n\n(4)\n\nIn addition, if \u03b4 = 0 then the algorithm A is \u03b5-differentially private.\nIn our setting, the distance d(\u00b7,\u00b7) between two datasets X and Y is de\ufb01ned as the number of different\ncolumns in X and Y. Differential privacy ensures the output distribution is obfuscated to the point\nthat every user has a plausible deniability about being in the dataset, and in addition any inferences\nabout individual user will have nearly the same con\ufb01dence before and after the private release.\n\n3 Sample-aggregation based private subspace clustering\n\nIn this section we \ufb01rst summarize the sample-aggregate framework introduced in [22] and argue\nwhy it should be preferred to conventional output perturbation mechanisms [7, 8] for subspace clus-\ntering. We then analyze two ef\ufb01cient algorithms based on the sample-aggregate framework and\nprove formal privacy and utility guarantees. We also prove new results in our analysis regarding\nthe stability of k-means subspace clustering (Lem. 3.3) and graph connectivity (i.e., consistency) of\nnoisy threshold-based subspace clustering (TSC, [14]) under a stochastic model (Lem. 3.5).\n\n3.1 Smooth local sensitivity and the sample-aggregate framework\n\nMost existing privacy frameworks [7, 8] are\nbased on the idea of global sensitivity, which\nis de\ufb01ned as the maximum output perturbation\n(cid:107)f (X1) \u2212 f (X2)(cid:107)\u03be, where maximum is over\nall neighboring databases X1,X2 and \u03be = 1 or\n2. Unfortunately, global sensitivity of cluster-\ning problems is usually high even if only clus-\nter centers are released. For example, Figure\n1 shows that the global sensitivity of k-means\nsubspace clustering could be as high as O(1),\nwhich ruins the algorithm utility.\nthe above-mentioned chal-\nTo circumvent\nlenges, Nissim et al.\nintroduces the\nsample-aggregate framework based on the con-\ncept of a smooth version of local sensitivity.\nUnlike global sensitivity, local sensitivity measures the maximum perturbation (cid:107)f (X ) \u2212 f (X (cid:48))(cid:107)\u03be\nover all databases X (cid:48) neighboring to the input database X . The proposed sample-aggregate frame-\nwork (pseudocode in Alg. 1) enjoys local sensitivity and comes with the following guarantee:\nTheorem 3.1 ([22], Theorem 4.2). Let f : D \u2192 RD be an ef\ufb01ciently computable function where\nD is the collection of all databases and D is the output dimension. Let dM(\u00b7,\u00b7) be a semimetric on\n\nFigure 1:\nIllustration of instability of k-means\nsubspace clustering solutions (d = 2, k = 2, q =\n1). Blue dots represent evenly spaced data points\non the unit circle; blue crosses indicate an addi-\ntional data point. Red lines are optimal solutions.\n\n[22]\n\n3\n\n\f\u221a\nthe outer space of f. 1 Set \u03b5 > 2D/\nm and m = \u03c9(log2 n). The sample-aggregate algorithm\nA in Algorithm 1 is an ef\ufb01cient (\u03b5, \u03b4)-differentially private algorithm. Furthermore, if f and m are\nchosen such that the (cid:96)1 norm of the output of f is bounded by \u039b and\nPrXS\u2286X [dM(f (XS), c) \u2264 r] \u2265 3\n\n(5)\nfor some c \u2208 RD and r > 0, then the standard deviation of Gaussian noise added is upper bounded\nIn addition, when m satis\ufb01es m = \u03c9(D2 log2(r/\u039b)/\u03b52), with high\nby O(r/\u03b5) + \u039b\nprobability each coordinate of A(X )\u2212 \u00afc is upper bounded by O(r/\u03b5), where \u00afc depending on A(X )\nsatis\ufb01es dM(c, \u00afc) = O(r).\n\n\u03b5 e\u2212\u2126( \u03b5\n\n\u221a\nD ).\n\n4\n\nm\n\nLet f be any subspace clustering solver that outputs k estimated low-dimensional subspaces and\ndM be the Wasserstein\u2019s distance as de\ufb01ned in Eq. (2). Theorem 3.1 provides privacy guarantee\nfor an ef\ufb01cient meta-algorithm with any f. In addition, utility guarantee holds with some more\nassumptions on input dataset X . In following sections we establish utility guarantees. The main\nidea is to prove stability results as outlined in Eq. (5) for particular subspace clustering solvers and\nthen apply Theorem 3.1.\n\n3.2 The agnostic setting\nWe \ufb01rst consider the setting when data points {xi}n\ni=1 are arbitrarily placed. Under such agnostic\nsetting the optimal solution C\u2217 is de\ufb01ned as the one that minimizes the k-means cost as in Eq. (3).\nThe solver f is taken to be any (1 + \u0001)-approximation2 of optimal k-means subspace clustering; that\nis, f always outputs subspaces \u02c6C satisfying cost( \u02c6C;X ) \u2264 (1 + \u0001)cost(C\u2217;X ). Ef\ufb01cient core-set\nbased approximation algorithms exist, for example, in [12]. The key task of this section it to identify\nassumptions under which the stability condition in Eq. (5) holds with respect to an approximate\nsolver f. The example given in Figure 1 also suggests that identi\ufb01ability issue arises when the input\ndata X itself cannot be well clustered. For example, no two straight lines could well approximate\ndata uniformly distributed on a circle. To circumvent the above-mentioned dif\ufb01culty, we impose the\nfollowing well-separation condition on the input data X :\nDe\ufb01nition 3.2 (Well-separation condition for k-means subspace clustering). A dataset X is\n(\u03c6, \u03b7, \u03c8)-well separated if there exist constants \u03c6, \u03b7 and \u03c8, all between 0 and 1, such that\n\nk(X ) \u2264 min(cid:8)\u03c62\u22062\n\n\u22062\n\nk\u22121(X ), \u22062\n\nk,\u2212(X ) \u2212 \u03c8, \u22062\n\nk,+(X ) + \u03b7(cid:9) ,\n\n(6)\nk,\u2212(X ) =\n\nwhere \u2206k\u22121, \u2206k,\u2212 and \u2206k,+ are de\ufb01ned as \u22062\nminS1\u2208Sd\n\ncost({Si};X ); and \u22062\n\nq\u22121,S2:k\u2208Sd\n\nq\n\nk\u22121(X ) = minS1:k\u22121\u2208Sd\nq+1,S2:k\u2208Sd\n\nk,+(X ) = minS1\u2208Sd\n\nq\n\nq\n\ncost({Si};X ); \u22062\ncost({Si};X ).\n\nk(X ) \u2264 \u22062\n\nk(X ) \u2264 \u03c62\u22062\n\nk\u22121(X ), constrains that the input dataset X cannot\nThe \ufb01rst condition in Eq. (6), \u22062\nbe well clustered using k \u2212 1 instead of k clusters. It was introduced in [23] to analyze stability of\nk-means solutions. For subspace clustering, we need another two conditions regarding the intrinsic\nk,\u2212(X ) \u2212 \u03c8 asserts that replacing a q-dimensional\ndimension of each subspace. The \u22062\nsubspace with a (q \u2212 1)-dimensional one is not suf\ufb01cient, while \u22062\nk,+(X ) + \u03b7 means an\nadditional subspace dimension does not help much with clustering X .\nThe following lemma is our main stability result for subspace clustering on well-separated datasets.\nIt states that when a candidate clustering \u02c6C is close to the optimal clustering C\u2217 in terms of clustering\ncost, they are also close in terms of the Wasserstein distance de\ufb01ned in Eq. (2).\nLemma 3.3 (Stability of agnostic k-means subspace clustering). Assume X is (\u03c6, \u03b7, \u03c8)-well sepa-\nrated with \u03c62 < 1/1602, \u03c8 > \u03b7. Suppose a candidate clustering \u02c6C = { \u02c6S1,\u00b7\u00b7\u00b7 , \u02c6Sk} \u2286 Sd\nq satis\ufb01es\ncost( \u02c6C;X ) \u2264 a \u00b7 cost(C\u2217;X ) for some a < 1\u2212802\u03c62\n600\n\n\u221a\n. Then the following holds:\n\nk(X ) \u2264 \u22062\n\n2\u03c62\n\n800\u03c62\n\n\u221a\n\nk\n\ndW ( \u02c6C,C\u2217) \u2264\n\n(1 \u2212 150\u03c62)(\u03c8 \u2212 \u03b7)\n\n.\n\n(7)\n\nThe following theorem is then a simple corollary, with a complete proof in Appendix B.\n\n1dM(\u00b7,\u00b7) satis\ufb01es dM(x, y) \u2265 0, dM(x, x) = 0 and dM(x, y) \u2264 dM(x, z) + dM(y, z) for all x, y, z.\n2Here \u0001 is an approximation constant and is not related to the privacy parameter \u03b5.\n\n4\n\n\fi=1 \u2286 Rd, number of clusters k and number of neighbors s.\n\nAlgorithm 2 Threshold-based subspace clustering (TSC), a simpli\ufb01ed version\n1: Input: X = {xi}n\n2: Thresholding: construct G \u2208 {0, 1}n\u00d7n by connecting xi to the other s data points in X with\n3: Clustering: Let X (1),\u00b7\u00b7\u00b7 ,X ((cid:96)) be the connected components in G. Construct \u00afX ((cid:96)) by sam-\n4: Output: subspaces \u02c6C = { \u02c6S((cid:96))}k\n(cid:96)=1; \u02c6S((cid:96)) is the subspace spanned by q arbitrary points in \u00afX ((cid:96)).\n\nthe largest absolute inner products |(cid:104)xi, x(cid:48)(cid:105)|. Complete G so that it is undirected.\npling q points from X ((cid:96)) uniformly at random without replacement.\n\nTheorem 3.4. Fix a (\u03c6, \u03b7, \u03c8)-well separated dataset X with n data points and \u03c62 < 1/1602,\n\u03c8 > \u03b7. Suppose XS \u2286 X is a subset of X with size m, sampled uniformly at random without\nreplacement. Let \u02c6C = { \u02c6S1,\u00b7\u00b7\u00b7 , \u02c6S2} be an (1 + \u0001)-approximation of optimal k-means subspace\nk(X ))\nclustering computed on XS. If m = \u2126( kqd log(qd/\u03b3(cid:48)\u22062\n800\u03c62 \u2212 2(1 + \u0001), then we\nk(X )\n\u221a\nhave:\n\n) with \u03b3(cid:48) < 1\u2212802\u03c62\n\u221a\n\n\u03b3(cid:48)2\u22064\n\n(cid:35)\n\n(cid:34)\n\n1 ,\u00b7\u00b7\u00b7 ,S\u2217\n\n(1 \u2212 150\u03c62)(\u03c8 \u2212 \u03b7)\n\nPrXS\nk} is the optimal clustering on X ; that is, cost(C\u2217;X ) = \u22062\n\nwhere C\u2217 = {S\u2217\nConsequently, applying Theorem 3.4 together with the sample-aggregate framework we obtain a\nweak polynomial-time \u03b5-differentially private algorithm for agnostic k-means subspace clustering,\nwith additional amount of per-coordinate Gaussian noise upper bounded by O( \u03c62\n\u03b5(\u03c8\u2212\u03b7) ). Our bound\nis comparable to the one obtained in [22] for private k-means clustering, except for the (\u03c8 \u2212 \u03b7) term\nwhich characterizes the well-separatedness under the subspace clustering scenario.\n\nk(X ).\n\n\u221a\n\nk\n\n,\n\n(8)\n\ndW ( \u02c6C,C\u2217) \u2264\n\n600\n\n2\u03c62\n\nk\n\n\u2265 3\n4\n\ni\n\ni\n\n1 ,\u00b7\u00b7\u00b7 ,S\u2217\n\ni = y((cid:96))\n\ni + \u03b5((cid:96))\n\n(cid:96) , a data point x((cid:96))\n(cid:96) : (cid:107)y(cid:107)2 = 1} and \u03b5i \u223c N (0, \u03c32/d \u00b7 Id) for some noise parameter \u03c3.\n\n3.3 The stochastic setting\nWe further consider the case when data points are stochastically generated from some underlying\n\u201ctrue\u201d subspace set C\u2217 = {S\u2217\nk}. Such settings were extensively investigated in previous\ndevelopment of subspace clustering algorithms [24, 25, 14]. Below we give precise de\ufb01nition of the\nconsidered stochastic subspace clustering model:\nThe stochastic model For every cluster (cid:96) associated with subspace S\u2217\nbelonging to cluster (cid:96) can be written as x((cid:96))\n, where y((cid:96))\nrandom from {y \u2208 S\u2217\nUnder the stochastic setting we consider the solver f to be the Threshold-based Subspace Clustering\n(TSC, [14]) algorithm. A simpli\ufb01ed version of TSC is presented in Alg. 2. An alternative idea is to\napply results in the previous section since the stochastic model implies well-separated dataset when\nnoise level \u03c3 is small. However, the running time of TSC is O(n2d), which is much more ef\ufb01cient\nthan core-set based methods. TSC is provably correct in that the similarity graph G has no false\nconnections and is connected per cluster, as shown in the following lemma:\nLemma 3.5 (Connectivity of TSC). Fix \u03b3 > 1 and assume max 0.04n(cid:96) \u2264 s \u2264 min n(cid:96)/6. If for\nevery (cid:96) \u2208 {1,\u00b7\u00b7\u00b7 , k}, the number of data points n(cid:96) and the noise level \u03c3 satisfy\n1 \u2212 min\n(cid:96)(cid:54)=(cid:96)(cid:48)\n\ni \u2208 Rd\nis sampled uniformly at\n\n(cid:18) \u03b3\n/(\u03b3 log n(cid:96)) \u2212 12/n \u2212(cid:80)\n\n(cid:34)\n\u221a\n5\u03c3 + \u03c32. Then with probability at least 1 \u2212 n2e\u2212\u221a\n\n(cid:32)(cid:18) 0.01(q/2 \u2212 1)(q \u2212 1)\nd \u2212 n(cid:80)\n\n,\n(cid:96) e\u2212n(cid:96)/400 \u2212\n(cid:96) n(cid:96)e\u2212c(n(cid:96)\u22121), the connected components in G correspond ex-\n\nd2(S\u2217\n(cid:96) ,S\u2217\n(cid:96)(cid:48))\nq\u22121(cid:33)(cid:35)\n(cid:19) 1\nq\n\n0.01(q/2 \u2212 1)(q \u2212 1)\n\u221a\n\n(cid:80)\n(cid:96) n1\u2212\u03b3\n\nq\u22121(cid:33)\n(cid:19) 1\n\n\u221a\n\u03c3(1 + \u03c3)\nlog n\n\n2q(12\u03c0)q\u22121\n\nwhere \u00af\u03c3 = 2\n\n24 log n\n\u221a\n\n\u221a\nq\u221a\nd\n\nn(cid:96)\n\n(cid:115)\n\n2\u03c0q log n(cid:96)\n\n\u2212 cos\n\n\u221a\n\n\u03b3\u03c0\n\ncos\n\n12\u03c0\n\n(cid:115)\n\n>\n\nd\n\n\u2264\n\n1\n\n\u2212\n\n15 log n\n\n;\n\nn(cid:96)\n\n(cid:32)\n\nlog n(cid:96)\n\n\u00af\u03c3 <\n\n\u03c0\n\n;\n\n(cid:96)\n\nactly to the k subspaces.\n\nConditions in Lemma 3.5 characterize the interaction between sample complexity n(cid:96), noise level\n\u03c3 and \u201csignal\u201d level min(cid:96)(cid:54)=(cid:96)(cid:48) d(S\u2217\n(cid:96)(cid:48)). Theorem 3.6 is then a simple corollary of Lemma 3.5.\nComplete proofs are deferred to Appendix C.\n\n(cid:96) ,S\u2217\n\n5\n\n\fTheorem 3.6 (Stability of TSC on stochastic data). Assume conditions in Lemma 3.5 hold with\nrespect to n(cid:48) = n/m for \u03c9(log2 n) \u2264 m \u2264 o(n). Assume in addition that limn\u2192\u221e n(cid:96) = \u221e for all\n(cid:96) = 1,\u00b7\u00b7\u00b7 , L and the failure probability does not exceed 1/8. Then for every \u0001 > 0 we have\n\n(cid:104)\n\n(cid:105)\n\nlim\nn\u2192\u221e PrXS\n\ndW ( \u02c6C,C\u2217) > \u0001\n\n= 0.\n\n(9)\n\nCompared to Theorem 3.4 for the agnostic model, Theorem 3.6 shows that one can achieve consis-\ntent estimation of underlying subspaces under a stochastic model. It is an interesting question to\nderive \ufb01nite sample bounds for the differentially private TSC algorithm.\n\n3.4 Discussion\nIt is worth noting that the sample-aggregate framework is an (\u03b5, \u03b4)-differentially private mechanism\nfor any computational subroutine f. However, the utility claim (i.e., the O(r/\u03b5) bound on each\ncoordinate of A(X ) \u2212 c) requires the stability of the particular subroutine f, as outlined in Eq.\n(5). It is unfortunately hard to theoretically argue for stability of state-of-the-art subspace clustering\nmethods such as sparse subspace cluster (SSC, [11]) due to the \u201cgraph connectivity\u201d issue [21]3.\nNevertheless, we observe satisfactory performance of SSC based algorithms in simulations (see\nSec. 5). It remains an open question to derive utility guarantee for (user) differentially private SSC.\n\n4 Private subspace clustering via the exponential mechanism\nIn Section 3 we analyzed two algorithms with provable privacy and utility guarantees for sub-\nspace clustering based on the sample-aggregate framework. However, empirical evidence shows\nthat sample-aggregate based private clustering suffers from poor utility in practice [26]. In this sec-\ntion, we propose a practical private subspace clustering algorithm based on the exponential mecha-\nnism [18]. In particular, given the dataset X with n data points, we propose to samples parameters\n\u03b8 = ({S(cid:96)}k\n\nd, zj \u2208 {1,\u00b7\u00b7\u00b7 , k} from the following distribution:\n\ni=1) where S(cid:96) \u2208 Sq\n\n(cid:96)=1,{zi}n\n\np(\u03b8;X ) \u221d exp\n\n,\n\n(10)\n\n(cid:32)\n\n\u00b7 n(cid:88)\n\ni=1\n\n\u2212 \u03b5\n2\n\n(cid:33)\nd2(xi,Szi)\n\nwhere \u03b5 > 0 is the privacy parameter. The following proposition shows that exact sampling from\nthe distribution in Eq. (10) results in a provable differentially private algorithm. Its proof is trivial\nand is deferred to Appendix D.1. Note that unlike sample-aggregate based methods, the exponential\nmechanism can privately release clustering assignment z. This does not violate the lower bound in\n[29] because the released clustering assignment z is not guaranteed to be exactly correct.\nProposition 4.1. The random algorithm A : X (cid:55)\u2192 \u03b8 that outputs one sample from the distribution\nde\ufb01ned in Eq. (10) is \u03b5-differential private.\n\np(zi|{S(cid:96)}k\n\n4.1 A Gibbs sampling implementation\nIt is hard in general to sample parameters from distributions as complicated as in Eq. (10). We\npresent a Gibbs sampler that iteratively samples subspaces {Si} and cluster assignments {zj} from\ntheir conditional distributions.\nUpdate of zi: When {S(cid:96)} and z\u2212i are \ufb01xed, the conditional distribution of zi is\n\n(cid:96)=1, z\u2212i;X ) \u221d exp(\u2212\u03b5/2 \u00b7 d2(xi,Szi)).\n\n(11)\nSince d(xi,Szi) can be ef\ufb01ciently computed (given an orthonormal basis of Szi), update of zi can\nUpdate of S(cid:96): Let (cid:101)X ((cid:96)) = {xi \u2208 X : zi = (cid:96)} denote data points that are assigned to cluster (cid:96) and\nbe easily done by sampling zj from a categorical distribution.\n\u02dcn(cid:96) = |(cid:101)X ((cid:96))|. Denote (cid:101)X((cid:96)) \u2208 Rd\u00d7\u02dcn(cid:96) as the matrix with columns corresponding to all data points in\n(cid:101)X ((cid:96)). The distribution over S(cid:96) conditioned on z can then be written as\nwhere A(cid:96) = (cid:101)X((cid:96))(cid:101)X((cid:96))(cid:62)\n\n(12)\nis the unnormalized sample covariance matrix. Distribution of the form in\nEq. (12) is a special case of the matrix Bingham distribution, which admits a Gibbs sampler [16]. We\ngive implementation details in Appendix D.2 with modi\ufb01cations so that the resulting Gibbs sampler\nis empirically more ef\ufb01cient for a wide range of parameter settings.\n\np(S(cid:96) = range(U(cid:96))|z;X ) \u221d exp(\u03b5/2 \u00b7 tr(U(cid:62)\n\n(cid:96) A(cid:96)U(cid:96))); U(cid:96) \u2208 Rd\u00d7q, U(cid:62)\n\n(cid:96) U(cid:96) = Iq\u00d7q,\n\n3Recently [28] established full clustering guarantee for SSC, however, under strong assumptions.\n\n6\n\n\f(cid:96)\n\nparameters \u03c3(cid:96) in MPPCA to(cid:112)1/\u03b5. The only difference is that yi are sampled uniformly at random\n\n4.2 Discussion\nThe proposed Gibbs sampler resembles the k-plane algorithm for subspace clustering [3].\nIt is\nin fact a \u201cprobabilistic\u201d version of k-plane since sampling is performed at each iteration rather\nthan deterministic updates. Furthermore, the proposed Gibbs sampler could be viewed as posterior\nsampling for the following generative model: \ufb01rst sample U(cid:96) uniformly at random from Sd\nq for\neach subspace S(cid:96); afterwards, cluster assignments {zi}n\ni=1 are sampled such that Pr[zi = j] = 1/k\nand xi is set as xi = U(cid:96)yi + PU\u22a5\nwi, where yi is sampled uniformly at random from the q-\ndimensional unit ball and wi \u223c N (0, Id/\u03b5). Connection between the above-mentioned generative\nmodel and Gibbs sampler is formally justi\ufb01ed in Appendix D.3. The generative model is strikingly\nsimilar to the well-known mixtures of probabilistic PCA (MPPCA, [27]) model by setting variance\nfrom a unit ball 4 and noise wi is constrained to U\u22a5\n(cid:96) , the complement space of U(cid:96). Note that this is\nclosely related to earlier observation that \u201cposterior sampling is private\u201d [20, 6, 31], but different in\nthat we constructed a model from a private procedure rather than the other way round.\nAs the privacy parameter \u03b5 \u2192 \u221e (i.e., no privacy guarantee), we arrive immediately at the exact\nk-plane algorithm and the posterior distribution concentrates around the optimal k-means solution\n(C\u2217, z\u2217). This behavior is similar to what a small-variance asymptotic analysis on MPPCA models\nreveals [30]. On the other hand, the proposed Gibbs sampler is signi\ufb01cantly different from previous\nBayesian probabilisitic PCA formulation [34, 30] in that the subspaces are sampled from a matrix\nBingham distribution. Finally, we remark that the proposed Gibbs sampler is only asymptotically\nprivate because Proposition 4.1 requires exact (or nearly exact [31]) sampling from Eq. (10).\n\n5 Numerical results\nWe provide numerical results of both the sample-aggregate and Gibbs sampling algorithms on syn-\nthetic and real-world datasets. We also compare with a baseline method implemented based on the\nk-plane algorithm [3] with perturbed sample covariance matrix via the SuLQ framework [2] (de-\ntails presented in Appendix E). Three solvers are considered for the sample-aggregate framework:\nthreshold-based subspace clustering (TSC, [14]), which has provable utility guarantee with sample-\naggregation on stochastic models, along with sparse subspace clustering (SSC, [11]) and low-rank\nrepresentation (LRR, [17]), the two state-of-the-art methods for subspace clustering. For Gibbs\nsampling, we use non-private SSC and LRR solutions as initialization for the Gibbs sampler. All\nmethods are implemented using Matlab.\nFor synthetic datasets, we \ufb01rst generate k random q-dimensional linear subspaces. Each subspace is\ngenerated by \ufb01rst sampling a d \u00d7 q random Gaussian matrix and then recording its column space. n\ndata points are then assigned to one of the k subspaces (clusters) uniformly at random. To generate\na data point xi assigned with subspace S(cid:96), we \ufb01rst sample yi \u2208 Rq with (cid:107)yi(cid:107)2 = 1 uniformly\nat random from the q-dimensional unit sphere. Afterwards, xi is set as xi = U(cid:96)yi + wi, where\nU(cid:96) \u2208 Rd\u00d7q is an orthonormal basis associated with S(cid:96) and wi \u223c N (0, \u03c32Id) is a noise vector.\nFigure 2 compares the utility (measured in terms of k-means objective cost( \u02c6C;X ) and the Wasser-\nstein\u2019s distance dW ( \u02c6C,C\u2217)) of sample aggregation, Gibbs sampling and SuLQ subspace clustering.\nAs shown in the plots, sample-aggregation algorithms have poor utility unless the privacy parameter\n\u03b5 is truly large (which means very little privacy protection). On the other hand, both Gibbs sampling\nand SuLQ subspace clustering give reasonably good performance. Figure 2 also shows that SuLQ\nscales poorly with the ambient dimension d. This is because SuLQ subspace clustering requires\ncalibrating noise to a d \u00d7 d sample covariance matrix, which induces much error when d is large.\nGibbs sampling seems to be robust to various d settings.\nWe also experiment on real-world datasets. The right two plots in Figure 2 report utility on a sub-\nset of the extended Yale Face Dataset B [13] for face clustering. 5 random individuals are picked,\nforming a subset of the original dataset with n = 320 data points (images). The dataset is prepro-\ncessed by projecting each individual onto a 9D af\ufb01ne subspace via PCA. Such preprocessing step\nwas adopted in [32, 29] and was theoretically justi\ufb01ed in [1]. Afterwards, ambient dimension of\nthe entire dataset is reduced to d = 50 by random Gaussian projection. The plots show that Gibbs\nsampling signi\ufb01cantly outperforms the other algorithms.\n\n4In MPPCA latent variables yi are sampled from a normal distribution N (0, \u03c12Iq).\n\n7\n\n\fFigure 2: Utility under \ufb01xed privacy budget \u03b5. Top row shows k-means cost and bottom row shows\nthe Wasserstein\u2019s distance dW ( \u02c6C,C\u2217). From left to right: synthetic dataset, n = 5000, d = 5, k =\n3, q = 3, \u03c3 = 0.01; n = 1000, d = 10, k = 3, q = 3, \u03c3 = 0.1; extended Yale Face Dataset B\n(a subset). n = 320, d = 50, k = 5, q = 9, \u03c3 = 0.01. \u03b4 is set to 1/(n ln n) for (\u03b5, \u03b4)-privacy\nalgorithms. \u201cs.a.\u201d stands for smooth sensitivity and \u201cexp.\u201d stands for exponential mechanism.\n\u201cSuLQ-10\u201d and \u201cSuLQ-50\u201d stand for the SuLQ framework performing 10 and 50 iterations. Gibbs\nsampling is run for 10000 iterations and the mean of the last 100 samples is reported.\n\nFigure 3: Test statistics, k-means cost and dW ( \u02c6C,C\u2217) of 8 trials of the Gibbs sampler under different\nprivacy settings. Synthetic dataset setting: n = 1000, d = 10, k = 3, q = 3, \u03c3 = 0.1.\n\n((cid:80)k\n(cid:96)=1 (cid:107)1/T \u00b7(cid:80)T\n\n\u221a\nIn Figure 3 we investigate the mixing behavior of proposed Gibbs sampler. We plot for multiple\nkq \u00b7\ntrials of Gibbs sampling the k-means objective, Wasserstein\u2019s distance and a test statistic 1/\nis a basis sample of S(cid:96) at the tth iteration. The test\nstatistic has mean zero under distribution in Eq. (10) and a similar statistic was used in [4] as a\ndiagnostic of the mixing behavior of another Gibbs sampler. Figure 3 shows that under various\nprivacy parameter settings, the proposed Gibbs sampler mixes quite well after 10000 iterations.\n\n(cid:96) (cid:107)2\nt=1 U(t)\n\nF )1/2, where U(t)\n\n(cid:96)\n\n6 Conclusion\nIn this paper we consider subspace clustering subject to formal differential privacy constraints. We\nanalyzed two sample-aggregate based algorithms with provable utility guarantees under agnostic and\nstochastic data models. We also propose a Gibbs sampling subspace clustering algorithm based on\nthe exponential mechanism that works well in practice. Some interesting future directions include\nutility bounds for state-of-the-art subspace clustering algorithms like SSC or LRR.\n\nAcknowledgement This research is supported in part by grant NSF CAREER IIS-1252412, NSF\nAward BCS-0941518, and a grant by Singapore National Research Foundation under its Interna-\ntional Research Centre @ Singapore Funding Initiative administered by the IDM Programme Of\ufb01ce.\n\n8\n\n\u22121\u22120.500.511.522.5300.050.10.150.20.250.3Log10\u03b5K\u2212means cost s.a., SSCs.a., TSCs.a., LRRexp., SSCexp. LRRSuLQ\u221210SuLQ\u221250\u22121\u22120.500.511.522.5300.10.20.30.40.50.60.7Log10\u03b5K\u2212means cost s.a., SSCs.a., TSCs.a., LRRexp., SSCexp. LRRSuLQ\u221210SuLQ\u221250\u22121\u22120.500.511.522.5300.10.20.30.40.50.60.70.80.9Log10\u03b5K\u2212means cost s.a., SSCs.a., TSCs.a., LRRexp., SSCexp. LRRSuLQ\u221210SuLQ\u221250\u22121\u22120.500.511.522.53\u22120.500.511.522.53Log10\u03b5Wasserstein distance s.a., SSCs.a., TSCs.a., LRRexp., SSCexp. LRRSuLQ\u221210SuLQ\u221250\u22121\u22120.500.511.522.5300.511.522.533.54Log10\u03b5Wasserstein distance s.a., SSCs.a., TSCs.a., LRRexp., SSCexp. LRRSuLQ\u221210SuLQ\u221250\u22121\u22120.500.511.522.5323456789Log10\u03b5Wasserstein distance s.a., SSCs.a., TSCs.a., LRRexp., SSCexp. LRRSuLQ\u221210SuLQ\u22125002040608010000.20.40.60.81\u00d7 100 iterationsTest statistic \u03b5=0.1\u03b5=1\u03b5=10\u03b5=10002040608010000.10.20.30.40.50.60.70.8\u00d7 100 iterationsK\u2212means cost02040608010000.511.522.533.54\u00d7 100 iterationsWasserstein distance\fReferences\n[1] R. Basri and D. Jacobs. Lambertian re\ufb02ectance and linear subspaces.\n\nAnalysis and Machine Intelligence, 25(2):218\u2013233, 2003.\n\nIEEE Transactions on Pattern\n\n[2] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SULQ framework. In PODS,\n\n2015.\n\n[3] P. S. Bradley and O. L. Mangasarian. k-plane clustering. Journal of Global Optimization, 16(1), 2000.\n[4] K. Chaudhuri, A. Sarwate, and K. Sinha. Near-optimal algorithms for differentially private principal\n\ncomponents. In NIPS, 2012.\n\n[5] Y. Chen, A. Jalali, S. Sanghavi, and H. Xu. Clustering partially observed graphs via convex optimization.\n\nThe Journal of Machine Learning Research, 15(1):2213\u20132238, 2014.\n\n[6] C. Dimitrakakis, B. Nelson, A. Mitrokotsa, and B. I. Rubinstein. Robust and private bayesian inference.\n\nIn Algorithmic Learning Theory, pages 291\u2013305. Springer, 2014.\n\n[7] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via\n\ndistributed noise generation. In EUROCRYPT, 2006.\n\n[8] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis.\n\nIn TCC, 2006.\n\n[9] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in\n\nTheoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[10] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze Gauss: Optimal bounds for privacy-preserving\n\nprincipal component analysis. In STOC, 2014.\n\n[11] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory and applications. IEEE Trans-\n\nactions on Pattern Analysis and Machine Intelligence, 35(11):2765\u20132781, 2013.\n\n[12] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for\n\nk-means, pca and projective clustering. In SODA, 2013.\n\n[13] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for\nface recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 23(6):643\u2013660, 2001.\n\n[14] R. Heckel and H. B\u00a8olcskei. Robust subspace clustering via thresholding. arXiv:1307.4891, 2013.\n[15] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances of objects under varying\n\nillumination conditions. In CVPR, 2003.\n\n[16] P. Hoff. Simulation of the matrix bingham-conmises-\ufb01sher distribution, with applications to multivariate\n\nand relational data. Journal of Computational and Graphical Statistics, 18(2):438\u2013456, 2009.\n\n[17] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Ma, and Y. Yu. Robust recovery of subspace structures by low-rank\nrepresentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171\u2013184, 2012.\n\n[18] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.\n[19] B. McWilliams and G. Montana. Subspace clustering of high-dimensional data: a predictive approach.\n\nData Mining and Knowledge Discovery, 28(3):736\u2013772, 2014.\n\n[20] D. J. Mir. Differential privacy: an exploration of the privacy-utility landscape. PhD thesis, Rutgers\n\n[21] B. Nasihatkon and R. Hartley. Graph connectivity in sparse subspace clustering. In CVPR, 2011.\n[22] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In\n\nUniversity, 2013.\n\nSTOC, 2007.\n\n[23] R. Ostrovksy, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the\n\n[24] M. Soltanolkotabi, E. J. Candes, et al. A geometric analysis of subspace clustering with outliers. The\n\nk-means problem. In FOCS, 2006.\n\nAnnals of Statistics, 40(4):2195\u20132238, 2012.\n\n[25] M. Soltanolkotabi, E. Elhamifa, and E. Candes. Robust subspace clustering. The Annals of Statistics,\n\n42(2):669\u2013699, 2014.\n\n11(2):443\u2013482, 1999.\n\n[26] D. Su, J. Cao, N. Li, E. Bertino, and H. Jin. Differentially private k-means clustering. arXiv, 2015.\n[27] M. Tipping and C. Bishop. Mixtures of probabilistic principle component anlyzers. Neural computation,\n\n[28] Y. Wang, Y.-X. Wang, and A. Singh. Clustering consistent sparse subspace clustering. arXiv, 2015.\n[29] Y. Wang, Y.-X. Wang, and A. Singh. A deterministic analysis of noisy sparse subspace clustering for\n\ndimensionality-reduced data. In ICML, 2015.\n\n[30] Y. Wang and J. Zhu. DP-space: Bayesian nonparametric subspace clustering with small-variance asymp-\n\ntotic analysis. In ICML, 2015.\n\nmonte carlo. In ICML, 2015.\n\n[31] Y.-X. Wang, S. Fienberg, and A. Smola. Privacy for free: Posterior sampling and stochastic gradient\n\n[32] Y.-X. Wang and H. Xu. Noisy sparse subspace clustering. In ICML, pages 89\u201397, 2013.\n[33] A. Zhang, N. Fawaz, S. Ioannidis, and A. Montanari. Guess who rated this movie: Identifying users\n\nthrough subspace clustering. arXiv, 2012.\n\n[34] Z. Zhang, K. L. Chan, J. Kwok, and D.-Y. Yeung. Bayesian inference on principal component analysis\n\nusing reversible jump markov chain monte carlo. In AAAI, 2004.\n\n9\n\n\f", "award": [], "sourceid": 629, "authors": [{"given_name": "Yining", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "CMU"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}]}