{"title": "Dimensionality Reduction with Subspace Structure Preservation", "book": "Advances in Neural Information Processing Systems", "page_first": 712, "page_last": 720, "abstract": "Modeling data as being sampled from a union of independent subspaces has been widely applied to a number of real world applications. However, dimensionality reduction approaches that theoretically preserve this independence assumption have not been well studied. Our key contribution is to show that $2K$ projection vectors are sufficient for the independence preservation of any $K$ class data sampled from a union of independent subspaces. It is this non-trivial observation that we use for designing our dimensionality reduction technique. In this paper, we propose a novel dimensionality reduction algorithm that theoretically preserves this structure for a given dataset. We support our theoretical analysis with empirical results on both synthetic and real world data achieving \\textit{state-of-the-art} results compared to popular dimensionality reduction techniques.", "full_text": "Dimensionality Reduction with Subspace Structure\n\nPreservation\n\nDevansh Arpit\n\nIfeoma Nwogu\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nSUNY Buffalo\n\nBuffalo, NY 14260\n\ndevansha@buffalo.edu\n\nSUNY Buffalo\n\nBuffalo, NY 14260\n\ninwogu@buffalo.edu\n\nVenu Govindaraju\n\nDepartment of Computer Science\n\nSUNY Buffalo\n\nBuffalo, NY 14260\n\ngovind@buffalo.edu\n\nAbstract\n\nModeling data as being sampled from a union of independent subspaces has been\nwidely applied to a number of real world applications. However, dimensional-\nity reduction approaches that theoretically preserve this independence assumption\nhave not been well studied. Our key contribution is to show that 2K projection\nvectors are suf\ufb01cient for the independence preservation of any K class data sam-\npled from a union of independent subspaces. It is this non-trivial observation that\nwe use for designing our dimensionality reduction technique. In summary, we\npropose a novel dimensionality reduction algorithm that theoretically preserves\nthis structure for a given dataset. We support our theoretical analysis with empiri-\ncal results on both synthetic and real world data achieving state-of-the-art results\ncompared to popular dimensionality reduction techniques.\n\n1\n\nIntroduction\n\nA number of real world applications model data as being sampled from a union of independent\nsubspaces. These applications include image representation and compression [6], systems theory\n[12], image segmentation [15], motion segmentation [13], face clustering [7, 5] and texture seg-\nmentation [8], to name a few. Dimensionality reduction is generally used prior to applying these\nmethods because most of these algorithms optimize expensive loss functions like nuclear norm, (cid:96)1\nregularization, e.t.c. Most of these applications simply apply off-the-shelf dimensionality reduction\ntechniques or resize images (in case of image data) as a pre-processing step.\nThe union of independent subspace model can be thought of as a generalization of the traditional\napproach of representing a given set of data points using a single low dimensional subspace (e.g.\nPrincipal Component Analysis). For the application of algorithms that model data at hand with this\nindependence assumption, the subspace structure of the data needs to be preserved after dimension-\nality reduction. Although a number of existing dimensionality reduction techniques [10, 3, 1, 4] try\nto preserve the spacial geometry of any given data, no prior work has tried to explicitly preserve the\nindependence between subspaces to the best of our knowledge.\n\nThis work was partially funded by the National Science Foundation under the grant number CNS-1314803\n\n1\n\n\fIn this paper, we propose a novel dimensionality reduction technique that preserves independence\nbetween multiple subspaces. In order to achieve this, we \ufb01rst show that for any two disjoint sub-\nspaces with arbitrary dimensionality, there exists a two dimensional subspace such that both the\nsubspaces collapse to form two lines. We then extend this non-trivial idea to multi-class case and\nshow that 2K projection vectors are suf\ufb01cient for preserving the subspace structure of a K class\ndataset. Further, we design an ef\ufb01cient algorithm that \ufb01nds the projection vectors with the afore-\nmentioned properties while being able to handle corrupted data at the same time.\n\n2 Preliminaries\n\nLet S1, S2 . . . SK be K subspaces in Rn. We say that these K subspaces are independent if there\ndoes not exist any non-zero vector in Si which is a linear combination of vectors in the other K \u2212 1\nsubspaces. Let the columns of the matrix Bi \u2208 Rn\u00d7d denote the support of the ith subspace of d\ndimensions. Then any vector in this subspace can be represented as x = Biw \u2200w \u2208 Rd. Now we\nde\ufb01ne the notion of margin between two subspaces.\n\nDe\ufb01nition 1 (Subspace Margin) Subspaces Si and Sj are separated by margin \u03b3ij if\n\n\u03b3ij = max\n\nu\u2208Si,v\u2208Sj\n\n(cid:104)u, v(cid:105)\n(cid:107)u(cid:107)2(cid:107)v(cid:107)2\n\n(1)\n\nThus margin between any two subspaces is de\ufb01ned as the maximum dot product between two unit\nvectors (u, v), one from either subspace. Such a vector pair (u, v) is known as the principal vector\npair between the two subspaces while the angle between these vectors is called the principal angle.\nWith these de\ufb01nitions of independent subspaces and margin, assume that we are given a dataset\nwhich has been sampled from a union of independent linear subspaces. Speci\ufb01cally, each class in\nthis dataset lies along one such independent subspace. Then our goal is to reduce the dimensionality\nof this dataset such that after projection, each class continues to lie along a linear subspace and that\neach such subspace is independent of all others. Formally, let X = [X1, X2 . . . , XK] be a K class\ndataset in Rn such that vectors from class i (x \u2208 Xi) lie along subspace Si. Then our goal is to\n\ufb01nd a projection matrix (P \u2208 Rn\u00d7m) such that the projected data vectors \u00afXi := {P T x : x \u2208 Xi}\n(i \u2208 {1 . . . K}) are such that data vectors \u00afXi belong to a linear subspace ( \u00afSi in Rm). Further, each\nsubspace \u00afSi (i \u2208 {1 . . . K}) is independent of all others.\n\n3 Proposed Approach\n\nIn this section, we propose a novel subspace learning approach applicable to labeled datasets that\ntheoretically guarantees independent subspace structure preservation. The number of projection\nvectors required by our approach is not only independent of the size of the dataset but is also \ufb01xed,\ndepending only on the number of classes. Speci\ufb01cally, we show that for any K class labeled dataset\nwith independent subspace structure, only 2K projection vectors are required for structure preser-\nvation.\nThe entire idea of being able to \ufb01nd a \ufb01xed number of projection vectors for the structure preserva-\ntion of a K class dataset is motivated by theorem 2. This theorem states a useful property of any\npair of disjoint subspaces.\n\nTheorem 2 Let unit vectors v1 and v2 be the ith principal vector pair for any two disjoint subspaces\nS1 and S2 in Rn. Let the columns of the matrix P \u2208 Rn\u00d72 be any two orthonormal vectors in the\nspan of v1 and v2. Then for all vectors x \u2208 Sj, P T x = \u03b1tj (j \u2208 {1, 2}), where \u03b1 \u2208 R depends on\nx and tj \u2208 R2 is a \ufb01xed vector independent of x. Further,\nProof: We use the notation (M )j to denote the jth column vector of matrix M for any arbitrary\nmatrix M. We claim that tj = P T vj (j \u2208 {1, 2}). Also, without any loss of generality, assume that\n(P )1 = v1. Then in order to prove theorem 2, it suf\ufb01ces to show that \u2200x \u2208 S1, (P )T\n2 x = 0. By\nsymmetry, \u2200x \u2208 S2, P T x will also lie along a line in the subspace spanned by the columns of P .\n\n(cid:107)t1(cid:107)2(cid:107)t2(cid:107)2\n\ntT\n1 t2\n\n= vT\n\n1 v2\n\n2\n\n\f(a) Independent subspaces in 3\ndimensions\n\n(b) Subspaces after projection\n\nFigure 1: A three dimensional example of the application of theorem 2. See text in section 3 for\ndetails.\nLet the columns of B1 \u2208 Rn\u00d7d1 and B2 \u2208 Rn\u00d7d2 be the support of S1 and S2 respectively, where d1\nand d2 are the dimensionality of the two subspaces. Then we can represent v1 and v2 as v1 = B1w1\nand v2 = B2w2 for some w1 \u2208 Rd1 and w2 \u2208 Rd2. Let B1w be any arbitrary vector in S1 where\nw \u2208 Rd1. Then we need to show that T := (B1w)T (P )2 = 0\u2200w. Notice that,\n\nT = (B1w)T (B2w2 \u2212 (wT\n1 BT\n1 B2w2)B1w1)\n1 B2w2 \u2212 (wT\n1 B2w2)w1) \u2200w\n1 BT\n= wT (BT\n\n(2)\n\nLet U SV T be the svd of BT\n1 B2. Then w1 and w2 are the ith columns of U and V respectively, and\n1 v2 is the ith diagonal element of S if v1 and v2 are the ith principal vectors of S1 and S2. Thus,\nvT\n\nT = wT (U SV T w2 \u2212 Sii(U )i)\n= wT (Sii(U )i \u2212 Sii(U )i) = 0 (cid:3)\n\n(3)\n\nGeometrically, this theorem says that after projection on the plane (P ) de\ufb01ned by any one of the\nprincipal vector pairs between subspaces S1 and S2, both the entire subspaces collapse to just two\nlines such that points from S1 lie along one line while points from S2 lie along the second line.\nFurther, the angle that separates these lines is equal to the angle between the ith principal vector pair\nbetween S1 and S2 if the span of the ith principal vector pair is used as P .\nWe apply theorem 2 on a three dimensional example as shown in \ufb01gure 1. In \ufb01gure 1 (a), the \ufb01rst\nsubspace (y-z plane) is denoted by red color while the second subspace is the black line in x-y axis.\nNotice that for this setting, the x-y plane (denoted by blue color) is in the span of the 1st (and only)\nprincipal vector pair between the two subspaces. After projection of both the entire subspaces onto\nthe x-y plane, we get two lines (\ufb01gure 1 (b)) as stated in the theorem.\nFinally, we now show that for any K class dataset with independent subspace structure, 2K projec-\ntion vectors are suf\ufb01cient for structure preservation.\nTheorem 3 Let X = {x}N\ni=1 be a K class dataset in Rn with Independent Subspace structure. Let\nP = [P1 . . . PK] \u2208 Rn\u00d72K be a projection matrix for X such that the columns of the matrix Pk \u2208\nRn\u00d72 consists of orthonormal vectors in the span of any principal vector pair between subspaces\nj(cid:54)=k Sj. Then the Independent Subspace structure of the dataset X is preserved after\n\nSk and(cid:80)\n\nprojection on the 2K vectors in P .\n\nBefore stating the proof of this theorem, we \ufb01rst state lemma 4 which we will use later in our proof.\nThis lemma states that if two vectors are separated by a non-zero angle, then after augmenting these\nvectors with any arbitrary vectors, the new vectors remain separated by some non-zero angle as\nwell. This straightforward idea will help us extend the two subspace case in theorem 2 to multiple\nsubspaces.\n\nxT\n\n1 y1\n\nLemma 4 Let x1, y1 be any two \ufb01xed vectors of same dimensionality with respect to each other\n= \u03b3 where \u03b3 \u2208 [0, 1). Let x2, y2 be any two arbitrary vectors of same \ufb01nite\nsuch that\ndimensionality with respect to each other. Then there exists a constant \u00af\u03b3 \u2208 [0, 1) such that vectors\nx(cid:48) = [x1; x2] and y(cid:48) = [y1; y2] are also separated such that\n\n(cid:107)x1(cid:107)2(cid:107)y1(cid:107)2\n\n\u2264 \u00af\u03b3.\n\nx(cid:48)T y(cid:48)\n\n(cid:107)x(cid:48)(cid:107)2(cid:107)y(cid:48)(cid:107)2\n\n3\n\n\fAlgorithm 1 Computation of projection matrix P\nINPUT: X,K,\u03bb, itermax\n\nfor k=1 to K do\n\n2 \u2190 random vector in R \u00afNk\nw\u2217\nwhile iter < itermax or \u03b3 not converged do\n1 \u2190 arg minw1(cid:107)Xkw1 \u2212 \u00afXkw\u2217\nw\u2217\n(cid:107) \u00afXkw\u2217\n2(cid:107)2\n1 \u2190 w\u2217\nw\u2217\n1/norm(w\u2217\n1)\n2 \u2190 arg minw2(cid:107) Xkw\u2217\n\u2212 \u00afXkw2(cid:107)2 + \u03bb(cid:107)w2(cid:107)2\nw\u2217\n(cid:107)Xkw\u2217\n1(cid:107)2\n2 \u2190 w\u2217\n2/norm(w\u2217\nw\u2217\n2)\n\u03b3 \u2190 (Xkw\u2217\n1)T ( \u00afXkw\u2217\n2)\nend while\nPk \u2190 [Xkw\u2217\n1, \u00afXkw\u2217\n2]\n\n(cid:107)2 + \u03bb(cid:107)w1(cid:107)2\n\n1\n\n2\n\nend for\nP \u2217 \u2190 [P1 . . . PK]\n\nOUTPUT: P \u2217\n\nk x and P T\n\nProof of theorem 3:\n\nFor the proof of theorem 3, it suf\ufb01ces to show that data vectors from subspaces Sk and(cid:80)\nbe any vectors in Sk and(cid:80)\npair between Sk and(cid:80)\n\nj(cid:54)=k Sj\n(for any k \u2208 {1 . . . K}) are separated by margin less than 1 after projection using P . Let x and y\nj(cid:54)=k Sj respectively and the columns of the matrix Pk be in the span of\nthe ith (say) principal vector pair between these subspaces. Using theorem 2, the projected vectors\nk y are separated by an angle equal to the the angle between the ith principal vector\nk x and P T\nP T\nj(cid:54)=k Sj. Let the cosine of this angle be \u03b3. Then, using lemma 4, the added\ndimensions in the vectors P T\nk y to form the vectors P T x and P T y are also separated by\nsome margin \u00af\u03b3 < 1. As the same argument holds for vectors from all classes, the Independent\nSubspace Structure of the dataset remains preserved after projection. (cid:3)\nFor any two disjoint subspaces, theorem 2 tells us that there is a two dimensional plane in which\nthe entire projected subspaces form two lines. It can be argued that after adding arbitrary valued\n\ufb01nite dimensions to the basis of this plane, the two projected subspaces will also remain disjoint\n(see proof of theorem 3). Theorem 3 simply applies this argument to each subspace and the sum of\nthe remaining subspaces one at a time. Thus for K subspaces, we get 2K projection vectors.\nFinally, our approach projects data to 2K dimensions which could be a concern if the original\nfeature dimension itself is less than 2K. However, since we are only concerned with data that has\nunderlying independent subspace assumption, notice that the feature dimension must be at least\nK. This is because each class must lie on at least 1 dimension which is linearly independent of\nothers. However, this is too strict an assumption and it is straight forward to see that if we relax this\nassumption to 2 dimensions for each class, the feature dimensions are already at 2K.\n3.1\n\nImplementation\n\nA naive approach to \ufb01nding projection vectors (say for a binary class case) would be to compute\n1 X2, where the columns of X1 and X2 contain vectors from class 1 and\nthe SVD of the matrix X T\nclass 2 respectively. For large datasets this would not only be computationally expensive but also\nbe incapable of handling noise. Thus, even though theorem 3 guarantees the structure preservation\nof the dataset X after projection using P as speci\ufb01ed, this does not solve the problem of dimen-\nsionality reduction. The reason is that given a labeled dataset sampled from a union of independent\nsubspaces, we do not have any information about the basis or even the dimensionality of the un-\nderlying subspaces. Under these circumstances, constructing the projection matrix P as speci\ufb01ed\nin theorem 3 itself becomes a problem. To solve this problem, we propose an algorithm that tries\nj(cid:54)=k Sj (for k = 1 to K)\ngiven the labeled dataset X. The assumption behind this attempt is that samples from each subspace\n(class) are not heavily corrupted and that the underlying subspaces are independent.\nNotice that we are not speci\ufb01cally interested in a particular principal vector pair between any two\nsubspaces for the computation of the projection matrix. This is because we have assumed indepen-\n\nto \ufb01nd the underlying principal vector pair between subspaces Sk and(cid:80)\n\n4\n\n\fsubspaces Sk and(cid:80)\n\ndent subspaces and so each principal vector pair is separated by some margin \u03b3 < 1. Hence we\nneed an algorithm that computes any arbitrary principal vector pair, given data from two indepen-\ndent subspaces. These vectors can then be used to form one of the K submatrices in P as speci\ufb01ed\nin theorem 3 . For computing the submatrix Pk, we need to \ufb01nd a principal vector pair between\nj(cid:54)=k Sj. In terms of dataset X, we estimate the vector pair using data in Xk\nand \u00afXk where \u00afXk := X \\ {Xk}. We repeat this process for each class to \ufb01nally form the entire\nmatrix P \u2217. Our approach is stated in algorithm 1. For each class k, the idea is to start with a random\nvector in the span of \u00afXk and \ufb01nd the vector in Xk closest to this vector. Then \ufb01x this vector and\nsearch of the closest vector in \u00afXk. Repeating this process till the convergence of the cosine between\nthese 2 vectors leads to a principal vector pair. In order to estimate the closest vector from oppo-\nsite subspace, we have used a quadratic program in 1 that minimizes the reconstruction error of the\n\ufb01xed vector (of one subspace) using vectors from the opposite subspace. The regularization in the\noptimization is to handle noise in data.\n\n3.2\n\nJusti\ufb01cation\n\nThe de\ufb01nition 1 for margin \u03b3 between two subspaces S1 and S2 can be equivalently expressed as\n\n1\n2\n\nw1,w2\n\n1 B1 and BT\n\n2 B2 are both identity matrices.\n\n1 \u2212 \u03b3 = min\n\n(cid:107)B1w1 \u2212 B2w2(cid:107)2 s.t. (cid:107)B1w1(cid:107)2 = 1,(cid:107)B2w2(cid:107)2 = 1\n\n(4)\nwhere the columns of B1 \u2208 Rn\u00d7d1 and B2 \u2208 Rn\u00d7d2 are the basis of the subspaces S1 and S2\nrespectively such that BT\nProposition 5 Let B1 \u2208 Rn\u00d7d1 and B2 \u2208 Rn\u00d7d2 be the basis of two disjoint subspaces S1 and\nS2. Then for any principal vector pair (ui, vi) between the subspaces S1 and S2, the corresponding\nvector pair (w1 \u2208 Rd1,w2 \u2208 Rd2), s.t. ui = B1w1 and vi = B2w2, is a local minima to the\nobjective in equation (4).\nProof: The Lagrangian function for the above objective is:\nL(w1, w2, \u03b7) =\n\n2 B2w2\u2212wT\n\nwT\n\n2 BT\n\nwT\n\n1 BT\n\n1 B1w1+\n\n1 BT\n\n1 B2w2+\u03b71((cid:107)B1w1(cid:107)2\u22121)+\u03b72((cid:107)B2w2(cid:107)2\u22121)\n(5)\n\n1\n2\n\n1\n2\n\nThen setting the gradient w.r.t. w1 to zero we get\n1 B2w2 = 0\n\n\u2207w1L = (1 + 2\u03b71)w1 \u2212 BT\n\nLet U SV T be the SVD of BT\nequation (6) becomes\n\n(6)\n1 B2 and w1 and w2 be the ith columns of U and V respectively. Then\n\n\u2207w1L = (1 + 2\u03b71)w1 \u2212 U SV T w2\n(7)\n= (1 + 2\u03b71)w1 \u2212 Siiw1 = 0\n2 (Sii\u22121). Similarly, it can be shown that the gradient\nThus the gradient w.r.t. w1 is zero when \u03b71 = 1\n2 (Sii \u2212 1). Thus the gradient of the Lagrangian L is 0 w.r.t. both w1\nw.r.t. w2 is zero when \u03b72 = 1\nand w2 for every corresponding principal vector pair. Thus vector pair (w1, w2) corresponding to\nany of the principal vector pairs between subspaces S1 and S2 is a local minima to the objective 4.\n(cid:3)\n\nSince (w1, w2) corresponding to any principal vector pair between two disjoint subspaces form a\nlocal minima to the objective given by equation (4), one can alternatively minimize equation (4)\nw.r.t. w1 and w2 and reach one of the local minima. Thus, by assuming independent subspace\nstructure for all the K classes in algorithm 1 and setting \u03bb to zero, it is straight forward to see that\nthe algorithm yields a projection matrix that satis\ufb01es the criteria speci\ufb01ed by theorem 3.\nFinally, real world data do not strictly satisfy the independent subspace assumption in general and\neven a slight corruption in data may easily lead to the violation of this independence. In order to\ntackle this problem, we add a regularization (\u03bb > 0) term while solving for the principal vector\npair in algorithm 1. If we assume that the corruption is not heavy, reconstructing a sample using\nvectors belonging to another subspace would require a large coef\ufb01cient over those vectors. The\nregularization avoids reconstructing data from one class using vectors from another class that are\nslightly corrupted by assigning such vectors small coef\ufb01cients.\n\n5\n\n\f(a) Data projected\nusing Pa\n\n(b) Data projected\nusing Pb\n\nFigure 2: Qualitative comparison between\n(a) true projection matrix and (b) projection\nmatrix from the proposed approach on high\ndimensional synthetic two class data.\nSee\nsection 4.1.1 for details.\n\n3.3 Complexity\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: Four different pairs of classes from\nthe Extended Yale dataset B projected onto\ntwo dimensional subspaces using proposed\napproach. See section 4.1.1 for details.\n\nSolving algorithm 1 requires solving an unconstrained quadratic program within a while-loop. As-\nsume that we run this while loop for T iterations and that we use conjugate gradient descent to solve\nthe quadratic program in each iteration. Also, it is known that for any matrix A \u2208 Ra\u00d7b and vector\nb \u2208 Ra, conjugate gradient applied to a problem of the form\n\narg min\n\nw\n\n(8)\n\n(cid:107)Ax \u2212 b(cid:107)2\n\u221aK), where K is the condition number of AT A. Thus it is straight forward to\n\u221aK), where n is the dimensionality of feature space, N is the total number of samples\n\ntakes time O(ab\nsee that the time required to compute the projection matrix for a K class problem in our case is\nO(KT nN\nand K is the condition number of the matrix (X T\n\nk Xk + \u03bbI). Here I is the identity matrix.\n\n4 Empirical Analysis\n\nIn this section, we present empirical evidence to support our theoretical analysis of our subspace\nlearning approach. For real world data, we use the following datasets:\n1. Extended Yale dataset B [2]: It consists of \u223c 2414 frontal face images of 38 individuals (K = 38)\nwith 64 images per person. These images were taken under constrained but varying illumination\nconditions.\n2. AR dataset [9]: This dataset consists of more than 4000 frontal face images of 126 individuals\nwith 26 images per person. These images were taken under varying illumination, expression and\nfacial disguise. For our experiments, similar to [14], we use images from 100 individuals (K =\n100) with 50 males and 50 females. We further use only 14 images per class which correspond to\nillumination and expression changes. This corresponds to 7 images from Session 1 and rest 7 from\nSession 2.\n3. PIE dataset [11]: The pose, illumination, and expression (PIE) database is a subset of CMU PIE\ndataset consisting of 11, 554 images of 68 people (K = 68).\nWe crop all the images to 32\u00d7 32, and concatenate all the pixel intensity to form our feature vectors.\nFurther, we normalize all data vectors to have unit (cid:96)2 norm.\n\n6\n\n\u22120.5\u22120.4\u22120.3\u22120.2\u22120.100.10.20.30.40.5\u22120.5\u22120.4\u22120.3\u22120.2\u22120.100.10.20.30.40.5\u22120.8\u22120.6\u22120.4\u22120.200.20.40.6\u22120.8\u22120.6\u22120.4\u22120.200.20.40.600.10.20.30.40.50.60.70.80.91\u22120.0500.050.10.150.20.250.3\u22121\u22120.9\u22120.8\u22120.7\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10\u22120.25\u22120.2\u22120.15\u22120.1\u22120.0500.050.1\u22121\u22120.9\u22120.8\u22120.7\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10\u22120.35\u22120.3\u22120.25\u22120.2\u22120.15\u22120.1\u22120.0500.05\u22121\u22120.9\u22120.8\u22120.7\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.10\u22120.35\u22120.3\u22120.25\u22120.2\u22120.15\u22120.1\u22120.0500.05\f(a) Yale dataset B\n\n(b) AR dataset\n\n(c) PIE dataset\n\nFigure 4: Multi-class separation after projection using proposed approach for different\ndatasets. See section 4.1.2 for details.\n4.1 Qualitative Analysis\n4.1.1 Two Subspaces-Two Lines\n\nWe test both the claim of theorem 2 and the quality of approximation achieved by algorithm 1 in\nthis section. We perform these tests on both synthetic and real data.\n1. Synthetic Data: We generate two random subspaces in R1000 of dimensionality 20 and 30 (notice\nthat these subspaces will be independent with probability 1). We randomly generate 100 data vectors\nfrom each subspace and normalize them to have unit length. We then compute the 1st principal\nvector pair between the two subspaces using their basis vectors by performing SVD of BT\n1 B2,\nwhere B1 and B2 are the basis of the two subspaces. We orthonormalize the vector pair to form the\nprojection matrix Pa. Next, we use the labeled dataset of 200 points generated to form the projection\nmatrix Pb by applying algorithm 1. The entire dataset of 200 points is then projected onto Pa and Pb\nseparately and plotted in \ufb01gure 2. The green and red points denote data from either subspace. The\nresults not only substantiate our claim in theorem 2 but also suggest that the proposed algorithm for\nestimating the projection matrix is a good approximation.\n2. Real Data: Here we use Extended Yale dataset B for analysis. Since we are interested in pro-\njection of two class data in this experimental setup, we randomly choose 4 different pairs of classes\nfrom the dataset and use the labeled data from each pair to generate the two dimensional projection\nmatrix (for that pair) using algorithm 1. The resulting projected data from the 4 pairs can be seen in\n\ufb01gure 3. As is evident from the \ufb01gure, the projected two class data for each pair approximately lie\nalong two different lines.\n4.1.2 Multi-class separability\n\nWe analyze the separation between the K classes of a given K-class dataset after dimensionality\nreduction. First we compute the projection matrix for that dataset using our approach and project the\ndata. Second, we compute the top principal vector for each class separately from the projected data.\nThis gives us K vectors. Let the columns of the matrix Z \u2208 R2K\u00d7K contain these vectors. Then\nin order to visualize inter-class separability, we simply take the dot product of the matrix Z with\nitself, i.e. Z T Z. Figure 4 shows this visualization for the three face datasets. The diagonal elements\nrepresent self-dot product; thus the value is 1 (white). The off-diagonal elements represent inter-\nclass dot product and these values are consistently small (dark) for all the three datasets re\ufb02ecting\nbetween class separability.\n4.2 Quantitative Analysis\n\nIn order to evaluate theorem 3, we perform a classi\ufb01cation experiment on all the three real world\ndatasets mentioned above after projecting the data vectors using different dimensionality reduction\ntechniques. We compare our quantitative results against PCA, Linear discriminant analysis (LDA),\nRegularized LDA and Random Projections (RP) 1. We make use of sparse coding [14] for classi\ufb01-\ncation.\n\n1We also used LPP (Locality Preserving Projections) [3], NPE (Neighborhood Preserving Embedding) [4],\nand Laplacian Eigenmaps [1] for dimensionality reduction on Extended Yale B dataset. However, because\nthe best performing of these reduction techniques yielded a result of only 73% compared to the close to 98%\naccuracy from our approach, we do not report results from these methods.\n\n7\n\n\fFor Extended Yale dataset B, we use all 38 classes for evaluation with 50% \u2212 50% train-test split 1\nand 70% \u2212 30% train-test split 2. Since our method is randomized, we perform 50 runs of comput-\ning the projection matrix using algorithm 1 and report the mean accuracy with standard deviation.\nSimilarly for RP, we generate 50 different random matrices and then perform classi\ufb01cation. Since\nall other methods are deterministic, there is no need for multiple runs.\n\nTable 1: Classi\ufb01cation Accuracy on Extended Yale dataset B with 50%-50% train-test split.\nSee section 4.2 for details.\n\nMethod\n\ndim\nacc\n\nOurs\n76\n\n98.06 \u00b1 0.18\n\nPCA\n76\n\n92.54\n\nLDA Reg-LDA\n37\n\n37\n\n83.68\n\n95.77\n\n93.78 \u00b1 0.48\n\nRP\n76\n\nRP\n76\n\nTable 2: Classi\ufb01cation Accuracy on Extended Yale dataset B with 70%-30% train-test split.\nSee section 4.2 for details.\n\nMethod\n\ndim\nacc\n\nOurs\n76\n\n99.45 \u00b1 0.20\n\nPCA\n76\n\n93.98\n\nLDA Reg-LDA\n37\n\n37\n\n93.85\n\n97.47\n\n94.72 \u00b1 0.66\n\nTable 3: Classi\ufb01cation Accuracy on AR dataset. See section 4.2 for details.\n\nMethod\n\ndim\nacc\n\nOurs\n200\n\n92.18 \u00b1 0.08\n\nPCA LDA Reg-LDA\n200\n85.00\n\n88.71\n\n99\n-\n\n99\n\nRP\n200\n\n84.76 \u00b1 1.36\n\nTable 4: Classi\ufb01cation Accuracy on a subset of CMU PIE dataset. See section 4.2 for details.\n\nMethod\n\ndim\nacc\n\nOurs\n136\n\n93.65 \u00b1 0.08\n\nPCA\n136\n87.76\n\nLDA Reg-LDA\n67\n\n67\n\n86.71\n\n92.59\n\nRP\n136\n\n90.46 \u00b1 0.93\n\nTable 5: Classi\ufb01cation Accuracy on a subset of CMU PIE dataset. See section 4.2 for details.\n\nMethod\n\ndim\nacc\n\nOurs\n20\n\n99.07 \u00b1 0.09\n\nPCA\n20\n\n97.06\n\nLDA Reg-LDA\n\n9\n\n95.88\n\n9\n\n97.25\n\nRP\n20\n\n95.03 \u00b1 0.41\n\nFor AR dataset, we take the 7 images from Session 1 for training and the 7 images from Session 2\nfor testing. The results are shown in table 3. The result using LDA is not reported because we found\nthat the summed within class covariance was degenerate and hence LDA was not applicable. It can\nbe clearly seen that our approach signi\ufb01cantly outperforms other dimensionality reduction methods.\nFinally for PIE dataset, we perform experiments on two different subsets. First, we take all the\n68 classes and for each class, we randomly choose 25 images for training and 25 for testing. The\nperformance for this subset is shown in table 4. Second, we take only the \ufb01rst 10 classes of the\ndataset and of all the 170 images per class, we randomly split the data into 70%\u2212 30% train-test set.\nThe performance for this subset is shown in table 5.\nEvidently, our approach consistently yields the best performance on all the three datasets compared\nto other dimensionality reduction methods.\n\n5 Conclusion\n\nWe proposed a theoretical analysis on the preservation of independence between multiple subspaces.\nWe show that for K independent subspaces, 2K projection vectors are suf\ufb01cient for independence\npreservation (theorem 3). This result is motivated from our observation that for any two disjoint\nsubspaces of arbitrary dimensionality, there exists a two dimensional plane such that after projec-\ntion, the entire subspaces collapse to just two lines (theorem 2). Resulting from this analysis, we\nproposed an ef\ufb01cient iterative algorithm (1) that tries to exploit these properties for learning a projec-\ntion matrix for dimensionality reduction that preserves independence between multiple subspaces.\nOur empirical results on three real world datasets yield state-of-the-art results compared to popular\ndimensionality reduction methods.\n\n8\n\n\fReferences\n[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural Comput., 15(6):1373\u20131396, June 2003.\n\n[2] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illumination cone\nmodels for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach.\nIntelligence, 23(6):643\u2013660, 2001.\n\n[3] X. He and P. Niyogi. Locality preserving projections (lpp). Proc. of the NIPS, Advances in\n\nNeural Information Processing Systems. Vancouver: MIT Press, 103, 2004.\n\n[4] Xiaofei He, Deng Cai, Shuicheng Yan, and Hong-Jiang Zhang. Neighborhood preserving\nembedding. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on,\nvolume 2, pages 1208\u20131213 Vol. 2, Oct 2005.\n\n[5] Jeffrey Ho, Ming-Husang Yang, Jongwoo Lim, Kuang-Chih Lee, and David Kriegman. Clus-\ntering appearances of objects under varying illumination conditions. In Computer Vision and\nPattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol-\nume 1, pages I\u201311\u2013I\u201318. IEEE, 2003.\n\n[6] Wei Hong, John Wright, Kun Huang, and Yi Ma. Multiscale hybrid linear models for lossy\n\nimage representation. Image Processing, IEEE Transactions on, 15(12):3655\u20133671, 2006.\n\n[7] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank rep-\n\nresentation. In ICML, 2010.\n\n[8] Yi Ma, Harm Derksen, Wei Hong, John Wright, and Student Member. Segmentation of multi-\nvariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 3, 2007.\n\n[9] Aleix Mart\u00b4\u0131nez and Robert Benavente. AR Face Database, 1998.\n[10] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. Science, 290:2323\u20132326, December 2000.\n\n[11] Terence Sim, Simon Baker, and Maan Bsat. The cmu pose, illumination, and expression (pie)\ndatabase. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE Inter-\nnational Conference on, pages 46\u201351. IEEE, 2002.\n\n[12] Ren\u00b4e Vidal, Stefano Soatto, Yi Ma, and Shankar Sastry. An algebraic geometric approach to the\nidenti\ufb01cation of a class of linear hybrid systems. In Decision and Control, 2003. Proceedings.\n42nd IEEE Conference on, volume 1, pages 167\u2013172. IEEE, 2003.\n\n[13] Ren\u00b4e Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentation with missing\ndata using powerfactorization and gpca. International Journal of Computer Vision, 79(1):85\u2013\n105, 2008.\n\n[14] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse\n\nrepresentation. IEEEE TPAMI, 31(2):210 \u2013227, Feb. 2009.\n\n[15] Allen Y Yang, John Wright, Yi Ma, and S Shankar Sastry. Unsupervised segmentation\nof natural images via lossy data compression. Computer Vision and Image Understanding,\n110(2):212\u2013225, 2008.\n\n9\n\n\f", "award": [], "sourceid": 499, "authors": [{"given_name": "Devansh", "family_name": "Arpit", "institution": "SUNY Buffalo"}, {"given_name": "Ifeoma", "family_name": "Nwogu", "institution": "SUNY Buffalo"}, {"given_name": "Venu", "family_name": "Govindaraju", "institution": "SUNY Buffalo"}]}