{"title": "Deep Subspace Clustering Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 24, "page_last": 33, "abstract": "We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep auto-encoders, which non-linearly map the input data into a latent space. Our key idea is to introduce a novel self-expressive layer between the encoder and the decoder to mimic the \"self-expressiveness\" property that has proven effective in traditional subspace clustering. Being differentiable, our new self-expressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard back-propagation procedure. Being nonlinear, our neural-network based method is able to cluster data points having complex (often nonlinear) structures. We further propose pre-training and fine-tuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that the proposed method significantly outperforms the state-of-the-art unsupervised subspace clustering methods.", "full_text": "Deep Subspace Clustering Networks\n\nPan Ji\u2217\n\nUniversity of Adelaide\n\nTong Zhang\u2217\n\nAustralian National University\n\nHongdong Li\n\nAustralian National University\n\nMathieu Salzmann\n\nEPFL - CVLab\n\nIan Reid\n\nUniversity of Adelaide\n\nAbstract\n\nWe present a novel deep neural network architecture for unsupervised subspace\nclustering. This architecture is built upon deep auto-encoders, which non-linearly\nmap the input data into a latent space. Our key idea is to introduce a novel\nself-expressive layer between the encoder and the decoder to mimic the \u201cself-\nexpressiveness\u201d property that has proven effective in traditional subspace clustering.\nBeing differentiable, our new self-expressive layer provides a simple but effective\nway to learn pairwise af\ufb01nities between all data points through a standard back-\npropagation procedure. Being nonlinear, our neural-network based method is able\nto cluster data points having complex (often nonlinear) structures. We further\npropose pre-training and \ufb01ne-tuning strategies that let us effectively learn the\nparameters of our subspace clustering networks. Our experiments show that\nour method signi\ufb01cantly outperforms the state-of-the-art unsupervised subspace\nclustering techniques.\n\n1\n\nIntroduction\n\nIn this paper, we tackle the problem of subspace clustering [42] \u2013 a sub-\ufb01eld of unsupervised\nlearning \u2013 which aims to cluster data points drawn from a union of low-dimensional subspaces in an\nunsupervised manner. Subspace clustering has become an important problem as it has found various\napplications in computer vision, e.g., image segmentation [50, 27], motion segmentation [17, 9],\nand image clustering [14, 10]. For example, under Lambertian re\ufb02ectance, the face images of one\nsubject obtained with a \ufb01xed pose and varying lighting conditions lie in a low-dimensional subspace\nof dimension close to nine [2]. Therefore, one can employ subspace clustering to group images of\nmultiple subjects according to their respective subjects.\nMost recent works on subspace clustering [49, 6, 10, 23, 46, 26, 16, 52] focus on clustering linear\nsubspaces. However, in practice, the data do not necessarily conform to linear subspace models. For\ninstance, in the example of face image clustering, re\ufb02ectance is typically non-Lambertian and the\npose of the subject often varies. Under these conditions, the face images of one subject rather lie in\na non-linear subspace (or sub-manifold). A few works [5, 34, 35, 51, 47] have proposed to exploit\nthe kernel trick [40] to address the case of non-linear subspaces. However, the selection of different\nkernel types is largely empirical, and there is no clear reason to believe that the implicit feature space\ncorresponding to a prede\ufb01ned kernel is truly well-suited to subspace clustering.\nIn this paper, by contrast, we introduce a novel deep neural network architecture to learn (in an\nunsupervised manner) an explicit non-linear mapping of the data that is well-adapted to subspace\nclustering. To this end, we build our deep subspace clustering networks (DSC-Nets) upon deep\nauto-encoders, which non-linearly map the data points to a latent space through a series of encoder\n\n\u2217Authors contributed equally to this work\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\flayers. Our key contribution then consists of introducing a novel self-expressive layer \u2013 a fully\nconnected layer without bias and non-linear activations \u2013 at the junction between the encoder and the\ndecoder. This layer encodes the \u201cself-expressiveness\u201d property [38, 9] of data drawn from a union\nof subspaces, that is, the fact that each data sample can be represented as a linear combination of\nother samples in the same subspace. To the best of our knowledge, our approach constitutes the\n\ufb01rst attempt to directly learn the af\ufb01nities (through combination coef\ufb01cients) between all data points\nwithin one neural network. Furthermore, we propose effective pre-training and \ufb01ne-tuning strategies\nto learn the parameters of our DSC-Nets in an unsupervised manner and with a limited amount of\ndata.\nWe extensively evaluate our method on face clustering, using the Extended Yale B [21] and ORL [39]\ndatasets, and on general object clustering, using COIL20 [31] and COIL100 [30]. Our experiments\nshow that our DSC-Nets signi\ufb01cantly outperform the state-of-the-art subspace clustering methods.\n\n2 Related Work\n\nSubspace Clustering. Over the years, many methods have been developed for linear subspace\nclustering. In general, these methods consist of two steps: the \ufb01rst and also most crucial one aims to\nestimate an af\ufb01nity for every pair of data points to form an af\ufb01nity matrix; the second step then applies\nnormalized cuts [41] or spectral clustering [32] using this af\ufb01nity matrix. The resulting methods\ncan then be roughly divided into three categories [42]: factorization methods [7, 17, 44, 29, 16],\nhigher-order model based methods [49, 6, 33, 37], and self-expressiveness based methods [9, 24, 26,\n46, 15, 12, 22, 52]. In essence, factorization methods build the af\ufb01nity matrix by factorizing the data\nmatrix, and methods based on higher-order models estimate the af\ufb01nities by exploiting the residuals\nof local subspace model \ufb01tting. Recently, self-expressiveness based methods, which seek to express\nthe data points as a linear combination of other points in the same subspace, have become the most\npopular ones. These methods build the af\ufb01nity matrix using the matrix of combination coef\ufb01cients.\nCompared to factorization techniques, self-expressiveness based methods are often more robust to\nnoise and outliers when relying on regularization terms to account for data corruptions. They also\nhave the advantage over higher-order model based methods of considering connections between all\ndata points rather than exploiting local models, which are often suboptimal. To handle situations\nwhere data points do not exactly reside in a union of linear subspaces, but rather in non-linear ones,\na few works [34, 35, 51, 47] have proposed to replace the inner product of the data matrix with a\npre-de\ufb01ned kernel matrix (e.g., polynomial kernel and Gaussian RBF kernel). There is, however, no\nclear reason why such kernels should correspond to feature spaces that are well-suited to subspace\nclustering. By contrast, here, we propose to explicitly learn one that is.\n\nAuto-Encoders. Auto-encoders (AEs) can non-linearly transform data into a latent space. When\nthis latent space has lower dimension than the original one [13], this can be viewed as a form of\nnon-linear PCA. An auto-encoder typically consists of an encoder and a decoder to de\ufb01ne the data\nreconstruction cost. With the success of deep learning [20], deep (or stacked) AEs have become\npopular for unsupervised learning. For instance, deep AEs have proven useful for dimensionality\nreduction [13] and image denoising [45]. Recently, deep AEs have also been used to initialize deep\nembedding networks for unsupervised clustering [48]. A convolutional version of deep AEs was also\napplied to extract hierarchical features and to initialize convolutional neural networks (CNNs) [28].\nThere has been little work in the literature combining deep learning with subspace clustering. To the\nbest of our knowledge, the only exception is [36], which \ufb01rst extracts SIFT [25] or HOG [8] features\nfrom the images and feeds them to a fully connected deep auto-encoder with a sparse subspace\nclustering (SSC) [10] prior. The \ufb01nal clustering is then obtained by applying k-means or SSC on the\nlearned auto-encoder features. In essence, [36] can be thought of as a subspace clustering method\nbased on k-means or SSC with deep auto-encoder features. Our method signi\ufb01cantly differs from [36]\nin that our network is designed to directly learn the af\ufb01nities, thanks to our new self-expressive layer.\n\n3 Deep Subspace Clustering Networks (DSC-Nets)\n\nOur deep subspace clustering networks leverage deep auto-encoders and the self-expressiveness\nproperty. Before introducing our networks, we \ufb01rst discuss this property in more detail.\n\n2\n\n\fFigure 1: Deep Convolutional Auto-Encoder: The input xi is mapped to zi through an encoder, and\nthen reconstructed as \u02c6xi through a decoder. We use shaded circles to denote data vectors and shaded\nsquares to denote the channels after convolution or deconvolution. We do not enforce the weights of\nthe corresponding encoder and decoder layers to be coupled (or the same).\n\n3.1 Self-Expressiveness\nGiven data points {xi}i=1,\u00b7\u00b7\u00b7 ,N drawn from multiple linear subspaces {Si}i=1,\u00b7\u00b7\u00b7 ,K, one can express\na point in a subspace as a linear combination of other points in the same subspace. In the literature [38,\n9], this property is called self-expressiveness. If we stack all the points xi into columns of a data\nmatrix X, the self-expressiveness property can be simply represented as one single equation, i.e.,\nX = XC, where C is the self-representation coef\ufb01cient matrix. It has been shown in [15] that,\nunder the assumption that the subspaces are independent, by minimizing certain norms of C, C is\nguaranteed to have a block-diagonal structure (up to certain permutations), i.e., cij (cid:54)= 0 iff point\nxi and point xj lie in the same subspace. So we can leverage the matrix C to construct the af\ufb01nity\nmatrix for spectral clustering. Mathematically, this idea is formalized as the optimization problem\n(1)\nwhere (cid:107) \u00b7 (cid:107)p represents an arbitrary matrix norm, and the optional diagonal constraint on C prevents\ntrivial solutions for sparsity inducing norms, such as the (cid:96)1 norm. Various norms for C have\nbeen proposed in the literature, e.g., the (cid:96)1 norm in Sparse Subspace Clustering (SSC) [9, 10], the\nnuclear norm in Low Rank Representation (LRR) [24, 23] and Low Rank Subspace Clustering\n(LRSC) [11, 43], and the Frobenius norm in Least-Squares Regression (LSR) [26] and Ef\ufb01cient\nDense Subspace Clustering (EDSC) [15]. To account for data corruptions, the equality constraint\nin (1) is often relaxed as a regularization term, leading to\n\ns.t. X = XC, (diag(C) = 0) ,\n\n(cid:107)C(cid:107)p\n\nmin\n\nC\n\n(cid:107)C(cid:107)p +\n\n\u03bb\n2\n\nmin\n\nC\n\n(cid:107)X \u2212 XC(cid:107)2\n\nF\n\ns.t.\n\n(diag(C) = 0) .\n\n(2)\n\nUnfortunately, the self-expressiveness property only holds for linear subspaces. While kernel based\nmethods [34, 35, 51, 47] aim to tackle the non-linear case, it is not clear that pre-de\ufb01ned kernels yield\nimplicit feature spaces that are well-suited for subspace clustering. In this work, we aim to learn an\nexplicit mapping that makes the subspaces more separable. To this end, and as discussed below, we\npropose to build our networks upon deep auto-encoders.\n\n3.2 Self-Expressive Layer in Deep Auto-Encoders\n\nOur goal is to train a deep auto-encoder, such as the one depicted by Figure 1, such that its latent\nrepresentation is well-suited to subspace clustering. To this end, we introduce a new layer that\nencodes the notion of self-expressiveness.\nSpeci\ufb01cally, let \u0398 denote the auto-encoder parameters, which can be decomposed into encoder\nparameters \u0398e and decoder parameters \u0398d. Furthermore, let Z\u0398e denote the output of the encoder,\ni.e., the latent representation of the data matrix X. To encode self-expressiveness, we introduce a\nnew loss function de\ufb01ned as\n\nL(\u0398, C) =\n\n(cid:107)X \u2212 \u02c6X\u0398(cid:107)2\n\nF + \u03bb1(cid:107)C(cid:107)p +\n\n1\n2\n\n(cid:107)Z\u0398e \u2212 Z\u0398eC(cid:107)2\n\nF\n\n\u03bb2\n2\n\ns.t.\n\n(diag(C) = 0) ,\n\n(3)\n\nwhere \u02c6X\u0398 represents the data reconstructed by the auto-encoder. To minimize (3), we propose to\nleverage the fact that, as discussed below, C can be thought of as the parameters of an additional\nnetwork layer, which lets us solve for \u0398 and C jointly using backpropagation.1\n\n1Note that one could also alternate minimization between \u0398 and C. However, since the loss is non-convex,\nthis would not provide better convergence guarantees and would require investigating the in\ufb02uence of the number\nof steps in the optimization w.r.t. \u0398 on the clustering results.\n\n3\n\n\fFigure 2: Deep Subspace Clustering Networks: As an example, we show a deep subspace clustering\nnetwork with three convolutional encoder layers, one self-expressive layer, and three deconvolutional\ndecoder layer. During training, we \ufb01rst pre-train the deep auto-encoder without the self-expressive\nlayer; we then \ufb01ne-tune our entire network using this pre-trained model for initialization.\nSpeci\ufb01cally, consider the self-expressiveness term in (3), (cid:107)Z\u0398e \u2212 Z\u0398e C(cid:107)2\nF . Since each data point zi\n(in the latent space) is approximated by a weighted linear combination of other points {zj}j=1,\u00b7\u00b7\u00b7 ,N\n(optionally, j (cid:54)= i) with weights cij, this linear operation corresponds exactly to a set of linear\nneurons without non-linear activations. Therefore, if we take each zi as a node in the network, we\ncan then represent the self-expressiveness term with a fully-connected linear layer, which we call\nthe self-expressive layer. The weights of the self-expressive layer correspond to the matrix C in (3),\nwhich are further used to construct af\ufb01nities between all data points. Therefore, our self-expressive\nlayer essentially lets us directly learn the af\ufb01nity matrix via the network. Moreover, minimizing\n(cid:107)C(cid:107)p simply translates to adding a regularizer to the weights of the self-expressive layer. In this\nwork, we consider two kinds of regularizations on C: (i) the (cid:96)1 norm, resulting in a network denoted\nby DSC-Net-L1; (ii) the (cid:96)2 norm, resulting in a network denoted by DSC-Net-L2.\nFor notational consistency, let us denote the parameters of the self-expressive layer (which are just the\nelements of C) as \u0398s. As can be seen from Figure 2, we then take the input to the decoder part of our\nnetwork to be the transformed latent representation Z\u0398e \u0398s. This lets us re-write our loss function as\n\n\u02dcL( \u02dc\u0398) =\n\n1\n2\n\n(cid:107)X \u2212 \u02c6X \u02dc\u0398(cid:107)2\n\nF + \u03bb1(cid:107)\u0398s(cid:107)p +\n\n(cid:107)Z\u0398e \u2212 Z\u0398e \u0398s(cid:107)2\n\nF\n\n\u03bb2\n2\n\ns.t.\n\n(diag(\u0398s) = 0) ,\n\n(4)\n\nwhere the network parameters \u02dc\u0398 now consist of encoder parameters \u0398e, self-expressive layer\nparameters \u0398s, and decoder parameters \u0398d, and where the reconstructed data \u02c6X is now a function of\n{\u0398e, \u0398s, \u0398d} rather than just {\u0398e, \u0398d} in (3).\n\n3.3 Network Architecture\n\nOur network consists of three parts, i.e., stacked encoders, a self-expressive layer, and stacked\ndecoders. The overall network architecture is shown in Figure 2. In this paper, since we focus on\nimage clustering problems, we advocate the use of convolutional auto-encoders that have fewer\nparameters than the fully connected ones and are thus easier to train. Note, however, that fully-\nconnected auto-encoders are also compatible with our self-expressive layer. In the convolutional\nlayers, we use kernels with stride 2 in both horizontal and vertical directions, and recti\ufb01ed linear unit\n(ReLU) [19] for the non-linear activations. Given N images to be clustered, we use all the images in\na single batch. Each input image is mapped by the convolutional encoder layers to a latent vector (or\nnode) zi, represented as a shaded circle in Figure 2. In the self-expressive layer, the nodes are fully\nconnected using linear weights without bias and non-linear activations. The latent vectors are then\nmapped back to the original image space via the deconvolutional decoder layers.\nFor the ith encoder layer with ni channels of kernel size ki \u00d7 ki, the number of weight parameters\ni ni\u22121ni, with n0 = 1. Since the encoders and decoders have symmetric structures, their total\nis k2\ni 2ni \u2212 n1 + 1. For N\ninput images, the number of parameters for the self-expressive layer is N 2. For example, if we have\n\ni ni\u22121ni plus the number of bias parameters(cid:80)\n\nnumber of parameters is(cid:80)\n\ni 2k2\n\n4\n\n\fFigure 3: From the parameters of the self-expressive layer, we construct an af\ufb01nity matrix, which we\nuse to perform spectral clustering to get the \ufb01nal clusters. Best viewed in color.\n\ni=1 2(k2\n\n3\u00d7 3, then the number of parameters for encoders and decoders is(cid:80)3\n\nthree encoder layers with 10, 20, and 30 channels, respectively, and all convolutional kernels are of size\ni ni\u22121 + 1)ni\u2212 n1 + 1 =\n14671. If we have 1000 input images, then the number of parameters in the self-expressive layer is\n106. Therefore, the network parameters are typically dominated by those of the self-expressive layer.\n\n3.4 Training Strategy\n\nSince the size of datasets for unsupervised subspace clustering is usually limited (e.g., in the order of\nthousands of images), our networks remain of a tractable size. However, for the same reason, it also\nremains dif\ufb01cult to directly train a network with millions of parameters from scratch. To address this,\nwe design the pre-training and \ufb01ne-tuning strategies described below. Note that this also allows us to\navoid the trivial all-zero solution while minimizing the loss (4).\nAs illustrated in Figure 2, we \ufb01rst pre-train the deep auto-encoder without the self-expressive layer on\nall the data we have. We then use the trained parameters to initialize the encoder and decoder layers\nof our network. After this, in the \ufb01ne-tuning stage, we build a big batch using all the data to minimize\nthe loss \u02dcL(\u0398) de\ufb01ned in (4) with a gradient descent method. Speci\ufb01cally, we use Adam [18], an\nadaptive momentum based gradient descent method, to minimize the loss, where we set the learning\nrate to 1.0 \u00d7 10\u22123 in all our experiments. Since we always use the same batch in each training epoch,\nour optimization strategy is rather a deterministic momentum based gradient method than a stochastic\ngradient method. Note also that, since we only have access to images for training and not to cluster\nlabels, our training strategy is unsupervised (or self-supervised).\nOnce the network is trained, we can use the parameters of the self-expressive layer to construct an\naf\ufb01nity matrix for spectral clustering [32], as illustrated in Figure 3. Although such an af\ufb01nity matrix\ncould in principle be computed as |C| + |CT|, over the years, researchers in the \ufb01eld have developed\nmany heuristics to improve the resulting matrix. Since there is no globally-accepted solution for this\nstep in the literature, we make use of the heuristics employed by SSC [10] and EDSC [15]. Due to\nthe lack of space, we refer the reader to the publicly available implementation of SSC and Section 5\nof [15], as well as to the TensorFlow implementation of our algorithm 2 for more detail.\n\n4 Experiments\n\nWe implemented our method in Python with Tensor\ufb02ow-1.0 [1], and evaluated it on four standard\ndatasets, i.e., the Extended Yale B and ORL face image datasets, and the COIL20/100 object image\ndatasets. We compare our methods against the following baselines: Low Rank Representation\n(LRR) [23], Low Rank Subspace Clustering (LRSC) [43], Sparse Subspace Clustering (SSC) [10],\nKernel Sparse Subspace Clustering (KSSC) [35], SSC by Orthogonal Matching Pursuit (SSC-\nOMP) [53], Ef\ufb01cient Dense Subspace Clustering (EDSC) [15], SSC with the pre-trained convolutional\nauto-encoder features (AE+SSC), and EDSC with the pre-trained convolutional auto-encoder features\n(AE+EDSC). For all the baselines, we used the source codes released by the authors and tuned the\nparameters by grid search to the achieve best results on each dataset. Since the code for the deep\nsubspace clustering method of [36] is not publicly available, we are only able to provide a comparison\n\n2https://github.com/panji1990/Deep-subspace-clustering-networks\n\n5\n\n\f(a) Extended Yale B\n\n(c) COIL20 and COIL100\nFigure 4: Sample images from Extended Yale B, ORL , COIL20 and COIL100.\n\n(b) ORL\n\nlayers\nkernel size\nchannels\nparameters\n\nencoder-1\n5 \u00d7 5\n10\n260\n\nencoder-2\n3 \u00d7 3\n20\n1820\n\nencoder-3\n3 \u00d7 3\n30\n5430\n\nTable 1: Network settings for Extended Yale B.\n\nself-expressive\n\n\u2013\n\u2013\n\n5914624\n\ndecoder-1\n3 \u00d7 3\n30\n5420\n\ndecoder-2\n3 \u00d7 3\n20\n1810\n\ndecoder-3\n5 \u00d7 5\n10\n251\n\nagainst this approach on Extended Yale B and COIL20, for which the results are provided in [36].\nNote that this comparison already clearly shows the bene\ufb01ts of our approach.\nFor all quantitative evaluations, we make use of the clustering error rate, de\ufb01ned as\n\nerr % =\n\n# of wrongly clustered points\n\ntotal # of points\n\n\u00d7 100% .\n\n(5)\n\n4.1 Extended Yale B Dataset\n\nThe Extended Yale B dataset [21] is a popular benchmark for subspace clustering. It consists of 38\nsubjects, each of which is represented with 64 face images acquired under different illumination\nconditions (see Figure 4(a) for sample images from this dataset). Following the experimental setup\nof [10], we down-sampled the original face images from 192 \u00d7 168 to 42 \u00d7 42 pixels, which\nmakes it computationally feasible for the baselines [10, 23]. In each experiment, we pick K \u2208\n{10, 15, 20, 25, 30, 35, 38} subjects (each subject with 64 face images) to test the robustness w.r.t.\nan increasing number of clusters. Taking all possible combinations of K subjects out of 38 would\nresult in too many experimental trials. To get a manageable size of experiments, we \ufb01rst number the\nsubjects from 1 to 38 and then take all possible K consecutive subjects. For example, in the case\nof 10 subjects, we take all the images from subject 1 \u2212 10, 2 \u2212 11, \u00b7\u00b7\u00b7 , 29 \u2212 38, giving rise to 29\nexperimental trials.\nWe experimented with different architectures for the convolutional layers of our network, e.g., different\nnetwork depths and number of channels. While increasing these values increases the representation\npower of the network, it also increases the number of network parameters, thus requiring larger\ntraining datasets. Since the size of Extended Yale B is quite limited, with only 2432 images, we\nfound having three-layer encoders and decoders with [10, 20, 30] channels to be a good trade-off for\nthis dataset. The detailed network settings are described in Table 1. In the \ufb01ne-tuning phase, since\nthe number of epochs required for gradient descent increases as the number of subjects K increases,\nwe de\ufb01ned the number of epochs for DSC-Net-L1 as 160 + 20K and for DSC-Net-L2 as 50 + 25K.\nWe set the regularization parameters to \u03bb1 = 1.0, \u03bb2 = 1.0 \u00d7 10 K\nThe clustering performance of different methods for different numbers of subjects is provided in\nTable 2. For the experiments with K subjects, we report the mean and median errors of 39 \u2212 K\nexperimental trials. From these results, we can see that the performance of most of the baselines\ndecreases dramatically as the number of subjects K increases. By contrast, the performance of our\ndeep subspace clustering methods, DSC-Net-L1 and DSC-Net-L2, remains relatively stable w.r.t.\nthe number of clusters. Speci\ufb01cally, our DSC-Net-L2 achieves 2.67% error rate for 38 subjects,\nwhich is only around 1/5 of the best performing baseline EDSC. We also observe that using the\npre-trained auto-encoder features does not necessarily improve the performance of SSC and EDSC,\nwhich con\ufb01rms the bene\ufb01ts of our joint optimization of all parameters in one network. The results\nof [36] on this dataset for 38 subjects was reported to be 92.08 \u00b1 2.42% in terms of clustering\naccuracy, or equivalently 7.92 \u00b1 2.42% in terms of clustering error, which is worse than both our\nmethods \u2013 DSC-Net-L1 and DSC-Net-L2. We further notice that DSC-Net-L1 performs slightly\nworse than DSC-Net-L2 in the current experimental settings. We conjecture that this is due to the\ndif\ufb01culty in optimization introduced by the (cid:96)1 norm as it is non-differentiable at zero.\n\n10\u22123.\n\n6\n\n\fMethod\n10 subjects\nMean\nMedian\n15 subjects\nMean\nMedian\n20 subjects\nMean\nMedian\n25 subjects\nMean\nMedian\n30 subjects\nMean\nMedian\n35 subjects\nMean\nMedian\n38 subjects\nMean\nMedian\n\n1.59\n1.25\n\n1.69\n1.72\n\n1.73\n1.80\n\n1.75\n1.81\n\n2.07\n2.19\n\n2.65\n2.64\n\n2.67\n2.67\n\nEDSC\n\nAE+\nEDSC\n\nDSC-\nNet-L1\n\nDSC-\nNet-L2\n\nLRR\n\nLRSC\n\nSSC\n\n22.22\n23.49\n\n23.22\n23.49\n\n30.23\n29.30\n\n27.92\n28.13\n\n37.98\n36.82\n\n41.85\n41.81\n\n30.95\n29.38\n\n31.47\n31.64\n\n28.76\n28.91\n\n27.81\n26.81\n\n30.64\n30.31\n\n31.35\n31.74\n\n10.22\n11.09\n\n13.13\n13.40\n\n19.75\n21.17\n\n26.22\n26.66\n\n28.76\n28.59\n\n28.55\n29.04\n\nAE+\nSSC\n\n17.06\n17.75\n\n18.65\n17.76\n\n18.23\n16.80\n\n18.72\n17.88\n\n19.99\n20.00\n\n22.13\n21.74\n\nKSSC\n\n14.49\n15.78\n\n16.22\n17.34\n\n16.55\n17.34\n\n18.56\n18.03\n\n20.49\n20.94\n\n26.07\n25.92\n\nSSC-\nOMP\n\n12.08\n8.28\n\n14.05\n14.69\n\n15.16\n15.23\n\n18.89\n18.53\n\n20.75\n20.52\n\n20.29\n20.18\n\n5.64\n5.47\n\n7.63\n6.41\n\n9.30\n10.31\n\n10.67\n10.84\n\n11.24\n11.09\n\n13.10\n13.10\n\n5.46\n6.09\n\n6.70\n5.52\n\n7.67\n6.56\n\n10.27\n10.22\n\n11.56\n10.36\n\n13.28\n13.21\n\n2.23\n2.03\n\n2.17\n2.03\n\n2.17\n2.11\n\n2.53\n2.19\n\n2.63\n2.81\n\n3.09\n3.10\n\n34.87\n34.87\n\n29.89\n29.89\n\n27.51\n27.51\n\n25.33\n25.33\n\n27.75\n27.75\n\n24.71\n24.71\n\n11.64\n11.64\n\n12.66\n12.66\n\n3.33\n3.33\n\nTable 2: Clustering error (in %) on Extended Yale B. The lower the better.\n\nlayers\nkernel size\nchannels\nparameters\n\nencoder-1\n5 \u00d7 5\n\n5\n130\n\n4.2 ORL Dataset\n\nencoder-2\n3 \u00d7 3\n\nencoder-3\n3 \u00d7 3\n\nself-expressive\n\ndecoder-1\n3 \u00d7 3\n\n3\n3\n138\n84\nTable 3: Network settings for ORL.\n\n160000\n\n3\n84\n\n\u2013\n\u2013\n\ndecoder-2\n3 \u00d7 3\n\n3\n140\n\ndecoder-3\n5 \u00d7 5\n\n5\n126\n\nThe ORL dataset [39] is composed of 400 human face images, with 40 subjects each having 10\nsamples. Following [4], we down-sampled the original face images from 112 \u00d7 92 to 32 \u00d7 32. For\neach subject, the images were taken under varying lighting conditions with different facial expressions\n(open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses)(see Figure 4(b)\nfor sample images). Compared to Extended Yale B, this dataset is more challenging for subspace\nclustering because (i) the face subspaces have more non-linearity due to varying facial expressions and\ndetails; (ii) the dataset size is much smaller (400 vs. 2432). To design a trainable deep auto-encoder\non 400 images, we reduced the number of network parameters by decreasing the number of channels\nin each encoder and decoder layer. The resulting network is speci\ufb01ed in Table 3.\nSince we already veri\ufb01ed the robustness of our method to the number of clusters in the previous\nexperiment, here, we only provide results for clustering all 40 subjects. In this setting, we set\n\u03bb1 = 1 and \u03bb2 = 0.2 and run 700 epochs for DSC-Net-L2 and 1500 epochs for DSC-Net-L1 during\n\ufb01ne-tuning. Note that, since the size of this dataset is small, we can even use the whole data as a\nsingle batch in pre-training. We found this to be numerically more stable and converge faster than\nstochastic gradient descent using randomly sampled mini-batches.\nFigure 5(a) shows the error rates of the different methods, where different colors denote different\nsubspace clustering algorithms and the length of the bars re\ufb02ects the error rate. Since there are\nmuch fewer samples per subject, all competing methods perform worse than on Extended Yale B.\nNote that both EDSC and SSC achieve moderate clustering improvement by using the features of\npre-trained convolutional auto-encoders, but their error rates are still around twice as high as those of\nour methods.\n\n4.3 COIL20 and COIL100 Datasets\n\nThe previous experiments both target face clustering. To show the generality of our algorithm, we\nalso evaluate it on the COIL object image datasets \u2013 COIL20 [31] and COIL100 [30]. COIL20\nconsists of 1440 gray-scale image samples, distributed over 20 objects such as duck and car model\n(see sample images in Figure 4(c)). Similarly, COIL100 consists of 7200 images distributed over\n100 objects. Each object was placed on a turntable against a black background, and 72 images were\ntaken at pose intervals of 5 degrees. Following [3], we down-sampled the images to 32 \u00d7 32. In\ncontrast with the previous human face datasets, in which faces are well aligned and have similar\nstructures, the object images from COIL20 and COIL100 are more diverse, and even samples from\n\n7\n\n\f(a) ORL\n\n(b) COIL20\n\n(c) COIL100\n\nFigure 5: Subspace clustering error (in %) on the ORL, COIL20 and COIL100 datasets. Different\ncolors indicate different methods. The height of the bars encodes the error, so the lower the better.\n\nlayers\nkernel size\nchannels\nparameters\n\nencoder-1\n3 \u00d7 3\n15\n150\n\nCOIL20\n\nself-expressive\n\n\u2013\n\u2013\n\ndecoder-1\n3 \u00d7 3\n15\n136\n\nencoder-1\n5 \u00d7 5\n50\n1300\n\nCOIL100\n\nself-expressive\n\n\u2013\n\u2013\n\ndecoder-1\n5 \u00d7 5\n50\n1251\n\n51840000\nTable 4: Network settings for COIL20 and COIL100.\n\n2073600\n\nthe same object differ from each other due to the change of viewing angle. This makes these datasets\nchallenging for subspace clustering techniques. For these datasets, we used shallower networks with\none encoder layer, one self-expressive layer, and one decoder layer. For COIL20, we set the number\nof channels to 15 and the kernel size to 3 \u00d7 3. For COIL100, we increased the number of channels\nto 50 and the kernel size to 5 \u00d7 5. The settings for both networks are provided in Table 4. Note\nthat with these network architectures, the dimension of the latent space representation zi increases\nby a factor of 15/4 for COIL20 (as the spatial resolution of each channel shrinks to 1/4 of the input\nimage after convolutions with stride 2, and we have 15 channels) and 50/4 for COIL100. Thus our\nnetworks perform dimensionality lifting rather than dimensionality reduction. This, in some sense, is\nsimilar to the idea of Hilbert space mapping in kernel methods [40], but with the difference that, in\nour case, the mapping is explicit, via the neural network. In our experiments, we found that these\nshallow, dimension-lifting networks performed better than deep, bottle-neck ones on these datasets.\nWhile it is also possible to design deep, dimension-lifting networks, the number of channels has to\nincrease by a factor of 4 after each layer to compensate for the resolution loss. For example, if we\nwant the latent space dimension to increase by a factor of 15/4, we need 15 \u00b7 4 channels in the second\nlayer for a 2-layer encoder, 15 \u00b7 42 channels in the third layer for a 3-layer encoder, and so forth. In\nthe presence of limited data, this increasing number of parameters makes training less reliable. In\nour \ufb01ne-tuning stage, we ran 30 epochs (COIL20) / 100 epochs (COIL100) for DSC-Net-L1 and 30\nepochs (COIL20) / 120 epochs (COIL100) for DSC-Net-L2, and set the regularization parameters to\n\u03bb1 = 1, \u03bb2 = 150/30 (COIL20/COIL100).\nFigure 5(b) and (c) depict the error rates of the different methods on clustering 20 classes for COIL20\nand 100 classes for COIL100, respectively. Note that, in both cases, our DSC-Net-L2 achieves the\nlowest error rate. In particular, for COIL20, we obtain an error of 5.14%, which is roughly 1/3 of the\nerror rate of the best-performing baseline AE+EDSC. The results of [36] on COIL20 were reported\nto be 14.24 \u00b1 4.70% in terms of clustering error, which is also much higher than ours.\n\n5 Conclusion\n\nWe have introduced a deep auto-encoder framework for subspace clustering by developing a novel\nself-expressive layer to harness the \"self-expressiveness\" property of a union of subspaces. Our deep\nsubspace clustering network allows us to directly learn the af\ufb01nities between all data points with a\nsingle neural network. Furthermore, we have proposed pre-training and \ufb01ne-tuning strategies to train\nour network, demonstrating the ability to handle challenging scenarios with small-size datasets, such\nas the ORL dataset. Our experiments have demonstrated that our deep subspace clustering methods\nprovide signi\ufb01cant improvement over the state-of-the-art subspace clustering solutions in terms of\nclustering accuracy on several standard datasets.\n\n8\n\n\fAcknowledgements\n\nThis research was supported by the Australian Research Council (ARC) through the Centre of\nExcellence in Robotic Vision, CE140100016, and through Laureate Fellowship FL130100102 to\nIDR. TZ was supported by the ARC\u2019s Discovery Projects funding scheme (project DP150104645).\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,\net al. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467,\n2016.\n\n[2] R. Basri and D. W. Jacobs. Lambertian re\ufb02ectance and linear subspaces. TPAMI, 25(2):218\u2013233, 2003.\n[3] D. Cai, X. He, J. Han, and T. Huang. Graph regularized nonnegative matrix factorization for data\n\nrepresentation. TPAMI, 33(8):1548\u20131560, 2011.\n\n[4] D. Cai, X. He, Y. Hu, J. Han, and T. Huang. Learning a spatially smooth subspace for face recognition. In\n\nCVPR, pages 1\u20137. IEEE, 2007.\n\n765\u2013772. IEEE, 2009.\n\n[5] G. Chen, S. Atev, and G. Lerman. Kernel spectral curvature clustering (KSCC). In ICCV Workshops, pages\n\n[6] G. Chen and G. Lerman. Spectral curvature clustering (SCC). IJCV, 81(3):317\u2013330, 2009.\n[7] J. Costeira and T. Kanade. A multibody factorization method for independently moving objects. IJCV,\n\n[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR 2005, pages\n\n29(3):159\u2013179, 1998.\n\n886\u2013893. IEEE, 2005.\n\n35(11):2765\u20132781, 2013.\n\npages 3818\u20133825, 2014.\n\n313(5786):504\u2013507, 2006.\n\n[9] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, pages 2790\u20132797, 2009.\n[10] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. TPAMI,\n\n[11] P. Favaro, R. Vidal, and A. Ravichandran. A closed form solution to robust subspace estimation and\n\nclustering. In CVPR, pages 1801\u20131807. IEEE, 2011.\n\n[12] J. Feng, Z. Lin, H. Xu, and S. Yan. Robust subspace segmentation with block-diagonal prior. In CVPR,\n\n[13] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n[14] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances of objects under varying\n\nillumination conditions. In CVPR, volume 1, pages 11\u201318. IEEE, 2003.\n\n[15] P. Ji, M. Salzmann, and H. Li. Ef\ufb01cient dense subspace clustering. In WACV, pages 461\u2013468. IEEE, 2014.\n[16] P. Ji, M. Salzmann, and H. Li. Shape interaction matrix revisited and robusti\ufb01ed: Ef\ufb01cient subspace\n\nclustering with corrupted and incomplete data. In ICCV, pages 4687\u20134695, 2015.\n\n[17] K.-i. Kanatani. Motion segmentation by subspace separation and model selection. In ICCV, volume 2,\n\npages 586\u2013591. IEEE, 2001.\n\n[18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.\n[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[20] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[21] K.-C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable\n\n[22] C.-G. Li and R. Vidal. Structured sparse subspace clustering: A uni\ufb01ed optimization framework. In CVPR,\n\nlighting. TPAMI, 27(5):684\u2013698, 2005.\n\npages 277\u2013286, 2015.\n\n[23] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank\n\nrepresentation. TPAMI, 35(1):171\u2013184, 2013.\n\n[24] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages\n\n663\u2013670, 2010.\n\n[25] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[26] C.-Y. Lu, H. Min, Z.-Q. Zhao, L. Zhu, D.-S. Huang, and S. Yan. Robust and ef\ufb01cient subspace segmentation\n\nvia least squares regression. In ECCV, pages 347\u2013360. Springer, 2012.\n\n[27] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data coding\n\nand compression. TPAMI, 29(9), 2007.\n\n[28] J. Masci, U. Meier, D. Cire\u00b8san, and J. Schmidhuber. Stacked convolutional auto-encoders for hierarchical\nfeature extraction. Arti\ufb01cial Neural Networks and Machine Learning\u2013ICANN 2011, pages 52\u201359, 2011.\n[29] Q. Mo and B. A. Draper. Semi-nonnegative matrix factorization for motion segmentation with missing\n\ndata. In ECCV, pages 402\u2013415. Springer, 2012.\n\n[30] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-100). Technical Report\n\nCUCS-006-96, 1996.\n\n9\n\n\f[31] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-20). Technical Report\n\nCUCS-005-96, 1996.\n\n[32] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. In NIPS,\n\nvolume 14, pages 849\u2013856, 2001.\n\n[33] P. Ochs and T. Brox. Higher order motion models and spectral clustering. In CVPR, 2012.\n[34] V. M. Patel, H. Van Nguyen, and R. Vidal. Latent space sparse subspace clustering. In ICCV, pages\n\n225\u2013232, 2013.\n\n[35] V. M. Patel and R. Vidal. Kernel sparse subspace clustering. In ICIP, pages 2849\u20132853. IEEE, 2014.\n[36] X. Peng, S. Xiao, J. Feng, W.-Y. Yau, and Z. Yi. Deep subspace clustering with sparsity prior. In IJCAI,\n\n2016.\n\n[37] P. Purkait, T.-J. Chin, H. Ackermann, and D. Suter. Clustering with hypergraphs: the case for large\n\nhyperedges. In ECCV, pages 672\u2013687. Springer, 2014.\n\n[38] S. R. Rao, R. Tron, R. Vidal, and Y. Ma. Motion segmentation via robust subspace separation in the\n\npresence of outlying, incomplete, or corrupted trajectories. In CVPR, pages 1\u20138. IEEE, 2008.\n\n[39] F. S. Samaria and A. C. Harter. Parameterisation of a stochastic model for human face identi\ufb01cation. In\nApplications of Computer Vision, 1994., Proceedings of the Second IEEE Workshop on, pages 138\u2013142.\nIEEE, 1994.\n\n[40] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press,\n\n2004.\n\n[41] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888\u2013905, 2000.\n[42] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52\u201368, 2011.\n[43] R. Vidal and P. Favaro. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47\u201361,\n\n2014.\n\n[44] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data using powerfactor-\n\nization and GPCA. IJCV, 79(1):85\u2013105, 2008.\n\n[45] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. JMLR, 11(Dec):3371\u2013\n3408, 2010.\n\n[46] Y.-X. Wang, H. Xu, and C. Leng. Provable subspace clustering: When LRR meets SSC. In Advances in\n\nNeural Information Processing Systems, pages 64\u201372, 2013.\n\n[47] S. Xiao, M. Tan, D. Xu, and Z. Y. Dong. Robust kernel low-rank representation. IEEE transactions on\n\nneural networks and learning systems, 27(11):2268\u20132281, 2016.\n\n[48] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016.\n[49] J. Yan and M. Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid,\n\nnon-rigid, degenerate and non-degenerate. In ECCV, pages 94\u2013106. Springer, 2006.\n\n[50] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry. Unsupervised segmentation of natural images via lossy data\n\ncompression. CVIU, 110(2):212\u2013225, 2008.\n\n[51] M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie. Kernel sparse subspace clustering on symmetric positive de\ufb01nite\n\nmanifolds. In CVPR, pages 5157\u20135164, 2016.\n\n[52] C. You, C.-G. Li, D. P. Robinson, and R. Vidal. Oracle based active set algorithm for scalable elastic net\n\nsubspace clustering. In CVPR, pages 3928\u20133937, 2016.\n\n[53] C. You, D. Robinson, and R. Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit.\n\nIn CVPR, pages 3918\u20133927, 2016.\n\n10\n\n\f", "award": [], "sourceid": 34, "authors": [{"given_name": "Pan", "family_name": "Ji", "institution": "University of Adelaide"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "The Australian National University & DATA61"}, {"given_name": "Hongdong", "family_name": "Li", "institution": "Australian National University"}, {"given_name": "Mathieu", "family_name": "Salzmann", "institution": "EPFL"}, {"given_name": "Ian", "family_name": "Reid", "institution": "University of Adelaide"}]}