{"title": "Projective dictionary pair learning for pattern classification", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 801, "abstract": "Discriminative dictionary learning (DL) has been widely studied in various pattern classification problems. Most of the existing DL methods aim to learn a synthesis dictionary to represent the input signal while enforcing the representation coefficients and/or representation residual to be discriminative. However, the $\\ell_0$ or $\\ell_1$-norm sparsity constraint on the representation coefficients adopted in many DL methods makes the training and testing phases time consuming. We propose a new discriminative DL framework, namely projective dictionary pair learning (DPL), which learns a synthesis dictionary and an analysis dictionary jointly to achieve the goal of signal representation and discrimination. Compared with conventional DL methods, the proposed DPL method can not only greatly reduce the time complexity in the training and testing phases, but also lead to very competitive accuracies in a variety of visual classification tasks.", "full_text": "Projective dictionary pair learning for pattern\n\nclassi\ufb01cation\n\nShuhang Gu1, Lei Zhang1, Wangmeng Zuo2, Xiangchu Feng3\n\n1Dept. of Computing, The Hong Kong Polytechnic University, Hong Kong, China\n\n2School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China\n\n3Dept. of Applied Mathematics, Xidian University, Xi(cid:48)an, China\n\n{cssgu, cslzhang}@comp.polyu.edu.hk\n\ncswmzuo@gmail.com, xcfeng@mail.xidian.edu.cn\n\nAbstract\n\nDiscriminative dictionary learning (DL) has been widely studied in various pattern\nclassi\ufb01cation problems. Most of the existing DL methods aim to learn a synthesis\ndictionary to represent the input signal while enforcing the representation coef-\n\ufb01cients and/or representation residual to be discriminative. However, the (cid:96)0 or\n(cid:96)1-norm sparsity constraint on the representation coef\ufb01cients adopted in most DL\nmethods makes the training and testing phases time consuming. We propose a new\ndiscriminative DL framework, namely projective dictionary pair learning (DPL),\nwhich learns a synthesis dictionary and an analysis dictionary jointly to achieve\nthe goal of signal representation and discrimination. Compared with convention-\nal DL methods, the proposed DPL method can not only greatly reduce the time\ncomplexity in the training and testing phases, but also lead to very competitive\naccuracies in a variety of visual classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nSparse representation represents a signal as the linear combination of a small number of atoms cho-\nsen out of a dictionary, and it has achieved a big success in various image processing and computer\nvision applications [1, 2]. The dictionary plays an important role in the signal representation process\n[3]. By using a prede\ufb01ned analytical dictionary (e.g., wavelet dictionary, Gabor dictionary) to rep-\nresent a signal, the representation coef\ufb01cients can be produced by simple inner product operations.\nSuch a fast and explicit coding makes analytical dictionary very attractive in image representation;\nhowever, it is less effective to model the complex local structures of natural images.\nSparse representation with a synthesis dictionary has been widely studied in recent years [2, 4, 5].\nWith synthesis dictionary, the representation coef\ufb01cients of a signal are usually obtained via an\n(cid:96)p-norm (p \u22641) sparse coding process, which is computationally more expensive than analytical\ndictionary based representation. However, synthesis based sparse representation can better model\nthe complex image local structures and it has led to many state-of-the-art results in image restoration\n[6]. Another important advantage lies in that the synthesis based sparse representation model allows\nus to easily learn a desired dictionary from the training data. The seminal work of KSVD [1] tells\nus that an over-complete dictionary can be learned from example natural images, and it can lead\nto much better image reconstruction results than the analytically designed off-the-shelf dictionaries.\nInspired by KSVD, many dictionary learning (DL) methods have been proposed and achieved state-\nof-the-art performance in image restoration tasks.\nThe success of DL in image restoration problems triggers its applications in image classi\ufb01cation\ntasks. Different from image restoration, assigning the correct class label to the test sample is the\ngoal of classi\ufb01cation problems; therefore, the discrimination capability of the learned dictionary is\n\n1\n\n\fof the major concern. To this end, supervised dictionary learning methods have been proposed to\npromote the discriminative power of the learned dictionary [4, 5, 7, 8, 9]. By encoding the query\nsample over the learned dictionary, both the coding coef\ufb01cients and the coding residual can be used\nfor classi\ufb01cation, depending on the employed DL model. Discriminative DL has led to many state-\nof-the-art results in pattern recognition problems.\nOne popular strategy of discriminative DL is to learn a shared dictionary for all classes while en-\nforcing the coding coef\ufb01cients to be discriminative [4, 5, 7]. A classi\ufb01er on the coding coef\ufb01cients\ncan be trained simultaneously to perform classi\ufb01cation. Mairal et al. [7] proposed to learn a dictio-\nnary and a corresponding linear classi\ufb01er in the coding vector space. In the label consistent KSVD\n(LC-KSVD) method, Jiang et al. [5] introduced a binary class label sparse code matrix to encourage\nsamples from the same class to have similar sparse codes. In [4], Mairal et al. proposed a task driv-\nen dictionary learning (TDDL) framework, which minimizes different risk functions of the coding\ncoef\ufb01cients for different tasks.\nAnother popular line of research in DL attempts to learn a structured dictionary to promote dis-\ncrimination between classes [2, 8, 9, 10]. The atoms in the structured dictionary have class labels,\nand the class-speci\ufb01c representation residual can be computed for classi\ufb01cation. Ramirez et al. [8]\nintroduced an incoherence promotion term to encourage the sub-dictionaries of different classes\nto be independent. Yang et al. [9] proposed a Fisher discrimination dictionary learning (FDDL)\nmethod which applies the Fisher criterion to both representation residual and representation coef-\n\ufb01cient. Wang et al. [10] proposed a max-margin dictionary learning (MMDL) algorithm from the\nlarge margin perspective.\nIn most of the existing DL methods, (cid:96)0-norm or (cid:96)1-norm is used to regularize the representation\ncoef\ufb01cients since sparser coef\ufb01cients are more likely to produce better classi\ufb01cation results. Hence\na sparse coding step is generally involved in the iterative DL process. Although numerous algorithms\nhave been proposed to improve the ef\ufb01ciency of sparse coding [11, 12], the use of (cid:96)0-norm or (cid:96)1-\nnorm sparsity regularization is still a big computation burden and makes the training and testing\ninef\ufb01cient.\nIt is interesting to investigate whether we can learn discriminative dictionaries but without the costly\n(cid:96)0-norm or (cid:96)1-norm sparsity regularization. In particular, it would be very attractive if the represen-\ntation coef\ufb01cients can be obtained by linear projection instead of nonlinear sparse coding. To this\nend, in this paper we propose a projective dictionary pair learning (DPL) framework to learn a syn-\nthesis dictionary and an analysis dictionary jointly for pattern classi\ufb01cation. The analysis dictionary\nis trained to generate discriminative codes by ef\ufb01cient linear projection, while the synthesis dictio-\nnary is trained to achieve class-speci\ufb01c discriminative reconstruction. The idea of using functions to\npredict the representation coef\ufb01cients is not new, and fast approximate sparse coding methods have\nbeen proposed to train nonlinear functions to generate sparse codes [13, 14]. However, there are\nclear difference between our DPL model and these methods. First, in DPL the synthesis dictionary\nand analysis dictionary are trained jointly, which ensures that the representation coef\ufb01cients can be\napproximated by a simple linear projection function. Second, DPL utilizes class label information\nand promotes discriminative power of the representation codes.\nOne related work to this paper is the analysis-based sparse representation prior learning [15, 16],\nwhich represents a signal from a dual viewpoint of the commonly used synthesis model. Analy-\nsis prior learning tries to learn a group of analysis operators which have sparse responses to the\nlatent clean signal. Sprechmann et al. [17] proposed to train a group of analysis operators for clas-\nsi\ufb01cation; however, in the testing phase a costly sparsity-constrained optimization problem is still\nrequired. Feng et al.\n[18] jointly trained a dimensionality reduction transform and a dictionary\nfor face recognition. The discriminative dictionary is trained in the transformed space, and sparse\ncoding is needed in both the training and testing phases.\nThe contribution of our work is two-fold. First, we introduce a new DL framework, which extends\nthe conventional discriminative synthesis dictionary learning to discriminative synthesis and analysis\ndictionary pair learning (DPL). Second, the DPL utilizes an analytical coding mechanism and it\nlargely improves the ef\ufb01ciency in both the training and testing phases. Our experiments in various\nvisual classi\ufb01cation datasets show that DPL achieves very competitive accuracy with state-of-the-art\nDL algorithms, while it is signi\ufb01cantly faster in both training and testing.\n\n2\n\n\f2 Projective Dictionary Pair Learning\n\n2.1 Discriminative dictionary learning\n\nDenote by X = [X1, . . . , Xk, . . . , XK] a set of p-dimensional training samples from K classes, where\nXk \u2208 Rp\u00d7n is the training sample set of class k, and n is the number of samples of each class. Dis-\ncriminative DL methods aim to learn an effective data representation model from X for classi\ufb01cation\ntasks by exploiting the class label information of training data. Most of the state-of-the-art discrim-\ninative DL methods [5, 7, 9] can be formulated under the following framework:\n\nminD,A (cid:107) X \u2212 DA (cid:107)2\n\nF +\u03bb (cid:107) A (cid:107)p +\u03a8(D, A, Y),\n\n(1)\nwhere \u03bb \u2265 0 is a scalar constant, Y represents the class label matrix of samples in X, D is the\nsynthesis dictionary to be learned, and A is the coding coef\ufb01cient matrix of X over D. In the training\nmodel (1), the data \ufb01delity term (cid:107) X \u2212 DA (cid:107)2\nF ensures the representation ability of D; (cid:107) A (cid:107)p is\nthe (cid:96)p-norm regularizer on A; and \u03a8(D, A, Y) stands for some discrimination promotion function,\nwhich ensures the discrimination power of D and A.\nAs we introduced in Section 1, some DL methods [4, 5, 7] learn a shared dictionary for all classes\nand a classi\ufb01er on the coding coef\ufb01cients simultaneously, while some DL methods [8, 9, 10] learn\na structured dictionary to promote discrimination between classes. However, they all employ (cid:96)0 or\n(cid:96)1-norm sparsity regularizer on the coding coef\ufb01cients, making the training stage and the consequent\ntesting stage inef\ufb01cient.\nIn this work, we extend the conventional DL model in (1), which learns a discriminative synthesis\ndictionary, to a novel DPL model, which learns a pair of synthesis and analysis dictionaries. No\ncostly (cid:96)0 or (cid:96)1-norm sparsity regularizer is required in the proposed DPL model, and the coding\ncoef\ufb01cients can be explicitly obtained by linear projection. Fortunately, DPL does not sacri\ufb01ce the\nclassi\ufb01cation accuracy while achieving signi\ufb01cant improvement in the ef\ufb01ciency, as demonstrated\nby our extensive experiments in Section 3.\n\n2.2 The dictionary pair learning model\n\nThe conventional discriminative DL model in (1) aims to learn a synthesis dictionary D to sparsely\nrepresent the signal X, and a costly (cid:96)1-norm sparse coding process is needed to resolve the code A.\nSuppose that if we can \ufb01nd an analysis dictionary, denoted by P \u2208 RmK\u00d7p , such that the code A\ncan be analytically obtained as A = PX, then the representation of X would become very ef\ufb01cient.\nBased on this idea, we propose to learn such an analysis dictionary P together with the synthesis\ndictionary D, leading to the following DPL model:\n\n(cid:107)X \u2212 DPX(cid:107)2\n\n{P\u2217,D\u2217} = arg min\nP,D\n\nF +\u03a8(D, P, X, Y),\n\n(2)\nwhere \u03a8(D, P, X, Y) is some discrimination function. D and P form a dictionary pair: the analysis\ndictionary P is used to analytically code X, and the synthesis dictionary D is used to reconstruct X.\nThe discrimination power of the DPL model depends on the suitable design of \u03a8(D, P, X, Y) .\nWe propose to learn a structured synthesis dictionary D = [D1, . . . , Dk, . . . , DK] and a structured\nanalysis dictionary P = [P1; . . . ; Pk; . . . ; PK], where {Dk \u2208 Rp\u00d7m, Pk \u2208 Rm\u00d7p} forms a sub-\ndictionary pair corresponding to class k. Recent studies on sparse subspace clustering [19] have\nproved that a sample can be represented by its corresponding dictionary if the signals satisfy certain\nincoherence condition. With the structured analysis dictionary P, we want that the sub-dictionary\nPk can project the samples from class i, i (cid:54)= k, to a nearly null space, i.e.,\n\n(3)\nClearly, with (3) the coef\ufb01cient matrix PX will be nearly block diagonal. On the other hand, with\nthe structured synthesis dictionary D, we want that the sub-dictionary Dk can well reconstruct the\ndata matrix Xk from its projective code matrix PkXk; that is, the dictionary pair should minimize\nthe reconstruction error:\n\nPkXi \u2248 0,\u2200k (cid:54)= i.\n\n(cid:88)K\n\nmin\nP,D\n\nk=1\n\n(cid:107) Xk \u2212 DkPkXk (cid:107)2\nF .\n\n(4)\n\n(5)\n\nBased on the above analysis, we can readily have the following DPL model:\n\n{P\u2217, D\u2217} = arg min\nP,D\n\n(cid:88)K\n\nk=1\n\n(cid:107) Xk \u2212 DkPkXk (cid:107)2\n\nF +\u03bb (cid:107) Pk \u00afXk (cid:107)2\nF,\n\ns.t. (cid:107) di (cid:107)2\n\n2\u2264 1.\n\n3\n\n\fAlgorithm 1 Discriminative synthesis&analysis dictionary pair learning (DPL)\nInput: Training samples for K classes X = [X1, X2, . . . , XK ], parameter \u03bb, \u03c4, m;\n1: Initialize D(0) and P(0) as random matrixes with unit Frobenious norm, t = 0;\n2: while not converge do\n3:\n4:\n5:\n6:\n7:\nend for\n8:\n9: end while\nOutput: Analysis dictionary P, synthesis dictionary D.\n\nt \u2190 t + 1;\nfor i=1:K do\nUpdate A(t)\nUpdate P(t)\nUpdate D(t)\n\nk by (8);\nk by (10);\nk by (12);\n\nwhere \u00afXk denotes the complementary data matrix of Xk in the whole training set X, \u03bb > 0 is a scalar\nconstant, and di denotes the ith atom of synthesis dictionary D. We constrain the energy of each\natom di in order to avoid the trivial solution of Pk = 0 and make the DPL more stable.\nThe DPL model in (5) is not a sparse representation model, while it enforces group sparsity on the\ncode matrix PX (i.e., PX is nearly block diagonal). Actually, the role of sparse coding in classi\ufb01-\ncation is still an open problem, and some researchers argued that sparse coding may not be crucial\nto classi\ufb01cation tasks [20, 21]. Our \ufb01ndings in this work are supportive to this argument. The D-\nPL model leads to very competitive classi\ufb01cation performance with those sparse coding based DL\nmodels, but it is much faster.\n\n2.3 Optimization\n\nThe objective function in (5) is generally non-convex. We introduce a variable matrix A and relax\n(5) to the following problem:\n{P\u2217, A\u2217, D\u2217} = arg min\nP,A,D\n\nF +\u03c4 (cid:107)PkXk\u2212Ak(cid:107)2\n\nF +\u03bb(cid:107)Pk \u00afXk(cid:107)2\nF,\n\n(cid:107)Xi\u2212DkAk(cid:107)2\n\ns.t. (cid:107) di (cid:107)2\n\n2\u2264 1. (6)\n\nK(cid:88)\n\nk=1\n\nwhere \u03c4 is a scalar constant. All terms in the above objective function are characterized by Frobenius\nnorm, and (6) can be easily solved. We initialize the analysis dictionary P and synthesis dictionary\nD as random matrices with unit Frobenius norm, and then alternatively update A and {D, P}. The\nminimization can be alternated between the following two steps.\n(1) Fix D and P, update A\n\n(cid:88)K\n\nF +\u03c4 (cid:107) PkXk \u2212 Ak (cid:107)2\nF .\nThis is a standard least squares problem and we have the closed-form solution:\n\nA\u2217 = arg min\nA\n\n(cid:107) Xk \u2212 DkAk (cid:107)2\n\nk=1\n\n(2) Fix A, update D and P:\n\nA\u2217\nk = (DT\n\n(cid:40)\nP\u2217 = arg minP(cid:80)K\nD\u2217 = arg minD(cid:80)K\n\nk Dk + \u03c4I)\u22121(\u03c4PkXk + DT\n\nk Xk).\n\nk=1 \u03c4 (cid:107) PkXk \u2212 Ak (cid:107)2\nk=1 (cid:107) Xk \u2212 DkAk (cid:107)2\n\nF +\u03bb (cid:107) Pk \u00afXk (cid:107)2\nF;\n2\u2264 1.\nF, s.t. (cid:107) di (cid:107)2\n\n(7)\n\n(8)\n\n(9)\n\nThe closed-form solutions of P can be obtained as:\nk (\u03c4XkXT\n\nP\u2217\nk = \u03c4AkXT\n\n(10)\nwhere \u03b3 = 10e\u22124 is a small number. The D problem can be optimized by introducing a variable S:\n(11)\n\ns.t. D = S, (cid:107) si (cid:107)2\n\n(cid:107) Xk \u2212 DkAk (cid:107)2\nF,\n\n2\u2264 1.\n\nk + \u03bb \u00afXk\n\nk + \u03b3I)\u22121,\n\u00afXT\n\nThe optimal solution of (11) can be obtained by the ADMM algorithm:\nF +\u03c1 (cid:107) Dk \u2212 S(r)\n\nk=1 (cid:107) Xk \u2212 DkAk (cid:107)2\nk=1 \u03c1 (cid:107) D(r+1)\n\u2212 S(r+1)\n\nk\n\nk\n\n\u2212 Sk + T(r)\n\nk\n\n(cid:107)2\nF,\n\n, update \u03c1 if appropriate.\n\n(cid:107)2\nk + T(r)\nF,\nk\ns.t. (cid:107) si (cid:107)2\n2\u2264 1,\n\n(12)\n\nmin\nD,S\n\n(cid:88)K\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3D(r+1) = arg minD(cid:80)K\nS(r+1) = arg minS(cid:80)K\n\nT(r+1) =T(r) + D(r+1)\n\nk=1\n\nk\n\n4\n\n\f(a)(cid:107) P\u2217\n\nk y (cid:107)2\n\n2\n\n(b)(cid:107) y \u2212 D\u2217\n\nk P\u2217\n\nk y (cid:107)2\n\n2\n\nFigure 1: (a) The representation codes and (b) reconstruction error on the Extended YaleB dataset.\n\nIn each step of optimization, we have closed form solutions for variables A and P, and the ADMM\nbased optimization of D converges rapidly. The training of the proposed DPL model is much faster\nthan most of previous discriminative DL methods. The proposed DPL algorithm is summarized in\nAlgorithm 1. When the difference between the energy in two adjacent iterations is less than 0.01,\nthe iteration stops. The analysis dictionary P and the synthesis dictionary D are then output for\nclassi\ufb01cation.\nOne can see that the \ufb01rst sub-objective function in (9) is a discriminative analysis dictionary learner,\nfocusing on promoting the discriminative power of P; the second sub-objective function in (9) is a\nrepresentative synthesis dictionary learner, aiming to minimize the reconstruction error of the input\nsignal with the coding coef\ufb01cients generated by the analysis dictionary P. When the minimization\nprocess converges, a balance between the discrimination and representation power of the model can\nbe achieved.\n\nk P\u2217\n\nk Xi (cid:107)2\n\nk P\u2217\n\nk Xk (cid:107)2\n\nk Xk; that is, the residual (cid:107) Xk \u2212 D\u2217\n\nk Xi, i (cid:54)= k, will be small and D\u2217\nF will be much larger than (cid:107) Xk \u2212 D\u2217\n\n2.4 Classi\ufb01cation scheme\nIn the DPL model, the analysis sub-dictionary P\u2217\nk is trained to produce small coef\ufb01cients for samples\nfrom classes other than k, and it can only generate signi\ufb01cant coding coef\ufb01cients for samples from\nclass k. Meanwhile, the synthesis sub-dictionary D\u2217\nk is trained to reconstruct the samples of class k\nfrom their projective coef\ufb01cients P\u2217\nF will be small. On\nthe other hand, since P\u2217\nk is not trained to reconstruct Xi, the residual\n(cid:107) Xi \u2212 D\u2217\nk P\u2217\nIn the testing phase, if the query sample y is from class k, its projective coding vector by P\u2217\nk (i.e.,\ni , i (cid:54)= k, tend to\nP\u2217\nky ) will be more likely to be signi\ufb01cant, while its projective coding vectors by P\u2217\nbe small. Consequently, the reconstruction residual (cid:107) y \u2212 D\u2217\n2 tends to be much smaller than\nthe residuals (cid:107) y \u2212 D\u2217\n2, i (cid:54)= k. Let us use the Extended YaleB face dataset [22] to illustrate\ni P\u2217\nthis. (The detailed experimental setting can be found in Section 3.) Fig. 1(a) shows the (cid:96)2-norm\nof the coef\ufb01cients P\u2217\nky, where the horizontal axis refers to the index of y and the vertical axis refers\nk . One can clearly see that (cid:107) P\u2217\nto the index of P\u2217\n2 has a nearly block diagonal structure, and\nthe diagonal blocks are produced by the query samples which have the same class labels as P\u2217\nk .\nFig. 1(b) shows the reconstruction residual (cid:107) y \u2212 D\u2217\nk y (cid:107)2\n2 also\nhas a block diagonal structure, and only the diagonal blocks have small residuals. Clearly, the class-\nspeci\ufb01c reconstruction residual can be used to identify the class label of y, and we can naturally have\nthe following classi\ufb01er associated with the DPL model:\n\n2. One can see that (cid:107) y \u2212 D\u2217\n\nk y (cid:107)2\nk P\u2217\n\nk Xk (cid:107)2\nF .\n\nk P\u2217\n\nk y (cid:107)2\n\ni y (cid:107)2\n\nk P\u2217\n\nk y (cid:107)2\n\nidentity(y) = arg mini (cid:107) y \u2212 DiPiy (cid:107)2 .\n\n(13)\n\n2.5 Complexity and Convergence\n\nComplexity In the training phase of DPL, Ak, Pk and Dk are updated alternatively. In each iteration,\nthe time complexities of updating Ak, Pk and Dk are O(mpn + m3 + m2n), O(mnp + p3 + mp2) and\nO(W(pmn + m3 + m2p + p2m)), respectively, where W is the iteration number in ADMM algorithm\nfor updating D. We experimentally found that in most cases W is less than 20. In many applications,\nthe number of training samples and the number of dictionary atoms for each class are much smaller\nthan the dimension p. Thus the major computational burden in the training phase of DPL is on\nupdating Pk, which involves an inverse of a p\u00d7 p matrix {\u03c4XkXT\nk + \u03b3I}. Fortunately, this\n\nk + \u03bb\u00afXk \u00afXT\n\n5\n\n (a) *22kPy (b) **22kk\uf02dyDPy (a) *22kPy (b) **22kk\uf02dyDPy \fFigure 2: The convergence curve of DPL on the AR database.\n\nk P\u2217\n\nmatrix will not change in the iteration, and thus the inverse of it can be pre-computed. This greatly\naccelerates the training process.\nIn the testing phase, our classi\ufb01cation scheme is very ef\ufb01cient. The computation of class-speci\ufb01c\nreconstruction error (cid:107) y \u2212 D\u2217\nk y (cid:107)2 only has a complexity of O(mp). Thus, the total complexity of\nour model to classify a query sample is O(Kmp).\nConvergence The objective function in (6) is a bi-convex problem for {(D, P), (A)}, e.g., by \ufb01xing\nA the function is convex for D and P, and by \ufb01xing D and P the function is convex for A. The con-\nvergence of such a problem has already been intensively studied [23], and the proposed optimization\nalgorithm is actually an alternate convex search (ACS) algorithm. Since we have the optimal solu-\ntions of updating A, P and D, and our objective function has a general lower bound 0, our algorithm\nis guaranteed to converge to a stationary point. A detailed convergence analysis can be found in our\nsupplementary \ufb01le.\nIt is empirically found that the proposed DPL algorithm converges rapidly. Fig. 2 shows the conver-\ngence curve of our algorithm on the AR face dataset [24]. One can see that the energy drops quickly\nand becomes very small after 10 iterations. In most of our experiments, our algorithm will converge\nin less than 20 iterations.\n\n3 Experimental Results\n\nWe evaluate the proposed DPL method on various visual classi\ufb01cation datasets, including two face\ndatabases (Extended YaleB [22] and AR [24]), one object categorization database (Caltech101)\n[25], and one action recognition database (UCF 50 action [26]). These datasets are widely used in\nprevious works [5, 9] to evaluate the DL algorithms.\nBesides the classi\ufb01cation accuracy, we also report the training and testing time of competing algo-\nrithms in the experiments. All the competing algorithms are implemented in Matlab except for SVM\nwhich is implemented in C. All experiments are run on a desktop PC with 3.5GHz Intel CPU and\n8 GB memory. The testing time is calculated in terms of the average processing time to classify a\nsingle query sample.\n\n3.1 Parameter setting\nThere are three parameters, m, \u03bb and \u03c4 in the proposed DPL model. To achieve the best performance,\nin face recognition and object recognition experiments, we set the number of dictionary atoms as\nits maximum (i.e., the number of training samples) for all competing DL algorithms, including the\nproposed DPL. In the action recognition experiment, since the samples per class is relatively big, we\nset the number of dictionary atoms of each class as 50 for all the DL algorithms. Parameter \u03c4 is an\nalgorithm parameter, and the regularization parameter \u03bb is to control the discriminative property of\nP. In all the experiments, we choose \u03bb and \u03c4 by 10-fold cross validation on each dataset. For all the\ncompeting methods, we tune their parameters for the best performance.\n\n3.2 Competing methods\nWe compare the proposed DPL method with the following methods: the base-line nearest subspace\nclassi\ufb01er (NSC) and linear support vector machine (SVM), sparse representation based classi\ufb01cation\n(SRC) [2] and collaborative representation based classi\ufb01cation (CRC) [21], and the state-of-the-art\nDL algorithms DLSI [8], FDDL [9] and LC-KSVD [5]. The original DLSI represents the test sample\nby each class-speci\ufb01c sub-dictionary. The results in [9] have shown that by coding the test sample\ncollaboratively over the whole dictionary, the classi\ufb01cation performance can be greatly improved.\n\n6\n\n01020304050456Iteration NumberEnergy\fFigure 3: Sample images in the (a) Extended YaleB and (b) AR databases.\n\nTherefore, we follow the use of DLDI in [9] and denote this method as DLSI(C). For the two\nvariants of LC-KSVD proposed in [5], we adopt the LC-KSVD2 since it can always produce better\nclassi\ufb01cation accuracy.\n3.3 Face recognition\nWe \ufb01rst evaluate our algorithm on two widely used face datasets: Extended YaleB [22] and AR [24].\nThe Extended YaleB database has large variations in illumination and expressions, as illustrated in\nFig. 3(a). The AR database involves many variations such as illumination, expressions and sunglass\nand scarf occlusion, as illustrated in Fig. 3(b).\nWe follow the experimental settings in [5] for fair comparison with state-of-the-arts. A set of 2,414\nface images of 38 persons are extracted from the Extended YaleB database. We randomly select half\nof the images per subject for training and the other half for testing. For the AR database, a set of\n2,600 images of 50 female and 50 male subjects are extracted. 20 images of each subject are used\nfor training and the remain 6 images are used for testing. We use the features provided by Jiang\net al. [5] to represent the face image. The feature dimension is 504 for Extended YaleB and 540\nfor AR. The parameter \u03c4 is set to 0.05 on both the datasets and \u03bb is set to 3e-3 and 5e-3 on the\nExtended YaleB and AR datasets, respectively. In these two experiments, we also compare with the\nmax-margin dictionary learning (MMDL) [10] algorithm, whose recognition accuracy is cropped\nfrom the original paper but the training/testing time is not available.\n\nTable 1: Results on the Extended YaleB database.\n\nTable 2: Results on the AR database.\n\nAccuracy\n\nAccuracy\n\n(%)\n94.7\n95.6\n97.0\n96.5\n97.0\n96.7\n96.7\n97.3\n97.5\n\nTraining\ntime (s)\nno need\n\n0.70\n\nno need\nno need\n567.47\n6,574.6\n412.58\n\n-\n\n4.38\n\nTesting\ntime (s)\n1.41e-3\n3.49e-5\n1.92e-3\n2.16e-2\n4.30e-2\n\n1.43\n\n4.22e-4\n\n-\n\n1.71e-4\n\nNSC\nSVM\nCRC\nSRC\n\nDLSI(C)\nFDDL\n\nLC-KSVD\n\nMMDL\n\nDPL\n\n(%)\n92.0\n96.5\n98.0\n97.5\n97.5\n97.5\n97.8\n97.3\n98.3\n\nTraining\ntime (s)\nno need\n\n3.42\n\nno need\nno need\n2,470.5\n61,709\n1,806.3\n\n-\n\n11.30\n\nTesting\ntime (s)\n3.29e-3\n6.16e-5\n5.08e-3\n3.42e-2\n\n0.16\n2.50\n\n7.72e-4\n\n-\n\n3.93e-4\n\nNSC\nSVM\nCRC\nSRC\n\nDLSI(C)\nFDDL\n\nLC-KSVD\n\nMMDL\n\nDPL\n\nExtended YaleB database The recognition accuracies and training/testing time by different algo-\nrithms on the Extended YaleB database are summarized in Table 1. The proposed DPL algorithm\nachieves the best accuracy, which is slightly higher than MMDL, DLSI(C), LC-KSVD and FDDL.\nHowever, DPL has obvious advantage in ef\ufb01ciency over the other DL algorithms.\nAR database The recognition accuracies and running time on the AR database are shown in Table 2.\nDPL achieves the best results among all the competing algorithms. Compared with the experiment\non Extended YaleB, in this experiment there are more training samples and the feature dimension is\nhigher, and DPL(cid:48)s advantage of ef\ufb01ciency is much more obvious. In training, it is more than 159\ntimes faster than DLSI and LC-KSVD, and 5,460 times faster than FDDL.\n\n3.4 Object recognition\nIn this section we test DPL on object categorization by using the Caltech101 database [25]. The\nCaltech101 database [25] includes 9,144 images from 102 classes (101 common object classes and\na background class). The number of samples in each category varies from 31 to 800. Following\nthe experimental settings in [5, 27], 30 samples per category are used for training and the rest are\n\n7\n\n (a) (b) (c) (d) (a) (b) (a) \fTable 3: Recognition accuracy(%) & running time(s) on the Caltech101 database.\n\nAccuracy\n\nTraining time\n\nTesting time\n\nNSC\nSVM\nCRC\nSRC\n\nDLSI(C)\nFDDL\n\nLC-KSVD\n\nDPL\n\n70.1\n64.6\n68.2\n70.7\n73.1\n73.2\n73.6\n73.9\n\nno need\n\n14.6\n\nno need\nno need\n97,200\n104,000\n12,700\n134.6\n\n1.79e-2\n1.81e-4\n1.38e-2\n\n1.09\n1.46\n12.86\n4.17e-3\n1.29e-3\n\nused for testing. We use the standard bag-of-words (BOW) + spatial pyramid matching (SPM)\nframework [27] for feature extraction. Dense SIFT descriptors are extracted on three grids of sizes\n1\u00d71, 2\u00d72, and 4\u00d74 to calculate the SPM features. For a fair comparison with [5], we use the vector\nquantization based coding method to extract the mid-level features and use the standard max pooling\napproach to build up the high dimension pooled features. Finally, the original 21,504 dimensional\ndata is reduced to 3,000 dimension by PCA. The parameters \u03c4 and \u03bb used in our algorithm are 0.05\nand 1e-4, respectively.\nThe experimental results are listed in Table 3. Again, DPL achieves the best performance. Though\nits classi\ufb01cation accuracy is only slightly better than the DL methods, its advantage in terms of\ntraining/testing time is huge.\n\n3.5 Action recognition\n\nAction recognition is an important yet very challenging task and it has been attracting great research\ninterests in recent years. We test our algorithm on the UCF 50 action database [26], which includes\n50 categories of 6,680 human action videos from YouTube. We use the action bank features [28]\nand \ufb01ve-fold data splitting to evaluate our algorithm. For all the comparison methods, the feature\ndimension is reduced to 5,000 by PCA. The parameters \u03c4 and \u03bb used in our algorithm are both 0.01.\nThe results by different methods are reported in Table 4. Our DPL algorithm achieves much higher\naccuracy than its competitors. FDDL has the second highest accuracy; however, it is 1,666 times\nslower than DPL in training and 83,317 times slower than DPL in testing.\n\nTable 4: Recognition accuracy(%) & running time(s) on the UCF50 action database\n\nMethods\n\nAccuracy\n\nTraining time\n\nTesting time\n\nNSC\nSVM\nCRC\nSRC\n\nDLSI(C)\nFDDL\n\nLC-KSVD\n\nDPL\n\n51.8\n57.9\n60.3\n59.6\n60.0\n61.1\n53.6\n62.9\n\nno need\n\n59.8\n\nno need\nno need\n397,000\n415,000\n9,272.4\n249.0\n\n6.11e-2\n5.02e-4\n6.76e-2\n\n8.92\n10.11\n89.15\n0.12\n\n1.07e-3\n\n4 Conclusion\n\nWe proposed a novel projective dictionary pair learning (DPL) model for pattern classi\ufb01cation tasks.\nDifferent from conventional dictionary learning (DL) methods, which learn a single synthesis dictio-\nnary, DPL learns jointly a synthesis dictionary and an analysis dictionary. Such a pair of dictionaries\nwork together to perform representation and discrimination simultaneously. Compared with previ-\nous DL methods, DPL employs projective coding, which largely reduces the computational burden\nin learning and testing. Performance evaluation was conducted on publically accessible visual clas-\nsi\ufb01cation datasets. DPL exhibits highly competitive classi\ufb01cation accuracy with state-of-the-art DL\nmethods, while it shows signi\ufb01cantly higher ef\ufb01ciency, e.g., hundreds to thousands times faster than\nLC-KSVD and FDDL in training and testing.\n\n8\n\n\fReferences\n[1] Aharon, M., Elad, M., Bruckstein, A.: K-svd: An algorithm for designing overcomplete dictionaries for\n\nsparse representation. IEEE Trans. on Signal Processing, 54(11) (2006) 4311\u20134322\n\n[2] Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence 31(2) (2009) 210\u2013227\n\n[3] Rubinstein, R., Bruckstein, A.M., Elad, M.: Dictionaries for sparse representation modeling. Proceedings\n\nof the IEEE 98(6) (2010) 1045\u20131057\n\n[4] Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. IEEE Trans. Pattern Anal. Mach. Intelli-\n\ngence 34(4) (2012) 791\u2013804\n\n[5] Jiang, Z., Lin, Z., Davis, L.: Label consistent k-svd: learning a discriminative dictionary for recognition.\n\nIEEE Trans. on Pattern Anal. Mach. Intelligence 35(11) (2013) 2651\u20132664\n\n[6] Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionar-\n\nies. IEEE Transactions on Image Processing 15(12) (2006) 3736\u20133745\n\n[7] Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A., et al.: Supervised dictionary learning. In: NIPS.\n\n(2008)\n\n[8] Ramirez, I., Sprechmann, P., Sapiro, G.: Classi\ufb01cation and clustering via dictionary learning with struc-\n\ntured incoherence and shared features. In: CVPR. (2010)\n\n[9] Yang, M., Zhang, L., , Feng, X., Zhang, D.: Fisher discrimination dictionary learning for sparse repre-\n\nsentation. In: ICCV. (2011)\n\n[10] Wang, Z., Yang, J., Nasrabadi, N., Huang, T.: A max-margin perspective on sparse representation-based\n\nclassi\ufb01cation. In: ICCV. (2013)\n\n[11] Lee, H., Battle, A., Raina, R., Ng, A.Y.: Ef\ufb01cient sparse coding algorithms. In: NIPS. (2007)\n[12] Hale, E.T., Yin, W., Zhang, Y.: Fixed-point continuation for (cid:96)1-minimization: Methodology and conver-\n\ngence. SIAM Journal on Optimization 19(3) (2008) 1107\u20131130\n\n[13] Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: ICML. (2010)\n[14] Ranzato, M., Poultney, C., Chopra, S., Cun, Y.L.: Ef\ufb01cient learning of sparse representations with an\n\nenergy-based model. In: NIPS. (2006)\n\n[15] Yunjin, C., Thomas, P., Bischof, H.: Learning l1-based analysis and synthesis sparsity priors using bi-\n\nlevel optimization. NIPS workshop (2012)\n\n[16] Elad, M., Milanfar, P., Rubinstein, R.: Analysis versus synthesis in signal priors. Inverse problems 23(3)\n\n(2007) 947\n\n[17] Sprechmann, P., Litman, R., Yakar, T.B., Bronstein, A., Sapiro, G.: Ef\ufb01cient supervised sparse analysis\n\nand synthesis operators. In: NIPS. (2013)\n\n[18] Feng, Z., Yang, M., Zhang, L., Liu, Y., Zhang, D.:\n\nJoint discriminative dimensionality reduction and\n\ndictionary learning for face recognition. Pattern Recognition 46(8) (2013) 2134\u20132143\n\n[19] Soltanolkotabi, M., Elhamifar, E., Candes, E.:\n\niv:1301.2603 (2013)\n\nRobust subspace clustering.\n\narXiv preprint arX-\n\n[20] Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse coding and vector quanti-\n\nzation. In: ICML. (2011)\n\n[21] Zhang, L., Yang, M., Feng, X.: Sparse representation or collaborative representation: Which helps face\n\nrecognition? In: ICCV. (2011)\n\n[22] Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illumination cone models for face\nrecognition under variable lighting and pose. IEEE Trans. Patt. Anal. Mach. Intel. 23(6) (2001) 643\u2013660\n[23] Gorski, J., Pfeuffer, F., Klamroth, K.: Biconvex sets and optimization with biconvex functions: a survey\n\nand extensions. Mathematical Methods of Operations Research 66(3) (2007) 373\u2013407\n\n[24] Martinez, A., Benavente., R.: The ar face database. CVC Technical Report (1998)\n[25] Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An in-\ncremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding\n106(1) (2007) 59\u201370\n\n[26] Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Machine Vision and\n\nApplications 24(5) (2013) 971\u2013981\n\n[27] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In: CVPR. (2006)\n\n[28] Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video.\n\n(2012)\n\nIn: CVPR.\n\n9\n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Shuhang", "family_name": "Gu", "institution": "The Hong Kong Polytechnic University"}, {"given_name": "Lei", "family_name": "Zhang", "institution": "The Hong Kong Polytechnic Univ"}, {"given_name": "Wangmeng", "family_name": "Zuo", "institution": null}, {"given_name": "Xiangchu", "family_name": "Feng", "institution": "Xidian University"}]}