{"title": "Efficient Optimization for Linear Dynamical Systems with Applications to Clustering and Sparse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 3444, "page_last": 3454, "abstract": "Linear Dynamical Systems (LDSs) are fundamental tools for modeling spatio-temporal data in various disciplines. Though rich in modeling, analyzing LDSs is not free of difficulty, mainly because LDSs do not comply with Euclidean geometry and hence conventional learning techniques can not be applied directly. In this paper, we propose an efficient projected gradient descent method to minimize a general form of a loss function and demonstrate how clustering and sparse coding with LDSs can be solved by the proposed method efficiently. To this end, we first derive a novel canonical form for representing the parameters of an LDS, and then show how gradient-descent updates through the projection on the space of LDSs can be achieved dexterously. In contrast to previous studies, our solution avoids any approximation in LDS modeling or during the optimization process. Extensive experiments reveal the superior performance of the proposed method in terms of the convergence and classification accuracy over state-of-the-art techniques.", "full_text": "Ef\ufb01cient Optimization for Linear Dynamical Systems\nwith Applications to Clustering and Sparse Coding\n\nWenbing Huang1,3, Mehrtash Harandi2, Tong Zhang2\n\nLijie Fan3, Fuchun Sun3, Junzhou Huang1\n\n1 Tencent AI Lab. ;\n\n2 Data61, CSIRO and Australian National University, Australia;\n\n3 Department of Computer Science and Technology, Tsinghua University,\nTsinghua National Lab. for Information Science and Technology (TNList);\n\n1{helendhuang, joehhuang}@tencent.com\n\n2{mehrtash.harandi@data61.csiro.au, tong.zhang@anu.edu.cn}\n\n3{flj14@mails, fcsun@mail}.tsinghua.edu.cn\n\nAbstract\n\nLinear Dynamical Systems (LDSs) are fundamental tools for modeling spatio-\ntemporal data in various disciplines. Though rich in modeling, analyzing LDSs is\nnot free of dif\ufb01culty, mainly because LDSs do not comply with Euclidean geometry\nand hence conventional learning techniques can not be applied directly. In this\npaper, we propose an ef\ufb01cient projected gradient descent method to minimize a\ngeneral form of a loss function and demonstrate how clustering and sparse coding\nwith LDSs can be solved by the proposed method ef\ufb01ciently. To this end, we \ufb01rst\nderive a novel canonical form for representing the parameters of an LDS, and then\nshow how gradient-descent updates through the projection on the space of LDSs\ncan be achieved dexterously. In contrast to previous studies, our solution avoids\nany approximation in LDS modeling or during the optimization process. Extensive\nexperiments reveal the superior performance of the proposed method in terms of\nthe convergence and classi\ufb01cation accuracy over state-of-the-art techniques.\n\n1\n\nIntroduction\n\nLearning from spatio-temporal data is an active research area in computer vision, signal processing\nand robotics. Examples include dynamic texture classi\ufb01cation [1], video action recognition [2, 3, 4]\nand robotic tactile sensing [5]. One kind of the popular models for analyzing spatio-temporal data\nis Linear Dynamical Systems (LDSs) [1]. Speci\ufb01cally, LDSs apply parametric equations to model\nthe spatio-temporal data. The optimal system parameters learned from the input are employed as\nthe descriptor of each spatio-temporal sequence. The bene\ufb01ts of applying LDSs are two-fold: 1.\nLDSs are generative models and their parameters are learned in an unsupervised manner. This makes\nLDSs suitable choices for not only classi\ufb01cation but also interpolation/extrapolation/generation of\nspatio-temporal sequences [1, 6, 7]; 2. Unlike vectorial ARMA models [8], LDSs are less prone to\nthe curse of dimensionality as a result of their lower-dimensional state space [9].\nClustering [10] and coding [5] LDSs are two fundamental problems that motivate this work. The\nclustering task is to group LDS models based on some given similarity metrics. The problem of\ncoding, especially sparse coding, is to identify a dictionary of LDSs along their associated sparse\ncodes to best reconstruct a collection of LDSs. Given a set of LDSs, the key problems of clustering\nand sparse coding are computing the mean and \ufb01nding the LDS atoms, respectively, both of which\nare not easy tasks by any measure. Due to an in\ufb01nite number of equivalent transformations for\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe system parameters [1], the space of LDSs is non-Euclidean. This in turn makes the direct\nuse of traditional techniques (e.g., conventional sparse solvers) inapplicable. To get around the\ndif\ufb01culties induced by the non-Euclidean geometry, previous studies (e.g., [11, 12, 13, 5]) resort to\nvarious approximations, either in modeling or during optimization. For instance, the authors in [11]\napproximated the clustering mean by \ufb01nding the closest sample under a certain embedding. As we\nwill see in our experiments, involving approximations into the solutions exhibits inevitable limitations\nto the algorithmic performance.\nThis paper develops a gradient-based method to solve the clustering and sparse coding tasks ef\ufb01ciently\nwithout any approximation involved. To this end, we reformulate the optimization problems for\nthese two different tasks and then unify them into one common problem by making use of the kernel\ntrick. However, there exist several challenges to address this common problem ef\ufb01ciently. The \ufb01rst\nchallenge comes from the aforementioned invariance property on the LDS parameters. To attack\nthis challenge, we introduce a novel canonical form of the system parameters that is insensitive to\nthe equivalent changes. The second challenge comes from the fact that the optimization problem of\ninterest requires solving Discrete Lyapunov Equations (DLEs). At \ufb01rst glance, such a dependency\nmakes backpropagating the gradients through DLEs more complicated. Interestingly, we prove\nthat the gradients can be exactly derived by solving another DLE in the end, which makes our\noptimization much simpler and more ef\ufb01cient. Finally, as suggested by [14], the LDS parameters, i.e.,\nthe transition and measurement matrices require to be stable and orthogonal, respectively. Under our\ncanonical representation, the stability constraint is reduced to the bound constraint. We then make use\nof the Cayley-transformation [15] to maintain orthogonality and perform the bound-normalization\nto accomplish stability. Clustering and sparse coding can be combined with high-level pooling\nframeworks (e.g., bag-of-systems [11] and spatial-temporal-pyramid-matching [16]) for classifying\ndynamic textures. Our experiments on such kind of data demonstrate that the proposed methods\noutperform state-of-the-art techniques in terms of the convergence and classi\ufb01cation accuracy.\n\n2 Related Work\n\nLDS modeling. In the literature, various non-Euclidean metrics have been proposed to measure\nthe distances between LDSs, such as Kullback-Leibler divergence [17], Chernoff distance [18],\nBinet-Cauchy kernel [19] and group distance [14]. This paper follows the works in [20, 21, 11, 12]\nto represent an LDS by making use of the extended observability subspace; comparing LDSs is then\nachieved by measuring the subspace angles [22].\nClustering LDSs. In its simplest form, clustering LDSs can be achieved by alternating between two\nsub-processes: 1) assigning LDSs to the closest clusters using a similarity measure; 2) computing\nthe mean of the LDSs within the same cluster. However, as the space of LDSs is non-Euclidean,\ncomputing means on this space is not straightforward. In [12], the authors embedded LDSs into a\n\ufb01nite Grassmann manifold by representing each LDS with its \ufb01nite observability subspace and then\ncluster LDSs on that manifold. In contrast, our method applies the extended observability subspace\nto represent LDSs. In this way, not only the fully temporal evolution of the input sequence is taken\ninto account, but also and as will be shown shortly, the computational cost is reduced. The solution\nproposed by [11] also represent LDSs with extended observability subspaces; but it approximates the\nmean by \ufb01nding a sample that is closest to the mean using the concept of Multidimensional Scaling\n(MDS). Instead, our method \ufb01nds the system tuple of the exact mean for the given group of LDSs\nwithout relying on any approximation. Afsari et al. [14] cluster LDSs by \ufb01rst aligning the parameters\nof LDSs in their equivalence space. However, the method of Afsari et al. is agnostic to the joint\nbehavior of transition and measurement matrices and treat them independently. Other related studies\ninclude probabilistic framework for clustering LDSs [23, 24].\nSparse Coding with LDSs. Combining sparse coding with LDS modeling could further promote\nthe classi\ufb01cation performance [13]. However, similar to the clustering task, the non-Euclidean\nstructure makes it hard to formulate the reconstruction objective and update the dictionary atoms\non the space of LDSs. To address this issue, [13] embedded LDSs into the space of symmetric\nmatrices by representing each LDS with its \ufb01nite observability subspace. With this embedding,\ndictionary learning can be performed in the Euclidean space. In [5], the authors employ the extended\nobservability subspaces as the LDS descriptors; however, to update the dictionary, the authors enforce\nsymmetric constraints on the the transition matrices. Different from previous studies, our model\n\n2\n\n\fworks on the the original LDS model and does not enforce any additional constraint to the transition\nmatrices.\nTo sum up, in contrast to previous studies [12, 11, 14, 13, 5], this paper solves the clustering and\nsparse coding problems in a novel way regarding the following aspects. First, we unify the optimizing\nobjective functions for both clustering and sparse coding; Second, we avoid any additional constraints\n(e.g. symmetric transition in [5] and \ufb01nite observability in [12, 13]) for the solution; Finally, we\npropose a canonical formulation of the LDS tuple to facilitate the optimization.\n\n3 LDS Modeling\n\nLDSs describe time series through the following model [1]:\n\n= y + Cx(t) + w(t)\n\n(cid:26) y(t)\n\nx(t + 1) = Ax(t) + Bv(t),\n\n(1)\nwith Rm\u00d7\u03c4 (cid:51) Y = [y(1),\u00b7\u00b7\u00b7 , y(\u03c4 )] and Rn\u00d7\u03c4 (cid:51) X = [x(1),\u00b7\u00b7\u00b7 , x(\u03c4 )] representing the observed\nvariables and the hidden states of the system, respectively. Furthermore, y \u2208 Rm is the mean of Y ;\nA \u2208 Rn\u00d7n is the transition matrix of the model; B \u2208 Rn\u00d7nv (nv \u2264 n) is the noise transformation\nmatrix; C \u2208 Rm\u00d7n is the measurement matrix; v(t) \u223c N (0, Inv ) and w(t) \u223c N (0, \u2126) denoting\nthe process and measurement noise components, respectively. We also assume that n (cid:28) m and\nC has full rank. Overall, generating the observed variables is governed by the parameters \u0398 =\n{x(1), y, A, B, C, \u2126}.\nSystem Identi\ufb01cation. The system parameters A and C of Eq. (1) describe the dynamics and spatial\npatterns of the input sequence, respectively [11]. Therefore, the tuple (A, C) is a desired descriptor\nfor spatio-temporal data. Finding the optimal tuple (A, C) is known as system identi\ufb01cation. A\npopular and ef\ufb01cient method for system identi\ufb01cation is proposed in [1]. This method requires the\ncolumns of C to be orthogonal, i.e., C is a point on the Stiefel manifold de\ufb01ned as ST(m, n) = {C \u2208\nRm\u00d7n|CTC = In}. The transition matrix A obtained by the method of [1] is not naturally stable.\nAn LDS is stable if its spectral radius, i.e. the maximum eigenvalue of its transition matrix denoted by\n\u03c1(A) is less than one. To obtain a stable transition matrix, [5] propose a soft-normalization technique\nwhich is our choice in this paper. Therefore, we are interested in the LDS tuple with the constraints,\n(2)\n\nC = {CTC = In, \u03c1(A) < 1}.\n\n(A, C) \u223c (P T AP , CP )\n\nEquivalent Representation. Studying Eq. (1) shows that the output of the system remains unchanged\nunder linear transformations of the state basis [1]. More speci\ufb01cally, an LDS has an equivalent class\nof representations, i.e.,\nfor any P \u2208 O(n)1. For simplicity, the equivalence in Eq.(3) is called as P-equivalence.\nObviously comparing LDSs through Euclidean distance between the associated tuples is inaccurate\nas a result of P-equivalence. To circumvent this dif\ufb01culty, a family of approaches apply the extended\nobservability subspace to represent an LDS [20, 21, 11, 5]. Below, we brie\ufb02y review this topic.\nExtended Observability Subspace. The expected output sequence of Eq. (1) [12] is calculated as\n(4)\nwhere O\u221e(A, C) \u2208 R\u221e\u00d7n is called as the extended observability matrix of the LDS associated\nto (A, C). Let S(A, C) denote the extended observability subspace spanned by the columns of\nO\u221e(A, C). Obviously, the extended observability subspace is invariant to P-equivalence, i.e.,\nS(A, C) = S(P TAP , CP ).\nIn addition, the extended observability subspace is capable of\ncontaining the fully temporal evolution of the input sequence as observed from Eq. (4).\n\n[E[y(1)]; E[y(2)]; E[y(3)];\u00b7\u00b7\u00b7 ] = [C; CA; CA2;\u00b7\u00b7\u00b7 ]x(1) = O\u221e(A, C)x(1),\n\n(3)\n\n4 Our Approach\n\nIn this section, we \ufb01rst unify the optimizations for clustering and sparse coding with LDSs by making\nuse of the kernel functions. Next, we present our method to address this optimization problem.\n\n1In general, (A, C) \u223c (P \u22121AP , CP ) for P \u2208 GL(n) with GL(n) denoting non-singular n\u00d7 n matrices.\nSince we are interested in orthogonal measurement matrices (i.e., C \u2208 ST(m, n)), the equivalent class takes\nthe form described in Eq. (3).\n\n3\n\n\f4.1 Problem Formulation\n\nWe recall that each LDS is represented by its extended observability subspace. Clustering or sparse\ncoding in the space of extended observability subspaces is not straightforward because the underlying\ngeometry is non-Euclidean. Our idea here is to implicitly map the subspaces to a Reproducing Kernel\nHilbert Space (RKHS). For better readability, we simplify the subspace induced by S(Ai, Ci) as Si\nin the rest of this section if no ambiguity is caused. We denote the implicit mapping de\ufb01ned by a\npositive de\ufb01nite kernel k(S1, S2) = \u03c6(S1)T\u03c6(S2) as \u03c6 : S (cid:55)\u2192 H. Various kernels [25, 19, 5] based\non extended observability subspaces have been proposed to measure the similarity between LDSs.\nThough the proposed method is general in nature, in the rest of the paper we employ the projection\nkernel [5] due to its simplicity. The projection kernel is de\ufb01ned as\n22 G21),\n\n(cid:80)\u221e\n(5)\nwhere Tr(\u00b7) computes the trace and the product matrices Gij = OT\u221e(Ai, Ci)O\u221e(Aj, Cj) =\n\nj, for i, j \u2208 {1, 2} are obtained by solving the following DLE\n\nkp(S1, S2) = Tr(G\u22121\n\n11 G12G\u22121\n\ni CjAt\n\ni )tCT\n\nt=0(AT\n\ni GijAj \u2212 Gij = \u2212CT\nAT\n\n(6)\nThe solution of DLE exists and is unique when both Ai and Aj are stable [22]. DLE can be solved\nby a numerical algorithm with the computational complexity of O(n3) [26], where n is the hidden\ndimension and is usually very small (see Eq. (1)).\nClustering. As discussed before, the key of clustering is to compute the mean for the given set of\nLDSs. While several works [12, 11, 14] have been developed for computing the mean, none of their\nsolutions are derived in the kernel form. The mean de\ufb01ned by the implicit mapping is\n\ni Cj.\n\n(cid:107)\u03c6(Sm) \u2212 \u03c6(Si)(cid:107)2\n\ns.t. (Am, Cm) \u2208 C,\n\n(7)\n\nwhere Sm is the mean subspace and Si are data subspaces. Removing the terms that are independent\nfrom Sm (e.g., \u03c6(Sm)T\u03c6(Sm) = 1) leads to\n\nk(Sm, Si)\n\ns.t. (Am, Cm) \u2208 C.\n\n(8)\n\nN(cid:88)\n\ni\n\nmin\n\nAm,Cm\n\n1\nN\n\nmin\n\nN(cid:88)\n(cid:107)\u03c6(Si) \u2212 J(cid:88)\n\n\u2212 2\nN\n\nAm,Cm\n\ni\n\nN(cid:88)\n\ni\n\nj=1\n\nj=1\n\nj}J\n\nmin\nj ,C(cid:48)\n\ns.t. (A(cid:48)\n\nj)(cid:107)2 + \u03bb(cid:107)zi(cid:107)1,\nj}J\n\nSparse Coding. The problem of sparse coding in the RKHS is written as [13]\nj, C(cid:48)\n\nzi,j\u03c6(S(cid:48)\n1\nN\ni=1 are the data subspaces; {S(cid:48)\n\n{A(cid:48)\nwhere {Si}N\nj=1 are the dictionary subspaces; zi,j is the sparse code\nj; RJ \u2208 zi = [zi,1;\u00b7\u00b7\u00b7 ; zi,J ] and \u03bb is the sparsity factor. Eq. (9) shares\nof data Si over atom S(cid:48)\nthe same form as those in [13, 5]; however, here we apply the extended observability subspaces and\nperform no additional constraint on the transition matrices.\nTo perform sparse coding, we alternative between the two phases: 1) computing the sparse codes\ngiven LDS dictionary, which is similar to the conventional sparse coding task [13]; 2) optimizing\neach dictionary atom with the codes \ufb01xed. Speci\ufb01cally, updating the r-th atom with other atoms \ufb01xed\ngives the kernel formulation of the objective as\n\nj) \u2208 C, j = 1,\u00b7\u00b7\u00b7 , J;\n\n(9)\n\n\u2212zi,rk(S(cid:48)\n\nr, Si) +\n\nzi,rzi,jk(S(cid:48)\n\nr, S(cid:48)\n\nj).\n\nJ(cid:88)\n\nj=1,j(cid:54)=r\n\nN(cid:88)\nN(cid:88)\n\ni\n\ni=1\n\n\u0393r =\n\n1\nN\n\nmin\nA,C\n\n1\nN\n\n(10)\n\n(11)\n\nCommon Problem. Clearly, Eq. (8) and (10) have the common form as\n\n\u03b2ik(S(A, C), S(Ai, Ci))\n\ns.t. (A, C) \u2208 C.\ni=1 are given LDSs; {\u03b2i}N\n\nHere, (A, C) is the LDS tuple to be identi\ufb01ed; {(Ai, Ci)}N\ntask-dependent coef\ufb01cients (are speci\ufb01ed in Eq. (8) and Eq. (10)).\nTo minimize (11), we resort to the Projected Gradient Descent (PGD) method. Note that the solution\nspace in (11) is redundant due to the invariance induced by P-equivalence (Eq. (3)). We thus devise\na canonical representation of the system tuple (see Theorem 1). The canonical form not only\ncon\ufb01nes the search space but also simpli\ufb01es the stability constraint to a bound constraint. We then\ncompute the gradients with respect to the system tuple by backpropagating the gradients through\nDLEs (see Theorem 4). Finally, we project the gradients to feasible regions of the system tuples via\nCaylay-transformation (Eq. (16-17) and bound-normalization (Eq. (18)). We now present the details.\n\ni=1 are the\n\n4\n\n\f4.2 Canonical Representation\nTheorem 1. For any given LDS, the system tuple (A, C) \u2208 Rn\u00d7n \u00d7 Rm\u00d7n and all its equivalent\nrepresentations have the canonical form (\u039bV , U ), where U \u2208 ST(m, n), V \u2208 O(n) and \u039b \u2208\nRn\u00d7n is diagonal with the diagonal elements arranged in a descend order, i.e. \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbn\n2.\nRemark 2. The proof of Theorem 1 (presented in the supplementary material) requires the SVD\ndecomposition that is not necessarily unique [27], thus the canonical form of a system tuple is not\nunique. Even so, the free dimensionality of the canonical space (i.e., mn) is less than that of the\noriginal tuples (i.e., mn + n(n\u22121)\n) within the feasible region of C. This is due to the invariance\ninduced by P-equivalence (Eq. (3)) if one optimizes (11) in the original form of the system tuple.\nRemark 3. It is easy to see that the stability (i.e., \u03c1(A) < 1) translates into the constraint |\u03bbi| < 1\nin the canonical representation with \u03bbi being the i-th diagonal element of \u039b. As such, problem (11)\ncan be cast as\n\n2\n\n\u03b2ik(S(\u039bV , U ), S(Ai, Ci)),\n\nmin\n\u039b,V ,U\ns.t. V TV = In; U TU = In; |\u03bbi| < 1, i = 1,\u00b7\u00b7\u00b7 , n.\n\ni=1\n\n(12)\n\nN(cid:88)\n\n1\nN\n\nA feasible solution of (11) can be obtained by minimizing (12) and the stability constraint in (11) is\nreduced to a bound constraint in (12).\n\nThe canonical form derived from Theorem 1 is central to our methods. It is because with the canonical\nform, we can simplify the stability constraint to a bound one, thus making the solution simpler and\nmore ef\ufb01cient. We note that even with conditions on one single LDS, optimizing the original form of\nA with the stability constraint is tedious (e.g., [7] and we note that the tasks addressed in our paper\nare more complicated where far more than one LDS are required to optimize). Furthermore, the\ncanonical form enables us to reduce the redundancy of the LDS tuple (see Remark 3). To be speci\ufb01c,\nwith canonical form, one needs to update only n singular values rather than the entire A matrix. Also\noptimization with the canonical representations avoids numerical instabilities related to equivalent\nclasses, thus facilitating the optimization.\n\n4.3 Passing Gradients Through DLEs\n\n(cid:80)\u221e\n\nAccording to the de\ufb01nition of the projection kernel, to obtain k(S(A, C), S(Ai, Ci)) for (11)\n(note that in the canonical form A = \u039bV and C = U), computing the product-matrices Gi =\ni are required. To compute the gradients of the objective in (11) shown by \u0393\n\nt=0(AT)tCTCiAt\n\nw.r.t. the tuple \u0398 = (A, C), we make use of the chain rule in the vectorized form as\n\n\u2202\u0393\n\u2202\u0398 :\n\n=\n\n\u2202\u0393\n\n\u2202Gi :\n\n\u2202Gi :\n\u2202\u0398 :\n\n.\n\n(13)\n\n(cid:88)\n\ni\n\n\u2202Gi: is straightforward, deriving \u2202Gi:\n\nWhile computing \u2202\u0393\n\u2202\u0398: is non-trivial as the values of the product-\nmatrices Gi are obtained by an in\ufb01nite summation. The following theorem proves that the gradients\nare derived by solving an induced DLE.\nTheorem 4. Let the extended observability matrices of two LDSs (A1, C1) and (A2, C2) be O1 and\nO2, respectively. Furthermore, let G12 = OT\n2 be the product-matrix\nbetween O1 and O2. Given the gradient of the objective function with respect to the product-matrix\n\u2202\u0393\n\u2202G12\n\n.\n= H, the gradients with respect to the system parameters are\n\n1 O2 =(cid:80)\u221e\n\nt=0(AT\n\n1 C2At\n\n1 )tCT\n\n\u2202\u0393\n\u2202A1\n\n= G12A2RT\n12,\n\n\u2202\u0393\n\u2202A2\nwhere R12 is obtained by solving the following DLE\n\n= C2RT\n12,\n\n\u2202\u0393\n\u2202C1\n\n= GT\n\n12A1R12,\n\n\u2202\u0393\n\u2202C2\n\n= C1R12,\n\n(14)\n\nA1R12AT\n\n2 \u2212 R12 + H = 0.\n\n(15)\n\n2All the proofs of the theorems in this paper are provided in the supplementary material.\n\n5\n\n\f4.4 Constraint-Aware Updates\n\nWe cannot preserve the orthogonality of V , U and the stability of \u039b if we use conventional gradient-\ndescent methods to update the parameters \u039b, V , U of (12). Optimization on the space of orthogonal\nmatrices is a well-studied problem [15]. Here, we employ the Cayley transformation [15] to maintain\northogonality for V and U. In particular, we update V by\n\nV = V \u2212 \u03c4 LV (I2n +\n\nV LV )\u22121RT\nRT\n\nV V ,\n\n(16)\nwhere LV = [\u2207V , V ] and RV = [V ,\u2212\u2207V ], \u2207V is the gradient of the objective w.r.t. V , and \u03c4\nis the learning rate. Similarly, to update U, we use\n\u03c4\n2\n\n(17)\nwhere LU = [\u2207U , U ] and RU = [U ,\u2212\u2207U ]. As shown in [15], the Cayley transform follows the\ndescent curve, thus updating V by Eq. (16) and U by Eq. (17) decreases the objective for suf\ufb01ciently\nsmall \u03c4.\nTo accomplish stability, we apply the following bound normalization on \u039b, i.e.,\n\nU = U \u2212 \u03c4 LU (I2n +\n\nU LU )\u22121RT\nRT\n\nU U ,\n\n\u03c4\n2\n\n\u03b5\n\n(\u03bbk \u2212 \u03c4\u2207\u03bbk),\n\n\u03bbk =\n\nmax(\u03b5,|\u03bbk \u2212 \u03c4\u2207\u03bbk|)\n\n(18)\nwhere \u03bbk is the k-th diagonal element of \u039b; \u2207\u03bbk denotes the gradient w.r.t. \u03bbk; and \u03b5 < 1 is a\nthreshold (we set \u03b5 = 0.99 in all of our experiments in this paper). From the above, we immediately\nhave the following result,\nTheorem 5. The update direction in Eq. (18) is a descent direction.\nThe authors in [5] constrain the eigenvalues of the transition matrix to be in (\u22121, 1) using a Sigmoid\nfunction. However, the Sigmoid function is easier to saturate and its gradient will vanish when \u03bbk is\nclose to the bound. In contrast, Eq. (18) does not suffer from this issue.\nFor reader\u2019s convenience, all the aforementioned details for optimizing (11) are summarized in\nAlgorithm 1. The full details about how to use Algorithm 1 to solve clustering and sparse coding are\nprovided in the supplementary material.\n\nAlgorithm 1 The PGD method to optimize problem (11)\n\nInput: The given tuples {(Ai, Cj)}; the initialization of (A, C); and the learning rate \u03c4;\nAccording to Theorem 1, compute the canonical formulations of {(Ai, Ci)}N\n{(\u039bi, V i, U i)}N\nfor t = 1 to maxIter do\n\ni=1 and (\u039b, V , U ), respectively;\n\ni=1 and (A, C) as\n\nCompute the gradients according to Theorem 4: \u2207\u039b,\u2207V ,\u2207U;\nUpdate V : V = V \u2212 \u03c4 LV (I2n + \u03c4\nUpdate U: U = U \u2212 \u03c4 LU (I2n + \u03c4\nUpdate \u039b: \u03bbk =\n\nmax(\u03b5,|\u03bbk\u2212\u03c4\u2207\u03bbk|) (\u03bbk \u2212 \u03c4\u2207\u03bbk);\n\nV LV )\u22121RT\nU LU )\u22121RT\n\n2 RT\n2 RT\n\n\u03b5\n\nV V with LV and RV de\ufb01ned in Eq. (16);\nU U with LU and RU de\ufb01ned in Eq. (17);\n\nend for\nOutput: the system tuple (\u039b, V , U ).\n\n4.5 Extensions for Other Kernels\n\nThe proposed solution is general in nature and can be used with other kernel functions such as the\nMartin kernel [25] and Binet-Cauchy kernel [19]. The Martin kernel is de\ufb01ned as\n\nkm\n\n(cid:16)\n(cid:0)(A1, C1), (A2, C2)(cid:1) = det\n11 G12G\u22121\nG\u22121\n(cid:17)\n(cid:16)\n(cid:0)(A1, C1), (A2, C2)(cid:1) = det\n\nC1M C T\n2\n\nkb\n\n,\n\n(cid:17)\n\n22 G21\n\n,\n\n(19)\n\n(20)\n\nwith Gij as in Eq.(5). The determinant version of the Binet-Cauchy kernel is de\ufb01ned as\n\nwhere M satis\ufb01es e\u2212\u03bbb A1M AT\n2 (1), \u03bbb is the exponential discounting rate,\nand x1(1), x2(1) are the initial hidden states of the two compared LDSs. Both the Martin kernel\nand Binet-Cauchy kernel are computed by DLEs. Thus, Theorem 4 can be employed to compute the\ngradients w.r.t. the system tuple for them.\n\n2 \u2212 M = \u2212x1(1)xT\n\n6\n\n\f5 Experiments\n\nIn this section, we \ufb01rst compare the performance of our proposed method (see Algorithm 1), called\nas PGD, with previous state-of-the-art methods for the task of clustering and sparse coding using\nthe DynTex++ [28] dataset. We then evaluate the classi\ufb01cation accuracies of various state-of-the-art\nmethods with PGD on two video datasets, namely the YUPENN [29] and the DynTex [30] datasets.\nThe above datasets have been widely used in evaluating LDS-based algorithms in the literature, and\ntheir details are presented in the supplementary material. In all experiments, the hidden order of LDS\n(n in Eq. (1)) is \ufb01xed to 10. To learn an LDS dictionary, we use the sparsity factor of 0.1 (\u03bb in Eq.(9)).\nThe LDS tuples for all input sequences are learned by the method in [1] and the transition matrices\nare stabilized by the soft-normalization technique in [5].\n\n5.1 Models Comparison\n\nThis experiment uses the DynTex++ datasets. We extract the histogram of LBP from Three Orthogonal\nPlanes (LBP-TOP) [31] by splitting each video into sub-videos of length 8, with a 6-frame overlap.\nThe LBP-TOP features are fed to LDSs to identify the system parameters. For clustering, we compare\nour PGD with the MDS method with the Martin Kernel [11] and the Align algorithm [14]. For sparse\ncoding, two related methods are compared: Grass [13] and LDSST [5]. We follow [13] and use 3-step\nobservability matrices for the Grass method (hence Grass-3 below). In LDSST, the transition matrices\nare enforced to be symmetric. All algorithms are randomly initialized and the average results over 10\ntimes are reported.\n\n(cid:80)\n\nFigure 1: The clustering performance of the MDS, Align and PGD algorithms with varying number\nof clusters on DynTex++.\n5.1.1 Clustering\n\nTo evaluate the clustering performance, we apply the purity metric [32], which is given by p =\nk maxi ci,k, where ci,k counts the number of samples from i-th class in k-th cluster; N is the\n1\nN\nnumber of the data. A higher purity means a better performance.\nFor the Align algorithm, we varied the learning rate when optimizing the aligning matrices and chose\nthe value that delivered the best performance. For our PGD algorithm, we selected the learning\nrate as 0.1 for \u039b and V and 1 for U. Fig. 1 reports the clustering performance of the compared\nmethods. Our method consistently outperforms both MDS and Align methods over various number of\nclusters. We also report the running time for one epoch of each algorithm in Fig. 1. Here, one epoch\nmeans one update of the clustering centers through all data samples. Fig. 1 shows that PGD performs\nfaster that both the MDS and Align algorithms, probably because the MDS method recomputes\nthe kernel-matrix for the embedding at each epoch and the Align algorithm calculates the aligning\ndistance in an iterative way.\n\n5.1.2 Sparse Coding\n\nIn this experiment, we used half of samples from DynTex++ for training the dictionary and the other\nhalf for testing. As the objective of (11) is in a sum-minimize form, we can employ the stochastic\nversion of Algorithm 1 to optimize (11) for large-scale dataset. This can be achieved by sampling\na mini-batch to update the system tuple at each iteration. Therefore, in addition to the full batch\nversion, we also carried out the stochastic PGD with the mini-bach of size 128, which is denoted as\nPGD-128. The learning rates of both full PGD and PGD-128 were selected as 0.1 for \u039b and V and 1\nfor U, and their values were decreased by half every 10 epoch. Different from PGD, the Grass and\nLDSST methods require the whole dataset in hand for learning the dictionary at each epoch, and thus\nthey can not support the update via mini-batches.\n\n7\n\n4 8 16 32 64 128NumberOfClusters00.20.40.60.8PurityPGDAlignMDS4 8 16 32 64 128NumberOfClusters0200400600Time Per EpochPGDAlignMDS\f(a) J = 4\n\nFigure 2: Testing reconstruction errors of Grass-3, LDSST, PGD-full and PGD-128 with different\ndictionary sizes on DynTex++. The PGD-128 method converges much faster than other counterparts.\nAlthough Grass-3 converges to a bit smaller error than PGD-128 when J = 4 (see (a)), it performs\nworse than PGD-128 when the value of J is increasing (see (b) and (c)).\n\n(b) J = 8\n\n(c) J = 16\n\nRinit\n\nIt is unfair to directly compare the reconstruction errors (Eq. (9)) of different methods, since their\nvalues are calculated by different metrics. Therefore, we make use of the normalized reconstruction\nerror de\ufb01ned as N R = Rt\u2212Rinit\n, where Rinit and Rt are corresponded to the reconstruction errors\nat the initial step and the t-th epoch, respectively. Fig. 2 shows the normalized reconstruction errors\non testing set of PGDs, Grass-3 and the LDSST method during the learning process for various\ndictionary sizes. PGD-128 converges to lower errors than PGD-full on all experiments, indicating\nthat the stochastic sampling strategy is helpful to escaping from the poor local minima. PGD-128\nconsistently outperforms both Grass-3 and LDSST in terms of the learning speed and the \ufb01nal error.\nThe computational complexities of updating one dictionary atom for the Grass and the LDSST method\nare O((J + N )L2n2m2)) and O((J + N )n2m2)), respectively. Here, J is the dictionary size, N is\nthe number of data, and n and m are LDS parameters de\ufb01ned in Eq. (1). In contrast, PGD requires to\ncalculate the projected gradients of the canonical tuples which scales to only O((J + N )n2m). As\nshown in Fig. 2, PGD is more than 50 times faster than the Grass-3 and LDSST methods per epoch.\n\n5.2 Video Classi\ufb01cation\n\nClassifying YUPENN or DynTex videos is challenging as the videos are recoded under various\nviewpoints and scales. To deliver robust features, we implement two kinds of high-level pooling\nframeworks: Bag-of-Systems (BoS) [11] and Spatial-Temporal-Pyramid-Matching (STPM) [16]3.\nIn particular, 1) BoS is performed with the clustering methods, i.e., MDS, Align and PGD. The\nBoS framework models the local spatio-temporal blocks with LDSs and then clusters the LDS\ndescriptors to obtain the codewords; 2)The STPM framework works in conjunction with the sparse\ncoding approaches (i.e., Grass-3, LDSST and the PGD methods). Unlike BoS that represents a\nvideo by unordered local descriptors, STPM partitions a video into segments under different scales\n(2-level scales are considered here) and concatenates all local descriptors for each segment to form a\nvectorized representation. The codewords are provided by learning a dictionary. For the BoS methods,\nwe apply the nonlinear SVM as the classi\ufb01er where the radial basis kernel with \u03c72 distance [33] is\nemployed; while for the STPM methods, we utilize linear SVM for classi\ufb01cation.\n\nTable 1: Mean classi\ufb01cation accuracies (percentage) on the YUPENN and DynTex datasets.\n\nDatasets References\nYUPENN 85 [10]\nDynTex\n\n-\n\n+BoS\n\n+STPM\n\nMDS Align PGD Grass-3 LDSST PGD\n90.7 93.6\n83.3 82.1 84.1\n59.5 62.7 65.4\n75.1 76.5\n\n91.6\n75.1\n\nYUPENN. The non-overlapping spatio-temporal blocks of size 8 \u00d7 8 \u00d7 25 were sampled from the\nvideos. The number of the codewords for all BoS and STPM methods was set to 128. We sampled 50\nblocks from each video to learn the codewords for the MDS, Align, Grass-3 and LDSST methods. For\nPGD, we updated the codewords by mini-batches. To maintain the diversity within each mini-batch, a\n\n3 In the experiments, we consider the projection kernel as de\ufb01ned in Eq. (5). We have also conducted\nadditional experiments by considering a new kernel, namely the Martin kernel (Eq. (19)). The results are\nprovided in the supplementary material.\n\n8\n\n10 100 1000 10000Time (s)-1.5-1-0.50Testing NRPGD-fullPGD-128LDSSTGrass-310 100 1000 1000050000Time (s)-1-0.8-0.6-0.4-0.20Testing NRPGD-fullPGD-128LDSSTGrass-310 100 1000 1000050000Time (s)-0.6-0.5-0.4-0.3-0.2-0.10Testing NRPGD-fullPGD-128LDSSTGrass-3\fhierarchical approach was used. In particular, at each iteration, we \ufb01rst randomly sampled 20 videos\nfrom the dataset and then sampled 4 blocks from each of the videos, leading to a mini-batch of size\nN(cid:48) = 80. The learning rates were set as 0.5 for \u039b and V and 5 for U, and their values were decreased\nby half every 10 epochs. The test protocol is the leave-one-video-out as suggested in [29], leading\nto a total of 420 trials. Table 1 shows that the STPM methods achieve better accuracies than the\nBoS approaches; within the same pooling framework, our PGD always outperforms other compared\nmodels. For the probabilistic clustering method [10], the result on YUPENN is 85% reported in\nTable 1. Note that in [10], a richer number of dictionary has been applied.\nDynTex. For the Dyntex dataset, the spatio-temporal blocks of size 16 \u00d7 16 \u00d7 50 were sampled in a\nnon-overlapping way. The number of the codewords for all methods was chosen as 64. We applied\nthe same sampling strategy as that on YUPENN to learn the codewords for all compared methods. As\nshown in Table 1, the proposed method is superior compared to the studied models with both BoS\nand STPM coding strategies.\n\n6 Conclusion\n\nWe propose an ef\ufb01cient Projected-Gradient-Decent (PGD) method to optimize problem (11). Our\nalgorithm can be used to perform clustering and sparse coding with LDSs. In contrast to previous\nstudies, our solution avoids any approximation in LDS modeling or during the optimization process.\nExtensive experiments on clustering and sparse coding verify the effectiveness of the proposed method\nin terms of the convergence performance and learning speed. We also explore the combination of\nPGD with two high-level pooling frameworks, namely Bag-of-Systems (BoS) and Spatial-Temporal-\nPyramid-Matching for video classi\ufb01cation. The experimental results demonstrate that our PGD\nmethod outperforms state-of-the-art methods consistently.\n\nAcknowledgments\n\nThis research was supported in part by the National Science Foundation of China (NSFC) (Grant No:\n91420302, 91520201,61210013 and 61327809), the NSFC and the German Research of Foundation\n(DFG) in project Crossmodal Learning (Grant No: NSFC 61621136008/ DFG TRR-169), and the\nNational High-Tech Research and Development Plan under Grant 2015AA042306. Besides, Tong\nZhang was supported by Australian Research Council\u2019s Discovery Projects funding scheme (project\nDP150104645).\n\nReferences\n[1] Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu, and Stefano Soatto. Dynamic textures. Interna-\n\ntional Journal of Computer Vision (IJCV), 51(2):91\u2013109, 2003.\n\n[2] Tae-Kyun Kim and Roberto Cipolla. Canonical correlation analysis of video volume tensors for action\ncategorization and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),\n31(8):1415\u20131428, 2009.\n\n[3] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. Devnet: A deep event\n\nnetwork for multimedia event detection and evidence recounting. In CVPR, pages 2568\u20132577.\n\n[4] Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. You lead, we exceed: Labor-free video\n\nconcept learning by jointly exploiting web videos and images. In CVPR, pages 923\u2013932, 2016.\n\n[5] Wenbing Huang, Fuchun Sun, Lele Cao, Deli Zhao, Huaping Liu, and Mehrtash Harandi. Sparse coding\nand dictionary learning with linear dynamical systems. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR). IEEE, 2016.\n\n[6] Sajid M Siddiqi, Byron Boots, and Geoffrey J Gordon. A constraint generation approach to learning stable\n\nlinear dynamical systems. In Advances in Neural Information Processing Systems (NIPS), 2007.\n\n[7] Wenbing Huang, Lele Cao, Fuchun Sun, Deli Zhao, Huaping Liu, and Shanshan Yu. Learning stable\nlinear dynamical systems with the weighted least square method. In Proceedings of the International Joint\nConference on Arti\ufb01cial Intelligence (IJCAI), 2016.\n\n[8] S\u00f8ren Johansen. Likelihood-based inference in cointegrated vector autoregressive models. Oxford\n\nUniversity Press on Demand, 1995.\n\n9\n\n\f[9] Bijan Afsari and Ren\u00e9 Vidal. Distances on spaces of high-dimensional linear stochastic processes: A\n\nsurvey. In Geometric Theory of Information, pages 219\u2013242. Springer, 2014.\n\n[10] Adeel Mumtaz, Emanuele Coviello, Gert RG Lanckriet, and Antoni B Chan. A scalable and accurate\ndescriptor for dynamic textures using bag of system trees. IEEE Transactions on Pattern Analysis and\nMachine Intelligence (TPAMI), 37(4):697\u2013712, 2015.\n\n[11] Avinash Ravichandran, Rizwan Chaudhry, and Rene Vidal. Categorizing dynamic textures using a\nbag of dynamical systems. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),\n35(2):342\u2013353, 2013.\n\n[12] Pavan Turaga, Ashok Veeraraghavan, Anuj Srivastava, and Rama Chellappa. Statistical computations on\nGrassmann and Stiefel manifolds for image and video-based recognition. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence (TPAMI), 33(11):2273\u20132286, 2011.\n\n[13] Mehrtash Harandi, Richard Hartley, Chunhua Shen, Brian Lovell, and Conrad Sanderson. Extrinsic\nmethods for coding and dictionary learning on Grassmann manifolds. International Journal of Computer\nVision (IJCV), 114(2):113\u2013136, 2015.\n\n[14] Bijan Afsari, Rizwan Chaudhry, Avinash Ravichandran, and Ren\u00e9 Vidal. Group action induced distances\nfor averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes.\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2208\u20132215. IEEE, 2012.\n\n[15] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathe-\n\nmatical Programming, 142(1-2):397\u2013434, 2013.\n\n[16] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse\ncoding for image classi\ufb01cation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 1794\u20131801. IEEE, 2009.\n\n[17] Antoni B Chan and Nuno Vasconcelos. Probabilistic kernels for the classi\ufb01cation of auto-regressive visual\nprocesses. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),\nvolume 1, pages 846\u2013851. IEEE, 2005.\n\n[18] Franco Woolfe and Andrew Fitzgibbon. Shift-invariant dynamic texture recognition.\n\nConference on Computer Vision (ECCV), pages 549\u2013562. Springer, 2006.\n\nIn European\n\n[19] SVN Vishwanathan, Alexander J Smola, and Ren\u00e9 Vidal. Binet-Cauchy kernels on dynamical systems\nand its application to the analysis of dynamic scenes. International Journal of Computer Vision (IJCV),\n73(1):95\u2013119, 2007.\n\n[20] Payam Saisan, Gianfranco Doretto, Ying Nian Wu, and Stefano Soatto. Dynamic texture recognition. In\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages\nII\u201358. IEEE, 2001.\n\n[21] Antoni B Chan and Nuno Vasconcelos. Classifying video with kernel dynamic textures. In IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), pages 1\u20136. IEEE, 2007.\n\n[22] Katrien De Cock and Bart De Moor. Subspace angles between ARMA models. Systems & Control Letters,\n\n46(4):265\u2013270, 2002.\n\n[23] Antoni B. Chan, Emanuele Coviello, and Gert RG Lanckriet. Clustering dynamic textures with the\nhierarchical EM algorithm. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 2022\u20132029. IEEE, 2010.\n\n[24] Antoni B. Chan, Emanuele Coviello, and Gert RG Lanckriet. Clustering dynamic textures with the\n\nhierarchical EM algorithm for modeling video. 35(7):1606\u20131621, 2013.\n\n[25] Richard J Martin. A metric for ARMA processes. IEEE Transactions on Signal Processing, 48(4):1164\u2013\n\n1170, 2000.\n\n[26] A Barraud. A numerical algorithm to solve a\u02c6{T} xa-x= q. IEEE Transactions on Automatic Control,\n\n22(5):883\u2013885, 1977.\n\n[27] Dan Kalman. A singularly valuable decomposition: the svd of a matrix. The college mathematics journal,\n\n27(1):2\u201323, 1996.\n\n[28] Bernard Ghanem and Narendra Ahuja. Maximum margin distance learning for dynamic texture recognition.\n\nIn European Conference on Computer Vision (ECCV), pages 223\u2013236. Springer, 2010.\n\n10\n\n\f[29] Konstantinos G Derpanis, Matthieu Lecce, Kostas Daniilidis, and Richard P Wildes. Dynamic scene\nunderstanding: The role of orientation features in space and time in scene classi\ufb01cation. In IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), pages 1306\u20131313. IEEE, 2012.\n\n[30] Renaud P\u00e9teri, S\u00e1ndor Fazekas, and Mark J. Huiskes. DynTex : a Comprehensive Database of Dynamic Tex-\ntures. Pattern Recognition Letters, doi: 10.1016/j.patrec.2010.05.009, 2010. http://projects.cwi.nl/dyntex/.\n\n[31] Guoying Zhao and Matti Pietikainen. Dynamic texture recognition using local binary patterns with\nan application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence\n(TPAMI), 29(6):915\u2013928, 2007.\n\n[32] Anna Huang. Similarity measures for text document clustering. In Proceedings of the sixth new zealand\ncomputer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pages 49\u201356,\n2008.\n\n[33] Richard O Duda, Peter E Hart, and David G Stork. Pattern classi\ufb01cation. John Wiley & Sons, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1956, "authors": [{"given_name": "Wenbing", "family_name": "Huang", "institution": "Tencent AI Lab"}, {"given_name": "Mehrtash", "family_name": "Harandi", "institution": "Data61"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "The Australian National University & DATA61"}, {"given_name": "Lijie", "family_name": "Fan", "institution": "Tsinghua University"}, {"given_name": "Fuchun", "family_name": "Sun", "institution": "Tsinghua University"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}]}