{"title": "A posteriori error bounds for joint matrix decomposition problems", "book": "Advances in Neural Information Processing Systems", "page_first": 4943, "page_last": 4950, "abstract": "Joint matrix triangularization is often used for estimating the joint eigenstructure of a set M of matrices, with applications in signal processing and machine learning. We consider the problem of approximate joint matrix triangularization when the matrices in M are jointly diagonalizable and real, but we only observe a set M' of noise perturbed versions of the matrices in M. Our main result is a first-order upper bound on the distance between any approximate joint triangularizer of the matrices in M' and any exact joint triangularizer of the matrices in M. The bound depends only on the observable matrices in M' and the noise level. In particular, it does not depend on optimization specific properties of the triangularizer, such as its proximity to critical points, that are typical of existing bounds in the literature. To our knowledge, this is the first a posteriori bound for joint matrix decomposition. We demonstrate the bound on synthetic data for which the ground truth is known.", "full_text": "A Posteriori Error Bounds for Joint Matrix\n\nDecomposition Problems\n\nNicol\u00f2 Colombo\n\nnicolo.colombo@ucl.ac.uk\n\nDepartment of Statistical Science\n\nUniversity College London\n\nNikos Vlassis\nAdobe Research\n\nSan Jose, CA\n\nvlassis@adobe.com\n\nAbstract\n\nJoint matrix triangularization is often used for estimating the joint eigenstructure\nof a set M of matrices, with applications in signal processing and machine learning.\nWe consider the problem of approximate joint matrix triangularization when the\nmatrices in M are jointly diagonalizable and real, but we only observe a set M\u2019\nof noise perturbed versions of the matrices in M. Our main result is a \ufb01rst-order\nupper bound on the distance between any approximate joint triangularizer of the\nmatrices in M\u2019 and any exact joint triangularizer of the matrices in M. The bound\ndepends only on the observable matrices in M\u2019 and the noise level. In particular, it\ndoes not depend on optimization speci\ufb01c properties of the triangularizer, such as its\nproximity to critical points, that are typical of existing bounds in the literature. To\nour knowledge, this is the \ufb01rst a posteriori bound for joint matrix decomposition.\nWe demonstrate the bound on synthetic data for which the ground truth is known.\n\n1\n\nIntroduction\n\nJoint matrix decomposition problems appear frequently in signal processing and machine learning,\nwith notable applications in independent component analysis [7], canonical correlation analysis [20],\nand latent variable model estimation [5, 4]. Most of these applications reduce to some instance of\na tensor decomposition problem, and the growing interest in joint matrix decomposition is largely\nmotivated by such reductions. In particular, in the past decade several \u2018matricization\u2019 methods have\nbeen proposed for factorizing tensors by computing the joint decomposition of sets of matrices\nextracted from slices of the tensor (see, e.g., [10, 22, 17, 8]).\nIn this work we address a standard joint matrix decomposition problem, in which we assume a set of\njointly diagonalizable ground-truth (unobserved) matrices\n\nM\u25e6 = {Mn = V diag([\u039bn1, . . . , \u039bnd])V \u22121, V \u2208 Rd\u00d7d, \u039b \u2208 RN\u00d7d}N\n\nn=1 ,\n\n(1)\n\nwhich have been corrupted by noise and we observe their noisy versions:\n\nM\u03c3 = { \u02c6Mn = Mn + \u03c3Rn, Mn \u2208 M\u25e6, Rn \u2208 Rd\u00d7d, (cid:107)Rn(cid:107) \u2264 1}N\n\n(2)\nThe matrices \u02c6Mn \u2208 M\u03c3 are the only observed quantities. The scalar \u03c3 > 0 is the noise level, and\nthe matrices Rn are arbitrary noise matrices with Frobenius norm (cid:107)Rn(cid:107) \u2264 1. The key problem\nis to estimate from the observed matrices in M\u03c3 the joint eigenstructure V, \u039b of the ground-truth\nmatrices in M\u25e6. One way to address this estimation problem is by trying to approximately jointly\ndiagonalize the observed matrices in M\u03c3, for instance by directly searching for an invertible matrix\nthat approximates V in (1). This approach is known as nonorthogonal joint diagonalization [23, 15,\n18], and is often motivated by applications that reduce to nonsymmetric CP tensor decomposition\n(see, e.g., [20]).\n\nn=1 .\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAn alternative approach to the above estimation problem (in the general case of nonorthogonal V ) is\nvia joint triangularization, also known as joint or simultaneous Schur decomposition [1, 13, 11, 12,\n22, 8]. Under mild conditions [14], the ground-truth matrices in M\u25e6 can be jointly triangularized,\nthat is, there exists an orthogonal matrix U\u25e6 that simultaneously renders all matrices U(cid:62)\n\u25e6 MnU\u25e6 upper\ntriangular:\n\n(3)\nwhere low(A) is the strictly lower triangular part of A, i.e., [low(A)]ij = Aij if i > j and 0 otherwise.\nOn the other hand, when \u03c3 > 0 the observed matrices in M\u03c3 can only be approximately jointly\ntriangularized, for instance by solving the following optimization problem\n\nfor all n = 1, . . . N,\n\n\u25e6 MnU\u25e6) = 0,\n\nlow(U(cid:62)\n\nmin\nU\u2208O(d)\n\nL(U ), where L(U ) =\n\n1\nN\n\n(cid:107)low(U(cid:62) \u02c6MnU )(cid:107)2 ,\n\n(4)\n\nwhere (cid:107)\u00b7(cid:107) denotes Frobenius norm and optimization is over the manifold O(d) of orthogonal matrices.\nThe optimization problem can be addressed by Jacobi-like methods [13], or Newton-like methods that\noptimize directly on the O(d) manifold [8]. For any feasible U in (4), the joint eigenvalues \u039b in (1)\ncan be estimated from the diagonals of U(cid:62) \u02c6MnU. This approach has been used in nonsymmetric CP\ntensor decomposition [22, 8] and other applications [9, 13].\nWe also note two related problems. In the special case that the ground-truth matrices Mn in M\u25e6 are\nsymmetric, the matrix V in (1) is orthogonal, and the estimation problem is known as orthogonal joint\ndiagonalization [7]. Our results apply to this special case too. Another problem is joint diagonalization\nby congruence [6, 3, 17], in which the matrix V \u22121 in (1) is replaced by V T . In that case the matrix \u039b\nin (1) does not contain the joint eigenvalues, and our results do not apply directly.\n\nContributions We are addressing the joint matrix triangularization problem de\ufb01ned via (4), under\nthe model assumptions (1), (2), and (3). The optimization problem (4) is nonconvex, and hence it\nis expected to be hard to solve to global optimality in general. Therefore, error bounds are needed\nthat can assess the quality of a solution produced by some algorithm that tries to solve (4). Our\nmain result (Theorem 1) is an error bound that allows to directly assess a posteriori the quality of\nany feasible triangularizer U in (4), in terms of its proximity to the (unknown) exact trangularizer\nof the ground-truth matrices in M\u25e6, regardless of the algorithm used for optimization. The bound\ndepends only on observable quantities and the noise parameter \u03c3 in (2). The parameter \u03c3 can often be\nbounded by a function of the sample size, as in problems involving empirical moment matching [4].\nOur approach draws on the perturbation analysis of the Schur decomposition of a single matrix [16].\nTo our knowledge, our bound in Theorem 1 is the \ufb01rst a posteriori error bound for joint matrix\ndecomposition problems. Existing bounds in the literature have a dependence on the ground-truth\n(and hence unobserved) matrices [11, 17], the proximity of a feasible U to critical points of the\nobjective function [6], or the amount of collinearity between the columns of the matrix \u039b in (1) [3].\nOur error bound is free of such dependencies. Outside the context of joint matrix decomposition, a\nposteriori error bounds have found practical uses in nonconvex optimization [19] and the design of\nalgorithms [21].\n\nNotation All matrices, vectors, and numbers are real. When the context is clear we use 1 to\ndenote the identity matrix. We use (cid:107) \u00b7 (cid:107) for matrix Frobenius norm and vector l2 norm. O(d) is the\nmanifold of orthogonal matrices U such that U(cid:62)U = 1. The matrix commutator [A, B] is de\ufb01ned\nby [A, B] = AB \u2212 BA. We use \u2297 for Kronecker product. For a matrix A, we denote by \u03bbi(A) its\nith eigenvalue, \u03bbmin(A) its smallest eigenvalue, \u03ba(A) its condition number, vec(A) its columnwise\nvectorization, and low(A) and up(A) its strictly lower triangular and strictly upper triangular parts,\nrespectively. Low is a binary diagonal matrix de\ufb01ned by vec(low(A)) = Low vec(A). Skew is a\nskew-symmetric projector de\ufb01ned by Skew vec(A) = vec(A \u2212 A(cid:62)). PLow is a d(d \u2212 1)/2 \u00d7 d2\nbinary matrix with orthogonal rows, which projects to the subspace of vectorized strictly lower\ntriangular matrices, such that PLowP (cid:62)\nLowPLow = Low. For example, for d = 3, one\nhas Low = diag([0, 1, 1, 0, 0, 1, 0, 0, 0]) and\n\nLow = 1 and P (cid:62)\n\nN(cid:88)\n\nn=1\n\n(cid:32) 0\n\nPlow =\n\n0\n0\n\n(cid:33)\n\n.\n\n1\n0\n0\n\n0\n1\n0\n\n0\n0\n0\n\n0\n0\n0\n\n0\n0\n1\n\n0\n0\n0\n\n0\n0\n0\n\n0\n0\n0\n\n2\n\n\f2 Perturbation of joint triangularizers\n\nThe objective function (4) is continuous in the parameter \u03c3. This implies that, for \u03c3 small enough,\nthe approximate joint triangularizers of the observed matrices \u02c6Mn \u2208 M\u03c3 can be expected to be\nperturbations of the exact triangularizers of the ground-truth matrices Mn \u2208 M\u25e6. To formalize this,\nwe express each feasible triangularizer U in (4) as a function of some exact triangularizer U\u25e6 of the\nground-truth matrices, as follows:\n\nU = U\u25e6e\u03b1X , where X = \u2212X(cid:62),\n\n(cid:107)X(cid:107) = 1, \u03b1 > 0,\n\n(5)\n\nwhere e denotes matrix exponential and X is a skew-symmetric matrix. Such an expansion holds for\nany pair U, U\u25e6 of orthogonal matrices with det(U ) = det(U\u25e6) (see, e.g., [2]). The scalar \u03b1 in (5) can\nbe interpreted as the \u2018distance\u2019 between the matrices U and U\u25e6. Our main result is an upper bound on\nthis distance:\nTheorem 1. Let M\u25e6 and M\u03c3 be the sets of matrices de\ufb01ned in (1) and (2), respectively. Let U be a\nfeasible solution of the optimization problem (4), with corresponding value L(U ). Then there exists\nan orthogonal matrix U\u25e6 that is an exact joint triangularizer of M\u25e6, such that U can be expressed as\na perturbation of U\u25e6 according to (5), with \u03b1 obeying\n\nN(cid:88)\n\nn=1\n\n\u02c6\u03c4 =\n\n1\n2N\n\n+ O(\u03b12) ,\n\nwhere\n\n(cid:0)1 \u2297 (U(cid:62) \u02c6MnU ) \u2212 (U(cid:62) \u02c6M(cid:62)\n\nn U ) \u2297 1(cid:1) Skew P (cid:62)\n\nlow.\n\n(6)\n\n(7)\n\n\u02c6T (cid:62)\n\nn\n\n\u02c6Tn,\n\n\u02c6Tn = Plow\n\n(cid:112)L(U ) + \u03c3\n(cid:112)\u03bbmin(\u02c6\u03c4 )\n\n\u03b1 \u2264\n\nProof. Let U\u25e6 be the exact joint triangularizer of M\u25e6 that is the nearest to U and det U = det U\u25e6.\nThen U = U\u25e6e\u03b1X for some unit-norm skew-symmetric matrix X and scalar \u03b1 > 0. Using the\nexpansion e\u03b1X = I + \u03b1X + O(\u03b12) and the fact that X is skew-symmetric, we can write, for any\nn = 1, . . . , N,\n\nU(cid:62) \u02c6MnU = U(cid:62)\n\n\u25e6 \u02c6MnU\u25e6 + \u03b1[U(cid:62)\n\n(8)\nwhere [\u00b7,\u00b7] denotes matrix commutator. Applying the low(\u00b7) operator and using the facts that\n\u03b1U(cid:62)\n\u25e6 MnU\u25e6) = 0, for any n = 1, . . . , N, we can write\n(9)\n\n\u03b1 low([U(cid:62) \u02c6MnU, X]) = low(U(cid:62) \u02c6MnU ) \u2212 \u03c3 low(U(cid:62)\n\n\u25e6 \u02c6MnU\u25e6 = \u03b1U(cid:62) \u02c6MnU + O(\u03b12) and low(U(cid:62)\n\n\u25e6 \u02c6MnU\u25e6, X] + O(\u03b12),\n\n\u25e6 RnU\u25e6) + O(\u03b12).\n\nStacking (9) over n, then taking Frobenius norm, and applying the triangle inequality together with\nthe fact (cid:107)low(U(cid:62)\n\n\u25e6 RnU\u25e6(cid:107) = (cid:107)Rn(cid:107) \u2264 1 for all n = 1, . . . , N, we get\n\n\u25e6 RnU\u25e6)(cid:107) \u2264 (cid:107)U(cid:62)\n\n(cid:19) 1\n2 \u2264(cid:112)NL(U ) + \u03c3\n\n\u221a\n\nN + O(\u03b12),\n\n(10)\n\n\u03b1\n\n(cid:107)low([U(cid:62) \u02c6MnU, X])(cid:107)2\n\n(cid:18) N(cid:88)\n\nn=1\n\nwhere we used the de\ufb01nition of L(U ) from (4). The rest of the proof involves computing a lower\nbound of the left-hand side of (10) that holds for all X. Since (cid:107)low(A)(cid:107) = (cid:107)Plowvec(A)(cid:107), we can\nrewrite the argument of each norm in the left-hand side of (10) as\n\nlow([U(cid:62) \u02c6MnU, X]) = Plow vec([U(cid:62) \u02c6MnU, X])\n\n(cid:0)1 \u2297 (U(cid:62) \u02c6MnU ) \u2212 (U(cid:62) \u02c6M(cid:62)\n\nn U ) \u2297 1(cid:1) vec(X),\n\n= Plow\n\nand, due to the skew-symmetry of X, we can write\n\nvec(X) = vec(low(X) + up(X)) = vec(low(X) \u2212 low(X)(cid:62))\n\n= Skew Low vec(X) = Skew P (cid:62)\n\nlow Plow vec(X) .\n\nHence, for all n = 1, . . . , N, we can write (cid:107)low([U(cid:62) \u02c6MnU, X])(cid:107)2 = (cid:107) \u02c6Tnx(cid:107)2 = x(cid:62) \u02c6T (cid:62)\n\u02c6Tnx, where\nx = Plow vec(X) and \u02c6Tn is de\ufb01ned in (7). The inequality in (6) then follows by using the inequality\nx(cid:62)Ax \u2265 (cid:107)x(cid:107)2 \u03bbmin(A), which holds for any symmetric matrix A, and noting that (cid:107)x(cid:107)2 = 1\n2 (since\nx contains the lower triangular part of X and (cid:107)X(cid:107)2 = 1).\n\nn\n\n3\n\n(11)\n(12)\n\n(13)\n(14)\n\n\f(cid:112)\u03bbmin(\u02c6\u03c4 ) \u2265\n\n\u221a\n\u03ba(V )2 ,\n\n\u0393\n\n(15)\n\nFor general M\u03c3, an analytical expression of \u03bbmin (\u02c6\u03c4 ) in (6) is not available. However, it is straight-\nforward to compute \u03bbmin (\u02c6\u03c4 ) numerically since all quantities in (7) are observable. Moreover, it\nis possible to show (see Theorem 2) that in the limit \u03c3 \u2192 0 and under certain conditions on the\nground-truth matrices in M\u25e6, the operator \u03c4 = lim\u03c3\u21920 \u02c6\u03c4 is nonsingular, i.e., \u03bbmin(\u03c4 ) > 0. Since\nboth \u02c6\u03c4 and L are continuous in \u03c3 for \u03c3 \u2192 0, the boundedness of the right-hand side of (6) is\nguaranteed, for \u03c3 small enough, by eigenvalue perturbation theorems.\nTheorem 2. The operator \u02c6\u03c4 de\ufb01ned in (7) obeys\n\nlim\n\u03c3\u21920\n\n(cid:80)N\nn=1(\u03bbi(Mn) \u2212 \u03bbj(Mn))2, and the matrix V is de\ufb01ned in (1).\n\n1\n2N\n\nwhere \u0393 = mini>j\nThe proof is given in the Appendix. The quantity \u0393 can be interpreted as a \u2018joint eigengap\u2019 of M\u25e6 (see\nalso [17] for a similar de\ufb01nition in the context of joint diagonalization by congruence). Theorem 2\nimplies that lim\u03c3\u21920 \u03bbmin(\u02c6\u03c4 ) > 0 if \u0393 > 0, and the latter is guaranteed under the following condition:\nCondition 1. For every i (cid:54)= j, i, j = 1, . . . , d, there exists at least n \u2208 {1, . . . , N} such that\n\n\u03bbi(Mn) (cid:54)= \u03bbj(Mn), where Mn \u2208 M\u25e6 .\n\n(16)\n\n3 Experiments\n\nTo assess the tightness of the inequality in (6), we created a set of synthetic problems in which\nthe ground truth is known, and we evaluated the bounds obtained from (6) against the true values.\nEach problem involved the approximate triangularization of a set of randomly generated nearly joint\ndiagonalizable matrices of the form \u02c6Mn = V \u039bnV \u22121 + \u03c3Rn, with \u039bn diagonal and (cid:107)Rn(cid:107) = 1, for\nn = 1, . . . , N. For each set M\u03c3 = { \u02c6Mn}N\nn=1, two approximate joint triangularizers were computed\nby optimizing (4) using two different iterative algorithms, the Gauss-Newton algorithm [8], and the\nJacobi algorithm [13] (our implementation), initialized with the same random orthogonal matrix. The\nobtained solutions U (which may not be the global optima) were then used to compute the empirical\nbound \u03b1 from (6), as well as the actual distance parameter \u03b1true = (cid:107) log U(cid:62)U\u25e6(cid:107), with U\u25e6 being the\nglobal optimum of the unperturbed problem (\u03c3 = 0) that is closest to U and has the same determinant.\nLocating the closest U\u25e6 to the given U required checking all 2dd! possible exact triangularizers of M\u25e6,\nthus we restricted our empirical evaluation to the case d = 5. We considered two settings, N = 5\nand N = 100, and several different noise levels obtained by varying the perturbation parameter \u03c3.\nThe \ufb01rst two graphs in Figure 1 show the value of the noise level \u03c3 against the values of \u03b1true = \u03b1true(U )\nand the corresponding empirical bounds \u03b1 = \u03b1(U ) from (6), where U are the solutions found by the\nGauss-Newton algorithm. (Very similar results were obtained using the Jacobi algorithm.) All values\nare obtained by averaging over 10 equivalent experiments, and the errorbars show the corresponding\nstandard deviations. For the same set of solutions U, the third graph in Figure 1 shows the ratios \u03b1\n\u03b1true .\nThe experiments show that, at least for small N, the bound (6) produces a reasonable estimate of\nthe true perturbation parameter \u03b1true. However, our bound does not fully capture the concentration\nthat is expected (and observed in practice) for large sets of nearly jointly decomposable matrices\n(note, for instance, the average value of \u03b1true in Figure 1, for N = 5 vs N = 100). This is most likely\n\u25e6 RnU\u25e6)(cid:107) \u2264 1 and the use of the triangle inequality in\ndue to the introduced approximation (cid:107)low(U(cid:62)\n(10) (see proof of Theorem 1), which are needed to separate the observable terms U(cid:62) \u02c6MnU from the\nunobservable terms U(cid:62)\n0 RnU0 in the right-hand side of (9). Extra assumptions on the distribution of\nthe random matrices Rn can possibly allow obtaining tighter bounds in a probabilistic setting.\n\n4 Conclusions\n\nWe addressed a joint matrix triangularization problem that involves \ufb01nding an orthogonal matrix that\napproximately triangularizes a set of noise-perturbed jointly diagonalizable matrices. The setting can\nhave many applications in statistics and signal processing, in particular in problems that reduce to a\nnonsymmetric CP tensor decomposition [4, 8, 20]. The joint matrix triangularization problem can be\ncast as a nonconvex optimization problem over the manifold of orthogonal matrices, and it can be\n\n4\n\n\fN(cid:88)\n\nn=1\n\n\u03c4 =\n\n1\n2N\n\n(cid:0)1 \u2297 (U(cid:62)\n\nFigure 1: The empirical bound \u03b1 from (6) vs the true distance \u03b1true, on synthetic experiments.\n\nsolved numerically but with no success guarantees. We have derived a posteriori upper bounds on\nthe distance between any approximate triangularizer (obtained by any algorithm) and the (unknown)\nsolution of the underlying unperturbed problem. The bounds depend only on empirical quantities\nand hence they can be used to asses the quality of any feasible solution, even when the ground truth\nis not known. We established that, under certain conditions, the bounds are well de\ufb01ned when the\nnoise is small. Synthetic experiments suggest that the obtained bounds are tight enough to be useful\nin practice.\nIn future work, we want to apply our analysis to related problems, such as nonnegative tensor\ndecomposition and simultaneous generalized Schur decomposition [11], and to empirically validate\nthe obtained bounds in machine learning applications [4].\n\nA Proof of Theorem 2\nThe proof consists of two steps. The \ufb01rst step consists of showing that in the limit \u03c3 \u2192 0 the operator\n\u02c6\u03c4 de\ufb01ned in (7) tends to a simpler operator, \u03c4, which depends on ground-truth quantities only. The\nsecond step is to derive a lower bound on the smallest eigenvalue of the operator \u03c4.\nLet \u03c4 be de\ufb01ned by\n\nn U\u25e6) \u2297 1(cid:1)Plow,\n(cid:12)(cid:12)(cid:12)\u03c3=0\n\n= Tn .\n\nT (cid:62)\nn Tn,\n\nTn = Plow\n\n\u25e6 MnU\u25e6) \u2212 (U(cid:62)\n\n\u25e6 M(cid:62)\n\n(17)\n\n(cid:12)(cid:12)(cid:12)\u03c3=0\n\nwhere Mn \u2208 M\u25e6, and U\u25e6 is the exact joint triangularizer of M\u25e6 that is closest to, and has the same\ndeterminant as, U, the approximate joint triangularizer that is used to de\ufb01ne \u02c6\u03c4. Proving that \u02c6\u03c4 \u2192 \u03c4\nas \u03c3 \u2192 0 is equivalent to showing that\n\n(cid:0)1 \u2297 (U(cid:62) \u02c6MnU ) \u2212 (U(cid:62) \u02c6M(cid:62)\n\nn U ) \u2297 1(cid:1) Skew P (cid:62)\n\n\u02c6Tn\n\nlow\n\n= Plow\n\n(18)\nSince for all n = 1, . . . , N, one has \u02c6Mn \u2192 Mn when \u03c3 \u2192 0, we need to prove that U \u2192 U\u25e6 and\nthat we can remove the Skew operator on the right.\nWe \ufb01rst show that U = U\u25e6e\u03b1X \u2192 U\u25e6, that is, \u03b1 \u2192 0 as \u03c3 \u2192 0. Assume that the descent algorithm\nused to obtain U is initialized with Uinit obtained from the Schur decomposition of \u02c6M\u2217 \u2208 M\u03c3. Let\nU\u25e6 be the exact triangularizer of M\u25e6 closest to Uinit and Uopt be the local optimum of the joint\ntriangularization objective closest to Uinit. Then, as \u03c3 \u2192 0 one has Uopt \u2192 U\u25e6, by continuity of the\nobjective in \u03c3, and also Uinit \u2192 U\u25e6 due to the perturbation properties of the Schur decomposition.\nThis implies U \u2192 U\u25e6, and hence \u03b1 \u2192 0.\nThen, it is easy to prove that Plow(1 \u2297 (U(cid:62)\nlow = Plow(1 \u2297\nn U\u25e6) \u2297 1)P (cid:62)\n\u25e6 MnU\u25e6) \u2212 (U(cid:62)\n(U(cid:62)\nlow by considering the action of the two operators on x =\nPlowvec(X), with X = \u2212X(cid:62). One has\n\nn U\u25e6) \u2297 1) Skew P (cid:62)\n\n\u25e6 MnU\u25e6) \u2212 (U(cid:62)\n\n\u25e6 M(cid:62)\n\n\u25e6 M(cid:62)\n\nPlow(1 \u2297 (U(cid:62)\n= P (cid:62)\n\nlowvec(cid:0)low[U(cid:62)\n\n\u25e6 MnU\u25e6, low(X)](cid:1) = Plow(1 \u2297 (U(cid:62)\n\nn U\u25e6) \u2297 1) Skew P (cid:62)\n\n\u25e6 M(cid:62)\n\n\u25e6 MnU\u25e6) \u2212 (U(cid:62)\n\nlow = P (cid:62)\n\nwhere in the second line we used the fact that U(cid:62)\nas \u03c3 \u2192 0.\n\n\u25e6 M(cid:62)\n\nlowvec(cid:0)low[U(cid:62)\n\n\u25e6 MnU\u25e6, X](cid:1)\n\n\u25e6 MnU\u25e6) \u2212 (U(cid:62)\n\nlow (19)\nn U\u25e6 is upper triangular. This shows that \u02c6\u03c4 \u2192 \u03c4\n\nn U\u25e6) \u2297 1)P (cid:62)\n\n\u25e6 M(cid:62)\n\n5\n\n-8-6-4-20log(<)-10-8-6-4-202log(,)N=5,true,-8-6-4-20log(<)-10-8-6-4-202log(,)N=100,true,-8-7-6-5-4-3-2-10log(<)020406080100,/,true N=5 N=100\fThe second part of the proof consists of bounding the smallest eigenvalue of \u03c4. We will make use of\nthe following identity that holds when A and C are upper triangular:\n\nlow(ABC) = low(A low(B) C) ,\n\n(20)\n\nfrom which we get the following identity when A and C are upper triangular:\n\nLow vec(ABC) = Low (C(cid:62) \u2297 A) Low vec(B) .\nlowTnx = Lowvec([U(cid:62)\n\n\u25e6 MnU\u25e6) and it can be shown that1\n\n\u25e6 MnU\u25e6, low(X)]) = Lowvec(U(cid:62)\n\nIn particular, one has P (cid:62)\nlow(X)U(cid:62)\nLow vec( \u02dcV \u039bn \u02dcV \u22121low(X)) = Low ( \u02dcV \u2212T \u2297 \u02dcV ) Low (I \u2297 \u039bn) Low ( \u02dcV (cid:62) \u2297 \u02dcV \u22121) Low vec(X) (29)\nand\nLow vec(low(X) \u02dcV \u039bn \u02dcV \u22121) = Low ( \u02dcV \u2212T \u2297 \u02dcV ) Low (\u039bn \u2297 I) Low ( \u02dcV (cid:62) \u2297 \u02dcV \u22121) Low vec(X) (30)\n\u25e6 MnU\u25e6 = \u02dcV \u039bn \u02dcV \u22121. Now, since\nwhere \u02dcV and \u02dcV \u22121 are upper triangular matrices de\ufb01ned by U(cid:62)\n\u02dcV = U(cid:62)\n\u25e6 V , where V is de\ufb01ned via the spectral decomposition Mn = V \u039bnV \u22121, we can rewrite the\noperator Tn as\n\n(21)\n\u25e6 MnU\u25e6low(X) \u2212\n\nTn = Plow(U(cid:62)\n\n\u25e6 V \u2212T \u2297 U(cid:62)\n\n\u25e6 V )Low(1 \u2297 \u039bn \u2212 \u039bn \u2297 1)Low(V T U\u25e6 \u2297 V \u22121U\u25e6)P (cid:62)\n\nlow , (31)\n\nand use the following inequality for the smallest eigenvalue of \u03c4 = 1\n2N\n\n(cid:80)N\nn=1 T (cid:62)\n\nn Tn:\n\nwhere\n\n\u03bbmin(\u03c4 ) \u2265\n\n1\n2N\n\n\u03bbmin(A) \u03bbmin(B) \u03bbmin(C),\n\nA = Plow(U(cid:62)\n\n(cid:32) N(cid:88)\n\n\u25e6 V \u2297 U(cid:62)\n\n\u25e6 V \u2212T )P (cid:62)\nlow,\n(1 \u2297 \u039bn \u2212 \u039bn \u2297 1)2\n\n(cid:33)\n\nP (cid:62)\nlow,\n\n(32)\n\n(33)\n\n(34)\n\nNow, it is easy to show that \u03bbmin(A) = \u03bbmin(C) \u2265 1\nA and C are obtained by deleting certain rows and columns of U(cid:62)\nrespectively. The matrix B is a diagonal matrix with entries given by\n\nB = Plow\nC = Plow(V \u22121U\u25e6 \u2297 V T U\u25e6)P (cid:62)\nlow.\n\nn=1\n\nN(cid:88)\n\n([\u039bn]ii \u2212 [\u039bn]jj)2,\n\n(35)\n\u03ba(V )2 since the d(d\u22121)/2\u00d7d(d\u22121)/2 matrices\n\u25e6 V \u2297 U(cid:62)\n\u25e6 V \u2212T and V \u22121U\u25e6\u2297 V T U\u25e6\ni\u22121(cid:88)\n(cid:80)N\nn=1(\u03bbi(Mn) \u2212 \u03bbj(Mn))2 and\n(cid:80)N\n(37)\nn=1(\u03bbi(Mn)\u2212 \u03bbj(Mn))2 is a \u2018joint eigengap\u2019 of the ground-truth matrices\n(cid:3)\n\n\u03bbmin(\u02c6\u03c4 ) \u2265 \u0393\n\nk = j \u2212 i +\n\n(d \u2212 a),\n\n\u03ba(V )4 ,\n\nlim\n\u03c3\u21920\n\n(36)\n\nn=1\n\na=1\n\n[B]kk =\n\nwith 0 < i < j and j = 1, . . . d. This implies \u03bbmin(B) = minij\nMn \u2208 M\u25e6.\n\n1\n2N\n\n1For any matrix Y one has\n\n\u22121Y ) =\n\u22121Y \u02dcV \u02dcV\n\nLow vec( \u02dcV \u039bn \u02dcV\nLow vec( \u02dcV \u039bn \u02dcV\nLow ( \u02dcV\nLow ( \u02dcV\nLow ( \u02dcV\n\n\u22121) =\n\u2212T \u2297 \u02dcV ) Low vec(\u039bn \u02dcV\n\u22121Y \u02dcV ) =\n\u2212T \u2297 \u02dcV ) Low (I \u2297 \u039bn) Low vec( \u02dcV\n\u2212T \u2297 \u02dcV ) Low (I \u2297 \u039bn) Low ( \u02dcV\n\n\u22121Y \u02dcV ) =\n\n(cid:62) \u2297 \u02dcV\n\n\u22121) Low vec(Y )\n\nand similarly\n\nLow vec(Y \u02dcV \u039bn \u02dcV\nLow ( \u02dcV\n\n\u22121) =\n\n\u2212T \u2297 \u02dcV ) Low (\u039bn \u2297 I) Low ( \u02dcV\n\n(cid:62) \u2297 \u02dcV\n\n\u22121) Low vec(Y )\n\n6\n\n(22)\n(23)\n(24)\n(25)\n(26)\n\n(27)\n(28)\n\n\fReferences\n[1] K. Abed-Meraim and Y. Hua. A least-squares approach to joint Schur decomposition. In\nAcoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International\nConference on, volume 4, pages 2541\u20132544. IEEE, 1998.\n\n[2] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds.\n\nPrinceton University Press, 2009.\n\n[3] B. Afsari. Sensitivity analysis for the problem of matrix joint diagonalization. SIAM Journal on\n\nMatrix Analysis and Applications, 30(3):1148\u20131171, 2008.\n\n[4] A. Anandkumar, R. Ge, D. Hsu, S. Kakade, and M. Telgarsky. Tensor decompositions for\nlearning latent variable models. Journal of Machine Learning Research, 15:2773\u20132832, 2014.\n\n[5] B. Balle, A. Quattoni, and X. Carreras. A spectral learning algorithm for \ufb01nite state transducers.\nIn Machine Learning and Knowledge Discovery in Databases, pages 156\u2013171. Springer, 2011.\n\n[6] J.-F. Cardoso. Perturbation of joint diagonalizers. Telecom Paris, Signal Department, Technical\n\nReport 94D023, 1994.\n\n[7] J.-F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM journal\n\non matrix analysis and applications, 17(1):161\u2013164, 1996.\n\n[8] N. Colombo and N. Vlassis. Tensor decomposition via joint matrix Schur decomposition. In\n\nProc. 33rd International Conference on Machine Learning, 2016.\n\n[9] R. M. Corless, P. M. Gianni, and B. M. Trager. A reordered Schur factorization method for zero-\ndimensional polynomial systems with multiple roots. In Proceedings of the 1997 international\nsymposium on Symbolic and algebraic computation, pages 133\u2013140. ACM, 1997.\n\n[10] L. De Lathauwer. A link between the canonical decomposition in multilinear algebra and\nsimultaneous matrix diagonalization. SIAM Journal on Matrix Analysis and Applications,\n28(3):642\u2013666, 2006.\n\n[11] L. De Lathauwer, B. De Moor, and J. Vandewalle. Computation of the canonical decomposition\nby means of a simultaneous generalized Schur decomposition. SIAM Journal on Matrix Analysis\nand Applications, 26(2):295\u2013327, 2004.\n\n[12] T. Fu, S. Jin, and X. Gao. Balanced simultaneous Schur decomposition for joint eigenvalue esti-\nmation. In Communications, Circuits and Systems Proceedings, 2006 International Conference\non, volume 1, pages 356\u2013360. IEEE, 2006.\n\n[13] M. Haardt and J. A. Nossek. Simultaneous Schur decomposition of several nonsymmetric\nmatrices to achieve automatic pairing in multidimensional harmonic retrieval problems. Signal\nProcessing, IEEE Transactions on, 46(1):161\u2013169, 1998.\n\n[14] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, 2nd edition, 2012.\n\n[15] R. Iferroudjene, K. A. Meraim, and A. Belouchrani. A new Jacobi-like method for joint\ndiagonalization of arbitrary non-defective matrices. Applied Mathematics and Computation,\n211(2):363\u2013373, 2009.\n\n[16] M. Konstantinov, P. H. Petkov, and N. Christov. Nonlocal perturbation analysis of the Schur\nsystem of a matrix. SIAM Journal on Matrix Analysis and Applications, 15(2):383\u2013392, 1994.\n\n[17] V. Kuleshov, A. Chaganty, and P. Liang. Tensor factorization via matrix factorization. In 18th\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[18] X. Luciani and L. Albera. Joint eigenvalue decomposition using polar matrix factorization. In\nInternational Conference on Latent Variable Analysis and Signal Separation, pages 555\u2013562.\nSpringer, 2010.\n\n[19] J.-S. Pang. A posteriori error bounds for the linearly-constrained variational inequality problem.\n\nMathematics of Operations Research, 12(3):474\u2013484, 1987.\n\n7\n\n\f[20] A. Podosinnikova, F. Bach, and S. Lacoste-Julien. Beyond CCA: Moment matching for multi-\n\nview models. In Proc. 33rd International Conference on Machine Learning, 2016.\n\n[21] S. Prudhomme, J. T. Oden, T. Westermann, J. Bass, and M. E. Botkin. Practical methods for a\nposteriori error estimation in engineering applications. International Journal for Numerical\nMethods in Engineering, 56(8):1193\u20131224, 2003.\n\n[22] S. H. Sardouie, L. Albera, M. B. Shamsollahi, and I. Merlet. Canonical polyadic decomposition\nof complex-valued multi-way arrays based on simultaneous Schur decomposition. In Acoustics,\nSpeech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 4178\u2013\n4182. IEEE, 2013.\n\n[23] A. Souloumiac. Nonorthogonal joint diagonalization by combining Givens and hyperbolic\n\nrotations. IEEE Transactions on Signal Processing, 57(6):2222\u20132231, 2009.\n\n8\n\n\f", "award": [], "sourceid": 2504, "authors": [{"given_name": "Nicolo", "family_name": "Colombo", "institution": "Univ of Luxembourg"}, {"given_name": "Nikos", "family_name": "Vlassis", "institution": "Adobe Research"}]}