Paper ID: | 2504 |
---|---|

Title: | A posteriori error bounds for joint matrix decomposition problems |

In this paper, the author considered an important problem of a posteriori error bounds for joint matrix decomposition. A perturbation analysis of the noisy joint matrix triangularization problem was carried. Some possible applications were also considered.

1. I think the authors should explain some notations in much more details, e.g., matrix exponential. 2. Theorem 1 is hard to understand at the first glimpse. Maybe some numerical example will be helpful for illustration. 3. The whole Section 3 is not well presented in the current version. The authors proposed to present application to tensor decomposition, but failed to introduce the background settings, which makes it hard to understand.

2-Confident (read it all; understood it all reasonably well)

The paper presents some error bounds for joint matrix decomposition problems, that depend on the observed data and the current solution. Sometimes the observed data are perturbed instances of the actual data. The main goal of the joint decompositions is to recover as much underlying information as possible.

1. The paper is well written and easy to understand, but perhaps too theoretical for the NIPS audience. 3. I suggest that the authors should use more triangularizer schemes. Here they only present the exponential one. 4. It lacks future work. Overall it is a well written and constructed paper. It is not very complicated and the flow is satisfactory enough. They claim that this is the first a posteriori bound for the distance between the optimized and the real triangularizer (I made a brief search and I found some recent papers on aposteriori bounds for tensors). The authors might want to cite some of those papers.

2-Confident (read it all; understood it all reasonably well)

This paper give a perturbation analysis for the noisy joint matrix triangularization problem, which can be considered as the generalization of joint matrix diagonization problem which has many applications and have been well studied. The work is based on an optimization (3) and show that the solution is stable. The work also shows it application to nonnegative tensor decomposition and a possible extension to joint generalized Schur decomposition.

The main result of the paper is a perturbation analysis, which seems technically involved and the analysis is not standard. But my concern is whether and how important the result is: some connections with tensor decomposition are established, but it is still missing in detail. Some other comments: 1. Should min and max be switched in the last two equations in (11)? Also I would suggest to use \hat{M} instead of \hat{m}, since this is a matrix. 2. What is the citation [Authors, 2016] in lines 41, 121 and 144? 3. Some details are left out while they could be submitted in the form of supplementary file, which make it difficult to evaluate the paper, see line 80 “All proofs of lemmas will appear in an extended version of the paper” and line 170-171, “full details will appear elsewhere”

1-Less confident (might not have understood significant parts)

Joint tiangularization is a problem that arises in a variety of different applications, inlcuding ICA, latent variable estimation, and tensor decomposition. As it is a non-convex problems, obtaining error bounds on the quality of an observed solution is desirable. The derives such an "a posteriori" bound: given a feasible solution, as well as the problem input (approximately diagonalizable matrices to be jointly triangulated), and the inherent noise magnitude sigma, the derived bound can be used to derive the distance of the solution from a "ground truth" triangularizer. The key here is that the bound is almost fully "a posteriori": apart from sigma, all other quantities are directly measurable from the problem input. Moreover, the bound applies to an arbitrary feasible solution, irrespectively of the algorithm used to derive it. According to the authors, all of this is in contrast to earlier bounds, that relied on assumptions on the unobservable/"ground truth" data, and hence cannot be used to directly asses the quality of a solution. The paper also provides some insight on how these results can be generalized to broader decomposition tasks.

I enjoyed reading the paper; the problem and result are discussed with great clarity. The exposition on the problem was very informative. I also appreciate the usefulness of deriving an a posteriori bound. That said, I list below certain concerns that, if addressed, would strengthen the paper in my opinion. 1) The usefulness of an "a posteriori" bound is not explicitly stated, though one can infer it from the context. It would be good to explicitly, concisely state somewhere in the contributions section that (a) because the problem is non-convex, it is hard to solve, (b) error bounds are therefore needed to assess the quality of a solution produced by any heuristic/algorithm that tries to solve it, (c) error bounds that directly asses the quality of the solution "a posteriori" are very useful to that end, as they do not depend on the algorithm user or on unobserved quantities. These three statements are somewhat dispersed throughout the paper, and could stand out better. 2)Though the discussion of the (possible) applicability to more general problems is interesting, one drawback of the paper is that it spends a lot of time on the big picture, rather than on demonstrating the efficacy/importance of the bound at hand. From a practical standpoint, for example, demonstrating how useful the bound is on specific methods, whether it can indeed discriminate between good and bad solutions (which could be verified over synthetic data that where ground truth is known) would illustrate that this is indeed something useful. From a theoretical standpoint, showing how it could be used to provide guarantees for solutions obtained by explicit algorithms would also show that the bound is "useful", in some sense. Either of the two is more relevant to the focus of the paper, rather than the discussion of (possible, but not actually derived) generalizations. 3)Nits: kappa is not defined in Theorem 1. Later (sec 2.2) it is indirectly referred to as the condition number. Please clarify what it is. Regarding the eigenvector matrix \hat{V}. Can you please elaborate on why \hat{m} is diagonalizable, given the presence of noise terms Wn? Regarding \hat{\gamma}: can you please elaborate why this is bounded away from zero, given the presense of noise terms Wn? If the assumptions of Lemma 1 do not hold, does this imply the absense of a triangularizer, or that the number of triangularizers is infinite? If it is the latter, should this not mean that the problem is easier to solve? In that sense, is this assumption technical or truly necessary, because, e.g., they relate to \hat{gamma}?

2-Confident (read it all; understood it all reasonably well)

In this paper, the authors carried out a perturbation analysis of the noisy joint matrix triangularization problem, and they derived a bound on the distance between any feasible approximate triangularizer and its noise-free counterpart. The work is interesting. Actually, the paper is not well written. There are many points unclear. • Equation (5), what is “GL(d)”? • Equation (11), the upper case and the lower case of variables are confusing. It is better to use upper case to define the matrix and use lower case to define the vector or scale. I am confused about the last two terms of Equation (11). • Equation (15-16), A/B is matrix or vector? It is misleading. • Page 5, line 178, “A = e^{-aX}A_0^T”, is there a transpose here? • Page 5, from equation (17) to (20), I did not get the point. • The format of reference should be consistent.

Improve the presentation and clarify the paper.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

The authors introduce a new perturbation analysis of joint triangularization of a set of matrices. Unlike previous bounds (e.g. the analysis by Cardoso), these involve quantities that can be estimated from data, except for the noise level. The bounds are also independent of the choice of algorithm. When matrices are symmetric, this problem reduces to orthogonal joint diagonalization, which has many machine learning applications.

I think this is a good first step in an important research direction. The methods that the authors analyze have a wide range of applications: independent component analysis, tensor decomposition, learning latent variable models, and many others. Joint (dia/trian)-gularization algorithms work well in practice, but may in principle get stuck in local optima. The authors' bounds can help assess the extent to which this is a problem. It was not obvious to me that such bounds are possible. The result is a good addition to the literature, but I think the authors could have spent more time/effort exploring the topic to make it a really great paper. Also, I found some presentation choices unusual. General comments / suggestions for improvement: - Some experimental validation would be very helpful. How useful are these bounds in practice? Are they always loose, or can we find cases when they provide meaningful guarantees? I think it's okay if they're loose, since these are the first a-posteriori bounds, but for follow-up work, it would be very helpful to know. - I think more discussion of applications would be very useful for motivation; especially emphasizing applications within the method of moments. - I would have liked more discussion about when the known noise assumption is valid. I get that in MoM-style applications, we can get a bound, but what about other cases? - Are the bounds useful in a noiseless setting? The global convergence of Jacobi-style methods is still an open question, so a-posteriori bounds should be useful in that case too. - How do the bounds extend to low-rank matrices? It seems like the theorem does not exactly apply because of the assumption on the distinctness of eigenvalues. Presentation comments: - What is the purpose of Lemma 1? It doesn't seem to be used in the proof? Is it just to say that we must assume that eigenvalues are distinct? I would just state that in one English sentence. - I didn't get the purpose of most of section 3. It seems like a long literature review. In 3.1, (15-16) are finally instantiated with one operator, so why not start with that instantiation in the first place? Still, this section lacks any concrete results in my opinion, and needs more work. Why are the authors saying that its full results will appear elsewhere? - I like the way that the a-posteriori version nicely extends the previous a-priori bound

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

The paper addresses the problem of approximate non-orthogonal joint diagonalization (NOJD) of several matrices, where each matrix is of the form $V_o \Lambda_n V_o^{-1}$ plus some noise. Such problem appears, e.g., when performing estimation in some latent variable models via the method of moments. The authors approach this NOJD problem via related simultaneous Schur decomposition (SSD), which is formulated as a non-convex optimization problem (OP). The main result of this paper is an a posteriori bound (which depends solely on a posteriori quantities). The only a priori quantity appearing in the bound is the noise parameter $\sigma$, which the authors assume to be known (see lines 57-59). For the obtained bound, an approximate joint triangularizer (the output of an algorithm for OP) is not required to be neither a global solution nor a critical point and the bound depends on the respective objective value. This bound is the first a posteriori bound for the SSD problem and, more generally, for joint matrix decomposition problems.

Update of the review: I read the rebuttal and other reviews and argue for acceptance of the paper. I would ask the authors to pay attention to the application of the Weyl's theorem which holds true only for Hermitian matrices. The main result of the paper -- an a posteriori bound for the SSD problem -- and the techniques used to obtain the result are novel both in the context of the SSD problem and in the context of joint matrix decomposition problems. Since joint matrix decomposition methods are known to work well in practice for computing tensor factorizations, this result can be a step forward in understanding this empirical observation. Therefore, the contributions of this paper are important and can be of interest to a certain community. The paper, however, is not written in a reader friendly way. I especially disagree with the way the authors motivate the problem since it looks like confusing several different matrix joint decomposition problems (see the detailed explanation in the next paragraph), which (a) can easily be interpreted in a wrong way and (b) can look like hiding some important assumptions. Therefore, I would insist on rewriting the introductory section in a clearer way. Moreover, some part of the proof of Theorem 1 (the main result; see the detailed discussion below) are subtle and have to be clarified. Given the authors successfully fix the mentioned problems, I would support the acceptance of the paper. The citations are not in accordance with the NIPS template (Authors, year format vs [number]). --- *Introduction.* One of my points of criticism is how the problem is motivated, e.g., in lines 24-26. Basically, there are two different types of simultaneous diagonalization problems: problem (1) in lines 19-20 and joint diagonalization by congruence (see comment #1). The latter is one of several techniques to obtain *symmetric* tensor CP decomposition, which, e.g., naturally arises in estimation for some latent variable models (as in the cited papers by Anandkumar et al, Balle et al, Kuleshov et al where the considered matrices are tensor contractions/projections or slices). The former naturally arises, e.g., in (De Lethauwer et al, 2004; section 5) as a technique to obtain *non-symmetric* tensor CP decomposition. One can transform the matrices in joint diagonalization by congruence to problem (1) (see comment #2). This, however, requires some additional assumptions (invertibility of $V_o$?) and poses some additional questions, e.g., whether this transformation simplifies the original problem or instead makes it more complicated. These differences should be explicitly addressed in the paper given the authors want to motivate the problem with applications discussed in Anandkumar, et al, Balle et al, Kuleshov et al. As an alternative or in addition, the authors could motivate the problem with applications which boil down to a non-symmetric CP decomposition (as, e.g., discussed in De Lethauwer et al, 2004; [4] would be a more recent machine learning reference). In any case, the differences, assumptions, and connections between the problems have to be explicitly addressed. Comments: #1: There are two types of simultaneous diagonalization problems: the problem in (1) and joint diagonalization by congruence, where the matrices have the form $V_o \Lambda_n V_o^{\top}$, i.e. the last matrix goes with the *transpose* instead of the *inverse*. The two problems are equivalent only if $V_o$ is orthogonal and then it is the problem of orthogonal joint diagonalization considered, e.g., by Cardoso and Souloumiac (1996). The setting where $V_o$ is orthogonal is, however, not the main setting in the paper being reviewed. #2: In joint diagonalization by congruence the matrices are in the form $V_o \Lambda_n V_o^{\top}$. Assuming $V_o$ and say j-th matrix are invertible, multiply i-th matrix by the inverse of the j-th matrix to obtain a matrix $V_o \Lambda_i \Lambda_j^{-1} V_o^{-1}$ in the form suitable for problem (1) (see, e.g., De Lethauwer et al, 2004, section 5). This being said, we recall that the perturbation analysis is only obtained for the SSD problem, but not for related tensor factorization problems. Do the results of the analysis easily extend to this case? --- *The proof of Theorem 1.* I have several questions regarding the proof: - The definition (line 212) of the quantity $T_{\beta}$ is not clear and, possibly wrong (see comment #4). This definition does not influence the overall result significantly, but this definition has to be clarified. - Is it obvious that min and max can be exchanged in (29)? (see comment #5) - Lemma 2 is provided without proof and I can not check whether the result is correct, esp., given the uncertainty in the definition of $T_{\beta}$. - The expansion in Eq. (32) does not obviously follow from the Weyl's theorem applied to the singular values of the matrices in line 224. The expression is in therms of the condition numbers ($kappa$?) of $V$ and $\hat{V}$ and not the singular values of $M$ and $\hat{M}$. Comments: #4: In definition of the quantity $T_{\beta}$ in line 212, it not clear what Low means. Assuming the first Low refers to the lower triangular part of a matrix, the LHS of (25) is not equal to the LHS of (24). E.g., $low( M_{\beta} \otimes 1 ) vec(X)$ is actually the zero matrix and $low( 1 \otimes M_{\beta}^{\top} ) vec(X)$ is not equal to neither $vec( low( low(X) M_{\beta} ) )$ nor $vec( low( M_{\beta} low(X) ) )$. Obvious modifications of this result do not seem to help. Clearly, there is a matrix $T_{\beat}$ can be expressed via $M_{\beta}$, but the question is how simple / difficult this expression is? #5: The problem is that the sets in min and max are not convex and compact. Is it obvious to show that these constraints can be relaxed to some convex and compact sets such that the solution is tight? I can see that the min problem can be reformulated as minimization over the simplex, but the objective function is not linear in $\beta$ anymore (an upper envelope of piecewise linear functions) and, hence, it does not seem to be straightforward to relax the unit sphere to the unit ball for $\beta$. Is there a different explanation? If not, does it help that by weak duality the first equality in (29) can always be replaced by the less or equal sign. --- Other less important questions and comments: I cannot check the correctness of Lemma 1 since the proof was not provided. However, the results of the lemma are not used in other proofs, so I see it as a complimentary result. Lemma 1 provides a kind of identifiability results for the SSD problem. Is there some intuition behind these results (e.g., analogy to permutation and scaling unidentifiability in the joint diagonalization by congruence case)? After introducing problem (1), the authors may consider mentioning that this problem was already addressed in the literature, e.g., in [1,2,3] (see below) or by De Lethauwer et al (2004). However, papers [1,2,3] do not provide theoretical analysis and the first order perturbation analysis is from section 6 of De Lethauwer et al (2004) is weaker than the results of this paper. One could also add an explanation of these differences. In lines 38-39, reference to Cardoso and Souloumiac (1996) is not appropriate since their optimization is also over the manifold of orthogonal matrices. The related optimization problems which do not involve the manifold of orthogonal matrices are, e.g., the ones by Afsari (2006) or Kuleshov et al (2015) or [9]. Moreover, the statement in lines 38-39 is a bit vague since the problems have different objective functions. In lines 40-41, one could add references, e.g., to De Lethauwer et al (2004) and [6]. The known in the literature bounds (Cardoso, 1994; Afsari, 2008; Kuleshov et al, 2015) are obtained for joint diagonalization by congruence and, therefore, are formally different from the SSD problem considered in this paper. On the contrary, an a priori bound for the SSD problem was recently obtained [6] and the authors may consider to cite this. List of typos/comments: 1) One should define the set GL(d) in equation (5). 2) It could be useful to specify which matrix norm is used, esp. in the proof of Theorem 1. 3) In the transition between equations (22) and (23), it worth mentioning that the expansion $e^{\alpha X} = 1 + \alpha X + O(\alpha^2)$ is used. 4) In equations (24), (25) and further the term $O( (\alpha + \sigma)^2 )$ is lost (either explain why it is omitted or put it back in). 5) In equation (28), the rightmost $\sigma$ has to be squared. 6) ``R'' is missing in the sum sign in line 223. 7) It would be useful to define $\kappa(\cdot)$ and $\sigma_{max/min}(\cdot)$ appearing in the proof - the condition number and the singular values? 8) In line 230, "min" and "max" are probably mixed up. 9) Is there some intuition behind the identifiability results in Lemma 1 (like permutation and scaling)? 10) Comment to line 132: there is only one way to obtain a slice of a tensor, but the matrices which are to be diagonalized can be constructed in different algebraic form. 11) In lines 158=159, the whitening procedure does not necessary lead to poor performance (and this is what the cited paper says). Moreover, it can be (and often is) used for dimensionality reduction. --- References: [1] ``Joint eigenvalue decomposition using polar matrix factorization'' by Luciani and Albera from 2010; [2] ``A new Jacobi-like method for joint diagonalization of arbitrary non-defective matrices'' by Iferroudjene, Abed Meraim, and Belouchrani from 2009; [3] ``Simultaneous diagonalization with similarity transformation for non-defective matrices'' by Fu and Gao from 2006; [4] ``Beyond CCA: moment matching for multi-view models'' by Podosinnikova, Bach, and Jacoste-Julien from 2016; [5] ``Nonorthogonal joint diagonalization by combining Givens and hyperbolic rotations'' by Souloumiac from 2009; [6] ``Tensor decomposition via joint matrix Schur decomposition'' by Colombo and Vlassis from 2016;

3-Expert (read the paper in detail, know the area, quite certain of my opinion)