Paper ID: 686
Title: Convex Tensor Decomposition via Structured Schatten Norm Regularization

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper considers convex norm-regularized optimization approaches for factorization of approximately low-rank tensors. It focuses on a recently proposed "latent" norm approach ([25]), and its comparison to a previously used "overlap" norm approach. Specifically, the paper derives theoretical bounds on the "latent" norm mean-squared-error behavior under several noise assumptions, which compare favorably to the existing analysis on the overlap norm, supporting its better empirically-observed performance in a previous paper. The theory is also confirmed via some numerical simulations. Finally, the paper considers a generalization of both these norms, and proves a duality theorem for it.

The paper is generally easy to follow, and appears to be technically sound (although I haven't checked the proofs in the supplementary material carefully). The results may also be useful for researchers focusing on optimization methods for tensor decompositions. My main issues are that
(i) The results are narrow and of a technical nature, only peripherally related to machine learning, and probably of limited potential impact at NIPS. While the authors make the (justified) case that tensor factorizations can crop up in machine learning, the paper does not attempt to examine the proposed approaches in the context of machine learning problems, either theoretically or experimentally.
(ii) The results seem a bit incremental, and I didn't find them to be particularly surprising (admittedly this is a subjective notion). They seem to provide some additional theoretical backing to a specific recently proposed approach [25], which was already argued empirically to work well. The techniques used also don't seem particularly novel.

A more minor defect is that the upper bounds do not come with any lower bounds. This is potentially dangerous, since the upper bounds are used to compare between different algorithms and argue for the superiority of one over the other.

Additional Comments

- The abstract begins with "We propose a new class of structured Schatten norms for tensors...". However, I think this is slightly misleading. The focus on the paper is devoted to analyzing two existing approaches (overlapped and latent). Unless I missed something, the generalization only comes up by defining it in section 2.1 and proving what is the dual norm (Lemma 1). In particular, no particular algorithms are proposed based on this generalized Schatten norms,
the behavior of such algorithms is not analyzed, nor are they tested experimentally.

- Figure 1 is presented in the introduction, but at this stage is not interpretable without the background appearing in section 2 onwards.

- Line 38: "Tensor is" -> "Tensor modelling is"(?)
- Line 40: Missing comma after "Conventionally"
- Line 395: "also in such situations" -> "in such situations as well"
Q2: Please summarize your review in 1-2 sentences
The paper's main contribution are theoretical error bounds for a recently proposed low-rank tensor decomposition approach. The paper seems technically sound, but the results are somewhat incremental and may suffer from limited impact at NIPS.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
****Additional comments after the rebuttal:
The authors need to do a much better job on motivating the problem for machine learning, for the paper to be accessible to a wider audience. The authors need to provide the latent variable model (with a graphical model representation) where this tensor decomposition can learn the parameters: Consider a model where latent variable h_1 connects to observed variable x_1, h_2 to x_2 and so on. Assume a linear model from hidden to observed variables, and arbitrary dependence among the h_i's. In this case, the observed moment tensor has a tucker decomposition and its multilinear rank is dim(h_1), dim(h_2).. The authors need to add this discussion and preferably provide a picture depicting this model. The authors also need to point out that it is possible that only some hidden variables are of small dimensions and not the others, which was a main advantage for the new result in the paper.

The paper considers convex-optimization based tensor decomposition using structured Schatten norms for tensors. The authors employ the overlapped Schatten norm and show how it leads to better reconstruction guarantees. The paper is well written and makes technically important contributions. Fixing a few technical and notational issues, will make the paper even more readable.

*Clarification about Theorem 2 and its proof:

Statement of theorem 2 is technically not correct: the components are only recovered up to a permutation. Although a minor issue, the authors need to mention this.

In Theorem 2, it is stated that (12) immediately leads to (13). This needs a justification.

Proof of Lemma 5 is missing. Lemma 5 forms a key part of Theorem 2.

It is better to clarify the proof of Lemma 4 explicitly rather than referring to another paper, although it is straightforward.

*Minor issues:

Need to add more related work on HOSVD and other methods for tensor decomposition:
The authors can mention how the higher order SVD and its variants yields approximation bounds for tensor decomposition. See Kolda's review on tensor decompositions (section 4).

The notion of rank that the authors use, which they refer to as mode-rank is also referred ot as multilinear rank. The authors can mention this and note there are different notions of tensor rank in the introduction.

The authors can briefly discuss the computational complexity of the methods employed.

The authors need to clarify their discussion about incoherence after (10). The authors use the term to mean that they need the "energy" to be spread across different mixture components and modes, i.e., no component or mode can have too high a singular value. The authors mention that this is the same as the incoherence conditions of [1,3,10]. I do not see this. In [1,3,10], incoherence notions are use for sparse+low rank recovery. However, here, the noise is not sparse. The condition of [1,3,10] involves the singular vectors being incoherent with the basis vectors. This is different from the condition here. Suggest removing this comparison.

*Typos/notational issues:

Superscript notation for components \hat{\cal{W}}^{(k)} is confusing. Suggest using some other symbol for components.

Bad notation: \bar{r}_k and \underline{r}_k are both used for k-mode rank. This should be changed. Could introduce mult_rank(tensor) instead to denote the multilinear rank.

In Last line of page 12 in supplementary material, some parenthesis is missing.

Abstract: "result for structures Schatten" should change to structured
Q2: Please summarize your review in 1-2 sentences
The paper is well written and makes novel contribution in terms of finding a good Tucker tensor decomposition using an overlapped Schatten norm and provides guarantees for recovery. There are however some technical clarifications needed which will hopefully be addressed by the authors.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
This paper provides theoretical bounds on the performance of a convex tensor decomposition algorithm regularized by the latent Schatten norm. It improves upon the result known for regularization using the overlapped Schatten norm.

Quality : The paper obtains theoretical results for tensor decomposition using latent Schatten norm. The dual characterization of the overlapped and latent Schatten norm in particular is interesting. The experiments however are rather briefly considered and are not elaborated upon. For instance how were the Tucker ranks (r_1,...r_k) of the examples chosen? How does group regularization such as l_1,\infty work when the tensor is simultaneously low rank in all modes? How does the method perform as compared to non-convex approaches?

Clarity : The paper is well organized and clear to understand.

Originality : The main original result in the paper is the duality between the overlapped and latent Schatten norms. This result seems to be of independent interest.

Significance : Tensor decomposition is an important problem encountered in many applications. Since this paper improves upon existing results for this problem, I believe it is significant work.
Q2: Please summarize your review in 1-2 sentences
This paper improves upon existing results for the tensor decomposition problem. The theoretical results are interesting and the paper is well written. However, experiments are considered rather briefly towards the end of the paper.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would like to thank all reviewers for their helpful comments.

To Assigned_Reviewer_1:
It is still an open question whether there is a (class of) computationally tractable convex relaxation for tensor rank, and this work, though it may seem incremental, is a step toward this goal. A convex relaxation for tensor decomposition is relevant not only in machine learning and data mining, but also for quantum physics and computational complexity theory. Thus we believe that it is suitable for an interdisciplinary venue such as NIPS.

A lower bound for tensor decomposition was recently discussed in Mu et al. (2013) "Square Deal: Lower Bounds and Improved Relaxations for Tensor Recovery". Although their result does not directly apply to our case, it is likely that a similar argument could show that our bound is tight, which is also consistent with our numerical results.

"We propose a new class of structured Schatten norms for tensors..." refers to the fact that we propose overlapped/latent Schatten S_p/q-norms, which is more general than the p=1 and q=1 case studied earlier. The dual norm (p=\infty and q=\infty) plays a crucial role for our analysis. Moreover, other choices (e.g., p=1 and q=\infy) may also be used in the future.

Other suggestions are reflected in the latest version of our manuscript.

To Assigned_Reviewer_8:
Statement of Theorem 2 is correct because each component is regularized to be low rank along its own mode. Since being low-rank (along a specific mode) is not an invariant property, there is no ambiguity of permutation. Nevertheless we need the incoherence assumption (10) since if a component happens to be simultaneously low-rank along different modes, it makes the decomposition ambiguous.

The incoherence assumption is not specific to the sparse+low-rank decomposition. In both our model and the sparse+low-rank decomposition, the signal is decomposed into different parts that obey different types of sparsity or low-rank-ness. In fact, the assumption (10) controls the degree of redundancy; a simultaneously sparse and low-rank component is bad for sparse+low-rank decomposition, and a component that is simultaneously low-rank along different modes is bad for our model.

We have added the derivation of (13) and proofs for Lemmas 4 and 5 in the appendix.

We will add more references to HOSVD and related methods.

We agree that multilinear rank is a more common terminology. This will be fixed in the final version.

Regarding the computational complexity, all the operations except SVD requires O(n^K) (this is unavoidable because all entries are observed). The cost of truncated SVD is O(rn^K) using Mark Tygert's randomized SVD algorithm. The number of iterations is usually on the order of hundreds and does not depend on the size of the problem.

\bar{r}_k refers to the mode-k rank of the k-th component W^{(k)} and \underline{r}_k refers to the mode-k rank of the sum \sum_{k}W^{(k)}; thus they are different.

Other notational suggestions will be reflected in the final version.

To Assigned_Reviewer_9:
In the experiments, the mode-k rank was chosen uniformly at random from [1,...,n_k] for k=1,...,K. We geneerated 20 random instances for each tensor size (50x50x20 or 80x80x40). We added Gaussian noise to these instances and averaged the estimation error (over 10 random realization of the noise) for each instance, which corresponds to a single data point in Figure 2. This will be explained more clearly in the final version.

Group regularization would be an interesting future direction.