Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Relevant references to *linear* numerical optimization papers/textbooks that employ gradient descent to compute orthonormal matrices for the matrix SVD that this method parallels and generalizes are not given, or not clearly shown. Its main claimed advantage over other scalable tensor decompositions is its ability to extend its algorithm to handle common constraints, such as non-negativity, but it has not been discussed relative to which other work. This is a complete piece of work, but it is unclear why is this paper submitted to NIPS and not to a numerical optimization conference/journal.
Summary: The paper proposes two algorithms for scalable Tucker decomposition that are based on the coordinate gradient descent and sequential processing of data chunks. The convergence analysis is also provided. The experimental results show the the proposed method can decompose bigger tensors than competitors without compromising efficiency. The strengths are 1 The proposed scalable algorithms for Tucker decomposition could be useful in large-scale real problems. 2 A complete theoretical results of convergence analysis are provided. 3 the paper is clearly written. The main drawback of the paper is that the experimental results are too simple and seem not quite convincing. More experimental setting should be tested and more deep analysis of the results should be provided. One question is, in the experiment, will the method still work if the dimension M exceed a certain number like 10,000?
I have read the authors authors rebuttal; I still believe that the experiments are not convincing enough and that a paper claiming to beat the state-of-the-art in scalable tensor decompositions should contain more thorough and clearer experiment setup (larger-scale experiments, fewer ambiguities about experiment setup decisions). --- The authors propose a scalable Tucker tensor decomposition and a streaming extension to this method. The authors demonstrate how the memory blow-up occurring during the Tucker decomposition can be avoided, by formulating the derivations in a way that sub-tensor partitions of the original input tensor can be processed in a sequential manner. The authors provide convergence guarantees of their algorithm to a point where the gradient is equal to zero. Although the approach is sensible and it is backed by theoretical analysis, I find that the experiments are not convincing enough for a work in which improved scalability is regarded as the main advancement. I first list my comments regarding the experiments and provide with additional comments below: - The authors choose to “arbitrarily set the size of the Movielens and Enron tensors to M × M × 200 and M × M × 610”. It is unclear why they have chosen to sample the original tensors (e.g., take the 610 most frequent words). Is there an issue with the algorithm or the method proposed if all 3 modes of a 3-order tensor grow simultaneously to relatively large sizes? - Both of the real datasets used seem to be sparse — if that is the case, have the authors tried to exploit sparsity in terms of the input data and intermediate computations? If all computations target dense data and the input is sparse, the experiment results may not be representative of the full potential of the approaches under comparison. There are several approaches targeting sparse input datasets which the authors have not compared against, e.g., - Oh, Sejoon, et al. "Scalable tucker factorization for sparse tensors-algorithms and discoveries." 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 2018. - Smith, Shaden, and George Karypis. "Accelerating the tucker decomposition with compressed sparse tensors." European Conference on Parallel Processing. Springer, Cham, 2017. I was left wondering how would such approaches perform for the input data considered. - Since scalability and memory blow-up issues are examined (as the central topic of the paper), I would expect the experiments to be conducted on a node with much larger memory capacity (the authors report that they used a single node with 32GB of memory). Overall experiment setup needs much more thorough description. Motivation / writing issues - Poor motivation of the problem in the Introduction; I would suggest improving the description on why scalable static Tucker decomposition is an important problem (ideally, connecting the discussion with the experiments conducted and their corresponding application domains). - Also, the authors mention that “From a streaming tensor decomposition point of view, our Singleshot extension is competitive with its competitor.” as part of the main contributions, which is confusing. If the streaming extension of the method proposed is just competitive and not superior as compared to baselines, then it should not be stated as a contribution. - Section 3.1 needs to be more specific to the challenges raised by the Tucker memory blowup problem; in the manuscript’s current version, it is not clear to the reader where the memory blowup occurs and why this is a challenge. - How is the user expected to choose the fixed mode, which is supposed to be predefined (as reported in Section 3.1)? There should be some guidance provided for such choices. Related work - The authors need to be more specific with respect to limitations of prior work; the statement “One major limitation related to these algorithms is their lack of genericness (i.e. they cannot be extended to incorporate some constraints such as non-negativity)” is vague and needs to be more precise w.r.t. each one of the prior works referred.