Paper ID: | 407 |
---|---|
Title: | SPALS: Fast Alternating Least Squares via Implicit Leverage Scores Sampling |
This work explores the connection between sampling-based regression and low-rank tensor factorization, and presents a new method that relies on a simple bound on the leverage scores of the KRP matrix. The algorithm and theory are all based on well-known results in sampling-based matrix approximation, and the analysis is quite straightforward, hence the main novelty is in putting the pieces together. That said, the work is technically solid and the end result in practice does appear to be an improvement over the current state of the art (albeit a somewhat incremental one).
The paper has several typos / grammatical mistakes / awkwardly constructed sentences, but the ideas themselves are presented well. Theorem 3.2 and 3.3: This is a nice observation, though also fairly simple results. line 174: “The estimation is tight” seems a bit misleading. I would instead say the estimation is tight up to a constant factor R. The table between lines 295 and 296 is not defined (though from context it is clear that it is table 1). Also, in (a) of this table, it’s also not immediately clear that “sketch” corresponds to the method in [37]
2-Confident (read it all; understood it all reasonably well)
This submission is concerned with approximate low-rank tensor approximation. Given a, say, 3-dimensional tensor T and a target rank R, the goal is to find vectors families of vectors a_r, b_r, and c_r for r=1,...,R such that T is well approximated by sum_{r=1}^R a_r * b_r * c_r, where here * denotes outer product. Henceforth A is the matrix with a_r as columns, etc. Error in the approximation is measured using the Frobenius norm. The submission focuses on one algorithm to solve this problem, namely alternating least squares (ALS), and discusses how to speed it up using leverage score sampling. Specifically, ALS in this context fixes 2 out of 3 of A, B, C, then tries to minimize the Frobenius norm error by optimizing the third. It then loops over these 3 families as the variable to optimize. The main observation of the submission is that once 2/3 of these matrices are fixed, the optimization problem is equivalent to a traditional least squares problem in which the matrix X is n^2 x n, and now standard results in the randomized linear algebra literature thus imply that it suffices to sample only O(n log n) rows of X via nonuniform sampling of the rows of X via its statistical leverage scores. Unfortunately computing the statistical leverage scores of X is expensive, since it has n^2 rows. The main observation of this paper is that for ALS, X has special structure (it is the "Khatri-Rao product" of the two matrices that were fixed), and a lemma is then proven in the submission showing that one can upper bound the leverage scores of X by simple functions of the leverage scores of the two matrices that were fixed. This overall leads to a faster algorithm for computing upper bounds on the leverage scores of X, which then translates into a faster ALS variant at the end of the day.
As the "LS" of ALS suggests, each iteration of ALS for tensor decomposition amounts to solving a least squares regression problem. The main contribution of this submission is then to observe that good upper bounds on the leverage scores of the underlying matrix can be quickly approximated due to special structure of the matrix, namely Theorem 3.2 of the submission. This is the only, albeit important, novel observation of this paper. Once Theorem 3.2 is obtained, filling in the other details is standard. From this one observation, they are able to compare quite favorably with [37] (see Figure (a) on page 8). The plus side is clear: the ability to compete well with the state-of-the-art based on quite a simple observation. The two main downsides I see are: (1) they seem to be more comparable with [37] in empirical performance as opposed to hands down beating it, and (2) in figure (b) I didn't see any explanation of why again [37] wasn't compared with. I was also confused as to why SPALS(alpha) figure numbers weren't monotonic with alpha in figure (a) on page 8. Minor comments: * Some typos: "nearly optimality" --> "near optimality"; "while approximates the" --> "while approximating the"; "for large tensor" --> "for large tensors"; "each rank-1 components are" --> "each rank-1 component is"; "It is a challenging tasks" --> "It is a challenging task"; --> "one of the most powerful tool" --> "one of the most powerful tools"; "we provide the efficient" --> "we provide an efficient"; "In the remaining of this section" --> "In the remainder of this section"; "toolset for estimating the statistical leverage score" --> "toolset for estimating the statistical leverage scores"; "the statistical leverage scores of the i-th row" --> "the statistical leverage score of the i-th row"; "score of certain row" --> "score of a certain row"; "first inequality is because of" --> "first inequality is because"; "equals to $R$" --> "equals $R$"; "an rank-1" --> "a rank-1"; "In the remaining of this section" --> "In the remainder of this section"; "separate the calculation to two parts" --> "separate the calculation into two parts"; "evaluating former expression" --> "evaluating the former expression"; "second term is spare" --> "second term is sparse"; "by the leverage score of the design" --> "by the leverage scores of the design"; "not directly utilize" --> "not directly utilizing"; "requires an one-time" --> "requires a one-time"; "routines fairly recent" --> "routines was fairly recent"; "moves the non-zeros around" --> "moves the non-zeroes around"; "considered as the data" --> "considered as data"; "piecemeal invocation" --> "piecemeal invocations"; "each rank-1 components" --> "each rank-1 component" * On page 2, line 80 "(i,j) also represents the index i+Ij between 1 and IJ"; I found this confusing. What does it mean? * On page 4, lines 128 and 129: "Its optimality in solving ... of linear regression." It cannot really be explained by this; it's not so trivial. The upper bound requires using some matrix concentration, like matrix Bernstein or matrix Chernoff, or the non-commutative Khintchine inequality. * On page 4, line 133: "O(r log n)"; why did n enter the picture? I don't know any results of this form where n enters the picture. * I found Figure (a) on page 8 to be confusing. First, how is "error" measured? Also, why is the SPALS(alpha) error not monotonic as a function of alpha? What are the units on the time and error? * On page 8, does n = 1000 mean the tensor is 1000 x 1000 x 1000? (All 3 dimensions are n, right?)
3-Expert (read the paper in detail, know the area, quite certain of my opinion)
The paper proposes a way to speed up tensor CP decomposition via ALS by approximately solving each least squares problem via leverage score sampling. The authors exploit the structure of the design matrix in each least squares problem (which is a Khatri Rao product of the other modes of the tensor) to efficiently sample rows according to a (bound on) their leverage scores and solve each least squares problem without having to form the KR product. The paper provides theoretical results on the number of samples necessary to solve the least squares problem with an additive approximation guarantee. Since errors do not accumulate over the ALS iterations the approximation requirements are mild and in the experimental results the technique is shown to provide speedups up to 30x with little or no loss of accuracy.
The ideas and contributions of this paper are very nice and I hope to see more such speedups in other applications since ALS is such a general workhorse. Sampling proportionally to the leverage scores is not terribly new but I think this paper innovates well on the application side to meet the NIPS bar. The presentation could be improved substantially by making sure symbols are always defined before being used (such as R, n, etc.) and fixing the typos listed below. The assertion in line 160 is not immediately obvious and needs a proof or a citation. Typos: Line 5: the the Khatri-Rao -> the Khatri-Rao Line 13 significantly speedups -> significant speedups Line 22: from the tensor data -> from tensor data Line 25: has became -> has become Line 26: tensor analytic -> tensor analytics Line 52: SPASL -> SPALS Line 62: approximates -> approximating Line 149: \tau_{i,j}^{A\otimes B} -> \tau_{i,j}^{A \bigodot B} Line 163: matrix -> matrices Line 202: spare -> sparse Line 202: the the -> the
2-Confident (read it all; understood it all reasonably well)
This paper proposed a new low rank tensor CP decomposition method, called sparse alternating least squares (SPALS) method. SPALS accelerates the classical ALS algorithms by sampling. Furthermore, the authors provide some useful properties for leverage scores of the matrix Khatri-Rao product. The experimental results validate the algorithm is effective and efficient.
I have some comments as follow. 1. The theoretical analysis of this paper focuses on approximating the solution of problem (1), which is the main step of ALS. Is there any analysis which connects the sampling method to the global convergence behavior of ALS? 2. I have some questions about the experiments. a) How to choose the initial value of parameters? Is it sensitive to the results? b) What is the stop criterion in practice? c) Why the results in Table 1b not include the sketching based method? Thanks!
2-Confident (read it all; understood it all reasonably well)
Recently, randomized sampling and sketching methods have been popularly used in solving large scale overdetermined least squares regression problems. The paper proposes a leverage scores based sampling method for the alternating least squares minimization problems that appear as intermediate steps during the computation of low rank tensor CANDECOMP/PARAFAC (CP) decomposition. Tensor CP decomposition involves representing the tensor as sum of rank-1 tensors, and these rank-1 tensors can be represented as low rank factor matrices. The low rank factor matrices can be computed using the Alternating least squares (ALS) approach, which involves solving LS problems with the tensor matricization and the Khatri-Rao product (KRP) of the remaining factor matrices. The paper proposes leverage scores based sampling of the matricization and the KRP prior to solving the least squares regression problem. The key contribution of the paper (according to me) is in computing the leverage scores of the rows of the KRPs. Paper proposes to replace the leverage scores of the KRP (which are expensive to compute) by an upper bound, which is shown to be the product of the scores of the constituting matrices. Experiment results illustrating the speed up of their algorithm are also provided.
The paper proposes a fast method for estimating the alternating least squares that appear during the computation of tensor CP decomposition, based on leverage scores sampling. The speed up in the computational time demonstrated in the experiments is impressive. However, I have certain concerns. Following are my questions, concerns and suggestions regarding the paper. Regarding the theory: 1. The paper somehow does not justify why leverage based sampling is proposed instead of other randomized sampling/sketching approaches, which do not require the computation of leverage scores. From the theoretical results point of view, according to Theorem 4.1 of the paper, the number of samples required for the proposed method is m=\Theta(R^2\log n/\epsilon^2), which seems to be worse than the number of samples required by the other randomized sampling methods (which do not require leverage scores computation) such as the one proposed in reference [15] (Drineas et. al, 2011) based on SRHT (subsampled Hadamard transforms), or the ones discussed in (Woodruff,2014) monograph, where their sampling complexities seem to be m=\Theta(R\log(n)/\epsilon^2). The advantages of the proposed method over other sampling methods must be discussed. 2. In Theorem 4.1, the number of samples is set to be m=\Theta(R^2\log n/\epsilon^2). How was this computed? There seems to be no proof for this neither in the main paper nor in the Appendix. 3. What are matrices C and D in Lemma E.2 and Corollary E.3? How does sampling come into picture? I suppose C=AS^T and D=SB, where S is the sampling matrix. This must to be mentioned. 4. I am unable to see how the quantity \rho(x_B) in Lemma E.1 is equivalent to \|\mathcal{T} - [A,B,C]\|^2 in Theorem 4.1. Please provide this connection in the proof. 5. The runtime of each iteration of the algorithm is claimed to be \tilde{O}(nR^3). Please provided details regarding this, (this is totally missing). Style and presentation: 6. Notation used is highly inconsistent. Both 'r' and 'R' are used for the size of the factor matrices. Lines 124,181,210,268, 'r' is used and elsewhere 'R' is used. Similarly, the size of the design matrix X (which should be R, same as A and B). In line 121, 'p' is used! Elsewhere 'r' is used. 7. Please mention what \tilde{O} time complexity is. This might not be a common knowledge. 8. Incomplete or missing statements: i) line 46, "of be" - something is missing ii) line 164 "i-th of" and line 165 something is missing. iii) line 444, lemma E1, U is not defined. iv) lemma E.2 and E.3, C and D are not defined. 9. Minor typos: i) line 5, the the. ii) line 52, SPASL iii) line 108, a challenging tasks, iv) line 185, an rank v) line 202, second term is spare, and the the
2-Confident (read it all; understood it all reasonably well)
This work makes use of the spectral structures of the matrix Khatri-Rao product, designs an efficient access to its statistical leverage scores, so that samples the rows of the KRP in a nearly-optimal manner. The experimental results demonstrated the performance of the proposed method.
This work is an extension of leverage scores computing from matrix to tensor data. My concerns are as follows. 1. It is not clear how to get the size of sampling rows (\alpha r^2 \log^2n) in L294 . 2. There is no caption for Table 1 in P.8. 3. Is there any reasonable explanation why the performance of SPALS becomes worse with the increasing of the sampling size in Table (a) at nsr =0.1 ? 4. In introduction, the authors mention that the proposed method could be easily applied on other tensor related applications such as SGD or HOSVD, but there is no experiments supporting this point. 5. Grammar error: In Sec.7, “Its worth noting that …”
3-Expert (read the paper in detail, know the area, quite certain of my opinion)