Paper ID: | 5348 |
---|---|

Title: | Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss |

This paper studies low rank matrix approximation with respect to the entry-wise L1 norm. The main result is a theoretical guarantee for the low rank setting where an arbitrary rank k matrix is additionally perturbed by a random matrix. The result shows how to obtain a O(k\log n + poly(k/\epsilon)) rank matrix with error less than that of the L1 norm of the random matrix. The random matrix is subject to a certain moment condition which is shown to be necessary as well. The paper further provides a heuristic method inspired by the theoretical machinery. Strength: *) Strong upper bound which improves understanding of column subset selection for L1 loss, along with hardness results on assumptions needed in the upper bound *) The proposed method is conceptually simple, and indeed the paper evaluated a heuristic based on the proposed method Weakness: *) The empirical result only gives incremental improvement over prior methods on the two real datasets, especially for larger values of k (rank). *) Discrepancy between the algorithm analyzed and the evaluated More detailed questions: *) Line 64-65, what is the order of the polynomial? Wouldn't a high polynomial dependency on \epsilon be less desirable as well? *) How well do Algorithm 2 apply in the setting of equation (2) (i.e. when you are trying to compute the best rank-k approximation)? Could the same idea be applied to get improvement in that setting? *) Algorithm 1 line 6: S -> Q? Other remarks: the formatting of equations could be made better in the current format.

Originality: The results obtained in this work are, to the best of my knowledge, new. The methods followed are relatively new (especially the ones using the notions of ``tuples" and "cores" to tackle the distributional assumptions on the noise. ) Quality: Both the statements of the results and the proofs I checked appear technically sound and correct. I am though skeptical about the experiments section: the authors discuss and analyze Algorithm 1 for 7 pages and suddenly they implement Algorithm 2 which is an elementary median heuristic. It is interesting that it beats more sophisticated methods, but I am unfortunately not convinced the Algorithm 2 is linked with the averaging argument in the analysis of Algorithm 1, or that it is sufficiently connected with the rest of the paper. I think the authors should address this issue. Clarity: For the most part the paper is well written and the ideas clearly expressed. I think though some parts of the paper are not up the same standard and are slightly poorly written: e.g. parts of subsections 1.2. and 2.2. seems a little bit rushed; for 1.2. the authors miss to explain at the end of page 3 what is the cardinality of the random sets I_t and the number of them (an important step) and for subsection 2.2. the argument using Cramer Rule which is hiding behind the relatively complex definitions should be clearly explained as it can help significantly the reader. Significance: I think the methods followed are significant, but I do not consider clear their significance beyond the setting considered. For example, is the averaging considered tied to \ell_1 loss or can it generalize to the \ell_p norm loss?

This paper introduces an algorithm to obtain a low-rank reconstruction of a nxn matrix under the L_1 loss, with provable guarantees under mild assumptions on the distribution of the noise. My main question for the authors relates to their parallel submission #3303, which was flagged as a potential dual submission (allowing reviewers to ask questions as if the other work were already published). My reading of both papers does *not* suggest that these works are too closely related to be both accepted. However, I would appreciate the authors providing a comparison of the results of both contributions. - Assuming the necessary conditions for both papers are met, Thm. 1.2 of #3303 for the L1 loss states that we can in O(n^2 + n poly(d log n)) time find a subset S of size O(k log n) s.t. ||A_S X - A||_1 <= O(k log k) ||Delta||_1 w.p. 0.99. (My apologies for the inline math). Am I understanding correctly that the contribution of this paper allows a tradeoff in the computational time to obtain a better approximation by way of the epsilon parameter? - More generally, how do the bounds you derive in this paper compare to the ones that would be obtained when using the L1 loss in #3303 (with all necessary assumptions)? Are there specific scenarios where the bounds from either paper should be preferred? - You mention in Section 3 that taking the median performs better in practice than taking the average. It might be worth showing this empirically. I would also be curious to know how the compared algorithms performed on average and the order of magnitude of the standard deviation for all reconstruction methods, as the authors only provide a comparison between the best results of each reconstruction algorithm. Minor comments: - I believe the left hand side of Eq. (3) of the appendix should read \int_0^n rather than ^\infty. ******** Post rebuttal comments ******** I am satisfied with the author's response regarding the use of mean vs. average. However, if the paper is accepted, I ask the reviewers to include their experimental results for the average, as it is the method suggested by their theoretical analysis.

This paper studies low rank approximation of the matrix A + Delta, where Delta consists of i.i.d. noise variables. The paper considers the L_1 cost function and makes only a mild assumption on the moment of the noises. The new algorithm proposed by the author is a column-selection type algorithm. Roughly speaking, the authors first argue that there is only a small subset of columns that have large entries (coming from the tail of the noises). An existing algorithm can be used to approximate these columns. For the rest of small entries, the author argues that using probabilistic method to randomly choose columns suffices. I think this is a fairly strong theoretical result. While the general framework may not be entirely new (i.e., identify heavy hitters and use a different strategy to deal with heavy hitters), there are quite a few interesting technical innovations, e.g., building specialized concentration bounds optimized for their own analysis. I start to be unable to follow technical details after Sec 2.2 (see below for more questions). My major concern is that the success probability is only a constant (0.99). It is not clear to me whether the success probability is with respect to the data or with respect to the random tosses in the algorithm (i.e., whether the success probability can be boosted with more running time). If the success probability is with respect to the data, I would consider it a much weaker result. A "standard requirement" for the failure probability is $1/n^c$ for a sufficiently large constant $c$. Technically, it may not even circumvent the lower bound from [24] as advertised by the authors --- information theoretic lower bounds usually are in the form "any algorithm fails with constant probability" and this is consistent with such kind of result. The most worrying building block is Lemma 2.3 (where a constant failure probability with respect to data shows up). There, only a Markov inequality is used so it seems unlikely to significantly push down the failure probability. This is also consistent with properties of "heavy tail" noises --- proving tail bounds for such noises is usually difficult. Detailed questions: 1. I am not able to understand Definition R_{A^*}(S) (line 220). Specifically, Q is not explained in the definition. 2. Averaging reduces noises (around Lemma 2.2). It looks this is an important technique but I am unable to see the high level intuition. The paper spent considerable effort to explain a simple fact resembling central limit theorem/law of large number but then it was hand-waiving in explaining how this simple fact may be used.