Review for NeurIPS paper: Coresets for Regressions with Panel Data

NeurIPS 2020

Coresets for Regressions with Panel Data

Review 1

Summary and Contributions: The paper deals with coresets (data summary obtained by subsampling and approximating the objective function for all queries) for least squares regression on panel data. In the usual "cross-sectional" setting the data consists of N individuals with d features each. Panel data extends this by introducing a time-series for each individual, consisting of the d features measured at T time steps. Moreover a correlation structure is introduced between the time steps to model dependencies over the time axes. The objective is to minimize the sum over all individuals of the least squares regressions subject to correlations between the time steps. The variables are thus the regression parameters as well as the defining parameters of the covariance structure. The problem is defined in this very general setting but the correlation structure studied is restricted to autocorrelations, resp. the autoregression model of order q, where each time step may depend on the q preceding time steps, and their q weights are the parameters subject to the optimization. The contributions are the first coreset constructions for this generalized regression setting obtained in the well-known in the sensitivity framework. The mere formulation of coresets is a bit more complicated than usual, since the terms in the sum of squares (sum f_i) do not correspond to the point x_i one by one. Instead each f_i may depend on up to q points x_j. The coreset is thus defined by subsampling and reweighting the functions f_i, and all points x_j that occur in the subsampled functions need to be stored. The paper spends a lot of space on making this formal. I think it could be explained much more concisely. The main complexity parameters to bound in teh sensitivity framework are the VC dimension of the functions to sample from and their "sensitivities" which bound their worst case importance for approximating the objective and their sum dominates the number of samples needed. The VC or pseudo dimension is bounded via a generic theorem from the literature by counting the number of parameters and the number of arithmetic operations to calculate the functions f_i. The sensitivities are bounded by a reduction to the sensitivities in the ordinary least squares (OLS) setting, which have well known results. The reduction is fairly straightforward in my opinion: The main challenge is to get rid of the correlations. For a lower bound the contribution of the covariance matrix is lower bounded by the smallest eigenvalue (lambda) times the contribution without covariance (OLS), note that lambda is assumed to be a constant in this paper by definition! The upper bound uses that the autocorrelation structure involves the square of a linear combination of the (non-squared) OLS terms. Now by Cauchy-Schwarz this is bounded by a linear combination of the squared OLS terms. This directly implies that the sensitivity of a function f_i is at most lambda times a linear combination of the corresponding OLS sensitivities, and the total sensitivity is simply the known small (independent of the input size) bound, times the maximal number of terms q. For other correlation structures this might be more challenging but the paper never considers others than autocorrelations. For the experimental data the standard generative model is used involving Gaussian errors from the linear model. It is known that in such a case the leverage scores, i.e. the OLS sensitivities are quite uniform (some paper of michael mahoney, I don't remember which). This might explain why the experimental results are not very convincing, see below. There is a further generalization of regression models to mixtures of linear regressions in the supplement. I didn't dig too much into that but it seems quite similar to existing work on Gaussian mixture models or k-means or alike. I think that the extension to panel regression is valuable and interesting as well as the k-clustered regression setting. On the other hand the results seem to be quite immediate from the literature, and the experiments are not very convincing (see weaknesses). Therefore I think that the current state is below the borderline. However I think that doing experiments on data that possesses less uniform errors would make a better selling point. Also discussing and bounding the sensitivities for more general or complicated covariances could improve the theoretical contributions. I also think that it's a pity to find those definitions for the sensitivity framework filling most of the paper while most of your own contributions appear only in the supplement, especially the k-clustered regression results. *** After Rebuttal: see additional feedback section below!!! ***

Strengths: - An important and natural extension of (ordinary) least squares regression - coreset sizes depend polynomially on 1/eps, d and q but have no dependence on N or T. - a further result on a mixture or k-clustering of least squares regressions (in analogy to coresets for mixtures of Gaussians or k-means)

Weaknesses: - spend too much on explaining the very general definitions of the sensitivity framework and fitting the problem into it. This could be done more concisely and defer the details to the appendix. - fairly straightforward reduction to the ordinary least squares regression case. - experimental results are not convincing, since worst case relative errors are within a factor of 2-3 to uniform subsampling even for small sample sizes, while average errors are often even worse (!!) than uniform subsampling (especially on the real-world data). It seems that the variance is reduced over uniform sampling but not the average behavior.

Correctness: yes all claims seem correct

Clarity: mostly yes, but it should move the sensitivity definitions into the appendix and move the own derivations and results into the main paper. Especially moving the results on k-clustering regression in the main body would improve the body of own contributions in the main paper.

Relation to Prior Work: yes, the generalized setting of panel regression is introduced and the challenges in constructing coresets are described and compared to well known constructions for the ordinary least squares case. further related work is given on coresets for ordinary and generalized regression.

Reproducibility: Yes

Additional Feedback: - in the beginning it is not clear that the parameters of the covariance is due to optimization (if it was fixed the reduction to OLS would be even simpler) - 76: \Omega(N) implies M is unavoidable? -> it should be Omega(M) right? - 79-80 the sentence is weird, missing words - last paragraph on page 2: here already the experimental results seem good compared to the full data but only slightly better than uniform sampling. -131: correlations between individuals -> should be between time steps, right? - 196: is -> in; compute -> computes; decides -> defines - Algo 1: "require ensure" -> "input output" - 252: missing "i" in index for x and y - 269: be more specific on the relevant hardware - have you experimented with larger values of q >1 ? - 287: average over what? - Tables 1 -> Table 1 - 299: add units (seconds) - 445-446: Equation (3) actually refers to the one above Eq (3) - it is weird that you "propose" theorems, then they are referenced from the literature, so others have proposed this before and then there are proofs that simply adapt the known theorems to the setting by direct application. - p. 15 what is T( x , y ) ? - (5) is simply by Cauchy-Schwarz and orthonormality of A, no need to look up the "well conditioned basis" definition from other literature Dear authors, thanks for your additional efforts and responses to our reviews. Your comment on GLSE_k made me read this part again in all details. Overall I decided to support your paper and raise my score significantly. However, I have some conditions: 1) incorporate carefully all points from your rebuttal and fix all issues that we raised, as you have promised. Regarding the new experiments add this reference: https://dl.acm.org/doi/10.5555/2789272.2831141 . It covers t-distributions with different parameters, including the special cases of Gaussian and Cauchy errors (note that the leverage scores are exactly your OLSE sensitivities!). 2) Don't waste so much space on tweaking the definitions of the framework. This can be done in a much more concise way and details can be moved to the appendix. Reference recent surveys on coresets, e.g. https://onlinelibrary.wiley.com/doi/full/10.1002/widm.1335 and https://link.springer.com/article/10.1007/s13218-017-0519-3. 3) Put GLSE_k back into the main paper! Emphasize the novelty of the "2-stage" coreset construction for the nested sums, and also the assumption of M-boundedness. Comment on the necessity of the latter and point out why M-boundedness for small M is natural to assume (e.g. what happens with M if the Matrices A(i) are drawn from normal distributions where the covariance has constant eigenvalues l_max...l_min?) Are there similarities to known results on coresets for mixtures of Gaussians regarding such nice conditioning assumptions? 4) (if time permits) can you add any experiments on GLSE_k? can you compare to mini-batchSGD? are there settings where coresets or SGD have advantages/deficiencies compared to one another? 5) fix some more minor issues in the appendix: - 713: the 2-stage coreset uses eps/2 on each stage. How do the errors propagate? I am missing calculations, e.g. (1+eps/2)^2 < (1+eps) does not hold... - 756: missing superscripts for the betas - 757: missing sum in denominator - the lower bound is akin to [39] which should be referenced in this context

Review 2

Summary and Contributions: The panel data problem is a generalization of linear regression and aims to minimize sum_i ||X(A_i x-b)||^2 over a matrix X and a vector b, where the sum is over large set of matrices A_1...A_n. Coresets for special cases are suggested: - For OLSE, coreset of size min{d/eps^2,d^2} - For GLSE, coreset of size d^2/eps^2 where d is a constant. - Generalization and Lower bounds for GLSE_k Justifications: - This is a very fundamental problem with many applications and generalizations in other fields. - There are many types of coresets in this paper, not just one - The theory is not trivial and it seems to contain many new ideas.

Strengths: The paper solves special versions of the problem min sum_i ||A_ix-b_i||^2 under different constraints. There are many results as stated above, including lower bounds, and the algorithms are interesting. The techniques are novel although not too surprising. I mainly liked the fact that there are few types of fundamental new coresets in the same paper. The paper has a potential to have high impact.

Weaknesses: - No open code as in related papers. - More experimental results are needed.

Correctness: I did not read all the proofs in the appendix, but the claims make sense.

Clarity: Yes, I liked to read it.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper introduces coresets for the Generalized Least Square Estimators (GLSE) problem, and provides several algorithms for coresets for GLSE, supported by experiments. Summary: GLSE is a variant of multiobjective regression problem where observations are correlated (over time). Informally, coreset for a data set D with respect to a class of functions F is a small subset of D such that any function from F evaluates approximately equally on D and on S. OLSE is a variant of GLSE where the correlation matrix is identity. OLSE can be reduced to a single objective regression problem and (weak) coresets for OLSE problems have been studied in [27]. This paper makes direct generalizations of results from [27] and [9] to provide approximate and exact coresets for OLSE. Further the paper employs the well-known Feldman-Langberg (FL) framework to obtain a coreset for GLSE. It seems that FL is directly applicable and the main result in this paper is to bound the two parameters of FL framework: (1) total sensitivity and (2) pseudo-dimension of GLSE.

Strengths: Coresets for GLSE is a nice and novel result that extend results in [7] and [27] and, as the authors explain, can lead to future work in multi-objective regression such as logistic regression.

Weaknesses: The basic approaches in the paper seem to be fairly standard, to the best of my understanding. Both the algorithms and the bounds proofs seem quite similar to known techniques used in previous results. It might be the case that I missed some technical novelty. If there are novel ideas in bounding the pseudo-dimension or total sensitivity, the authors should carefully explain them in the introduction. The current explanation such as “Hence, to apply the FL framework, we need to upper bound the pseudo dimension and construct a sensitivity function.” and “The main idea is to apply the prior results [2, 43] “ strengthen my impression that the technical contribution is somewhat incremental.

Correctness: Due to the lack of time, I did not check the correctness of all technical claims but the overall approach makes sense to me.

Clarity: I think that the clarity can be improved by emphasizing the technical novelty in bounding the pseudo-dimension and total sensitivity. Also, in appendix B it is difficult to distinguish prior work from novel claims. Overall the paper is not easy to read.

Relation to Prior Work: see my comments above

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: In contrast to usual time series data, where we observe features for one individual (unit) across time, panel data (cross-sectional time series data) describes of a time series for several individuals (units) in parallel (i.e. same set of features recorded for each invidiual separately over time). In the regression problem on panel data we control variables that change over time but not across units: We seek for $min_{\beta, \Omega}\ (y_i-X_i\beta)^T\Omega^{-1}(y_i-X_i\beta)$ where $X_i$ is the time series for the i-th unit, and where the parameter $\Omega \in \R^{T \times T}$ is constraint to have eigenvalue at most 1. Two versions considered in the paper are: - In OLS (ordinary least squares), $\Omega = I_d$. - In GLS (generalised least sqares), $\Omega$ is defined by an autocorrelation function dependin on some parameter $\rho$. - Additionally, the paper mentions the GLS_k problem where entities are divided into k groups and (only) each group shares parameters -- but this is only described in the supplementary material. The idea of coresets/sketches is that, for a fixed problem (e.g. panelregression, clustering), to construct a smaller problem instance that behaves similar to the origial given problem instance. E.g. for clustering, coresets are usually subset of the given data points. Goal of the paper is to create a coreset that is independent in the number of entities (N) as well as time steps (T). For this, the paper shows how to apply a framework for coresets that was introduced by Langberg and Feldman, and previous results on near optimal coresets for least-squares regression by Boutsidis et al.. Hence the main step consists in deriving the so called sensitivity function. This yields an algorithm that constructs an eps-coreset of a size hat is independent on N and T, but depends on the success probability, eps, and an upper bound of the parameter $\rho$ (of the autocorrelation function), and runs in time linear in $N,T,q$ and quadratic in $d$. Additionally, the paper conducts some experiments with synthetic as well as real-world data.

Strengths: The paper presents a formal derivation (supplementary) of the sensitivity function for GLSE, which is not trivial. It shows how the abstract coreset framework of langberg and feldman can be applied to a new problem. It presents the first coreset construction for this problem with these guarantees (size independent of N and T). The paper tries to evaluate the performance of the methods on a real world data set.

Weaknesses: The way the GLSE problem is fitted into the framework by Langberg and Feldman is pretty straightforward. Yet, the paper e.g. introduces the generalisation of an query space, which does not add any real value in the main paper (~1 page). The necessary notation could as well have been introduced in Definitions 2.2 and 2.3 right away (which is what is basically done in ll.169 then - in addition to the original definition plus the coreset/query space definitions). The paper states that "When dealing with massive datasets, coresets have emerged as a valuable tool froma computational, storage and privacy perspective, as one needs to work with and share much smaller datasets." (ll. 3). But the only thing that can be done with these coresets is to solve the specific GLS problem (i.e. specific $\rho$, $q$,.. and all wrt. X) that the coreset was computed for. If one wants to e.g. test what happens with a different $q$, the coreset would need to be computed again. The experimental resusults are rather weak: With synthetic data, one can achieve lower error via sampling approx 4k points than with the proposed method using approx 800 points - and both approaches would take about the same time (6s vs 5s). For the real-world data, the uniform coreset is on avg. actually better (except for 819 sample size), which comparable standard deviations. The experimental section highlights the maximum empirical error in the comparison (boxplots would be more appropriate). Especially for the real-world data where on avg. the uniform method is outperforming the proposed method. The errors depend on how variables are scaled. A predefined significance level would help to interpret the results. (In the context of speeding up computations, one should keep in mind that the computational budget (in practice) might allow to compute uniform samples and to solve the problem on the uniform sample repeatedly (to achieve same results, at a small cost even without a clever algorithm)). (Also the experimental evaluation just aims to minimize the train error. This is the goal of the proposed method, but in practice, one would want to know how well the solution obtained via the coreset generates to unseen data. And then the difference between uniform and the proposed method might be smaller.)

Correctness: There are no full proofs in the main paper. Had a look at some of the proofs in the appendix, which look ok to me.

Clarity: The paper is clearly organized and clearly describes ideas and which ideas/algorithms where motivated by which papers, which is nice. It could be clearer on the limited applications regarding coresets, and on the interpretation of the experimental results which seem rather week (see weaknesses) and are presented in an irritating way.

Relation to Prior Work: Relation to prior work is clearly stated. The results are compared, and an overview e.g. on other coreset literature is included.

Reproducibility: Yes

Additional Feedback: Is $GLS_k$ used anywhere (useful in any real world context), or has it just been introduced e because we can? It is weird that GLS_k and existence of results are mentioned in the main part, but then no result is in the main paper. Maybe it would be better to mention this extension later and then also briefly describe what (not) changes in the main results (compared to the just presented results then). The maximum empirical error column is really irritating (especially since that column is marked in bold to underline the superiority of CGLSE). Instead of the table, boxplots incl. plots of outliers might be a better way to represent the results. It is not so easy to follow the notation sometimes: e.g. when there's a bigger gap between variable introduction and mentioning the variable name. E.g. "The $\rho$ variables make the objective function of GLSE in contrast to the cross-sectional data setting where objective functions only contain β and are convex." (ll. 79) - a $rho$ was mentioned last time in ll. 38, plus part of the sentence must be missing here? Based on author feedback and discussion: Thanks for your detailed feedback. I'll increase my score based on the list of conditions mentioned in the updated review #1