Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Lingxiao Huang, K Sudhir, Nisheeth Vishnoi
A panel dataset contains features or observations for multiple individuals over multiple time periods and regression problems with panel data are common in statistics and applied ML. When dealing with massive datasets, coresets have emerged as a valuable tool from a computational, storage and privacy perspective, as one needs to work with and share much smaller datasets. However, results on coresets for regression problems thus far have only been available for cross-sectional data ($N$ individuals each observed for a single time unit) or longitudinal data (a single individual observed for $T>1$ time units), but there are no results for panel data ($N>1$, $T>1$). This paper introduces the problem of coresets to panel data settings; we first define coresets for several variants of regression problems with panel data and then present efficient algorithms to construct coresets of size that are independent of $N$ and $T$, and only polynomially depend on $1/\varepsilon$ (where $\varepsilon$ is the error parameter) and the number of regression parameters. Our approach is based on the Feldman-Langberg framework in which a key step is to upper bound the “total sensitivity” that is roughly the sum of maximum influences of all individual-time pairs taken over all possible choices of regression parameters. Empirically, we assess our approach with a synthetic and a real-world datasets; the coreset sizes constructed using our approach are much smaller than the full dataset and coresets indeed accelerate the running time of computing the regression objective.