Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Swati Padmanabhan, David Woodruff, Richard Zhang
Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating sensitivities, which we show is equivalent to approximate regression, are known for only the ℓ2 setting, in which they are popularly termed leverage scores. In this work, we provide the first efficient algorithms for approximating ℓp sensitivities and other summary statistics of a given matrix. In particular, for a given n×d matrix, we compute α-approximation to its ℓ1 sensitivities at the cost of n/α sensitivity computations. For estimating the total ℓp sensitivity (i.e. the sum of ℓp sensitivities), we provide an algorithm based on importance sampling of ℓp Lewis weights, which computes a constant factor approximation at the cost of roughly √d sensitivity computations, with no polynomial dependence on n. Furthermore, we estimate the maximum ℓ1 sensitivity up to a √d factor in O(d) sensitivity computations. We also generalize these results to ℓp norms. Lastly, we experimentally show that for a wide class of structured matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have on average low intrinsic effective dimensionality.