Aditya Bhaskara, Pruthuvi Maheshakya Wijewardena
In the stochastic k-PCA problem, we are given i.i.d. samples from an unknown distribution over vectors, and the goal is to compute the top k eigenvalues and eigenvectors of the moment matrix. In the simplest distributed variant, we have 'm' machines each of which receives 'n' samples. Each machine performs some computation and sends an O(k) size summary of the local dataset to a central server. The server performs an aggregation and computes the desired eigenvalues and vectors. The goal is to achieve the same effect as the server computing using m*n samples by itself. The main choices in this framework are the choice of the summary, and the method of aggregation. We consider a slight variant of the well-studied "distributed averaging" approach, and prove that this leads to significantly better bounds on the dependence between 'n' and the eigenvalue gaps. Our method can also be applied directly to a setting where the "right" value of the parameter k (i.e., one for which there is a non-trivial eigenvalue gap) is not known exactly. This is a common issue in practice which prior methods were unable to address.