Unlabeled data: Now it helps, now it doesn't

Part of Advances in Neural Information Processing Systems 21 (NIPS 2008)

Bibtex Metadata Paper


Aarti Singh, Robert Nowak, Jerry Zhu


Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundancy of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing the situations in which unlabeled data can help have met with little success, and sometimes appear to conflict with each other and intuition. In this paper, we attempt to bridge the gap between practice and theory of semi-supervised learning. We develop a rigorous framework for analyzing the situations in which unlabeled data can help and quantify the improvement possible using finite sample error bounds. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.