Active learning for misspecified generalized linear models

Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)

Bibtex Metadata Paper


Francis Bach


Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or im- prove the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecifica- tion, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive un- biased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary clas- sification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels.