Interpolating between types and tokens by estimating power-law generators

Part of Advances in Neural Information Processing Systems 18 (NIPS 2005)

Bibtex Metadata Paper


Sharon Goldwater, Mark Johnson, Thomas Griffiths


Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting stan- dard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.