Part of Advances in Neural Information Processing Systems 18 (NIPS 2005)
Sharon Goldwater, Mark Johnson, Thomas Griffiths
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting stan- dard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justiﬁes the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.