Part of Advances in Neural Information Processing Systems 24 (NIPS 2011)
When used to learn high dimensional parametric probabilistic models, the clas- sical maximum likelihood (ML) learning often suffers from computational in- tractability, which motivates the active developments of non-ML learning meth- ods. Yet, because of their divergent motivations and forms, the objective func- tions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learn- ing methods, including contrastive divergence , noise-contrastive estimation , partial likelihood , non-local contrastive objectives , score match- ing , pseudo-likelihood , maximum conditional likelihood , maximum mutual information , maximum marginal likelihood , and conditional and marginal composite likelihood , can be unified under the minimum KL con- traction framework with different choices of the KL contraction operators.