{"title": "Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 72, "abstract": "When used to learn high dimensional parametric probabilistic models, the clas- sical maximum likelihood (ML) learning often suffers from computational in- tractability, which motivates the active developments of non-ML learning meth- ods. Yet, because of their divergent motivations and forms, the objective func- tions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learn- ing methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score match- ing [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL con- traction framework with different choices of the KL contraction operators.", "full_text": "Unifying Non-Maximum Likelihood Learning\n\nObjectives with Minimum KL Contraction\n\nComputer Science Department\n\nUniversity at Albany, State University of New York\n\nSiwei Lyu\n\nlsw@cs.albany.edu\n\nAbstract\n\nWhen used to learn high dimensional parametric probabilistic models, the clas-\nsical maximum likelihood (ML) learning often suffers from computational in-\ntractability, which motivates the active developments of non-ML learning meth-\nods. Yet, because of their divergent motivations and forms, the objective func-\ntions of many non-ML learning methods are seemingly unrelated, and there lacks\na uni\ufb01ed framework to understand them. In this work, based on an information\ngeometric view of parametric learning, we introduce a general non-ML learning\nprinciple termed as minimum KL contraction, where we seek optimal parameters\nthat minimizes the contraction of the KL divergence between the two distributions\nafter they are transformed with a KL contraction operator. We then show that\nthe objective functions of several important or recently developed non-ML learn-\ning methods, including contrastive divergence [12], noise-contrastive estimation\n[11], partial likelihood [7], non-local contrastive objectives [31], score match-\ning [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum\nmutual information [2], maximum marginal likelihood [9], and conditional and\nmarginal composite likelihood [24], can be uni\ufb01ed under the minimum KL con-\ntraction framework with different choices of the KL contraction operators.\n\n1\n\nIntroduction\n\n(cid:80)n\n\n1\nn\n\nZ(\u03b8) = (cid:82)\n\nFitting parametric probabilistic models to data is a basic task in statistics and machine learning.\nGiven a set of training data {x(1),\u00b7\u00b7\u00b7 , x(n)}, parameter learning aims to \ufb01nd a member in a\nparametric distribution family, q\u03b8, to best represent the training data.\nIn practice, many useful\nhigh dimensional parametric probabilistic models, such as Markov random \ufb01elds [18] or products\nof experts [12], are de\ufb01ned as q\u03b8(x) = \u02dcq\u03b8(x)/Z(\u03b8), where \u02dcq\u03b8 is the unnormalized model and\nRd \u02dcq\u03b8(x)dx is the partition function. The maximum (log) likelihood (ML) estimation is\nthe most commonly used method for parameter learning, where the optimal parameter is obtained\nby solving argmax\u03b8\nk=1 log q\u03b8(x(k)). The obtained ML estimators has many desirable proper-\nties, such as consistency and asymptotic normality [21]. However, because of the high dimensional\nintegration/summation, the partition function in q\u03b8 oftentimes makes ML learning computationally\nintractable. For this reason, non-ML parameter learning methods that use \u201ctricks\u201d to obviate direct\ncomputation of the partition function have experienced rapid developments, particularly in recent\nyears. While many computationally ef\ufb01cient non-ML learning methods have achieved impressive\npractical performances, with a few exceptions, their different learning objectives and numerical im-\nplementations seem to suggest that they are largely unrelated.\nIn this work, based on the information geometric view of parametric learning, we elaborate on a gen-\neral non-ML learning principle termed as minimum KL contraction (MKC), where we seek optimal\nparameters that minimize the contraction of the KL divergence between two distributions after they\nare transformed with a KL contraction operator. The KL contraction operator is a mapping between\n\n1\n\n\fprobability distributions under which the KL divergence of two distributions tend to reduce unless\nthey are equal. We then show that the objective functions of a wide range of non-ML learning meth-\nods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7],\nnon-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum con-\nditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and\nconditional and marginal composite likelihood [24], can all be uni\ufb01ed under the MKC framework\nwith different choices of the KL contraction operators and MKC objective functions.\n\n2 Related Works\n\nSimilarities in the parameter updates among non-ML learning methods have been noticed in several\nrecent works. For instance, in [15], it is shown that the parameter update in score matching [14] is\nequivalent to the parameter update in a version of contrastive divergence [12] that performs Langevin\napproximation instead of Gibbs sampling, and both are approximations to the parameter update of\npseudo-likelihood [3]. This connection is further generalized in [1], which shows that parameter\nupdate in another variant of contrastive divergence is equivalent to a stochastic parameter update\nin conditional composite likelihood [24]. However, such similarities in numerical implementations\nare only tangential to the more fundamental relationship among the objective functions of different\nnon-ML learning methods. On the other hand, the energy based learning [22] presents a general\nframework that subsume most non-ML learning objectives, but its broad generality also obscures\ntheir speci\ufb01c statistical interpretations.\nAt the objective function level, relations between some non-ML methods are known. For instance,\nit is known that pseudo-likelihood is a special case of conditional composite likelihood [30]. In\n[10], several non-ML learning methods are uni\ufb01ed under the framework of minimizing Bregman\ndivergence.\n\n3 KL Contraction Operator\n\nWe base our discussion hereafter on continuous variables and probability density functions. Most\nresults can be readily extended to the discrete case by replacing integrations and probability density\nfunctions with summations and probability mass functions. We denote \u2126d as the set of all probability\n(cid:82)\ndensity functions over Rd. For two different probability distributions p, q \u2208 \u2126d, their Kulback-\nLeibler (KL) divergence (also known as relative entropy or I-divergence) [6] is de\ufb01ned as KL(p(cid:107)q) =\nRd p(x) log p(x)\nq(x) dx. KL divergence is non-negative and equals to zero if and only if p = q almost\neverywhere (a.e.). We de\ufb01ne a distribution operator, \u03a6, as a mapping between a density function\np \u2208 \u2126d to another density function \u02dcp \u2208 \u2126d(cid:48), and adopt the shorthand notation \u02dcp = \u03a6{p}. A\ndistribution p is a \ufb01x point of a distribution operator \u03a6 if p = \u03a6{p}.\nA KL contraction operator is a distribution operator, \u03a6 :\n\u2126d (cid:55)\u2192 \u2126d(cid:48), such that for any p, q \u2208 \u2126d, there exist a\nconstant \u03b2 \u2265 1 for the following condition to hold:\n(1)\nSubsequently, \u03b2 is known as the contraction factor, and\nLHS of Eq.(1) is the KL contraction of p and q under \u03a6.\nObviously, if p = q (a.e.), their KL contraction, as well\nas their KL divergence, is zero. In addition, a KL con-\ntraction operator is strict if the equality in Eq.(1) holds\nonly when p = q (a.e.). Intuitively, if the KL divergence\nis regarded as a \u201cdistance\u201d metric of probability distri-\nbutions1, then it is never increased after both distributions are transformed with a KL contraction\noperator, a graphical illustration of which is shown in Fig.1. Furthermore, under a strict KL contrac-\ntion operator, the KL divergence is always reduced unless the two distributions are equal (a.e.). The\nKL contraction operators are analogous to the contraction operators in ordinary metric spaces, with\n\u03b2 having a similar role as the Lipschitz constant [19].\n\nFigure 1: Illustration of a KL contraction\noperator on two density functions p and q.\n\nKL(p(cid:107)q) \u2212 \u03b2KL(\u03a6{p}(cid:107)\u03a6{q}) \u2265 0.\n\n1Indeed, it is known that the KL divergence behaves like the squared Euclidean distance [6].\n\n2\n\npq\u02dcq\u02dcpDKL(\u02dcp\uffff\u02dcq)DKL(p\uffffq)\u02dcp=\u03a6{p}\u02dcq=\u03a6{q}\fEq.(1) gives the general and abstract de\ufb01nition of KL contraction operators. In the following, we\ngive several examples of KL contraction operators that are constructed from common operations of\nprobability distributions.\n\n3.1 Conditional Distribution\nWe can form a family of KL contraction operators using conditional distributions. Consider x \u2208 Rd\n(cid:90)\nwith distribution p(x) \u2208 \u2126d and y \u2208 Rd(cid:48)\n, from a conditional distribution, T (y|x), we can de\ufb01ne a\ndistribution operator, as\n(2)\n\u03a6c\nT{p}(y) =\n\nT (y|x)p(x)dx = \u02dcp(y).\n\nThe following result shows that \u03a6c\nLemma 1 (Cover & Thomas [6]2) For two distributions p, q \u2208 \u2126d, with the distribution operator\nde\ufb01ned in Eq.(2), we have\n\nT is a strict KL contraction operator with \u03b2 = 1.\n\nRd\n\nKL(p(cid:107)q) \u2212 KL(\u03a6c\nwhere Tp(x|y) = T (y|x)p(x)\nfrom p and q with T . Furthermore, the equality holds if and only if p = q (a.e.).\n3.2 Marginalization and Marginal Grafting\n\nT{p}(cid:107)\u03a6c\nand Tq(x|y) = T (y|x)q(x)\n\nRd(cid:48) \u02dcp(y)KL(Tp(x|y)(cid:107)Tq(x|y)) dy \u2265 0,\n\nT{q}) =\n\n\u02dcp(y)\n\n\u02dcq(y)\n\nare the induced conditional distributions\n\n(cid:90)\n\nTwo related types of KL contraction operators can be constructed based on marginal distributions.\nConsider x with distribution p(x) \u2208 \u2126d, and a nonempty index subset A \u2282 {1,\u00b7\u00b7\u00b7 , d}. Let \\A\ndenote {1,\u00b7\u00b7\u00b7 , d}\u2212 A, the marginal distribution, pA(xA), of sub-vector xA formed by components\nwhose indices are in A is obtained by integrating p(x) over sub-vector x\\A. This marginalization\noperation thus de\ufb01nes a distribution operator between p \u2208 \u2126d and pA \u2208 \u2126|A|, as:\n\n(3)\nAnother KL contraction operator termed as marginal grafting can also be de\ufb01ned based on pA. For\na distribution q(x) \u2208 \u2126d, the marginal grafting operator is de\ufb01ned as:\n\n\u03a6m\nA{p}(xA) =\n\np(x)dx\\A = pA(xA)\n\nRd\u2212|A|\n\n(cid:90)\n\n\u03a6g\np,A{q}(x) =\n\n(4)\n\u03a6g\np,A{q} can be understood as replacing qA in q(x) with pA. The last term in Eq.4 is nonnegative\nand integrates to one over Rd, and is thus a proper probability distribution in \u2126d. Furthermore, p is\nthe \ufb01xed point of operator \u03a6g\n\n= q\\A|A(x\\A|xA)pA(xA),\n\nqA(xA)\n\np,A, as \u03a6g\n\nq(x)pA(xA)\n\nThe following result shows that both \u03a6m\nin a sense complementary to each other.\nLemma 2 (Huber [13]) For two distributions p, q \u2208 \u2126d, with the distribution operators de\ufb01ned in\nEq.(3) and Eq.(4), we have\n\np,A are KL contraction operators, and that they are\n\np,A{p} = p.\nA and \u03a6mg\n\n(cid:16)\n(cid:17)\n\n(cid:90)\n\n(cid:17)\n\n(cid:13)(cid:13)(cid:13)\u03a6g\npA(xA)KL(cid:0)p\\A|A(\u00b7|xA)(cid:13)(cid:13)q\\A|A(\u00b7|xA)(cid:1) dxA,\n\nA{p}(cid:107)\u03a6m\n\nA{q}) .\n\n= KL(\u03a6m\n\np,A{q}\n\nKL(\u03a6m\n\nwhere p\\A|A(\u00b7|xA) and q\\A|A(\u00b7|xA) are the conditional distributions induced from p(x) and q(x),\nand\n\nA{p}(cid:107)\u03a6m\nLemma 2 also indicates that neither \u03a6m\np,A is strict - the KL contraction of p(x) and q(x) for\nthe former is zero if p\\A|A(x\\A|xA) = q\\A|A(x\\A|xA) (a.e.), even though they may differ on the\nmarginal distribution over xA. And for the latter, having pA(xA) = qA(xA) is suf\ufb01cient to make\ntheir KL contraction zero.\n\nA{q}) = KL(pA(xA)(cid:107)qA(xA)) .\n\nA nor \u03a6mg\n\n2We cite the original reference to this and subsequent results, which are recast using the terminology in-\ntroduced in this work. Due to the limit of space, we defer formal proofs of these results to the supplementary\nmaterials.\n\n3\n\nFurthermore,\n\n(cid:16)\n\nKL(p(cid:107)q) \u2212 KL\n\n\u03a6g\np,A{p}\n\nKL\n\n\u03a6g\np,A{p}\n\np,A{q}\n\n=\n\nRd\n\n(cid:13)(cid:13)(cid:13)\u03a6g\n\n\f3.3 Binary Mixture\nFor two different distributions p(x) and g(x) \u2208 \u2126d, we introduce a binary variable c \u2208 {0, 1} and\nP r(c = i) = \u03c0i, with \u03c00, \u03c01 \u2208 [0, 1] and \u03c00 + \u03c01 = 1. We can then form a joint distribution\n\u02c6p(x, c = 0) = \u03c00g(x) and \u02c6p(x, c = 1) = \u03c01p(x) over Rd \u00d7 {0, 1}. Marginalizing out c from\n\u02c6p(x, c), we obtain a binary mixture \u02dcp(x), which induces a distribution operator:\n(5)\n\n\u03a6b\ng{p}(x) = \u03c00g(x) + \u03c01p(x) = \u02dcp(x).\n\ng{q}\n\ng{p}\n\n(cid:1) =\n\n(cid:13)(cid:13)\u03a6b\n\nKL(p(cid:107)q) \u2212\n\nKL(cid:0)\u03a6b\n\ng is a strict KL contraction operator with \u03b2 = 1/\u03c01.\n\nThe following result shows that \u03a6b\nLemma 3 For two distributions p, q \u2208 \u2126d, with the distribution operator de\ufb01ned in Eq.(5), we have\n\n\u02dcp(x)(cid:2)KL(cid:0)pc|x(c|x)(cid:13)(cid:13)qc|x(c|x)(cid:1)(cid:3) dx \u2265 0,\nLet S = (S1, S2,\u00b7\u00b7\u00b7 , Sm) be a partition of Rd such that Si \u2229 Sj = \u2205 for i (cid:54)= j, and(cid:83)m\n\nwhere p(c|x) and q(c|x) are the induced posterior conditional distributions from \u02c6p(x, c) and \u02c6q(x, c),\nrespectively. The equality holds if and only if p = q (a.e.).\n\ni=1 Si = Rd,\nthe lumping [8] of a distribution p(x) \u2208 \u2126d over S yields a distribution over i \u2208 {1, 2,\u00b7\u00b7\u00b7 , m}, and\nsubsequently induces a distribution operator \u03a6lS, as:\n\n3.4 Lumping\n\n1\n\u03c01\n\n1\n\u03c01\n\n(cid:90)\n\nRd\n\n\u03a6lS{p}(i) =\n\nS\ni , for i = 1,\u00b7\u00b7\u00b7 , m.\nThe following result shows that \u03a6lS is a KL contraction operator with \u03b2 = 1.\nLemma 4 (Csisz`ar & Shields [8]) For two distributions p, q \u2208 \u2126d, with the distribution operator\nde\ufb01ned in Eq.(6), we have\n\np(x)dx = P\n\nx\u2208Si\n\n(6)\n\n(cid:90)\n\nKL(p(cid:107)q) \u2212 KL(cid:0)\u03a6lS{p}\n\n(cid:82)\np(x)\u00d71[x\u2208Si]\nx(cid:48)\u2208Si\n\nm(cid:88)\n\ni=1\n\n(cid:13)(cid:13)\u03a6lS{p}\n(cid:1) =\n(cid:82)\nq(x)\u00d71[x\u2208Si]\nx(cid:48)\u2208Si\n\nP\n\nS\ni KL(\u02dcpi(cid:107)\u02dcqi) \u2265 0,\n\np(x(cid:48))dx(cid:48) and \u02dcqi(x) =\n\nwhere \u02dcpi(x) =\nq by restricting to Si, respectively, with 1[\u00b7] being the indicator function.\nNote that \u03a6lS is in general not strict, as two distributions agree over all \u02dcpi and \u02dcqi will have a zero KL\ncontraction.\n\nq(x(cid:48))dx(cid:48) are the distributions induced from p and\n\n4 Minimizing KL Contraction for Parametric Learning\n\n(cid:82)\n\n(cid:80)n\n\nIn this work, we take the information geometric view of parameter learning - assuming training\ndata are samples from a distribution p \u2208 \u2126d, we seek an optimal distribution on the statistical\nmanifold corresponding to the parametric distribution family q\u03b8 that best approximates p [20]. In this\ncontext, the maximum (log) likelihood learning is equivalent to \ufb01nding parameter \u03b8 that minimizes\n(cid:82)\nthe KL divergence of p and q\u03b8 [8], as argmin\u03b8 KL(p(cid:107)q\u03b8) = argmax\u03b8\nRd p(x) log q\u03b8(x)dx. The\ndata based ML objective is obtained when we approximate the expectation with sample average as\nRd p(x) log q\u03b8(x)dx \u2248 1\nThe KL contraction operators suggest an alternative approach for parametric learning. In particular,\nthe KL contraction of p and q\u03b8 under a KL contraction operator is always nonnegative and reaches\nzero when p and q\u03b8 are equal almost everywhere. Therefore, we can minimize their KL contrac-\ntion under a KL contraction operator to encourage the matching of q\u03b8 to p. We term this general\napproach of parameter learning as minimum KL contraction (MKC). Mathematically, minimum KL\ncontraction may be realized with three different but related types of objective functions.\n- Type I: With a KL contraction operator \u03a6, we can \ufb01nd optimal \u03b8 that directly minimizes the KL\n\nk=1 log q\u03b8(x(k)).\n\nn\n\ncontraction between p and q\u03b8, as:\nargmin\n\nKL(p(cid:107)q\u03b8) \u2212 \u03b2KL(\u03a6{p}(cid:107)\u03a6{q\u03b8}) .\n\n\u03b8\n\n(7)\n\nIn practice, it may be desirable to use \u03a6 with \u03b2 = 1 that is also a linear operator for L2 bounded\nfunctions over Rd [19]. To better see this, consider q\u03b8(x) = \u02dcq\u03b8(x)\nZ(\u03b8) as the model de\ufb01ned with\n\n4\n\n\fn(cid:88)\n\n1\nn\n\nn(cid:48)(cid:88)\n\n1\nn(cid:48)\n\nthe unnormalized model and its partition function. Furthermore, assuming that we can obtain\nsamples {x1,\u00b7\u00b7\u00b7 , xn} and {y1,\u00b7\u00b7\u00b7 , yn(cid:48)} from p and \u03a6{p}, respectively, the optimization of\nEq.(7) can be approximated as\n\n\u03b8\n\n\u03b8\n\nk=1\n\nk=1\n\nlog \u02dcq\u03b8(x(k))\u2212\n\nKL(p(cid:107)q\u03b8)\u2212KL(\u03a6{p}(cid:107)\u03a6{q\u03b8})\u2248 argmax\n\nlog \u03a6{\u02dcq\u03b8}(y(k)),\nargmin\nwhere due to the linearity of \u03a6, the two terms of Z(\u03b8) in q\u03b8 and L{q\u03b8} cancel out each other.\nTherefore, the optimization does not require the computation of the partition function, a highly\ndesirable property for \ufb01tting parameters in high dimensional probabilistic models with intractable\npartition functions. Type I MKC objective functions with KL contraction operators induced from\nconditional distribution, marginalization, marginal grafting, linear transform, and lumping all fall\ninto this category. However, using nonlinear KL contraction operators, such as the one induced\nfrom binary mixtures, may also be able to avoid computing the partition function (e.g., Section\n4.4). Furthermore, the KL contraction operator in Eq.(7) can have parameters, which can include\nthe model parameter \u03b8 (e.g., Section 4.2). However, the optimization becomes more complicated\nas \u03a6{p} cannot be ignored when optimizing \u03b8. Last, note that when using Type I MKC objective\nfunctions with a non-strict KL contraction operator, we cannot guarantee p = q\u03b8 even if their\ncorresponding KL contraction is zero.\n\n(cid:12)(cid:12)(cid:12)t=0\n\n- Type II: Consider a strict KL contraction operator with \u03b2 = 1, denoted as \u03a6t, is parameterized\nby an auxiliary parameter t that is different from \u03b8, and for any distribution p \u2208 \u2126d, we have\n\u03a60{p} = p and \u03a6t{p} is continuously differentiable with regards to t. Then, the KL divergence\n\u03a6t{p} and \u03a6t{q\u03b8} can be regarded as a function of t and \u03b8, as: L(t, \u03b8) = KL(\u03a6t{p}(cid:107)\u03a6t{q\u03b8).\nThus, the KL contraction in Eq.(7) can be approximated with a Taylor expansion near t = 0, as\nKL(p(cid:107)q\u03b8)\u2212 KL(\u03a6\u03b4t{p}(cid:107)\u03a6\u03b4t{q\u03b8}) = KL(\u03a60{p}(cid:107)\u03a60{q\u03b8})\u2212 KL(\u03a6\u03b4t{p}(cid:107)\u03a6\u03b4t{q\u03b8}) = L(0, \u03b8)\u2212\nL(\u03b4t, \u03b8) \u2248 \u2212\u03b4t \u2202L(t,\u03b8)\n. If the derivative of KL contrac-\ntion with regards to t is easier to work with than the KL contraction itself (e.g., Section 4.5),\nwe can \ufb01x \u03b4t and equivalently maximizing the derivative, which is the Type II MKC objective\nfunction, as\n\n\u2202t KL(\u03a6t{p}(cid:107)\u03a6t{q\u03b8})(cid:12)(cid:12)t=0\n(cid:12)(cid:12)(cid:12)(cid:12)t=0\n\n(8)\n- Type III: In the case where we have access to a set of different KL contraction operators,\n{\u03a61,\u00b7\u00b7\u00b7 , \u03a6m}, we can implement the minimum KL contraction principle by \ufb01nding optimal\n\u03b8 that minimizes their average KL contraction, as\n(9)\n\nKL(\u03a6t{p}(cid:107)\u03a6t{q\u03b8})\n\n= \u2212\u03b4t \u2202\n\nm(cid:88)\n\nargmax\n\n\u2202\n\u2202t\n\nargmin\n\n\u2202t\n\n.\n\n\u03b8\n\n(KL(p(cid:107)q\u03b8) \u2212 \u03b2iKL(\u03a6i{p}(cid:107)\u03a6i{q\u03b8})) .\n\n1\nm\n\n\u03b8\n\nAs each KL contraction in the sum is nonnegative, Eq.(9) is zero if and only if each KL contrac-\ntion is zero. If the consistency of p and q\u03b8 with regards to \u03a6i corresponds to certain constraints\non q\u03b8, the objective function, Eq.(9), represents the consistency of all such constraints. Under\nsome special cases, minimizing Eq.(9) to zero over a suf\ufb01cient number of certain types of KL\ncontraction operators may indeed ensure equality of p and q\u03b8 (e.g., Section 4.6).\n\ni=1\n\n4.1 Fitting Gaussian Model with KL Contraction Operator from a Gaussian Distribution\n\nWe \ufb01rst describe an instance of MKC learning under a very simple setting, where we approximate\n0, with a Gaussian model q\u03b8\na distribution p(x) for x \u2208 R with known mean \u00b50 and variance \u03c32\n(cid:19)\nwhose mean and variance are the parameters to be estimated as \u03b8 = (\u00b5, \u03c32). Using the strict KL\ncontraction operator \u03a6c\n\nT constructed with a Gaussian conditional distribution\n\n(cid:18)\n\n1(cid:112)\n\nT (y|x) =\n\nexp\n\n\u2212\n\n2\u03c0\u03c32\n1\n\n(y \u2212 x)2\n\n2\u03c32\n1\n\n,\n\n(cid:20) \u03c32\n\nwith known variance \u03c32\nreduced to a closed form objective function, as:\n\n1, we form the Type I MKC objective function. In this simple case, Eq.(7) is\n\n0\n\n\u00b5,\u03c32\n\nargmin\n\n2\u03c32 \u2212\n\n0 + \u03c32\n\u03c32\n1\n2(\u03c32 + \u03c32\n1)\n\n1\n2\nwhose optimal solution, \u00b5 = \u00b50 and \u03c32 = \u03c32\n0, is obtained by direct differentiation. The detailed\nderivation of this result is omitted due to the limit of space. Note that, the optimal parameters do\nnot rely on the parameter in the KL contraction operator (in this case, \u03c32\n1), and are the same as those\nobtained by minimizing the KL divergence between p and q\u03b8, or equivalently, maximizing the log\nlikelihood, when samples from p(x) are used to approximate the expectation.\n\n1(\u00b5 \u2212 \u00b50)2\n\u03c32\n2\u03c32(\u03c32 + \u03c32\n1)\n\n\u03c32 + \u03c32\n1\n\nlog\n\n\u03c32\n\n+\n\n+\n\n,\n\n(cid:21)\n\n5\n\n\f4.2 Relation with Contrastive Divergence [12]\n\n(cid:82)\nNext, we consider the general strict KL contraction operator \u03a6c\nconstructed from a conditional\nT\u03b8\ndistribution, T\u03b8(y|x), for x, y \u2208 Rd, of which the parametric model q\u03b8 is a \ufb01xed point, as q\u03b8(y) =\nIn other words, q\u03b8 is the equilibrium distribution of the\nRd T\u03b8(y|x)q\u03b8(x)dx = \u03a6c\nMarkov chain whose transitional distribution is given by T\u03b8(y|x). The Type I objective function\nof minimum KL contraction, Eq.(7), for p, q\u03b8 \u2208 \u2126d under \u03a6c\n\nT\u03b8{q\u03b8}(y).\n\nis\n\n\u03b8\n\nargmin\n\nT\u03b8{p}\nwhere p\u03b8 is the shorthand notation for \u03a6c\nT\u03b8{p}. Note that this is the objective function of the con-\ntrastive divergence learning [12]. However, the dependency of p\u03b8 on \u03b8 makes this objective function\ndif\ufb01cult to optimize. By ignoring this dependency, the practical parameter update in contrastive\ndivergence only approximately follows the gradient of this objective function [5].\n\nKL(p(cid:107)q\u03b8) \u2212 KL(p\u03b8(cid:107)q\u03b8) ,\n\nT\u03b8{q\u03b8}\n\n\u03b8\n\nKL(p(cid:107)q\u03b8) \u2212 KL(cid:0)\u03a6c\n\n(cid:1) = argmin\n\nT\u03b8\n\n(cid:13)(cid:13)\u03a6c\n\n4.3 Relation with Partial Likelihood [7] and Non-local Contrastive Objectives [31]\n\nNext, we consider the Type I MKC objective function, Eq.(7), combined with the KL contraction\noperator constructed from lumping. Using Lemma 4, we have\n\n(cid:8)KL(p(cid:107)q\u03b8) \u2212 KL(cid:0)\u03a6lS{p}\nm(cid:88)\n\n(cid:90)\n\n\u02dcpi(x) log \u02dcq\u03b8\n\nS\ni\n\nP\n\n(cid:13)(cid:13)\u03a6lS{q\u03b8}\n\n(cid:1)\n\n(cid:13)(cid:13)\u02dcq\u03b8\n\ni\n\n(cid:1)(cid:9) = argmin\nn(cid:88)\n\n\u03b8\n\nm(cid:88)\n\ni=1\n\nP\n\nS\n\ni KL(cid:0)\u02dcpi\nm(cid:88)\n\n1[x(k)\u2208Si]\n\nk=1\n\ni=1\n\n1\nn\n\n= argmax\n\n\u03b8\n\ni=1\n\nx\u2208Si\n\ni (x)dx \u2248 argmax\n\n\u03b8\n\nS\ni\n\nlog \u02dcq\u03b8\n\ni (x(k)),\n\nP\n\nargmin\n\n\u03b8\n\nwhere {x(1),\u00b7\u00b7\u00b7 , x(n)} are samples from p(x). Minimizing KL contraction in this case is equiv-\nalent to maximizing the weighted sum of log likelihood of the probability distributions formed by\nrestricting the overall model to subsets of state space. The last step resembles the partial likelihood\nobjective function [7], which is recently rediscovered in the context of discriminative learning as\nnon-local contrastive objectives [31]. In [31], the partitions are required to overlap with each other,\nwhile the above result shows that non-overlapping partitions of Rd can also be used for non-ML\nparameter learning.\n\n4.4 Relation with Noise Contrastive Estimation [11]\n\nNext, we consider the Type I MKC objective function, Eq.(7), combined with the strict KL contrac-\ntion operator constructed from the binary mixture operation (Lemma 3). In particular, we simplify\ng, as:\nEq.(7) using the de\ufb01nition of \u03a6b\n1\n\u03c01\n\nKL(cid:0)\u03a6b\n\n(cid:13)(cid:13)\u03a6b\n\ng{q\u03b8}\n\ng{p}\n\nargmin\n\n(cid:1)\n\n\u03b8\n\n(cid:90)\n\n(cid:90)\n\nKL(p(cid:107)q\u03b8) \u2212\n1\n\u03c01\n\nRd\n\n(cid:90)\n\n(cid:90)\n\nn\u2212(cid:88)\n\nk=1\n\n= argmin\n\n\u03b8\n\n(\u03c00g(x) + \u03c01p(x)) log (\u03c00g(x) + \u03c01q\u03b8(x)) dx \u2212\n\nRd\n\np(x) log q\u03b8(x)dx\n\n= argmax\n\n\u03b8\n\nRd\n\np(x) log\n\n\u03c01q\u03b8(x)\n\n\u03c00g(x) + \u03c01q\u03b8(x)\n\ndx +\n\n\u03c00\n\u03c01\n\ng(x) log\n\n\u03c00g(x)\n\n\u03c00g(x) + \u03c01q\u03b8(x)\n\ndx.\n\nRd\n\nWhen the expectations in the above objective function are approximated with averages over samples\nfrom p(x) and g(x), {x(1),\u00b7\u00b7\u00b7 , x(n+)} and {y(1),\u00b7\u00b7\u00b7 , y(n\u2212)}, the Type I MKC objective function\nin this case reduces to\n\nargmax\n\n\u03b8\n\n1\nn+\n\nlog\n\n\u03c01q\u03b8(x(k))\n\n\u03c00g(x(k)) + \u03c01q\u03b8(x(k))\n\n+\n\n\u03c00\n\u03c01\n\n1\nn\u2212\n\nlog\n\n\u03c00g(y(k))\n\n\u03c00g(y(k)) + \u03c01q\u03b8(y(k))\n\n.\n\nIf we set \u03c00 = \u03c01 = 1/2, and treat {x(1),\u00b7\u00b7\u00b7 , x(n+)} and {y(1),\u00b7\u00b7\u00b7 , y(n\u2212)} as data of interest and\nnoise, respectively, the above objective function can also be interpreted as minimizing the Bayesian\nclassi\ufb01cation error of data and noise, which is the objective function of noise-contrastive estimation\n[11].\n\n6\n\nn+(cid:88)\n\nk=1\n\n\f4.5 Relation with Score Matching [14]\n\nNext, we consider the strict KL contraction operator, \u03a6c\nTt\nconditional distribution with a time decaying variance (i.e., a Gaussian diffusion process):\n\n, constructed from an isotropic Gaussian\n\n(cid:18)\n\n(cid:19)\n\n,\n\nTt(y|x) =\n\n1\n\n(2\u03c0t)d/2 exp\n\n\u2212(cid:107)y \u2212 x(cid:107)2\n\n2t\n\nT0{p} = p for any p \u2208 \u2126d.\nwhere t \u2208 [0,\u221e) is the continuous temporal index. Note that we have \u03a6c\nIf both p(x) and q\u03b8(x) are functions differentiable with regards to x, it is know that the temporal\nderivative of the KL contraction of p and q\u03b8 under \u03a6c\nis in closed form, which is formally stated in\nTt\nthe following result.\nLemma 5 (Lyu [25]) For any two distributions p, q \u2208 \u2126d differentiable with regards to x, we have\n(10)\n\n\u03a6c\nTt{p}(x)\nwhere \u2207x is the gradient operator with regards to x.\nSetting t = 0 in Eq.(10), we obtain a closed form for the Type II MKC objective function, Eq.(8),\nwhich can be further simpli\ufb01ed [14], as\n\n(cid:1) = \u2212\n\nKL(cid:0)\u03a6c\n\n(cid:13)(cid:13)\u03a6c\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nTt{q\u03b8}\n\nTt{p}\n\nTtq\u03b8(x)\n\nTtq\u03b8(x)\n\nd\ndt\n\n(cid:90)\n\ndx,\n\nRd\n\n\u03a6c\n\n\u03a6c\n\n1\n2\n\nKL(\u03a6t{p}(cid:107)\u03a6t{q\u03b8})\n\n(cid:12)(cid:12)(cid:12)(cid:12)t=0\n(cid:16)\n(cid:18)(cid:13)(cid:13)(cid:13)\u2207x log q\u03b8(x(k))\n(cid:13)(cid:13)(cid:13)2\n\np(x)\n\nn(cid:88)\n\nd\ndt\n\n(cid:90)\n\nRd\n1\nn\n\nk=1\n\n= argmin\n\n(cid:17)\nRd\n(cid:107)\u2207x log q\u03b8(x)(cid:107)2 + 2(cid:52)x log q\u03b8(x)\n\n\u03b8\n\n+ 2(cid:52)x log q\u03b8(x(k))\n\n,\n\nargmax\n\n\u03b8\n\n= argmin\n\n\u03b8\n\n\u2248 argmin\n\n\u03b8\n\nTtp(x)\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207x\u03a6c\nTtp(x) \u2212 \u2207x\u03a6c\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207xp(x)\n(cid:90)\n(cid:19)\n\np(x)\n\ndx\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\np(x) \u2212 \u2207xq\u03b8(x)\n\nq\u03b8(x)\n\ndx\n\nwhere {x(1),\u00b7\u00b7\u00b7 , x(n)} are samples from p(x), and (cid:52)x is the Laplacian operator with regards to x.\nThe last step is the objective function of score matching learning [14].\n\n4.6 Relation with Conditional Composite Likelihood [24] and Pseudo-Likelihood [3]\n\n(cid:82)\n\n1\nn\n\nm(cid:88)\n\nA{p}(cid:107)\u03a6m\n\nA{q}) = argmax\u03b8\n\n(cid:80)n\nk=1 log q\\A|A(x(k)\\A|x(k)\n\nNext, we consider the Type I MKC objective function, Eq.(7), combined with the KL\nA , constructed from marginalization. According to Lemma 2, we\ncontraction operator, \u03a6m\nhave argmin\u03b8 KL(p(cid:107)q) \u2212 KL(\u03a6m\nA{p}(cid:107)\u03a6m\nRd p(x) log q\\A|A(x\\A|xA)dx \u2248\nA ), where in the last step, expectation over p(x) is replaced\nargmax\u03b8\nwith averages over samples from p(x), {x(1),\u00b7\u00b7\u00b7 , x(n)}. This corresponds to the objective function\nin maximum conditional likelihood [17] or maximum mutual information [2], which are non-ML\nlearning objectives for discriminative learning of high dimensional probabilistic data models.\nA{q}) = 0 is not suf\ufb01cient to guaran-\nHowever, Lemma 2 also shows that KL(p(cid:107)q)\u2212 KL(\u03a6m\ntee p = q\u03b8. Alternatively, we can use the Type III MKC objective function, Eq.(9), to combine KL\nm(cid:88)\ncontraction operators formed from marginalizations over m different index subsets A1,\u00b7\u00b7\u00b7 , Am:\nAi |x(k)\\Ai\nargmin\nThis is the objective function in conditional composite likelihood [24, 30, 23, 1] (also rediscovered\nunder the name piecewise learning in [26]).\nA special case of conditional composite likelihood is when Ai = \\{i}, the resulting marginal-\nization operator, \u03a6m\\{i}, is known as the ith singleton marginalization operator. With the d dif-\n(cid:80)d\nferent singleton marginalization operators, we can rewrite the objective function as KL(p(cid:107)q) \u2212\n1\nd\nthis case, the average KL contraction is zero if and only if p(x) and q\u03b8(x) agree on every singleton\nconditional distribution, i.e., pi|\\i(xi|x\\i) = qi|\\i(xi|x\\i) for all i and x. According the Brook\u2019s\nLemma [4], the latter condition is suf\ufb01cient for p(x) = q\u03b8(x) (a.e.). Furthermore, when approxi-\nmating the expectations with averages over samples from p(x), we have\n\nR pi(xi)KL(cid:0)pi|\\i(xi|x\\i)(cid:13)(cid:13)qi|\\i(xi|x\\i)(cid:1) dxi. Note that in\n(cid:82)\n\n(cid:13)(cid:13)(cid:13)\u03a6m\\iq\n(cid:17)\n\nKL(cid:0)\u03a6m\n\nlog qAi|\\Ai(x(k)\n\n(cid:13)(cid:13)\u03a6m\n\nKL(p(cid:107)q)\u2212\n\n\u2248 argmax\n\n(cid:80)d\n\nn(cid:88)\n\nAi{p}\n\nAi{q}\n\ni=1 KL\n\n\u03a6m\\ip\n\n= 1\nd\n\n(cid:16)\n\n1\nm\n\n1\nm\n\n(cid:1)\n\n1\nn\n\nk=1\n\ni=1\n\ni=1\n\ni=1\n\n).\n\n\u03b8\n\n\u03b8\n\n7\n\n\fd(cid:88)\n\n(cid:16)\n\nKL\n\n\u03a6m\\{i}p\n\n(cid:13)(cid:13)(cid:13)\u03a6m\\{i}q\n(cid:17)\n\n1\nd\n\nn(cid:88)\n\n1\nd\n\nd(cid:88)\n\nlog qi|\\i(x(k)\n\ni\n\n|x(k)\\i ),\n\n\u2248 argmax\n\n\u03b8\n\n1\nn\n\nargmin\n\n\u03b8\n\nKL(p(cid:107)q) \u2212\n\ni=1\nwhich is objective function in maximum pseudo-likelihood learning [3, 29].\n\nk=1\n\ni=1\n\n4.7 Relation with Marginal Composite Likelihood\n\n(cid:16)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(cid:90)\n\nm(cid:88)\n\nRd\n\ni=1\n\nm(cid:88)\nn(cid:88)\n\ni=1\n\n1\nm\n\n1\nn\n\nm(cid:88)\n\n1\nm\n\nk=1\n\ni=1\n\nWe now consider combining Type III MKC objective function, Eq.(9), with the KL contraction oper-\nator constructed from the marginal grafting operation. Speci\ufb01cally, with m different KL contraction\noperators constructed from marginal grafting on index subsets A1,\u00b7\u00b7\u00b7 , Am, using Lemma 2, we can\nexpand the corresponding Type III minimum KL contraction objective function as:\n\nargmin\n\n\u03b8\n\nKL(p(cid:107)q) \u2212\n\nKL\n\n\u03a6g\np,Ai{p}\n\np,Ai{q}\n\n= argmin\n\n\u03b8\n\nKL(pAi(xAi)(cid:107)qAi(xAi))\n\n(cid:13)(cid:13)(cid:13)\u03a6g\n\n(cid:17)\n\n= argmax\n\n\u03b8\n\n1\nm\n\npAi(xAi) log qAi (xAi)dxAi \u2248 argmax\n\n\u03b8\n\nlog qAi (x(k)\nAi )\n\nThe last step, which maximizes the log likelihood of a set of marginal distributions on training data,\ncorresponds to the objective function of marginal composite likelihood [30]. With m = 1, the\nresulting objective is used in the maximum marginal likelihood or Type-II likelihood learning [9].\n\n5 Discussions\n\nIn this work, based on an information geometric view of parameter learning, we have described\nminimum KL contraction as a unifying principle for non-ML parameter learning, showing that the\nobjective functions of several existing non-ML parameter learning methods can all be understood as\ninstantiations of this principle with different KL contraction operators.\nThere are several directions that we would like to extend the current work. First, the proposed min-\nimum KL contraction framework may be further generalized using the more general f-divergence\n[8], of which the KL divergence is a special case. With the more general framework, we hope to\nreveal further relations among other types of non-ML learning objectives [16, 25, 28, 27]. Second, in\nthe current work, we have focused on the idealization of parametric learning as matching probability\ndistributions. In practice, learning is most often performed on \ufb01nite data set with an unknown under-\nlying distribution. In such cases, asymptotic properties of the estimation as data volume increases,\nsuch as consistency, become essential. While many non-ML learning methods covered in this work\nhave been shown to be consistent individually, the uni\ufb01cation based on the minimum KL contrac-\ntion may provide a general condition for such asymptotic properties. Last, understanding different\nexisting non-ML learning objectives through minimizing KL contraction also provides a principled\napproach to devise new non-ML learning methods, by seeking new KL contraction operators, or\nnew combinations of existing KL contraction operators.\n\nAcknowledgement The author would like to thank Jascha Sohl-Dickstein, Michael DeWeese and\nMichael Gutmann for helpful discussions on an early version of this work. This work is supported\nby the National Science Foundation under the CAREER Award Grant No. 0953373.\n\nReferences\n\n[1] Arthur U. Asuncion, Qiang Liu, Alexander T. Ihler, and Padhraic Smyth. Learning with blocks:\n\nComposite likelihood and contrastive divergence. In AISTATS, 2010. 2, 7\n\n[2] L. Bahl, P. Brown, P. de Souza, and R. Mercer. Maximum mutual information estimation of\n\nhidden markov model parameters for speech recognition. In ICASSP, 1986. 1, 2, 7\n\n[3] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179\u201395, 1975. 1, 2, 7, 8\n[4] D. Brook. On the distinction between the conditional probability and the joint probability\napproaches in the speci\ufb01cation of nearest-neighbor systems. Biometrika, 3/4(51):481\u2013483,\n1964. 7\n\n8\n\n\f[5] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an and G. E. Hinton. On contrastive divergence learning. In AISTATS,\n\n2005. 6\n\n[6] T. Cover and J. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition,\n\n2006. 2, 3\n\n[7] D. R. Cox. Partial likelihood. Biometrika, 62(2):pp. 269\u2013276, 1975. 1, 2, 6\n[8] I. Csisz\u00b4ar and P. C. Shields. Information theory and statistics: A tutorial. Foundations and\n\nTrends in Communications and Information Theory, 1(4):417\u2013528, 2004. 4, 8\n\n[9] I.J. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT\n\nPress, 1965. 1, 2, 8\n\n[10] M. Gutmann and J. Hirayama. Bregman divergence as general framework to estimate un-\nnormalized statistical models. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\nBarcelona, Spain, 2011. 2\n\n[11] M. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation principle for\n\nunnormalized statistical models. In AISTATS, 2010. 1, 2, 6\n\n[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14:1771\u20131800, 2002. 1, 2, 6\n\n[13] P. J. Huber. Projection pursuit. The Annuals of Statistics, 13(2):435\u2013475, 1985. 3\n[14] A. Hyv\u00a8arinen. Estimation of non-normalized statistical models using score matching. Journal\n\nof Machine Learning Research, 6:695\u2013709, 2005. 1, 2, 7\n\n[15] A. Hyv\u00a8arinen. Connections between score matching, contrastive divergence, and pseudolike-\nlihood for continuous-valued variables. IEEE Transactions on Neural Networks, 18(5):1529\u2013\n1531, 2007. 2\n\n[16] A. Hyv\u00a8arinen. Some extensions of score matching. Computational Statistics & Data Analysis,\n\n51:2499\u20132512, 2007. 8\n\n[17] T. Jebara and A. Pentland. Maximum conditional likelihood via bound maximization and the\n\nCEM algorithm. In NIPS, 1998. 1, 2, 7\n\n[18] J. Laurie Kindermann, Ross; Snell. Markov Random Fields and Their Applications. American\n\nMathematical Society, 1980. 1\n\n[19] E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, 1989. 2, 4\n[20] Stefan L. Lauritzen. Statistical manifolds. In Differential Geometry in Statistical Inference,\n\npages 163\u2013216, 1987. 4\n\n[21] Lucien Le Cam. Maximum likelihood \u2014 an introduction. ISI Review, 58(2):153\u2013171, 1990. 1\n[22] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. Tutorial on energy-based learning.\n\nIn Predicting Structured Data. MIT Press, 2006. 2\n\n[23] P. Liang and M. I Jordan. An asymptotic analysis of generative, discriminative, and pseudo-\n\nlikelihood estimators. In International Conference on Machine Learning, 2008. 7\n\n[24] B. G Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):22\u201339, 1988.\n\n1, 2, 7\n\n[25] S. Lyu. Interpretation and generalization of score matching. In UAI, 2009. 7, 8\n[26] A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Genera-\ntive/discriminative training for clustering and classi\ufb01cation. In Association for the Advance-\nment of Arti\ufb01cial Intelligence (AAAI), 2006. 7\n\n[27] M. Pihlaja, M. Gutmann, and A. Hyv\u00a8arinen. A family of computationally ef\ufb01cient and simple\n\nestimators for unnormalized statistical models. In UAI, 2010. 8\n\n[28] J. Sohl-Dickstein, P. Battaglino, and M. DeWeese. Minimum probability \ufb02ow learning.\n\nICML, 2011. 8\n\nIn\n\n[29] D. Strauss and M. Ikeda. Pseudolikelihood estimation for social networks. Journal of the\n\nAmerican Statistical Association, 85:204\u2013212, 1990. 8\n\n[30] C. Varin and P. Vidoni. A note on composite likelihood inference and model selection.\n\nBiometrika, 92(3):519\u2013528, 2005. 2, 7, 8\n\n[31] D. Vickrey, C. Lin, and D. Koller. Non-local contrastive objectives. In ICML, 2010. 1, 2, 6\n\n9\n\n\f", "award": [], "sourceid": 68, "authors": [{"given_name": "Siwei", "family_name": "Lyu", "institution": null}]}