{"title": "Fisher Efficient Inference of Intractable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 8793, "page_last": 8803, "abstract": "Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic Cram{\\'e}r-Rao lower bound (efficiency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the Kullback-Leibler divergence minimization criterion implemented via density ratio estimation and a Stein operator. We study the problem of model inference using DLE. We prove its consistency and show that the asymptotic variance of its solution can attain the equality of the efficiency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pre-trained deep neural network.", "full_text": "Fisher Ef\ufb01cient Inference of Intractable Models\n\nSong Liu\n\nUniversity of Bristol\n\nThe Alan Turing Institute, UK\nsong.liu@bristol.ac.uk\n\nTakafumi Kanamori\n\nTokyo Institute of Technology,\n\nRIKEN, Japan\n\nkanamori@c.titech.ac.jp\n\nWittawat Jitkrittum\nMax Planck Institute\n\nfor Intelligent Systems, Germany\nwittawat@tuebingen.mpg.de\n\nYu Chen\n\nUniversity of Bristol, UK\nyc14600@bristol.ac.uk\n\nAbstract\n\nMaximum Likelihood Estimators (MLE) has many good properties. For example,\nthe asymptotic variance of MLE solution attains equality of the asymptotic Cram\u00e9r-\nRao lower bound (ef\ufb01ciency bound), which is the minimum possible variance for\nan unbiased estimator. However, obtaining such MLE solution requires calculating\nthe likelihood function which may not be tractable due to the normalization term of\nthe density model. In this paper, we derive a Discriminative Likelihood Estimator\n(DLE) from the Kullback-Leibler divergence minimization criterion implemented\nvia density ratio estimation and a Stein operator. We study the problem of model\ninference using DLE. We prove its consistency and show that the asymptotic\nvariance of its solution can attain the equality of the ef\ufb01ciency bound under mild\nregularity conditions. We also propose a dual formulation of DLE which can be\neasily optimized. Numerical studies validate our asymptotic theorems and we give\nan example where DLE successfully estimates an intractable model constructed\nusing a pre-trained deep neural network.\n\n1\n\nIntroduction\n\nMaximum Likelihood Estimation (MLE) has been a classic choice of density parameter estimator. It\ncan be derived from the Kullback-Leibler (KL) divergence minimization criterion and the resulting\nalgorithm simply maximizes the likelihood function (log-density function) over a set of observations.\nThe solution of MLE has many attractive asymptotic properties: the asymptotic variance of MLE\nsolutions reach an asymptotic lower bound of all unbiased estimators [5, 24].\nHowever, learning via MLE requires evaluating the normalization term of the density function; it\nmay be challenging to apply MLE to learn a complex model that has a computationally intractable\nnormalization term. A partial solution to this problem is approximating the normalization term or\nthe gradient of the likelihood function numerically. Many methods along this line of research have\nbeen actively studied: importance-sampling MLE [25], contrastive divergence [12] and more recently\namortized MLE [33]. While the computation of the normalization term is mitigated, these sampling-\nbased approximate methods come at the expense of extra computational burden and estimation\nerrors.\nThe issue of intractable normalization terms has led to the develoment of other approaches other\nthan the KL divergence minimization. For example, Score Matching (SM) [13] minimizes the Fisher\ndivergence [26] between the data distribution and a model distribution which is speci\ufb01ed by the\ngradient (with respect to the input variable) of its log density function. Its computation does not\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\frequire the evaluation of the normalization term, thus SM does not suffer from the intractability\nissue. Extensions of SM has been used for in\ufb01nite dimensional exponential family models [28],\nnon-negative models [14, 35] and high dimensional graphical models \ufb01tting [17].\nOther than the Fisher divergence, a kernel-based divergence measure known as Kernel Stein Discrep-\nancy (KSD) [4, 19] has been proposed as a test statistic for goodness-of-\ufb01t testing to measure the\ndifference between a data and a model distribution, without the hassle of evaluating the normalization\nterm. It reformulates the kernel Maximum Mean Discrepancy (MMD) [9] with a Stein operator\n[29, 8, 23] which is also de\ufb01ned using the gradient of the log density function. For the same reason\nas in SM, the KSD can be estimated when applied to a density model with an intractable normalizer.\nThe last few years have seen many applications of KSD such as variational inference [18], sampling\n[23, 3], and score function estimation [16, 27] among others. KSD minimization is a natural candidate\ncriterion for \ufb01tting intractable models [2]. However, the divergence measure de\ufb01ned by the KSD\nis directly characterized by the kernel used. Unlike in the case of goodness-of-\ufb01t testing where the\nkernel may be chosen by maximizing the test power [15], to date, there is no clear objective for\nchoosing the right kernel in the case of model \ufb01tting.\nBy contrast, KL divergence has been a classic discrepancy measure for model \ufb01tting. The question\nthat we address is: can we construct a generic model inference method by minimizing the KL\ndivergence without the knowledge of the normalization term? In this paper, we present a novel\nunnormalized model inference method, Discriminative Likelihood Estimation (DLE), by following\nthe KL divergence minimization criterion. The algorithm uses a technique called Density Ratio\nEstimation [31] which is conventionally used to estimate the ratio between two density functions from\ntwo sets of samples. We adapt this method to estimate the ratio between a data and an unnormalized\ndensity model with the help of a Stein operator. We then use the estimated ratio to construct a\nsurrogate to KL divergence which is later minimized to \ufb01t the parameters of an unnormalized density\nfunction. The resulting algorithm is a min max problem, which we show can be conveniently\nconverted into a min-min problem using Lagrangian duality. No extra sampling steps are required.\nWe further prove the consistency and asymptotic properties of DLE under mild conditions. One\nof our major contributions is that we prove the proposed estimator can also attain the asymptotic\nCram\u00e9r-Rao bound. Numerical experiments validate our theories and we show DLE indeed performs\nwell under realistic settings.\n2 Background\n\n2.1 Problem: Intractable Model Inference via KL Divergence Minimization\n\nConsider the problem of estimating the parameter \u03b8 of a probability density model p(x; \u03b8) from a\ni.i.d.\u223c Q where Q is a probability distribution whose density\nset of i.i.d. samples: Xq := {x(i)\nfunction is q(x). One idea is minimizing the sample approximated KL divergence from p\u03b8 to q:\n\nq }nq\n\ni=1\n\n(cid:20)\n\n(cid:21)\n\nKL [q|p\u03b8] = min\n\n\u03b8\n\nEq\n\nmin\n\n\u03b8\n\nlog\n\nq(x)\n\np(x; \u03b8)\n\n= C \u2212 max\n\n\u03b8\n\nEq [log p(x; \u03b8)]\n\nnq(cid:88)\n\ni=1\n\n\u2248 C \u2212 max\n\n\u03b8\n\n1\nnq\n\nlog p(x(i)\n\nq ; \u03b8),\n\nwhere C is a constant that does not depend on \u03b8. The last line uses Xq to approximate the expectation\nover q(x). This technique is known as Maximum Likelihood Estimation (MLE).\nDespite many advantages, MLE is un\ufb01t for intractable model inference. Consider for instance a\ndensity model p(x; \u03b8) := \u00afp(x;\u03b8)\nz(\u03b8) , where \u00afp(x; \u03b8) is a positive multilayer neural network parametrized\n\nby \u03b8, Z(\u03b8) =(cid:82) \u00afp(x; \u03b8)dx is the normalization term which guarantees that p(x; \u03b8) integrates to 1\n\nover its domain. In this example, Z(\u03b8) does not have a computationally tractable form; therefore,\nMLE cannot be used without approximating the likelihood function or its gradient using numerical\nmethods such as Markov chain Monte Carlo (MCMC).\nHowever, there is an alternative approach to minimizing the KL divergence: KL [q|p\u03b8] is an expecta-\ntion of the log-ratio log q(x)\np(x;\u03b8) with respect to the data distribution q(x). If we have access to q(x)\np(x;\u03b8),\nwe can approximate this KL by taking the average of the density ratio function over samples Xq, and\n\n2\n\n\fthe density model parameter \u03b8 can be subsequently estimated by minimizing this approximation to\nthe KL divergence.\n\n2.2 Two Sample Density Ratio Estimation\n\nTraditionally, Density Ratio Estimation (DRE) [30, 31] refers to estimating the ratio of two unknown\ndensities from their samples. Given two sets of i.i.d. samples drawn separately from distributions\nQ and P : Xq := {x(i)\ni=1 \u223c P, xq, xp \u2208 Rd, where distribution Q and P\nhave density functions q(x) and p(x) respectively. We hope to estimate the ratio q(x)\np(x).\n\ni=1 \u223c Q, Xp := {x(i)\n\np }np\n\nq }nq\n\nWe can model the density ratio using a function r(x; \u03b4) parameterized by \u03b4. To obtain the parameter\n\u03b4, we minimize the KL divergence KL[q|q\u03b4] where q(x; \u03b4) := r(x; \u03b4)p(x):\n\nKL[q|q\u03b4] comprises three terms in which only one term is dependent on the parameter \u03b4:\nKL[q|q\u03b4] = Eq[log q(x)] \u2212 Eq[log r(x; \u03b4)] \u2212 Eq[log p(x)] \u2248 \u2212 1\nThe last step uses Xq to approximate the expectation over q(x). C is a constant irrelevant to \u03b4. We\ncan also approximate the equality constraint in (1) using Xp:\n\ni=1 log r(x(i)\n\nq ; \u03b4) + C,\n\n(2)\n\nnq\n\nmin\n\n\u03b4\n\nKL[q|q\u03b4] s.t.(cid:82) r(x; \u03b4)p(x)dx = 1.\n(cid:80)nq\n\n(cid:80)np\n\n(cid:82) r(x; \u03b4)p(x)dx \u2248 1\n(cid:80)nq\n\ni=1 log r(x(i)\n\nnp\n\nj=1 r(x(j)\n\np ; \u03b4).\n\n(cid:80)np\n\n(1)\n\n(3)\n\n(4)\n\nCombining (2) and (3), we get a sample version of (1):\n\n\u02c6\u03b4 := argmin\n\n\u03b4\n\n\u2212 1\n\nnq\n\nq ; \u03b4) + C s.t. 1\nnp\n\nj=1 r(x(j)\n\np ; \u03b4) = 1.\n\nThe above optimization is called Kullback Leibler Importance Estimation Procedure (KLIEP) [30].\nUnfortunately, it cannot be directly used to estimate our ratio q(x)\np(x;\u03b8) since we only have samples\n\nfrom q(x) but not from p(x; \u03b8). Consequently the equality constraint(cid:82) r(x; \u03b4)p(x; \u03b8)dx = 1 can\n\nno longer be approximated using samples.\nA natural remedy to this problem is to draw samples from p(x; \u03b8) using sampling techniques, such as\nMCMC which, in general, can be costly when p(x; \u03b8) is complex. Correlation among drawn samples\nfrom an MCMC scheme further complicates estimation of the ratio. More importantly, regardless\nof the feasibility of sampling from p(x; \u03b8), the availability of an explicit (possibly unnormalized)\ndensity p(x; \u03b8) is much more valuable than just samples, especially in a high dimensional space\nwhere samples rarely capture the \ufb01ne-grained structural information present in the density model\np(x; \u03b8).\nIn this work, we propose a new procedure \u2013 Stein Density Ratio Estimation \u2013 which can directly\nuse the (unnormalized) density p, as it is, without sampling from it. Moreover, the new procedure\n(described in Section 3.1) yields a density ratio model r\u03b8(x; \u03b4) for the ratio function q(x)\np(x;\u03b8) that\nautomatically satis\ufb01es the aforementioned equality constraint for all \u03b8.\n\n3 Stein Density Ratio Estimation\n\nLet us consider a linear-in-parameter density ratio model r(x; \u03b4) := \u03b4\nf (x), where f (x) is a\n\u201cfeature function\u201d that transforms a data point x into a more powerful representation. To better model\np(x;\u03b8), we de\ufb01ne a family of feature functions called Stein features.\n\nq(x)\n\n(cid:62)\n\n3.1 Stein Features\nSuppose we have a feature function f (x) : Rd \u2192 Rb and a density model p(x; \u03b8) : Rd \u2192 R. A Stein\nfeature T\u03b8f (x) \u2208 Rb with respect to p(x; \u03b8) is T\u03b8f (x) := [T\u03b8f1(x), . . . , T\u03b8fi(x), . . . , T\u03b8fb(x)](cid:62),\nwhere T\u03b8 is a Stein operator [29, 8, 4, 23] and T\u03b8fi(x) \u2208 R is de\ufb01ned as\nT\u03b8fi(x) := (cid:104)\u2207x log p(x; \u03b8),\u2207xfi(x)(cid:105) + trace(\u22072\n\nxfi(x)),\n\n3\n\n\fwhere fi is the i-th output of function f, \u2207xfi(x) is the gradient of fi(x) and \u22072\nxfi(x) is the\nHessian of fi(x). Note that computing T\u03b8f (x) does not require evaluating the normalization term\nZ(\u03b8) as\n\n\u2207x log p(x; \u03b8) = \u2207x log \u00afp(x; \u03b8) \u2212 \u2207x log Z(\u03b8), where \u2207x log Z(\u03b8) = 0.\n\nExample 1. Let p(x; \u03b8) be in exponential family with suf\ufb01cient statistic \u03c8(x), then\n\nT\u03b8fi(x) = \u03b8\n\n(cid:62)\n\nJ x\u03c8(x)\u2207xfi(x) + trace[\u22072\n\nxfi(x)],\n\n(cid:80)d\n\nwhere J x\u03c8(x) \u2208 Rdim(\u03b8)\u00d7d is the Jacobian of \u03c8(x) and dim(\u03b8) is the dimension of \u03b8 .\nOne more example can be found at Appendix, Section A.1. A slightly different Stein opera-\ntor was introduced in [4, 23] where T (cid:48)\n\u03b8f (x) :=\ni=1 [\u2202xi log p(x; \u03b8)] fi(x)+\u2202xifi(x), where \u2202xif (x) is the partial derivative of f (x) with respect\n\u03b8\u2207xfi(x). Next we show an\nto xi. We can see the relationship between T and T (cid:48): T\u03b8fi(x) = T (cid:48)\nimportant property of Stein features.\nProposition 1 (Stein\u2019s Identity). Suppose p(x; \u03b8) > 0,\n\n\u03b8f (x) \u2208 R for f (x) \u2208 Rd is de\ufb01ned as T (cid:48)\n\n\u2200i,j\n\nlim|xj|\u2192\u221e p(x1,\u00b7\u00b7\u00b7 , xj,\u00b7\u00b7\u00b7 , xd; \u03b8)\u2202xj fi(x1,\u00b7\u00b7\u00b7 , xj,\u00b7\u00b7\u00b7 , xd) = 0,\n\np(x; \u03b8) is continously differentiable and fi is twice continuously differentiable for all \u03b8 and i. Then\nEp\u03b8 [T\u03b8f (x)] = 0 for all \u03b8.\nWe give a proof in Appendix Section B.1. Similar identities were given in previous literatures such\nas Lemma 2.2 in [19] or Lemma 5.1 in [4]. Utilizing this property, we can construct a density ratio\nmodel which bypasses the \u201cintractable equality constraint\u201d issue when estimating q(x)\np(x;\u03b8) as shown in\nthe next section.\n\n3.2 Stein Density Ratio Modeling and Estimation (SDRE)\n(cid:62)\n\nDe\ufb01ne a linear-in-parameter density ratio model: r\u03b8(x; \u03b4) := \u03b4\nT\u03b8f (x) + 1 by using a Stein feature\n(cid:62)Ep\u03b8 [T\u03b8f (x)] + 1 = 1 where\nfunction. We can see that Ep\u03b8 [r\u03b8(x; \u03b4)] = Ep\u03b8 [\u03b4\nthe last equality is ensured by Proposition 1 for all \u03b4 and \u03b8 if the speci\ufb01ed regularity conditions are\nmet. This equality means the constraint in (1) is automatically satis\ufb01ed with this density ratio model.\n(cid:104)\nNow we can solve (4) without its equality constraint.\n\nT\u03b8f (x) + 1] = \u03b4\n\n(cid:105)\n\n(cid:62)\n\nnq(cid:88)\n\n(cid:62)\n\nlog r\u03b8(x(i)\n\nq ; \u03b4) + C = argmax\n\nlog\n\n\u03b4\n\nT\u03b8f (x(i)\n\nq ) + 1\n\n.\n\n(5)\n\n\u02c6\u03b4 := argmin\n\n\u03b4\u2208Rb\n\n\u2212 1\nnq\n\ni=1\n\nnq(cid:88)\n\ni=1\n\n1\nnq\n\n\u03b4\u2208Rb\n\nIt can be seen that (5) is an unconstrained concave maximization problem. Note for all xq \u2208 Xq,\nr\u03b8(xq; \u02c6\u03b4) must be strictly positive thanks to the log-barrier (see e.g., Section 17.2 in [22]) in our\nobjective function. However, it is not possible to guarantee that for all x \u2208 Rd, r\u03b8(x; \u02c6\u03b4) is positive.\nThis is not a problem in this paper, as the density ratio function is only used for approximating the\nKL divergence, and we will not evaluate r\u03b8(x; \u02c6\u03b4) at a data point x that is outside of Xq. Note, the\nunnormalized density model \u00afp(x; \u03b8), by de\ufb01nition, should be non-negative everywhere for all \u03b8.\nWe refer to the objective (5) as Stein Density Ratio Estimation (SDRE). One may notice that\nq ; \u02c6\u03b4) evaluated at \u02c6\u03b4 is exactly the sample average of the estimated ratio over Xq\n1\nnq\nwhich allows us to approximate the KL divergence from p(x; \u03b8) to q(x).\n\ni=1 log r\u03b8(x(i)\n\n(cid:80)nq\n\n4\n\nIntractable Model Inference via Discriminative Likelihood Estimation\n\nLet (cid:96)(\u02c6\u03b4, \u03b8) := 1\nnq\nThe rationale of minimizing KL divergence from p(x; \u03b8) to q(x) leads to:\n\ni=1 log r\u03b8(x(i)\n\nq ; \u02c6\u03b4). We will use (cid:96)(\u02c6\u03b4, \u03b8) as a replacement of KL [q(x)|p(x; \u03b8)].\n\n(cid:80)nq\n\nmin\n\n\u03b8\n\n(cid:96)(\u02c6\u03b4, \u03b8) or equivalently min\n\u03b8\n\nmax\n\n\u03b4\n\n(cid:96)(\u03b4, \u03b8).\n\n(6)\n\n4\n\n\fThe equivalence is due to the fact that (cid:96) evaluated at the optimal ratio parameter \u02c6\u03b4 is also the maximum\nof the DRE objective function when being optimized w.r.t. \u03b4. The outer problem minimizes (cid:96) with\nrespect to the density parameter \u03b8. We call this estimator Discriminative Likelihood Estimation\n(DLE) as the parameter of the density model p(x; \u03b8) is learned via minimizing a discriminator1,\nwhich is the likelihood ratio function (cid:96)(\u02c6\u03b4, \u03b8) measuring the differences between q(x) and p(x; \u03b8).\n\n4.1 Consistency with Correct Model\n\nFor brevity, we state all theorems assuming all regularity conditions in Proposition 1 are met.\n(\u03b4,\u03b8)(cid:96)(\u03b4, \u03b8), the full Hessian of (cid:96)(\u03b4, \u03b8). H \u03b4,\u03b8 is \u2207\u03b4\u2207\u03b8(cid:96)(\u03b4, \u03b8), submatrix of the\nNotations: H is \u22072\nHessian matrix whose rows and columns indexed by \u03b4, \u03b8 respectively. s \u2208 Rdim(\u03b8) is \u2207\u03b8 log p(x, \u03b8)\n\u2217, score function of p(x; \u03b8). \u03bb(\u00b7) is the eigenvalue operator. \u03bbmin(\u00b7) or \u03bbmax(\u00b7) is the\nevaluated at \u03b8\nminimum or maximum eigenvalue and (cid:107) \u00b7 (cid:107) is the operator norm.\nWe study the consistency of the following estimator under a correct model setting.\n\n(\u02c6\u03b4, \u02c6\u03b8) := arg min\n\u03b8\u2208\u0398\n\nmax\n\u03b4\u2208\u2206nq\n\n(cid:96)(\u03b4, \u03b8),\n\n(7)\n\nwhere \u0398 and \u2206nq are compact parameter spaces for \u03b8 and \u03b4 respectively. The compactness condition\nis among a set of conditions commonly used in classic consistency proofs (see e.g., Wald\u2019s Consistency\nProof, 5.2.1,[32]). It is possible to derive weaker conditions given speci\ufb01c choices of f or p(x; \u03b8).\nHowever, in the current manuscript, we only focus on more generic settings and conditions that would\ngive rise to estimation consistency and useful asymptotic theories. We assume they are properly\nchosen so that (\u02c6\u03b8, \u02c6\u03b4) is the saddle point of (7).\nFirst, we assume our density model p(x; \u03b8) is correctly speci\ufb01ed:\n\u2217\nAssumption 1. There exists a unique pair of parameter (\u03b8\n, \u03b4\np(x; \u03b8\n\n\u2217 \u2208 \u2206nq, such that\n\n) \u2261 q(x) and r\u03b8\u2217 (x; \u03b4\n\n\u2217 \u2208 \u0398, \u03b4\n\n) = 1.\n\n), \u03b8\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217 must be 0.\n\n(cid:13)(cid:13)(cid:13) , \u03bbmin\n\n(cid:110)\u2212H \u03b8,\u03b4H\u22121\n\nmin > 0 and \u039bmax > 0 so that \u2200\u03b8 \u2208 \u0398, \u03b4 \u2208 \u2206nq\n\nmin, \u039bmax \u2265(cid:13)(cid:13)(cid:13)H \u03b8,\u03b4H\u22121\n\nGiven how r\u03b8(x; \u03b4) is constructed in Section 3.2, the above assumption implies \u03b4\nAssumption 2. There exist constants \u039bmin > 0, \u039b(cid:48)\n\u03bbmin {\u2212H \u03b4,\u03b4} \u2265 \u039b(cid:48)\nThe lower-boundedness of \u03bbmin {\u2212H \u03b4,\u03b4} implies the strict concavity of (cid:96)(\u03b4, \u03b8) with respect to\n\u03b4 ((cid:96)(\u03b4, \u03b8) is already concave by construction, see (5)): For all \u03b8 \u2208 \u0398, there exists a unique \u02c6\u03b4(\u03b8)\nthat maximizes the likelihood ratio, which means the likelihood ratio function should always have\nsuf\ufb01cient discriminative power to precisely pinpoint the differences between our data and the current\nmodel \u03b8. It also ensures that \u03b4 can \u201cteach\u201d the model parameter \u03b8 well by assuming the \u201cinteraction\u201d\nbetween \u03b4 and \u03b8 in our estimator, H \u03b8,\u03b4, is well-behaved.\nNow we analyze Assumption 2 on a special case:\n\n(cid:111) \u2265 \u039bmin > 2(cid:107)H \u03b8,\u03b8(cid:107) .\n\n\u03b4,\u03b4H \u03b4,\u03b8\n\n\u03b4,\u03b4\n\n(cid:27)\n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03b4\n\n1\n\nCratio\n\nnq(cid:88)\n\n\u2264 r\u03b8(x; \u03b4) \u2264 Cratio,(cid:107)\u03b4(cid:107)2 \u2264 T /\u03c3(nq),\u2200\u03b8 \u2208 \u0398,\u2200x \u2208 Xq,\nProposition 2. Let \u2206nq :=\nwhere T > 0, Cratio > 1 are constants and \u03c3(\u00b7) is a monotone-increasing function. p(x; \u03b8) is in\nexponential family with suf\ufb01cient statistic \u03c8(x) and Stein feature is chosen as T\u03b8\u03c8(x). Suppose\nthere exist constants C2, C3, C4, C5, \u039b(cid:48)(cid:48)\n\n(cid:80)nq\ni=1 (cid:107)J x\u03c8(x(i)\n\nmax, \u039b(cid:48)(cid:48)\nq )(cid:62)\n\n(cid:41)\nmin > 0, C2 \u2265 1\nnq(cid:88)\n\u2265 C3,\n\n(cid:40)\nnq(cid:88)\n1\nnq\nmax \u2265 \u03bb\nand \u039b(cid:48)(cid:48)\nexists a constant N > 1, when nq \u2265 N, Assumption 2 holds with high probability.\n\nq )(cid:62)(cid:17) \u2265 \u039b(cid:48)(cid:48)\n\n(cid:62)(cid:107) \u00b7 (cid:107)T\u03b8\u03c8(x(i)\n\nmin,\u2200\u03b8 \u2208 \u0398 with high probability. There\n\n(cid:107)J x\u03c8(x(i)\n\n(cid:107)J x\u03c8(x(i)\n\nq )(cid:107) \u2264 C5\n\nJ x\u03c8(x(i)\n\nq )J x\u03c8(x(i)\n\n(cid:62)(cid:107) \u2264 C4,\n\ni=1 T\u03b8\u03c8(x(i)\n\nq )T\u03b8\u03c8(x(i)\n\nq )J x\u03c8(x(i)\nq )\n\nq )J x\u03c8(x(i)\nq )\n\n(cid:80)nq\n\nq )(cid:107)4,\n\ni=1\n\n(cid:16) 1\n\nnq\n\n1\nnq\n\ni=1\n\n\u03bbmin\n\n1\nnq\n\ni=1\n\nnq\n\n1The word \u201cdiscriminator\u201d is borrowed from GAN [7]. Indeed, DLE and GAN bears many resemblances.\n\n5\n\n\fThe proof can be found in Appendix, Section B.3. Note in practice the domain constraint of \u2206nq in\nthis proposition can be easily enforced via convex constraints or penalty terms. Analysis on a few\nother examples can be found in Appendix, Section A.2.\nProposition 2 gives us some hints on how the feature function f of Stein feature can be chosen. In the\ncase of exponential family, the choice f = \u03c6 guarantees Assumption 2 to hold with high probability\nwhen nq increases.\nAssumption 3 (Concentration of Stein features). The difference between the sample average of\nthe Stein feature T\u03b8\u2217 f (x) and its expectation over q converges to zero in (cid:96)2 norm in probability.\n\ni=1 T\u03b8\u2217 f (x(i)\n\nq ) \u2212 Eq [T\u03b8\u2217 f (x)]\n\nP\u2192 0.\n\n(cid:13)(cid:13)(cid:13) 1\n\nnq\n\n(cid:80)nq\n\n(cid:13)(cid:13)(cid:13)2\n\nNote, if Assumption 1 holds at the same time, Proposition 1 indicates Eq [T\u03b8\u2217 f (x)] \u2261 0. This\nassumption holds due to the (strong) law of large numbers given that the Eq [T\u03b8\u2217 f (x)] exists.\nTheorem 1 (Consistency). Suppose Assumption 1, 2 and 3 holds, (\u02c6\u03b4, \u02c6\u03b8)\n\nP\u2192 (0, \u03b8\n\n).\n\n\u2217\n\nSee Section B.4 in Appendix for the proof. This theorem states that as nq increases, all saddle points\nof (7), converge to the vicinity of true parameters. All the following theorems rely on the result of\nTheorem 1.\n\n4.2 Asymptotic Variance of \u02c6\u03b8 and Fisher Ef\ufb01ciency of DLE\n\nIn this section we state one of our main contributions: DLE can attain the ef\ufb01ciency bound, i.e.,\nasymptotic Cram\u00e9r-Rao bound when f (x) is appropriately chosen. First, we show our estimator\n\u02c6\u03b8 has a simple asymptotic distribution which allows us to perform model inference. To state the\ntheorem, we need an extra assumption on the Hessian H:\nAssumption 4 (Uniform Convergence on H). sup\u03b4\u2208\u2206nq ,\u03b8\u2208\u0398 |Hi,j \u2212 Eq [Hi,j]| P\u2192 0,\u2200i,j.\nThis assumption states the second order derivatives (which is an average over samples from Xq)\nconverges uniformly to its population mean, as nq \u2192 \u221e. It helps us control the residual in the\nsecond order Taylor expansion in our proof. This assumption may be weakened given speci\ufb01c choices\nof f and p(x; \u03b8) but we focus on establishing the asymptotic results in generic settings, so this\ncondition is only listed as an assumption.\nTheorem 2 (Asymptotic Normality of \u02c6\u03b8). Suppose Assumption 1, 2, 3 and 4 holds,\n\n(cid:16)\n\n\u221a\n\nnq\n\n\u2217 \u2212 \u02c6\u03b8\n\n\u03b8\n\n(cid:17) (cid:32) N [0, V ] , V =\n\n(cid:16)\u2212Eq\n\n(cid:2)H\u2217\n\n(cid:3) Eq\n\n(cid:2)H\u2217\n\n\u03b8,\u03b4\n\n\u03b4,\u03b4\n\n(cid:3)\u22121Eq\n\n(cid:2)H\u2217\n\n\u03b4,\u03b8\n\n(cid:3)(cid:17)\u22121\n\n,\n\n(8)\n\n\u2217\n\n\u2217\n\n).\n\n, \u03b8\n\nwhere H\u2217 is H evaluated at (\u03b4\nSee Section B.5 in Appendix for the proof. In practice, we do not know Eq [H\u2217], so we may use \u02c6H,\nthe Hessian of (cid:96)(\u03b4, \u03b8) evaluated at (\u02c6\u03b4, \u02c6\u03b8) as an approximation to Eq [H\u2217].\nAlthough MLE is also asymptotically normal, important quantities such as Fisher Information Matrix\nmay not be ef\ufb01ciently computed on intractable models. In comparison, Theorem 2 enables us to\ncompute parameter con\ufb01dence interval for DLE even on intractable p\u03b8.\nNow we consider the asymptotic ef\ufb01ciency of the DLE with respect to speci\ufb01c choices of Stein\nfeatures. Let V f be the asymptotic variance (8) using a Stein feature with a speci\ufb01c choice of f.\nLemma 3. Suppose that Assumptions 1, 2, 3 and 4 hold and Eq[T\u03b8\u2217 f (x)T\u03b8\u2217 f (x)(cid:62)] is invertible.\nMoreover, suppose that the integration and the derivative of \u2202\u03b8i\n\nfor all i. Vf =(cid:0)Eq[sT\u03b8\u2217 f (x)(cid:62)]Eq[T\u03b8\u2217 f (x)T\u03b8\u2217 f (x)(cid:62)]\u22121Eq[T\u03b8\u2217 f (x)s(cid:62)](cid:1)\u22121\n\n(cid:82) p(x; \u03b8)T\u03b8f (x)dx is exchangeable\n\n.\n\nThe proof is given in Section B.6 in the Appendix. Lemma 3 expresses asymptotic variance using\nscore function and Stein feature and is used to prove that the variance monotonically decreases as the\nvector space spanned by the Stein feature vectors becomes larger.\nCorollary 4 (Monotonocity of Asymptotic Variance). De\ufb01ne the inner product as Eq[f g] for func-\ntions f and g. Let T\u03b8\u2217 f (x) = [t1, . . . , tb] and T\u03b8\u2217 \u00aff (x) = [\u00aft1, . . . , \u00aft\u00afb] be two Stein feature vec-\ntors. Assume that span{t1, . . . , tb} \u2282 span{\u00aft1, . . . , \u00aft\u00afb}, where span{\u00b7\u00b7\u00b7} denotes the linear space\n\n6\n\n\fspanned by the speci\ufb01ed elements. Then, the inequality V\u00aff (cid:22) Vf holds in the sense of the positive\nde\ufb01niteness.\nProof. Let us de\ufb01ne Pf s as the orthogonal projection of s onto span{t1, . . . , tb}. A simple calcula-\ntion yields Pf s = Eq[sT\u03b8\u2217 f (x)(cid:62)]Eq[T\u03b8\u2217 f (x)T\u03b8\u2217 f (x)(cid:62)]\u22121T\u03b8\u2217 f (x), and thus, Lemma 3 leads to\nV \u22121\nf = Eq[Pf s(Pf s)(cid:62)]. From the property of the orthogonal projection (see e.g., Theorem 2.23 in\n[34]), we have Eq[P\u00aff s(P\u00aff s)(cid:62)] (cid:23) Eq[Pf s(Pf s)(cid:62)]. Therefore, we obtain V\u00aff (cid:22) Vf .\nFor Qf s = s \u2212 Pf s, we have Eq[ss(cid:62)] = Eq[Pf s(Pf s)(cid:62)] + Eq[Qf s(Qf s)(cid:62)] = V \u22121\nf +\nEq[Qf s(Qf s)(cid:62)]. Thus, we see that the asymptotic variance converges to the inverse of the Fisher\ninformation, Eq[ss(cid:62)]\u22121, as Pf s gets close to s. In particular, when the linear space span{t1, . . . , tb}\nincludes s, Qf s vanishes and consequently the DLE with f (x) is asymptotically ef\ufb01cient.\nExample 2. Let p(x; \u03b8) be the model of the d-dimensional multivariate Gaussian distribution\nN (\u03b8, Iden\u00b7 \u03c32), where Iden is the identity matrix. Here the variance \u03c32 is assumed to be known. The\nscore function is sj(x; \u03b8) = \u2212(xj \u2212 \u03b8j)/\u03c32, and the Stein feature vector de\ufb01ned from f (x) = x is\n(T\u03b8x)j = \u2212(xj\u2212\u03b8j)/\u03c32 for j = 1, . . . , d. Clearly, the score function is included in span{t1, . . . , td}.\nHence, the DLE with f achieves the ef\ufb01ciency bound of the parameter estimation.\n\nOne more example can be found in Appendix, Section A.3. In fact, Corollary 3 suggests that as\nlong as we can represent the score function s using Stein feature T\u03b8f up to a linear transformation,\nDLE can achieve ef\ufb01ciency bound. However, since f is coupled with \u2207x log p(x; \u03b8) in T\u03b8f, it is\nnot always easy to reverse engineer an f from s. Nonetheless, our numerical experiments show that\nusing simple functions such polynomials as f yields good performance.\n\n4.3 Model Selection of DLE\n\nAs our objective (6) tries to minimize the discrepancy between our model p(x; \u03b8) and the data\ndistribution, it is tempting to compare models using the objective function evaluated at (\u02c6\u03b4, \u02c6\u03b8), i.e.,\n(cid:96)(\u02c6\u03b4, \u02c6\u03b8). However, the more sophisticated p(x; \u03b8) becomes, the more likely it picks up spurious\npatterns of our dataset. Similarly, the more powerful the Stein features are, the more likely the\ndiscriminator is overly critical to the density model on this dataset. Thus a better model selection\ncriterion would be comparing Eq\nwhich eliminates the potential of over\ufb01tting a speci\ufb01c\ndataset. Unfortunately, this expectation cannot be computed without the knowledge on q(x). We\npropose to approximate this quantity using a penalized likelihood:\nTheorem 5. Suppose Assumption 1, 2, 3 and 4 holds. Eq\ndim(\u03b8) \u2264 b, then nqEq\n\n= min\u03b8 max\u03b4 nq(cid:96)(\u03b4, \u03b8) \u2212 b + dim(\u03b8) + op(1).\n\n(cid:3) are full-rank and\n\n(cid:3) and Eq\n\n(cid:2)H\u2217\n\n(cid:2)H\u2217\n\n\u03b4,\u03b4\n\n(cid:104)\n\n(cid:105)\n\n(cid:104)\n\n(cid:96)(\u02c6\u03b4, \u02c6\u03b8)\n\n(cid:105)\n\n(cid:96)(\u02c6\u03b4, \u02c6\u03b8)\n\n\u03b4,\u03b8\n\nSee Section B.7 in Appendix for the proof. This theorem is closely related to a classic result called\nAkaike Information Criterion (AIC) [1]. Both AIC and Theorem 5 similarly penalize the degree of\nfreedom of the density model dim(\u03b8), while our theorem also penalizes the number of ratio parameter\ndim(\u03b4) = b due to the fact that our ratio function is also \ufb01tted using samples.\nOne can also show (cid:96)(\u02c6\u03b4, \u02c6\u03b8) follows a \u03c72 distribution. See Section B.8 in Appendix for details.\nTheorem 5 provides an information-criterion based model selection method. Suppose M is a set of dif-\nferent Stein features and M(cid:48) is a set of candidate density models. We can jointly select density model\nand Stein feature: ( \u02c6m, \u02c6m(cid:48)) := arg minm(cid:48)\u2208M(cid:48) maxm\u2208M Eq[(cid:96)(\u02c6\u03b8(m(cid:48)), \u02c6\u03b4(m))], where (\u02c6\u03b8(m(cid:48)), \u02c6\u03b4(m))\nare estimated parameters under the model choice (m(cid:48), m). Replacing Eq[(cid:96)(\u02c6\u03b8(m(cid:48)), \u02c6\u03b4(m))] with the\npenalized likelihood derived in Theorem 5, we can get a practical model selection method.\n\n5 Lagrangian Dual of SDRE and DLE by Minimization\n\nSome techniques can be used to directly optimize the min-max problem in (6), such as performing\ngradient descend/ascend on \u03b8 and \u03b4 alternately. However, looking for the saddle points of a min-max\noptimization is hard. In this section, we derive a partial Lagrangian dual for (6) so we can convert this\n\n7\n\n\f(a) qqplot of marginals\n\n\u221a\n\n(b)\n\nnq \u02c6\u03b8 vs. C.I.\n\n(c) Var(\u02c6\u03b8), Gamma dist.\n\n(d) Var(\u02c6\u03b8), Gaussian mix.\n\nFigure 1: Theoretical Prediction values vs. Empirical results\n\nmin-max problem into a min-min problem whose local optima can be ef\ufb01ciently found by existing\noptimization techniques.\nProposition 3. SDRE problem in (5) has a Lagrangian dual:\n\nnq(cid:88)\n\nnq(cid:88)\n\ni=1\n\nnq(cid:88)\n\ni=1\n\n\u02c6\u00b5 = argmin\n\n\u00b5\n\ni=1\n\n[\u2212(log \u2212\u00b5i) \u2212 1] \u2212\n\n\u00b5i s.t. :\n\n\u00b5iT\u03b8f (x(i)\n\nq ) = 0.\n\n(9)\n\nMoreover, the duality gap between (9) and (5) is 0 and r\u03b8(x(i)\nSee Section B.9 in the Appendix for its proof. Instead of solving the min-max problem (6), we solve\nthe following constrained minimization problem:\n[\u2212(log \u2212\u00b5i) \u2212 1] \u2212\n\n\u00b5iT\u03b8f (x(i)\n\nnq(cid:88)\n\nnq(cid:88)\n\nnq(cid:88)\n\n\u00b5i, s.t. :\n\n(10)\n\nq ) = 0,\n\nq ; \u02c6\u03b4) = \u22121/\u02c6\u00b5i.\n\nmin\n\nmin\n\n\u03b8\n\n\u00b5\n\ni=1\n\ni=1\n\ni=1\n\nwhere we replace the inner max problem in (6) with its Lagrangian (9). All experiments in this paper\nare performed using the Lagrangian dual objective (10). See https://github.com/lamfeeling/\nStein-Density-Ratio-Estimation for code demos on SDRE and model inference.\n6 Related Works\n\nScore Matching (SM) [13, 14] is a inference method for unnormalized statistical models. It estimates\nmodel parameters by minimizing the Fisher divergence [20, 26] between the true log density and\nthe model log density. To estimate the parameter, this method only requires \u2207x log p(x; \u03b8) and\n\u22072\nx log p(x; \u03b8) to avoid evaluating the normalization term.\nKernel Stein Discrepancy (KSD) [2] is a kernel mean discrepancy measure between a data distribu-\ntion and a model density using the Stein identity de\ufb01ned on Stein operator T (cid:48)\np\u03b8. This measure has\nbeen used for model evaluation [4, 19]. In Section 7, we minimize such a discrepancy with respect to\n\u03b8 for unnormalized model parameter estimation. A more generic version of this estimator has been\ndiscussed in [2].\nNoise Contrastive Estimation (NCE) [10] estimates the parameters of an unnormalized statistical\nmodel by performing a non-linear logistic regression to discriminate between observed dataset and\narti\ufb01cially generated noise. The normalization term can be dealt with like a regular parameter in\nthe statistical model and estimated through such a logistic regression. NCE requires us to select a\nnoise distribution and in our experiments, we use a multivariate Gaussian distribution with mean and\nvariance the same as Xq.\n\n7 Experiments\n\nnq[\u02c6\u03b8 \u2212 \u03b8\n\n\u2217\n\n], we design an intractable exponential family\n\n\u03c8(x) := [\n\nx2\ni , x1x2,\n\nx1xi, tanh(x)](cid:62), \u03b7(\u03b8) := [\u2212.5, .6, .2, 0, 0, 0, \u03b8]\n\n(cid:62)\n\n, x \u2208 R5, \u03b8 \u2208 R2.\n\n7.1 Validation of Asymptotic Results\nmodel \u00afp(x; \u03b8) := exp(cid:2)\u03b7(\u03b8)(cid:62)\u03c8(x)(cid:3), where\nTo examine the asymptotic distribution of \u221a\n\nd(cid:88)\n\nd(cid:88)\n\ni=1\n\ni=3\n\n8\n\n\fFigure 2: MNIST images with the highest (upper red box) and the lowest unnormalized density (lower green\nbox) estimated on each digit by DLE and NCE.\n\nnq\n\nnq\n\n\u2217\n\ntanh(x) is applied in an element-wise fashion. The feature function of the Stein feature is chosen as\nf (x) := tanh(x) \u2208 R5. Due to the tanh function, \u00afp(x; \u03b8) does not have a closed form normalization\nterm. We draw nq = 500 samples from p(x; 0) as Xq. Given we set \u03b8\n= 0, Xq actually comes\nfrom a tractable distribution. However the intractability of \u00afp(x; \u03b8) does not allow us to perform MLE\nstraight away.\nWe run DLE 6000 times with new batch of Xq each time and obtain an empirical distribution of\n\u221a\n\u02c6\u03b8. We show qqplots of its marginal distributions vs. N (0, V1,1), N (0, V2,2), the asymptotic\ndistribution predicted by Theorem 2 whose variance V is approximated by Xq and \u02c6\u03b8. Figure 1a shows\nall quantiles between the empirical marginals and predicted marginals are very well aligned. We also\nscatter-plot \u221a\n\u02c6\u03b8 together with the predicted 95% and 99.9% con\ufb01dence interval in Figure 1b. It\ncan be seen that the empirical joint distribution of \u221a\n\u02c6\u03b8 has the same elongated shape as predicted\nby Theorem 2 and agrees with the predicted con\ufb01dence intervals nicely.\nOne of our major contributions is proving DLE attains the Cram\u00e9r-Rao bound. We now compare\nthe variances of the estimated parameter \u02c6\u03b8 using Gamma p(x; \u03b8) = \u0393(5, \u03b8), \u03b8\u2217 = 1 and Gaussian\nmixture model p(x; \u03b8) = .5N (\u03b8, 1) + .5N (1, 1), \u03b8\u2217 = \u22121 across DLE, SM and KSD. Varnq [\u02c6\u03b8] are\nshown on Figure 1c and 1d. For DLE, we set f (x) := [x, x2] and for KSD, we use a polynomial\nkernel with degree 2. Note we particularly choose p(x; \u03b8) to be tractable so we can compute MLE\nand Cram\u00e9r-Rao bound easily. It can be seen that all estimators have decreasing variances and MLE,\nbeing one of the minimum variance estimators, has the lowest variance. However, DLE has the\nsecond lowest variances in both cases and quickly converges to Cram\u00e9r-Rao bound after nq = 150.\nIn comparison, both KSD and SM maintain higher levels of variances.\n\nnq\n\n7.2 Unnormalized Model Using Pre-trained Deep Neural Network (DNN)\n\ni \u03c8(x)], x \u2208 R784\n(cid:62)\nIn this experiment, we create an exponential family model \u00afp(x; \u03b8i) := exp[\u03b8\nwhere \u03c8(x) \u2208 R20 is a pre-trained 3-layer DNN. \u03c8(x) is trained using a logistic regression so that\nthe classi\ufb01cation error is minimized on the full MNIST dataset over all digits. Clearly, \u00afp(x; \u03b8i) does\nnot have a tractable normalization term. The dataset Xqi contains nq = 100 randomly selected\nimages from a single digit i and we use DLE and NCE to estimate \u02c6\u03b8i for each digit i. For DLE, we\nset f (x) = \u03c8(x). Though we can only obtain an unnormalized density for each digit, it can be used\nto rank images and \ufb01nd potential inliers and outliers.\nIn Figure 2 we show images that are ranked either among the top two or bottom two places when\nsorted by log \u00afp(x; \u02c6\u03b8i), for each digit i. It can be seen that, when \u02c6\u03b8 is estimated by DLE, images\nranked the highest are indeed typical looking images, while the lowest ranking images tend to be\noutliers in that digit group. However, in comparison, when \u02c6\u03b8 is estimated by NCE, some highest\nranked images are distorted while some lowest ranked image look very regular. This experiment\nshows the usefulness of DLE as a model inference method when working with a complex model\n(DNN) on a high dimensional dataset (d = 784) using relatively small number of samples (nq = 100).\n8 Conclusion and Discussion\nIn this paper, we introduce a model inference method for unnormalized statistical models. First,\nStein density ratio estimation is used to \ufb01t a ratio and to approximate the KL divergence. The model\ninference is done by minimizing such an approximated KL divergence. Despite promising theoretical\nand experimental results, future works are needed to demonstrate a systematic way of choosing Stein\nfeatures in different scenarios as the performance of DLE depends heavily on such choices.\n\n9\n\n\fAcknowledgements\n\nThis work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. TK\nwas partially supported by JSPS KAKENHI Grant Number 15H01678, 15H03636, 16K00044, and\n19H04071. Authors would like to thank Dr. Carl Henrik Ek and three anonymous reviewers for their\ninsightful comments.\n\nReferences\n[1] H. Akaike. A new look at the statistical model identi\ufb01cation. IEEE Transactions on Automatic\n\nControl, 19(6):716\u2013723, 1974.\n\n[2] A. Barp, F-X. Briol, A. Duncan, M. Girolami, and L. Mackey. Minimum stein discrepancy\n\nestimators. arXiv:1906.08283, 2019.\n\n[3] W. Y. Chen, L. Mackey, J. Gorham, F. X. Briol, and C. Oates. Stein points. In Proceedings of\n\nthe 35th International Conference on Machine Learning, volume 80, pages 844\u2013853, 2018.\n\n[4] K. Chwialkowski, H. Strathmann, and Arthur Gretton. A kernel test of goodness of \ufb01t. In\nProceedings of The 33rd International Conference on Machine Learning, volume 48, pages\n2606\u20132615, 2016.\n\n[5] H. Cram\u00e9r. Mathematical methods of statistics. Princeton university press, 1946.\n\n[6] J. Ding and A. Zhou. Eigenvalues of rank-one updated matrices with some applications. Applied\n\nMathematics Letters, 20(12):1223 \u2013 1226, 2007.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems\n27, pages 2672\u20132680, 2014.\n\n[8] J. Gorham and L. Mackey. Measuring sample quality with stein\u2019s method. In Advances in\n\nNeural Information Processing Systems 28, pages 226\u2013234, 2015.\n\n[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample\n\ntest. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\n[10] M. Gutmann and A. Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation principle for\nunnormalized statistical models. In Proceedings of the Thirteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\n[11] J. Hayakawa and A. Takemura. Estimation of exponential-polynomial distribution by holonomic\ngradient descent. Communications in Statistics-Theory and Methods, 45(23):6860\u20136882, 2016.\n\n[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\ncomputation, 14(8):1771\u20131800, 2002.\n\n[13] A. Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. Journal of\n\nMachine Learning Research, 6:695\u2013709, 2005.\n\n[14] A. Hyv\u00e4rinen. Some extensions of score matching. Computational statistics & data analysis,\n\n51(5):2499\u20132512, 2007.\n\n[15] W. Jitkrittum, W. Xu, Z. Szab\u00f3, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-\n\nof-\ufb01t test. In Advances in Neural Information Processing Systems 30, pages 261\u2013270, 2017.\n\n[16] Y. Li and R. E. Turner. Gradient estimators for implicit models. In International Conference on\n\nLearning Representations, 2018.\n\n[17] L. Lin, M. Drton, and A. Shojaie. Estimation of high-dimensional graphical models using\n\nregularized score matching. Electronic journal of statistics, 10(1):806, 2016.\n\n[18] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference\nalgorithm. In Advances In Neural Information Processing Systems 29, pages 2378\u20132386, 2016.\n\n10\n\n\f[19] Q Liu, J. D. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-\ufb01t tests. In\nProceedings of the 33rd International Conference on International Conference on Machine\nLearning, pages 276\u2013284, 2016.\n\n[20] S. Lyu. Interpretation and generalization of score matching. In Proceedings of the Twenty-Fifth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 359\u2013366. AUAI Press, 2009.\n\n[21] J. K. Merikoski and R. Kumar. Inequalities for spreads of matrix sums and products. Applied\n\nMathematics E-Notes [electronic only], 4:150\u2013159, 2004.\n\n[22] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, New York, NY, USA, second\n\nedition, 2006.\n\n[23] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for monte carlo integration.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695\u2013718,\n2017.\n\n[24] C. R. Rao. Information and the accuracy attainable in the estimation of statistical parameters.\n\nBulletin of the Calcutta Mathematical Society, 37:81\u201391, 1945.\n\n[25] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, 2005.\n\n[26] P. S\u00e1nchez-Moreno, A. Zarzo, and J. S. Dehesa. Jensen divergence based on \ufb01sher\u2019s information.\n\nJournal of Physics A: Mathematical and Theoretical, 45(12), 2012.\n\n[27] J. Shi, S. Sun, and J. Zhu. A spectral approach to gradient estimation for implicit distributions.\nIn Proceedings of the 35th International Conference on Machine Learning, volume 80 of\nProceedings of Machine Learning Research, pages 4644\u20134653, 2018.\n\n[28] B. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyv\u00e4rinen, and R. Kumar. Density estimation\nin in\ufb01nite dimensional exponential families. The Journal of Machine Learning Research, 18(1):\n1830\u20131888, 2017.\n\n[29] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of\ndependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathemat-\nical Statistics and Probability, Volume 2: Probability Theory, pages 583\u2013602. University of\nCalifornia Press, 1972.\n\n[30] M. Sugiyama, S. Nakajima, H. Kashima, P. von B\u00fcnau, and M. Kawanabe. Direct importance\nestimation with model selection and its application to covariate shift adaptation. In Advances in\nNeural Information Processing Systems 20, pages 1433\u20131440, 2008.\n\n[31] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning.\n\nCambridge University Press, 2012.\n\n[32] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic\n\nMathematics. Cambridge University Press, 1998.\n\n[33] D. Wang and Q. Liu. Learning to draw samples: With application to amortized MLE for\n\ngenerative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\n[34] H. Yanai, K. Takeuchi, and Y. Takane. Projection Matrices, Generalized Inverse Matrices, and\n\nSingular Value Decomposition. Springer-Verlag New York, 2011.\n\n[35] S. Yu, M. Drton, and A. Shojaie. Generalized score matching for non-negative data. arXiv\n\npreprint arXiv:1812.10551, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4738, "authors": [{"given_name": "Song", "family_name": "Liu", "institution": "University of Bristol"}, {"given_name": "Takafumi", "family_name": "Kanamori", "institution": "Tokyo Institute of Technology/RIKEN"}, {"given_name": "Wittawat", "family_name": "Jitkrittum", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Yu", "family_name": "Chen", "institution": "University of Bristol"}]}