{"title": "Minimum Stein Discrepancy Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 12964, "page_last": 12976, "abstract": "When maximum likelihood estimation is infeasible, one often turns to score matching, contrastive divergence, or minimum probability flow to obtain tractable parameter estimates. We provide a unifying perspective of these techniques as minimum Stein discrepancy estimators, and use this lens to design new diffusion kernel Stein discrepancy (DKSD) and diffusion score matching (DSM) estimators with complementary strengths. We establish the consistency, asymptotic normality, and robustness of DKSD and DSM estimators, then derive stochastic Riemannian gradient descent algorithms for their efficient optimisation. The main strength of our methodology is its flexibility, which allows us to design estimators with desirable properties for specific models at hand by carefully selecting a Stein discrepancy.\n We illustrate this advantage for several challenging problems for score matching, such as non-smooth, heavy-tailed or light-tailed densities.", "full_text": "Minimum Stein Discrepancy Estimators\n\nAlessandro Barp\n\nDepartment of Mathematics\nImperial College London\n\na.barp16@imperial.ac.uk\n\nFran\u00e7ois-Xavier Briol\n\nDepartment of Statistical Science\n\nUniversity College London\n\nf.briol@ucl.ac.uk\n\nAndrew B. Duncan\n\nDepartment of Mathematics\nImperial College London\n\na.duncan@imperial.ac.uk\n\nMark Girolami\n\nDepartment of Engineering\nUniversity of Cambridge\nmag92@eng.cam.ac.uk\n\nAbstract\n\nLester Mackey\n\nMicrosoft Research\nCambridge, MA, USA\n\nlmackey@microsoft.com\n\nWhen maximum likelihood estimation is infeasible, one often turns to score match-\ning, contrastive divergence, or minimum probability \ufb02ow to obtain tractable param-\neter estimates. We provide a unifying perspective of these techniques as minimum\nStein discrepancy estimators, and use this lens to design new diffusion kernel\nStein discrepancy (DKSD) and diffusion score matching (DSM) estimators with\ncomplementary strengths. We establish the consistency, asymptotic normality, and\nrobustness of DKSD and DSM estimators, then derive stochastic Riemannian gra-\ndient descent algorithms for their ef\ufb01cient optimisation. The main strength of our\nmethodology is its \ufb02exibility, which allows us to design estimators with desirable\nproperties for speci\ufb01c models at hand by carefully selecting a Stein discrepancy.\nWe illustrate this advantage for several challenging problems for score matching,\nsuch as non-smooth, heavy-tailed or light-tailed densities.\n\n1\n\nIntroduction\n\nMaximum likelihood estimation [9] is a de facto standard for estimating the unknown parameters in a\nstatistical model {P\u03b8 : \u03b8 \u2208 \u0398}. However, the computation and optimization of a likelihood typically\nrequires access to the normalizing constants of the model distributions. This poses dif\ufb01culties for\ncomplex statistical models for which direct computation of the normalisation constant would entail\nprohibitive multidimensional integration of an unnormalised density. Examples of such models\narise naturally in modelling images [27, 39], natural language [54], Markov random \ufb01elds [61]\nand nonparametric density estimation [63, 69]. To by-pass this issue, various approaches have\nbeen proposed to address parametric inference for unnormalised models, including Monte Carlo\nmaximum likelihood [22], contrastive divergence [28], minimum probability \ufb02ow learning [62],\nnoise-contrastive estimation [10, 26, 27] and score matching (SM) [34, 35].\nThe SM estimator is a minimum score estimator [16] based on the Hyv\u00e4rinen scoring rule that avoids\nnormalizing constants by depending on P\u03b8 only through the gradient of its log density \u2207x log p\u03b8. SM\nestimators have proven to be a widely applicable method for estimation for models with unnormalised\nsmooth positive densities, with generalisations to bounded domains [35] and compact Riemannian\nmanifolds [51]. Despite the \ufb02exibility of this approach, SM has three important and distinct limitations.\nFirstly, as the Hyv\u00e4rinen score depends on the Laplacian of the log-density, SM estimation will be\nexpensive in high dimension and will break down for non-smooth models or for models in which\nthe second derivative grows very rapidly. Secondly, as we shall demonstrate, SM estimators can\nbehave poorly for models with heavy tailed distributions. Thirdly, the SM estimator is not robust to\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\foutliers in many applications of interest. Each of these situations arise naturally for energy models,\nparticularly product-of-experts models and ICA models [33].\nIn a separate strand of research, new approaches have been developed to measure discrepancy between\nan unnormalised distribution and a sample. In [23, 25, 50, 24], it was shown that Stein\u2019s method can\nbe used to construct discrepancies that control weak convergence of an empirical measure to a target.\nIn this paper we consider minimum Stein discrepancy (SD) estimators and show that SM, minimum\nprobability \ufb02ow and contrastive divergence estimators are all special cases. Within this class we\nfocus on SDs constructed from reproducing kernel Hilbert Spaces (RKHS), establishing the consis-\ntency, asymptotic normality and robustness of these estimators. We demonstrate that these SDs are\nappropriate for estimation of non-smooth distributions and heavy- or light- tailed distributions. The\nremainder of the paper is organized as follows. In Section 2 we introduce the class of minimum SD\nestimators, then investigate asymptotic properties of SD estimators based on kernels in Section 3,\ndemonstrating consistency and asymptotic normality under general conditions, as well as conditions\nfor robustness. Section 4 presents three toy problems where SM breaks down, but our new estimators\nare able to recover the truth. All proofs are in the supplementary materials.\n\n2 Minimum Stein Discrepancy Estimators\nLet PX the set of Borel probability measures on X . Given identical and independent (IID) realisations\nfrom Q \u2208 PX on an open subset X \u2282 Rd, the objective is to \ufb01nd a sequence of measures Pn\nthat approximate Q in an appropriate sense. More precisely we will consider a family P\u0398 =\n{P\u03b8 : \u03b8 \u2208 \u0398} \u2282 PX together with a function D : PX \u00d7 PX \u2192 R+ which quanti\ufb01es the\ndiscrepancy between any two measures in PX , and wish to estimate an optimal parameter \u03b8\u2217\nsatisfying \u03b8\u2217 \u2208 arg min\u03b8\u2208\u0398 D(Q(cid:107)P\u03b8). In practice, it is often dif\ufb01cult to compute the discrepancy\nD explicitly, and it is useful to consider a random approximation \u02c6D({Xi}n\ni=1(cid:107)P\u03b8) based on a IID\nsample X1, . . . , Xn \u223c Q, such that \u02c6D({Xi}n\ni=1(cid:107)P\u03b8) a.s.\u2212\u2212\u2192 D(Q(cid:107)P\u03b8) as n \u2192 \u221e. We then consider\nthe sequence of estimators\n\nn \u2208 argmin\u03b8\u2208\u0398\n\u02c6\u03b8D\n\n\u02c6D({Xi}n\n\ni=1(cid:107)P\u03b8).\n\nThe choice of discrepancy will impact the consistency, ef\ufb01ciency and robustness of the estimators.\nExamples of such estimators include minimum distance estimators [4, 58] where the discrepancy\nwill be a metric on probability measures, including minimum maximum mean discrepancy (MMD)\nestimation [18, 42, 8] and minimum Wasserstein estimation [19, 21, 6].\nMore generally, minimum scoring rule estimators [16] arise from proper scoring rules, for ex-\nample Hyv\u00e4rinen, Bregman and Tsallis scoring rules. These discrepancies are often statistical\ndivergences, i.e., D(Q(cid:107)P) = 0 \u21d4 P = Q for all P, Q in a subset of PX . Suppose that P\u03b8\nand Q are absolutely continuous with respect to a common measure \u03bb on X , with respective\npositive densities p\u03b8 and q. Then a well-known statistical divergence is the Kullback-Leibler\nX log p\u03b8dQ. Minimising\nX log p\u03b8dQ, which can be estimated using the likelihood\ni=1 log p\u03b8(Xi). Informally, we see that minimising the KL-divergence is\n\n(KL) divergence KL(Q(cid:107)P\u03b8) \u2261 (cid:82)\nKL(Q(cid:107)P\u03b8) is equivalent to maximising(cid:82)\n(cid:99)KL({Xi}n\n\nX log(dQ/dP\u03b8)dQ = (cid:82)\n\nX log qdQ \u2212(cid:82)\n\nequivalent to performing maximum likelihood estimation.\nFor our purposes we are interested in discrepancies that can be evaluated when P\u03b8 is only known\nup to normalisation, precluding the use of KL divergence. We instead consider a related class of\ndiscrepancies based on integral probability pseudometric (IPM) [55] and Stein\u2019s method [3, 11, 65].\nLet \u0393(Y) \u2261 \u0393(X ,Y) \u2261 {f : X \u2192 Y}. A map SP : G \u2282 \u0393(Rd) \u2192 \u0393(R) is a Stein operator\nX SP[f ]dP = 0 \u2200f \u2208 G for any P. We can then de\ufb01ne an associated Stein\ndiscrepancy (SD) [23] using an IPM with entry-dependent function space F \u2261 SP\u03b8 [G]\n\nover a Stein class G if(cid:82)\n\ni=1(cid:107)P\u03b8) \u2261 1\n\n(cid:80)n\n\nn\n\n(cid:12)(cid:12)(cid:82)\n\nX f dP\u03b8 \u2212(cid:82)\n\nX f dQ(cid:12)(cid:12) = supg\u2208G\n\n(cid:12)(cid:12)(cid:82)\n\nX SP\u03b8 [g]dQ(cid:12)(cid:12).\n\n\u03b8 [G](Q(cid:107)P\u03b8) \u2261 supf\u2208SP\n\nSDSP\n\n(1)\nThe Stein discrepancy depends on Q only through expectations, and does not require the existence of\na density, therefore permitting Q to be an empirical measure. If P has a C 1 density p on X , one can\nconsider the Langevin-Stein discrepancy arising from the Stein operator Tp[g] \u2261 (cid:104)\u2207 log p, g(cid:105) + \u2207 \u00b7\ng [23, 25]. In this case, the Stein discrepancy will not depend on the normalising constant of p.\n\n\u03b8 [G]\n\n2\n\n\fIn this paper, for an arbitrary m \u2208 \u0393(Rd\u00d7d) which we call diffusion matrix, we shall consider the\nmore general diffusion Stein operators [25]: S m\np [A] \u2261 (1/p)\u2207 \u00b7 (pmA),\nwhere g \u2208 \u0393(Rd), A \u2208 \u0393(Rd\u00d7d), and the associated minimum Stein discrepancy estimators which\ni=1 \u223c Q, we will focus on the estimators\nminimise (1). As we will only have access to a sample {Xi}n\n\np [g] \u2261 (1/p)\u2207 \u00b7 (pmg) , S m\n\nminimising an approximation (cid:99)SDSP\n\ni=1(cid:107)P\u03b8) based on a U-statistic of the Q-integral:\n\n\u03b8 [G]({Xi}n\n\nn \u2261 argmin\u03b8\u2208\u0398(cid:99)SDSP\n\n\u02c6\u03b8Stein\n\n\u03b8 [G]({Xi}n\n\ni (cid:107)P\u03b8).\n\nRelated and complementary approaches to inference using SDs include the nonparametric estimator\nof [41], the density ratio approach of [47] and the variational inference algorithms of [49, 60]. We\nnow highlight several instances of SDs which will be studied in detail in this paper.\n\n2.1 Example 1: Diffusion Kernel Stein Discrepancy Estimators\n\n: \u0393(cid:0)X \u00d7 X , Rd\u00d7d(cid:1) \u2192 \u0393(cid:0)R(cid:1) which acts \ufb01rst on the \ufb01rst variable and then\n\nA convenient choice of Stein class is the unit ball of reproducing kernel Hilbert spaces (RKHS)\n[5] of a scalar kernel function k. For the Langevin Stein operator Tp, the resulting kernel Stein\ndiscrepancy (KSD) \ufb01rst appeared in [57] and has since been considered extensively in the context\nof hypothesis testing, measuring sample quality and approximation of probability measures in [12\u2013\n14, 17, 24, 44, 46, 43]. In this paper, we consider a more general class of discrepancies based on the\ndiffusion Stein operator and matrix-valued kernels.\nConsider an RKHS Hd of functions f \u2208 \u0393(Rd) with (matrix-valued) kernel K \u2208 \u0393(X \u00d7 X , Rd\u00d7d),\nKx \u2261 K(x,\u00b7) (see Appendix A.3 and A.4 for further details). The Stein operator S m\np [f ] induces\nan operator S m,2\non the second one. We brie\ufb02y mention two simple examples of matrix kernels constructed from\nscalar kernels. If we want the components of f to be orthogonal, we can use the diagonal kernel\n(i) K = diag(\u03bb1k1, . . . , \u03bbdkd) where \u03bbi > 0 and ki is a C 2 kernel on X , for i = 1, . . . , d; else we\ncan \u201ccorrelate\" the components by setting (ii) K = Bk where k is a (scalar) kernel on X and B is a\n(constant) symmetric positive de\ufb01nite matrix.\nWe propose to study diffusion kernel Stein discrepancies indexed by K and m (see Appendix B):\nTheorem 1 (Diffusion Kernel Stein Discrepancy). For any kernel K, we \ufb01nd that S m\n(cid:104)S m,1\n\np Kx, f(cid:105)Hd for any f \u2208 Hd. Moreover if x (cid:55)\u2192 (cid:107)S m,1\n\np S m,1\n\np [f ](x) =\n\np\n\nDKSDK,m(Q(cid:107)P)2 \u2261 sup h\u2208Hd\n(cid:107)h(cid:107)\u22641\n\nk0(x, y) \u2261 S m,2\n\np S m,1\n\np K(x, y) =\n\nX\n\n(cid:82)\n\nX S m\n\n=(cid:82)\n\nX k0(x, y)dQ(x)dQ(y)\n\np [h]dQ(cid:12)(cid:12)2\n\np Kx(cid:107)Hd \u2208 L1(Q), we have\n\n(cid:12)(cid:12)(cid:82)\np(y)p(x)\u2207y \u00b7 \u2207x \u00b7(cid:0)p(x)m(x)K(x, y)m(y)(cid:62)p(y)(cid:1).\n(cid:80)\n\n1\n\n(2)\n\n(cid:80)\nn(n\u22121)\ni=1(cid:107)P\u03b8)2.\n\n1\n\n2\n\np\n\nn\n\ni(cid:54)=j k0\n\n1\u2264i 0 for any \ufb01nite non-zero signed\np Kx, f(cid:105)Hd we have that f \u2208 Hd is in the Stein\n\nWe say a matrix kernel K is in the Stein class of Q if(cid:82)\nintegrally positive de\ufb01nite (IPD) if(cid:82)\nclass (i.e.,(cid:82)\n\nq [f ]dQ = 0) when K is also in the class. Setting sp \u2261 m(cid:62)\u2207 log p \u2208 \u0393(Rd):\n\nvector Borel measure \u00b5. From S m\n\np [f ](x) = (cid:104)S m,1\n\nProposition 1 (DKSD as a Statistical Divergence). Suppose K is IPD and in the Stein class of Q,\nand m(x) is invertible. If sp \u2212 sq \u2208 L1(Q), then DKSDK,m(Q(cid:107)P)2 = 0 iff Q = P.\n\nX S m,1\n\nX S m\n\nq\n\n3\n\n\fSee Appendix B.5 for the proof. Note that this proposition generalises Proposition 3.3 from [46] to\na signi\ufb01cantly larger class of SD. For the matrix kernels introduced above, the proposition below\nshows that K is IPD when its associated scalar kernels are; a well-studied problem [64].\nProposition 2 (IPD Matrix Kernels). (i) Let K = diag(k1, . . . , kd). Then K is IPD iff each kernel\nki is IPD. (ii) Let K = Bk for B be symmetric positive de\ufb01nite. Then K is IPD iff k is IPD.\n\n2.2 Example 2: Diffusion Score Matching Estimators\n\nA well-known family of estimators are the score matching (SM) estimators (based on the Fisher\nor Hyvarinen divergence) [34, 35]. As will be shown below, these can be seen as special cases of\nminimum SD estimators. The SM discrepancy is computable for suf\ufb01ciently smooth densities:\n\n2 dQ =(cid:82)\n\nX\n\nSM(Q(cid:107)P) \u2261(cid:82)\nX (cid:107)\u2207 log p \u2212 \u2207 log q(cid:107)2\nwe consider the approximation(cid:100)SM({Xi}n\nargmin\u03b8\u2208\u0398(cid:100)SM({Xi}n\n\nwhere \u2206 denotes the Laplacian and we have used the divergence theorem. If P = P\u03b8, the \ufb01rst\nintegral above does not depend on \u03b8, and the second one does not depend on the density of Q, so\n2(cid:107)\u2207 log p\u03b8(Xi)(cid:107)2\nn \u2261\n\nbased on an unbiased estimation for the minimiser of the SM divergence, and its estimators \u02c6\u03b8SM\n\ni=1 \u2206 log p\u03b8(Xi) + 1\n\ni=1(cid:107)P\u03b8) \u2261 1\n\nn\n\n2\n\n2 + 2\u2206 log p(cid:1)dQ\n\n2 + (cid:107)\u2207 log p(cid:107)2\n\ni=1(cid:107)P\u03b8), for independent random vectors Xi \u223c Q.\n\n(cid:0)(cid:107)\u2207 log q(cid:107)2\n(cid:80)n\n\nThe SM discrepancy can also be generalised to include higher-order derivatives of the log-likelihood\n[48] and does not require a normalised model. We will now introduce a further generalisation that we\ncall diffusion score matching (DSM) which is a SD constructed from the diffusion Stein operator (see\nAppendix B.6):\nTheorem 2 (Diffusion Score Matching). Let X = Rd and consider some diffusion Stein operator\nS m\np for some function m \u2208 \u0393(Rd\u00d7d) and the Stein class G \u2261 {g = (g1, . . . , gd) \u2208 C 1(X , Rd) \u2229\nL2(X ; Q) : (cid:107)g(cid:107)L2(X ;Q) \u2264 1}. If p, q > 0 are differentiable and sp \u2212 sq \u2208 L2(Q), then we de\ufb01ne the\ndiffusion score matching divergence as the Stein discrepancy,\n\nThis satis\ufb01es DSMm(Q(cid:107)P) = 0 iff Q = P when m(x) is invertible. Moreover, if p is twice-\ndifferentiable, and qmm(cid:62)\u2207 log p,\u2207 \u00b7 (qmm(cid:62)\u2207 log p) \u2208 L1(Rd), then Stoke\u2019s theorem gives\n\nDSMm(Q(cid:107)P) \u2261 supf\u2208Sp[G]\n\nDSMm(Q(cid:107)P) =(cid:82)\n\nX\n\n(cid:12)(cid:12)(cid:82)\nX f dQ \u2212(cid:82)\n(cid:0)(cid:107)m(cid:62)\u2207x log p(cid:107)2\n\nX f dP(cid:12)(cid:12)2\n\n=(cid:82)\n\nX\n\n2 + (cid:107)m(cid:62)\u2207 log q(cid:107)2\n\n(cid:13)(cid:13)m(cid:62)(\u2207 log q \u2212 \u2207 log p)(cid:13)(cid:13)2\n2 + 2\u2207 \u00b7(cid:0)mm(cid:62)\u2207 log p(cid:1)(cid:1)dQ.\n\n2dQ.\n\nn\n\nNotably, DSMm recovers SM when m(x)m(x)(cid:62) = I and the (generalised) non-negative score\nmatching estimator of [48] with the choice m(x) \u2261 diag(h1(x1)1/2, . . . , hd(xd)1/2). Like standard\nSM, DSM is only de\ufb01ned for distributions with suf\ufb01ciently smooth densities. Since the \u03b8-dependent\npart of DSMm(Q(cid:107)P\u03b8) does not depend on the density of Q, and can be estimated using an empirical\nmean, leading to the estimators \u02c6\u03b8DSM\n\n(cid:91)DSMm({Xi}n\n\n(cid:80)n\n\u2261 argmin\u03b8\u2208\u0398\n\n(cid:0)(cid:107)m(cid:62)\u2207x log p\u03b8(cid:107)2\n\ni=1(cid:107)P\u03b8) for\n\n2 + 2\u2207 \u00b7(cid:0)mm(cid:62)\u2207 log p\u03b8\n\ni=1\n\n(cid:1)(cid:1)(Xi)\n\n(cid:91)DSMm({Xi}n\n\ni=1(cid:107)P\u03b8) \u2261 1\n\nn\n\nwhere {Xi}n\ni=1 is a sample from Q. Note that this is only possible if m is independent of \u03b8, in\ncontrast to DKSD where m can depend on X \u00d7 \u0398, thus leading to a more \ufb02exible class of estimators.\nAn interesting remark is that the DSMm discrepancy may in fact be obtained as a limit of DKSD\nover a sequence of target-dependent kernels: see Appendix B.6 for the complete result which corrects\nand signi\ufb01cantly generalises previously established connections between the SM divergence and\nKSD (such as in Sec. 5 of [46]).\nWe conclude by commenting on the computational complexity. Evaluating the DKSD loss function\nrequires O(n2d2) computation, due to the U-statistic and a matrix-matrix product. However, if\nK = diag(\u03bb1k1, . . . , \u03bbdkd) or K = Bk, and if m is a diagonal matrix, then we can by-pass\nexpensive matrix products and the cost is O(n2d), making it comparable to that of KSD. Although\nwe do not consider these in this paper, recent approximations to KSD could also be adapted to DKSD\nto reduce the computational cost to O(nd) [32, 36]. The DSM loss function has computational cost\nO(nd2), which is comparable to the SM loss. From a computational viewpoint, DSM will hence be\npreferable to DKSD for large n, whilst DKSD will be preferable to DSM for large d.\n\n4\n\n\f2.3 Further Examples: Contrastive Divergence and Minimum Probability Flow\n\nBefore analysing DKSD and DSM estimators further, we show that the class of minimum SD\n\u03b8 , n \u2208 N be a\nestimators also includes other well-known estimators for unnormalised models. Let X n\nMarkov process with unique invariant probality measure P\u03b8, for example a Metropolis-Hastings chain.\n\u03b8 = x]. Choosing the\nLet P n\n\u03b8 and Stein class G = {log p\u03b8 + c : c \u2208 R}, leads to the following SD:\nStein operator Sp = I \u2212 P n\n\u03b8(cid:107)P\u03b8),\n\n\u03b8 )|X 0\n\u03b8 log p\u03b8)dQ = KL(Q(cid:107)P\u03b8) \u2212 KL(Qn\n\n\u03b8 be the associated transition semigroup, i.e. (P n\n\nCD(Q(cid:107)P\u03b8) =(cid:82)\n\n\u03b8 f )(x) = E[f (X n\n\n\u03b8 |X 0\n\n\u03b8 is the law of X n\n\n\u03b8 (cid:28) P\u03b8, which is the loss\nwhere Qn\nfunction associated with contrastive divergence (CD) [28, 45]. Suppose now that X is a \ufb01nite set.\nGiven \u03b8 \u2208 \u0398 let P\u03b8 be the transition matrix for a Markov process with unique invariant distribution\nP\u03b8. Suppose we observe data {xi}n\ni=1 and let q be the corresponding empirical distribution. Choosing\nthe Stein operator Sp = I \u2212 P\u03b8 and the Stein set G = {f \u2208 \u0393(R) : (cid:107)f(cid:107)\u221e \u2264 1}. Note that,\ng \u2208 arg supg\u2208G |Q(Sp[g])| will satisfy g(i) = sgn(q(cid:62)(I \u2212 P\u03b8)i), and the resulting Stein discrepancy\nis the minimum probability \ufb02ow loss objective function [62]:\n\nX (log p\u03b8 \u2212 P n\n\u03b8 \u223c Q and assuming that Q (cid:28) P\u03b8 and Qn\n\nMPFL(Q(cid:107)P) =(cid:80)\n\n(cid:12)(cid:12)((I \u2212 P\u03b8)(cid:62)q)y\n\n(cid:12)(cid:12) =(cid:80)\n\ny\n\n(cid:12)(cid:12)(cid:12) 1\n\nn\n\n(cid:80)\n\ny(cid:54)\u2208{xi}n\n\ni=1\n\nx\u2208{xi}n\n\ni=1\n\n(I \u2212 P\u03b8)xy\n\n(cid:12)(cid:12)(cid:12).\n\n2.4\n\nImplementing Minimum SD Estimators: Stochastic Riemannian Gradient Descent\n\nIn order to implement the minimum SD estimators, we propose to use a stochastic gradient descent\n(SGD) algorithm associated to the information geometry induced by the SD on the parameter space.\nMore precisely, consider a parametric family P\u0398 of probability measures on X with \u0398 \u2282 Rm.\nGiven a discrepancy D : P\u0398 \u00d7 P\u0398 \u2192 R satisfying D(P\u03b1(cid:107)P\u03b8) = 0 iff P\u03b1 = P\u03b8 (called a statistical\ndivergence), its associated information matrix \ufb01eld on \u0398 is de\ufb01ned as the map \u03b8 (cid:55)\u2192 g(\u03b8), where g(\u03b8)\nis the symmetric bilinear form g(\u03b8)ij = \u2212 1\n2 (\u22022/\u2202\u03b1i\u2202\u03b8j)D(P\u03b1(cid:107)P\u03b8)|\u03b1=\u03b8 [2]. When g is positive\nde\ufb01nite, we can use it to perform (Riemannian) gradient descent on the parameter space \u0398. We\nprovide below the information matrices of DKSD and DSM (and hence extends results of [37]):\nProposition 3 (Information Tensor DKSD). Assume the conditions of Proposition 1 hold. The\ninformation tensor associated to DKSD is positive semi-de\ufb01nite and has components\n\n\u03b8 (y)\u2207y\u2202\u03b8i log p\u03b8(y)dP\u03b8(x)dP\u03b8(y).\nProposition 4 (Information Tensor DSM). Assume the conditions of Theorem 2 hold. The infor-\nmation tensor de\ufb01ned by DSM is positive semi-de\ufb01nite and has components\n\nm\u03b8(x)K(x, y)m(cid:62)\n\nX\n\ngDKSD(\u03b8)ij =(cid:82)\n\n(cid:82)\n(cid:62)\nX (\u2207x\u2202\u03b8j log p\u03b8(x))\ngDSM(\u03b8)ij =(cid:82)\n\n(cid:10)m(cid:62)\u2207\u2202\u03b8i log p\u03b8, m(cid:62)\u2207\u2202\u03b8j log p\u03b8\n\n(cid:11)dP\u03b8.\n\nX\n\nSee Appendix C for the proofs. Given an (information) Riemannian metric, recall the gradient \ufb02ow\nof a curve \u03b8 on the Riemannian manifold \u0398 is the solution to \u02d9\u03b8(t) = \u2212\u2207\u03b8(t) SD(Q(cid:107)P\u03b8), where \u2207\u03b8\ndenotes the Riemannian gradient at \u03b8. It is the curve that follows the direction of steepest decrease\n(measured with respect to the Riemannian metric) of the function SD(Q(cid:107)P\u03b8) (see Appendix A.5).\nThe well-studied natural gradient descent [1, 2] corresponds to the case in which the Riemannian\nmanifold is \u0398 = Rm equipped with the Fisher metric and SD is replaced by KL. When \u0398 is\na linear manifold with coordinates (\u03b8i) we have \u2207\u03b8 SD(Q(cid:107)P\u03b8) = g(\u03b8)\u22121d\u03b8 SD(Q(cid:107)P\u03b8), where\ni}i)\u22121d\u03b8t(cid:99)SD({X t\nd\u03b8f denotes the tuple (\u2202\u03b8if ). We will approximate this at step t of the descent using the biased\ni=1(cid:107)P\u03b8), where \u02c6g\u03b8t({X t\ni}n\ni}n\nestimator \u02c6g\u03b8t({X t\ni=1) is an unbiased estimator for the\ninformation matrix g(\u03b8t) and {X t\ni \u223c Q}i is a sample at step t. In general, we have no guarantee\nthat \u02c6g\u03b8t is invertible, and so we may need a further approximation step to obtain an invertible matrix.\ni=1)\u22121d\u03b8t(cid:99)SD({X t\nGiven a sequence (\u03b3t) of step sizes we will approximate the gradient \ufb02ow with\ni}n\ni}n\ni=1(cid:107)P\u03b8).\n\n\u02c6\u03b8t+1 = \u02c6\u03b8t \u2212 \u03b3t\u02c6g\u03b8t({X t\n\nMinimum SD estimators hold additional appeal for exponential family models, since their densities\nhave the form p\u03b8(x) \u221d exp((cid:104)\u03b8, T (x)(cid:105)Rm ) exp(b(x)) for natural parameters \u03b8 \u2208 Rm, suf\ufb01cient\nstatistics T \u2208 \u0393(Rm), and base measure exp(b(x)). For these models, the U-statistic approximations\nof DKSD and DSM are convex quadratics with closed form solutions whenever K and m are\nindependent of \u03b8. Moreover, since the absolute value of an af\ufb01ne function is convex, and the\nsupremum of convex functions is convex, any SD with a diffusion Stein operator is convex in \u03b8,\nprovided m and the Stein class G are independent of \u03b8.\n\n5\n\n\f3 Theoretical Properties for Minimum Stein Discrepancy Estimators\n\n\u2217\n\na.s.\u2212\u2212\u2192 \u03b8DKSD\n\nWe now show that the DKSD and DSM estimators have many desirable properties such as consistency,\nasymptotic normality and bias-robustness. These results do not only provide us with reassuring\ntheoretical guarantees on the performance of our algorithms, but can also be a practical tool for\nchoosing a Stein operator and Stein class given an inference problem of interest.\nalmost sure convergence:\nWe begin by establishing strong consistency and for DKSD; i.e.\n\u2261 argmin\u03b8\u2208\u0398 DKSDK,m(Q(cid:107)P\u03b8)2. This will be followed by a proof of asymp-\n\u02c6\u03b8DKSD\n\u2208 P\u0398. In the\nn\ntotic normality. We will assume we are in the speci\ufb01ed setting, so that Q = P\u03b8DKSD\nmisspeci\ufb01ed setting, we will need to also assume the existence of a unique minimiser.\nTheorem 3 (Strong Consistency of DKSD). Let X = Rd, \u0398 \u2282 Rm. Suppose that K is bounded\nwith bounded derivatives up to order 2, that k0(x, y) is continuously-differentiable on an Rm-open\nneighbourhood of \u0398, and that for any compact subset C \u2282 \u0398 there exist functions f1, f2, g1, g2 such\nthat for Q-a.e. x \u2208 X ,\n\n1. (cid:13)(cid:13)m(cid:62)(x)\u2207 log p\u03b8(x)(cid:13)(cid:13) \u2264 f1(x), where f1 \u2208 L1(Q) and continuous,\n(cid:0)m(x)(cid:62)\u2207 log p\u03b8(x)(cid:1)(cid:13)(cid:13) \u2264 g1(x), where g1 \u2208 L1(Q) is continuous,\n2. (cid:13)(cid:13)\u2207\u03b8\n\n\u2217\n\n3. (cid:107)m(x)(cid:107) + (cid:107)\u2207xm(x)(cid:107) \u2264 f2(x) where f2 \u2208 L1(Q) and continuous,\n4. (cid:107)\u2207\u03b8m(x)(cid:107) + (cid:107)\u2207\u03b8\u2207xm(x)(cid:107) \u2264 g2(x) where g2 \u2208 L1(Q) is continuous.\nAssume further that \u03b8 (cid:55)\u2192 P\u03b8 is injective. Then we have a unique minimiser \u03b8DKSD\n, and if either \u0398 is\ni=1(cid:107)P\u03b8)2 are convex, then \u02c6\u03b8DKSD\ncompact, or \u03b8DKSD\nis\nstrongly consistent.\nTheorem 4 (Central Limit Theorem for DKSD). Let X and \u0398 be open subsets of Rd and Rm\nrespectively. Let K be a bounded kernel with bounded derivatives up to order 2 and suppose that\n\u02c6\u03b8DKSD\nsuch that\nn\n\u03b8 \u2192 (cid:92)DKSDK,m({Xi}n\ni=1, P\u03b8)2 is twice continuously differentiable for \u03b8 \u2208 N and, for Q-a.e.\nx \u2208 X ,\n\nand that there exists a compact neighbourhood N \u2282 \u0398 of \u03b8DKSD\n\n\u2208 int(\u0398) and \u0398 and \u03b8 (cid:55)\u2192 (cid:92)DKSDK,m({Xi}n\n\np\u2212\u2192 \u03b8DKSD\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\nn\n\n(cid:0)m(x)(cid:62)\u2207 log p\u03b8(x)(cid:1)(cid:107) \u2264 g1(x),\n\n1. (cid:107)m(cid:62)(x)\u2207 log p\u03b8(x)(cid:107) + (cid:107)\u2207\u03b8\n2. (cid:107)m(x)(cid:107) + (cid:107)\u2207xm(x)(cid:107) + (cid:107)\u2207\u03b8m(x)(cid:107) + (cid:107)\u2207\u03b8\u2207xm(x)(cid:107) \u2264 f2(x),\n3. (cid:107)\u2207\u03b8\u2207\u03b8\n4. (cid:107)\u2207\u03b8\u2207\u03b8m(x)(cid:107) + (cid:107)\u2207\u03b8\u2207\u03b8\u2207xm(x)(cid:107) + (cid:107)\u2207\u03b8\u2207\u03b8\u2207\u03b8m(x)(cid:107) + (cid:107)\u2207\u03b8\u2207\u03b8\u2207\u03b8\u2207xm(x)(cid:107) \u2264 g2(x),\nwhere f1, f2 \u2208 L2(Q),g1, g2 \u2208 L1(Q) are continuous. Suppose also that the information tensor g is\ninvertible at \u03b8DKSD\n\n(cid:0)m(x)(cid:62)\u2207 log p\u03b8(x)(cid:1)(cid:107) + (cid:107)\u2207\u03b8\u2207\u03b8\u2207\u03b8\n(cid:16)\u02c6\u03b8DKSD\n(cid:17) d\u2212\u2192 N(cid:0)0, g\u22121\n(cid:16)(cid:82)\nwhere \u03a3DKSD =(cid:82)\n\u2212 \u03b8DKSD\n\u2217\nX \u2207\u03b8k0\n\n\u2217\n(x, z)dQ(z)\n\n(cid:17) \u2297(cid:16)(cid:82)\n\n\u2217\nX \u2207\u03b8k0\n\n)\u03a3DKSDg\u22121\n\n(x, y)dQ(y)\n\nDKSD(\u03b8DKSD\n\nDKSD(\u03b8DKSD\n\n)(cid:1),\n\ndQ(x).\n\n. Then\n\n\u2217\n\u221a\n\n(cid:17)\n\nX\n\nn\n\nn\n\n(cid:0)m(x)(cid:62)\u2207 log p\u03b8(x)(cid:1)(cid:107) \u2264 f1(x),\n\n\u03b8DKSD\n\u2217\n\n\u03b8DKSD\n\u2217\n\nSee Appendix D for proofs. For both results, the assumptions on the kernel are satis\ufb01ed by most\nkernels common in the literature, such as Gaussian, inverse-multiquadric (IMQ) and any Mat\u00e9rn\nkernels with smoothness greater than 2. Similarly, the assumptions on the model are very weak given\nthat the diffusion tensor m can be adapted to guarantee consistency and asymptotic normality.\nWe now prove analogous results for DSM. This time we show weak consistency, i.e. convergence in\nX F\u03b8(x)dQ(x). This will\nprobability: \u02c6\u03b8DSM\nbe a suf\ufb01cient form of convergence for asymptotic normality.\nTheorem 5 (Weak Consistency of DSM). Let X be an open subset of Rd, and \u0398 \u2282 Rm. Suppose\nlog p\u03b8(\u00b7) \u2208 C 2(X ) and m \u2208 C 1(X ), and (cid:107)\u2207x log p\u03b8(x)(cid:107) \u2264 f1(x) for Q-a.e. x. Suppose also that\n(cid:107)\u2207x\u2207x log p\u03b8(x)| \u2264 f2(x) on any compact set C \u2282 \u0398 for Q-a.e. x, where (cid:107)m(cid:62)(cid:107)f1 \u2208 L2(Q),\n(cid:107)\u2207 \u00b7 (mm(cid:62))(cid:107)f1 \u2208 L1(Q), (cid:107)mm(cid:62)(cid:107)\u221ef2 \u2208 L1(Q). If either \u0398 is compact, or \u0398 and \u03b8 (cid:55)\u2192 F\u03b8 are\nconvex and \u03b8DSM\u2217\n\n\u2261 argmin\u03b8\u2208\u0398 DSMm(Q(cid:107)P\u03b8) = argmin\u03b8\u2208\u0398\n\nis weakly consistent for \u03b8DSM\u2217\n\n\u2208 int(\u0398), then \u02c6\u03b8DSM\n\np\u2212\u2192 \u03b8DSM\u2217\n\n(cid:82)\n\nn\n\nn\n\n.\n\n6\n\n\fTheorem 6 (Central Limit Theorem for DSM). Let X , \u0398 be open subsets of Rd and Rm respec-\n, \u03b8 (cid:55)\u2192 log p\u03b8(x) is twice continuously differentiable on a closed ball\ntively. Suppose \u02c6\u03b8DSM\n\u00afB(\u0001, \u03b8DSM\u2217\n\n) \u2282 \u0398, and that for Q-a.e. x \u2208 X ,\n\np\u2212\u2192 \u03b8DSM\u2217\n\nn\n\n(i) (cid:107)m(x)m(cid:62)(x)(cid:107) + (cid:107)\u2207x \u00b7 (m(x)m(cid:62)(x))(cid:107)\n\n(cid:107)\u2207\u03b8\u2207x log p\u03b8(x)(cid:107) + (cid:107)\u2207\u03b8\u2207x\u2207x log p\u03b8(x)(cid:107) \u2264 f2(x), with f1f2, f1f 2\n\n\u2264\n\nf1(x), and (cid:107)\u2207x log p\u03b8(x)(cid:107) +\n\n2 \u2208 L2(Q)\n\n(ii) for \u03b8 \u2208 \u00afB(\u0001, \u03b8\u2217), (cid:107)\u2207\u03b8\u2207x log p\u03b8(cid:107)2 +(cid:107)\u2207x log p\u03b8(cid:107)(cid:107)\u2207\u03b8\u2207\u03b8\u2207x log p\u03b8(cid:107) +(cid:107)\u2207\u03b8\u2207\u03b8\u2207x log p\u03b8(cid:107) +\n\n(cid:107)\u2207\u03b8\u2207\u03b8\u2207x\u2207x log p\u03b8(cid:107) \u2264 g1(x), and f1g1 \u2208 L1(Q).\n, we have\n\nThen, if the information tensor is invertible at \u03b8DSM\u2217\n\n(cid:16)\u02c6\u03b8DSM\n\nn\n\n\u221a\n\nn\n\nX \u2207\u03b8F\u03b8DSM\u2217\n\nwhere \u03a3DSM =(cid:82)\n\n(cid:17) d\u2212\u2192 N(cid:0)0, g\u22121\n\nDSM\n\n(cid:0)\u03b8DSM\u2217\n\n(cid:1)\u03a3DSMg\u22121\n\nDSM\n\n(cid:0)\u03b8DSM\u2217\n\n(cid:1)(cid:1).\n\n\u2212 \u03b8DSM\u2217\n(x) \u2297 \u2207\u03b8F\u03b8DSM\u2217\n\n(x)dQ(x).\n\nthen IF(z, P\u03b8) = gDKSD(\u03b8)\u22121(cid:82)\n\nThen if x (cid:55)\u2192 ((cid:107)sp(x)(cid:107) + (cid:107)\u2207\u03b8sp(x)(cid:107))(cid:82) F (x, y)Q(dy)|\u03b8DKSD\n\nAll of the proofs can be found in Appendix D.2. An important special case covered by our theory is that\nof natural exponential families, which admit densities of the form log p\u03b8(x) \u221d (cid:104)\u03b8, T (x)(cid:105)Rm + b(x).\nIf K is IPD with bounded derivative up to order 2, \u2207T has linearly independent rows, m is invertible,\nand (cid:107)\u2207T m(cid:107),(cid:107)\u2207xb(cid:107)(cid:107)m(cid:107),(cid:107)\u2207xm(cid:107) + (cid:107)m(cid:107) \u2208 L2(Q), then the sequence of minimum DKSD and\nDSM estimators are strongly consistent and asymptotically normal (see Appendix D.3).\nBefore concluding this section, we turn to a concept of importance to practical inference: robustness\nwhen subjected to corrupted data [31]. We quantify the robustness of DKSD and DSM estimators in\nterms of their in\ufb02uence function, which can be interpreted as measuring the impact of an in\ufb01nitesimal\nperturbation of a distribution P by a Dirac located at a point z \u2208 X on the estimator. If \u03b8Q denotes the\nunique minimum SD estimator for Q, then the in\ufb02uence functions is given by IF(z, Q) \u2261 \u2202t\u03b8Qt|t=0\nif it exists, where Qt = (1\u2212 t)Q + t\u03b4z, for t \u2208 [0, 1]. An estimator is said to be bias robust if IF(z, Q)\nis bounded in z.\nProposition 7 (Robustness of DKSD estimators). Suppose that the map \u03b8 \u2192 P\u03b8 over \u0398\nX \u2207\u03b8k0(z, y)dP\u03b8(y). Moreover, suppose that\nis injective,\ny (cid:55)\u2192 F (x, y) is Q-integrable for any x, where F (x, y) = (cid:107)K(x, y)sp(y)(cid:107), (cid:107)K(x, y)\u2207\u03b8sp(y)(cid:107),\n(cid:107)\u2207xK(x, y)sp(y)(cid:107), (cid:107)\u2207xK(x, y)\u2207\u03b8sp(y)(cid:107), (cid:107)\u2207y\u2207x(K(x, y)m(y))(cid:107),(cid:107)\u2207y\u2207x(K(x, y)\u2207\u03b8m(y))(cid:107).\nis bounded, the DKSD estimators\nare bias robust: supz\u2208X (cid:107) IF(z, Q)(cid:107) < \u221e.\nThe analogous results for DSM estimators can be found in Appendix E. Consider a Gaussian\nlocation model, i.e. p\u03b8 \u221d exp(\u2212(cid:107)x \u2212 \u03b8(cid:107)2\n2), for \u03b8 \u2208 Rd. The Gaussian kernel satis\ufb01es the\nassumptions of Proposition 7 so that supz (cid:107) IF(z, Q)(cid:107) < \u221e, even when m = I.\nIndeed\n(cid:107) IF(z, P\u03b8)(cid:107) \u2264 C(\u03b8)e\u2212(cid:107)z\u2212\u03b8(cid:107)2/4(cid:107)z \u2212 \u03b8(cid:107), where z (cid:55)\u2192 e\u2212(cid:107)z\u2212\u03b8(cid:107)2/4(cid:107)z \u2212 \u03b8(cid:107) is uniformly bounded over\nX xdQ(x),\nwhich is unbounded with respect to z, and is thus not robust. This clearly demonstrates the\nimportance of carefully selecting a Stein class for use in minimum SD estimators. An alterna-\ntive way of inducing robustness is to introduce a spatially decaying diffusion matrix in DSM.\nTo this end, consider the minimum DSM estimator with scalar diffusion coef\ufb01cient m. Then\ntion yields that the associated in\ufb02uence function will be bounded if both m(x) and (cid:107)\u2207m(x)(cid:107) decay\nas (cid:107)x(cid:107) \u2192 \u221e. This clearly demonstrates another signi\ufb01cant advantage provided by the \ufb02exibility of\nour family of diffusion SD, where the Stein operator also plays an important role.\n\n\u03b8. In contrast, the SM estimator has an in\ufb02uence function of the form IF(z, Q) = z \u2212(cid:82)\n\u03b8DSM = ((cid:82)\n\nX \u2207m2(x)dQ(x)(cid:1). A straightforward calcula-\n\nX m2(x)dQ(x))\u22121(cid:0)(cid:82)\n\nX m2(x)xdQ(x) +(cid:82)\n\n\u2217\n\n4 Numerical Experiments\n\nIn this section, we explore several examples which demonstrate worrying breakpoints for SM, and\nhighlight how these can be straightforwardly handled using KSD, DKSD and DSM.\n\n4.1 Rough densities: the symmetric Bessel distributions\n\nA major drawback of SM is the smoothness requirement on the target density. However, this can be\nremedied by choosing alternative Stein classes, as will be demonstrated in the case of the symmetric\n\n7\n\n\fFigure 1: Minimum SD Estimators for the Symmetric Bessel Distribution. We consider the case where\n\u03b8\u2217\n1 = 0 and \u03b8\u2217\n\n2 = 1 and n = 500 for a range of smoothness parameter values s in d = 1.\n\nFigure 2: Minimum SD Estimators for Non-standardised Student-t Distributions. We consider a\nstudent-t problem with \u03bd = 5, \u03b8\u2217\n\n2 = 10 and n = 300.\n\n1 = 25, \u03b8\u2217\n\nBessel distributions. Let Ks\u2212d/2 denote the modi\ufb01ed Bessel function of the second kind with\nparameter s \u2212 d/2. This distribution generalises the Laplace distribution [40] and has log-density:\nlog p\u03b8(x) \u221d ((cid:107)x \u2212 \u03b81(cid:107)2/\u03b82)(s\u2212d/2)Ks\u2212d/2((cid:107)x \u2212 \u03b81(cid:107)2/\u03b82) where \u03b81 \u2208 Rd is a location parameter\nand \u03b82 > 0 a scale parameter. The parameter s \u2265 d/2 encodes smoothness.\nWe compared SM with KSD based on a Gaussian kernel and a range of lengthscale values in Fig. 1.\nThese results are based on n = 500 IID realisations in d = 1. The case s = 1 corresponds to a\nLaplace distribution, and we notice that both SM and KSD are able to obtain a reasonable estimate\nof the location. For rougher values, for example s = 0.6, we notice that KSD outperforms SM\nfor certain choices of lengthscales, whereas for s = 2, SM and KSD are both able to recover the\nparameter. Analogous results for scale can be found in Appendix F.1, and Appendix F.2 illustrates\nthe trade-off between ef\ufb01ciency and robustness on this problem.\n\n4.2 Heavy-tailed distributions: the non-standardised student-t distributions\n\n2/\u03b82\n\nA second drawback of standard SM is that it is inef\ufb01cient for heavy-tailed distributions. To demon-\nstrate this, we focus on non-standardised student-t distributions: p\u03b8(x) \u221d (1/\u03b82)(1 + (1/\u03bd)(cid:107)x \u2212\n\u03b81(cid:107)2\n2)\u2212(\u03bd+1)/2 where \u03b81 \u2208 R is a location parameter and \u03b82 > 0 a scale parameter. The parameter\n\u03bd determines the degrees of freedom: when \u03bd = 1, we have a Cauchy distribution, whereas \u03bd = \u221e\ngives the Gaussian distribution. For small values of \u03bd, the student-t distribution is heavy-tailed.\nWe illustrate SM and KSD for \u03bd = 5 in Fig. 2, where we take an IMQ kernel k(x, y; c, \u03b2) =\n(c2 +(cid:107)x\u2212 y(cid:107)2\n2)\u03b2 with c = 1. and \u03b2 = \u22120.5. This choice of \u03bd guarantees the \ufb01rst two moments exist,\nbut the distribution is still heavy-tailed. In the left plot, both SM and KSD struggle to recover \u03b8\u2217\n1 when\nn = 300, and the loss functions are far from convex. However, DKSD with m\u03b8(x) = 1+(cid:107)x\u2212\u03b81(cid:107)2/\u03b82\n2\ncan estimate \u03b81 very accurately. In the middle left plot, we instead estimate \u03b82 with SM, KSD and\ntheir correponding non-negative version (NNSM & NNKSD, m(x) = x), which are particularly\nwell suited for scale parameters. NNSM and NNKSD provide improvements on SM and KSD, but\nDKSD with m\u03b8(x) = ((x \u2212 \u03b81)/\u03b82)(1 + (1/\u03bd)(cid:107)x \u2212 \u03b81(cid:107)2\n2) provides signi\ufb01cant further gains.\nOn the right-hand side, we also consider the advantage of the Riemannian SGD algorithm over\nSGD by illustrating them on the KSD loss function with n = 1000. Both algorithms use constant\nstepsizes and minibatches of size 50. As demonstrated, Riemmannian SGD converges within a few\ndozen iterations, whereas SGD hasn\u2019t converged after 1000 iterations. Additional experiments on the\nrobustness of these estimators is also available in Appendix F.2.\n\n2/\u03b82\n\n4.3 Robust estimators for light-tailed distributions: the generalised Gamma distributions\n\nOur \ufb01nal example demonstrates a third failure mode for SM: its lack of robustness for light-tailed\ndistributions. We consider generalised gamma location models with likelihoods p\u03b8(x) \u221d exp(\u2212(x \u2212\n\u03b81)\u03b82 ) where \u03b81 is a location parameter and \u03b82 determines how fast the tails decay. The larger \u03b82,\n\n8\n\n\fFigure 3: Minimum SD Estimators for Generalised Gamma Distributions under Corruption. We\nconsider the case where \u03b8\u2217\n\n2 = 2 (left and middle) or \u03b8\u2217\n\n2 = 5 (right). Here n = 300.\n\n1 = 0 and \u03b8\u2217\n\nthe lighter the tails will be and vice-versa. We set n = 300 and corrupt 80 points by setting them\nto the value x = 8. A robust estimator should obtain a good approximation of \u03b8\u2217 even under this\ncorruption. The left plot in Fig. 3 considers a Gaussian model (i.e. \u03b8\u2217\n2 = 2); we see that SM is not\nrobust for this very simple model whereas DSM with m(x) = 1/(1 + (cid:107)x(cid:107)\u03b1), \u03b1 = 2 is robust. The\nmiddle plot shows that DKSD with this same m is also robust, and con\ufb01rms the analytical results\nof the previous section. Finally, the right plot considers the case \u03b8\u2217\n2 = 5 and we see that \u03b1 can be\nchosen as a function of \u03b82 to guarantee robustness. In general, taking \u03b1 \u2265 \u03b8\u2217\n2 \u2212 1 will guarantee a\nbounded in\ufb02uence function. Such a choice allows us to obtain robust estimators even for models with\nvery light tails.\n\ni ,(cid:80)d\n\napplied elementwise\n\nexp(\u03b7(\u03b8)(cid:62)\u03c8(x)) where \u03c8(x) = ((cid:80)d\n\n4.4 Ef\ufb01cient estimators for a simple unnormalised model\nFinally we consider a simple intractable model from [47]: p\u03b8(x) \u221d\ni=3 x1xi, tanh(x))(cid:62)\nand tanh is\n=\n(\u22120.5, 0.2, 0.6, 0, 0, 0, \u03b8, 0).\nis intractable since\nwe cannot easily compute its normalisation constant due to the\ndif\ufb01culty of integrating the unnormalised part of the model.\nOur results based on n = 200 samples show that DKSD with\nm(x) = diag(1/(1 + x)) is able to recover \u03b8\u2217 = \u22121, whereas both\nSM and KSD provide less accurate estimates of the parameter. This\nillustrates yet again that a judicious choice of diffusion matrix can\nsigni\ufb01cantly improve the ef\ufb01ciency of our estimators.\n\nto x and \u03b7(\u03b8)\n\ni=1 x2\n\nThis model\n\nFigure 4: Estimators for a Sim-\nple Intractable Model\n\n5 Conclusion\n\nThis paper introduced a general approach for constructing minimum distance estimators based on\nStein\u2019s method, and demonstrated that many popular inference schemes can be recovered as special\ncases. This class of algorithms gives us additional \ufb02exibility through the choice of an operator and\nfunction space (the Stein operator and Stein class), which can be used to tailor the inference scheme to\ntrade-off ef\ufb01ciency and robustness. However, this paper only scratches the surface of what is possible\nwith minimum SD estimators. Looking ahead, it will be interesting to identify diffusion matrices\nwhich increase ef\ufb01ciency for important classes of problems in machine learning. One example on\nwhich we foresee progress are the product of student-t experts models [38, 66, 68], whose heavy tails\nrender estimation challenging for SM. Advantages could also be found for other energy models, such\nas large graphical models where the kernel could be adapted to the graph [67].\n\nAcknowledgments\n\nAB was supported by a Roth scholarship from the Department of Mathematics at Imperial College\nLondon. FXB was supported by the EPSRC grants [EP/L016710/1, EP/R018413/1]. AD and MG\nwere supported by the Lloyds Register Foundation Programme on Data-Centric Engineering, the\nUKRI Strategic Priorities Fund under the EPSRC Grant [EP/T001569/1] and the Alan Turing Institute\nunder the EPSRC grant [EP/N510129/1]. MG was supported by the EPSRC grants [EP/J016934/3,\nEP/K034154/1, EP/P020720/1, EP/R018413/1].\n\n9\n\n\fReferences\n[1] S.-I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276,\n\n1998.\n\n[2] S.-I. Amari. Information Geometry and Its Applications, volume 194. Springer, 2016.\n\n[3] A. Barbour and L. H. Y. Chen. An introduction to Stein\u2019s method. Lecture Notes Series, Institute\n\nfor Mathematical Sciences, National University of Singapore, 2005.\n\n[4] A. Basu, H. Shioya, and C. Park. Statistical Inference: The Minimum Distance Approach. CRC\n\nPress, 2011.\n\n[5] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Springer Science+Business Media, New York, 2004.\n\n[6] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. Approximate Bayesian computation\nwith the Wasserstein distance. Journal of the Royal Statistical Society Series B: Statistical\nMethodology, 81(2):235\u2013269, 2019.\n\n[7] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on\n\nAutomatic Control, 58(9):2217\u20132229, 2013.\n\n[8] F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami. Statistical inference for generative models\n\nwith maximum mean discrepancy. arXiv:1906.05944, 2019.\n\n[9] G. Casella and R. Berger. Statistical Inference. 2001.\n\n[10] C. Ceylan and M. U. Gutmann. Conditional noise-contrastive estimation of unnormalised\n\nmodels. arXiv:1806.03664, 2018.\n\n[11] L. H. Y. Chen, L. Goldstein, and Q.-M. Shao. Normal Approximation by Stein\u2019s Method.\n\nSpringer, 2011.\n\n[12] W. Y. Chen, L. Mackey, J. Gorham, F.-X. Briol, and C. J. Oates. Stein points. In Proceedings of\n\nthe International Conference on Machine Learning, PMLR 80:843-852, 2018.\n\n[13] W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, M. Girolami, L. Mackey, and C. J. Oates. Stein\npoint Markov chain Monte Carlo. In International Conference on Machine Learning, PMLR 97,\npages 1011\u20131021, 2019.\n\n[14] K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of \ufb01t.\n\nInternational Conference on Machine Learning, pages 2606\u20132615, 2016.\n\nIn\n\n[15] A. P. Dawid and M. Musio. Theory and applications of proper scoring rules. Metron, 72(2):\n\n169\u2013183, 2014. ISSN 2281695X. doi: 10.1007/s40300-014-0039-y.\n\n[16] A. P. Dawid, M. Musio, and L. Ventura. Minimum scoring rule inference. Scandinavian Journal\n\nof Statistics, 43(1):123\u2013138, 2016.\n\n[17] G. Detommaso, T. Cui, Y. Marzouk, A. Spantini, and R. Scheichl. A stein variational newton\nmethod. In Advances in Neural Information Processing Systems 31, pages 9169\u20139179. 2018.\n\n[18] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via\n\nmaximum mean discrepancy optimization. In Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[19] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. Poggio. Learning with a Wasserstein\n\nloss. In Advances in Neural Information Processing Systems, pages 2053\u20132061, 2015.\n\n[20] D. Gabay. Minimizing a differentiable function over a differential manifold. Journal of\n\nOptimization Theory and Applications, 37(2):177\u2013219, 1982.\n\n[21] A. Genevay, G. Peyr\u00e9, and M. Cuturi. Learning generative models with Sinkhorn divergences.\nIn Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and\nStatistics, PMLR 84, pages 1608\u20131617, 2018.\n\n10\n\n\f[22] C. J. Geyer. On the convergence of Monte Carlo maximum likelihood calculations. Journal of\n\nthe Royal Statistical Society: Series B (Methodological), 56(1):261\u2013274, 1994.\n\n[23] J. Gorham and L. Mackey. Measuring sample quality with Stein\u2019s method. In Advances in\n\nNeural Information Processing Systems, pages 226\u2013234, 2015.\n\n[24] J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the\n\nInternational Conference on Machine Learning, pages 1292\u20131301, 2017.\n\n[25] J. Gorham, A. Duncan, L. Mackey, and S. Vollmer. Measuring sample quality with diffusions.\n\narXiv:1506.03039. To appear in Annals of Applied Probability., 2016.\n\n[26] M. U. Gutmann and A. Hyv\u00e4rinen. Noise-contrastive estimation: A new estimation principle\nfor unnormalized statistical models. In Proceedings of the Thirteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 297\u2013304, 2010.\n\n[27] M. U. Gutmann and A. Hyvarinen. Noise-contrastive estimation of unnormalized statistical\nmodels, with applications to natural image statistics. Journal of Machine Learning Research,\n13:307\u2013361, 2012.\n\n[28] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):1771\u20131800, 2002.\n\n[29] W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of\n\nMathematical Statistics, pages 293\u2013325, 1948.\n\n[30] W. Hoeffding. The strong law of large numbers for U-statistics. Technical report, North Carolina\n\nState University Department of Statistics, 1961.\n\n[31] P. J. Huber and E. M. Ronchetti. Robust Statistics. Wiley, 2009.\n\n[32] J. Huggins and L. Mackey. Random feature stein discrepancies.\n\nInformation Processing Systems, pages 1899\u20131909, 2018.\n\nIn Advances in Neural\n\n[33] A. Hyv\u00e4rinen. Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood\n\nestimation. Neural computation, 11(7):1739\u20131768, 1999.\n\n[34] A. Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. Journal of\n\nMachine Learning Research, 6:695\u2013708, 2006.\n\n[35] A. Hyv\u00e4rinen. Some extensions of score matching. Computational Statistics and Data Analysis,\n\n51(5):2499\u20132512, 2007.\n\n[36] W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-\n\nof-\ufb01t test. In Advances in Neural Information Processing Systems, pages 261\u2013270, 2017.\n\n[37] R. Karakida, M. Okada, and S.-I. Amari. Adaptive natural gradient learning algorithms for\nunnormalized statistical models. Arti\ufb01cial Neural Networks and Machine Learning - ICANN,\n2016.\n\n[38] D. P. Kingma and Y. LeCun. Regularized estimation of image statistics by score matching. In\n\nAdvances in Neural Information Processing Systems, pages 1126\u20131134, 2010.\n\n[39] U. K\u00f6ster and A. Hyv\u00e4rinen. A two-layer model of natural stimuli estimated with score\n\nmatching. Neural Computation, 22(9):2308\u20132333, 2010.\n\n[40] S. Kotz, T. J. Kozubowski, and K. Podgorski. The Laplace Distribution and Generalizations.\n\nSpringer, 2001.\n\n[41] Y. Li and R. E. Turner. Gradient estimators for implicit models. In International Conference on\n\nLearning Representations, 2018.\n\n[42] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In Proceedings of\n\nthe International Conference on Machine Learning, volume 37, pages 1718\u20131727, 2015.\n\n11\n\n\f[43] C. Liu and J. Zhu. Riemannian Stein Variational Gradient Descent for Bayesian Inference. (i),\n\n2017. URL http://arxiv.org/abs/1711.11216.\n\n[44] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference\n\nalgorithm. In Advances in Neural Information Processing Systems, 2016.\n\n[45] Q. Liu and D. Wang. Learning deep energy models: Contrastive divergence vs. amortized mle.\n\narXiv preprint arXiv:1707.00797, 2017.\n\n[46] Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests. In\n\nProceedings of the International Conference on Machine Learning, pages 276\u2013284, 2016.\n\n[47] S. Liu, T. Kanamori, W. Jitkrittum, and Y. Chen. Fisher ef\ufb01cient inference of intractable models.\n\narXiv:1805.07454, 2018.\n\n[48] S. Lyu. Interpretation and generalization of score matching. In Conference on Uncertainty in\n\nArti\ufb01cial Intelligence, pages 359\u2013366, 2009.\n\n[49] C. Ma and D. Barber. Black-box Stein divergence minimization for learning latent variable\n\nmodels. Advances in Approximate Bayesian Inference, NIPS 2017 Workshop, 2017.\n\n[50] L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave\n\ndistributions. Electronic Communications in Probability, 21, 2016.\n\n[51] K. V. Mardia, J. T. Kent, and A. K. Laha. Score matching estimators for directional distributions.\n\narXiv preprint arXiv:1604.08470, 2016.\n\n[52] C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural computation, 17(1):\n\n177\u2013204, 2005.\n\n[53] M. Micheli and J. A. Glaunes. Matrix-valued kernels for shape deformation analysis. arXiv\n\npreprint arXiv:1308.5739, 2013.\n\n[54] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language\nmodels. In Proceedings of the International Conference on Machine Learning, pages 419\u2013426,\n2012.\n\n[55] A. Muller. Integral probability metrics and their generating classes of functions. Advances in\n\nApplied Probability, 29(2):429\u2013443, 1997.\n\n[56] W. K. Newey and D. McFadden. Large sample estimation and hypothesis testing. Handbook of\n\nEconometrics, 4:2111\u20132245, 1994.\n\n[57] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration.\n\nJournal of the Royal Statistical Society B: Statistical Methodology, 79(3):695\u2013718, 2017.\n\n[58] L. Pardo. Statistical Inference Based on Divergence Measures, volume 170. Chapman and\n\nHall/CRC, 2005.\n\n[59] S. Pigola and A. G. Setti. Global divergence theorems in nonlinear PDEs and geometry. Ensaios\n\nMatem\u00e1ticos, 26:1\u201377, 2014.\n\n[60] R. Ranganath, J. Altosaar, D. Tran, and D. M. Blei. Operator variational inference. In Advances\n\nin Neural Information Processing Systems, pages 496\u2013504, 2016.\n\n[61] S. Roth and M. J. Black. Fields of experts. International Journal of Computer Vision, 82(2):\n\n205, 2009.\n\n[62] J. Sohl-dickstein, P. Battaglino, and M. R. DeWeese. Minimum probability \ufb02ow learning. In\nProceedings of the 28th International Conference on International Conference on Machine\nLearning, pages 905\u2013912, 2011.\n\n[63] B. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyv\u00e4rinen, and R. Kumar. Density estimation\nin in\ufb01nite dimensional exponential families. Journal of Machine Learning Research, 18(1):\n1830\u20131888, 2017.\n\n12\n\n\f[64] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. Lanckriet. Hilbert space\nembeddings and metrics on probability measures. Journal of Machine Learning Research, 11:\n1517\u20131561, 2010.\n\n[65] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of\ndependent random variables. In Proceedings of 6th Berkeley Symposium on Mathematical\nStatistics and Probability, pages 583\u2013602. University of California Press, 1972.\n\n[66] K. Swersky, M. A. Ranzato, D. Buchman, B. M. Marlin, and N. de Freitas. On autoencoders and\nscore matching for energy based models. In International Conference on Machine Learning,\npages 1201\u20131208, 2011.\n\n[67] S. V. N. Vishwanathan, N. Schraudolph, R. Kondor, and K. Borgwardt. Graph kernels. Journal\n\nof Machine Learning Research, pages 1201\u20131242, 2010.\n\n[68] M. Welling, G. Hinton, and S. Osindero. Learning sparse topographic representations with\nproducts of student-t distributions. In Advances in Neural Information Processing Systems,\npages 1383\u20131390, 2003.\n\n[69] L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton. Learning deep kernels for expo-\n\nnential family densities. arXiv:1811.08357, 2018.\n\n[70] I.-K. Yeo and R. A. Johnson. A uniform strong law of large numbers for U-statistics with\napplication to transforming to near symmetry. Statistics & Probability Letters, 51(1):63\u201369,\n2001.\n\n13\n\n\f", "award": [], "sourceid": 7112, "authors": [{"given_name": "Alessandro", "family_name": "Barp", "institution": "Imperial College London"}, {"given_name": "Francois-Xavier", "family_name": "Briol", "institution": "University of Cambridge"}, {"given_name": "Andrew", "family_name": "Duncan", "institution": "Imperial College London"}, {"given_name": "Mark", "family_name": "Girolami", "institution": "University of Cambridge"}, {"given_name": "Lester", "family_name": "Mackey", "institution": "Microsoft Research"}]}