{"title": "Kernel Stein Tests for Multiple Model Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 2243, "page_last": 2253, "abstract": "We address the problem of non-parametric multiple model comparison: given $l$\ncandidate models, decide whether each candidate is as good as the best one(s) or worse than it. We propose two statistical tests,\neach controlling a different notion of decision errors. The first test,\nbuilding on the post selection inference framework, provably controls the\nnumber of best models that are wrongly declared worse (false positive\nrate). The second test is based on multiple correction, and controls the\nproportion of the models declared worse but are in fact as good as the best\n(false discovery rate). \nWe prove that under appropriate conditions the first test can yield a higher true\npositive rate than the second. Experimental results on toy and real (CelebA,\nChicago Crime data) problems show that the two tests have high true positive\nrates with well-controlled error rates. By contrast, the naive approach of\nchoosing the model with the lowest score without correction\nleads to more false positives.", "full_text": "Kernel Stein Tests for Multiple Model Comparison\n\nJen Ning Lim\n\nMax Planck Institute for Intelligent Systems\n\njlim@tuebingen.mpg.de\n\nMakoto Yamada\n\nKyoto University, RIKEN AIP\n\nmakoto.yamada@riken.jp\n\nBernhard Sch\u00f6lkopf\n\nWittawat Jitkrittum\n\nMax Planck Institute for Intelligent Systems\n\nMax Planck Institute for Intelligent Systems\n\nbs@tuebingen.mpg.de\n\nwittawat@tuebingen.mpg.de\n\nAbstract\n\nWe address the problem of non-parametric multiple model comparison: given l\ncandidate models, decide whether each candidate is as good as the best one(s) or\nworse than it. We propose two statistical tests, each controlling a different notion of\ndecision errors. The \ufb01rst test, building on the post selection inference framework,\nprovably controls the number of best models that are wrongly declared worse (false\npositive rate). The second test is based on multiple correction, and controls the\nproportion of the models declared worse but are in fact as good as the best (false\ndiscovery rate). We prove that under appropriate conditions the \ufb01rst test can yield\na higher true positive rate than the second. Experimental results on toy and real\n(CelebA, Chicago Crime data) problems show that the two tests have high true\npositive rates with well-controlled error rates. By contrast, the naive approach of\nchoosing the model with the lowest score without correction leads to more false\npositives.\n\nIntroduction\n\n1\nGiven a sample (a set of i.i.d. observations), and a set of l candidate models M, we address the\nproblem of non-parametric comparison of the relative \ufb01t of these candidate models. The comparison\nis non-parametric in the sense that the class of allowed candidate models is broad (mild assumptions\non the models). All the given candidate models may be wrong; that is, the true data generating\ndistribution may not be present in the candidate list. A widely used approach is to pre-select a\ndivergence measure which computes a distance between a model and the sample (e.g., Fr\u00e9chet\nInception Distance (FID, [16]), Kernel Inception Distance [3] or others), and choose the model which\ngives the lowest estimate of the divergence. An issue with this approach is that multiple equally good\nmodels may give roughly the same estimate of the divergence, giving a wrong conclusion of the best\nmodel due to noise from the sample (see Table 1 in [17] for an example of a misleading conclusion\nresulted from direct comparison of two FID estimates).\nIt was this issue that motivates the development of a non-parametric hypothesis test of relative \ufb01t\n(RelMMD) between two candidate models [4]. The test uses as its test statistic the difference of two\nestimates of Maximum Mean Discrepancy (MMD, [14]), each measuring the distance between the\ngenerated sample from each model and the observed sample. It is known that if the kernel function\nused is characteristic [27, 11], the population MMD de\ufb01nes a metric on a large class of distributions.\nAs a result, the magnitude of the relative test statistic provides a measure of relative \ufb01t, allowing one\nto decide a (signi\ufb01cantly) better model when the statistic is suf\ufb01ciently large. The key to avoiding\nthe previously mentioned issue of false detection is to appropriately choose the threshold based on\nthe null distribution, i.e., the distribution of the statistic when the two models are equally good. An\nextension of RelMMD to a linear-time relative test was considered by Jitkrittum et al. [17].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA limitation of the relative tests of RelMMD and others [4, 17] is that they are limited to the\ncomparison of only l = 2 candidate models. Indeed, taking the difference is inherently a function of\ntwo quantities, and it is unclear how the previous relative tests can be applied when there are l > 2\ncandidate models. We note that relative \ufb01t testing is different from goodness-of-\ufb01t testing, which\naims to decide whether a given model is the true distribution of a set of observations. The latter task\nmay be achieved with the Kernel Stein Discrepancy (KSD) test [6, 23, 13] where, in the continuous\ncase, the model is speci\ufb01ed as a probability density function and needs only be known up to the\nnormalizer. A discrete analogue of the KSD test is studied in [32]. When the model is represented\nby its sample, goodness-of-\ufb01t testing reduces to two-sample testing, and may be carried out with\nthe MMD test [14], its incomplete U-statistic variants [33, 31], the ME and SCF tests [7, 18], and\nrelated kernel-based tests [8, 10], among others. To reiterate, we stress that in general multiple model\ncomparison differs from multiple goodness-of-\ufb01t tests. While the latter may be addressed with l\nindividual goodness-of-\ufb01t tests (one for each candidate), the former requires comparing l correlated\nestimates of the distances between each model and the observed sample. The use of the observed\nsample in the l estimates is what creates the correlation which must be accounted for.\nIn the present work, we generalize the relative comparison tests of RelMMD and others [4, 17] to the\ncase of l > 2 models. The key idea is to select the \u201cbest\u201d model (reference model) that is the closest\nmatch to the observed sample, and consider l hypotheses. Each hypothesis tests the relative \ufb01t of each\ncandidate model with the reference model, where the reference is chosen to be the model giving the\nlowest estimate of the pre-chosen divergence measure (MMD or KSD). The total output thus consists\nof l binary values where 1 (assign positive) indicates that the corresponding model is signi\ufb01cantly\nworse (higher divergence to the sample) than the reference, and 0 indicates no evidence for such\nclaim (indecisive). We assume that the output is always 0 when the reference model is compared to\nitself. The need for a reference model greatly complicates the formulation of the null hypothesis (i.e.,\nthe null hypothesis is random due to the noisy selection of the reference), an issue that is not present\nin the multiple goodness-of-\ufb01t testing.\nWe propose two non-parametric multiple model comparison tests (Section 3.3) following the previ-\nously described scheme. Each test controls a different notion of decision errors. The \ufb01rst test RelPSI\nbuilds on the post selection inference framework and provably (Lemma 4.2) controls the number of\nbest models that are wrongly declared worse (FPR, false positive rate). The second test RelMulti is\nbased on multiple correction, and controls the proportion of the models declared worse but are in\nfact as good as the best (FDR, false discovery rate). In both tests, the underlying divergence measure\ncan be chosen to be either the Maximum Mean Discrepancy (MMD) allowing each model to be\nrepresented by its sample, or the Kernel Stein Discrepancy (KSD) allowing the comparison of any\nmodels taking the form of unnormalized, differentiable density functions.\nAs theoretical contribution, the asymptotic null distribution of RelMulti-KSD (RelMulti when the\ndivergence measure is KSD) is provided (Theorem C.1), giving rise to a relative KSD test in the case\nof l = 2 models, as a special case. To our knowledge, this is the \ufb01rst time that a KSD-based relative\ntest for two models is studied. Further, we show (in Theorem 4.1) that the RelPSI test can yield a\nhigher true positive rate (TPR) than the RelMulti test, under appropriate conditions. Experiments\n(Section 5) on toy and real (CelebA, Chicago Crime data) problems show that the two proposed tests\nhave high true positive rates with well-controlled respective error rates \u2013 FPR for RelPSI and FDR for\nRelMulti. By contrast, the naive approach of choosing the model with the lowest divergence without\ncorrection leads to more false positives.\n\n2 Background\n\nHypothesis testing of relative \ufb01t between l = 2 candidate models, P1 and P2, to the data generating\ndistribution R (unknown) can be performed by comparing the relative magnitudes of a pre-chosen\ndiscrepancy measure which computes the distance from each of the two models to the observed\nsample drawn from R. Our proposed methods RelPSI and RelMulti (described in Section 3.3)\ngeneralize this formulation based upon selective testing [20], and multiple correction [1], respectively.\nUnderlying these new tests is a base discrepancy measure D for measuring the distance between each\ncandidate model to the observed sample. In this section, we review Maximum Mean Discrepancy\n(MMD, [14]) and Kernel Stein Discrepancy (KSD, [6, 23]), which will be used as a base discrepancy\nmeasure in our proposed tests in Section 3.3.\n\n2\n\n\fi=1\n\nu =\n\n2\nl = 2\nn\n\n1\n\nn(n\u22121)\n\n(cid:80)n/2\n\ni.i.d.\u223c P,{yi}n\n\ni=1\n\n(cid:80)\ni(cid:54)=j h(zi, zj) where zi := (xi, yi), {xi}n\n\nwhich is an element in Hd that has an inner product de\ufb01ned as (cid:104)f, g(cid:105)Hd = (cid:80)d\n\nbedding of P , denoted by \u00b5P , is de\ufb01ned as \u00b5P = Ex\u223cP [k(x,\u00b7)] [26] (exists if Ex\u223cP [(cid:112)k(x, x)] <\n\nReproducing kernel Hilbert space Given a positive de\ufb01nite kernel k : X \u00d7 X \u2192 R, it is known\nthat there exists a feature map \u03c6 : X \u2192 H and a reproducing kernel Hilbert Space (RKHS) Hk = H\nassociated with the kernel k [2]. The kernel k is symmetric and is a reproducing kernel on H in the\nsense that k(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105)H for all x, y \u2208 X where (cid:104)\u00b7,\u00b7(cid:105)H = (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product.\nIt follows from this reproducing property that for any f \u2208 H, (cid:104)f, \u03c6(x)(cid:105) = f (x) for all x \u2208 X . We\ninterchangeably write k(x,\u00b7) and \u03c6(x).\nMaximum Mean Discrepancy Given a distribution P and a positive de\ufb01nite kernel k, the mean em-\n\u221e). Given two distributions P and R, the Maximum Mean Discrepancy (MMD, [14]) is a pseu-\ndometric de\ufb01ned as MMD(P, R) := ||\u00b5P \u2212 \u00b5R||H and (cid:107)f(cid:107)2H = (cid:104)f, f(cid:105)H for any f \u2208 H. If the\nkernel k is characteristic [27, 11], then MMD de\ufb01nes a metric. An important implication is that\nMMD2(P, R) = 0 \u21d0\u21d2 P = R. Examples of characteristic kernels include the Gaussian and\nInverse multiquadric (IMQ) kernels [28, 13]. It was shown in [14] that MMD2 can be written as\nMMD2(P, R) = Ez,z(cid:48)\u223cP\u00d7R[h(z, z(cid:48))] where h(z, z(cid:48)) = k(x, x(cid:48)) + k(y, y(cid:48)) \u2212 k(x, y(cid:48)) \u2212 k(x(cid:48), y)\nand z := (x, y), z(cid:48) := (x(cid:48), y(cid:48)) are independent copies. This form admits an unbiased estimator\n(cid:86)2\ni.i.d.\u223c Q and is\nMMD\na second-order U-statistic [14]. Gretton et al. [14, Section 6] proposed a linear-time estimator\n(cid:92)MMD\ni=1 h(z2i, z2i\u22121) which can be shown to be asymptotically normally distributed both\nwhen P = R and P (cid:54)= R [14, Corollary 16]. Notice that the MMD can be estimated solely on the\nbasis of two independent samples from the two distributions.\nKernel Stein Discrepancy The Kernel Stein Discrepancy (KSD, [23, 6]) is a discrepancy mea-\nsure between an unnormalized, differentiable density function p and a sample, originally pro-\nposed for goodness-of-\ufb01t testing. Let P, R be two distributions that have continuously differ-\nentiable density functions p, r respectively. Let sp(x) := \u2207x log p(x) (a column vector) be\nthe score function of p de\ufb01ned on its support. Let k be a positive de\ufb01nite kernel with contin-\nuous second-order derivatives. Following [23, 19], de\ufb01ne \u03bep(x,\u00b7) := sp(x)k(x,\u00b7) + \u2207xk(x,\u00b7)\ni=1(cid:104)fi, gi(cid:105)H.\nThe Kernel Stein Discrepancy is de\ufb01ned as KSD2(P, R) := (cid:107)Ex\u223cR\u03bep(x,\u00b7)(cid:107)2Hd. Under ap-\npropriate boundary conditions on p and conditions on the kernel k [6, 23], it is known that\nKSD2(P, R) = 0 \u21d0\u21d2 P = R. Similarly to the case of MMD, the squared KSD can\nbe written as KSD2(P, R) = Ex,x(cid:48)\u223cR[up(x, x(cid:48))] where up(x, x(cid:48)) = (cid:104)\u03bep(x,\u00b7), \u03bep(x(cid:48),\u00b7)(cid:105)Hd =\nsp(x)(cid:62)sp(x(cid:48))k(x, x(cid:48)) + sp(x)(cid:62)\u2207x(cid:48)k(x, x(cid:48)) + \u2207xk(x, x(cid:48))(cid:62)sp(x(cid:48)) + tr[\u2207x,x(cid:48)k(x, x(cid:48))]. The KSD\n(cid:86)2\ni.i.d.\u223c R, which\nhas an unbiased estimator KSD\nis also a second-order U-statistic. Like the MMD, a linear-time estimator of KSD2 is given by\n(cid:91)KSD\n2\nl is asymptotically normally dis-\ntributed [23]. In contrast to the MMD estimator, the KSD estimator requires only samples from\nR, and P is represented by its score function \u2207x log p(x) which is independent of the normalizing\nconstant. As shown in the previous work, an explicit probability density function is far more repre-\nsentative of the distribution than its sample counterpart [19, 17]. KSD is suitable when the candidate\nmodels are given explicitly (i.e., known density functions), whereas MMD is more suitable when the\ncandidate models are implicit and better represented by their samples.\n3 Proposal: non-parametric multiple model comparison\nIn this section, we propose two new tests: RelMulti (Section 3.2) and RelPSI (Section 3.3), each\ncontrolling a different notion of decision errors.\nProblem (Multiple Model Comparison). Suppose we have l models denoted as M = {Pi}l\ni=1,\nwhich we can either: draw a sample (a collection of n i.i.d. realizations) from or have access to their\nunnormalized log density log p(x). The goal is to decide whether each candidate Pi is worse than\nthe best one(s) in the candidate list (assign positive), or indecisive (assign zero). The best model is\nde\ufb01ned to be PJ such that J \u2208 arg minj\u2208{1,...,l} D(Pj, R) where D is a base discrepancy measure\n(see Section 2), and R is the data generating distribution (unknown).\n\n(cid:80)\ni(cid:54)=j up(xi, xj) where {xi}n\n\ni=1 up(x2i, x2i\u22121). It is known that\n\n(cid:80)(cid:98)n(cid:99)/2\n\nu(P, R) =\n\n1\n\nn(n\u22121)\n\n2\n\nl = 2(cid:98)n(cid:99)\n\ni=1\n\n\u221a\n\nn(cid:91)KSD\n\nThroughout this work, we assume that all candidate models P1, . . . , Pl and the unknown data\ngenerating distribution R have a common support X \u2286 Rd, and are all distinct. The task can be\n\n3\n\n\fseen as a multiple binary decision making task, where a model P \u2208 M is considered negative\nif it is as good as the best one, i.e., D(P, R) = D(PJ , R) where J \u2208 arg minj D(Pj, R). The\nindex set of all models which are as good as the best one is denoted by I\u2212 := {i | D(Pi, R) =\nminj=1,...,l D(Pj, R)}. When |I\u2212| > 1, J is an arbitrary index in I\u2212. Likewise, a model is\nconsidered positive if it is worse than the best model. Formally, the index set of all positive\nmodels is denoted by I+ := {i | D(Pi, R) > D(PJ , R)}. It follows that I\u2212 \u2229 I+ = \u2205 and\nI\u2212 \u222a I+ = I := {1, . . . , l}. The problem can be equivalently stated as the task of deciding\nwhether the index for each model belongs to I+ (assign positive). The total output thus consists\nof l binary values where 1 (assign positive) indicates that the corresponding model is signi\ufb01cantly\nworse (higher divergence to the sample) than the best, and 0 indicates no evidence for such claim\n(indecisive). In practice, there are two dif\ufb01culties: \ufb01rstly, R can only be observed through a sample\nXn := {xi}n\ni.i.d.\u223c R so that D(Pi, R) has to be estimated by \u02c6D(Pi, R) computed on the sample;\nsecondly, the index J of the reference model (the best model) is unknown. In our work, we consider\nthe complete, and linear-time U-statistic estimators of MMD or KSD as the discrepancy \u02c6D (see\nSection 2).\n\ni=1\n\nn( \u02c6D(Pi, R) \u2212 \u02c6D(Pj, R)) d\u2212\u2192\nWe note that the main assumption on the discrepancy \u02c6D is that\nN (\u00b5, \u03c32) for any Pi, Pj \u2208 M and i (cid:54)= j. If this holds, our proposal can be easily amended to\naccommodate a new discrepancy measure D beyond MMD or KSD. Examples include (but not\nlimited to) the Unnormalized Mean Embedding [7, 17], Finite Set Stein Discrepancy [19, 17], or\nother estimators such as the block [33] and incomplete estimator [31].\n\n\u221a\n\n3.1 Selecting a reference candidate model\nIn both proposed tests, the algorithms start by \ufb01rst choosing a model P \u02c6J \u2208 M as the reference\nmodel where \u02c6J \u2208 arg minj\u2208I \u02c6D(Pj, R) is a random variable. The algorithms then proceed to test\nthe relative \ufb01t of each model Pi for i (cid:54)= \u02c6J and determine if it is statistically worse than the selected\nreference P \u02c6J. The null and the alternative hypotheses for the ith candidate model can be written as\n\n0,i : D(Pi, R) \u2212 D(P \u02c6J , R) \u2264 0 | P \u02c6J is selected as the reference,\nH \u02c6J\n1,i : D(Pi, R) \u2212 D(P \u02c6J , R) > 0 | P \u02c6J is selected as the reference.\nH \u02c6J\n,\u00b7\u00b7\u00b7 , 1(cid:124)(cid:123)(cid:122)(cid:125)\n\nThese hypotheses are conditional on the selection event (i.e., selecting \u02c6J as the reference index). For\nn[ \u02c6D(Pi, R) \u2212 \u02c6D(P \u02c6J , R)] where\neach of the l null hypotheses, the test uses as its statistics \u03b7(cid:62)z :=\nn[ \u02c6D(P1, R),\u00b7\u00b7\u00b7 , \u02c6D(Pl, R)](cid:62). The distribution of\n\n\u03b7 = [0,\u00b7\u00b7\u00b7 , \u22121(cid:124)(cid:123)(cid:122)(cid:125)\n\n,\u00b7\u00b7\u00b7 ](cid:62) and z =\n\n\u221a\n\n\u221a\n\n(cid:86)2\nu or KSD\n\n0,i : \u03b7(cid:62)\u00b5 \u2264 0 | Az \u2264 0 vs. H \u02c6J\n\nthe test statistic \u03b7(cid:62)z depends on the choice of estimator for the discrepancy measure \u02c6D which can\n(cid:86)2\nu. De\ufb01ne \u00b5 := [D(P1, R), . . . , D(Pl, R)](cid:62), then the hypotheses above can be\nbe MMD\n1,i : \u03b7(cid:62)\u00b5 > 0 | Az \u2264 0, where\nwe note that \u03b7 depends on i, A \u2208 {\u22121, 0, 1}(l\u22121)\u00d7l, As,: = [0, . . . , 1(cid:124)(cid:123)(cid:122)(cid:125)\nequivalently expressed as H \u02c6J\n,\u00b7\u00b7\u00b7 , 0] for all\ns \u2208 {1, . . . , l}\\{ \u02c6J} and As,: denote the sth row of A. This equivalence was exploited in the multiple\ngoodness-of-\ufb01t testing by Yamada et al. [31]. The condition Az \u2264 0 represents the fact that P \u02c6J is\nselected as the reference model, and expresses \u02c6D(P \u02c6J , R) \u2264 \u02c6D(Ps, R) for all s \u2208 {1, . . . , l}\\{ \u02c6J}.\n3.2 RelMulti: for controlling false discovery rate (FDR)\n\n,\u00b7\u00b7\u00b7 , \u22121(cid:124)(cid:123)(cid:122)(cid:125)\n\n\u02c6J\n\ns\n\n\u02c6J\n\ni\n\nUnlike traditional hypothesis testing, the null hypotheses here are conditional on the selection event,\nmaking the null distribution non-trivial to derive [21, 22]. Speci\ufb01cally, the sample used to form\nthe selection event (i.e., establishing the reference model) is the same sample used for testing the\nhypothesis, creating a dependency. Our \ufb01rst approach of RelMulti is to divide the sample into two\nindependent sets, where the \ufb01rst is used to choose P \u02c6J and the latter for performing the test(s). This\napproach simpli\ufb01es the null distribution since the sample used to form the selection event and the\n0,i : \u03b7(cid:62)\u00b5 \u2264 0\ntest sample are now independent. That is, H \u02c6J\ndue to independence. In this case, the distribution of the test statistic (for (cid:92)MMD\n2\nu) after\n\n0,i : \u03b7(cid:62)\u00b5 \u2264 0 | Az \u2264 0 simpli\ufb01es to H \u02c6J\n\nu and (cid:91)KSD\n\n2\n\n4\n\n\fselection is the same as its unconditional null distribution. Under our assumption that all distributions\nare distinct, the test statistic is asymptotically normally distributed [14, 23, 6].\nFor the complete U-statistic estimator of Maximum Mean Discrepancy ((cid:92)MMD\n2\nu), Bounliphone et al.\n[4] showed that, under mild assumptions, z is jointly asymptotically normal, where the covariance\n(cid:86)2\nmatrix is known in closed form. However, for KSD\nu, only the marginal variance is known [6, 23]\nand not its cross covariances, which are required for characterizing the null distributions of our test\n(see Algorithm 2 in the appendix for the full algorithm of RelMulti). We present the asymptotic\nmultivariate characterization of (cid:91)KSD\nGiven a desired signi\ufb01cance level \u03b1 \u2208 (0, 1), the rejection threshold is chosen to be the (1 \u2212 \u03b1)-\nquantile of the distribution N (0, \u02c6\u03c32) where \u02c6\u03c32 is the plug-in estimator of the asymptotic variance \u03c32\nof our test statistic (see [4, Section 3] for MMD and Section D for KSD). With this choice, the false\nrejection rate for each of the l \u2212 1 hypotheses is upper bounded by \u03b1 (asymptotically). However,\nto control the false discovery rate for the l \u2212 1 tests it is necessary to further correct with multiple\ntesting adjustments. We use the Benjamini\u2013Yekutieli procedure [1] to adjust \u03b1. We note that when\ntesting H \u02c6J\n, the result is always 0 (fail to reject) by default. When l > 2, following the result of [1]\n0, \u02c6J\nthe asymptotic false discovery rate (FDR) of RelMulti is provably no larger than \u03b1. The FDR in our\ncase is the fraction of the models declared worse that are in fact as good as the (true) reference model.\nFor l = 2, no correction is required as only one test is performed.\n\n2\nu in Theorem C.1.\n\n3.3 RelPSI: for controlling false positive rate (FPR)\n\nA caveat of the data splitting used in RelMulti is the loss of true positive rate since a portion of sample\nfor testing is used for forming the selection. When the selection is wrong, i.e., \u02c6J \u2208 I+, the test\nwill yield a lower true positive rate. It is possible to alleviate this issue by using the full sample for\nselection and testing, which is the approach taken by our second proposed test RelPSI. This approach\nrequires us to know the null distribution of the conditional null hypotheses (see Section 3.1), which\ncan be derived based on Theorem 3.1.\nTheorem 3.1 (Polyhedral Lemma [20]). Suppose that z \u223c N (\u00b5, \u03a3) and the selection event is af\ufb01ne,\ni.e., Az \u2264 b for some matrix A and b, then for any \u03b7, we have\n\n\u03b7(cid:62)z | Az \u2264 b \u223c T N (\u03b7(cid:62)\u00b5, \u03b7(cid:62)\u03a3\u03b7, V\u2212(z), V +(z)),\n\n\u03c3\n\n\u03c3\n\n\u03b7(cid:62)\u03a3\u03b7 . The truncated points are given by: V\u2212(z) = maxj:\u03b1j <0\n\nwhere T N (\u00b5, \u03c32, a, b) is a truncated normal distribution with mean \u00b5 and variance \u03c32 truncated\n+ \u03b7(cid:62)z,\nat [a, b]. Let \u03b1 = A\u03a3\u03b7\nand V +(z) = minj:\u03b1j >0\nThis lemma assumes two parameters are known: \u00b5 and \u03a3. Fortunately, we do not need to estimate\n\u00b5 and can set \u03b7(cid:62)\u00b5 = 0. To see this note that threshold is given by (1 \u2212 \u03b1)-quantile of a truncated\n\nnormal which is t\u03b1 := \u03b7(cid:62)\u00b5 + \u03c3\u03a6\u22121(cid:0)(1 \u2212 \u03b1)\u03a6(cid:0)V +\u2212\u03b7(cid:62)\u00b5\n\n(cid:1)(cid:1) where \u03c32 = \u03b7(cid:62)\u03a3\u03b7.\n\n(cid:1) + \u03b1\u03a6(cid:0)V\u2212\u2212\u03b7(cid:62)\u00b5\n\n+ \u03b7(cid:62)z.\n\nbj\u2212Azj\n\nbj\u2212Azj\n\n\u03b1j\n\n\u03b1j\n\nIf our test statistic \u03b7(cid:62)z exceeds the threshold, we reject the null hypothesis H \u02c6J\n0,i. This choice of the\nrejection threshold will control the selective type-I error P(\u03b7(cid:62)z > t\u03b1 | H \u02c6J\n0,i is true, P \u02c6J is selected)\nto be no larger than \u03b1. However \u00b5 is not known, the threshold can be adjusted by setting \u03b7(cid:62)\u00b5 = 0\nand can be seen as a more conservative threshold. A similar adjustment procedure is used in\nBounliphone et al. [4] and Jitkrittum et al. [17] for Gaussian distributed test statistics. And since\n\u03a3 is also unknown, we replace \u03a3 with a consistent plug-in estimator \u02c6\u03a3 given by Bounliphone et\nal. [4, Theorem 2] for (cid:92)MMD\n2\nu. Speci\ufb01cally, we have as the threshold\n\n(cid:1)(cid:1) where \u02c6\u03c32 = \u03b7(cid:62) \u02c6\u03a3\u03b7 (see Algorithm 1 in the appendix for\n\n\u02c6t\u03b1 := \u02c6\u03c3\u03a6\u22121(cid:0)(1 \u2212 \u03b1)\u03a6(cid:0)V +\n\nu and Theorem C.1 for (cid:91)KSD\n\n(cid:1) + \u03b1\u03a6(cid:0)V\u2212\n\nthe full algorithm of RelPSI).\nOur choice of \u03b7 depends on the realization of \u02c6J, but \u03b7 can be \ufb01xed such that the test we perform is\nindependent of our observation of \u02c6J (see Experiment 1). For a \ufb01xed \u03b7, the concept of power, i.e.,\nP(\u03b7(cid:62)z > \u02c6t\u03b1) when \u03b7(cid:62)\u00b5 > 0, is meaningful; and we show in Theorem 3.2 that our test is consistent\nusing MMD. However, when \u03b7 is random (i.e., dependent on \u02c6J) the notion of test power is less\nappropriate, and we use true positive rate and false positive rate to measure the performance (see\nSection 4).\n\n\u02c6\u03c3\n\n\u02c6\u03c3\n\n2\n\n5\n\n\f\u221a\n\nTheorem 3.2 (Consistency of RelPSI-MMD). Given two models P1, P2 and a data distribution R\n(which are all distinct). Let \u02c6\u03a3 be a consistent estimate of the covariance matrix de\ufb01ned in Theorem\nC.2. and \u03b7 be de\ufb01ned such that \u03b7(cid:62)z =\nu(P1, R)]. Suppose that the\nthreshold \u02c6t\u03b1 is the (1 \u2212 \u03b1)-quantile of T N (0, \u03b7(cid:62) \u02c6\u03a3\u03b7,V\u2212,V +) where V + and V\u2212 are de\ufb01ned in\nTheorem 3.1. Under H0 : \u03b7(cid:62)\u00b5 \u2264 0| P \u02c6J is selected, the asymptotic type-I error is bounded above by\n\u03b1. Under H1 : \u03b7(cid:62)\u00b5 > 0| P \u02c6J is selected, we have P(\u03b7(cid:62)z > \u02c6t\u03b1) \u2192 1 as n \u2192 \u221e.\nA proof for Theorem 3.2 can be found in Section G in the appendix. A similar result holds for\nRelPSI-KSD (see Appendix G.1) whose proof follows closely the proof of Theorem 3.2 and is\nomitted.\n\n(cid:86)2\nu(P2, R) \u2212 MMD\n\n(cid:86)2\nn[MMD\n\n4 Performance analysis\nPost selection inference (PSI) incurs its loss of power from conditioning on the selection event\n[9, Section 2.5]. Therefore, in the \ufb01xed hypothesis (not conditional) setting of l = 2 models, it is\nunsurprising that the empirical power of RelMMD and RelKSD is higher than its PSI counterparts (see\nExperiment 1). However, when l = 2, and conditional hypotheses are considered, it is unclear which\napproach is desirable. Since both PSI (as in RelPSI) and data-splitting (as in RelMulti) approaches\nfor model comparison have tractable null distributions, we study the performance of our proposals\nfor the case when the hypothesis is dependent on the data.\nWe measure the performance of RelPSI and RelMulti by true positive rate (TPR) and false positive\nrate (FPR) in the setting of l = 2 candidate models. These are popular metrics when reporting the\nperformance of selective inference approaches [29, 31, 9]. TPR is the expected proportion of models\nworse than the best that are correctly reported as such. FPR is the expected proportion of models as\ngood as the best that are wrongly reported as worse. It is desirable for TPR to be high and FPR to be\nlow. We defer the formal de\ufb01nitions to Section A (appendix); when we estimate TPR and FPR, we\ndenote it as (cid:91)TPR and (cid:91)FPR respectively. In the following theorem, we show that the TPR of RelPSI\nis higher than the TPR of RelMulti.\nTheorem 4.1 (TPR of RelPSI and RelMulti). Let P1, P2 be two candidate models, and R be a data\ngenerating distribution. Assume that P1, P2 and R are distinct. Given \u03b1 \u2208 [0, 1\n2 ] and split proportion\n\u03c1 \u2208 (0, 1) for RelMulti so that (1\u2212 \u03c1)n samples are used for selecting P \u02c6J and \u03c1n samples for testing,\n\n(cid:1)2, we have TPRRelPSI (cid:39) TPRRelMulti.\n\nfor all n (cid:29) N =(cid:0) \u03c3\u03a6\u22121(1\u2212 \u03b1\n\n2 )\n\n\u00b5(1\u2212\u221a\n\n\u03c1)\n\n0,i | H \u02c6J\n\nThe proof is provided in the Section F.6. This result holds for both MMD and KSD. Additionally,\nin the following result we show that both approaches bound FPR by \u03b1. Thus, RelPSI controls FPR\nregardless of the choice of discrepancy measure and number of candidate models.\nLemma 4.2 (FPR Control). De\ufb01ne the selective type-I error for the ith model to be s(i, \u02c6J) :=\n0,i is true, P \u02c6J is selected). If s(i, \u02c6J) \u2264 \u03b1 for all i, \u02c6J \u2208 {1, . . . , l}, then FPR \u2264 \u03b1.\nP(reject H \u02c6J\nThe proof can be found in Section A. For both RelPSI and RelMulti, the test threshold is chosen to\ncontrol the selective type-I error. Therefore, both control FPR to be no larger than \u03b1. In RelPSI, we\nexplicitly control this quantity by characterizing the distribution of statistic under the conditional null.\nRemark. The selection of the best model is a noisy process, and we can pick a model that is worse\nthan the actual best, i.e., \u02c6J /\u2208 arg minj D(Pj, R). An incorrect selection results in a higher portion\nof true conditional null hypotheses. So, the true positive rate of the test will be lowered. However, the\nfalse rejection is still controlled at level \u03b1.\n\n5 Experiments\n\nIn this section, we demonstrate our proposed method for both toy problems and real world datasets.\nOur \ufb01rst experiment is a baseline comparison of our proposed method RelPSI to RelMMD [4] and\nRelKSD (see Appendix D). In this experiment, we consider a \ufb01xed hypothesis of model comparison\nfor two candidate models (RelMulti is not applicable here). This is the original setting that RelMMD\nwas proposed for. In the second experiment, we consider a set of mixture models for smiling and\nnon-smiling images of CelebA [24] where each model has its own unique generating proportions\n\n6\n\n\f(a) Mean shift d = 10\n\n(b) Blobs d = 2\n\n(c) RBM d = 20\n\n(d) Blobs problem.\n\nFigure 1: Rejection rates (estimated from 300 trials) for the six tests with \u03b1 = 0.05 is shown.\n\u201cMMD-U\u201d refers to the usage of the complete U-statistic for MMD which is (cid:92)MMD\n2\nu, \u201cMMD-Lin\u201d\nrefers to the linear time estimator (cid:92)MMD\n2\nl and similarly for KSD Complete and KSD Linear (de\ufb01ned\nin Section 2).\n\nfrom the real data set or images from trained GANs. For our \ufb01nal experiment, we examine density\nestimation models trained on the Chicago crime dataset considered by Jitkrittum et al. [19]. In this\nexperiment, each model has a score function which allows us to apply both RelPSI and RelMulti with\nKSD. In the last two experiments on real data, there is no ground truth for which candidate model\nis the best; so estimating TPR, FDR and FPR is infeasible. Instead, the experiments are designed\nto have a strong indication of the ground truth with support from another metric. More synthetic\nexperiments are shown in Appendix H to verify our theoretical results.\nThe kernel parameters used in the discrepancy between each model Pi and the data distribution R are\nthe same to ensure the comparison between the discrepancies are meaningful. If the median heuristic\nis used, the bandwidth parameter is the empirical median of all the pairwise L2 distances between the\ngiven samples. For MMD, samples from the data distribution R and all the model samples M are\nused to calculate the median heuristic. Whereas for KSD, only the samples from R are used. Code\nfor reproducing the results can be found online.1 We note that to account for sample variability, our\nexperiments are averaged over at least 100 trials with new samples (from a different seed) redrawn\nfor each trial.\n1. A comparison of RelMMD, RelKSD, RelPSI-KSD and RelPSI-MMD (l = 2): The aim of\nthis experiment is to investigate the behaviour of the proposed tests with RelMMD and RelKSD\nas baseline comparisons and empirically demonstrate that RelPSI-MMD and RelPSI-KSD possess\ndesirable properties such as level-\u03b1 and comparable test power. Since RelMMD and RelKSD have\nno concept of selection, in order for the results to be comparable we \ufb01xed null hypothesis to be\nH0 : D(P1, R) \u2264 D(P2, R) which is possible for RelPSI by \ufb01xing \u03b7(cid:62) = [\u22121, 1]. In this experiment,\nwe consider the following problems:\n\n1. Mean shift: The two candidate models are isotropic Gaussians on R10 with varying mean:\nP1 = N ([0.5, 0,\u00b7\u00b7\u00b7 , 0], I) and P2 = N ([\u22120.5, 0,\u00b7\u00b7\u00b7 , 0], I). Our reference distribution is\nR = N (0, I). In this case, H0 is true.\n\np(y) = (cid:80)\n\nx p(cid:48)(y, x) and p(cid:48)(y, x) = 1\n\n2. Blobs: This problem was studied by various authors [7, 15, 17]. Each distribution is a\nmixture of Gaussians with a similar structure on a global scale but different locally by\nrotation. Samples from this distribution is shown in Figure 1d. In this case, the H1 is true.\n3. Restricted Boltzmann Machine (RBM): This problem was studied by [23, 19, 17]. Each\ndistribution is given by a Gaussian Restricted Boltzmann Machine (RBM) with a density\n2||y||2) where x\nare the latent variables and model parameters are B, b, c. The model will share the same\nparameters b and c (which are drawn from a standard normal) with the reference distribution\nbut the matrix B (sampled uniformly from {\u22121, 1}) will be perturbed with Bp2 = B +0.3\u03b4\nand Bp1 = B +\u0001\u03b4 where \u0001 varies between 0 and 1. It measures the sensitivity of the test [19]\n\nZ exp(y(cid:62)Bx + b(cid:62)y + c(cid:62)x \u2212 1\n\n1https://github.com/jenninglim/model-comparison-test\n\n7\n\nRelPSI MMD-URelPSI MMD-LinRelPSI KSD-LinRelPSI KSD-URelMMDRelKSDP1P2R010002000NumberofSamples0.000.020.040.060.080.10RejectionRate250050007500NumberofSamples0.20.40.60.81.0RejectionRate0.20.40.6Perturbation\u00010.00.20.40.60.81.0RejectionRate\u221210010\u221210\u221250510\fTable 1: A comparison of our proposed method with FID. The underlying distribution are samples\nforming a mixture of smiling (S) or non-smiling (N) faces which can be either generated (G) or real\n(R). \u201cRej.\u201d denotes the rate of rejection of the model indicating that it is signi\ufb01cantly worse than the\nbest model. \u201cSel.\u201d is the rate at which the model is selected (the one with the minimum discrepancy\nscore). Average FID scores are also reported. These results are averaged over 100 trials.\n\nMix\n\nModel\n\nS\n\n1\n2\n3\n4\n5\n\nTruth\n\n0.50 (G)\n0.60 (R)\n0.40 (R)\n0.51 (R)\n0.52 (R)\n0.5 (R)\n\nN\n\n0.50 (G)\n0.40 (R)\n0.60 (R)\n0.49 (R)\n0.48 (R)\n0.5 (R)\n\nRelPSI-MMD RelMulti-MMD\nRej.\n0.99\n0.39\n0.28\n0.02\n0.06\n\nSel.\n0.0\n0.02\n0.03\n0.52\n0.43\n\nSel.\n0.0\n0.08\n0.10\n0.37\n0.45\n\nRej.\n1.0\n0.18\n0.19\n0.03\n0.0\n-\n\nFID\n\nAver.\n\n27.86 \u00b1 0.49\n16.01 \u00b1 0.19\n16.29 \u00b1 0.20\n16.03 \u00b1 0.18\n16.01 \u00b1 0.17\n\nSel.\n0\n\n0.39\n0.03\n0.27\n0.31\n\n-\n\n-\n\n-\n\n-\n\n-\n\nsince perturbing only one entry can create a difference that is hard to detect. Furthermore,\nWe \ufb01x n = 1000, dx = 5, dy = 20.\n\nOur proposal and baselines are all non-parametric kernel based test. For a fair comparison, all the\ntests use the same Gaussian kernel with its bandwidth chosen by the median heuristic. In Figure 1, it\nshows the rejection rates for all tests. As expected, the tests based on KSD have higher power than\nMMD due to having access to the density function. Additionally, linear time estimators perform\nworse than their complete counterpart.\nIn Figure 1a, when H0 is true, then the false rejection rate (type-I error) is controlled around level \u03b1\nfor all tests. In Figure 1b, the poor performance of MMD-based tests in blobs experiments is caused\nby an unsuitable choice of bandwidth. The median heuristic cannot capture the small-scale differences\n[15, 17]. Even though KSD-based tests utilize the same heuristic, equipped with the density function\na mismatch in the distribution shape can be detected. Interestingly, in all our experiments, the RelPSI\nvariants perform comparatively to their cousins, Rel-MMD and Rel-KSD but as expected, the power\nis lowered due to the loss of information from our conditioning [9]. These two problems show the\nbehaviour of the tests when the number of samples n increases.\nIn Figure 1c, this shows the behaviour of the tests when the difference between the candidate models\nincreases (one model gets closer to the reference distribution). When \u0001 < 0.3, the null case is true\nand the tests exhibit a low rejection rate. However, when \u0001 > 0.3 then the alternative is true. Tests\nutilizing KSD can detect this change quickly which indicated by the sharp increase in the rejection\nrate when \u0001 = 0.3. However, MMD-based tests are unable to detect the differences at that point. As\nthe amount of perturbation increases, this changes and MMD tests begin to identify with signi\ufb01cance\nthat the alternative is true. Here we see that RelPSI-MMD has visibly lowered rejection rate indicating\nthe cost of power for conditioning, whilst for RelPSI-KSD and RelKSD both have similar power.\n2. Image Comparison (l = 5): In this experiment, we apply our proposed test RelPSI-MMD and\nRelMulti-MMD for comparing between \ufb01ve image generating candidate models. We consider the\nCelebA dataset [24] which for each sample is an image of a celebrity labelled with 40 annotated\nfeatures. As our reference distribution and candidate models, we use a mixture of smiling and\nnon-smiling faces of varying proportions (Shown in Table 1) where the model can generate images\nfrom a GAN or from the real dataset. For generated images, we use the GANs of [17, Appendix B].\nIn each trial, n = 2000 samples are used. We partition the dataset such that the reference distribution\ndraws distinct independent samples, and each model samples independently of the remainder of the\npool. All algorithms receive the same model samples. The kernel used is the Inverse Multiquadric\n(IMQ) on 2048 features extracted by the Inception-v3 network at the pool layer [30]. Additionally,\nwe use 50:50 split for RelMulti-MMD. Our baseline is the procedure of choosing the lowest Fr\u00e9chet\nInception Distance (FID) [16]. We note the authors did not propose a statistical test with FID. Table 1\nsummaries the results from the experiment.\nIn Table 1, we report the model-wise rejection rate (a high rejection indicts a poor candidate relatively\nspeaking) and the model selection rate (which indicates the rate that the model has the smallest\ndiscrepancy from the given samples). The table illustrates several interesting points. First, even\n\n8\n\n\f(a) Truth\n\n(b) MoG (1)\n\n(c) MoG (2)\n\n(d) MoG (5)\n\n(e) MADE\n\n(f) MAF\n\nFigure 2: The density plots of the trained models on the Chicago Crime dataset.\n\nthough Model 1 shares the same portions as the true reference models, the quality of the generated\nimages is a poor match to the reference images and thus is frequently rejected. A considerably higher\nFID score (than the rest) also supports this claim. Secondly, in this experiment, MMD is a good\nestimator of the best model for both RelPSI and RelMulti (with splitting exhibiting higher variance)\nbut the minimum FID score selects the incorrect model 73% of the time. The additional testing\nindicate that Model 4 or Model 5 could be the best as they were rarely deemed worse than the best\nwhich is unsurprising given that their mixing proportions are closest to the true distribution. The\nlow rejection for Model 4 is expected given that they differ by only 40 samples. Model 2 and 3\nhave respectable model-wise rejections to indicate their position as worse than the best. Overall,\nboth RelPSI and RelMulti perform well and shows that the additional testing phase yields more\ninformation than the approach of picking the minimum of a score function (especially for FID).\n3. Density Comparison (l = 5): In our \ufb01nal experiment, we demonstrate RelPSI-KSD and RelMulti-\nKSD on the Chicago data-set considered in Jitkrittum et al. [19] which consists of 11957 data points.\nWe split the data-set into disjoint sets such that 7000 samples are used for training and the remainder\nfor testing. For our candidate models, we trained a Mixture of Gaussians (MoG) with expectation\nmaximization with C components where C \u2208 {1, 2, 5}, Masked Auto-encoder for Density Estimation\n(MADE) [12] and a Masked Auto-regressive Flow (MAF) [25]. MAF with 1 autoregressive layer\nwith a standard normal as the base distribution (or equivalently MADE) and MAF model has 5\nautoregressive layers with a base distribution of a MoG (5). Each autoregressive layer is a feed-\nforward network with 512 hidden units. Both invertible models are trained with maximum likelihood\nwith a small amount of (cid:96)2 penalty on the weights. In each trial, we sample n = 2000 points\nindependently of the test set. The resultant density shown in Figure 2 and the reference distribution\nin Figure 2a. We compare our result with the negative log-likelihood (NLL). Here we use the IMQ\nkernel.\nThe results are shown in Table 2. If performance is measured by a higher model-wise rejection\nrates, for this experiment RelPSI-KSD performs better than RelMulti-KSD. RelPSI-KSD suggests\nthat MoG (1), MoG (2) and MADE are worse than the best but is unsure about MoG (5) and MAF.\nWhilst the only signi\ufb01cant rejection of RelMulti-KSD is MoG (1). These \ufb01ndings with RelPSI-KSD\ncan be further endorsed by inspecting the density (see Figure 2). It is clear that MoG (1), MoG\n(2) and MADE are too simple. But between MADE and MAF (5), it is unclear which is a better\n\ufb01t. Negative Log Likelihood (NLL) consistently suggest that MAF is the best which corroborates\nwith our \ufb01ndings that MAF is one of the top models. The preference of MAF for NLL is due to log\nlikelihood not penalizing the complexity of the model (MAF is the most complex with the highest\nnumber of parameters).\n\nRelPSI-KSD RelMul-KSD\nRej.\n0.42\n0.28\n0.02\n0.26\n\nRej.\n0.22\n0.07\n\n0.04\n\nSel.\n0.\n0.01\n0.62\n0.01\n0.36\n\n0\n\n0\n\nNLL\n\nSel. Aver.\n0\n2.64\n2.55\n2.38\n2.53\n2.25\n\n0.08\n0.38\n0.03\n0.51\n\nSel.\n0\n0\n0\n0\n1.\n\nModel\nMoG (1)\nMoG (2)\nMoG (5)\nMADE\nMAF (5)\n\n0\n\nTable 2: Relative testing on unconditional density estimation models. The model-wise rejection\nrates, selection rates and average negative log likelihood (NLL) scores are reported. These results are\naveraged over 100 trials.\n\n9\n\n\fAcknowledgments\n\nM.Y. was supported by the JST PRESTO program JPMJPR165A and partly supported by MEXT\nKAKENHI 16H06299 and the RIKEN engineering network funding.\n\nReferences\n[1] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing\n\nunder dependency. The annals of statistics, 29(4):1165\u20131188, 2001.\n\n[2] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer, 2004.\n\n[3] M. Bi\u00b4nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In\n\nICLR. 2018.\n\n[4] Wacha Bounliphone, Eugene Belilovsky, Matthew B. Blaschko, Ioannis Antonoglou, and Arthur\nGretton. A test of relative similarity for model selection in generative models. In International\nConference on Learning Representations, 2016.\n\n[5] John Burkardt. The truncated normal distribution. Department of Scienti\ufb01c Computing Website,\n\nFlorida State University, 2014.\n\n[6] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of \ufb01t.\n\nIn International Conference on Machine Learning. PMLR, 2016.\n\n[7] Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. Fast two-\nsample testing with analytic representations of probability measures. In Advances in Neural\nInformation Processing Systems, pages 1981\u20131989, 2015.\n\n[8] Moulines Eric, Francis R Bach, and Za\u00efd Harchaoui. Testing for homogeneity with kernel \ufb01sher\ndiscriminant analysis. In Advances in Neural Information Processing Systems, pages 609\u2013616,\n2008.\n\n[9] William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection.\n\narXiv preprint arXiv:1410.2597, 2014.\n\n[10] Magalie Fromont, Matthieu Lerasle, Patricia Reynaud-Bouret, et al. Kernels based tests with\nnon-asymptotic bootstrap approaches for two-sample problems. In Conference on Learning\nTheory, pages 23\u20131, 2012.\n\n[11] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch\u00f6lkopf. Kernel measures of\nconditional dependence. In Advances in neural information processing systems, pages 489\u2013496,\n2008.\n\n[12] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked autoen-\ncoder for distribution estimation. In International Conference on Machine Learning, pages\n881\u2013889, 2015.\n\n[13] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 1292\u20131301. JMLR.\norg, 2017.\n\n[14] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[15] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano\nPontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale\ntwo-sample tests. In Advances in neural information processing systems, pages 1205\u20131213,\n2012.\n\n10\n\n\f[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances\nin Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n[17] Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn Sangkloy, James Hays, Bernhard Sch\u00f6lkopf,\nIn Advances in Neural\n\nInformative features for model comparison.\n\nand Arthur Gretton.\nInformation Processing Systems, 2018.\n\n[18] Wittawat Jitkrittum, Zolt\u00e1n Szab\u00f3, Kacper P Chwialkowski, and Arthur Gretton. Interpretable\n\ndistribution features with maximum testing power. In NIPS, pages 181\u2013189. 2016.\n\n[19] Wittawat Jitkrittum, Wenkai Xu, Zolt\u00e1n Szab\u00f3, Kenji Fukumizu, and Arthur Gretton. A linear-\ntime kernel goodness-of-\ufb01t test. In Advances in Neural Information Processing Systems, pages\n262\u2013271, 2017.\n\n[20] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference,\n\nwith application to the Lasso. The Annals of Statistics, 44(3):907\u2013927, 2016.\n\n[21] Hannes Leeb and Benedikt M P\u00f6tscher. Model selection and inference: Facts and \ufb01ction.\n\nEconometric Theory, 21(1):21\u201359, 2005.\n\n[22] Hannes Leeb, Benedikt M P\u00f6tscher, et al. Can one estimate the conditional distribution of\n\npost-model-selection estimators? The Annals of Statistics, 34(5):2554\u20132591, 2006.\n\n[23] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t\n\ntests. In International Conference on Machine Learning, pages 276\u2013284, 2016.\n\n[24] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes\n\n(celeba) dataset. Retrieved August, 15:2018, 2018.\n\n[25] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for density\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[26] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00f6lkopf. A Hilbert space embedding\nfor distributions. In International Conference on Algorithmic Learning Theory, pages 13\u201331.\nSpringer, 2007.\n\n[27] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, characteristic\nkernels and rkhs embedding of measures. Journal of Machine Learning Research, 12(Jul):2389\u2013\n2410, 2011.\n\n[28] Ingo Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines.\n\nJournal of machine learning research, 2(Nov):67\u201393, 2001.\n\n[29] Shinya Suzumura, Kazuya Nakagawa, Yuta Umezu, Koji Tsuda, and Ichiro Takeuchi. Selective\ninference for sparse high-order interaction models. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 3338\u20133347. JMLR. org, 2017.\n\n[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pages 2818\u20132826, 2016.\n\n[31] Makoto Yamada, Denny Wu, Yao-Hung Hubert Tsai, Hirofumi Ohta, Ruslan Salakhutdinov,\nIchiro Takeuchi, and Kenji Fukumizu. Post selection inference with incomplete maximum mean\ndiscrepancy estimator. In International Conference on Learning Representations, 2019.\n\n[32] Jiasen Yang, Qiang Liu, Vinayak Rao, and Jennifer Neville. Goodness-of-\ufb01t testing for discrete\ndistributions via Stein discrepancy. In International Conference on Machine Learning, pages\n5557\u20135566, 2018.\n\n[33] Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric, low\nvariance kernel two-sample test. In Advances in neural information processing systems, pages\n755\u2013763, 2013.\n\n11\n\n\f", "award": [], "sourceid": 1324, "authors": [{"given_name": "Jen Ning", "family_name": "Lim", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Makoto", "family_name": "Yamada", "institution": "Kyoto University / RIKEN AIP"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}, {"given_name": "Wittawat", "family_name": "Jitkrittum", "institution": "Max Planck Institute for Intelligent Systems"}]}