{"title": "Robust Hypothesis Test for Nonlinear Effect with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 795, "page_last": 803, "abstract": "This work constructs a hypothesis test for detecting whether an data-generating function $h: \\real^p \\rightarrow \\real$ belongs to a specific reproducing kernel Hilbert space $\\mathcal{H}_0$, where the structure of $\\mathcal{H}_0$ is only partially known. Utilizing the theory of reproducing kernels, we reduce this hypothesis to a simple one-sided score test for a scalar parameter, develop a testing procedure that is robust against the mis-specification of kernel functions, and also propose an ensemble-based estimator for the null model to guarantee test performance in small samples. To demonstrate the utility of the proposed method, we apply our test to the problem of detecting nonlinear interaction between groups of continuous features. We evaluate the finite-sample performance of our test under different data-generating functions and estimation strategies for the null model. Our results revealed interesting connection between notions in machine learning (model underfit/overfit) and those in statistical inference (i.e. Type I error/power of hypothesis test), and also highlighted unexpected consequences of common model estimating strategies (e.g. estimating kernel hyperparameters using maximum likelihood estimation) on model inference.", "full_text": "Robust Hypothesis Test for Nonlinear Effect\n\nwith Gaussian Processes\n\nJeremiah Zhe Liu, Brent Coull\n\nDepartment of Biostatistics\n\nHarvard University\n\nCambridge, MA 02138\n\n{zhl112@mail, bcoull@hsph}.harvard.edu\n\nAbstract\n\nThis work constructs a hypothesis test for detecting whether an data-generating\nfunction h : Rp \u2192 R belongs to a speci\ufb01c reproducing kernel Hilbert space\nH0, where the structure of H0 is only partially known. Utilizing the theory of\nreproducing kernels, we reduce this hypothesis to a simple one-sided score test\nfor a scalar parameter, develop a testing procedure that is robust against the mis-\nspeci\ufb01cation of kernel functions, and also propose an ensemble-based estimator\nfor the null model to guarantee test performance in small samples. To demonstrate\nthe utility of the proposed method, we apply our test to the problem of detecting\nnonlinear interaction between groups of continuous features. We evaluate the\n\ufb01nite-sample performance of our test under different data-generating functions and\nestimation strategies for the null model. Our results reveal interesting connections\nbetween notions in machine learning (model under\ufb01t/over\ufb01t) and those in statistical\ninference (i.e. Type I error/power of hypothesis test), and also highlight unexpected\nconsequences of common model estimating strategies (e.g. estimating kernel\nhyperparameters using maximum likelihood estimation) on model inference.\n\n1\n\nIntroduction\n\nWe study the problem of constructing a hypothesis test for an unknown data-generating function h :\nRp \u2192 R, when h is estimated with a black-box algorithm (speci\ufb01cally, Gaussian Process regression)\nfrom n observations {yi, xi}n\n\ni=1. Speci\ufb01cally, we are interested in testing for the hypothesis:\nH0 : h \u2208 H0\n\nHa : h \u2208 Ha\n\nv.s.\n\nwhere H0,Ha are the function spaces for h under the null and the alternative hypothesis. We assume\nonly partial knowledge about H0. For example, we may assume H0 = {h|h(xi) = h(xi,1)} is the\nspace of functions depend only on x1, while claiming no knowledge about other properties (linearity,\nsmoothness, etc) about h. We pay special attention to the setting where the sample size n is small.\nThis type of tests carries concrete signi\ufb01cance in scienti\ufb01c studies.\nIn areas such as genetics,\ndrug trials and environmental health, a hypothesis test for feature effect plays a critical role in\nanswering scienti\ufb01c questions of interest. For example, assuming for simplicity x2\u00d71 = [x1, x2]T ,\nan investigator might inquire the effect of drug dosage x1 on patient\u2019s biometric measurement y\n(corresponds to the null hypothesis H0 = {h(x) = h(x2)}), or whether the adverse health effect\nof air pollutants x1 is modi\ufb01ed by patients\u2019 nutrient intake x2 (corresponds to the null hypothesis\nH0 = {h(x) = h1(x1) + h2(x2)}). In these studies, h usually represents some complex, nonlinear\nbiological process whose exact mathematical properties are not known. Sample size in these studies\nare often small (n \u2248 100 \u2212 200), due to the high monetary and time cost in subject recruitment and\nthe lab analysis of biological samples.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fhand, Gaussian process (GP) [16] models h(xi) =(cid:80)n\n\nThere exist two challenges in designing such a test. The \ufb01rst challenge arises from the low inter-\npretability of the blackbox model. It is dif\ufb01cult to formulate a hypothesis about feature effect in\nthese models, since the blackbox models represents \u02c6h implicitly using a collection of basis functions\nconstructed from the entire feature vector x, rather than a set of model parameters that can be\ninterpreted in the context of some effect of interest. For example, consider testing for the interaction\neffect between x1 and x2. With linear model h(xi) = xi1\u03b21 + xi2\u03b22 + xi1xi2\u03b23, we can simply\nrepresent the interaction effect using a single parameter \u03b23, and test for H0 : \u03b23 = 0. On the other\nj=1 k(xi, xj)\u03b1j using basis functions de\ufb01ned\nby the kernel function k. Since k is an implicit function that takes the entire feature vector as input, it\nis not clear how to represent the interaction effect in GP models. We address this challenge assuming\nh belongs to a reproducing kernel Hilbert space (RKHS) governed by the kernel function k\u03b4, such\nthat H = H0 when \u03b4 = 0, and H = Ha otherwise. In this way, \u03b4 encode exactly the feature effect\nof interest, and the null hypothesis h \u2208 H0 can be equivalently stated as H0 : \u03b4 = 0. To test for the\nhypothesis, we re-formulate the GP estimates as the variance components of a linear mixed model\n(LMM) [13], and derive a variance component score test which requires only model estimates under\nthe null hypothesis.\nClearly, performance of the hypothesis test depends on the quality of the model estimate under the\nnull hypothesis, which give rise to the second challenge: estimating the null model when only having\npartial knowledge about H0. In the case of Gaussian process, this translates to only having partial\nknowledge about the kernel function k0. The performance of Gaussian process is sensitive to the\nchoices of the kernel function k(z, z(cid:48)). In principle, the RKHS H generated by a proper kernel\nfunction k(z, z(cid:48)) should be rich enough so it contains the data-generating function h, yet restrictive\nenough such that \u02c6h does not over\ufb01t in small samples. Choosing a kernel function that is too restrictive\nor too \ufb02exible will lead to either model under\ufb01t or over\ufb01t, rendering the subsequent hypothesis\ntests not valid. We address this challenge by proposing an ensemble-based estimator for h we term\nCross-validated Kernel Ensemble (CVEK). Using a library of base kernels, CVEK learns a proper H\nfrom data by directly minimizing the ensemble model\u2019s cross-validation error, therefore guaranteeing\nrobust test performance for a wide range of data-generating functions.\nThe rest of the paper is structured as follows. After a brief review of Gaussian process and its\nconnection with linear mixed model in Section 2, we introduce the test procedure for general\nhypothesis h \u2208 H0 in Section 3, and its companion estimation procedure CVEK in Section 4. To\ndemonstrate the utility of the proposed test, in section 5, we adapt our test to the problem of detecting\nnonlinear interaction between groups of continuous features, and in section 6 we conduct simulation\nstudies to evaluate the \ufb01nite-sample performance of the interaction test, under different kernel\nestimation strategies, and under a range of data-generating functions with different mathematical\nproperties. Our simulation study reveals interesting connection between notions in machine learning\nand those in statistical inference, by elucidating the consequence of model estimation (under\ufb01t /\nover\ufb01t) on the Type I error and power of the subsequent hypothesis test. It also cautions against the\nuse of some common estimation strategies (most notably, selecting kernel hyperparameters using\nmaximum likelihood estimation) when conducting hypothesis test in small samples, by highlighting\nin\ufb02ated Type I errors from hypothesis tests based on the resulting estimates. We note that the methods\nand conclusions from this work is extendable beyond the Gaussian Process models, due to GP\u2019s\nconnection to other blackbox models such as random forest [5] and deep neural network [19].\n\n2 Background on Gaussian Process\n\nAssume we observe data from n independent subjects. For the ith subject, let yi be a continuous\nresponse, xi be the set of p continuous features that has nonlinear effect on yi. We assume that the\noutcome yi depends on features xi through below data-generating model:\niid\u223c N (0, \u03bb)\n\n(1)\nand h : Rp \u2192 R follows the Gaussian process prior GP(0, k) governed by the positive de\ufb01nite kernel\nfunction k, such that the function evaluated at the observed record follows the multivariate normal\n(MVN) distribution:\n\nyi|h = \u00b5 + h(xi) + \u0001i\n\nwhere \u0001i\n\nh = [h(x1), . . . , h(xn)] \u223c M V N (0, K)\n\n2\n\n\fwith covariance matrix Ki,j = k(xi, xj). Under above construction, the predictive distribution of h\nevaluated at the samples is also multivariate normal:\n\nh|{yi, xi}n\n\ni=1 \u223c M V N (h\u2217, K\u2217)\nh\u2217 = K(K + \u03bbI)\u22121(y \u2212 u)\nK\u2217 = K \u2212 K(K + \u03bbI)\u22121K\n\nTo understand the impact of \u03bb and k on h\u2217, recall that Gaussian process can be understood as the\nBayesian version of the kernel machine regression, where h\u2217 equivalently arise from the below\noptimization problem:\n\nh\u2217 = argmin\nh\u2208Hk\n\n||y \u2212 \u00b5 \u2212 h(x)||2 + \u03bb||h||2H\n\nwhere Hk is the RKHS generated by kernel function k. From this perspective, h\u2217 is the element\nin a spherical ball in Hk that best approximates the observed data y. The norm of h\u2217, ||h||2H, is\nconstrained by the tuning parameter \u03bb, and the mathematical properties (e.g. smoothness, spectral\ndensity, etc) of h\u2217 are governed by the kernel function k. It should be noticed that although h\u2217 may\narise from Hk, the probability of the Gaussian Process h \u2208 Hk is 0 [14].\nGaussian Process as Linear Mixed Model\n[13] argued that if de\ufb01ne \u03c4 = \u03c32\n\n\u03bb , h\u2217 can arise exactly from a linear mixed model (LMM):\n\ny = \u00b5 + h + \u0001\n\nwhere\n\nh \u223c N (0, \u03c4 K)\n\n\u0001 \u223c N (0, \u03c32I)\n\n(2)\n\nTherefore \u03bb can be treated as part of the LMM\u2019s variance components parameters. If K is cor-\nrectly speci\ufb01ed, then the variance components parameters (\u03c4, \u03c32) can be estimated unbiasedly by\nmaximizing the Restricted Maximum Likelihood (REML)[12]:\n\nLREML(\u00b5, \u03c4, \u03c32|K) = \u2212log|V| \u2212 log|1T V\u221211| \u2212 (y \u2212 \u00b5)T V\u22121(y \u2212 \u00b5)\n\n(3)\nwhere V = \u03c4 K + \u03c32I, and 1 a n \u00d7 1 vector whose all elements are 1. However, it is worth noting\nthat REML is a model-based procedure. Therefore improper estimates for \u03bb = \u03c32\n\u03c4 may arise when\nthe family of kernel functions are mis-speci\ufb01ed.\n3 A recipe for general hypothesis h \u2208 H0\nThe GP-LMM connection introduced in Section 2 opens up the arsenal of statistical tools from Linear\nMixed Model for inference tasks in Gaussian Process. Here, we use the classical variance component\ntest [12] to construct a testing procedure for the hypothesis about Gaussian process function:\n\nH0 : h \u2208 H0.\n\n(4)\nWe \ufb01rst translate above hypothesis into a hypothesis in terms of model parameters. The key of our\napproach is to assume that h lies in a RKHS generated by a garrote kernel function k\u03b4(z, z(cid:48)) [15],\nwhich is constructed by including an extra garrote parameter \u03b4 to a given kernel function. When\ngenerates exactly H0, the space of\n\u03b4 = 0, the garrote kernel function k0(x, x(cid:48)) = k\u03b4(x, x(cid:48))\nfunctions under the null hypothesis. In order to adapt this general hypothesisio to their hypothesis of\ninterest, practitioners need only to specify the form of the garrote kernel so that H0 corresponds to\nthe null hypothesis. For example, If k\u03b4(x) = k(\u03b4 \u2217 x1, x2, . . . , xp), \u03b4 = 0 corresponds to the null\nhypothesis H0 : h(x) = h(x2, . . . , xp), i.e. the function h(x) does not depend on x1. (As we\u2019ll see\nin section 5, identifying such k0 is not always straightforward). As a result, the general hypothesis\n(4) is equivalent to\n\n(cid:12)(cid:12)(cid:12)\u03b4=0\n\nH0 : \u03b4 = 0.\n\n(5)\n\nWe now construct a test statistic \u02c6T0 for (5) by noticing that the garrote parameter \u03b4 can be treated as a\nvariance component parameter in the linear mixed model. This is because the Gaussian process under\ngarrote kernel can be formulated into below LMM:\n\ny = \u00b5 + h + \u0001\n\nwhere\n\nh \u223c N (0, \u03c4 K\u03b4)\n\n\u0001 \u223c N (0, \u03c32I)\n\n3\n\n\fwhere K\u03b4 is the kernel matrix generated by k\u03b4(z, z(cid:48)). Consequently, we can derive a variance\ncomponent test for H0 by calculating the square derivative of LREML with respect \u03b4 under H0 [12]:\n\n\u02c6T0 = \u02c6\u03c4 \u2217 (y \u2212 \u02c6\u00b5)T V\u22121\n\n0\n\n\u2202K0\n\n0 (y \u2212 \u02c6\u00b5)\nV\u22121\n\n(6)\n\nwhere V0 = \u02c6\u03c32I + \u02c6\u03c4 K0. In this expression, K0 = K\u03b4\n\n, and \u2202K0 is the null derivative kernel\n\n(cid:104)\n\n(cid:105)\n(cid:12)(cid:12)(cid:12)\u03b4=0\n\n(cid:17) \u2217 \u02c6\u0001T(cid:104)\n\n(cid:105)\n\n.\n\n(cid:12)(cid:12)(cid:12)\u03b4=0\n(cid:16) \u02c6\u03c4\n\nmatrix whose (i, j)th entry is \u2202\n\n\u2202\u03b4 k\u03b4(x, x(cid:48))\n\nAs discussed previously, misspecifying the null kernel function k0 negatively impacts the performance\nof the resulting hypothesis test. To better understand the mechanism at play, we express the test\nstatistic \u02c6T0 from (6) in terms of the model residual \u02c6\u0001 = y \u2212 \u02c6\u00b5 \u2212 \u02c6h:\n\n\u02c6T0 =\n\n\u02c6\u0001,\n\n\u02c6\u03c34\n\n\u2202K0\n\n(7)\n0 (y \u2212 \u02c6\u00b5) = (\u02c6\u03c32)\u22121(\u02c6\u0001) [10]. As shown, the test statistic \u02c6T0 is a\nwhere we have used the fact V\u22121\nscaled quadratic-form statistic that is a function of the model residual. If k0 is too restrictive, model\nestimates will under\ufb01t the data even under the null hypothesis, introducing extraneous correlation\namong the \u02c6\u0001i\u2019s, therefore leading to overestimated \u02c6T0 and eventually underestimated p-value under\nthe null. In this case, the test procedure will frequently reject the null hypothesis (i.e. suggest the\nexistence of nonlinear interaction) even when there is in fact no interaction, yielding an invalid test\ndue to in\ufb02ated Type I error. On the other hand, if k0 is too \ufb02exible, model estimates will likely\nover\ufb01t the data in small samples, producing underestimated residuals, an underestimated test statistic,\nand overestimated p-values. In this case, the test procedure will too frequently fail to reject the\nnull hypothesis (i.e. suggesting there is no interaction) when there is in fact interaction, yielding an\ninsensitive test with diminished power.\nThe null distribution of \u02c6T can be approximated using a scaled chi-square distribution \u03ba\u03c72\nSatterthwaite method [20] by matching the \ufb01rst two moments of T :\n\n\u03bd using\n\n\u03ba \u2217 \u03bd = E(T ),\nwith solution (see Appendix for derivation):\n\n(cid:104)\n\n(cid:16)\n\n2 \u2217 \u03ba2 \u2217 \u03bd = V ar(T )\n\n(cid:17)(cid:105)\n\n(cid:104)\n\n(cid:16)\n\n\u02c6\u03ba = \u02c6I\u03b4\u03b4/\n\n\u02c6\u03c4 \u2217 tr\n\nV\u22121\n\n0 \u2202K0\n\n\u02c6\u03bd =\n\n\u02c6\u03c4 \u2217 tr\n\nV\u22121\n\n0 \u2202K0\n\n(cid:17)(cid:105)2\n\n/(2 \u2217 \u02c6I\u03b4\u03b4)\n\nwhere \u02c6I\u03b4\u03b8 and \u02c6I\u03b4\u03b8 are the submatrices of the REML information matrix. Numerically more accurate,\nbut computationally less ef\ufb01cient approximation methods are also available [2].\nFinally, the p-value of this test is calculated by examining the tail probability of \u02c6\u03ba\u03c72\n\u02c6\u03bd:\n\np = P (\u02c6\u03ba\u03c72\n\n\u02c6\u03bd > \u02c6T ) = P (\u03c72\n\n\u02c6\u03bd > \u02c6T /\u02c6\u03ba)\n\nA complete summary of the proposed testing procedure is available in Algorithm 1.\nAlgorithm 1 Variance Component Test for h \u2208 H0\n1: procedure VCT FOR INTERACTION\n\nInput: Null Kernel Matrix K0, Derivative Kernel Matrix \u2202K0, Data (y, X)\nOutput: Hypothesis Test p-value p\n# Step 1: Estimate Null Model using REML\n( \u02c6\u00b5, \u02c6\u03c4 , \u02c6\u03c32) = argmin LREML(\u00b5, \u03c4, \u03c32|K0) as in (3)\n(cid:17)(cid:105)\n(cid:104)\n(cid:16)\n0 (y \u2212 X\u02c6\u03b2)\n\u02c6T0 = \u02c6\u03c4 \u2217 (y \u2212 X\u02c6\u03b2)T V\u22121\n0 \u2202K0 V\u22121\n\u02c6\u03ba = \u02c6I\u03b4\u03b8/\n\u02c6\u03bd =\n0 \u2202K0\n\n0 \u2202K0\n# Step 3: Compute p-value and reach conclusion\n\n/(2 \u2217 \u02c6I\u03b4\u03b8)\n\n(cid:17)(cid:105)2\n\n\u02c6\u03c4 \u2217 tr\n\n\u02c6\u03c4 \u2217 tr\n\nV\u22121\n\nV\u22121\n\n(cid:16)\n\n(cid:104)\n\n,\n\n# Step 2: Compute Test Statistic and Null Distribution Parameters\n\n2:\n\n3:\n\n4:\n\np = P (\u02c6\u03ba\u03c72\n5:\n6: end procedure\n\n\u02c6\u03bd > \u02c6T ) = P (\u03c72\n\n\u02c6\u03bd > \u02c6T /\u02c6\u03ba)\n\n4\n\n\fIn light of the discussion about model misspeci\ufb01cation in Introduction section, we highlight the fact\nthat our proposed test (6) is robust against model misspeci\ufb01cation under the alternative [12], since\nthe calculation of test statistics do not require detailed parametric assumption about k\u03b4. However, the\ntest is NOT robust against model misspeci\ufb01cation under the null, since the expression of both test\nstatistic \u02c6T0 and the null distribution parameters (\u02c6\u03ba, \u02c6\u03bd) still involve the kernel matrices generated by\nk0 (see Algorithm 1). In the next section, we address this problem by proposing a robust estimation\nprocedure for the kernel matrices under the null.\n\n4 Estimating Null Kernel Matrix using Cross-validated Kernel Ensemble\n\nObservation in (7) motivates the need for a kernel estimation strategy that is \ufb02exible so that it does\nnot under\ufb01t under the null, yet stable so that it does not over\ufb01t under the alternative. To this end, we\npropose estimating h using the ensemble of a library of \ufb01xed base kernels {kd}D\n2 = 1},\n\nu \u2208 \u2206 = {u|u \u2265 0,||u||2\n\nD(cid:88)\n\nud\n\n\u02c6hd(x)\n\n\u02c6h(x) =\n\nd=1:\n\n(8)\n\nd=1\n\nwhere \u02c6hd is the kernel predictor generated by dth base kernel kd. In order to maximize model stability,\nthe ensemble weights u are estimated to minimize the overall cross-validation error of \u02c6h. We term\nthis method the Cross-Validated Kernel Ensemble (CVEK). Our proposed method belongs to the\nwell-studied family of algorithms known as ensembles of kernel predictors (EKP) [7, 8, 3, 4], but with\nspecialized focus in maximizing the algorithm\u2019s cross-validation stability. Furthermore, in addition to\nproducing ensemble estimates \u02c6h, CVEK will also produce the ensemble estimate of the kernel matrix\n\u02c6K0 that is required by Algorithm 1. The exact algorithm proceeds in three stages as follows:\nStage 1: For each basis kernel in the library {kd}D\nd=1, we \ufb01rst estimate \u02c6hd = Kd(Kd + \u02c6\u03bbdI)\u22121y,\nthe prediction based on dth kernel, where the tunning parameter \u02c6\u03bbd is selected by minimizing the\nleave-one-out cross validation (LOOCV) error [6]:\n\nLOOCV(\u03bb|kd) = (I \u2212 diag(Ad,\u03bb))\u22121(y \u2212 \u02c6hd,\u03bb) where Ad,\u03bb = Kd(Kd + \u03bbI)\u22121.\n\n(9)\n\nWe denote estimate the \ufb01nal LOOCV error for dth kernel \u02c6\u0001d = LOOCV\nStage 2: Using the estimated LOOCV errors {\u02c6\u0001d}D\nsuch that it minimizes the overall LOOCV error:\n\nd=1, estimate the ensemble weights u = {ud}D\n\nd=1\n\n(cid:16)\u02c6\u03bbd|kd\n\n(cid:17)\n\n.\n\n\u02c6u = argmin\n\nu\u2208\u2206\n\nud\u02c6\u0001d||2\n\nwhere \u2206 = {u|u \u2265 0,||u||2\n\n2 = 1},\n\n|| D(cid:88)\n\nd=1\n\nD(cid:88)\n\nd=1\n\nand produce the \ufb01nal ensemble prediction:\n\nwhere \u02c6A =(cid:80)D\n\nd=1 \u02c6udAd,\u02c6\u03bbd\n\n\u02c6h =\n\n\u02c6udhd =\n\nis the ensemble hat matrix.\n\nD(cid:88)\n\nd=1\n\n\u02c6udAd,\u02c6\u03bbd\n\ny = \u02c6Ay,\n\nStage 3: Using the ensemble hat matrix \u02c6A, estimate the ensemble kernel matrix \u02c6K by solving:\n\n\u02c6K( \u02c6K + \u03bbI)\u22121 = \u02c6A.\n\nSpeci\ufb01cally, if we denote UA and {\u03b4A,k}n\nthe form (see Appendix for derivation):\n\nk=1 the eigenvector and eigenvalues of \u02c6A, then \u02c6K adopts\n\n(cid:17)\n\n(cid:16) \u03b4A,k\n\n1 \u2212 \u03b4A,k\n\nUT\nA\n\n\u02c6K = UAdiag\n\n5 Application: Testing for Nonlinear Interaction\nRecall in Section 3, we assume that we are given a k\u03b4 that generates exactly H0. However, depending\non the exact hypothesis of interest, identifying such k0 is not always straightforward. In this section,\n\n5\n\n\fwe revisit the example about interaction testing discussed in challenge 1 at the Introduction section,\nand consider how to build a k0 for below hypothesis of interest\n\nH0 : h(x) = h1(x1) + h2(x2)\nHa : h(x) = h1(x1) + h2(x2) + h12(x1, x2)\n\nwhere h12 is the \"pure interaction\" function that is orthogonal to main effect functions h1 and h2.\nRecall as discussed previously, this hypothesis is dif\ufb01cult to formulate with Gaussian process models,\nsince the kernel functions k(x, x(cid:48)) in general do not explicitly separate the main and the interaction\neffect. Therefore rather than directly de\ufb01ne k0, we need to \ufb01rst construct H0 and Ha that corresponds\nto the null and alternative hypothesis, and then identify the garrote kernel function k\u03b4 such it generates\nexactly H0 when \u03b4 = 0 and Ha when \u03b4 > 0.\nWe build H0 using the tensor-product construction of RKHS on the product domain (x1,i, x2,i) \u2208\nRp1 \u00d7 Rp2 [9], due to this approach\u2019s unique ability in explicitly characterizing the space of \"pure\ninteraction\" functions. Let 1 = {f|f \u221d 1} be the RKHS of constant functions, and H1,H2\nbe the RKHS of centered functions for x1x2, respectively. We can then de\ufb01ne the full space as\nH = \u22972\nm=1(1 \u2295 Hm). H describes the space of functions that depends jointly on {x1, x2}, and\nadopts below orthogonal decomposition:\n\nH = (1 \u2295 H1) \u2297 (1 \u2295 H2)\n\n= 1 \u2295(cid:110)H1 \u2295 H2\n\n(cid:111) \u2295(cid:110)H1 \u2297 H2\n\n(cid:111)\n\n= 1 \u2295 H\u22a5\n\n12 \u2295 H12\n\n12 = H1 \u2295 H2 and H12 = H1 \u2297 H2, respectively. We see that H12 is\nwhere we have denoted H\u22a5\nindeed the space of\u201cpure interaction\" functions , since H12 contains functions on the product domain\nRp1 \u00d7 Rp2, but is orthogonal to the space of additive main effect functions H\u22a5\n12. To summarize, we\nhave identi\ufb01ed two function spaces H0 and Ha that has the desired interpretation:\n\nH0 = H\u22a5\n\n12\n\nHa = H\u22a5\n\n12 \u2295 H12\n\nWe are now ready to identify the garrote kernel k\u03b4(x, x(cid:48)). To this end, we notice that both H0 and\nH12 are composite spaces built from basis RKHSs using direct sum and tensor product. If denote\nm) the reproducing kernel associated with Hm, we can construct kernel functions for\nkm(xm, x(cid:48)\ncomposite spaces H0 and H12 as [1]:\n\nk0(x, x(cid:48)) = k1(x1, x(cid:48)\nk12(x, x(cid:48)) = k1(x1, x(cid:48)\nand consequently, the garrote kernel function for Ha:\n\n1) + k2(x2, x(cid:48)\n2)\n1) k2(x2, x(cid:48)\n2)\n\nk\u03b4(x, x(cid:48)) = k0(x, x(cid:48)) + \u03b4 \u2217 k12(x, x(cid:48)).\n\n(10)\n\nFinally, using the chosen form of the garrote kernel function, the (i, j)th element of the null derivative\n\u2202\u03b4 k\u03b4(x, x(cid:48)) = k12(x, x(cid:48)), i.e. the null derivative kernel matrix \u2202K0 is simply\nkernel matrix K0 is \u2202\nthe kernel matrix K12 that corresponds to the interaction space. Therefore the score test statistic \u02c6T0\nin (6) simpli\ufb01es to:\n\n\u02c6T0 = \u02c6\u03c4 \u2217 (y \u2212 X\u02c6\u03b2)T V\u22121\n\n0 K12 V\u22121\n\n0 (y \u2212 X\u02c6\u03b2)\n\n(11)\n\nwhere V0 = \u02c6\u03c32I + \u02c6\u03c4 K0.\n\n6 Simulation Experiment\n\nWe evaluated the \ufb01nite-sample performance of the proposed interaction test in a simulation study\nthat is analogous to a real nutrition-environment interaction study. We generate two groups of input\nfeatures (xi,1, xi,2) \u2208 Rp1 \u00d7 Rp2 independently from standard Gaussian distribution, representing\nnormalized data representing subject\u2019s level of exposure to p1 environmental pollutants and the levels\nof a subject\u2019s intake of p2 nutrients during the study. Throughout the simulation scenarios, we keep\nn = 100, and p1 = p2 = 5. We generate the outcome yi as:\n\nyi = h1(xi,1) + h2(xi,2) + \u03b4 \u2217 h12(xi,1, xi,2) + \u0001i\n\n(12)\n\n6\n\n\fwhere h1, h2, h12 are sampled from RKHSs H1,H2 and H1 \u2297 H2, generated using a ground-truth\nkernel ktrue. We standardize all sampled functions to have unit norm, so that \u03b4 represents the strength\nof interaction relative to the main effect.\nFor each simulation scenario, we \ufb01rst generated data using \u03b4 and ktrue as above, then selected a\nkmodel to estimate the null model and obtain p-value using Algorithm 1. We repeated each scenario\n1000 times, and evaluate the test performance using the empirical probability \u02c6P (p \u2264 0.05). Under\nnull hypothesis H0 : \u03b4 = 0, \u02c6P (p \u2264 0.05) estimates the test\u2019s Type I error, and should be smaller or\nequal to the signi\ufb01cance level 0.05. Under alternative hypothesis Ha : \u03b4 > 0, \u02c6P (p \u2264 0.05) estimates\nthe test\u2019s power, and should ideally approach 1 quickly as the strength of interaction \u03b4 increases.\nIn this study, we varied ktrue to produce data-generating functions h\u03b4(xi,1, xi,2) with different\nsmoothness and complexity properties, and varied kmodel to re\ufb02ect different common modeling\nstrategies for the null model in addition to using CVEK. We then evaluated how these two aspects\nimpact the hypothesis test\u2019s Type I error and power.\nData-generating Functions\nWe sampled the data-generate function by using ktrue from Mat\u00e9rn kernel family [16]:\n\n(cid:16)\u221a\n\n2\u03bd\u03c3||r||(cid:17)\u03bd\n\n(cid:16)\u221a\n\n2\u03bd\u03c3||r||(cid:17)\n\nK\u03bd\n\n,\n\nwhere\n\nr = x \u2212 x(cid:48),\n\nk(r|\u03bd, \u03c3) =\n\n21\u2212\u03bd\n\u0393(\u03bd)\n\nwith two non-negative hyperparameters (\u03bd, \u03c3). For a function h sampled using Mat\u00e9rn kernel, \u03bd\ndetermines the function\u2019s smoothness, since h is k-times mean square differentiable if and only\nif \u03bd > k. In the case of \u03bd \u2192 \u221e, Mat\u00e9rn kernel reduces to the Gaussian RBF kernel which is\nin\ufb01nitely differentiable. \u03c3 determines the function\u2019s complexity, this is because in Bochner\u2019s spectral\ndecomposition[16]\n\nk(r|\u03bd, \u03c3) =\n\ne2\u03c0isT rdS(s|\u03bd, \u03c3),\n\n(13)\n\n(cid:90)\n\n2 , 5\n\n\u03c3 determines how much weight the spectral density S(s) puts on the slow-varying, low-frequency\nbasis functions. In this work, we vary \u03bd \u2208 { 3\n2 ,\u221e} to generate once-, twice, and in\ufb01nitely-\ndifferentiable functions, and vary \u03c3 \u2208 {0.5, 1, 1.5} to generate functions with varying degree of\ncomplexity.\nModeling Strategies\nPolynomial Kernels is a family of simple parametric kernels that is equivalent to the polynomial\nridge regression model favored by statisticians due to its model interpretability. In this work, we\nuse the linear kernel klinear(x, x(cid:48)|p) = xT x(cid:48) and quadratic kernel kquad(x, x(cid:48)|p) = (1 + xT x(cid:48))2\nwhich are common choices from this family.\nGaussian RBF Kernels kRBF(x, x(cid:48)|\u03c3) = exp(\u2212\u03c3||x\u2212 x(cid:48)||2) is a general-purpose kernel family that\ngenerates nonlinear, but in\ufb01nitely differentiable (therefore very smooth) functions. Under this kernel,\nwe consider two hyperparameter selection strategies common in machine learning applications: RBF-\nMedian where we set \u03c3 to the sample median of {||xi \u2212 xj||}i(cid:54)=j, and RBF-MLE who estimates \u03c3\nby maximizing the model likelihood.\nMat\u00e9rn and Neural Network Kernels are two \ufb02exible kernels favored by machine learning practi-\ntioners for their expressiveness. Mat\u00e9rn kernels generates functions that are more \ufb02exible compared\nto that of Gaussian RBF due to the relaxed smoothness constraint [17]. In order to investigate the\nconsequences of added \ufb02exibility relative to the true model, we use Matern 1/2, Matern 3/2 and\nMatern 5/2, corresponding to Mat\u00e9rn kernels with \u03bd = 1/2, 3/2, and 5/2. Neural network kernels\n[16] knn(x, x(cid:48)|\u03c3) = 2\n, on the other hand, represent a 1-layer\nBayesian neural network with in\ufb01nite hidden unit and probit link function, with \u03c3 being the prior\nvariance on hidden weights. Therefore knn is \ufb02exible in the sense that it is an universal approximator\nfor arbitrary continuous functions on the compact domain [11]. In this work, we denote NN 0.1, NN\n1 and NN 10 to represent Bayesian networks with different prior constraints \u03c3 \u2208 {0.1, 1, 10}.\nResult\nThe simulation results are presented graphically in Figure 1 and documented in detail in the Appendix.\nWe \ufb01rst observe that for reasonably speci\ufb01ed kmodel, the proposed hypothesis test always has the\n\n\u03c0 \u2217 sin\u22121(cid:16)\n\n(1+2\u03c3\u02dcxT \u02dcx)(1+2\u03c3\u02dcx(cid:48)T \u02dcx(cid:48))\n\n2\u03c3\u02dcxT \u02dcx(cid:48)\n\n(cid:17)\n\n\u221a\n\n7\n\n\f(a) ktrue = Mat\u00e9rn 3/2, \u03c3 = 0.5\n\n(b) ktrue = Mat\u00e9rn 3/2, \u03c3 = 1\n\n(c) ktrue = Mat\u00e9rn 3/2, \u03c3 = 1.5\n\n(d) ktrue = Mat\u00e9rn 5/2, \u03c3 = 0.5\n\n(e) ktrue = Mat\u00e9rn 5/2, \u03c3 = 1\n\n(f) ktrue = Mat\u00e9rn 5/2, \u03c3 = 1.5\n\n(g) ktrue = Gaussian RBF, \u03c3 = 0.5 (h) ktrue = Gaussian RBF, \u03c3 = 1 (i) ktrue = Gaussian RBF, \u03c3 = 1.5\n\nFigure 1: Estimated \u02c6P (p < 0.05) (y-axis) as a function of Interaction Strength \u03b4 \u2208 [0, 1] (x-axis).\nSkype Blue: Linear (Solid) and Quadratic (Dashed) Kernels, Black: RBF-Median (Solid) and RBF-\nMLE (Dashed), Dark Blue: Mat\u00e9rn Kernels with \u03bd = 1\n2, Purple: Neural Network Kernels\nwith \u03c3 = 0.1, 1, 10, Red: CVEK based on RBF (Solid) and Neural Networks (Dashed).\nHorizontal line marks the test\u2019s signi\ufb01cance level (0.05). When \u03b4 = 0, \u02c6P should be below this line.\n\n2 , 5\n\n2 , 3\n\ncorrect Type I error and reasonable power. We also observe that the complexity of the data-generating\nfunction h\u03b4 (12) plays a role in test performance, in the sense that the power of the hypothesis\ntests increases as the Mat\u00e9rn ktrue\u2019s complexity parameter \u03c3 becomes larger, which corresponds to\nfunctions that put more weight on the complex, fast-varying eigenfunctions in (13).\nWe observed clear differences in test performance from different estimation strategies. In general,\npolynomial models (linear and quadratic) are too restrictive and appear to under\ufb01t the data under\nboth the null and the alternative, producing in\ufb02ated Type I error and diminished power. On the other\nhand, lower-order Mat\u00e9rn kernels (Mat\u00e9rn 1/2 and Mat\u00e9rn 3/2, dark blue lines) appear to be too\n\ufb02exible. Whenever data are generated from smoother ktrue, Mat\u00e9rn 1/2 and 3/2 over\ufb01t the data and\nproduce de\ufb02ated Type I error and severely diminished power, even if the hyperparameter \u03c3 are \ufb01xed\nat true value. Therefore unless there\u2019s strong evidence that h exhibits behavior consistent with that\ndescribed by these kernels, we recommend avoid the use of either polynomial or lower-order Mat\u00e9rn\nkernels for hypothesis testing. Comparatively, Gaussian RBF works well for a wider range of ktrue\u2019s,\nbut only if the hyperparameter \u03c3 is selected carefully. Speci\ufb01cally, RBF-Median (black dashed line)\nworks generally well, despite being slightly conservative (i.e. lower power) when the data-generation\nfunction is smooth and of low complexity. RBF-MLE (black solid line), on the other hand, tends to\nunder\ufb01t the data under the null and exhibits in\ufb02ated Type I error, possibly because of the fact that \u03c3 is\nnot strongly identi\ufb01ed when the sample size is too small [18]. The situation becomes more severe as\nh\u03b4 becomes rougher and more complex, in the moderately extreme case of non-differentiable h with\n\u03c3 = 1.5, the Type I error is in\ufb02ated to as high as 0.238. Neural Network kernels also perform well\nfor a wide range of ktrue, and its Type I error is more robust to the speci\ufb01cation of hyperparameters.\nFinally, the two ensemble estimators CVEK-RBF (based on kRBF\u2019s with log(\u03c3) \u2208 {\u22122,\u22121, 0, 1, 2})\nand CVEK-NN (based on kNN\u2019s with \u03c3 \u2208 {0.1, 1, 10, 50}) perform as well or better than the non-\nensemble approaches for all ktrue\u2019s, despite being slightly conservative under the null. As compared\nto CVEK-NN, CVEK-RBF appears to be slightly more powerful.\n\n8\n\n0.00.20.40.60.81.0LinearQuadraticRBF_MLERBF_MedianMatern 1/2Matern 3/2Matern 5/20.00.20.40.60.81.0NN 0.1NN 1NN 10CVKE_RBFCVKE_NN0.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0\fReferences\n[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical\n\nSociety, 68(3):337\u2013404, 1950.\n\n[2] D. A. Bodenham and N. M. Adams. A comparison of ef\ufb01cient approximations for a weighted\n\nsum of chi-squared random variables. Statistics and Computing, 26(4):917\u2013928, July 2016.\n\n[3] C. Cortes, M. Mohri, and A. Rostamizadeh. Two-Stage Learning Kernel Algorithms. 2010.\n\n[4] C. Cortes, M. Mohri, and A. Rostamizadeh. Ensembles of Kernel Predictors. arXiv:1202.3712\n\n[cs, stat], Feb. 2012. arXiv: 1202.3712.\n\n[5] A. Davies and Z. Ghahramani. The Random Forest Kernel and other kernels for big data from\n\nrandom partitions. arXiv:1402.4293 [cs, stat], Feb. 2014. arXiv: 1402.4293.\n\n[6] A. Elisseeff and M. Pontil. Leave-one-out Error and Stability of Learning Algorithms with\nApplications. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors,\nLearning Theory and Practice. IOS Press, 2002.\n\n[7] T. Evgeniou, L. Perez-Breva, M. Pontil, and T. Poggio. Bounds on the Generalization Per-\nformance of Kernel Machine Ensembles. In Proceedings of the Seventeenth International\nConference on Machine Learning, ICML \u201900, pages 271\u2013278, San Francisco, CA, USA, 2000.\nMorgan Kaufmann Publishers Inc.\n\n[8] T. Evgeniou, M. Pontil, and A. Elisseeff. Leave One Out Error, Stability, and Generalization of\n\nVoting Combinations of Classi\ufb01ers. Machine Learning, 55(1):71\u201397, Apr. 2004.\n\n[9] C. Gu. Smoothing Spline ANOVA Models. Springer Science & Business Media, Jan. 2013.\n\nGoogle-Books-ID: 5VxGAAAAQBAJ.\n\n[10] D. A. Harville. Maximum Likelihood Approaches to Variance Component Estimation and to\n\nRelated Problems. Journal of the American Statistical Association, 72(358):320\u2013338, 1977.\n\n[11] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks,\n\n4(2):251\u2013257, 1991.\n\n[12] X. Lin. Variance component testing in generalised linear models with random effects.\n\nBiometrika, 84(2):309\u2013326, June 1997.\n\n[13] D. Liu, X. Lin, and D. Ghosh. Semiparametric Regression of Multidimensional Genetic Pathway\nData: Least-Squares Kernel Machines and Linear Mixed Models. Biometrics, 63(4):1079\u20131088,\nDec. 2007.\n\n[14] M. N. Luki\u00b4c and J. H. Beder. Stochastic Processes with Sample Paths in Reproducing Kernel\nHilbert Spaces. Transactions of the American Mathematical Society, 353(10):3945\u20133969, 2001.\n\n[15] A. Maity and X. Lin. Powerful tests for detecting a gene effect in the presence of possible\ngene-gene interactions using garrote kernel machines. Biometrics, 67(4):1271\u20131284, Dec. 2011.\n\n[16] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. University\n\nPress Group Limited, Jan. 2006. Google-Books-ID: vWtwQgAACAAJ.\n\n[17] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning\n\nAlgorithms. arXiv:1206.2944 [cs, stat], June 2012. arXiv: 1206.2944.\n\n[18] G. Wahba. Spline Models for Observational Data. SIAM, Sept. 1990. Google-Books-ID:\n\nScRQJEETs0EC.\n\n[19] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep Kernel Learning. arXiv:1511.02222\n\n[cs, stat], Nov. 2015. arXiv: 1511.02222.\n\n[20] D. Zhang and X. Lin. Hypothesis testing in semiparametric additive mixed models. Biostatistics\n\n(Oxford, England), 4(1):57\u201374, Jan. 2003.\n\n9\n\n\f", "award": [], "sourceid": 523, "authors": [{"given_name": "Jeremiah", "family_name": "Liu", "institution": "Harvard University"}, {"given_name": "Brent", "family_name": "Coull", "institution": "Harvard University"}]}