{"title": "DeepPINK: reproducible feature selection in deep neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8676, "page_last": 8686, "abstract": "Deep learning has become increasingly popular in both supervised and unsupervised machine learning thanks to its outstanding empirical performance. However, because of their intrinsic complexity, most deep learning methods are largely treated as black box tools with little interpretability. Even though recent attempts have been made to facilitate the interpretability of deep neural networks (DNNs), existing methods are susceptible to noise and lack of robustness. Therefore, scientists are justifiably cautious about the reproducibility of the discoveries, which is often related to the interpretability of the underlying statistical models. In this paper, we describe a method to increase the interpretability and reproducibility of DNNs by incorporating the idea of feature selection with controlled error rate. By designing a new DNN architecture and integrating it with the recently proposed knockoffs framework, we perform feature selection with a controlled error rate, while maintaining high power. This new method, DeepPINK (Deep feature selection using Paired-Input Nonlinear Knockoffs), is applied to both simulated and real data sets to demonstrate its empirical utility.", "full_text": "DeepPINK: reproducible feature selection in deep\n\nneural networks\n\nYang Young Lu \u2217\n\nDepartment of Genome Sciences\n\nUniversity of Washington\n\nSeattle, WA 98195\nylu465@uw.edu\n\nYingying Fan \u2217\n\nData Sciences and Operations Department\n\nMarshall School of Business\n\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\nfanyingy@marshall.usc.edu\n\nJinchi Lv\n\nData Sciences and Operations Department\n\nMarshall School of Business\n\nUniversity of Southern California\n\nLos Angeles, CA 90089\n\njinchilv@marshall.usc.edu\n\nDepartment of Genome Sciences and Department of Computer Science and Engineering\n\nWilliam Stafford Noble\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nwilliam-noble@uw.edu\n\nAbstract\n\nDeep learning has become increasingly popular in both supervised and unsuper-\nvised machine learning thanks to its outstanding empirical performance. However,\nbecause of their intrinsic complexity, most deep learning methods are largely treat-\ned as black box tools with little interpretability. Even though recent attempts have\nbeen made to facilitate the interpretability of deep neural networks (DNNs), exist-\ning methods are susceptible to noise and lack of robustness. Therefore, scientists\nare justi\ufb01ably cautious about the reproducibility of the discoveries, which is often\nrelated to the interpretability of the underlying statistical models. In this paper, we\ndescribe a method to increase the interpretability and reproducibility of DNNs by\nincorporating the idea of feature selection with controlled error rate. By designing\na new DNN architecture and integrating it with the recently proposed knockoffs\nframework, we perform feature selection with a controlled error rate, while main-\ntaining high power. This new method, DeepPINK (Deep feature selection using\nPaired-Input Nonlinear Knockoffs), is applied to both simulated and real data sets\nto demonstrate its empirical utility. 2\n\n1\n\nIntroduction\n\nRapid advances in machine learning techniques have revolutionized our everyday lives and had\nprofound impacts on many contemporary domains such as decision making, healthcare, and \ufb01nance\n\n\u2217These authors contributed equally to this work.\n2 All code and data will be available here: github.com/younglululu/DeepPINK.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[28]. In particular, deep learning has received much attention in recent years thanks to its outstanding\nempirical performance in both supervised and unsupervised machine learning tasks. However, due to\nthe complicated nature of deep neural networks (DNNs) and other deep learning methods, they have\nbeen mostly treated as black box tools. In many scienti\ufb01c areas, interpretability of scienti\ufb01c results is\nbecoming increasingly important, as researchers aim to understand why and how a machine learning\nsystem makes a certain decision. For example, if a clinical image is predicted to be either benign\nor malignant, then the doctors are eager to know which parts of the image drive such a prediction.\nAnalogously, if a \ufb01nancial transaction is \ufb02agged to be fraudulent, then the security teams want to\nknow which activities or behaviors led to the \ufb02agging. Therefore, an explainable and interpretable\nsystem to reason about why certain decisions are made is critical to convince domain experts [27].\nThe interpretability of conventional machine learning models, such as linear regression, random\nforests, and support vector machines, has been studied for decades. Recently, identifying explainable\nfeatures that contribute the most to DNN predictions has received much attention. Existing methods\neither \ufb01t a simple model in the local region around the input [33, 38] or locally perturb the input to see\nhow the prediction changes [35, 6, 34, 37]. Though these methods can yield insightful interpretations,\nthey focus on speci\ufb01c architectures of DNNs and can be dif\ufb01cult to generalize. Worse still, Ghorbani\net al. [22] systematically revealed the fragility of these widely-used methods and demonstrated that\neven small random perturbation can dramatically change the feature importance. For example, if a\nparticular region of a clinical image is highlighted to explain a malignant classi\ufb01cation, the doctors\nwould then focus on that region for further investigation. However, it would be highly problematic if\nthe choice of highlighted region varied dramatically in the presence of very small amounts of noise.\nIn such a setting, it is desirable for practitioners to select explanatory features in a fashion that is\nrobust and reproducible, even in the presence of noise. Though considerable work has been devoted\nto creating feature selection methods that select relevant features, it is less common to carry out\nfeature selection while explicitly controlling the associated error rate. Among different feature\nselection performance measures, the false discovery rate (FDR) [4] can be exploited to measure the\nperformance of feature selection algorithms. Informally, the FDR is the expected proportion of falsely\nselected features among all selected features, where a false discovery is a feature that is selected but is\nnot truly relevant (For a formal de\ufb01nition of FDR, see section 2.1). Commonly used procedures, such\nas the Benjamini\u2013Hochberg (BHq) procedure [4], achieve FDR control by working with p-values\ncomputed against some null hypothesis, indicating observations that are less likely to be null.\nIn the feature selection setting, existing methods for FDR control utilize the p-values produced\nby an algorithm for evaluating feature importance, under the null hypothesis that the feature is\nnot relevant. Speci\ufb01cally, for each feature, one tests the signi\ufb01cance of the statistical association\nbetween the speci\ufb01c feature and the response either jointly or marginally and obtains a p-value. These\np-values are then used to rank the feature importance for FDR control. However, for DNNs, how to\nproduce meaningful p-values that can re\ufb02ect feature importance is still completely unknown. See,\ne.g., [19] for the nonuniformity of p-values for the speci\ufb01c case of diverging-dimensional generalized\nlinear models. Without a way to produce appropriate p-values, performing feature selection with a\ncontrolled error rate in deep learning is highly challenging.\nTo bypass the use of p-values but still achieve FDR control, Cand\u00e8s et al. [10] proposed the model-X\nknockoffs framework for feature selection with controlled FDR. The salient idea is to generate\nknockoff features that perfectly mimic the arbitrary dependence structure among the original features\nbut are conditionally independent of the response given the original features. Then these knockoff\nfeatures can be used as control in feature selection by comparing the feature importance between\noriginal features and their knockoff counterpart. See more details in section 2.2 for a review of the\nmodel-X knockoffs framework and [2, 3, 10] for more details on different knockoff \ufb01lters and their\ntheoretical guarantees.\nIn this paper, we integrate the idea of a knockoff \ufb01lter with DNNs to enhance the interpretability\nof the learned network model. Through simulation studies, we discover surprisingly that naively\ncombining the knockoffs idea with a multilayer perceptron (MLP) yields extremely low power in\nmost cases (though FDR is still controlled). To resolve this issue, we propose a new DNN architecture\nnamed DeepPINK (Deep feature selection using Paired-Input Nonlinear Knockoffs). DeepPINK is\nbuilt upon an MLP with the major distinction that it has a plugin pairwise-coupling layer containing\np \ufb01lters, one per each input feature, where each \ufb01lter connects the original feature and its knockoff\ncounterpart. We demonstrate empirically that DeepPINK achieves FDR control with much higher\n\n2\n\n\fpower than many state-of-the-art methods in the literature. We also apply DeepPINK to two real data\nsets to demonstrate its empirical utility. It is also worth mentioning that the idea of DeepPINK may\nbe generalized to other deep neural networks such as CNNs and RNNs, which is the subject of our\nongoing research.\n\n2 Background\n\n2.1 Model setting\n\nConsider a supervised learning task where we have n independent and identically distributed (i.i.d.)\nobservations (xi, Yi), i = 1,\u00b7\u00b7\u00b7 , n, with xi \u2208 Rp the feature vector and Yi the scalar response. Here\nwe consider the high-dimensional setting where the feature dimensionality p can be much larger\nthan the sample size n. Assume that there exists a subset S0 \u2282 {1,\u00b7\u00b7\u00b7 , p} such that, conditional on\nfeatures in S0, the response Yi is independent of features in the complement S c\n0. Our goal is to learn\nthe dependence structure of Yi on xi so that effective prediction can be made with the \ufb01tted model\nand meanwhile achieve accurate feature selection in the sense of identifying features in S0 with a\ncontrolled error rate.\n\n2.2 False discovery rate control and the knockoff \ufb01lter\n\nTo measure the accuracy of feature selection, various performance measures have been proposed.\n\nThe false discovery rate (FDR) is among the most popular ones. For a set of features (cid:98)S selected by\n\nsome feature selection procedure, the FDR is de\ufb01ned as\n\nFDR = E[FDP] with FDP =\n\n|(cid:98)S \u2229 S c\n|(cid:98)S|\n0|\n\n,\n\nwhere | \u00b7 | stands for the cardinality of a set. Many methods have been proposed to achieve FDR\ncontrol [5, 36, 1, 15, 16, 40, 14, 23, 20, 44, 17]. However, as discussed in Section 1, most of these\nexisting methods rely on p-values and cannot be adapted to the setting of DNNs.\nIn this paper, we focus on the recently introduced model-X knockoffs framework [10]. The model-X\nknockoffs framework provides an elegant way to achieve FDR control in a feature selection setting at\nsome target level q in \ufb01nite sample and with arbitrary dependency structure between the response and\nfeatures. The idea of knockoff \ufb01lters was originally proposed in Gaussian linear models [2, 3]. The\nmodel-X knockoffs framework generalizes the original method to work in arbitrary, nonlinear models.\nIn brief, knockoff \ufb01lter achieves FDR control in two steps: 1) construction of knockoff features, and\n2) \ufb01ltering using knockoff statistics.\nDe\ufb01nition 1 ([10]). Model-X knockoff features for the family of random features x =\n(X1,\u00b7\u00b7\u00b7 , Xp)T are a new family of random features \u02dcx = ( \u02dcX1,\u00b7\u00b7\u00b7 , \u02dcXp)T that satis\ufb01es two proper-\nd= (x, \u02dcx) for any subset S \u2282 {1,\u00b7\u00b7\u00b7 , p}, where swap(S) means swapping Xj\nties: (1) (x, \u02dcx)swap(S)\nand \u02dcXj for each j \u2208 S and d= denotes equal in distribution, and (2) \u02dcx|= Y |x, i.e., \u02dcx is independent of\nresponse Y given feature x.\n\nAccording to De\ufb01nition 1, the construction of knockoffs is totally independent of the response Y . If\nwe can construct a set of model-X knockoff features, then by comparing the original features with\nthese control features, FDR can be controlled at target level q. See [10] for theoretical guarantees of\nFDR control with knockoff \ufb01lters.\nClearly, the construction of model-X knockoff features plays a key role in FDR control. In some\nspecial cases such as x \u223c N (0, \u03a3) with \u03a3 \u2208 Rp\u00d7p the covariance matrix, the model-X knockoff\nfeatures can be constructed easily. More speci\ufb01cally, if x \u223c N (0, \u03a3), then a valid construction of\nknockoff features is\n\n\u02dcx|x \u223c N(cid:0)x \u2212 diag{s}\u03a3\u22121x, 2diag{s} \u2212 diag{s}\u03a3\u22121diag{s}(cid:1).\n\n(1)\nHere diag{s} with all components of s being positive is a diagonal matrix the requirement that the\nconditional covariance matrix in Equation 1 is positive de\ufb01nite. Following the above knockoffs\n\n3\n\n\fconstruction, the original features and the model-X knockoff features have the following joint\ndistribution\n\n(x, \u02dcx) \u223c N\n\n\u03a3\n\n\u03a3 \u2212 diag{s}\n\n\u03a3 \u2212 diag{s}\n\n\u03a3\n\n.\n\n(2)\n\n(cid:18)\n\n(cid:18)(cid:18)0\n(cid:19)\n\n,\n\n0\n\n(cid:19)(cid:19)\n\nIntuitively, a larger s implies that the constructed knockoff features are more different from the\noriginal features and thus can increase the power of the method.\nWith the constructed knockoff features \u02dcx, we quantify important features via the knockoff by\nresorting to the knockoff statistics Wj = gj(Zj, \u02dcZj) for 1 \u2264 j \u2264 p, where Zj and \u02dcZj represent\nfeature importance measures for the jth feature Xj and its knockoff counterpart \u02dcXj, respectively,\nand gj(\u00b7,\u00b7) is an antisymmetric function satisfying gj(Zj, \u02dcZj) = \u2212gj( \u02dcZj, Zj). Note that the feature\nimportance measures as well as the knockoff statistics depend on the speci\ufb01c algorithm used to \ufb01t the\nmodel. For example, in linear regression models one can choose Zj and \u02dcZj as the Lasso regression\ncoef\ufb01cients of Xj and \u02dcXj, respectively, and a valid knockoff statistic could be Wj = |Zj| \u2212 | \u02dcZj|.\nIn principle, the knockoff statistics Wj should satisfy a coin-\ufb02ip property such that swapping an\nsigns of other Wk (k (cid:54)= j) unchanged [10]. A desirable property for knockoff statistics Wj\u2019s is that\nimportant features are expected to have large positive values, whereas unimportant ones should have\nsmall magnitudes symmetric around 0.\nGiven the knockoff statistics as feature importance measures, we sort |Wj|\u2019s in decreasing order and\nselect features whose Wj\u2019s exceed some threshold T . In particular, two choices of threshold are\nsuggested [2, 10]\n\narbitrary pair of Xj and its knockoff counterpart (cid:101)Xj only changes the sign of Wj but keeps the\n\nt \u2208 W,\n\n|{j : Wj \u2264 \u2212t}|\n|{j : Wj \u2265 t}| \u2264 q\n\nt \u2208 W,\n\n1 + |{j : Wj \u2264 \u2212t}|\n1 \u2228 |{j : Wj \u2265 t}| \u2264 q\n\nT = min\n\n, T+ = min\n\n,\n(3)\nwhere W = {|Wj| : 1 \u2264 j \u2264 p}\\{0} is the set of unique nonzero values attained by |Wj|\u2019s and\nq \u2208 (0, 1) is the desired FDR level speci\ufb01ed by the user. [10] proved that under mild conditions on\nWj\u2019s, the knockoff \ufb01lter with T controls a slightly modi\ufb01ed version of FDR and the knockoff+ \ufb01lter\nwith T+ controls the exact FDR.\nWhen the joint distribution of x is unknown, to construct knockoff features one needs to estimate this\ndistribution from data. For the case of Gaussian design, the approximate knockoff features can be\n\nconstructed by replacing \u03a3\u22121 in Equation 1 with the estimated precision matrix (cid:98)\u2126. Following [18],\n\nwe exploited the ISEE [21] for scalable precision matrix estimation when implementing DeepPINK.\n\n(cid:26)\n\n(cid:27)\n\n(cid:26)\n\n(cid:27)\n\n3 Knockoffs inference for deep neural networks\n\nIn this paper, we integrate the idea of knockoff \ufb01lters with DNNs to achieve feature selection with\ncontrolled FDR. We restrict ourselves to a Gaussian design so that the knockoff features can be\nconstructed easily by following Equation 1. In practice, [10] shows how to generate knockoff\nvariables from a general joint distribution of covariates.\nWe initially experimented with the multilayer preceptron by naively feeding the augmented feature\nvector (xT , \u02dcxT )T directly into the networks. However, we discovered that although the FDR is\ncontrolled at the target level q, the power is extremely low and even 0 in many scenarios; see Table\n1 and the discussions in a later section. To resolve the power issue, we propose a new \ufb02exible\nframework named DeepPINK for reproducible feature selection in DNNs, as illustrated in Figure 1.\nThe main idea of DeepPINK is to feed the network through a plugin pairwise-coupling layer contain-\ning p \ufb01lters, F1,\u00b7\u00b7\u00b7 , Fp, where the jth \ufb01lter connects feature Xj and its knockoff counterpart \u02dcXj.\nInitialized equally, the corresponding \ufb01lter weights Zj and \u02dcZj compete against each other during\ntraining. Thus intuitively, Zj being much larger than \u02dcZj in magnitude provides some evidence that\nthe jth feature is important, whereas similar values of Zj and \u02dcZj indicate that the jth feature is not\nimportant.\nIn addition to the competition of each feature against its own knockoff counterpart, features compete\nagainst each other. To encourage competitions, we use a linear activation function in the pairwise-\ncoupling layer.\n\n4\n\n\fFigure 1: A graphical illus-\ntration of DeepPINK. Deep-\nPINK is built upon an\nMLP with a plugin pairwise-\ncoupling layer containing p\n\ufb01lters, one per input feature,\nwhere each \ufb01lter connects\nthe original feature and its\nknockoff counterpart. The\n\ufb01lter weights Zj and \u02dcZj for\njth feature and its knockof-\nf counterpart are initialized\nequally for fair competition.\nThe outputs of the \ufb01lters are\nfed into a fully connected\nMLP with 2 hidden layer-\ns, each containing p neuron-\ns. Both ReLU activation and\nL1-regularization are used in\nthe MLP.\n\n(cid:113) 2 log p\n\nThe outputs of the \ufb01lters are then fed into a fully connected multilayer perceptron (MLP) to learn a\nmapping to the response Y . The MLP network has multiple alternating linear transformation and\nnonlinear activation layers. Each layer learns a mapping from its input to a hidden space, and the last\nlayer learns a mapping directly from the hidden space to the response Y . In this work, we use an MLP\nwith 2 hidden layers, each containing p neurons, as illustrated in Figure 1. We use L1-regularization\nin the MLP with regularization parameter set to O(\nn ). We use Adam [24] to train the deep\nlearning model with respect to the mean squared error loss, using an initial learning rate of 0.001 and\nbatch size 10.\nLet W (0) \u2208 Rp\u00d71 be the weight vector connecting the \ufb01lters to the MLP. And let W (1) \u2208 Rp\u00d7p,\nW (2) \u2208 Rp\u00d7p, and W (3) \u2208 Rp\u00d71 be the weight matrices connecting the input vector to the hidden\nlayer, hidden layer to hidden layer, and hidden layer to the output, respectively.\nThe importance measures Zj and \u02dcZj are determined by two factors: (1) the relative importance\nbetween Xj and its knockoff counterpart \u02dcXj, encoded by \ufb01lter weights z = (z1,\u00b7\u00b7\u00b7 , zp)T and\n\u02dcz = (\u02dcz1,\u00b7\u00b7\u00b7 , \u02dczp)T , respectively; and (2) the relative importance of the jth feature among all p\nfeatures, encoded by the weight matrices, w = W (0) (cid:12) (W (1)W (2)W (3)), where (cid:12) denotes the\nentry-wise matrix multiplication. Therefore, we de\ufb01ne Zj and \u02dcZj as\n\u02dcZj = \u02dczj \u00d7 wj.\n\n(4)\n\nZj = zj \u00d7 wj\n\nand\n\nj \u2212 \u02dcZ 2\n\nWith the above introduced importance measures Zj and \u02dcZj, the knockoff statistic can be de\ufb01ned as\nj , and the \ufb01ltering step can be applied to the Wj\u2019s to select features. Our de\ufb01nition\nWj = Z 2\nof feature importance measures naturally applies to deep neural nets with more hidden layers. The\nchoice of 2 hidden layers for the MLP is only for illustration purposes.\n\n4 Simulation studies\n\n4.1 Model setups and simulation settings\n\nWe use synthetic data to compare the performance of DeepPINK to existing methods in the literature.\nSince the original knockoff \ufb01lter [2] and the high-dimensional knockoff \ufb01lter [3] were only designed\nfor Gaussian linear regression models, we \ufb01rst simulate data from\n\ny = X\u03b2 + \u03b5,\n\n5\n\n(5)\n\n(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:95)xpx1x2xp(cid:95)x1(cid:95)x2InputOriginal featuresKnockoff featuresPlugin pairwise-coupling layer with linear activationMultilayer perceptronwith ReLU activationy(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)(cid:15)Responsew(1)w(2)w(3)w(0)z1z1(cid:95)z2z2(cid:95)zpzp(cid:95)F1F2Fp\fYi = g(xT\n\ni \u03b2) + \u03b5i,\n\ni = 1,\u00b7\u00b7\u00b7 , n,\n\nwhere y = (Y1,\u00b7\u00b7\u00b7 , Yn)T \u2208 Rn is the response vector, X \u2208 Rn\u00d7p is a random design matrix,\n\u03b2 = (\u03b21, \u03b22,\u00b7\u00b7\u00b7 , \u03b2p)T \u2208 Rp is the coef\ufb01cient vector, and \u03b5 = (\u03b51,\u00b7\u00b7\u00b7 , \u03b5n)T \u2208 Rn is a vector of\nnoise.\nSince nonlinear models are more pervasive than linear models in real applications, we also consider\nthe following Single-Index model\n\n(6)\nwhere g is some unknown link function and xi is the feature vector corresponding to the ith observa-\ntion.\nWe simulate the rows of X independently from N (0, \u03a3) with precision matrix \u03a3\u22121 =\n1\u2264j,k\u2264p with \u03c1 = 0.5. The noise distribution is chosen as \u03b5 \u223c N (0, \u03c32In) with \u03c3 = 1. For\n\n(cid:0)\u03c1|j\u2212k|(cid:1)\n\nall simulation examples, we set the target FDR level to q = 0.2.\nFor the linear model, we experiment with sample size n = 1000, and consider the high-dimensional\nscenario with number of features p = 50, 100, 200, 400, 600, 800, 1000, 1500, 2000, 2500, and 3000.\nThe true regression coef\ufb01cient vector \u03b20 \u2208 Rp is sparse with s = 30 nonzero signals randomly\nlocated, and the nonzero coef\ufb01cients are randomly chosen from {\u00b11.5}.\nFor the Single-Index model, we \ufb01x the sample size n = 1000 and vary feature dimensionality as\np = 50, 100, 200, 400, 600, 800, 1000, 1500, 2000, 2500, and 3000. We set the true link function\ng(x) = x3/2. The true regression coef\ufb01cient vector is generated similarly with s = 10.\n\n4.2 Simulation results\n\nWe compare the performance of DeepPINK with three popularly used methods, MLP, DeepLIFT\n[34], random forest (RF) [9], and support vector regression (SVR) with linear kernel. For RF, the\nfeature importance is measured by Gini importance [8]. For MLP, the feature importance is measured\nsimilarly to DeepPINK without the pairwise-coupling layer. For DeepLIFT, the feature importance is\nmeasured by the multiplier score. For SVR with linear kernel, the feature importance is measured\nby the coef\ufb01cients in the primal problem [11]. We did not compare with SVR with nonlinear kernel\nbecause we could not \ufb01nd any feature importance measure with nonlinear kernel in existing software\npackages. We would like to point out that with the linear kernel, the model is misspeci\ufb01ed when\nobservations are generated from the Single-Index model. Due to the very high computational cost for\nvarious methods in high dimensions, for each simulation example, we set the number of repetitions to\n20. For DeepPINK and MLP, the neural networks are trained 5 times with different random seeds\nwithin each repetition. When implementing the knockoff \ufb01lter, the threshold T+ was used, and the\n\napproximate knockoff features were constructed using the estimated precision matrix (cid:98)\u2126 (see [18] for\n\nmore details).\nThe empirical FDR and power are reported in Table 1. We see that DeepPINK consistently control\nFDR much below the target level even though we choose a pretty loose threshold and has the highest\npower among the competing methods in almost all settings. MLP, DeepLIFT, and RF control FDR at\nthe target level, but in each case the power is very sensitive to dimensionality and model nonlinearity.\nFor SVR with the linear kernel, occasionally the FDR is above the target level q = 0.2, which\ncould be caused by the very small number of repetitions. It is worth mentioning that the superiority\nof DeepPINK over MLP and DeepLIFT lies in the pairwise-coupling layer, which encourages the\nimportance competition between original and knockoff features directly. The results on FDR control\nare consistent with the general theory of knockoffs inference in [10] and [18]. It is worth mentioning\nthat DeepPINK uses the same network architecture across different model settings.\n\n5 Real data analysis\n\nIn addition to the simulation examples presented in Section 4, we also demonstrate the practical\nutility of DeepPINK on two real applications. For all studies the target FDR level is set to q = 0.2.\n\n5.1 Application to HIV-1 data\n\nWe \ufb01rst apply DeepPINK to the task of identifying mutations associated with drug resistance in\nHIV-1 [32]. Separate data sets are available for resistance against different drugs from three classes: 7\n\n6\n\n\fDeepPINK\n\nMLP\n\nDeepLIFT\n\nRF\n\nSVR\n\np\n\nLinear\n\nSingle-Index\n\nLinear\n\nSingle-Index\n\nLinear\n\nSingle-Index\n\nLinear\n\nSingle-Index\n\nLinear\n\nSingle-Index\n\nFDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower FDR\n\nPower\n\n0.13\n0.08\n\n0.046 1\n0.047 1\n\n50\n0.98\n100\n1\n200 0.042 0.99 0.042 1\n400\n0.022 0.97 0.022 1\n600 0.031 0.95 0.046 1\n800 0.048 0.95 0.082 1\n1000 0.023 0.97 0.065 1\n1500 0.007 1\n0.065 1\n2000 0.026 0.99 0.098 1\n2500 0.029 0.97 0.067 1\n3000 0.046 0.97 0.051 1\n\n0.24\n0.13\n\n0.9\n0.47\n\n0\n\n0\n0\n\n0.16\n0.16\n0.24\n0.034 0.5\n\n0.17\n0.89\n0.15\n1\n0.056 0.26\n0.048 1\n1\n0.11\n0\n0.95 0\n0.29\n0.17\n0.8\n0.037 0.62 0.016 0.068 0\n0\n0.007 0.4\n0.037 0.16\n0.001 0.32 0.13\n0.002 0.41 0.068 0.25\n0.015 0.37 0.1\n0.063 0.35\n0.023 0.4\n0.088 0.58 0.32\n0.5\n0.21\n0.042 0.35\n0.11\n0.43 0.046 0.31\n0.069 0.46 0.14\n\n0.014 0.013 0.003 0.26 0.068 0.16\n0.17 0.16\n0.24\n0.12 0.013 0.33\n0.44\n0.56\n0.47\n0.44\n\n0.180 0.81\n1\n0.005 0.45 0\n0.18\n1\n0.094 0.26\n1\n1\n0.016 0.61 0.025 0.045 0.22\n0.061 0.05\n1\n0.96 0.034 0.067 0.013 0.54 0.020 0.045 0.21\n0.083 0.01\n1\n0.039 0.069 0.017 0.53 0.033 0.050 0.22\n1\n0\n0.095 0.19\n0.98 0\n0.22\n0.15\n0.67 0\n0.064 0.043 0\n0.002 0\n0.04\n0.005 0\n0.02\n0.05\n0\n0\n\n0.023 0.56 0.11\n0.022 0.61 0.061 0.12\n0.029 0.59 0.081 0.17\n0.045 0.58 0.098 0.17\n0.033 0.65 0.046 0.14\n0.18\n0.034 0.62 0.11\n0.05\n0.65 0.087 0.17\n\n0\n0\n0\n0\n0\n0\n0\n\nTable 1: Simulation results for linear model and the Single-Index model.\n\nprotease inhibitors (PIs), 6 nucleoside reverse-transcriptase inhibitors (NRTIs), and 3 nonnucleoside\nreverse transcriptase inhibitors (NNRTIs). The response Y is the log-transformed drug resistance\nlevel to the drug, and the jth column of the design matrix X indicates the presence or absence of the\njth mutation.\nWe compare the identi\ufb01ed mutations, within each drug class, against the treatment-selected mutations\n(TSM), which contains mutations associated with treatment by the drug from that class. Following\n[2], for each drug class, we consider the jth mutation as a discovery for that drug class as long as\nit is selected for any of the drugs in that class. Before running the selection procedure, we remove\npatients with missing drug resistance information and only keep those mutations which appear at\nleast three times in the sample. We then apply three different methods to the data set: DeepPINK\nwith model-X knockoffs discussed earlier in this paper, the original \ufb01xed-X knockoff \ufb01lter based on a\nGaussian linear model (Knockoff) proposed in [2], and the Benjamini\u2013Hochberg (BHq) procedure\n[4]. Following [2], z-scores rather than p-values are used in BHq to facilitate comparison with other\nmethods.\nFigure 2 summarizes the discovered mutations of all methods within each drug class for PI and\nNRTI. In this experiment, we see several differences among the methods. Compared to Knockoff,\nDeepPINK obtains equivalent or better power in 9 out of 13 cases with comparable false discovery\nproportion. In particular, DeepPINK and BHq are the only methods that can identify mutations in\nDDI, TDF, and X3TC for NRTI. Compared to BHq, DeepPINK obtains equivalent or better power\nin 10 out of 13 cases with with much better controlled false discovery proportion. In particular,\nDeepPINK shows remarkable performance for APV where it recovers 18 mutations without any\nmutation falling outside the TSM list. Finally, it is worth mentioning that DeepPINK does not make\nassumptions about the underlying models.\n\n5.2 Application to gut microbiome data\n\nWe next apply DeepPINK to the task of identifying the important nutrient intake as well as bacteria\ngenera in the gut that are associated with body-mass index (BMI). We use a cross-sectional study\nof n = 98 healthy volunteers to investigate the dietary effect on the human gut microbiome [12,\n26, 45]. The nutrient intake consists of p1 = 214 micronutrients whose values are \ufb01rst normalized\nusing the residual method to adjust for caloric intake and then standardized [12]. Furthermore, the\ncomposition of p2 = 87 genera are extracted using 16S rRNA sequencing from stool samples. the\ncompositional data is \ufb01rst log-ratio transformed to get rid of the sum-to-one constraint and then\ncentralized. Following [26], 0s are replaced with 0.5 before converting the data to compositional form.\nTherefore, we treat BMI as the response and the nutrient intake together with genera composition as\npredictors.\nTable 2 shows the eight identi\ufb01ed nutrient intake and bacteria genera using DeepPINK with the\ntarget FDR level q = 0.2. Among them, three overlap with the four genera (Acidaminococcus,\nAlistipes, Allisonella, Clostridium) identi\ufb01ed in [26] using Lasso. At the phylum level, it is known\nthat \ufb01rmicutes affect human obesity [26], which is consistent with identi\ufb01ed \ufb01rmicutes-dominated\ngenera. In view of the associations with BMI, all identi\ufb01ed genera and nutrient intake by DeepPINK\nare supported by literature evidence shown in Table 2.\n\n7\n\n\fFigure 2: Performance on the HIV-1 drug resistance data. For each drug class, we show the number\nof mutation positions for (A)PI and (B)NRTI identi\ufb01ed by DeepPINK, Knockoff, and BHq at target\nFDR level q = 0.2.\n\nNutrient intake\n\nBacteria genera\n\nMicronutrient\n\nReference\n\nPhylum\n\nGenus\n\nReference\n\n1\n2\n3\n4\n5\n6\n7\n8\n\nLinoleic\n\nDairy Protein\n\nOmega 6\n\nCholine, Phosphatidylcholine\n\nCholine, Phosphatidylcholine w/o suppl.\n\nPhenylalanine, Aspartame\nAspartic Acid, Aspartame\n\nThea\ufb02avin 3-gallate, \ufb02avan-3-ol(2)\n\n[7]\n[29]\n[31]\n[31]\n[39]\n[41]\n[41]\n[42]\n\nFirmicutes\nFirmicutes\nFirmicutes\nFirmicutes\nFirmicutes\nFirmicutes\nFirmicutes\n\nProteobacteria\n\nClostridium\n\nAcidaminococcus\n\nAllisonella\nMegamonas\nMegasphaera\nMitsuokella\nHoldemania\nSutterella\n\n[26]\n[26]\n[26]\n[25]\n[43]\n[43]\n[30]\n[13]\n\nTable 2: Major nutrient intake and gut microbiome genera identi\ufb01ed by DeepPINK with supported\nliterature evidence.\n\n6 Conclusion\n\nWe have introduced a new method, DeepPINK, that can help to interpret a deep neural network model\nby identifying a subset of relevant input features, subject to the FDR control. DeepPINK employs a\nparticular DNN architecture, namely, a plugin pairwise-coupling layer, to encourage competitions\nbetween each original feature and its knockoff counterpart. DeepPINK achieves FDR control with\nmuch higher power than the naive combination of the knockoffs idea with a vanilla MLP. Extending\nDeepPINK to other neural networks such as CNN and RNN would be interesting directions for future.\nReciprocally, DeepPINK in turn improves on the original knockoff inference framework. It is worth\nmentioning that feature selection always relies on some importance measures, explicitly or implicitly\n(e.g., Gini importance [8] in random forest). One drawback is that the feature importance measures\nheavily depend on the speci\ufb01c algorithm used to \ufb01t the model. Given the universal approximation\nproperty of DNNs, practitioners can now avoid having to handcraft feature importance measures with\nthe help of DeepPINK.\n\n8\n\nFalse DiscoveriesTrue Discoveries1819317826102382071961912201018516118328102472271841981881521741911Resistance to APVResistance to ATVResistance to IDVResistance to LPVResistance to NFVResistance to RTVResistance to SQVDeepPINKKnockoffBHq0102030(A)DeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHq1241019613516413910261101082631021215826Resistance to ABCResistance to AZTResistance to D4TResistance to DDIResistance to TDFResistance to X3TCDeepPINKKnockoffBHq01020(B)DeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHqDeepPINKKnockoffBHqn = 767p = 201n = 328p = 147n = 825p = 206n = 515p = 184n = 842p = 207n = 793p = 205n = 824p = 206n = 629p = 283n = 623p = 283n = 626p = 283n = 625p = 281n = 628p = 283n = 351p = 215\fAcknowledgments\n\nThis work was supported by NIH awards 1R01GM131407-01 and R01GM121818, a grant from the\nSimons Foundation, and an Adobe Data Science Research Award.\n\nReferences\n[1] Felix Abramovich, Yoav Benjamini, David L. Donoho, and Iain M. Johnstone. Adapting to\nunknown sparsity by controlling the false discovery rate. The Annals of Statistics, 34:584\u2013653,\n2006.\n\n[2] Rina Foygel Barber and Emmanuel J Cand\u00e8s. Controlling the false discovery rate via knockoffs.\n\nThe Annals of Statistics, 43(5):2055\u20132085, 2015.\n\n[3] Rina Foygel Barber and Emmanuel J Cand\u00e8s. A knockoff \ufb01lter for high-dimensional selective\n\ninference. arXiv preprint arXiv:1602.03574, 2016.\n\n[4] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and\npowerful approach to multiple testing. Journal of the Royal Statistical Society. Series B\n(Methodological), pages 289\u2013300, 1995.\n\n[5] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing\n\nunder dependency. The Annals of Statistics, 29:1165\u20131188, 2001.\n\n[6] Alexander Binder, Gr\u00e9goire Montavon, Sebastian Lapuschkin, Klaus-Robert M\u00fcller, and Woj-\nciech Samek. Layer-wise relevance propagation for neural networks with local renormalization\nlayers. In International Conference on Arti\ufb01cial Neural Networks, pages 63\u201371. Springer, 2016.\n\n[7] Henrietta Blankson, Jacob A Stakkestad, Hans Fagertun, Erling Thom, Jan Wadstein, and Ola\nGudmundsen. Conjugated linoleic acid reduces body fat mass in overweight and obese humans.\nThe Journal of Nutrition, 130(12):2943\u20132948, 2000.\n\n[8] Leo Breiman. Classi\ufb01cation and regression trees. 1984.\n\n[9] Leo Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\n[10] Emmanuel J Cand\u00e8s, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: Model-X\nknockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical\nSociety Series B, 2018. to appear.\n\n[11] Yin-Wen Chang and Chih-Jen Lin. Feature ranking using linear SVM. In Causation and\n\nPrediction Challenge, pages 53\u201364, 2008.\n\n[12] Jun Chen and Hongzhe Li. Variable selection for sparse dirichlet-multinomial regression with\n\nan application to microbiome data analysis. The Annals of Applied Statistics, 7(1), 2013.\n\n[13] Chih-Min Chiu, Wei-Chih Huang, Shun-Long Weng, Han-Chi Tseng, Chao Liang, Wei-Chi\nWang, Ting Yang, Tzu-Ling Yang, Chen-Tsung Weng, Tzu-Hao Chang, et al. Systematic\nanalysis of the association between gut \ufb02ora and obesity through high-throughput sequencing\nand bioinformatics approaches. BioMed Research International, 2014, 2014.\n\n[14] Sandy Clarke and Peter Hall. Robustness of multiple testing procedures against dependence.\n\nThe Annals of Statistics, 37:332\u2013358, 2009.\n\n[15] Bradley Efron. Correlation and large-scale simultaneous signi\ufb01cance testing. Journal of the\n\nAmerican Statistical Association, 102:93\u2013103, 2007.\n\n[16] Jianqing Fan, Peter Hall, and Qiwei Yao. To how many simultaneous hypothesis tests can\nnormal, student\u2019s t or bootstrap calibration be applied? Journal of the American Statistical\nAssociation, 102:1282\u20131288, 2007.\n\n[17] Jianqing Fan, Han Xu, and Weijie Gu. Control of the false discovery rate under arbitrary\ncovariance dependence (with discussion). Journal of the American Statistical Association,\n107:1019\u20131045, 2012.\n\n9\n\n\f[18] Yingying Fan, Emre Demirkaya, Gaorong Li, and Jinchi Lv. RANK: large-scale inference with\n\ngraphical nonlinear knockoffs. arXiv preprint arXiv:1709.00092, 2017.\n\n[19] Yingying Fan, Emre Demirkaya, and Jinchi Lv. Nonuniformity of p-values can occur early in\n\ndiverging dimensions. arXiv preprint arXiv:1705.03604, 2017.\n\n[20] Yingying Fan and Jianqing Fan. Testing and detecting jumps based on a discretely observed\n\nprocess. Journal of Econometrics, 164:331\u2013344, 2011.\n\n[21] Yingying Fan and Jinchi Lv. Innovated scalable ef\ufb01cient estimation in ultra-large Gaussian\n\ngraphical models. The Annals of Statistics, 44(5):2098\u20132126, 2016.\n\n[22] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile.\n\narXiv preprint arXiv:1710.10547, 2017.\n\n[23] Peter Hall and Qiying Wang. Strong approximations of level exceedences related to multiple\n\nhypothesis testing. Bernoulli, 16:418\u2013434, 2010.\n\n[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] Ya-Shu Kuang, Jin-Hua Lu, Sheng-Hui Li, Jun-Hua Li, Ming-Yang Yuan, Jian-Rong He, Nian-\nNian Chen, Wan-Qing Xiao, Song-Ying Shen, Lan Qiu, et al. Connections between the human\ngut microbiome and gestational diabetes mellitus. GigaScience, 6(8):1\u201312, 2017.\n\n[26] Wei Lin, Pixu Shi, Rui Feng, and Hongzhe Li. Variable selection in regression with composi-\n\ntional covariates. Biometrika, 101(4):785\u2013797, 2014.\n\n[27] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490,\n\n2016.\n\n[28] Ziad Obermeyer and Ezekiel J Emanuel. Predicting the future - big data, machine learning, and\n\nclinical medicine. The New England Journal of Medicine, 375(13):1216, 2016.\n\n[29] Laura Pimpin, Susan Jebb, Laura Johnson, Jane Wardle, and Gina L Ambrosini. Dietary protein\nintake is associated with body mass index and weight up to 5 y of age in a prospective cohort of\ntwins. The American Journal of Clinical Nutrition, 103(2):389\u2013397, 2016.\n\n[30] Sylvie Rabot, Mathieu Membrez, Florence Blancher, Bernard Berger, D\u00e9borah Moine, Lutz\nKrause, Rodrigo Bibiloni, Aur\u00e9lia Bruneau, Philippe G\u00e9rard, Jay Siddharth, et al. High fat diet\ndrives obesity regardless the composition of gut microbiota in mice. Scienti\ufb01c Reports, 6:32484,\n2016.\n\n[31] Dominic N Reeds, B Selma Mohammed, Samuel Klein, Craig Brian Boswell, and V Leroy\nYoung. Metabolic and structural effects of phosphatidylcholine and deoxycholate injections on\nsubcutaneous fat: a randomized, controlled trial. Aesthetic Surgery Journal, 33(3):400\u2013408,\n2013.\n\n[32] Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and\nRobert W Shafer. Genotypic predictors of human immunode\ufb01ciency virus type 1 drug resistance.\nProceedings of the National Academy of Sciences, 103(46):17355\u201317360, 2006.\n\n[33] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining\nthe predictions of any classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 1135\u20131144, 2016.\n\n[34] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through\n\npropagating activation differences. arXiv preprint arXiv:1704.02685, 2017.\n\n[35] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\nVisualising image classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034,\n2013.\n\n10\n\n\f[36] John D. Storey, Jonathan E. Taylor, and David Siegmund. Strong control, conservative point es-\ntimation and simultaneous conservative consistency of false discovery rates: a uni\ufb01ed approach.\nJournal of the Royal Statistical Society Series B, 66:187\u2013205, 2004.\n\n[37] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks.\n\narXiv preprint arXiv:1703.01365, 2017.\n\n[38] Ryan Turner. A model explanation system. In Machine Learning for Signal Processing (MLSP),\n\n2016 IEEE 26th International Workshop on, pages 1\u20136, 2016.\n\n[39] Mauno Vanhala, Juha Saltevo, Pasi Soininen, Hannu Kautiainen, Antti J Kangas, Mika Ala-\nKorpela, and Pekka M\u00e4ntyselk\u00e4. Serum omega-6 polyunsaturated fatty acids and the metabolic\nsyndrome: a longitudinal population-based cohort study. American Journal of Epidemiology,\n176(3):253\u2013260, 2012.\n\n[40] Wei Biao Wu. On false discovery control under dependence. The Annals of Statistics, 36:364\u2013\n\n380, 2008.\n\n[41] Qing Yang. Gain weight by \"going diet?\" Arti\ufb01cial sweeteners and the neurobiology of sugar\n\ncravings: Neuroscience 2010. The Yale Journal of Biology and Medicine, 83(2):101, 2010.\n\n[42] Yoon Jung Yang, You Jin Kim, Yoon Kyoung Yang, Ji Yeon Kim, and Oran Kwon. Dietary\n\ufb02avan-3-ols intake and metabolic syndrome risk in korean adults. Nutrition Research and\nPractice, 6(1):68\u201377, 2012.\n\n[43] Yeojun Yun, Han-Na Kim, Song E Kim, Seong Gu Heo, Yoosoo Chang, Seungho Ryu, Hocheol\nShin, and Hyung-Lae Kim. Comparative analysis of gut microbiota associated with body mass\nindex in a large korean cohort. BMC Microbiology, 17(1):151, 2017.\n\n[44] Yu Zhang and Jun S. Liu. Fast and accurate approximation to signi\ufb01cance tests in genome-wide\n\nassociation studies. Journal of the American Statistical Association, 106:846\u2013857, 2011.\n\n[45] Zemin Zheng, Jinchi Lv, and Wei Lin. Nonsparse learning with latent variables. arXiv preprint\n\narXiv:1710.02704, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5240, "authors": [{"given_name": "Yang", "family_name": "Lu", "institution": "University of Washington"}, {"given_name": "Yingying", "family_name": "Fan", "institution": "University of Southern California"}, {"given_name": "Jinchi", "family_name": "Lv", "institution": "University of Southern California"}, {"given_name": "William", "family_name": "Stafford Noble", "institution": "University of Washington"}]}