{"title": "Finding significant combinations of features in the presence of categorical covariates", "book": "Advances in Neural Information Processing Systems", "page_first": 2279, "page_last": 2287, "abstract": "In high-dimensional settings, where the number of features p is typically much larger than the number of samples n, methods which can systematically examine arbitrary combinations of features, a huge 2^p-dimensional space, have recently begun to be explored. However, none of the current methods is able to assess the association between feature combinations and a target variable while conditioning on a categorical covariate, in order to correct for potential confounding effects. We propose the Fast Automatic Conditional Search (FACS) algorithm, a significant discriminative itemset mining method which conditions on categorical covariates and only scales as O(k log k), where k is the number of states of the categorical covariate. Based on the Cochran-Mantel-Haenszel Test, FACS demonstrates superior speed and statistical power on simulated and real-world datasets compared to the state of the art, opening the door to numerous applications in biomedicine.", "full_text": "Finding signi\ufb01cant combinations of features in the\n\npresence of categorical covariates\n\nLaetitia Papaxanthos\u2217, Felipe Llinares-L\u00f3pez\u2217, Dean Bodenham, Karsten Borgwardt\n\nMachine Learning and Computational Biology Lab\n\nD-BSSE, ETH Zurich\n\n*Equally contributing authors.\n\nAbstract\n\nIn high-dimensional settings, where the number of features p is much larger than the\nnumber of samples n, methods that systematically examine arbitrary combinations\nof features have only recently begun to be explored. However, none of the current\nmethods is able to assess the association between feature combinations and a target\nvariable while conditioning on a categorical covariate. As a result, many false\ndiscoveries might occur due to unaccounted confounding effects.\nWe propose the Fast Automatic Conditional Search (FACS) algorithm, a signi\ufb01cant\ndiscriminative itemset mining method which conditions on categorical covariates\nand only scales as O(k log k), where k is the number of states of the categorical\ncovariate. Based on the Cochran-Mantel-Haenszel Test, FACS demonstrates supe-\nrior speed and statistical power on simulated and real-world datasets compared to\nthe state of the art, opening the door to numerous applications in biomedicine.\n\nIntroduction\n\n1\nIn the last 10 years, the amount of data available is growing at an unprecedented rate. However, in\nmany application domains, such as computational biology and healthcare, the amount of features is\ngrowing much faster than typical sample sizes. Therefore, statistical inference in high-dimensional\nspaces has become a tool of the utmost importance for practitioners in those \ufb01elds. Despite the great\nsuccess of approaches based on sparsity-inducing regularizers [16, 2], the development of methods to\nsystematically explore arbitrary combinations of features and assess their statistical association with\na target of interest has been less studied. Exploring all combinations of p features is equivalent to\nhandling a 2p-dimensional space, thus combinatorial feature discovery exacerbates the challenges for\nstatistical inference in high-dimensional spaces even for moderate p.\nUnder the assumption that features and targets are binary random variables, recent work in the \ufb01eld\nof signi\ufb01cant discriminative itemset mining offers tools to solve the computational and statistical\nchallenges incurred by combinatorial feature discovery. However, all state-of-the-art approaches [15,\n10, 13, 7, 8] share a key limitation: no method exists to assess the conditional association between\nfeature combinations and the target. The ability to condition the associations on an observed covariate\nis fundamental to correct for confounding effects. If unaccounted for, one may \ufb01nd many false\npositives that are actually associated with the covariate and not the class of interest [17]. For example,\nin medical case/control association studies, it is common to search for combinations of genetic\nvariants that are associated with a disease of interest. In this setting, the class labels are the health\nstatus of individuals, sick or healthy. The features represent binary genetic variants, encoded as 1\nif the variant is altered and as 0 if not. Often, in high-order association studies, a subset of genetic\nvariants are combined to form a binary variable whose value is 1 if the subset only contains altered\ngenetic variants and is 0 otherwise. A subset of genetic variants is associated with the class label if the\nfrequencies of altered combinations in each class are statistically different. However, it is often the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fcase that the studied samples belong to several subpopulations, for example African-American, East\nAsian or European Caucasian, which show differences in the prevalence of some altered combinations\nof genetic variants because of systematic ancestry differences. When, additionally, the subpopulations\nclusters are unevenly distributed across classes, it can result in false associations to the disease of\ninterest [12]. This is the reason why it is necessary to model ancestry differences between cases and\ncontrols in the presence of population structure or to correct for covariates in more general settings.\nHence our goal in this article is to present the \ufb01rst approach to signi\ufb01cant discriminative itemset\nmining that allows one to correct for a confounding categorical covariate.\nTo reach this goal, we present the novel algorithm FACS, which enables signi\ufb01cant discrimina-\ntive itemset mining with categorical covariates through the Cochran-Mantel-Haenszel test [9] in\nO(k log k) time, where k is the number of states of the categorical covariate, compared to the standard\nimplementation which is exponential in k.\nThe rest of this article is organized as follows: In Section 2 we de\ufb01ne the problem to be solved\nand introduce the main theoretical concepts from related work that FACS is based on, namely the\nCochran-Mantel-Haenszel-test and Tarone\u2019s testability criterion. In Section 3, we describe in detail\nour contribution, the FACS algorithm and its ef\ufb01cient implementation. Finally, Section 4 validates the\nperformance of our method on a set of simulated and biomedical datasets.\n2 Problem statement and related work\nIn this section we introduce the necessary background, notation and terminology for the remainder\nof this article. First, in Section 2.1 we rigorously de\ufb01ne the problem we solve in this paper. Next,\nin Sections 2.2 and 2.3 we describe two key elements on which our method is based: the Cochran-\nMantel-Haenszel (CMH) test and Tarone\u2019s testability criterion.\n2.1 Discovering signi\ufb01cant feature combinations in the presence of a categorical covariate\nWe consider a dataset of n observations D = {(ui, yi, ci)}n\ni=1, where the ith observation consists of:\n(I) a feature vector ui consisting of p binary features, ui,j \u2208 {0, 1} for j = 1, . . . , p; (II) a binary class\nlabel, yi \u2208 {0, 1}; and (III) a categorical covariate ci, which has k categories, i.e. ci \u2208 {1, 2, . . . , k}.\nGiven any subset of features S \u2286 {1, 2, . . . , p}, we de\ufb01ne its induced feature combination for the ith\nj\u2208S ui,j, such that zi,S takes value 1 if and only if ui,j = 1 for all features in\nS. Now, we use ZS to denote the feature combination induced by S, of which zi,S is the realization\nfor the ith observation. Similarly, we use Y to denote the label, and C to denote the covariate, of\nwhich yi and ci are realizations, respectively, for i = 1, 2, . . . , n. Below we use the standard notation\nA \u22a5\u22a5 B to denote \u201cA is statistically independent of B\u201d.\nTypically, signi\ufb01cant discriminative itemset mining aims to \ufb01nd all feature subsets S for which a\nstatistical association test rejects the null hypothesis, namely ZS \u22a5\u22a5 Y , after a rigorous correction for\nmultiple hypothesis testing. However, for any feature subset such that ZS (cid:54)\u22a5\u22a5 Y but ZS \u22a5\u22a5 Y | C, the\nassociation between ZS and Y is exclusively mediated by the covariate C, which acts in this case as\na confounder creating spurious associations.\nOur goal: In this work, the aim is to \ufb01nd all feature subsets S for which a statistical association\ntest rejects the null hypothesis ZS \u22a5\u22a5 Y | C, thus allowing to correct for a confounding categorical\ncovariate while keeping the computational ef\ufb01ciency, statistical power and the ability to correct for\nmultiple hypothesis testing of existing methods.\nIn the remainder of this section we will introduce two fundamental concepts our work relies upon.\nThe \ufb01rst one is the Cochran-Mantel-Haenszel (CMH) test, which offers a principled way to test if a\nfeature combination ZS is conditionally dependent on the class labels Y given the covariate C, that\nis, to test the null hypothesis ZS \u22a5\u22a5 Y | C. The second concept is Tarone\u2019s testability criterion, which\nallows a correction for multiple hypothesis testing while retaining large statistical power, in scenarios\nsuch as ours where billions or trillions of association tests must be performed.\nTarone\u2019s testability criterion has only been successfully applied to unconditional association tests,\nsuch as Fisher\u2019s exact test [6] or Pearson\u2019s \u03c72 test [11]. Thus, the state-of-the-art in signi\ufb01cant\ndiscriminative itemset mining forces one to choose between: (a) using Bonferroni\u2019s correction,\nresulting in very low statistical power or an arbitrary limit in the cardinality of feature subsets (e.g.\n[18]), or (b) using Tarone\u2019s testability criterion, losing the ability to account for covariates and\nresulting in potentially many confounded patterns being deemed signi\ufb01cant [15, 13, 7, 8].\n\nobservation as zi,S =(cid:81)\n\n2\n\n\fOur contribution: In this paper, we propose FACS, a novel algorithm that allows applying Tarone\u2019s\ntestability criterion to the CMH test, allowing to correct for a categorical covariate in signi\ufb01cant\ndiscriminative itemset mining for the \ufb01rst time. FACS will be introduced in detail in Section 3.\n2.2 Conditional association testing with the Cochran-Mantel-Haenszel (CMH) test\nTo test if ZS \u22a5\u22a5 Y | C, the CMH test [9] arranges the n realisations of {(zi,S , yi, ci)}n\ndistinct 2 \u00d7 2 contingency tables, one table for each possible value of the covariate c, as:\n\ni=1 into k\n\nVariables\n\ny = 1\ny = 0\n\nCol totals\n\nzS = 1\nxS,j \u2212 aS,j\n\naS,j\n\nxS,j\n\nzS = 0\nn1,j \u2212 aS,j\nnj \u2212 xS,j\n\nn2,j \u2212 xS,j + aS,j\n\nRow totals\n\nn1,j\nn2,j\nnj\n\nwhere: (I) nj is the number of observations with c = j, n1,j of which have class label y = 1 and n2,j\nof which have class label y = 0; (II) xS,j is the number of observations with c = j and zi,S = 1;\n(III) aS is the number of observations with c = j, class label y = 1 and zi,S = 1. Based on\n{nj, n1,j, xS,j, aS,j}k\n\nj=1, a p-value pS for feature combination ZS is computed as:\n\n\uf8eb\uf8ec\uf8ed\n\n(cid:80)k\n\nn1,j\nnj\n\nj=1\n\npS = 1 \u2212 F\u03c72\n\n1\n\n(cid:16)(cid:80)k\n(cid:16)\n(cid:16)\nj=1 aS,j \u2212 xS,j n1,j\n1 \u2212 n1,j\n\n(cid:17)\n\nxS,j\n\nnj\n\nnj\n\n(cid:17)2\n\n1 \u2212 xS,j\n\nnj\n\n\uf8f6\uf8f7\uf8f8\n\n(cid:17)\n\n(1)\n\n1\n\n(\u00b7) is the distribution function of a \u03c72 random variable with 1 degree of freedom. Finally,\nwhere F\u03c72\nthe feature combination ZS and its corresponding feature subset S will be deemed signi\ufb01cantly\nassociated if the p-value pS falls below a corrected signi\ufb01cance threshold \u03b4, that is, if pS \u2264 \u03b4.\nThe CMH test can be understood as a form of meta-analysis applied to k disjoint datasets {Dj}k\nj=1,\nwhere Dj = {(ui, yi)| ci = j} contains only observations for which the covariate c takes value j.\nFor confounded feature combinations, the association might be large in the entire dataset D, but small\nfor conditional datasets Dj. Thus, the CMH test will not deem such feature combinations signi\ufb01cant.\n2.3 The multiple testing problem in discriminative itemset mining\nIn our setup, one must perform 2p \u2212 1 association tests, one for each possible subset of features. Even\nfor moderate p, this leads to an enormous number of tests, resulting in a large multiple hypothesis\ntesting problem. To produce statistically reliable results, the signi\ufb01cance threshold \u03b4 will be chosen to\nguarantee that the Family-Wise Error Rate (FWER), de\ufb01ned as the probability of producing any false\npositives, is upper-bounded by a signi\ufb01cance level \u03b1. FWER control is most commonly achieved with\nBonferroni\u2019s correction [3, 5], which in our setup would imply using \u03b4 = \u03b1/(2p \u2212 1) as signi\ufb01cance\nthreshold. However, Bonferroni\u2019s correction tends to be overly conservative, resulting in very low\nstatistical power when the number of tests performed is large. In contrast, recent work in signi\ufb01cant\ndiscriminative itemset mining [15, 10, 13, 7] showed that, in this setting, Bonferroni\u2019s correction can\nbe outperformed in terms of statistical power by Tarone\u2019s testability criterion [14].\nTarone\u2019s testability criterion is based on the observation that, for some discrete test statistics based on\ncontingency tables, a minimum attainable p-value can be computed as a function of the table margins.\nLet \u03a8(S) denote the minimum attainable p-value corresponding to the contingency table of feature\ncombination ZS. By de\ufb01nition, pS \u2265 \u03a8(S), therefore \u03a8(S) > \u03b4 implies that feature combination\nZS can never be deemed signi\ufb01cantly associated, and hence it cannot cause a false positive. In other\nwords, feature subsets S for which \u03a8(S) > \u03b4 are irrelevant as far as the FWER is concerned. In\nTarone\u2019s terminology, S is said to be untestable. Thus, de\ufb01ning the set of testable feature subsets at\nlevel \u03b4 as IT (\u03b4) = {S| \u03a8(S) \u2264 \u03b4}, Tarone\u2019s testability criterion obtains the corrected signi\ufb01cance\nthreshold as \u03b4tar = max{\u03b4 : FWERtar(\u03b4) \u2264 \u03b1}, where FWERtar(\u03b4) = \u03b4|IT (\u03b4)|. Note that this\namounts to applying a Bonferroni correction to feature subsets S in IT (\u03b4) only. FWER control\nfollows from the fact that untestable feature subsets cannot affect the FWER. Since in practice\n|IT (\u03b4)| (cid:28) 2p \u2212 1, Tarone\u2019s testability criterion often outperforms Bonferroni\u2019s correction in terms\nof statistical power by a large margin.\nThe main practical limitation of Tarone\u2019s testability criterion is its computational complexity. Naively\ncomputing \u03b4tar would involve explicitly enumerating all 2p \u2212 1 feature subsets and evaluating their\nrespective minimum attainable p-values, something unfeasible even for moderate p. Existing work in\nsigni\ufb01cant discriminative pattern mining solves that limitation by exploiting speci\ufb01c properties of\n\n3\n\n\fcertain test statistics, such as Fisher\u2019s Exact Test or Pearson\u2019s \u03c72 test, that allow to apply branch-and-\nbound algorithms to evaluate \u03b4tar. However, the properties those algorithms rely on do not apply to\nconditional statistical association tests, such as the CMH test. In the next section, we present in detail\nour novel approach to apply Tarone\u2019s method to the CMH test.\n3 Our contribution: The FACS algorithm\nThis section introduces the Fast Automatic Conditional Search (FACS) algorithm, the \ufb01rst approach\nthat allows the application of Tarone\u2019s testability criterion to the CMH test in a computationally\nef\ufb01cient manner. Section 3.1 discusses the main challenges facing FACS and summarizes how\nFACS improves the state of the art. Section 3.2 provides a high-level description of the algorithm.\nFinally, Sections 3.3 and 3.4 detail the two key steps of FACS, which are also the main algorithmic\ncontributions of this work.\n3.1 Overview and Contributions\nThe main objective of the FACS algorithm, described in Section 3.2 below, can be summarised as:\nObjective: Given a dataset D = {(ui, yi, ci)}n\n\ni=1, the goal of FACS is to:\n\n1. Compute Tarone\u2019s corrected signi\ufb01cance threshold \u03b4tar.\n2. Retrieve all feature subsets S whose p-value pS is below \u03b4tar.\n\nFor both (1) and (2), the test statistic of choice will be the CMH test, thus allowing to correct for a\nconfounding categorical covariate as described in Section 2.2.\nThe key contribution of our work is to bridge the gap between Tarone\u2019s testability criterion and the\nCMH test. Firstly, in Section 3.3, we show for the \ufb01rst time that Tarone\u2019s method can be applied to\nthe CMH test. More importantly, in Section 3.4 we introduce a novel branch-and-bound algorithm to\nef\ufb01ciently compute \u03b4tar without requiring the function \u03a8 computing Tarone\u2019s minimum attainable\np-value to be monotonic. This allows us not only to apply Tarone\u2019s testability criterion to the CMH\ntest, but to do so as ef\ufb01ciently as existing methods not able to handle confounding covariates do.\n3.2 High-level description of FACS\nAs shown in the pseudocode in Algorithm 1, conceptually, FACS performs two main operations:\nAlgorithm 1 FACS\nInput: Dataset D = {(ui, yi, ci)}n\nOutput: {S | pS \u2264 \u03b4tar}\n1: Initialize global variables \u03b4tar = 1\n2: \u03b4tar,IT (\u03b4tar) \u2190 tarone_cmh(\u2205)\n3: Return\n\nAlgorithm 2 tarone_cmh\nInput: Current feature subset being processed S\n1: if is_testable_cmh(S, \u03b4tar)\n\nAppend S to IT (\u03b4tar)\nFWERtar(\u03b4tar) \u2190 \u03b4tar|IT (\u03b4tar)|\nwhile FWERtar(\u03b4tar) > \u03b1 do\n\nand IT (\u03b4tar) = \u2205\n\ntarget FWER \u03b1\n\nthen {see Sec-\n\ntion 3.3}\n\nDecrease \u03b4tar\n\nIT (\u03b4tar) \u2190(cid:8)S \u2208 IT (\u03b4tar) : is_testable(S, \u03b4tar)(cid:9)\n\nFWERtar(\u03b4tar) \u2190 \u03b4tar|IT (\u03b4tar)|\n\n7:\n8: if not is_prunable_cmh(S, \u03b4tar) then {see 3.4}\n9:\n10:\n\nfor S(cid:48) \u2208 Children(S) do\n\ntarone_cmh(S(cid:48))\n\ni=1,\n\n2:\n3:\n4:\n5:\n6:\n\n{S \u2208 IT (\u03b4tar)| pS \u2264 \u03b4tar}\n\nFirstly, Line 2 invokes the routine tarone_cmh, described in Algorithm 2. This routine uses our\nnovel branch-and-bound approach to ef\ufb01ciently compute Tarone\u2019s corrected signi\ufb01cance threshold\n\u03b4tar and the set of testable feature subsets IT (\u03b4tar).\nSecondly, using the signi\ufb01cance threshold \u03b4tar obtained in the previous step, Line 3 evaluates the\nconditional association of the feature combination ZS of each testable feature subset S \u2208 IT (\u03b4tar)\nwith the class labels, given the categorical covariate, using the CMH test as shown in Section 2.2.\nNote that, according to Tarone\u2019s testability criterion, untestable feature subsets S (cid:54)\u2208 IT (\u03b4tar) cannot\nbe signi\ufb01cant and therefore do not need to be considered in this step. Since in practice |IT (\u03b4tar)| (cid:28)\n2p \u2212 1, the procedure tarone_cmh is the most critical part of FACS.\nThe routine tarone_cmh uses the enumeration scheme \ufb01rst proposed in [10, 13]. All 2p feature\nsubsets are arranged in an enumeration tree such that S(cid:48) \u2208 Children(S) \u21d2 S \u2282 S(cid:48). In other words,\n\n4\n\n\fthe children of a feature subset S in the enumeration tree are obtained by adding an additional feature\nto S. Before invoking tarone_cmh, in Line 1 of Algorithm 1 the signi\ufb01cance threshold \u03b4tar is\ninitialized to 1, the largest value it can take, and the set of testable feature combinations IT (\u03b4tar) is\ninitialized to the empty set. The enumeration procedure is started by calling tarone_cmh with the\nempty feature subset S = \u2205, which acts as the root of the enumeration tree1. All 2p \u2212 1 non-empty\nfeature subsets will then be explored recursively by traversing the enumeration tree depth-\ufb01rst.\nEvery time a feature subset S in the tree is visited, Line 1 of Algorithm 2 checks if it is testable, as\ndetailed in Section 3.3. If it is, S is appended to the set of testable feature subsets IT (\u03b4tar) in Line 2.\nThe FWER condition for Tarone\u2019s testability criterion is checked in Lines 3 and 4. If it is found\nto be violated, the signi\ufb01cance threshold \u03b4tar is decreased in Line 5 until the condition is satis\ufb01ed\nagain, removing from IT (\u03b4tar) any feature subsets made untestable by decreasing \u03b4tar in Line 6 and\nre-evaluating the FWER condition accordingly in Line 7. Before continuing the traversal of the tree\nby exploring the children of the current feature subset S, Line 8 checks if our novel pruning criterion\napplies, as described in Section 3.4. Only if it does not apply are all children of S visited recursively\nin Lines 9 and 10. The testability and pruning conditions in Lines 1 and 8 become more stringent\nas \u03b4tar decreases. Because of this, as \u03b4tar decreases along the enumeration procedure (see Line 5),\nincreasingly larger parts of the search space are pruned. Thus, the algorithm terminates when, for the\ncurrent value of \u03b4tar and IT (\u03b4tar), all feature subsets that cannot be pruned have been visited.\nThe two most challenging steps in FACS are the design of an appropriate testability criterion,\nis_testable_cmh(S, \u03b4), and an ef\ufb01cient pruning criterion, is_prunable_cmh(S, \u03b4), that circum-\nvent the limitations of the current state of the art. These are now each described in detail.\n3.3 A testability criterion for the CMH test\nAs mentioned in Section 2.3, Tarone\u2019s testability criterion has only been applied to test statistics such\nas Fisher\u2019s exact test, Pearson\u2019s \u03c72 test and the Mann-Whitney U Test, none of which allows for\nincorporating covariates. However, the following proposition shows that the CMH test also has a\nminimum attainable p-value \u03a8cmh(S):\nProposition 1 The CMH test has a minimum attainable p-value \u03a8cmh(S), which can be computed\nin O(k) time as a function of the margins {nj, n1,j, xS,j}k\nThe proof of Proposition 1, provided in the Supp. Material, involves showing that \u03a8cmh(S) can be\ncomputed from the k 2 \u00d7 2 contingency tables corresponding to ZS (see Section 2.2) by optimising\nthe p-value pS with respect to {aS,j}k\nj=1 \ufb01xed.\n3.4 A pruning criterion for the CMH test\nState-of-the-art methods [15, 8], all of which are limited to unconditional association testing, exploit\nthe fact that the minimum attainable p-value function \u03a8(S), using either Fisher\u2019s exact test or\nPearson\u2019s \u03c72 test on a single contingency table, obeys a simple monotonicity property: S \u2286 S(cid:48) \u21d2\n\u03a8(S) \u2264 \u03a8(S(cid:48)) provided that xS \u2264 min(n1, n2). This leads to a remarkably simple pruning criterion:\nif a feature subset S is non-testable, i.e. \u03a8(S) > \u03b4, and its support xS is smaller or equal to\nmin(n1, n2), then all children S(cid:48) of S, which satisfy S \u2282 S(cid:48) by construction of the enumeration tree,\nwill also be non-testable and can be pruned from the search space. However, such a monotonicity\nproperty does not hold for the CMH minimum attainable p-value function \u03a8cmh(S), severely\ncomplicating the development of an effective pruning criterion.\nIn Section 3.4.1 we show how to circumvent this limitation by introducing a novel pruning criterion\nattainable p-value function \u03a8cmh(S) and prove that it leads to a valid pruning strategy. Finally, in\n\nbased on de\ufb01ning a monotonic lower envelope (cid:101)\u03a8cmh(S) \u2264 \u03a8cmh(S) of the original minimum\nSection 3.4.2, we provide an ef\ufb01cient algorithm to evaluate(cid:101)\u03a8cmh(S) in O(k log k) time, instead of a\n\nj=1 while keeping the table margins {nj, n1,j, xS,j}k\n\nj=1 of the k 2 \u00d7 2 contingency tables.\n\nnaive implementation whose computational complexity would scale exponentially with k, the number\nof categories for the covariate. Due to space constraints, all proofs are in the Supp. Material.\n\n3.4.1 De\ufb01nition and correctness of the pruning criterion\n\nAs mentioned above, existing unconditional signi\ufb01cant discriminative pattern mining meth-\nods only consider feature subsets S with support xS \u2264 min(n1, n2) to be potentially prun-\n\n1We de\ufb01ne zi,\u2205 = 1 for all observations, so this arti\ufb01cial feature combination will never be signi\ufb01cant.\n\n5\n\n\fable. Analogously, we consider as potentially prunable the set of feature subsets IP P =\n{S | xS,j \u2264 min(n1,j, n2,j)\u2200 j = 1, . . . , k}. Note that for k = 1, our de\ufb01nition reduces to that\nof existing work. In itemset mining, a very large proportion of all feature subsets will have small\nsupports. Therefore, restricting the application of the pruning criterion to potentially prunable patterns\ndoes not cause a loss of performance in practice. We can now state the de\ufb01nition of the lower envelope\nfor the CMH minimum attainable p-value:\n\nset of potentially prunable patterns. Next, we show that unlike for the minimum attainable p-value\nLemma 1 Let S,S(cid:48) \u2208 IP P be two potentially prunable feature subsets such that S \u2286 S(cid:48). Then,\n\nDe\ufb01nition 1 Let S \u2208 IP P be a potentially prunable feature subset. The lower envelope(cid:101)\u03a8cmh(S) is\nde\ufb01ned as(cid:101)\u03a8cmh(S) = min{\u03a8cmh(S(cid:48)) | S(cid:48) \u2287 S}.\nNote that, by construction,(cid:101)\u03a8cmh(S) satis\ufb01es(cid:101)\u03a8cmh(S) \u2264 \u03a8cmh(S) for all feature subsets S in the\nfunction \u03a8cmh(S), the monotonicity property holds for the lower envelope(cid:101)\u03a8cmh(S):\n(cid:101)\u03a8cmh(S) \u2264 (cid:101)\u03a8cmh(S(cid:48)) holds.\nTheorem 1 Let S \u2208 IP P be a potentially prunable feature subset such that (cid:101)\u03a8cmh(S) > \u03b4. Then,\nif and only if S \u2208 IP P \u21d4 xS,j \u2264 min(n1,j, n2,j)\u2200 j = 1, . . . , k and(cid:101)\u03a8cmh(S) > \u03b4tar.\n\n\u03a8cmh(S(cid:48)) > \u03b4 for all S(cid:48) \u2287 S, i.e. all feature subsets containing S are non-testable at level \u03b4 and\ncan be pruned from the search space.\nTo summarize, the pruning criterion is_prunable_cmh in Line 8 of Algorithm 2 evaluates to true\n\nNext, we state the main result of this section, which establishes our search space pruning criterion:\n\n3.4.2 Evaluating the pruning criterion in O(k log k) time\n\nIn FACS, the pruning criterion stated above will be applied to all enumerated feature subsets. Hence,\nit is mandatory to have an ef\ufb01cient algorithm to compute the lower envelope for the CMH minimum\n\nAs shown in the proof of Proposition 1 in the Supp. Material, \u03a8cmh(S) depends on the pattern S\nthrough its k-dimensional vector of supports xS = (xS,1, . . . , xS,k). Also, the condition S(cid:48) \u2287 S\nimplies that xS(cid:48),j \u2264 xS,j \u2200 j = 1, . . . , k. As a consequence, one can rewrite De\ufb01nition 1 as\n\u03a8cmh(xS(cid:48)), where the vector inequality xS(cid:48) \u2264 xS holds component-wise. Thus,\nj=1 xS,j = O(mk),\nj=1. This scaling is clearly impractical, as even for moderate\n\nattainable p-value(cid:101)\u03a8cmh(S) for any potentially prunable feature subset S \u2208 IP P .\n(cid:101)\u03a8cmh(S) = min\nnaively computing (cid:101)\u03a8(S) would require optimizing \u03a8cmh over a set of size(cid:81)k\nalgorithm which evaluates(cid:101)\u03a8(S) in only O(k log(k)) time. We will arrive at our \ufb01nal result in two\n\nwhere m is the geometric mean of {xS,j}k\nk it would result in an overhead large enough to outweigh the bene\ufb01ts of pruning.\nBecause of this, in the remainder of this section we propose the last key part of FACS: an ef\ufb01cient\n\nxS(cid:48)\u2264xS\n\nS(cid:48),j = 0 or x\u2217\n\n\u03a8cmh(xS(cid:48)) satis\ufb01es x\u2217\n\nsteps, contained in Lemma 2 and Theorem 2.\nLemma 2 Let S \u2208 IP P be a potentially prunable feature subset. The optimum x\u2217\noptimization problem min\nxS(cid:48)\u2264xS\nIn short, Lemma 2 shows that the optimum x\u2217\n\nS(cid:48) of the discrete\nS(cid:48),j = xS,j for each j = 1, . . . , k.\nmization problem de\ufb01ning (cid:101)\u03a8(S) is always a vertex of the discrete hypercube(cid:74)0, xS(cid:75). Thus, the\nS(cid:48) = {\u03a8cmh(xS(cid:48))| xS(cid:48) \u2264 xS} of the discrete opti-\ncomputational complexity of evaluating (cid:101)\u03a8cmh(S) can be reduced from O(mk) to O(2k), where\n(cid:17)\nfor j = 1, . . . , k. Let \u03c0l and \u03c0r be permutations \u03c0l, \u03c0r : (cid:74)1, k(cid:75) (cid:55)\u2192\n\nm (cid:29) 2 for most patterns. Finally, building upon the result of Lemma 2, Theorem 2 below shows that\none can in fact \ufb01nd the optimal vertex out of all O(2k) vertices in O(k log k) time.\nTheorem 2 Let S \u2208 IP P be a potentially testable feature subset and de\ufb01ne \u03b2lS,j = n2,j\nand \u03b2rS,j = n1,j\n(cid:74)1, k(cid:75) such that \u03b2lS,\u03c0l(1) \u2264 . . . \u2264 \u03b2lS,\u03c0l(k) and \u03b2rS,\u03c0r(1) \u2264 . . . \u2264 \u03b2rS,\u03c0r(k), respectively.\nnj\nThen, there exists an integer \u03ba \u2208(cid:74)1, k(cid:75) such that the optimum x\u2217\n\n1 \u2212 xS,j\n\n1 \u2212 xS,j\n\nnj\n\nnj\n\nnj\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\none of the two possible conditions: (I) x\u2217\nj > \u03ba or (II) x\u2217\n\nS(cid:48),\u03c0r(j) = xS,\u03c0r(j) for all j \u2264 \u03ba and x\u2217\n\nS(cid:48) = arg min\nxS(cid:48)\u2264xS\nS(cid:48),\u03c0l(j) = xS,\u03c0l(j) for all j \u2264 \u03ba and x\u2217\nS(cid:48),\u03c0r(j) = 0 for all j > \u03ba.\n\n\u03a8cmh(xS(cid:48)) satis\ufb01es\nS(cid:48),\u03c0l(j) = 0 for all\n\n6\n\n\fFigure 1: (a) Runtime as a function of the number of features, p. (b) Runtime as a function of the\nnumber of categories of the covariate, k. (c) Precision as a function of the true signal strengh, \u03c1true.\n(d) False detection proportion as a function of the strength of the signal \u03c1conf . n = 200 samples\nwere used in (a), (b) and n = 500 in (c), (d). Also, we set \u03c1true = \u03c1conf = \u03c1.\nIn summary, Theorem 2 above implies that the 2k candidates to be the optimum x\u2217\nS(cid:48) according to\nLemma 2 can be narrowed down to only 2k vertices: k candidates satisfying the \ufb01rst condition and k\nthe second condition. Moreover, evaluating \u03a8cmh for all k candidates satisfying the \ufb01rst condition\n(resp. the second condition) can be done in O(k) time rather than O(k2). This is due to the fact that\neach of the k candidate vertices for each condition can be obtained by changing a single dimension\nwith respect to the previous one. Therefore, the operation dominating the computational complexity\nis the sorting of the two k-vectors (\u03b2lS,1, . . . , \u03b2lS,k) and (\u03b2rS,1, . . . , \u03b2rS,k). As a consequence, the\n\nruntime required to evaluate the lower envelope (cid:101)\u03a8cmh(S), and thus our novel pruning criterion\n\nis_prunable_cmh, scales as O(k log k) with the number of categories of the covariate.\n\n4 Experiments\nIn Section 4.1 we describe a set of experiments on simulated datasets, evaluating the performance of\nFACS in terms of runtime, precision and its ability to correct for confounding. Next, in Section 4.2,\nwe use our method in two applications in computational biology. Due to space constraints, only a\nhigh-level summary of the experimental setup and results will be presented here. Additional details\ncan be found in the Supp. Material and code for FACS is available on GitHub2.\n\n4.1 Runtime and power comparisons on simulated datasets\n\nWe compare FACS with four signi\ufb01cant discriminative itemset mining methods: LAMP-\u03c72, Bonf-CMH,\n2k-FACS and mk-FACS. (1) LAMP-\u03c72 [15, 10] is the state-of-the-art in signi\ufb01cant discriminative\nitemset mining. It uses Tarone\u2019s testability criterion but is based on Pearson\u2019s \u03c72 test and thus cannot\naccount for covariates; (2) Bonf-CMH uses the CMH test, being able to correct for confounders, but\nuses Bonferroni\u2019s correction, resulting in a considerable loss of statistical power; (3) and (4) 2k-FACS\nand mk-FACS are two suboptimal versions of FACS, which implement the pruning criterion using the\napproach shown in Lemma 2, which scales as O(2k), or via brute-force search, scaling as O(mk).\nRuntime evaluations: Figure 1(a) shows that FACS scales as the state-of-the-art LAMP-\u03c72 when\nincreasing the number of features p, while the Bonferroni-based method Bonf-CMH scales consider-\nably worse. This indicates both that FACS is able to correct for covariates with virtually no runtime\noverhead with respect to LAMP-\u03c72 and con\ufb01rms the ef\ufb01cacy of Tarone\u2019s testability criterion. Figure\n1(b) shows that FACS can handle categorical covariates of high-cardinality k with almost no overhead,\nin contrast to mk-FACS and 2k-FACS which are only applicable for low k. This demonstrates the\nimportance of our ef\ufb01cient implementation of the pruning criterion.\nPrecision and false positive detection evaluations: We generated synthetic datasets with one truly\nassociated feature subset Strue and one confounded feature subset Sconf to evaluate precision and\nability to correct for confounders. Figure 1(c) shows that FACS has a similar precision as LAMP-\u03c72,\nbeing slightly worse for weak signals and slightly better for stronger signals. Again, the performance\nof the Bonferroni-based method Bonf-CMH is drastically worse. Most importantly, Figure 1(d)\nindicates that unlike LAMP-\u03c72, FACS has the ability to greatly reduce the false positive detection by\nconditioning on an appropriate categorical covariate.\n\n2https://github.com/BorgwardtLab/FACS\n\n7\n\n102103104105106Number of features1.(a)10-210-11001011021031041051061071081091010Runtime (in seconds)One day100 daysOne yearFACS2k-CMHmk-CMHBonf-CMHLAMP-\u03c72051015202530Number of categories k1.(b)100101102103104105106107Runtime (in seconds)One day100 days0.00.20.40.60.81.0Strength\u03c11.(c)0.00.20.40.60.81.0Precision0.00.20.40.60.81.0Strength\u03c11.(d)0.00.20.40.60.81.0False positive detectionFACS-CMHLAMP-\u03c72Bonf-CMH\fTable 1: Total number of signi\ufb01cant combinations (hits) found by LAMP-\u03c72, FACS and BONF-CMH and\naverage genomic in\ufb02ation factor \u03bb. \u03bb for BONF-CMH is similar to FACS since both use the CMH test.\n\nDatasets\n\nhits\nLY 433\n43\n\navrB\n\nhits\n19\n1\n\nBONF-CMH\n\nhits\n100,883\n546\n\nLAMP-\u03c72\n\u03bb\n3.18\n2.38\n\nFACS\n\u03bb\n1.17\n1.21\n4.2 Applications to computational biology\nIn this section, we look for signi\ufb01cant feature combinations in two widely investigated biological\napplications: Genome-Wide Association Studies (GWAS), using two A. thaliana datasets, and a study\nof combinatorial regulation of gene expression in breast cancer cells.\nA. thaliana GWAS: We apply FACS, LAMP-\u03c72 and Bonf-CMH to two datasets from the plant model\norganism A. thaliana [1], which contain 84 and 95 samples, respectively. The labels of each dataset\nindicate the presence/absence of a plant defense-related phenotype: LY and avrB. In the two datasets,\neach plant sample is represented by a sequence of approximately 214, 000 genetic bases. The genetic\nbases are encoded as binary features which indicate if the base at a speci\ufb01c locus is standard or altered.\nTo minimize the effect of the evolutionary correlations between nearby bases (< 10 kilo-bases),\nwe downsampled each of the \ufb01ve chromosomes of each dataset, evenly by a factor of 20, using 20\ndifferent offsets. It resulted in complementary datasets containing between 1, 423 and 2, 661 features.\nOur results for all methods are aggregated across all downsampled versions. In GWAS, one needs to\ncorrect for the confounding effect of population structure to avoid many spurious associations. For\nboth datasets we condition on the ancestry, resulting in k = 5 and k = 3 categories for the covariate.\nTable 1 shows the number of feature combinations (c.f. Section 2.1) reported as signi\ufb01cant by each\nmethod, as well as the corresponding genomic in\ufb02ation factor \u03bb [4], a popular criterion in statistical\ngenetics to quantify confounding. When compared to LAMP-\u03c72, we observe a severe reduction in the\nnumber of feature combinations deemed signi\ufb01cant by FACS, as well as a sharp decrease in \u03bb. This\nstrongly indicates that many feature combinations reported by LAMP-\u03c72 are affected by confounding.\nThe \u03bb values of LAMP-\u03c72 show strong marginal associations between many feature combinations\nand labels, in\ufb02ating the corresponding Pearson \u03c72-test statistic values compared to the expected \u03c72\nnull distribution and resulting in many spurious associations. However, since most of those feature\ncombinations are independent of the labels given the covariates, the CMH test statistics values are\nmuch closer to the \u03c72 distribution, leading to a lower \u03bb and resulting in hits that are corrected for the\ncovariate. Moreover, the lack of power of BONF-CMH results in a very small number of hits.\nCombinatorial regulation of gene expression in breast cancer cells: The breast cancer data set,\nas used in [15], includes 12, 773 genes classi\ufb01ed into up-regulated or not up-regulated. Each gene is\nrepresented by 397 binary features which indicate the presence/absence of a sequence motif in the\nneighborhood of this gene. We aim to \ufb01nd combinations of motifs that are enriched in up-regulated\ngenes. Two sets of experiments were conducted, conditioning on 8 and 16 categories respectively. In\nthis case, the covariate groups together genes sharing similar sets of motifs. As previously, LAMP-\u03c72\nreports 1, 214 motif combinations as signi\ufb01cant, while FACS reports only 26 \u2014 a reduction of over\n97%. Further studies shown in the Supp. Material strongly suggest that most motif combinations\nfound by LAMP-\u03c72 but not FACS are indeed due to confounding.\n5 Conclusions\nThis article has presented FACS, the \ufb01rst approach to signi\ufb01cant discriminative itemset mining that (i)\nallows to condition on a categorical covariate, (ii) corrects for the inherent multiple testing problem\nand (iii) retains high statistical power. Furthermore, we (iv) proved that the runtime of FACS scales\nas O(k log k), where k is the number of states of the categorical covariate. Regarding future work,\ngeneralizing the state-of-the-art to handle continuous data is a key open problem in signi\ufb01cant\ndiscriminative itemset mining. Solving it would greatly help make the framework applicable to new\ndomains. Another interesting improvement would be to combine FACS with the approach in [8]. In\ntheir work, Tarone\u2019s testability criterion is used along with permutation-testing to increase statistical\npower by taking the redundancy between feature combinations into account. By using a similar\napproach in combination with the CMH test, one could further increase statistical power while\nretaining the ability to correct for a categorical covariate.\nAcknowledgments: This work was funded in part by the SNSF Starting Grant \u2018Signi\ufb01cant Pattern\nMining\u2019 (KB) and the Marie Curie ITN MLPM2012, Grant No. 316861 (KB, FLL).\n\n8\n\n\fReferences\n[1] S. Atwell, Y. S. Huang, B. J. Vilhj\u00e1lmsson, G. Willems, M. Horton, Y. Li, D. Meng, A. Platt, A. M. Tarone,\nT. T. Hu, et al. Genome-wide association study of 107 phenotypes in arabidopsis thaliana inbred lines.\nNature, 465(7298):627\u2013631, 2010.\n\n[2] C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara, and K. M. Borgwardt. Ef\ufb01cient network-guided\n\nmulti-locus association mapping with graph cuts. Bioinformatics, 29(13):i171\u2013i179, 2013.\n\n[3] C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilit\u00e0. Pubblicazioni del R Istituto\n\nSuperiore di Scienze Economiche e Commerciali di Firenze, 8:3\u201362, 1936.\n\n[4] B. Devlin and K. Roeder. Genomic control for association studies. Biometrics, 55(4):997\u20131004, 1999.\n\n[5] O. J. Dunn. Estimation of the medians for dependent variables. Ann. Math. Statist., 30(1):192\u2013197, 03\n\n1959.\n\n[6] R. A. Fisher. On the Interpretation of \u03c72 from Contingency Tables, and the Calculation of P. Journal of\n\nthe Royal Statistical Society, 85(1):87\u201394, 1922.\n\n[7] F. Llinares-L\u00f3pez, D. Grimm, D. A. Bodenham, U. Gieraths, M. Sugiyama, B. Rowan, and K. M. Borgwardt.\nGenome-wide detection of intervals of genetic heterogeneity associated with complex traits. Bioinformatics,\n31(12):240\u2013249, 2015.\n\n[8] F. Llinares-L\u00f3pez, M. Sugiyama, L. Papaxanthos, and K. M. Borgwardt. Fast and Memory-Ef\ufb01cient\nSigni\ufb01cant Pattern Mining via Permutation Testing. In Proceedings of the 21th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, Sydney, 2015, pages 725\u2013734. ACM, 2015.\n\n[9] N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease.\n\nJournal of the National Cancer Institute, 22(4):719, 1959.\n\n[10] S. Minato, T. Uno, K. Tsuda, A. Terada, and J. Sese. A fast method of statistical assessment for combinato-\nrial hypotheses based on frequent itemset enumeration. In ECMLPKDD, volume 8725 of LNCS, pages\n422\u2013436, 2014.\n\n[11] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated\nsystem of variables is such that it can reasonable be supposed to have arisen from random sampling.\nPhilosophical Magazine, 50:157\u2013175, 1900.\n\n[12] A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal\ncomponents analysis corrects for strati\ufb01cation in genome-wide association studies. Nat Genet, 38(8):904\u2013\n909, 08 2006.\n\n[13] M. Sugiyama, F. Llinares L\u00f3pez, N. Kasenburg, and K. M. Borgwardt. Mining signi\ufb01cant subgraphs with\n\nmultiple testing correction. In SIAM Data Mining (SDM), 2015.\n\n[14] R. E. Tarone. A modi\ufb01ed bonferroni method for discrete data. Biometrics, 46(2):515\u2013522, 1990.\n\n[15] A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese. Statistical signi\ufb01cance of combinatorial\n\nregulations. Proceedings of the National Academy of Sciences, 110(32):12996\u201313001, 2013.\n\n[16] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[17] B. J. Vilhj\u00e1lmsson and M. Nordborg. The nature of confounding in genome-wide association studies.\n\nNature Reviews Genetics, 14(1):1\u20132, Jan. 2013.\n\n[18] G. Webb. Discovering signi\ufb01cant rules. In Proceedings of the 12th ACM SIGKDD International Conference\n\non Knowledge Discovery and Data Mining, New York, 2006, pages 434 \u2013 443. ACM, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1173, "authors": [{"given_name": "Laetitia", "family_name": "Papaxanthos", "institution": "ETH Zurich"}, {"given_name": "Felipe", "family_name": "Llinares-L\u00f3pez", "institution": "ETH Zurich"}, {"given_name": "Dean", "family_name": "Bodenham", "institution": "ETH Zurich"}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": "ETH Zurich"}]}