{"title": "Toward Provably Correct Feature Selection in Arbitrary Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 1240, "page_last": 1248, "abstract": "In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artificial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability.", "full_text": "Toward Provably Correct Feature Selection in\n\nArbitrary Domains\n\nDimitris Margaritis\n\nDepartment of Computer Science\n\nIowa State University\nAmes, IA 50010, USA\n\ndmarg@cs.iastate.edu\n\nAbstract\n\nIn this paper we address the problem of provably correct feature selection in arbi-\ntrary domains. An optimal solution to the problem is a Markov boundary, which\nis a minimal set of features that make the probability distribution of a target vari-\nable conditionally invariant to the state of all other features in the domain. While\nnumerous algorithms for this problem have been proposed, their theoretical cor-\nrectness and practical behavior under arbitrary probability distributions is unclear.\nWe address this by introducing the Markov Boundary Theorem that precisely char-\nacterizes the properties of an ideal Markov boundary, and use it to develop algo-\nrithms that learn a more general boundary that can capture complex interactions\nthat only appear when the values of multiple features are considered together. We\nintroduce two algorithms: an exact, provably correct one as well a more practi-\ncal randomized anytime version, and show that they perform well on arti\ufb01cial as\nwell as benchmark and real-world data sets. Throughout the paper we make min-\nimal assumptions that consist of only a general set of axioms that hold for every\nprobability distribution, which gives these algorithms universal applicability.\n\n1 Introduction and Motivation\nThe problem of feature selection has a long history due to its signi\ufb01cance in a wide range of im-\nportant problems, from early ones like pattern recognition to recent ones such as text categoriza-\ntion, gene expression analysis and others.\nIn such domains, using all available features may be\nprohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance,\nespecially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of\nthe domain for use in subsequent application of machine learning algorithms has become a standard\npreprocessing step. A typical task of these algorithms is learning a classi\ufb01er: Given a number of\ninput features and a quantity of interest, called the target variable, choose a member of a family of\nclassi\ufb01ers that can predict the target variable\u2019s value as well as possible. Another task is understand-\ning the domain and the quantities that interact with the target quantity.\nMany algorithms have been proposed for feature selection. Unfortunately, little attention has been\npaid to the issue of their behavior under a variety of application domains that can be encountered in\npractice. In particular, it is known that many can fail under certain probability distributions such as\nones that contain a (near) parity function [1], which contain interactions that only appear when the\nvalues of multiple features are considered together. There is therefore an acute need for algorithms\nthat are widely applicable and can be theoretically proven to work under any probability distribution.\nIn this paper we present two such algorithms, an exact and a more practical randomized approximate\none. We use the observation (\ufb01rst made in Koller and Sahami [2]) that an optimal solution to the\nproblem is a Markov boundary, de\ufb01ned to be a minimal set of features that make the probability\ndistribution of a target variable conditionally invariant to the state of all other features in the domain\n(a more precise de\ufb01nition is given later in Section 3) and present a family of algorithms for learning\n\n1\n\n\fthe Markov boundary of a target variable in arbitrary domains. We \ufb01rst introduce a theorem that\nexactly characterizes the minimal set of features necessary for probabilistically isolating a variable,\nand then relax this de\ufb01nition to derive a family of algorithms that learn a parameterized approxima-\ntion of the ideal boundary that are provably correct under a minimal set of assumptions, including a\nset of axioms that hold for any probability distribution.\nIn the following section we present work on feature selection, followed by notation and de\ufb01nitions in\nSection 3. We subsequently introduce an important theorem and the aforementioned parameterized\nfamily of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We\nevaluate these algorithms in Section 6 and conclude in Section 7.\n\n2 Related Work\nNumerous algorithms have been proposed for feature selection. At the highest level algorithms can\nbe classi\ufb01ed as \ufb01lter, wrapper, or embedded methods. Filter methods work without consulting the\nclassi\ufb01er (if any) that will make use of their output i.e., the resulting set of selected features. They\ntherefore have typically wider applicability since they are not tied to any particular classi\ufb01er fam-\nily. In contrast, wrappers make the classi\ufb01er an integral part of their operation, repeatedly invoking\nit to evaluate each of a sequence of feature subsets, and selecting the subset that results in mini-\nmum estimated classi\ufb01cation error (for that particular classi\ufb01er). Finally, embedded algorithms are\nclassi\ufb01er-learning algorithms that perform feature selection implicitly during their operation e.g.,\ndecision tree learners.\nEarly work was motivated by the problem of pattern recognition which inherently contains a large\nnumber of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and\nFukunaga [3] \ufb01rst cast feature selection as a problem of maximization of an objective function over\nthe set of features to use, and proposed a number of search approaches including forward selec-\ntion and backward elimination. Later work by machine learning researchers includes the FOCUS\nalgorithm of Almuallim and Dietterich [4], which is a \ufb01lter method for deterministic, noise-free\ndomains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a\nweight assigned to each feature, selecting the features whose weight exceeds a given threshold. A\nlarge number of additional algorithms have appeared in the literature, too many to list here\u2014good\nsurveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An impor-\ntant concept for feature subset selection is relevance. Several notions of relevance are discussed in\na number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument\nthat the problem of feature selection can be cast as the problem of Markov blanket discovery was\n\ufb01rst made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an\napproximate Markov blanket using mutual information. Other algorithms include the GS algorithm\n[10], originally developed for learning of the structure of a Bayesian network of a domain, and ex-\ntensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B\u00a8uhlmann [13]\nrecently proposed an optimal theoretical solution to the problem of learning the neighborhood of\na Markov network when the distribution of the domain can be assumed to be a multidimensional\nGaussian i.e., linear relations among features with Gaussian noise. This assumption implies that\nthe Composition axiom holds in the domain (see Pearl [14] for a de\ufb01nition of Composition); the\ndifference with our work is that we address here the problem in general domains where it may not\nnecessarily hold.\n\n3 Notation and Preliminaries\nIn this section we present notation, fundamental de\ufb01nitions and axioms that will be subsequently\nused in the rest of the paper. We use the term \u201cfeature\u201d and \u201cvariable\u201d interchangeably, and de-\nnote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We\ndenote the set of all variables/features in the domain (the \u201cuniverse\u201d) by U. All algorithms pre-\nsented are independence-based, learning the Markov boundary of a given target variable using the\ntruth value of a number of conditional independence statements. The use of conditional indepen-\ndence for feature selection subsumes many other criteria proposed in the literature. In particular, the\nuse of classi\ufb01cation accuracy of the target variable can be seen as a special case of testing for its\nconditional independence with some of its predictor variables (conditional on the subset selected at\nany given moment). A bene\ufb01t of using conditional independence is that, while classi\ufb01cation error\nestimates depend on the classi\ufb01er family used, conditional independence does not. In addition, al-\ngorithms utilizing conditional independence for feature selection are applicable to all domain types,\n\n2\n\n\fe.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed\ndomains, as long as a reliable estimate of probabilistic independence is available.\nWe denote probabilistic independence by the symbol \u201c \u22a5\u22a5 \u201d i.e., (X\u22a5\u22a5 Y | Z) denotes the fact\nthat the variables in set X are (jointly) conditionally independent from those in set Y given the\nvalues of the variables in set Z; (X 6\u22a5\u22a5 Y | Z) denotes their conditional dependence. We assume\nthe existence of a probabilistic independence query oracle that is available to answer any query\nof the form (X, Y | Z), corresponding to the question \u201cIs the set of variables in X independent\nof the variables in Y given the value of the variables in Z?\u201d (This is similar to the approach of\nlearning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle\ndoes not exist, but can be approximated by a statistical independence test on a data set. Many tests of\nindependence have appeared and studied extensively in the statistical literature over the last century;\nin this work we use the \u03c72 (chi-square) test of independence [16].\nA Markov blanket of variable X is a set of variables such that, after \ufb01xing (by \u201cknowing\u201d) the value\nof all of its members, the set of remaining variables in the domain, taken together as a single set-\nvalued variable, are statistically independent of X. More precisely, we have the following de\ufb01nition.\nDe\ufb01nition 1. A set of variables S \u2286 U is called a Markov blanket of variable X if and only if\n(X\u22a5\u22a5 U \u2212 S \u2212 {X} | S).\nIntuitively, a Markov blanket S of X captures all the information in the remaining domain variables\nU \u2212 S \u2212 {X} that can affect the probability distribution of X, making their value redundant as far\nas X is concerned (given S). The blanket therefore captures the essence of the feature selection\nproblem for target variable X: By completely \u201cshielding\u201d X, a Markov blanket precludes the exis-\ntence of any possible information about X that can come from variables not in the blanket, making\nit an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov\nboundary.\nDe\ufb01nition 2. A set of variables S \u2286 U \u2212 {X} is called a Markov boundary of variable X if it is a\nminimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket.\nPearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection\nare suf\ufb01cient to guarantee a unique Markov boundary. These are shown below together with the\naxiom of Contraction.\n\n(Symmetry)\n\n(Decomposition)\n(Weak Union)\n(Contraction)\n(Intersection)\n\n(X\u22a5\u22a5 Y | Z) =\u21d2 (Y\u22a5\u22a5 X | Z)\n(X\u22a5\u22a5 Y \u222a W | Z) =\u21d2 (X\u22a5\u22a5 Y | Z) \u2227 (X\u22a5\u22a5 W | Z)\n(X\u22a5\u22a5 Y \u222a W | Z) =\u21d2 (X\u22a5\u22a5 Y | Z \u222a W)\n(X\u22a5\u22a5 Y | Z) \u2227 (X\u22a5\u22a5 W | Y \u222a Z) =\u21d2 (X\u22a5\u22a5 Y \u222a W | Z)\n(X\u22a5\u22a5 Y | Z \u222a W) \u2227 (X\u22a5\u22a5 W | Z \u222a Y) =\u21d2 (X\u22a5\u22a5 Y \u222a W | Z)\n\n(1)\n\nThe Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are\nnecessary axioms for the probabilistic de\ufb01nition of independence i.e., they hold in every probability\ndistribution, as their proofs are based on the axioms of probability theory. Intersection is not univer-\nsal but it holds in distributions that are positive, i.e., any value combination of the domain variables\nhas a non-zero probability of occurring.\n4 The Markov Boundary Theorem\nAccording to De\ufb01nition 2, a Markov boundary is a minimal Markov blanket. We \ufb01rst introduce a\ntheorem that provides an alternative, equivalent de\ufb01nition of the concept of Markov boundary that\nwe will relax later in the paper to produce a more general boundary de\ufb01nition.\nTheorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction\naxioms hold, S \u2286 U \u2212 {X} is a Markov boundary of variable X \u2208 U if and only if\n\n\u2200 T \u2286 U \u2212 {X}, nT \u2286 U \u2212 S \u21d0\u21d2 (X\u22a5\u22a5 T | S \u2212 T)o .\n\n(2)\nA detailed proof cannot be included here due to space constraints but a proof sketch appears in\nAppendix A. According to the above theorem, a Markov boundary S partitions the powerset of\nU \u2212 {X} into two parts: (a) set P1 that contains all subsets of U \u2212 S, and (b) set P2 containing\nthe remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are\nconditionally dependent with X.\nIntuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept\nof Markov blanket and its minimality i.e., the equation\n\n\u2200 T \u2286 U \u2212 {X}, nT \u2286 U \u2212 S =\u21d2 (X\u22a5\u22a5 T | S \u2212 T)o\n\n3\n\n\fif (X 6\u22a5\u22a5 Y | S) then\n\nS \u2190 S \u222a Y\ngoto line 3\n\n/* Restart loop. */\n\nAlgorithm 1 The abstract GS(m)(X) algorithm. Returns an m-Markov boundary of X.\n1: S \u2190 \u2205\n2: /* Growing phase. */\n3: for all Y \u2286 U \u2212 S \u2212 {X} such that 1 \u2264 |Y| \u2264 m do\n4:\n5:\n6:\n7: /* Shrinking phase. */\n8: for all Y \u2208 S do\n9:\n10:\n11:\n12: return S\n\nif (X\u22a5\u22a5 Y | S \u2212 {Y }) then\n\nS \u2190 S \u2212 {Y }\ngoto line 8\n\n/* Restart loop. */\n\nor, equivalently, (\u2200 T \u2286 U \u2212 S \u2212 {X}, (X\u22a5\u22a5 T | S)) (as T and S are disjoint) corresponds to\nthe de\ufb01nition of Markov blanket, as it includes T = U \u2212 S \u2212 {X}. In the opposite direction, the\ncontrapositive form is\n\n\u2200 T \u2286 U \u2212 {X},nT 6\u2286 U \u2212 S =\u21d2 (X 6\u22a5\u22a5 T | S \u2212 T)o .\n\nThis corresponds to the concept of minimality of the Markov boundary: It states that all sets that\ncontain a part of S cannot be independent of X given the remainder of S. Informally, this is because\nif there existed some set T that contained a non-empty subset T\u2032 of S such that (X\u22a5\u22a5 T | S \u2212 T),\nthen one would be able to shrink S by T\u2032 (by the property of Contraction) and therefore S would\nnot be minimal (more details in Appendix A).\n5 A Family of Algorithms for Arbitrary Domains\nTheorem 1 de\ufb01nes conditions that precisely characterize a Markov boundary and thus can be thought\nof as an alternative de\ufb01nition of a boundary. By relaxing these conditions we can produce a more\ngeneral de\ufb01nition. In particular, an m-Markov boundary is de\ufb01ned as follows.\nDe\ufb01nition 3. A set of variables S \u2286 U \u2212 {X} of a domain U is called an m-Markov boundary of\nvariable X \u2208 U if and only if\n\n\u2200 T \u2286 U \u2212 {X} such that |T| \u2264 m,nT \u2286 U \u2212 S \u21d0\u21d2 (X\u22a5\u22a5 T | S \u2212 T)o .\n\nWe call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an\nm-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller\nare independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its\ncomplement are dependent of X given the part of S that is not contained in T. This de\ufb01nition is a\nspecial case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the\ntheorem now restricted to having size m or smaller. For m = n \u2212 1, where n = |U|, the condition\n|T| \u2264 m is always satis\ufb01ed and can be omitted; in this case the de\ufb01nition of an (n \u2212 1)-Markov\nboundary results in exactly Eq. (2) of Theorem 1.\n\nWe now present an algorithm called GS(m), shown in Algorithm 1, that provably correctly learns\nan m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking\nphase (hence the acronym). During the growing phase it examines sets of variables of size up to m,\nwhere m is a user-speci\ufb01ed parameter. During the shrinking phase, single variables are examined for\nconditional independence and possible removal from S (examining sets in the shrinking phase is not\nnecessary for provably correct operation\u2014see Appendix B). The orders of examination of the sets\nfor possible addition and deletion from the candidate boundary are left intentionally unspeci\ufb01ed in\nAlgorithm 1\u2014one can therefore view it as an abstract representative of a family of algorithms, with\neach member specifying one such ordering. All members of this family are m-correct, as the proof\nof correctness does not depend on the ordering. In practice numerous choices for the ordering exist;\none possibility is to examine the sets in the growing phase in order of increasing set size and, for\neach such size, in order of decreasing conditional mutual information I(X, Y, S) between X and\nY given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets\ntend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false\nmembers of the Markov boundary. We used this implementation in all our experiments, presented\nlater in Section 6.\nIntuitively, the margin represents a trade-off between sample and computational complexity and\ncompleteness: For m = n \u2212 1 = |U| \u2212 1, the algorithm returns a Markov boundary in unrestricted\n\n4\n\n\fSchanged \u2190 false\n\nif (X 6\u22a5\u22a5 Y | S) then\n\nAlgorithm 2 The RGS(m,k)(X) algorithm, a randomized anytime version of the GS(m) algorithm,\nutilizing k random subsets for the growing phase.\n1: S \u2190 \u2205\n2: /* Growing phase. */\n3: repeat\n4:\n5: Y \u2190 subset of U \u2212 S \u2212 {X} of size 1 \u2264 |Y| \u2264 m of maximum dependence out of k random subsets\n6:\n7:\nS \u2190 S \u222a Y\n8:\nSchanged \u2190 true\n9: until Schanged = false\n10: /* Shrinking phase. */\n11: for all Y \u2208 S do\n12:\n13:\n14:\n15: return S\n\nif (X\u22a5\u22a5 Y | S \u2212 {Y }) then\n\nS \u2190 S \u2212 {Y }\ngoto line 11\n\n/* Restart loop. */\n\n(arbitrary) domains. For 1 \u2264 m < n \u2212 1, GS(m) may recover the correct boundary depending\non characteristics of the domain. For example, it will recover the correct boundary in domains\ncontaining embedded parity functions such that the number of variables involved in every k-bit\nparity function is m + 1 or less i.e., if k \u2264 m + 1 (parity functions are corner cases in the space\nof probability distributions that are known to be hard to learn [17]). The proof of m-correctness of\nGS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of\nEqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain).\nA Practical Randomized Anytime Version\nWhile GS(m) is provably correct even in dif\ufb01cult domains such as those that contain parity functions,\nit may be impractical with a large number of features as its asymptotic complexity is O(nm). We\ntherefore also we here provide a more practical randomized version called RGS(m,k) (Randomized\nGS(m)), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its\ncomputational requirements: instead of exhaustively examining all possible subsets of (U \u2212S\u2212{X})\n(as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U \u2212 S \u2212 {X}),\nwhere k is user-speci\ufb01ed. It is therefore a randomized algorithm that becomes equivalent to GS(m)\ngiven a large enough k. Many possibilities for the method of random selection of the subsets exist;\nin our experiments we select a subset Y = {Yi} (1 \u2264 |Y| \u2264 m) with probability proportional\nto P|Y|\ni=1(1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test\nbetween X and Yi given S, which has a low computational cost.\nThe RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer\nmay be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of\nGS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm,\nif interrupted, will return an approximation of this upper bound. Moreover, if there exists time\nfor the shrinking phase to be executed (which conducts a number of tests linear in n and is thus\nfast), extraneous variables will be removed and a minimal blanket (boundary) approximation will\nbe returned. These features make it an anytime algorithm, which is a more appropriate choice for\nsituations where critical events may occur that require the interruption of computation, e.g., during\nthe planning phase of a robot, which may be interrupted at any time due to an urgent external event\nthat requires a decision to be made based on the present state\u2019s feature values.\n6 Experiments\nWe evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and\nbenchmark data sets. We \ufb01rst systematically examined the performance on the task of recov-\nering near-parity functions, which are known to be hard to learn [17]. We compared GS(m)\nand RGS(m,k) with respect to accuracy of recovery of the original boundary as well as com-\nputational cost. We generated domains of sizes ranging from 10 to 100 variables, of which\n4 variables (X1 to X4) were related through a near-parity relation with bit probability 0.60\nand various degrees of noise. The remaining independent variables (X5 to Xn) act as \u201cdis-\n\n5\n\n\fF1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level\n\n50 variables, true Markov boundary size = 3\nBernoulli probability = 0.6, 1000 data points\n\n GS(1)\n GS(3)\n RGS(1, 1000)\n\nRGS(3, 1000)\nRelieved, threshold = 0.001\nRelieved, threshold = 0.03\n\ne\nr\nu\ns\na\ne\nm\nn\no\n\n \n\ni\nt\n\nl\n\na\no\ns\ni\n \n\ne\ng\na\nr\ne\nv\na\n\n \n)\n3\n0\n\n.\n\nl\n\n \n\n \n\n0\n=\nd\no\nh\ns\ne\nr\nh\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\ne\nr\nu\ns\na\ne\nm\n1\nF\n\n \n\n 1.3\n 1.2\n 1.1\n 1\n 0.9\n 0.8\n 0.7\n 0.6\n 0.5\n 0.4\n 0.3\n 0.2\n 0.1\n 0\n\nProbabilistic isolation performance of GS(m) and RELIEVED\n\nProbabilistic isolation performance of GS(m) and RGS(m ,k)\n\nReal-world and benchmark data sets\n\nData set\n\nBalance scale\nBalloons\nCar evaluation\nCredit screening\nMonks\nNursery\nTic-tac-toe\nBreast cancer\nChess\nAudiology\n\ne\nr\nu\ns\na\ne\nm\nn\no\n\n \n\ni\nt\n\nl\n\na\no\ns\ni\n \n\ne\ng\na\nr\ne\nv\na\n\n \n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n 0\n\n 0\n\nReal-world and benchmark data sets\n\nData set\n\nBalance scale\nBalloons\nCar evaluation\nCredit screening\nMonks\nNursery\nTic-tac-toe\nBreast cancer\nChess\nAudiology\n\n 0.2\n\n 0.4\n\n 0.6\n\n 0.8\n\n 1\n\nGS(m = 3) average isolation measure\n\n)\n0\n0\n3\n \n=\n \nk\n \n,\n3\n \n=\nm\nS\nG\nR\n\n(\n\n \n\n 0\n\n 0.05\n\n 0.1\n\n 0.15\n\n 0.2\n\n 0.25\nNoise probability\n\n 0.3\n\n 0.35\n\n 0.4\n\nt\n(\n\nD\nE\nV\nE\nL\nE\nR\n\nI\n\n 0\n\n 0\n\n 0.2\n\n 0.4\n\n 0.6\n\n 0.8\n\n 1\n\nGS(m = 3) average isolation measure\n\nFigure 2: Left: F1 measure of GS(m), RGS(m,k) and RELIEVED under increasing amounts of\nnoise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on\nreal-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000).\n\n 1\n\n 0\n\n 0.9\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\ne\nr\nu\ns\na\ne\nm\n1\nF\n\n-\n\nGS(1)\nGS(2)\nGS(3)\nRGS(1, 1000)\nRGS(2, 1000)\nRGS(3, 1000)\n\nTrue Markov boundary size = 3, 1000 data points\nBernoulli probability = 0.6, noise probability = 0.1\n\nF1-measure of GS(m ) and RGS(m, k ) vs. domain size\n\ntractors\u201d and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 =\n{X2, X3, X4}.\nIn such domains, learning the boundary of X1 is dif\ufb01cult because of the large\nnumber of distractors and because each Xi \u2208 B1 is independent of X1 given any proper subset\nof B1 \u2212 {Xi} (they only become dependent when including all of them in the conditioning set).\nTo measure an algorithm\u2019s feature selection performance, ac-\ncuracy (fraction of variables correctly included or excluded)\nis inappropriate as the accuracy of trivial algorithms such as\nreturning the empty set will tend to 1 as n increases. Preci-\nsion and recall are therefore more appropriate, with precision\nde\ufb01ned as the fraction of features returned that are in the cor-\nrect boundary (3 features for X1), and recall as the fraction\nof the features present in the correct boundary that are re-\nturned by the algorithm. A convenient and frequently used\nmeasure that combines precision and recall is the F1 mea-\nsure, de\ufb01ned as the harmonic mean of precision and recall\n[18]. In Fig. 1 (top) we report 95% con\ufb01dence intervals for\nthe F1 measure and execution time of GS(m) (margins m =\n1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random\nsubsets), using 20 data sets containing 10 to 100 variables,\nwith the target variable X1 was perturbed (inverted) by noise\nwith 10% probability. As can be seen, the RGS(m,k) and\nGS(m) using the same value for margin perform comparably\nwith respect to F1, up to their 95% con\ufb01dence intervals. With\nrespect to execution time however RGS(m,k) exhibits much\ngreater scalability (Fig. 1 bottom, log scale); for example, it\nexecutes in about 10 seconds on average in domains contain-\ning 100 variables, while GS(m) executes in 1,000 seconds on\naverage for this domain size.\n\nFigure 1: GS(m) and RGS(m,k) per-\nformance with respect\nto domain\nsize (number of variables). Top: F1\nmeasure, re\ufb02ecting accuracy. Bot-\ntom: Execution time in seconds (log\nscale).\n\nRunning time of GS(m ) and RGS(m, k ) vs. domain size\n\nTrue Markov boundary size = 3, 1000 data points\nBernoulli probability = 0.6, noise probability = 0.1\n\nGS(1)\nGS(2)\nGS(3)\nRGS(1, 1000)\nRGS(2, 1000)\nRGS(3, 1000)\n\nNumber of domain variables\n\nNumber of domain variables\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \nn\no\ni\nt\nu\nc\ne\nx\nE\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\n 10000\n\n 1000\n\n 0.01\n\n 90\n\n 100\n\n 90\n\n 100\n\n 10\n\n 20\n\n 40\n\n 50\n\n 100\n\n 60\n\n 70\n\n 80\n\n 10\n\n 20\n\n 30\n\n 40\n\n 50\n\n 60\n\n 70\n\n 80\n\n 30\n\n 10\n\n 1\n\n 0.1\n\nWe also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selec-\ntion that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight\nfor each variable and compares it to a threshold \u03c4 to decide on its inclusion in the set of relevant vari-\nables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that\nis necessary only for very large data sets, we instead used a deterministic variant called RELIEVED\n[9], whose behavior corresponds to RELIEF at the limit of in\ufb01nite execution time. We calculated\nthe F1 measure for GS(m), RGS(m,k) and RELIEVED in the presence of varying amounts of noise,\nwith noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as\nGS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the per-\nformance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds\n\u03c4 = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was\nrepeated 20 times to generate 95% con\ufb01dence intervals. We can observe that even though m = 1\n(equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to\nrecover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few\n(1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold,\n\n6\n\n\fbut appears to deteriorate for more noisy domains. As we can see it is dif\ufb01cult to choose the \u201cright\u201d\nthreshold for RELIEVED\u2014better performing \u03c4 at low noise can become worse in noisy environ-\nments; in particular, small \u03c4 tend to include irrelevant variables while large \u03c4 tend to miss actual\nmembers.\nWe also evaluated GS(m), RGS(m,k), and RELIEVED on benchmark and real-world data sets from\nthe UCI Machine Learning repository. As the true Markov boundary for these is impossible to know,\nwe used as performance measure a measure of probabilistic isolation by the Markov boundary re-\nturned of subsets outside the boundary. For each domain variable X, we measured the independence\nof subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for\n\u03c4 = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by\nthe average p-value of the \u03c72 test between X and Y given S (with p-values of 0 and 1 indicating\nideal dependence and independence, respectively). Due to the large number of subsets outside the\nboundary when the boundary is small, we limited the estimation of isolation performance to 2,000\nsubsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a vari-\nable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation\nperformance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000)\n(right plot). To obtain a statistically signi\ufb01cant comparison, we used the non-parametric Wilcoxon\npaired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each\nother, while both outperformed RELIEVED at the 99.99% signi\ufb01cance level (\u03b1 < 10\u22127).\n7 Conclusion\nIn this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary\ndistribution) domains that may contain complex interactions that only appear when the values of\nmultiple features are considered together. We introduced two algorithms: an exact, provably cor-\nrect one as well a more practical randomized anytime version, and evaluated them on on arti\ufb01cial,\nbenchmark and real-world data, demonstrating that they perform well, even in the presence of noise.\nWe also introduced the Markov Boundary Theorem that precisely characterizes the properties of a\nboundary, and used it to prove m-correctness of the exact family of algorithms presented. We made\nminimal assumptions that consist of only a general set of axioms that hold for every probability\ndistribution, giving our algorithms universal applicability.\nAppendix A: Proof sketch of the Markov Boundary Theorem\nProof sketch. (=\u21d2 direction) We need to prove that if S is a Markov boundary of X then (a) for\nevery set T \u2286 U \u2212 S \u2212 {X}, (X\u22a5\u22a5 T | S \u2212 T), and (b) for every set T\u2032 6\u2286 U \u2212 S that does not\ncontain X, (X 6\u22a5\u22a5 T\u2032 | S \u2212 T\u2032). Case (a) is immediate from the de\ufb01nition of the boundary and the\nDecomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of\n2 in U \u2212 S, we get (from Decomposition)\nT\u2032 that contains a non-empty part T\u2032\n1 satis\ufb01es the inde-\n(X\u22a5\u22a5 T\u2032\npendence property of a Markov boundary, i.e., that (X\u22a5\u22a5 U \u2212 (S \u2212 T\u2032\n1), which\ncontradicts the assumption that S is a boundary (and thus minimal).\n(\u21d0= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The\nproof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S =\nS1 \u222a S2 with S1 a blanket and S2 6= \u2205 i.e., S1 is a blanket strictly smaller than S. Then (X\u22a5\u22a5 S2 |\nS1) = (X\u22a5\u22a5 S2 | S \u2212 S2). However, since S2 6\u2286 U \u2212 S, from Eq. (2) we get (X 6\u22a5\u22a5 S2 | S \u2212 S2),\nwhich is a contradiction.\nAppendix B: Proof of m-Correctness of GS(m)\nLet the value of the set S at the end of the growing phase be SG, its value at the end of the shrinking\nphase SS, and their difference S\u2206 = SG \u2212 SS. The following two observations are immediate.\nObservation 1. For every Y \u2286 U \u2212 SG \u2212 {X} such that 1 \u2264 |Y| \u2264 m, (X\u22a5\u22a5 Y | SG).\nObservation 2. For every Y \u2208 SS, (X 6\u22a5\u22a5 Y | SS \u2212 {Y }).\nLemma 2. Consider variables Y1, Y2, . . . , Yt for some t \u2265 1 and let Y = {Yj}t\nContraction holds, if (X\u22a5\u22a5 Yi | S \u2212 {Yj}i\nProof. By induction on Yj, j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S\ndown to S \u2212 {Yj}i\nj=1, we immediately obtain the\ndesired relation (X\u22a5\u22a5 Y | S \u2212 Y).\n\n1). We can then use Contraction to show that the set S \u2212 T\u2032\n\nj=1 for all i = 1, 2, . . . , t. Since Y = {Yj}t\n\nj=1) for all i = 1, . . . , t, then (X\u22a5\u22a5 Y | S \u2212 Y).\n\nj=1. Assuming that\n\n1 | S \u2212 T\u2032\n\n1 in S and a part T\u2032\n\n1) \u2212 {X} | S \u2212 T\u2032\n\n7\n\n\fLemma 2 can be used to show that the variables found individually independent of X during\nthe shrinking phase are actually jointly independent of X, given the \ufb01nal set SS. Let S\u2206 =\n{Y1, Y2, . . . , Yt} be the set of variables removed (in that order) from SG to form the \ufb01nal set SS\ni.e., S\u2206 = SG \u2212 SS. Using the above lemma, the following is immediate.\nCorollary 3. Assuming that the Contraction axiom holds, (X\u22a5\u22a5 S\u2206 | SS).\nLemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set\nT \u2286 U \u2212 SG \u2212 {X} such that (X\u22a5\u22a5 T | SG),\n\n(X\u22a5\u22a5 T \u222a (SG \u2212 SS) | SS).\n\n(3)\n\nFurthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true.\nProof. From Corollary 3, (X\u22a5\u22a5 S\u2206 | SS). Also, by the hypothesis, (X\u22a5\u22a5 T | SG) = (X\u22a5\u22a5 T |\nSS \u222a S\u2206), where S\u2206 = SG \u2212 SS as usual. From these two relations and Contraction we obtain\n(X\u22a5\u22a5 T \u222a S\u2206 | SS).\nTo prove minimality, let us assume that SS 6= \u2205 (if SS = \u2205 then it is already minimal). We prove\nby contradiction: Assume that there exists a set S\u2032 \u2282 SS such that (X\u22a5\u22a5 T \u222a (SG \u2212 S\u2032) | S\u2032). Let\nW = SS \u2212 S\u2032 6= \u2205. Note that W and S\u2032 are disjoint. We have that\n\nSS \u2286 SS \u222a S\u2206 =\u21d2 SS \u2212 S\u2032 \u2286 SS \u222a S\u2206 \u2212 S\u2032 \u2286 T \u222a (SS \u222a S\u2206 \u2212 S\u2032)\n\n=\u21d2 W \u2286 T \u222a (SS \u222a S\u2206 \u2212 S\u2032) = T \u222a (SG \u2212 S\u2032)\n\n\u2022 Since (X\u22a5\u22a5 T \u222a (SG \u2212 S\u2032) | S\u2032) and W \u2286 T \u222a (SS \u222a S\u2206 \u2212 S\u2032), from Decomposition we\n\n\u2022 From (X\u22a5\u22a5 W | S\u2032) and Weak Union we have that for every Y \u2208 W, (X\u22a5\u22a5 Y | S\u2032 \u222a\n\nget (X\u22a5\u22a5 W | S\u2032).\n\n(W \u2212 {Y })).\n\n\u2022 Since S\u2032 and W are disjoint and since Y \u2208 W, Y 6\u2208 S\u2032. Applying the set equality\n(A \u2212 B) \u222a C = (A \u222a B) \u2212 (A \u2212 C) to S\u2032 \u222a (W \u2212 {Y }) we obtain S\u2032 \u222aW \u2212 ({Y } \u2212 S\u2032) =\nSS \u2212 {Y }.\n\n\u2022 Therefore, \u2200 Y \u2208 W, (X\u22a5\u22a5 Y | SS \u2212 {Y }).\n\nHowever, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W \u2286 SS)\nhave been evaluated for independence and found dependent (Observation 2). Thus, since W 6= \u2205,\nthere exists at least one Y such that (X 6\u22a5\u22a5 Y | SS \u2212 {Y }), producing a contradiction.\nTheorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algo-\nrithm 1 is m-correct with respect to X.\nProof. We use the Markov Boundary Theorem. We \ufb01rst prove that\n\n\u2200 T \u2286 U \u2212 {X} such that |T| \u2264 m,nT \u2286 U \u2212 SS =\u21d2 (X\u22a5\u22a5 T | SS \u2212 T)o\n\nor, equivalently, \u2200 T \u2286 U \u2212 SS \u2212 {X} such that |T| \u2264 m, (X\u22a5\u22a5 T | SS).\nSince U \u2212 SS \u2212 {X} = S\u2206 \u222a (U \u2212 SG \u2212 {X}), S\u2206 and U \u2212 SG \u2212 {X} are disjoint, there are three\nkinds of sets of size m or less to consider: (i) all sets T \u2286 S\u2206, (ii) all sets T \u2286 U \u2212 SG \u2212 {X},\nand (iii) all sets (if any) T = T\u2032 \u222a T\u2032\u2032, T\u2032 \u2229 T\u2032\u2032 = \u2205, that have a non-empty part T\u2032 \u2286 S\u2206 and a\nnon-empty part T\u2032\u2032 \u2286 U \u2212 SG \u2212 {X}.\n\n(i) From Corollary 3, (X\u22a5\u22a5 S\u2206 | SS). Therefore, from Decomposition, for any set T \u2286 S\u2206,\n\n(X\u22a5\u22a5 T | SS).\n\n(ii) By Observation 1, for every set T \u2286 U \u2212 SG \u2212 {X} such that |T| \u2264 m, (X\u22a5\u22a5 T | SG).\nBy Lemma 4 we get (X\u22a5\u22a5 T \u222a S\u2206 | SS), from which we obtain (X\u22a5\u22a5 T | SS) by\nDecomposition.\n\n(iii) Since |T| \u2264 m, we have that |T\u2032\u2032| \u2264 m. Since T\u2032\u2032 \u2286 U \u2212 SG \u2212 {X}, by Observation 1,\n(X\u22a5\u22a5 T\u2032\u2032 | SG). Therefore, by Lemma 4, (X\u22a5\u22a5 T\u2032\u2032 \u222a S\u2206 | SS). Since T\u2032 \u2286 S\u2206 \u21d2\nT\u2032\u2032 \u222a T\u2032 \u2286 T\u2032\u2032 \u222a S\u2206, by Decomposition to obtain (X\u22a5\u22a5 T\u2032\u2032 \u222a T\u2032 | SS) = (X\u22a5\u22a5 T | SS).\n\nTo complete the proof we need to prove that\n\n\u2200 T \u2286 U \u2212 {X} such that |T| \u2264 m,nT 6\u2286 U \u2212 SS =\u21d2 (X 6\u22a5\u22a5 T | SS \u2212 T)o .\n\nLet T = T1 \u222a T2, with T1 \u2286 SS and T2 \u2286 U \u2212 SS. Since T 6\u2286 U \u2212 SS, T1 contains at least one\nvariable Y \u2208 SS. From Observation 2, (X 6\u22a5\u22a5 Y | SS \u2212 {Y }). From this and (the contrapositive of)\nWeak Union, we get (X 6\u22a5\u22a5 {Y } \u222a (T1 \u2212 {Y }) | SS \u2212 {Y } \u2212 (T1 \u2212 {Y })) = (X 6\u22a5\u22a5 T1 | SS \u2212 T1).\nFrom (the contrapositive of) Decomposition we get (X 6\u22a5\u22a5 T1 \u222a T2 | SS \u2212 T1) = (X 6\u22a5\u22a5 T |\nSS \u2212 T1), which is equal to (X 6\u22a5\u22a5 T | SS \u2212 T1 \u2212 T2) = (X 6\u22a5\u22a5 T | SS \u2212 T) as SS and T2 are\ndisjoint.\n\n8\n\n\fReferences\n\n[1] Isabelle Guyon and Andr\u00b4e Elisseeff. An introduction to variable and feature selection. Journal\n\nof Machine Learning Research, 3:1157\u20131182, 2003.\n\n[2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the\n\nTenth International Conference on Machine Learning (ICML), pages 284\u2013292, 1996.\n\n[3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection.\n\nIEEE Transactions on Computers, C-26(9):917\u2013922, 1977.\n\n[4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of\nthe National Conference on the Americal Association for Arti\ufb01cal Intelligence (AAAI), 1991.\n[5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new\nalgorithm. In Proceedings of the National Conference on the Americal Association for Arti\ufb01cal\nIntelligence (AAAI), pages 129\u2013134, 1992.\n\n[6] M. Dash and H. Liu. Feature selection for classi\ufb01cation.\n\n131\u2013156, 1997.\n\nIntelligent Data Analysis, 1(3):\n\n[7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A\nData Mining Perspective, volume 453 of The Springer International Series in Engineering\nand Computer Science. 1998.\n\n[8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning.\n\nArti\ufb01cial Intelligence, 97(1-2):245\u2013271, 1997.\n\n[9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Arti\ufb01cial Intelligence, 97\n\n(1-2):273\u2013324, 1997.\n\n[10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods.\n\nIn Advances in Neural Information Processing Systems 12 (NIPS), 2000.\n\n[11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket\n\ndiscovery. In Proceedings of the 16th International FLAIRS Conference, 2003.\n\n[12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample ef\ufb01cient discovery of Markov\nblankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 673\u2013678, 2003.\n\n[13] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.\n\n1988.\n\n[15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory.\n\nMIT Press, 1994.\n\n[16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990.\n[17] M. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. J. ACM, 45(6):983\u20131006,\n\n1998.\n\n[18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979.\n\n9\n\n\f", "award": [], "sourceid": 331, "authors": [{"given_name": "Dimitris", "family_name": "Margaritis", "institution": null}]}