{"title": "Discrete R\u00e9nyi Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 3276, "page_last": 3284, "abstract": "Consider the binary classification problem of predicting a target variable Y from a discrete feature vector X = (X1,...,Xd). When the probability distribution P(X,Y) is known, the optimal classifier, leading to the minimum misclassification rate, is given by the Maximum A-posteriori Probability (MAP) decision rule. However, in practice, estimating the complete joint distribution P(X,Y) is computationally and statistically impossible for large values of d. Therefore, an alternative approach is to first estimate some low order marginals of the joint probability distribution P(X,Y) and then design the classifier based on the estimated low order marginals. This approach is also helpful when the complete training data instances are not available due to privacy concerns. In this work, we consider the problem of designing the optimum classifier based on some estimated low order marginals of (X,Y). We prove that for a given set of marginals, the minimum Hirschfeld-Gebelein-R\u00b4enyi (HGR) correlation principle introduced in [1] leads to a randomized classification rule which is shown to have a misclassification rate no larger than twice the misclassification rate of the optimal classifier. Then, we show that under a separability condition, the proposed algorithm is equivalent to a randomized linear regression approach which naturally results in a robust feature selection method selecting a subset of features having the maximum worst case HGR correlation with the target variable. Our theoretical upper-bound is similar to the recent Discrete Chebyshev Classifier (DCC) approach [2], while the proposed algorithm has significant computational advantages since it only requires solving a least square optimization problem. Finally, we numerically compare our proposed algorithm with the DCC classifier and show that the proposed algorithm results in better misclassification rate over various UCI data repository datasets.", "full_text": "Discrete R\u00b4enyi Classi\ufb01ers\n\nMeisam Razaviyayn\u2217\n\nmeisamr@stanford.edu\n\nFarzan Farnia\u2217\n\nfarnia@stanford.edu\n\nDavid Tse\u2217\n\ndntse@stanford.edu\n\nAbstract\n\nConsider the binary classi\ufb01cation problem of predicting a target variable Y from\na discrete feature vector X = (X1, . . . , Xd). When the probability distribution\nP(X, Y ) is known, the optimal classi\ufb01er, leading to the minimum misclassi\ufb01-\ncation rate, is given by the Maximum A-posteriori Probability (MAP) decision\nrule. However, in practice, estimating the complete joint distribution P(X, Y ) is\ncomputationally and statistically impossible for large values of d. Therefore, an\nalternative approach is to \ufb01rst estimate some low order marginals of the joint prob-\nability distribution P(X, Y ) and then design the classi\ufb01er based on the estimated\nlow order marginals. This approach is also helpful when the complete training\ndata instances are not available due to privacy concerns.\nIn this work, we consider the problem of \ufb01nding the optimum classi\ufb01er based on\nsome estimated low order marginals of (X, Y ). We prove that for a given set of\nmarginals, the minimum Hirschfeld-Gebelein-R\u00b4enyi (HGR) correlation principle\nintroduced in [1] leads to a randomized classi\ufb01cation rule which is shown to have\na misclassi\ufb01cation rate no larger than twice the misclassi\ufb01cation rate of the opti-\nmal classi\ufb01er. Then, under a separability condition, it is shown that the proposed\nalgorithm is equivalent to a randomized linear regression approach. In addition,\nthis method naturally results in a robust feature selection method selecting a sub-\nset of features having the maximum worst case HGR correlation with the target\nvariable. Our theoretical upper-bound is similar to the recent Discrete Chebyshev\nClassi\ufb01er (DCC) approach [2], while the proposed algorithm has signi\ufb01cant com-\nputational advantages since it only requires solving a least square optimization\nproblem. Finally, we numerically compare our proposed algorithm with the DCC\nclassi\ufb01er and show that the proposed algorithm results in better misclassi\ufb01cation\nrate over various UCI data repository datasets.\n\n1\n\nIntroduction\n\nStatistical classi\ufb01cation, a core task in many modern data processing and prediction problems, is\nthe problem of predicting labels for a given feature vector based on a set of training data instances\ncontaining feature vectors and their corresponding labels. From a probabilistic point of view, this\nproblem can be formulated as follows: given data samples (X1, Y 1), . . . , (Xn, Y n) from a proba-\nbility distribution P(X, Y ), predict the target label ytest for a given test point X = xtest.\nMany modern classi\ufb01cation problems are on high dimensional categorical features. For exam-\nple, in the genome-wide association studies (GWAS), the classi\ufb01cation task is to predict a trait\nof interest based on observations of the SNPs in the genome. In this problem, the feature vector\nX = (X1, . . . , Xd) is categorical with Xi \u2208 {0, 1, 2}.\nWhat is the optimal classi\ufb01er leading to the minimum misclassi\ufb01cation rate for such a classi\ufb01cation\nproblem with high dimensional categorical feature vectors? When the joint probability distribution\nof the random vector (X, Y ) is known, the MAP decision rule de\ufb01ned by \u03b4MAP (cid:44) argmaxy\nP(Y =\n\n\u2217Department of Electrical Engineering, Stanford University, Stanford, CA 94305.\n\n1\n\n\fy|X = x) achieves the minimum misclassi\ufb01cation rate. However, in practice the joint probability\ndistribution P(X, Y ) is not known. Moreover, estimating the complete joint probability distribution\nis not possible due to the curse of dimensionality. For example, in the above GWAS problem, the\ndimension of the feature vector X is d \u2248 3, 000, 000 which leads to the alphabet size of 33,000,000\nfor the feature vector X. Hence, a practical approach is to \ufb01rst estimate some low order marginals of\nP(X, Y ), and then use these low order marginals to build a classi\ufb01er with low misclassi\ufb01cation rate.\nThis approach, which is the sprit of various machine learning and statistical methods [2\u20136], is also\nuseful when the complete data instances are not available due to privacy concerns in applications\nsuch as medical informatics.\nIn this work, we consider the above problem of building a classi\ufb01er for a given set of low order\nmarginals. First, we formally state the problem of \ufb01nding the robust classi\ufb01er with the minimum\nworst case misclassi\ufb01cation rate. Our goal is to \ufb01nd a (possibly randomized) decision rule which has\nthe minimum worst case misclassi\ufb01cation rate over all probability distributions satisfying the given\nlow order marginals. Then a surrogate objective function, which is obtained by the minimum HGR\ncorrelation principle [1], is used to propose a randomized classi\ufb01cation rule. The proposed classi\ufb01-\ncation method has the worst case misclassi\ufb01cation rate no more than twice the misclassi\ufb01cation rate\nof the optimal classi\ufb01er. When only pairwise marginals are estimated, it is shown that this classi\ufb01er\nis indeed a randomized linear regression classi\ufb01er on indicator variables under a separability condi-\ntion. Then, we formulate a feature selection problem based on the knowledge of pairwise marginals\nwhich leads to the minimum misclassi\ufb01cation rate. Our analysis provides a theoretical justi\ufb01cation\nfor using group lasso objective function for feature selection over the discrete set of features. Finally,\nwe conclude by presenting numerical experiments comparing the proposed classi\ufb01er with discrete\nChebyshev classi\ufb01er [2], Tree Augmented Naive Bayes [3], and Minimax Probabilistic Machine [4].\nIn short, the contributions of this work is as follows.\n\n\u2022 Providing a rigorous theoretical justi\ufb01cation for using the minimum HGR correlation prin-\n\nciple for binary classi\ufb01cation problem.\n\n\u2022 Proposing a randomized classi\ufb01er with misclassi\ufb01cation rate no larger than twice the mis-\n\nclassi\ufb01cation rate of the optimal classi\ufb01er.\n\n\u2022 Introducing a computationally ef\ufb01cient method for calculating the proposed randomized\nclassi\ufb01er when pairwise marginals are estimated and a separability condition is satis\ufb01ed.\n\u2022 Providing a mathematical justi\ufb01cation based on maximal correlation for using group lasso\n\nproblem for feature selection in categorical data.\n\nRelated Work: The idea of learning structures in data through low order marginals/moments is\npopular in machine learning and statistics. For example, the maximum entropy principle [7], which\nis the spirit of the variational method in graphical models [5] and tree augmented naive Bayes [3],\nis based on the idea of \ufb01xing the marginal distributions and \ufb01tting a probabilistic model which\nmaximizes the Shannon entropy. Although these methods \ufb01t a probabilistic model satisfying the low\norder marginals, they do not directly optimize the misclassi\ufb01cation rate of the resulting classi\ufb01er.\nAnother related information theoretic approach is the minimum mutual information principle [8]\nwhich \ufb01nds the probability distribution with the minimum mutual information between the feature\nvector and the target variable. This approach is closely related to the framework of this paper;\nhowever, unlike the minimum HGR principle, there is no known computationally ef\ufb01cient approach\nfor calculating the probability distribution with the minimum mutual information.\nIn the continuous setting, the idea of minimizing the worst case misclassi\ufb01cation rate leads to the\nminimax probability machine [4]. This algorithm and its analysis is not easily extendible to the\ndiscrete scenario.\nThe most related algorithm to this work is the recent Discrete Chebyshev Classi\ufb01er (DCC) algo-\nrithm [2]. The DCC is based on the minimization of the worst case misclassi\ufb01cation rate over the\nclass of probability distributions with the given marginals of the form (Xi, Xj, Y ). Similar to our\nframework, the DCC method achieves the misclassi\ufb01cation rate no larger than twice the misclassi-\n\ufb01cation rate of the optimum classi\ufb01er. However, computation of the DCC classi\ufb01er requires solving\na non-separable non-smooth optimization problem which is computationally demanding, while the\nproposed algorithm results in a least squares optimization problem with a closed form solution.\nFurthermore, in contrast to [2] which only considers deterministic decision rules, in this work we\n\n2\n\n\fconsider the class of randomized decision rules. Finally, it is worth noting that the algorithm in [2]\nrequires tree structure to be tight, while our proposed algorithm works on non-tree structures as long\nas the separability condition is satis\ufb01ed.\n\n2 Problem Formulation\nConsider the binary classi\ufb01cation problem with d discrete features X1, X2, . . . , Xd \u2208 X and a target\nvariable Y \u2208 Y (cid:44) {0, 1}. Without loss of generality, let us assume that X (cid:44) {1, 2, . . . , m} and the\ndata points (X, Y ) are coming from an underlying probability distribution \u00afPX,Y (x, y). If the joint\nprobability distribution \u00afP(x, y) is known, the optimal classi\ufb01er is given by the maximum a posteriori\n\nprobability (MAP) estimator, i.e.,(cid:98)y MAP(x) (cid:44) argmaxy\u2208{0,1} \u00afP(Y = y | X = x). However, the\n\njoint probability distribution \u00afP(x, y) is often not known in practice. Therefore, in order to utilize\nthe MAP rule, one should \ufb01rst estimate \u00afP(x, y) using the training data instances. Unfortunately,\nestimating the joint probability distribution requires estimating the value of \u00afP(X = x, Y = y) for\nall (x, y) \u2208 X d \u00d7 Y which is intractable for large values of d. Therefore, as mentioned earlier, our\napproach is to \ufb01rst estimate some low order marginals of the joint probability distribution \u00afP(\u00b7); and\nthen utilize the minimax criterion for classi\ufb01cation.\nLet C be the class of probability distributions satisfying the estimated marginals. For example, when\nonly pairwise marginals of the ground-truth distribution \u00afP is estimated, the set C is the class of\ndistributions satisfying the given pairwise marginals, i.e.,\n\nPX,Y (\u00b7,\u00b7) (cid:12)(cid:12) PXi,Xj (xi, xj) = \u00afPXi,Xj (xi, xj), PXi,Y (xi, y) = \u00afPXi,Y (xi, y),\n\nCpairwise (cid:44)\n\n(cid:26)\n\n\u2200xi, xj \u2208 X , \u2200y \u2208 Y, \u2200i, j\n\n.\n\n(1)\n\n(cid:27)\n\nIn general, C could be any class of probability distributions satisfying a set of estimated low order\nmarginals.\nLet us also de\ufb01ne \u03b4 to be a randomized classi\ufb01cation rule with\n\n(cid:40) 0 with probability qx\n\n\u03b4\n\n\u03b4(x) =\n\n1 with probability 1 \u2212 qx\n\u03b4 ,\n\n\u03b4 \u2208 [0, 1], \u2200x \u2208 X d. Given a randomized decision rule \u03b4 and a joint probability distribu-\nfor some qx\ntion PX,Y (x, y), we can extend P(\u00b7) to include our randomized decision rule. Then the misclassi\ufb01-\ncation rate of the decision rule \u03b4, under the probability distribution P(\u00b7), is given by P(\u03b4(X) (cid:54)= Y ).\nHence, under minimax criterion, we are looking for a decision rule \u03b4\u2217 which minimizes the worst\ncase misclassi\ufb01cation rate. In other words, the robust decision rule is given by\n\n(2)\nwhere D is the set of all randomized decision rules. Notice that the optimal decision rule \u03b4\u2217 may not\nbe unique in general.\n\nmaxP\u2208C\n\n\u03b4\u2217 \u2208 argmin\n\u03b4\u2208D\n\nP (\u03b4(X) (cid:54)= Y ) ,\n\n3 Worst Case Error Minimization\n\nIn this section, we propose a surrogate objective for (2) which leads to a decision rule with misclas-\nsi\ufb01cation rate no larger than twice of the optimal decision rule \u03b4\u2217. Later we show that the proposed\nsurrogate objective is connected to the minimum HGR principle [1].\nLet us start by rewriting (2) as an optimization problem over real valued variables. Notice that each\nprobability distribution PX,Y (\u00b7,\u00b7) can be represented by a probability vector p = [px,y | (x, y) \u2208\nx,y px,y = 1. Similarly, every randomized\n| x \u2208 X d] \u2208 Rmd. Adopting these notations, the\n\nX d \u00d7 Y] \u2208 R2md with px,y = P(X = x, Y = y) and(cid:80)\n\nrule \u03b4 can be represented by a vector q\u03b4 = [qx\nset C can be rewritten in terms of the probability vector p as\n\u03b4\n\nC (cid:44)(cid:8)p(cid:12)(cid:12) Ap = b, 1T p = 1, p \u2265 0(cid:9) ,\n\n3\n\n\fwhere the system of linear equations Ap = b represents all the low order marginal constraints in\nB; and the notation 1 denotes the vector of all ones. Therefore, problem (2) can be reformulated as\n(3)\n\n\u03b4 px,1 + (1 \u2212 qx\n(qx\n\n\u03b4 )px,0) ,\n\n(cid:88)\n\n\u03b4 \u2208 argmin\nq\u2217\n0\u2264q\u03b4\u22641\n\nmax\np\u2208C\n\nx\n\nwhere px,0 and px,1 denote the elements of the vector p corresponding to the probability values\nP(X = x, Y = 0) and P(X = x, Y = 1), respectively. The simple application of the minimax\ntheorem [9] implies that the saddle point of the above optimization problem exists and moreover,\nthe optimal decision rule is a MAP rule for a certain probability distribution P\u2217 \u2208 C. In other words,\nthere exists a pair (\u03b4\u2217, P\u2217) for which\nP(\u03b4\u2217(X) (cid:54)= Y ) \u2264 P\u2217(\u03b4\u2217(X) (cid:54)= Y ), \u2200 P \u2208 C and P\u2217(\u03b4(X) (cid:54)= Y ) \u2265 P\u2217(\u03b4\u2217(X) (cid:54)= Y ), \u2200\u03b4 \u2208 D.\nAlthough the above observation characterizes the optimal decision rule to some extent, it does not\nprovide a computationally ef\ufb01cient approach for \ufb01nding the optimal decision rule. Notice that it\nis NP-hard to verify the existence of a probability distribution satisfying a given set of low order\nmarginals [10]. Based on this observation and the result in [11], we conjecture that in general,\nsolving (2) is NP-hard in the number variables and the alphabet size even when the set C is non-\nempty. Hence, here we focus on developing a framework to \ufb01nd an approximate solution of (2).\nLet us continue by utilizing the minimax theorem [9] and obtain the worst case probability distribu-\ntion in (3) by p\u2217 \u2208 argmaxp\u2208C min0\u2264q\u03b4\u22641\np\u2217 \u2208 argmax\n\n\u03b4 px,1 + (1 \u2212 qx\nmin{px,0 , px,1} .\n\n\u03b4 )px,0) , or equivalently,\n\n(cid:80)\n(cid:88)\n\nx (qx\n\n(4)\n\np\u2208C\n\nx\n\nDespite convexity of the above problem, there are two sources of hardness which make the problem\nintractable for moderate and large values of d. Firstly, the objective function is non-smooth. Sec-\nondly, the number of optimization variables is 2md and grows exponentially with the alphabet size.\nTo deal with the \ufb01rst issue, notice that the function inside the summation is the max-min fairness ob-\njective between the two quantities px,1 and px,0. Replacing this objective with the harmonic average\nleads to the following smooth convex optimization problem:\n\n(cid:101)p \u2208 argmax\n\np\u2208C\n\n(cid:88)\n\nx\n\npx,1px,0\n\npx,1 + px,0\n\n.\n\n(5)\n\nIt is worth noting that the harmonic mean of the two quantities is intuitively a reasonable surrogate\nfor the original objective function since\n\npx,1px,0\n\npx,1 + px,0\n\n\u2264 min{px,0 , px,1} \u2264 2px,1px,0\npx,1 + px,0\n\n.\n\n(6)\n\nAlthough this inequality suggests that the objective functions in (5) and (4) are close to each other,\n\nit is not clear whether the distribution(cid:101)p leads to any classi\ufb01cation rule having low misclassi\ufb01cation\nrate for all distributions in C. In order to obtain a classi\ufb01cation rule from(cid:101)p, the \ufb01rst naive approach\nis to use MAP decision rule based on(cid:101)p. However, the following result shows that this decision rule\nTheorem 1 Let us de\ufb01ne (cid:101)\u03b4map(x) (cid:44) argmaxy\u2208Y(cid:101)px,y with the worst case error probability\n(cid:101)e map (cid:44) maxP\u2208C P(cid:16)(cid:101)\u03b4 map(X) (cid:54)= Y\n. Then, e\u2217 \u2264 (cid:101)e map \u2264 4e\u2217, where e\u2217 is the worst case\n\ndoes not achieve the factor two misclassi\ufb01cation rate obtained in [2].\n\nmisclassi\ufb01cation rate of the optimal decision rule \u03b4\u2217, that is, e\u2217 (cid:44) maxP\u2208C P (\u03b4\u2217(X) (cid:54)= Y ) .\nProof The proof is similar to the proof of next theorem and hence omitted here.\n\n(cid:17)\n\nNext we show that, surprisingly, one can obtain a randomized decision rule based on the solution of\n(5) which has a misclassi\ufb01cation rate no larger than twice of the optimal decision rule \u03b4\u2217.\n\nGiven(cid:101)p as the optimal solution of (5), de\ufb01ne the random decision rule(cid:101)\u03b4 as\n\n(cid:101)\u03b4(x) =\n\n\uf8f1\uf8f2\uf8f3 0 with probability\n\n1 with probability\n\nx,0\n\n(cid:101)p 2\n(cid:101)p 2\nx,0+(cid:101)p 2\n(cid:101)p 2\nx,0+(cid:101)p 2\n(cid:101)p 2\n\nx,1\n\nx,1\n\nx,1\n\n4\n\n(7)\n\n\fLet \u02dce be the worst case classi\ufb01cation error of the decision rule(cid:101)\u03b4, i.e.,\nClearly, e\u2217 \u2264(cid:101)e according to the de\ufb01nition of the optimal decision rule e\u2217. The following theorem\nshows that(cid:101)e is also upper-bounded by twice of the optimal misclassi\ufb01cation rate e\u2217.\n\nP(cid:16)(cid:101)\u03b4(X) (cid:54)= Y\n\n(cid:101)e (cid:44) maxP\u2208C\n\n(cid:17)\n\n.\n\n(cid:88)\n\nTheorem 2 De\ufb01ne\n\nThen, \u03b8 \u2264(cid:101)e \u2264 2\u03b8 \u2264 2e\u2217. In other words, the worst case misclassi\ufb01cation rate of the decision rule\n(cid:101)\u03b4 is at most twice the optimal decision rule \u03b4\u2217.\n\npx,1 + px,0\n\nx\n\n(8)\n\n\u03b8 (cid:44) max\np\u2208C\n\npx,1px,0\n\nProof The proof is relegated to the supplementary materials.\n\nSo far, we have resolved the non-smoothness issue in solving (4) by using a surrogate objective\nfunction. In the next section, we resolve the second issue by establishing the connection between\nproblem (5) and the minimum HGR correlation principle [1]. Then, we use the existing result in [1]\nto develop a computationally ef\ufb01cient approach for calculating the decision rule \u02dc\u03b4(\u00b7) for Cpairwise.\n\n4 Connection to Hirschfeld-Gebelein-R\u00b4enyi Correlation\n\nA commonplace approach to infer models from data is to employ the maximum entropy principle [7].\nThis principle states that, given a set of constraints on the ground-truth distribution, the distribution\nwith the maximum (Shannon) entropy under those constraints is a proper representer of the class.\nTo extend this rule to the classi\ufb01cation problem, the authors in [8] suggest to pick the distribution\nmaximizing the target entropy conditioned to features, or equivalently minimizing mutual informa-\ntion between target and features. Unfortunately, this approach does not lead to a computationally\nef\ufb01cient approach for model \ufb01tting and there is no guarantee on the misclassi\ufb01cation rate of the re-\nsulting classi\ufb01er. Here we study an alternative approach of minimum HGR correlation principle [1].\nThis principle suggests to pick the distribution in C minimizing HGR correlation between the target\nvariable and features. The HGR correlation coef\ufb01cient between the two random objects X and Y ,\nwhich was \ufb01rst introduced by Hirschfeld and Gebelein [12, 13] and then studied by R\u00b4enyi [14], is\nde\ufb01ned as \u03c1(X, Y ) (cid:44) supf,g\nE [f (X)g(Y )] , where the maximization is taken over the class of all\nmeasurable functions f (\u00b7) and g(\u00b7) with E[f (X)] = E[g(Y )] = 0 and E[f 2(X)] = E[g2(Y )] = 1.\nThe HGR correlation coef\ufb01cient has many desirable properties. For example, it is normalized to be\nbetween 0 and 1. Furthermore, this coef\ufb01cient is zero if and only if the two random variables are\nindependent; and it is one if there is a strict dependence between X and Y . For other properties of\nthe HGR correlation coef\ufb01cient see [14, 15] and the references therein.\nLemma 1 Assume the random variable Y is binary and de\ufb01ne q (cid:44) P(Y = 0). Then,\n\n(cid:115)\n\n\u03c1(X, Y ) =\n\n1 \u2212\n\n1\n\nq(1 \u2212 q)\n\nx\n\n(cid:20) PX,Y (x, 0)PX,Y (x, 1)\n(cid:88)\n\nPX,Y (x, 0) + PX,Y (x, 1)\n\n(cid:21)\n\n,\n\nProof The proof is relegated to the supplementary material.\n\nThis lemma leads to the following observation.\nObservation: Assume the marginal distribution P(Y = 0) and P(Y = 1) is \ufb01xed for any\ndistribution P \u2208 C. Then, the distribution in C with the minimum HGR correlation between X and\nwhere \u03c1(X, Y ; P) denotes the HGR correlation coef\ufb01cient under the probability distribution P.\n\nY is the distribution(cid:101)P obtained by solving (5). In other words, \u03c1(X, Y ;(cid:101)P) \u2264 \u03c1(X, Y ; P), \u2200 P \u2208 C,\nBased on the above observation, from now on, we call the classi\ufb01er(cid:101)\u03b4(\u00b7) in (7) as the \u201cR\u00b4enyi clas-\n(cid:101)\u03b4(\u00b7) for a special class of marginals C = Cpairwise.\n\nsi\ufb01er\u201d. In the next section, we use the result of the recent work [1] to compute the R\u00b4enyi classi\ufb01er\n\n5\n\n\f5 Computing R\u00b4enyi Classi\ufb01er Based on Pairwise Marginals\n\nIn many practical problems, the number of features d is large and therefore, it is only computation-\nally tractable to estimate marginals of order at most two. Hence, hereafter, we restrict ourselves\nto the case where only the \ufb01rst and second order marginals of the distribution \u00afP is estimated, i.e.,\nC = Cpairwise. In this scenario, in order to predict the output of the R\u00b4enyi classi\ufb01er for a given data\n\npoint x, one needs to \ufb01nd the value of(cid:101)px,0 and(cid:101)px,1. Next, we state a result from [1] which sheds\nlight on the computation of(cid:101)px,0 and(cid:101)px,1. To state the theorem, we need the following de\ufb01nitions:\n\nLet the matrix Q \u2208 Rdm\u00d7dm and the vector d \u2208 Rdm\u00d71 be de\ufb01ned through their entries as\nQmi+k,mj+(cid:96) = \u00afP(Xi+1 = k, Xj+1 = (cid:96)), dmi+k = \u00afP(Xi+1 = k, Y = 1) \u2212 \u00afP(Xi+1 = k, Y = 0),\nfor every i, j = 0, . . . , d \u2212 1 and k, (cid:96) = 1, . . . , m. Also de\ufb01ne the function h(z) : Rmd\u00d71 (cid:55)\u2192 R as\n\ni=1 max{zmi\u2212m+1, zmi\u2212m+2, . . . , zmi}. Then, we have the following theorem.\n\nh(z) (cid:44)(cid:80)d\n\n(cid:113)\n\nTheorem 3 (Rephrased from [1]) Assume Cpairwise (cid:54)= \u2205. Let\nzT Qz \u2212 dT z +\n\n\u03b3 (cid:44) min\nz\u2208Rmd\u00d71\n\n(9)\nq(1\u2212q) \u2264 minP\u2208Cpairwise \u03c1(X, Y ; P), where the inequality holds with equality if and\n2 ; or equivalently, if and\n\nThen,\n2 and h(\u2212z\u2217) \u2264 1\nonly if there exists a solution z\u2217 to (9) such that h(z\u2217) \u2264 1\nonly if the following separability condition is satis\ufb01ed for some P \u2208 Cpairwise.\n\n1 \u2212 \u03b3\n\n.\n\n1\n4\n\nfor some functions \u03b61, . . . , \u03b6d. Moreover, if the separability condition holds with equality, then\n\n\u03b6i(xi), \u2200x \u2208 X d,\n\nd(cid:88)\n\ni=1\n\n\u2212 (\u22121)y\n\n1\n2\n\nz\u2217\n(i\u22121)m+xi\n\n.\n\n(10)\n\n(11)\n\ni=1\n\nEP[Y |X = x] =\n\nd(cid:88)\n(cid:101)P(Y = y(cid:12)(cid:12)X = (x1, . . . , xd)) =\n(cid:101)P2(Y = 0, X = x)\n\nCombining the above theorem with the equality\n\n(cid:101)P2(Y = 0, X = x) +(cid:101)P2(Y = 1, X = x)\n\n=\n\n(cid:101)P2(Y = 0 | X = x)\n\n(cid:101)P2(Y = 0 | X = x) +(cid:101)P2(Y = 1 | X = x)\n\nimplies that the decision rule(cid:101)\u03b4 and(cid:101)\u03b4 map can be computed in a computationally ef\ufb01cient manner\n\nunder the separability condition. Notice that when the separability condition is not satis\ufb01ed, the\napproach proposed in this section would provide a classi\ufb01cation rule whose error rate is still bounded\nby 2\u03b3. However, this error rate does no longer provide a 2-factor approximation gap. It is also worth\nmentioning that the separability condition is a property of the class of distribution Cpairwise and is\nindependent of the classi\ufb01er at hand. Moreover, this condition is satis\ufb01ed with a positive measure\nover the simplex of the all probability distributions, as discussed in [1]. Two remarks are in order:\nInexact knowledge of marginal distribution: The optimization problem (9) is equivalent to solv-\ning the stochastic optimization problem\n\nE(cid:104)(cid:0)WT z \u2212 C(cid:1)2(cid:105)\n\n,\n\nz\u2217 = argmin\n\nz\n\nwhere W \u2208 {0, 1}md\u00d71 is a random vector with Wm(i\u22121)+k = 1 if Xi = k in the and\nWm(i\u22121)+k = 0, otherwise. Also de\ufb01ne the random variable C \u2208 {\u2212 1\n2 if the\nrandom variable Y = 1 and C = \u2212 1\n2, otherwise. Here the expectation could be calculated with re-\nspect to any distribution in C. Hence, in practice, the above optimization problem can be estimated\nusing Sample Average Approximation (SAA) method [16, 17] through the optimization problem\n\n2} with C = 1\n\n2 , 1\n\n(cid:98)z = argmin\n\nz\n\n1\nn\n\n(cid:0)(wi)T z \u2212 ci(cid:1)2\n\n,\n\nn(cid:88)\n\ni=1\n\n6\n\n\fwhere (wi, ci) corresponds to the i-th training data point (xi, yi). Clearly, this is a least square\nproblem with a closed form solution. Notice that in order to bound the SAA error and avoid over\ufb01t-\n\nting, one could restrict the search space for(cid:98)z [18]. This could also be done using regularizers such\n\nas ridge regression by solving\n\n(cid:98)z ridge = argmin\n\nz\n\n1\nn\n\nn(cid:88)\n(cid:0)(wi)T z \u2212 ci(cid:1)2\n(cid:110)(cid:101)Xij = (Xi, Xj) | i (cid:54)= j\n\ni=1\n\n(cid:111)\n\n+ \u03bbridge(cid:107)z(cid:107)2\n2.\n\nBeyond pairwise marginals: When d is small, one might be interested in estimating higher order\nmarginals for predicting Y . In this scenario, a simple modi\ufb01cation for the algorithm is to de\ufb01ne\nthe new set of feature random variables\n; and apply the algorithm to the\nnew set of feature variables. It is not hard to see that this approach utilizes the marginal information\nP(Xi, Xj, Xk, X(cid:96)) and P(Xi, Xj, Y ).\n\n6 Robust R\u00b4enyi Feature Selection\n\nThe task of feature selection for classi\ufb01cation purposes is to preselect a subset of features for use in\nmodel \ufb01tting in prediction. Shannon mutual information, which is a measure of dependence between\ntwo random variables, is used in many recent works as an objective for feature selection [19, 20].\nIn these works, the idea is to select a small subset of features with maximum dependence with the\ntarget variable Y . In other words, the task is to \ufb01nd a subset of variables S \u2286 {1, . . . , d} with\n|S| \u2264 k based on the following optimization problem\nS MI (cid:44) argmax\nS\u2286{1,...,d}\n\n(12)\nwhere XS (cid:44) (Xi)i\u2208S and I (XS ; Y ) denotes the mutual information between the random variable\nXS and Y . Almost all of the existing approaches for solving (12) are based on heuristic approaches\nand of greedy nature which aim to \ufb01nd a sub-optimal solution of (12). Here, we suggest to replace\nmutual information with the maximal correlation. Furthermore, since estimating the joint distribu-\ntion of X and Y is computationally and statistically impossible for large number of features d, we\nsuggest to estimate some low order marginals of the groundtruth distribution \u00afP(X, Y ) and then solve\nthe following robust R\u00b4enyi feature selection problem:\n\nI(XS ; Y ),\n\n(13)\nWhen only pairwise marginals are estimated from the training data, i.e., C = Cpairwise, maximizing\nthe lower-bound\n\nq(1\u2212q) instead of (13) leads to the following optimization problem\n\n1 \u2212 \u03b3\n\n(cid:113)\n\nminP\u2208C \u03c1(XS , Y ; P).\n\nS RFS (cid:44) argmax\nS\u2286{1,...,d}\n\n(cid:115)\n\n(cid:98)S RFS (cid:44) argmax\n\n|S|\u2264k\n\n1 \u2212\n\n1\n\nq(1 \u2212 q)\n\nmin\nz\u2208ZS\n\nzT Qz \u2212 dT z +\n\n1\n4\n\n,\n\n(cid:98)S RFS (cid:44) argmin\nk=1 |zmi\u2212m+k| = 0, \u2200i /\u2208 S(cid:9). This problem is of combinatorial na-\n\nzT Qz \u2212 dT z,\n\nmin\nz\u2208ZS\n\nor equivalently,\n\n|S|\u2264k\n\nwhere ZS (cid:44) (cid:8)z \u2208 Rmd(cid:12)(cid:12) (cid:80)m\nChoose a regularization parameter \u03bb > 0 and de\ufb01ne h(z) (cid:44)(cid:80)d\nLet (cid:98)zRFS \u2208 argminz zT Qz \u2212 dT z + \u03bbh(|z|).\nSet S = {i |(cid:80)m\n\nAlgorithm 1 Robust R\u00b4enyi Feature Selection\n\nmi\u2212m+k| > 0}.\n\nk=1 |zRFS\n\nture. Howevre, using the standard group Lasso regularizer leads to the feature selection procedure\nin Algorithm 1.\n\ni=1 max{zmi\u2212m+1, . . . , zmi}.\n\nNotice that, when the pairwise marginals are estimated from a set of training data points, the above\nfeature selection procedure is equivalent to applying the group Lasso regularizer to the standard\nlinear regression problem over the domain of indicator variables. Our framework provides a justi\ufb01-\ncation for this approach based on the robust maximal correlation feature selection problem (13).\n\n7\n\n\fRemark 1 Another natural approach to de\ufb01ne the feature selection procedure is to select a subset of\nfeatures S by minimizing the worst case classi\ufb01cation error, i.e., solving the following optimization\nproblem\n\nmin|S|\u2264k\n\n(14)\nwhere DS is the set of randomized decision rules which only uses the feature variables in\nS. De\ufb01ne F(S) (cid:44) min\u03b4\u2208DS maxP\u2208C P(\u03b4(X)\nIt can be shown that F(S) \u2264\nmin|S|\u2264k minz\u2208ZS zT Qz \u2212 dT z + 1\n4 . Therefore, another justi\ufb01cation for Algorithm 1 is to\nminimize an upper-bound of F(S) instead of itself.\n\n(cid:54)= Y ).\n\nmin\n\u03b4\u2208DS\n\nmaxP\u2208C\n\nP(\u03b4(X) (cid:54)= Y ),\n\nRemark 2 Alternating Direction Method of Multipliers (ADMM) algorithm [21] can be used for\nsolving the optimization problem in Algorithm 1; see the supplementary material for more details.\n\n2\n\n7 Numerical Results\n\ncation datasets from the UCI machine learning data repository. The results are compared with \ufb01ve\ndifferent benchmarks used in [2]: Discrete Chebyshev Classi\ufb01er [2], greedy DCC [2], Tree Aug-\nmented Naive Bayes [3], Minimax Probabilistic Machine [4], and support vector machines (SVM).\n\nWe evaluated the performance of the R\u00b4enyi classi\ufb01ers(cid:101)\u03b4 and(cid:101)\u03b4 map on \ufb01ve different binary classi\ufb01-\nIn addition to the classi\ufb01ers (cid:101)\u03b4 and (cid:101)\u03b4 map which only use pairwise marginals, we also use higher\norder marginals in(cid:101)\u03b42 and(cid:101)\u03b4 map\n{(cid:101)Xij = (Xi, Xj)} as discussed in section 5. Since in this scenario, the number of features is large,\nwe \ufb01rst select a subset of {(cid:101)Xij} and then \ufb01nd the maximum correlation classi\ufb01er for the selected\n\nwe combine our R\u00b4enyi classi\ufb01er with the proposed group lasso feature selection. In other words,\n\n. These classi\ufb01ers are obtained by de\ufb01ning the new feature variables\n\nfeatures. The value of \u03bbridge and \u03bb is determined through cross validation. The results are averaged\nover 100 Monte Carlo runs each using 70% of the data for training and the rest for testing. The\nresults are summarized in the table below where each number shows the percentage of the error of\neach method. The boldface numbers denote the best performance on each dataset.\nAs can be seen in this table, in four of the tested datasets, at least one of the proposed methods\n\noutperforms the other benchmarks. Furthermore, it can be seen that the classi\ufb01er(cid:101)\u03b4map on average\n\nperforms better than \u02dc\u03b4. This fact could be due to the speci\ufb01c properties of the underlying probability\ndistribution in each dataset.\n\n(cid:101)\u03b4map\n\n17\n13\n5\n6\n3\n\n(cid:101)\u03b4\n\n21\n16\n10\n16\n4\n\n(cid:101)\u03b4map\n\n2\n16\n16\n5\n3\n3\n\n(cid:101)\u03b42\n\n20\n17\n14\n4\n4\n\n(cid:101)\u03b4map\n\nFS,2\n16\n16\n5\n3\n2\n\nDatasets\n\nadult\ncredit\nkr-vs-kp\npromoters\n\nvotes\n\n(cid:101)\u03b4FS,2 DCC gDCC MPM TAN SVM\n\n20\n17\n14\n4\n4\n\n18\n14\n10\n5\n3\n\n18\n13\n10\n3\n3\n\n22\n13\n5\n6\n4\n\n18\n17\n7\n44\n8\n\n22\n16\n3\n9\n5\n\nIn order to evaluate the computational ef\ufb01ciency of the R\u00b4enyi classi\ufb01er, we compare its running\ntime with SVM over the synthetic data set with d = 10, 000 features and n = 200 data points.\nEach feature Xi is generated by i.i.d. Bernoulli distribution with P(Xi = 1) = 0.7. The target\nvariable y is generated by y = sign(\u03b1T X + n) with n \u223c N (0, 1); and \u03b1 \u2208 Rd is generated with\n30% nonzero elements each drawn from standard Gaussian distribution N (0, 1). The results are\naveraged over 1000 Monte-Carlo runs of generating the data set and use 85% of the data points\nfor training and 15% for test. The R\u00b4enyi classi\ufb01er is obtained by gradient descent method with\nregularizer \u03bbridge = 104. The numerical experiment shows 19.7% average misclassi\ufb01cation rate for\nSVM and 19.9% for R\u00b4enyi classi\ufb01er. However, the average training time of the R\u00b4enyi classi\ufb01er is\n0.2 seconds while the training time of SVM (with Matlab SVM command) is 1.25 seconds.\nAcknowledgments: The authors are grateful to Stanford University supporting a Stanford Graduate\nFellowship, and the Center for Science of Information (CSoI), an NSF Science and Technology\nCenter under grant agreement CCF-0939370 , for the support during this research.\n\n8\n\n\fReferences\n[1] F. Farnia, M. Razaviyayn, S. Kannan, and D. Tse. Minimum HGR correlation principle: From\n\nmarginals to joint distribution. arXiv preprint arXiv:1504.06010, 2015.\n\n[2] E. Eban, E. Mezuman, and A. Globerson. Discrete chebyshev classi\ufb01ers. In Proceedings of\nthe 31st International Conference on Machine Learning (ICML-14), pages 1233\u20131241, 2014.\n[3] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classi\ufb01ers. Machine learning,\n\n29(2-3):131\u2013163, 1997.\n\n[4] G. R. G. Lanckriet andE. L. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax\n\napproach to classi\ufb01cation. The Journal of Machine Learning Research, 3:555\u2013582, 2003.\n\n[5] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[6] T. Roughgarden and M. Kearns. Marginals-to-models reducibility.\n\nInformation Processing Systems, pages 1043\u20131051, 2013.\n\nIn Advances in Neural\n\n[7] E. T. Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.\n[8] A. Globerson and N. Tishby. The minimum information principle for discriminative learning.\nIn Proceedings of the 20th conference on Uncertainty in arti\ufb01cial intelligence, pages 193\u2013200.\nAUAI Press, 2004.\n\n[9] M. Sion. On general minimax theorems. Paci\ufb01c J. Math, 8(1):171\u2013176, 1958.\n[10] J. De Loera and S. Onn. The complexity of three-way statistical tables. SIAM Journal on\n\nComputing, 33(4):819\u2013836, 2004.\n\n[11] D. Bertsimas and J. Sethuraman. Moment problems and semide\ufb01nite optimization. In Hand-\n\nbook of semide\ufb01nite programming, pages 469\u2013509. Springer, 2000.\n\n[12] H. O. Hirschfeld. A connection between correlation and contingency. In Mathematical Pro-\nceedings of the Cambridge Philosophical Society, volume 31, pages 520\u2013524. Cambridge\nUniv. Press, 1935.\n\n[13] H. Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und\nsein zusammenhang mit der ausgleichsrechnung. ZAMM-Journal of Applied Mathematics and\nMechanics/Zeitschrift f\u00a8ur Angewandte Mathematik und Mechanik, 21(6):364\u2013379, 1941.\n\n[14] A. R\u00b4enyi. On measures of dependence. Acta mathematica hungarica, 10(3):441\u2013451, 1959.\n[15] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. On maximal correlation, hypercon-\ntractivity, and the data processing inequality studied by Erkip and Cover. arXiv preprint\narXiv:1304.6133, 2013.\n\n[16] A. Shapiro, D. Dentcheva, and A. Ruszczy\u00b4nski. Lectures on stochastic programming: model-\n\ning and theory, volume 16. SIAM, 2014.\n\n[17] A. Shapiro. Monte carlo sampling methods. Handbooks in operations research and manage-\n\nment science, 10:353\u2013425, 2003.\n\n[18] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\nIn Advances in neural information processing\n\nbounds, margin bounds, and regularization.\nsystems, pages 793\u2013800, 2009.\n\n[19] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-\ndependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 27(8):1226\u20131238, 2005.\n\n[20] R. Battiti. Using mutual information for selecting features in supervised neural net learning.\n\nIEEE Transactions on Neural Networks, 5(4):537\u2013550, 1994.\n[21] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1817, "authors": [{"given_name": "Meisam", "family_name": "Razaviyayn", "institution": "Stanford University"}, {"given_name": "Farzan", "family_name": "Farnia", "institution": null}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}