{"title": "Gradient-based kernel method for feature extraction and variable selection", "book": "Advances in Neural Information Processing Systems", "page_first": 2114, "page_last": 2122, "abstract": "We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. (2010). Experimental results show that the proposed methods successfully find effective features and variables without parametric models.", "full_text": "Gradient-based kernel method for feature extraction\n\nand variable selection\n\nKenji Fukumizu\n\nChenlei Leng\n\nThe Institute of Statistical Mathematics\n\nNational University of Singapore\n\n10-3 Midori-cho, Tachikawa, Tokyo 190-8562 Japan\n\n6 Science Drive 2, Singapore, 117546\n\nfukumizu@ism.ac.jp\n\nstalc@nus.edu.sg\n\nAbstract\n\nWe propose a novel kernel approach to dimension reduction for supervised learn-\ning: feature extraction and variable selection; the former constructs a small num-\nber of features from predictors, and the latter \ufb01nds a subset of predictors. First,\na method of linear feature extraction is proposed using the gradient of regression\nfunction, based on the recent development of the kernel method.\nIn compari-\nson with other existing methods, the proposed one has wide applicability without\nstrong assumptions on the regressor or type of variables, and uses computationally\nsimple eigendecomposition, thus applicable to large data sets. Second, in combi-\nnation of a sparse penalty, the method is extended to variable selection, following\nthe approach by Chen et al. [2]. Experimental results show that the proposed meth-\nods successfully \ufb01nd effective features and variables without parametric models.\n\n1\n\nIntroduction\n\nDimension reduction is involved in most of modern data analysis, in which high dimensional data\nmust be handled. There are two categories of dimension reduction: feature extraction, in which a\nlinear or nonlinear mapping to a low-dimensional space is pursued, and variable selection, in which\na subset of variables is selected. This paper discusses both the methods in supervised learning.\nLet (X, Y ) be a random vector such that X = (X 1, . . . , X m) \u2208 Rm. The domain of Y can be\narbitrary, either continuous, discrete, or structured. The goal of dimension reduction in supervised\nsetting is to \ufb01nd such features or a subset of variables X that explain Y as effectively as possible.\nThis paper focuses linear dimension reduction, in which linear combinations of the components of\nX are used to make effective features. Although there are many methods for extracting nonlinear\nfeatures, this paper con\ufb01nes its attentions on linear features, since linear methods are more stable\nthan nonlinear feature extraction, which depends strongly on the choice of the nonlinearity, and after\nestablishing a linear method, extension to a nonlinear one would not be dif\ufb01cult.\n\nWe \ufb01rst develop a method for linear feature extraction with kernels, and extend it to variable selec-\ntion with a sparseness penalty. The most signi\ufb01cant point of the proposed methods is that we do\nnot assume any parametric models on the conditional probability, or make strong assumptions on\nthe distribution of variables. This differs from many other methods, particularly for variable selec-\ntion, where a speci\ufb01c parametric model is often assumed. Beyond the classical approaches such as\nFisher Discriminant Analysis and Canonical Correlation Analysis to linear dimension reduction, the\nmodern approach is based on the notion of conditional independence; we assume for the distribution\n(1)\nwhere B is a projection matrix (BT B = Id) onto a d-dimensional subspace (d < m) in Rm, and\nwish to estimate B. For variable selection, we further assume that some rows of B may be zero.\nThe subspace spanned by the columns of B is called the effective direction for regression, or EDR\nspace [14]. Our goal is thus to estimate B without speci\ufb01c parametric models for p(y|x).\n\np(Y |X) = \u02dcp(Y |BT X)\n\nY \u22a5\u22a5X | BT X,\n\nor equivalently\n\n1\n\n\fFirst, consider the linear feature extraction based on Eq. (1). The \ufb01rst method using this formula-\ntion is the sliced inverse regression (SIR, [13]), which employs the fact that the inverse regression\nE[X|Y ] lies in the EDR space under some assumptions. Many methods have been proposed in this\nvein of inverse regression ([4, 12] among others). While the methods are computationally simple,\nthey often need some strong assumptions on the distribution of X such as elliptic symmetry.\nThere are two most relevant works to this paper. The \ufb01rst one is the dimension reduction with the\ngradient of regressor E[Y |X = x] [11, 17]. As explained in Sec. 2.1, under Eq. (1) the gradient\nis contained in the EDR space. One can thus estimate the space by some standard nonparametric\nmethod. There are some limitations in this approach, however: the nonparametric gradient esti-\nmation in high-dimensional spaces is challenging, and the method may not work unless the noise\nis additive. The second one is the kernel dimension reduction (KDR, [8, 9, 28]), which uses the\nkernel method for characterizing the conditional independence to overcome various limitations of\nexisting methods. While KDR applies to a wide class of problems without any strong assumptions\non the distributions or types of X or Y , and shows high estimation accuracy for small data sets, its\noptimization has a problem: the gradient descent method used for KDR may have local optima, and\nneeds many matrix inversions, which prohibits application to high-dimensional or large data.\n\nWe propose a kernel method for linear feature extraction using the gradient-based approach, but\nunlike the existing ones [11, 17], the gradient is estimated based on the recent development of the\nkernel method [9, 19]. It solves the problems of existing methods: by virtue of the kernel method, Y\ncan be of arbitrary type, and the kernel estimator is stable without careful decrease of bandwidth. It\nsolves also the problem of KDR: the estimator by an eigenproblem needs no numerical optimization.\nThe method is thus applicable to large and high-dimensional data, as we demonstrate experimentally.\n\nSecond, by using the above feature extraction in conjunction with a sparseness penalty, we propose a\nnovel method for variable selection. Recently extensive studies have been done for variable selection\nwith a sparseness penalty such as LASSO [23] and SCAD [6]. It is also known that with appropriate\nchoice of regularization coef\ufb01cients they have oracle property [6, 25, 30]. These methods, however,\nuse some speci\ufb01c model for regression such as linear regression, which is a limitation of the methods.\nChen et al. [2] proposed a novel method for sparse variable selection based on the objective function\nof linear feature extraction formulated as an eigenproblem such as SIR. We follow this approach to\nderive our method for variable selection. Unlike the methods used in [2], the proposed one does not\nrequire strong assumptions on the regressor or distribution, and thus provides a variable selection\nmethod based on the conditional independence irrespective of the regression model.\n\n2 Gradient-based kernel dimension reduction\n\n2.1 Gradient of a regression function and dimension reduction\n\nWe review the basic idea of the gradient-based method [11, 17] for dimension reduction. Suppose\nY is an R-valued random variable. If the assumption of Eq. (1) holds, we have\n\n\u2202\n\n\u2202x E[Y |X = x] = \u2202\n\n\u2202xR yp(y|x)dy =R y \u2202\n\n\u2202x \u02dcp(y|BT x)dy = BR y \u2202\n\n\u2202z \u02dcp(y|z)(cid:12)(cid:12)z=BT x dy,\n\nwhich implies that the gradient \u2202\n\u2202x E[Y |X = x] at any x is contained in the EDR space. Based on\nthis fact, the average derivative estimates (ADE, [17]) has been proposed to estimate B. In the more\nrecent method [11], a standard local linear least squares with a smoothing kernel (not necessarily\npositive de\ufb01nite, [5]) is used for estimating the gradient, and the dimensionality of the projection\nis continuously reduced to the desired one in the iteration. Since the gradient estimation for high-\ndimensional data is dif\ufb01cult in general, the iterative reduction is expected to give more accurate\nestimation. We call the method in [11] iterative average derivative estimates (IADE) in the sequel.\n\n2.2 Kernel method for estimating gradient of regression\n\nFor a set \u2126, a (R-valued) positive de\ufb01nite kernel k on \u2126 is a symmetric kernel k : \u2126 \u00d7 \u2126 \u2192 R\nsuch thatPn\ni,j=1 cicjk(xi, xj) \u2265 0 for any x1, . . . , xn in \u2126 and c1, . . . , cn \u2208 R. It is known that\na positive de\ufb01nite kernel on \u2126 uniquely de\ufb01nes a Hilbert space H consisting of functions on \u2126,\nin which the reproducing property hf, k(\u00b7, x)iH = f (x) (\u2200f \u2208 H) holds, where h\u00b7,\u00b7iH is the inner\nproduct of H. The Hilbert space H is called the reproducing kernel Hilbert space (RKHS) associated\nwith k. We assume that an RKHS is always separable.\n\n2\n\n\fIn deriving a kernel method based on the approach in Sec. 2.1, the fundamental tool is the repro-\nducing property for the derivative of a function. It is known (e.g., [21] Sec. 4.3) that if a positive\nde\ufb01nite kernel k(x, y) on an open set in the Euclidean space is continuously differentiable with re-\nspect to x and y, every f in the corresponding RKHS H is continuously differentiable. If further\n\u2202x k(\u00b7, x) \u2208 H, we have\n\n\u2202\n\n(2)\n\n\u2202f\n\u2202x\n\n=Df,\n\n\u2202\n\u2202x\n\nk(\u00b7, x)EH\n\n.\n\nThis reproducing property combined with the following kernel estimator of the conditional expec-\ntation (see [8, 9, 19] for details) will provide a method for dimension reduction. Let (X, Y ) be a\nrandom variable on X \u00d7 Y with probability P . We always assume that the p.d.f. p(x, y) and the\nconditional p.d.f. p(y|x) exist, and that a positive de\ufb01nite kernel is measurable and bounded. Let kX\nand kY be positive de\ufb01nite kernels on X and Y, respectively, with respective RKHS HX and HY.\nThe (uncentered) covariance operator CY X : HX \u2192 HY is de\ufb01ned by the equation\nhg, CY X fiHY = E[f (X)g(Y )] = E(cid:2)hf, \u03a6X (X)iHX h\u03a6Y (Y ), giHY(cid:3)\n(3)\nfor all f \u2208 HX , g \u2208 HY, where \u03a6X (x) = kX (\u00b7, x) and \u03a6Y (y) = kY (\u00b7, y). Similarly, CXX\ndenotes the operator on HX that satis\ufb01es hf2, CXX f1i = E[f2(X)f1(X)] for any f1, f2 \u2208 HX .\nThese de\ufb01nitions are straightforward extensions of the ordinary covariance matrices, if we con-\nsider the covariance of the random vectors \u03a6X (X) and \u03a6Y (Y ) on the RKHSs. One of the advan-\ntages of the kernel method is that estimation with \ufb01nite data is straightforward. Given i.i.d. sample\n(X1, Y1), . . . , (Xn, Yn) with law P , the covariance operator is estimated by\n\nnPn\n\nY X f = 1\n\ni=1kY (\u00b7, Yi)hkX (\u00b7, Xi), fiHX\n\nbC (n)\ni=1kX (\u00b7, Xi)hkX (\u00b7, Xi), fiHX . (4)\nIt is known [8] that if E[g(Y )|X = \u00b7] \u2208 HX holds for g \u2208 HY, then we have CXX E[g(Y )|X =\n\u00b7] = CXY g. If further CXX is injective1, this relation can be expressed as\n\nbC (n)\n\nXX f = 1\n\nnPn\n\nE[g(Y )|X = \u00b7] = CXX\u22121CXY g.\n\n(5)\n\nWhile the assumption E[g(Y )|X = \u00b7] \u2208 HX may not hold in general, we can nonetheless obtain an\nempirical estimator based on Eq. (5), namely,\n\nXX + \u03b5nI)\u22121bC (n)\n(bC (n)\n\nXY g,\n\nwhere \u03b5n is a regularization coef\ufb01cient in Tikhonov-type regularization. Note that the above expres-\nsion is the kernel ridge regression of g(Y ) on X. As we discuss in Supplements, we can in fact\nprove rigorously that this estimator converges to E[g(Y )|X = \u00b7].\nAssume now that X = Rm, CXX is injective, kX (x, \u02dcx) is continuously differentiable, E[g(Y )|X =\nx] \u2208 HX for any g \u2208 HY, and \u2202\n\u2202x kX (\u00b7, x) \u2208 R(CXX ), where R denotes the range of the operator.\n\u2202kX (\u00b7,x)\nFrom Eqs. (5) and (2), \u2202\ni.\nWith g = kY (\u00b7, \u02dcy), we obtain the gradient of regression of the feature vector \u03a6Y (Y ) on X as\n\n\u2202x E[g(Y )|X = x] = hC\u22121\n\ni = hg, CY X C\u22121\n\nXX CXY g, \u2202kX (\u00b7,x)\n\nXX\n\n\u2202x\n\n\u2202x\n\n\u2202\n\u2202x\n\nE[\u03a6Y (Y )|X = x] = CY X C\u22121\n\nXX\n\n\u2202kX (\u00b7, x)\n\n\u2202x\n\n.\n\n(6)\n\n2.3 Gradient-based kernel method for linear feature extraction\n\nIt follows from the same argument as in Sec. 2.1 that \u2202\n\u2202x E[kY (\u00b7, y)|X = x] = \u039e(x)B with an\noperator \u039e(x) from Rm to HY, where we use a slight abuse of notation by identifying the operator\n\u039e(x) with a matrix. In combination with Eq. (6), we have\n\nBTh\u039e(x), \u039e(x)iHY B =D \u2202kX (\u00b7, x)\n\n\u2202x\n\n, C\u22121\n\nXX CXY CY X C\u22121\n\nXX\n\n\u2202kX (\u00b7, x)\n\n\u2202x\n\nEHX\n\n=: M (x),\n\n(7)\n\nwhich shows that the eigenvectors for non-zero eigenvalues of m \u00d7 m matrix M (x) are contained\nin the EDR space. This fact is the basis of our method. In contrast to the conventional gradient-\nbased method described in Sec. 2.1, this method incorporates high (or in\ufb01nite) dimensional regressor\nE[\u03a6Y (Y )|X = x].\n\n1Noting hCXX f, f i = E[f (X)2], it is easy to see that CXX is injective, if kX is a continuous kernel on a\n\ntopological space X , and PX is a Borel probability measure such that P (U ) > 0 for any open set U in X .\n\n3\n\n\f\u2202x\n\n\u2202x\n\n(cid:11)\n\nGiven i.i.d. sample (X1, Y1), . . . , (Xn, Yn) from the true distribution, based on the empirical covari-\nance operators Eq. (4) and regularized inversions, the matrix M (x) is estimated by\n\nY X(cid:0)bC (n)\n\nXX + \u03b5nI(cid:1)\u22121 \u2202kX (\u00b7,x)\n\ncMn(x) =(cid:10) \u2202kX (\u00b7,x)\n\nXX + \u03b5nI(cid:1)\u22121bC (n)\n,(cid:0)bC (n)\n\n= \u2207kX (x)T (GX + n\u03b5nI)\u22121GY (GX + n\u03b5nI)\u22121\u2207kX (x),\n\nXY bC (n)\n(8)\nwhere GX and GY are the Gram matrices (kX (Xi, Xj)) and (kY (Yi, Yj)), respectively, and\n\u2207kX (x) = (\u2202kX (X1, x)/\u2202x,\u00b7\u00b7\u00b7 , \u2202kX (Xn, x)/\u2202x)T \u2208 Rn.\nAs the eigenvectors of M (x) are contained in the EDR space for any x, we propose to use the\naverage of M (Xi) over all the data points Xi, and de\ufb01ne\n\u02dcMn := 1\ni=1\u2207kX (Xi)T (GX + n\u03b5nIn)\u22121GY (GX + n\u03b5nIn)\u22121\u2207kX (Xi).\nWe call the dimension reduction with the matrix \u02dcMn the gradient-based kernel dimension reduction\n(gKDR). For linear feature extraction, the projection matrix B in Eq. (1) is then estimated simply\nby the top d eigenvectors of \u02dcMn. We call this method gKDR-FEX.\nThe proposed method applies to a wide class of problems; in contrast to many existing methods,\nthe gKDR-FEX can handle any type of data for Y including multinomial or structured variables,\nand make no strong assumptions on the regressor or distribution of X. Additionally, since the\ngKDR incorporates the high dimensional feature vector \u03a6Y (Y ), it works for any regression relation\nincluding multiplicative noise, for which many existing methods such as SIR and IADE fail.\n\ni=1cMn(Xi) = 1\n\nnPn\n\nnPn\n\nAs in all kernel methods, the results of gKDR depend on the choice of kernels. We use the cross-\nvalidation (CV) for choosing kernels and parameters, combined with some regression or classi\ufb01ca-\ntion method. In this paper, the k-nearest neighbor (kNN) regression / classi\ufb01cation is used in CV\nfor its simplicity: for each candidate of a kernel or parameter, we compute the CV error by the kNN\nmethod with (BT Xi, Yi), where B is given by gKDR, and choose the one that gives the least error.\nThe time complexity of the matrix inversions and the eigendecomposition for gKDR are O(n3),\nwhich is prohibitive for large data sets. We can apply, however, low-rank approximation of Gram\nmatrices, such as incomplete Cholesky decomposition. The space complexity may be also a problem\nof gKDR, since (\u2207kX (Xi))n\nIn the case of Gaussian kernel, where\n\u2202xa kX (Xj, x)|x=Xi = 1\ni ) exp(\u2212kXj \u2212 Xik2/(2\u03c32)), we have a way of reducing\n\u03c32 (X a\nthe necessary memory by low rank approximation. Let GX \u2248 RRT and GY \u2248 HH T be the\nlow rank approximation with rx = rkR, ry = rkH (rx, ry < n, m). With the notation F :=\ni = 1\n(GX + n\u03b5nIn)\u22121H and \u0398as\nj=1RjsFjt(cid:1).\n\u02dcMn,ab =Pn\nia\u0393t\nWith this method, the complexity is O(nmr) in space and O(nm2r) in time (r = max{rx, ry}),\nwhich is much more ef\ufb01cient in memory than straightforward implementation.\n\ni Ris, we have, for 1 \u2264 a, b \u2264 m,\nj Fjt(cid:1) \u2212Prx\nia =Prx\n\ni=1 has n2 \u00d7 m dimension.\nj \u2212 X a\n\ns=1Ris(cid:0)Pn\n\ni=1Pry\n\ni (cid:0)Pn\n\ns=1\u0398as\n\nj=1\u0398as\n\nib, \u0393t\n\n\u03c32 X a\n\nt=1\u0393t\n\n\u2202\n\nWe introduce two variants of gKDR-FEX. First, since accurate nonparametric estimation with high-\ndimensional X is not easy, we propose a method for decreasing the dimensionality iteratively. Using\ngKDR-FEX, we \ufb01rst \ufb01nd a matrix B1 of dimensionality d1 larger than the target d, project data Xi\nonto the subspace as Z (1)\n1 Xi, \ufb01nd the projection matrix B2 (d1 \u00d7 d2 matrix) for Z (1)\nonto a\nd2 (d2 < d1) dimensional subspace, and repeat this process. We call this method gKDR-FEXi.\nSecond, if Y takes only L points as in classi\ufb01cation, the Gram matrix GY and thus \u02dcMn are of rank L\nat most (see Eq. (8)), which is a strong limitation of gKDR. Note that this problem is shared by many\nlinear dimension reduction methods including CCA and slice-based methods. To solve this problem,\n\ni = BT\n\ni\n\nwe propose to use the variation of cMn(x) over the points x = Xi instead of the average \u02dcMn. By\npartitioning {1, . . . , n} into T1, . . . , T`, the projection matrices bB[a] given by the eigenvectors of\ncM[a] =Pi\u2208Ta cM (Xi) are used to de\ufb01ne bP = 1\n` P`\na=1 bB[a]bBT\nby the top d eigenvectors of bP . We call this method gKDR-FEXv.\n\n[a]. The estimator of B is then given\n\n2.4 Theoretical analysis of gKDR\n\nWe have derived the gKDR method based on the necessary condition of EDR space. The following\ntheorem shows that it is also suf\ufb01cient, if kY is characteristic. A positive de\ufb01nite kernel k on a\n\n4\n\n\fgKDR\n-FEX\n0.1989\n0.1264\n0.1500\n0.0755\n0.1919\n0.1346\n\ngKDR\n-FEXi\n0.1639\n0.0995\n0.1358\n0.0750\n0.2322\n0.1372\n\ngKDR\n-FEXv\n0.2002\n0.1287\n0.1630\n0.0802\n0.1930\n0.1369\n\nIADE\n0.1372\n0.0857\n0.1690\n0.0940\n0.7724\n0.7863\n\nSIR II\n0.2986\n0.2077\n0.3137\n0.2129\n0.7326\n0.7167\n\nKDR\n0.2807\n0.1175\n0.2138\n0.1440\n0.1479\n0.0897\n\ngKDR-FEX\n\n+KDR\n0.0883\n0.0501\n0.1076\n0.0506\n0.1285\n0.0893\n\n(A) n = 100\n(A) n = 200\n(B) n = 100\n(B) n = 200\n(C) n = 200\n(C) n = 400\n\nTable 1: gKDE-FEX for synthetic data: mean discrepancies over 100 runs.\n\nmeasurable space is characteristic if EP [k(\u00b7, X)] = EQ[k(\u00b7, X)] means P = Q, i.e., the mean of\nfeature vector uniquely determines a probability [9, 20]. Examples include Gaussian kernel.\nIn the following theoretical results, we assume (i) \u2202kX (\u00b7, x)/\u2202xa \u2208 R(CXX ) (a = 1, . . . , m), (ii)\nE[kY (y, X)|X = \u00b7] \u2208 HX for any y \u2208 Y, and (iii) E[g(Y )|BT X = z] is a differentiable function\nof z for any g \u2208 HY and the linear functional g 7\u2192 \u2202E[g(Y )|BT X = z]/\u2202z is continuous for any z.\nIn the sequel, the subspace spanned by the columns of B is denoted by Span(B), and the Frobenius\nnorm of a matrix M by kMkF . The proofs are given in Supplements.\nTheorem 1. In addition to the above assumptions (i)-(iii), assume that the kernel kY is character-\nistic. If the eigenspaces for the non-zero eigenvalues of E[M (X)] are included in Span(B), then Y\nand X are conditionally independent given BT X.\n\nXX ) (a = 1, . . . , m) for some\n\nWe can obtain the rate of consistency for cMn(x) and \u02dcMn.\nTheorem 2. In addition to (i)-(iii), assume that \u2202kX (\u00b7,x)\n\u03b2 \u2265 0, and E[kY (y, Y )|X = \u00b7] \u2208 HX for every y \u2208 Y. Then, for \u03b5n = n\u2212 max{ 1\ncMn(x) \u2212 M (x) = Op(cid:0)n\u2212 min{ 1\nIf further E[kM (X)k2\n\n\u2208 R(C \u03b2+1\n4\u03b2+4 }(cid:1)\nXkHX < \u221e, then \u02dcMn \u2192 E[M (X)] in the same order as above.\n\nfor every x \u2208 X as n \u2192 \u221e.\nEkha\nNote that, assuming that the eigenvalues of M (x) or E[M (X)] are all distinct, the convergence\nof matrices implies the convergence of the eigenvectors [22], thus the estimator of gKDR-FEX is\nconsistent to the subspace given by the top eigenvectors of E[M (X)].\n\nF ] < \u221e and \u2202kX (\u00b7,x)\n\n2\u03b2+2 }, we have\n\n= C \u03b2+1\n\nXX ha\n\nx with\n\n3 , 2\u03b2+1\n\n\u2202xa\n\n1\n\n3 ,\n\n\u2202xa\n\n1 +Z2)(Z1\u2212Z 3\n\n(1, 2, 0, . . . , 0)T X, (B): Y = (Z 3\n\n(1, 1, 0, . . . , 0)T X, Z2 = 1\u221a2\n\n2.5 Experiments with gKDR-FEX\nWe always use the Gaussian kernel k(x, \u02dcx) = exp(\u2212 1\n2\u03c32kx\u2212 \u02dcxk2) in the kernel method below. First\nwe use three synthetic data to verify the basic performance of gKDR-FEX(i,v). The data are gener-\nated by (A): Y = Z sin(\u221a5Z)+W , Z = 1\u221a5\n2 )+W ,\nZ1 = 1\u221a2\n(1,\u22121, 0, . . . , 0)T X, where 10-dimensional X is generated\nby the uniform distribution on [\u22121, 1]10 and W is independent noise with N (0, 10\u22122), and (C):\nY = Z 4E, Z = (1, 0, . . . , 0)T X, where each component of 10-dimensional X is independently\ngenerated by the truncated normal distribution N (0, 1/4) \u2217 I[\u22121,1] and E \u223c N (0, 1) is a multi-\nplicative noise. The discrepancy between the estimator B and the true projector B0 is measured by\n0 (Im \u2212 BBT )kF /d. For choosing the parameter \u03c3 in Gaussian kernel and the regularization\nkB0BT\nparameter \u03b5n, the CV in Sec. 2.3 with kNN (k = 5, manually chosen to optimize the results) is\nused with 8 different values given by c\u03c3med (0.5 \u2264 c \u2264 10), where \u03c3med is the median of pairwise\ndistances of data [10], and ` = 4, 5, 6, 7 for \u03b5n = 10\u2212` (a similar strategy is used for the CV below).\nWe compare the results with those of IADE, SIR II [13], and KDR. The IADE has seven parameters\n[11], and we tuned two of them (h1 and \u03c1min) manually to optimize the performance. For SIR II,\nwe tried several numbers of slices, and chose the one that gave the best result. From Table 1, we see\nthat gKDR-FEX(i,v) show much better results than SIR II in all the cases. The IADE works better\nthan these methods for (A), while for (B) and (C) it works worse. Since (C) has multiplicative noise,\nthe IADE does not obtain meaningful estimation. The KDR attains higher accuracy for (C), but less\naccurate for (A) and (B) with n = 100; this undesired result is caused by failure of optimization in\n\n5\n\n\f \n\n100\n\n \n\ngKDR\u2212v\nKDR\nAll variables\n\n85\n\n80\n\n75\n\n)\n\n%\n\n(\n \ne\nt\na\nr\n \n\nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nC\n\n)\n\n%\n\n(\n \ne\nt\na\nr\n \n\nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nC\n\n95\n\n90\n\n85\n\n80\n\n75\n\n95\n\n90\n\n85\n\n)\n\n%\n\n(\n \ne\nt\na\nr\n \n\nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nC\n\n \n\ngKDR\u2212v\nKDR\nAll variables\n\n10\n\n15\n\n20\nDimensionality\n\n30\n\n(B) Breast-cancer-Wisconsin\n(m:30, ntr:200, ntest:369)\n\n70\n\n \n\n3\n\n5\n\n7\n\n9\nDimensionality\n\n11\n\n13\n\n(H) Heart Disease\n\n70\n\n \n\n3 5\n\n(m:13, ntr:129, ntest:148)\n\n(m:34, ntr:151, ntest:200)\n\ngKDR\u2212v\nKDR\nAll variables\n\n10\n\n15\n\n20\n\nDimensionality\n(I) Ionoshpere\n\n34\n\n80\n\n \n\n3 5\n\nFigure 1: Classi\ufb01cation accuracy with gKDR-v and KDR for binary classi\ufb01cation problems. m, ntr\nand ntest are the dimension of X, training data size, and testing data size, respectively.\n\nDim.\n\n10\n\ngKDR + kNN\ngKDR-v + kNN\n\n13.53\n13.15\n22.77\nCCA + kNN\nSIR-II + kNN\n77.42\ngKDR + SVM 14.43\ngKDR-v + SVM 16.87\n13.09\n\nCCA + SVM\n\n20\n4.55\n4.55\n6.74\n70.11\n5.00\n4.75\n6.54\n\n30\n\u2013\n\n4.81\n\n\u2013\n\n63.44\n\n40\n\u2013\n\n5.26\n\n\u2013\n\n52.66\n\n50\n\u2013\n\n5.58\n\n\u2013\n\n50.61\n\n3.85\n\n\u2013\n\n\u2013\n\n3.59\n\n\u2013\n\n\u2013\n\n3.08\n\n\u2013\n\n\u2013\n\ngKDR\n+SVM\n\n12.0\n16.2\n18.0\n21.8\n19.5\n\nCorr\n+SVM\n(500)\n15.7\n30.2\n29.2\n35.4\n41.1\n\nCorr\n+SVM\n(2000)\n\n8.3\n18.0\n24.0\n25.0\n29.0\n\nL\n\n10\n20\n30\n40\n50\n\nTable 2: Left: ISOLET - classi\ufb01cation errors for test data (percentage). Right: Amazon Reviews -\n10-fold cross-validation errors (%) for classi\ufb01cation\n\nsome runs (see Supplements for error bars). We also used the results of gKDR-FEX as the initial\nstate for KDR, which improved the accuracy signi\ufb01cantly for (A) and (B). Note however that these\ndata sets are very small in size and dimension, and KDR is not applicable to large data used later.\n\nOne way of evaluating dimension reduction methods in supervised learning is to consider the classi-\n\ufb01cation or regression accuracy after projecting data onto the estimated subspaces. We next used three\ndata sets for binary classi\ufb01cation, heart-disease (H), ionoshpere (I), and breast-cancer-Wisconsin\n(B), from UCI repository [7], and evaluated the classi\ufb01cation rates of gKDR-FEXv with kNN clas-\nsi\ufb01ers (k = 7). We compared them with KDR, as KDR shows high accuracy for small data sets.\nFrom Fig. 1, we see gKDR-FEXv shows competitive accuracy with KDR: slightly worse for (I), and\nslightly better for (B). The computation of gKDR-FEXv for these data sets can be much faster than\nthat of KDR. For each parameter set, the computational time of gKDR vs KDR was, in (H) 0.044\nsec / 622 sec (d = 11), in (I) 0.l03 sec / 84.77 sec (d = 20), and in (B) 0.116 sec / 615 sec (d = 20).\nThe next two data sets taken from UCI repository are larger in the sample size and dimensionality,\nfor which the optimization of KDR is dif\ufb01cult to apply. The \ufb01rst one is ISOLET, which provides\n617 dimensional continuous features of speech signals to classify 26 alphabets. In addition to 6238\ntraining data, 1559 test data are separately provided. We evaluate the classi\ufb01cation errors with the\nkNN classi\ufb01er (k = 5) and 1-vs-1 SVM to see the effectiveness of the estimated subspaces (see\nTable 2). From the information on the data at the UCI repository, the best performance with neural\nnetworks and C4.5 with ECOC are 3.27% and 6.61%, respectively. In comparison with these results,\nthe low dimensional subspaces found by gKDR-FEX and gKDR-FEXv maintain the information for\nclassi\ufb01cation effectively. SIR-II does not \ufb01nd meaningful features.\n\nThe second data set is author identi\ufb01cation of Amazon commerce reviews with 10000 dimensional\nlinguistic features. The total number of authors is 50, and 30 reviews were collected for each author;\nthe total size of data is thus 1500. We varied the used number of authors (L) to make different levels\nof dif\ufb01culty for the tasks. The reduced dimensionality by gKDR-FEX is set to the same as L, and the\n10-fold CV errors with data projected on the estimated EDR space are evaluated using 1-vs-1 SVM.\n`=1 Corr[X a, Y `]2, is also\nused for choosing explanatory variables (a = 1, . . . , 10000). Such variable selection methods with\nPearson correlation are popularly used for very high dimensional data. The variables with top 500\nand 2000 correlations are used to make SVM classi\ufb01ers. As we can see from Table 2, the gKDR-\nFEX gives much more effective subspaces for regression than the Pearson correlation method, when\n\nAs comparison, the squared sum of variable-wise Pearson correlations,PL\n\n6\n\n\fthe number of authors is large. The creator of the data set has also reported the classi\ufb01cation result\nwith a neural network model [15]; for 50 authors, the 10-fold CV error with 2000 selected variables\nis 19.51%, which is similar to the gKDR-FEX result with only 50 linear features.\n\n3 Variable selection with gKDR\n\nIn recent years, extensive studies have been done on variable selection with a sparseness penalty\n([6, 16, 18, 23\u201327, 29, 30] among many others). In supervised setting, these studies often consider\nsome speci\ufb01c model for the regression such as least square or logistic regression. While consistency\nand oracle property have been also established for many methods, the assumption that there is a true\nparameter in the model may not hold in practice, and thus a strong restriction of the methods. It\nis then important to consider more \ufb02exible ways of variable selection without assuming any para-\nmetric model on the regression. The gKDR approach is appealing to this problem, since it realizes\nconditional independence without strong assumptions for regression or distribution of variables.\n\nChen et al. [2] recently proposed the Coordinate-Independent Sparse (CIS) method, which is a semi-\nparametric method for sparse variable selection. In CIS, the linear feature BT X is assumed with\nsome rows of B zero, but no parametric model is speci\ufb01ed for regression. We wish to estimate B so\nthat the zero-rows should be estimated as zeros. This is achieved by imposing the sparseness penalty\nof the group LASSO [29] in combination with an objective function of linear feature extraction\nwritten in the form of eigenproblem such as SIR and PFC [3].\n\nWe follow the CIS method for our variable selection with gKDR; since the gKDR is given by the\neigenproblem with matrix \u02dcMn, the CIS method is applied straightforwardly. The signi\ufb01cance of our\nmethod is that the gKDR formulates the conditional independence of Y and X given BT X, while\nthe existing CIS-based methods in [2] realize only weaker conditions under strong assumptions.\n\n3.1 Sparse variable selection with gKDR\n\n01, . . . , vT\n\nThroughout this section, it is assumed that the true probability satis\ufb01es Eq. (1) with B = B0 =\n0m)T , and with some 1 \u2264 q \u2264 m the j-th row v0j is non-zero for j \u2264 q and v0j = 0\n(vT\nm)T , where bi is the i-th\nfor j \u2265 q + 1. The projection matrix is B = (b1, . . . , bd) = (vT\ncolumn and vj is the j-th row. The proposed variable selection method, gKDR-VS, estimates B by\nB:BT B=Idh\u2212Tr[BT \u02dcMnB] +\nmXi=1\n+ is the regularization coef\ufb01cients.\nwhere kvjk is the Euclidean norm and \u03bb = (\u03bb1, . . . , \u03bbm) \u2208 Rm\nTo optimize Eq. (9), as in [2], we used the local quadratic approximation [6], which is simple and\nfast. We used the matlab code provided at the homepage of X. Chen.\n\nbB\u03bb = arg min\n\n\u03bbikviki,\n\n1 , . . . , vT\n\n(9)\n\nThe choice of \u03bb is crucial on the practical performance of sparse variable selection. As a theoretical\nguarantee, we will show that some asymptotic condition provides model consistency. In practice, as\nin the Adaptive Lasso [30], it is suitable to consider \u03bb = \u03bb(\u03b8) de\ufb01ne by\n\n\u03bbi = \u03b8k\u02dcvik\u2212r\n\nwhere \u03b8 and r are positive numbers, and \u02dcvi is the row vector of \u02dcB0, the solution to gKDR without\npenalty, i.e., \u02dcB0 = arg minBT B=Id \u2212Tr[BT \u02dcMnB]. We used r = 1/2 for all of our experiments.\nTo choose the parameter \u03b8, a BIC-based method is often used in sparse variable selection [27, 31]\nwith theoretical guarantee of model consistency. We use a BIC-type method for choosing \u03b8 by\nminimizing\n\nlog n\n\nn\n\n,\n\n(10)\n\nBIC\u03b8 = \u2212Tr[bBT\n\n\u03bb(\u03b8)\n\n\u02dcMnbB\u03bb(\u03b8)] + Cndf\u03b8\n\neigenvalue of \u02dcMn. The log log(m) factor is used in [27], where increasing number of variables is\n\nwhere df\u03b8 = d(p \u2212 d) is the degree of freedom of bB\u03bb(\u03b8) with p the number of non-zero rows in\nbB\u03bb(\u03b8), and Cn is a positive number of Op(1). We used Cn = \u03b11 log log(m) with \u03b11 is the largest\ndiscussed, and \u03b11 is introduced to adjust the scale of Tr[bBT\n\u02dcMnbB\u03bb]; we use CV for choosing the\nhyperparameters (kernel and regularization coef\ufb01cient), in which the values of Tr[bBT\n\u02dcMnbB\u03bb] is not\n\nnormalized well for different choices.\n\n\u03bb\n\n\u03bb\n\n7\n\n\fgKDR\n-VS\n\n.94/.99/75\n1.0/1.0/98\n.92/.84/63\n.98/.89/75\n\nCIS\n-SIR\n\n.89/1.0/65\n.99/1.0/97\n.19/.85/1\n.18/.85/1\n\n(A) n = 60\n(A) n = 120\n(B) n = 100\n(B) n = 200\n\nTable 3: gKDR-VS and CIS-SIR\nwith synthetic data (ratio of non-\nzeros in 1 \u2264 j \u2264 q / ratio of ze-\nros in q + 1 \u2264 j \u2264 m / number of\ncorrect models among 100 runs).\n\nMethod\nCRIM\n\nZN\n\nINDUS\nCHAS\nNOX\nRM\nAGE\nDIS\nRAD\nTAX\n\nPTRATIO\n\nB\n\nLSTAT\n\ngKDR-VS\n\n0\n0\n0\n0\n0\n0.896\n0\n-0.169\n0.018\n0\n-0.376\n0\n-0.165\n\n0\n0\n0\n0\n0\n0.393\n0\n0.022\n-0.000\n0\n0.919\n0\n0.017\n\nCIS-SIR\n0\n0\n-0.008\n-0.000\n0\n0\n0\n0\n0\n0\n-1.00\n-1.253\n-0.022\n0.005\n0\n0\n0\n0\n-0.001\n0.001\n0.003\n0.049\n-0.001\n0.002\n-0.114\n0.043\n\nCIS-PFC\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1.045\n-1.390\n-0.011\n-0.003\n0\n0\n0\n0\n-0.005\n-0.001\n0.007\n-0.038\n0.001\n0.005\n-0.113\n-0.043\n\nTable 4: Boston Housing Data: estimated sparse EDR.\n\n3.2 Theoretical results on gKDR-VS\n\n1 \u2212 B2BT\n\n2 k, where k \u00b7 k is the operator norm.\n\nThis subsection shows the model consistency of the gKDR-VS. All the proofs are shown in Supple-\nments. Let \u03b1n = max{\u03bbj | 1 \u2264 j \u2264 q} and \u03b2n = min{\u03bbj | q + 1 \u2264 j \u2264 m}. The eigenvalues of\nM = E[M (X)] are \u03b71 \u2265 . . . \u2265 \u03b7m \u2265 0. For two m \u00d7 d matrices Bi (i = 1, 2) with BT\ni Bi = Id,\nwe de\ufb01ne D(B1, B2) = kB1BT\nTheorem 3. Suppose k \u02dcMn \u2212 MkF = Op(n\u2212\u03c4 ) for some \u03c4 > 0. If n\u03c4 \u03b1n \u2192 0 as n \u2192 \u221e and\n\u03b7q > \u03b7q+1, then the estimator bB\u03bb in Eq. (9) satis\ufb01es D(bB\u03bb, B0) = Op(n\u2212\u03c4 ) as n \u2192 \u221e.\n1/4 \u2264 \u03c4 \u2264 1/3. Thus Theorem 3 shows that bB\u03bb is also consistent of the same rate.\nTheorem 4. In addition to the assumptions in Theorem 3, assume n\u03c4 \u03b2n \u2192 \u221e as n \u2192 \u221e. Then,\nfor all q + 1 \u2264 j \u2264 m, Pr(bvj = 0) \u2192 1 as n \u2192 \u221e, wherebvj is the j-th row of bB\u03bb.\n\nWe saw in Theorem 2 that under some conditions \u02dcMn converges to M at the rate Op(n\u2212\u03c4 ) with\n\n3.3 Experiments with gKDR-VS\n\nWe \ufb01rst apply the gKDR-VS with d = 1 to synthetic data generated by the following two models:\n(A): Y = X 1 + X 2 + X 3 + W and (B): Y = (X 1 + X 2 + X 3)4W , where the noise W follows\nN (0, 1). For (A), X = (X 1, . . . , X 24) is generated by N (0, \u03a3) with \u03a3ij = (1/2)|i\u2212j| (1 \u2264 i, j \u2264\n24), and for (B) X = (X 1, . . . , X 10) by N (0, 4I10). Note that (B) includes multiplicative noise,\nwhich cannot be handled by many dimension reduction methods. In comparison, the CIS method\nwith SIR is also applied to the same data. The regularization parameter of CIS-SIR is chosen by\nBIC described in [2]. While both the methods work effectively for (A), only gKDR-VS can handle\nthe multiplicative noise of (C).\n\nThe next experiment uses Boston Housing data, which has been often used for variable selection.\nThe response Y is the median value of homes in each tract, and thirteen variables are used to explain\nit. The detail of the variables is described in Supplements, Sec. E. The results of gKDR-VS and CIS-\nSIR / CIS-PFC with d = 2 are shown in Table 4. The variables selected by gKDR-VS are RM, DIS,\nRAD, PTRATIO and LSTAT, which are slightly different from the CIS methods. In a previous study\n[1], the four variables RM, TAX, PTRATIO and LSTAT are considered to have major contribution.\n\n4 Conclusions\n\nWe have proposed a gradient-based kernel approach for dimension reduction in supervised learn-\ning. The method is based on the general kernel formulation of conditional independence, and thus\nhas wide applicability without strong restrictions on the model or variables. The linear feature\nextraction, gKDR-FEX, \ufb01nds effective features with simple eigendecomposition, even when other\nconventional methods are not applicable by multiplicative noise or high-dimensionality. The con-\nsistency is also guaranteed. We have extended the method to variable selection (gKDR-VS) with a\nsparseness penalty, and demonstrated its promising performance with synthetic and real world data.\nThe model consistency has been also proved.\nAcknowledgements. KF has been supported in part by JSPS KAKENHI (B). 22300098.\n\n8\n\n\fReferences\n\n[1] L. Breiman and J. Friedman. Estimating optimal transformations for multiple regression and correlation.\n\nJ. Amer. Stat. Assoc., 80:580\u2013598, 1985.\n\n[2] X. Chen, C. Zou, and R. Dennis Cook. Coordinate-independent sparse suf\ufb01cient dimension reduction and\n\nvariable selection. Ann. Stat., 38(6):3696\u20133723, 2010.\n\n[3] R. Dennis Cook and L. Forzani. Principal \ufb01tted components for dimension reduction in regression. Sta-\n\ntistical Science, 23(4):485\u2013501, 2008.\n\n[4] R. Dennis Cook and S. Weisberg. Discussion of Li (1991). J. Amer. Stat. Assoc., 86:328\u2013332, 1991.\n[5] J. Fan and I. Gijbels. Local Polynomial Modelling and its Applications. Chapman and Hall, 1996.\n[6] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer.\n\nStat. Assoc., 96(456):1348\u20131360, 2001.\n\n[7] A. Frank and A. Asuncion. UCI machine learning repository, [http://archive.ics.uci.edu/ml]. Irvine, CA:\n\nUniversity of California, School of Information and Computer Science. 2010.\n\n[8] K. Fukumizu, F.R. Bach, and M.I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. JMLR, 5:73\u201399, 2004.\n\n[9] K. Fukumizu, F.R. Bach, and M.I. Jordan. Kernel dimension reduction in regression. Ann. Stat.,\n\n37(4):1871\u20131905, 2009.\n\n[10] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Sch\u00a8olkopf, and Alex Smola. A kernel statistical test of\n\nindependence. In Advances in NIPS 20, pages 585\u2013592. 2008.\n\n[11] M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. Structure adaptive approach for dimension reduc-\n\ntion. Ann. Stat., 29(6):1537\u20131566, 2001.\n\n[12] B. Li, H. Zha, and F. Chiaromonte. Contour regression: A general approach to dimension reduction. Ann.\n\nStat., 33(4):1580\u20131616, 2005.\n\n[13] K.-C. Li. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Stat. Assoc.,\n\n86:316\u2013342, 1991.\n\n[14] K.-C. Li. On principal Hessian directions for data visualization and dimension reduction: Another appli-\n\ncation of Stein\u2019s lemma. J. Amer. Stat. Assoc., 87:1025\u20131039, 1992.\n\n[15] S. Liu, Z. Liu, J. Sun, and L. Liu. Application of synergetic neural network in online writeprint identi\ufb01-\n\ncation. Intern. J. Digital Content Technology and its Applications, 5(3):126\u2013135, 2011.\n\n[16] L. Meier, S. Van De Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. J. Royal Stat. Soc.:\n\nSer. B, 70(1):53\u201371, 2008.\n\n[17] A.M. Samarov. Exploring regression structure using nonparametric functional estimation. J. Amer. Stat.\n\nAssoc., 88(423):836\u2013847, 1993.\n\n[18] S. K. Shevade and S. S. Keerthi. A simple and ef\ufb01cient algorithm for gene selection using sparse logistic\n\nregression. Bioinformatics, 19(17):2246\u20132253, 2003.\n\n[19] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions\n\nwith applications to dynamical systems. In Proc. ICML2009, pages 961\u2013968. 2009.\n\n[20] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G.R.G. Lanckriet. Hilbert space\n\nembeddings and metrics on probability measures. JMLR, 11:1517\u20131561, 2010.\n\n[21] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n[22] G.W. Stewart and J.-Q. Sun. Matrix Perturbation Theory. Academic Press, 1990.\n[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc.: Ser. B, 58(1):pp.\n\n267\u2013288, 1996.\n\n[24] H. Wang and C. Leng. Uni\ufb01ed lasso estimation by least squares approximation. J. Amer. Stat. Assoc., 102\n\n(479):1039\u20131048, 2007.\n\n[25] H. Wang, G. Li, and C.-L. Tsai. Regression coef\ufb01cient and autoregressive order shrinkage and selection\n\nvia the lasso. J. Royal Stat. Soc.: Ser. B, 69(1):63\u201378, 2007.\n\n[26] H. Wang, G. Li, and C.-L. Tsai. On the consistency of SCAD tunign parameter selector. Biometrika, 94:\n\n553\u2013558, 2007.\n\n[27] H. Wang, B. Li, and C. Leng. Shrinkage tuning parameter selection with a diverging number of parame-\n\nters. J. Royal Stat. Soc.: Ser. B, 71(3):671\u2013683, 2009.\n\n[28] M. Wang, F. Sha, and M. Jordan. Unsupervised kernel dimension reduction. NIPS 23, pages 2379\u20132387.\n\n2010.\n\n[29] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. Royal Stat.\n\nSoc.: Ser. B, 68(1):49\u201367, 2006.\n\n[30] H. Zou. The adaptive lasso and its oracle properties. J. Amer. Stat. Assoc., 101:1418\u20131429, 2006.\n[31] C. Zou and X. Chen. On the consistency of coordinate-independent sparse estimation with BIC. J. Multi-\n\nvariate Analysis, 112:248\u2013255, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1036, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Chenlei", "family_name": "Leng", "institution": null}]}