{"title": "Feature selection in functional data classification with recursive maxima hunting", "book": "Advances in Neural Information Processing Systems", "page_first": 4835, "page_last": 4843, "abstract": "Dimensionality reduction is one of the key issues in the design of effective machine learning methods for automatic induction. In this work, we introduce recursive maxima hunting (RMH) for variable selection in classification problems with functional data. In this context, variable selection techniques are especially attractive because they reduce the dimensionality, facilitate the interpretation and can improve the accuracy of the predictive models. The method, which is a recursive extension of maxima hunting (MH), performs variable selection by identifying the maxima of a relevance function, which measures the strength of the correlation of the predictor functional variable with the class label. At each stage, the information associated with the selected variable is removed by subtracting the conditional expectation of the process. The results of an extensive empirical evaluation are used to illustrate that, in the problems investigated, RMH has comparable or higher predictive accuracy than standard simensionality reduction techniques, such as PCA and PLS, and state-of-the-art feature selection methods for functional data, such as maxima hunting.", "full_text": "Feature selection in functional data classi\ufb01cation with\n\nrecursive maxima hunting\n\nJos\u00b4e L. Torrecilla\n\nComputer Science Department\n\nUniversidad Aut\u00b4onoma de Madrid\n\n28049 Madrid, Spain\n\nAlberto Su\u00b4arez\n\nComputer Science Department\n\nUniversidad Aut\u00b4onoma de Madrid\n\n28049 Madrid, Spain\n\njoseluis.torrecilla@uam.es\n\nalberto.suarez@uam.es\n\nAbstract\n\nDimensionality reduction is one of the key issues in the design of effective machine\nlearning methods for automatic induction. In this work, we introduce recursive\nmaxima hunting (RMH) for variable selection in classi\ufb01cation problems with func-\ntional data. In this context, variable selection techniques are especially attractive\nbecause they reduce the dimensionality, facilitate the interpretation and can im-\nprove the accuracy of the predictive models. The method, which is a recursive\nextension of maxima hunting (MH), performs variable selection by identifying the\nmaxima of a relevance function, which measures the strength of the correlation of\nthe predictor functional variable with the class label. At each stage, the information\nassociated with the selected variable is removed by subtracting the conditional\nexpectation of the process. The results of an extensive empirical evaluation are\nused to illustrate that, in the problems investigated, RMH has comparable or higher\npredictive accuracy than standard dimensionality reduction techniques, such as\nPCA and PLS, and state-of-the-art feature selection methods for functional data,\nsuch as maxima hunting.\n\n1\n\nIntroduction\n\nIn many important prediction problems from different areas of application (medicine, environmental\nmonitoring, etc.) the data are characterized by a function, instead of by a vector of attributes, as is\ncommonly assumed in standard machine learning problems. Some examples of these types of data\nare functional magnetic resonance imaging (fMRI) (Grosenick et al., 2008) and near-infrared spectra\n(NIR) (Xiaobo et al., 2010). Therefore, it is important to develop methods for automatic induction\nthat take into account the functional structure of the data (in\ufb01nite dimension, high redundancy, etc.)\n(Ramsay and Silverman, 2005; Ferraty and Vieu, 2006). In this work, the problem of classi\ufb01cation\nof functional data is addressed. For simplicity, we focus on binary classi\ufb01cation problems (Ba\u00b4\u0131llo\net al., 2011). Nonetheless, the proposed method can be readily extended to a multiclass setting. Let\nX(t), t \u2208 [0, 1] be a continuous stochastic process in a probability space (\u2126,F, P). A functional\ndatum Xn(t) is a realization of this process (a trajectory). Let {Xn(t), Yn}Ntrain\n, t \u2208 [0, 1] be a\nset of trajectories labeled by the dichotomous variable Yn \u2208 {0, 1}. These trajectories come from\none of two different populations; either P0, when the label is Yn = 0, or P1, when the label is\nYn = 1. For instance, the data could be the ECG\u2019s from either healthy or sick persons (P0 and P1,\nrespectively). The classi\ufb01cation problem consist in deciding to which population a new unlabeled\nobservation X test(t) belongs (e.g., to decide from his or her ECG whether a person is healthy or\nnot). Speci\ufb01cally, we are interested in the problem of dimensionality reduction for functional data\nclassi\ufb01cation. The goal is to achieve the optimal discrimination performance using only a \ufb01nite, small\nset of values from the trajectory as input to a standard classi\ufb01er (in our work, k-nearest neighbors).\n\nn=1\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn general, to properly handle functional data, some kind of reduction of information is neces-\nsary. Standard dimensionality reduction methods in functional data analysis (FDA) are based\non principal component analysis (PCA) (Ramsay and Silverman, 2005) or partial least squares\n(PLS) (Preda et al., 2007). In this work, we adopt a different approach based on variable selec-\ntion (Guyon et al., 2006). The goal is to replace the complete function X(t) by a d-dimensional\nvector (X(t1), . . . , X(td)) for a set of \u201csuitable chosen\u201d points {t1, . . . , td} (for instance, instants in\na heartbeat in ECG\u2019s), where d is small.\nMost previous work on feature selection in supervised learning with functional data is quite recent\nand focuses on regression problems; for instance, on the analysis of fMRI images (Grosenick et al.,\n2008; Ryali et al., 2010) and NIR spectra (Xiaobo et al., 2010). In particular, adaptations of lasso\nand other embedded methods have been proposed to this end (see, e.g., Kneip and Sarda (2011);\nZhou et al. (2013); Aneiros and Vieu (2014)). In most cases, functional data are simply treated as\nhigh-dimensional vectors for which the standard methods apply. Speci\ufb01cally, G\u00b4omez-Verdejo et al.\n(2009) propose feature extraction from the functional trajectories before applying a multivariate\nvariable selector based on measuring the mutual information. Similarly, Fernandez-Lozano et al.\n(2015) compare different standard feature selection techniques for image texture classi\ufb01cation. The\nmethod of minimum Redundancy Maximum Relevance (mRMR) introduced by Ding and Peng (2005)\nhas been applied to functional data in Berrendero et al. (2016a). In that work distance correlation\n(Sz\u00b4ekely et al., 2007) is used instead of mutual information to measure nonlinear dependencies, with\ngood results. A fully functional perspective is adopted in Ferraty et al. (2010) and Delaigle et al.\n(2012). In these articles, a wrapper approach is used to select the optimal set of instants in which the\ntrajectories should be monitored by minimizing a cross-validation estimate of the classi\ufb01cation error.\nBerrendero et al. (2015) introduce a \ufb01lter selection procedure based on computing the Mahalanobis\ndistance and Reproducing Kernel Hilbert Space techniques. Logistic regression models have been\napplied to the problem of binary classi\ufb01cation with functional data in Lindquist and McKeague (2009)\nand McKeague and Sen (2010), assuming Brownian and fractional Brownian trajectories, respectively.\nFinally, the selection of intervals or elementary functions instead of variables is addressed in Li and\nYu (2008); Fraiman et al. (2016) or Tian and James (2013).\nFrom the analysis of previous work one concludes that, in general, it is preferable, both in terms of\naccuracy and interpretability, to adopt a fully functional approach to the problem. In particular, if\nthe data are characterized by functions that are continuous, values of the trajectory that are close to\neach other tend to be highly redundant and convey similar information. Therefore, if the value of the\nprocess at a particular instant has high discriminant capacity, one could think of discarding nearby\nvalues. This idea is exploited in maxima hunting (MH) (Berrendero et al., 2016b).\nIn this work, we introduce recursive Maxima Hunting (RMH), a novel variable selection method for\nfeature selection in functional data classi\ufb01cation that takes advantage of the good properties of MH\nwhile addressing some of its de\ufb01ciencies. The extension of MH consists in removing the information\nconveyed by each selected local maximum before searching for the next one in a recursive manner.\nThe rest of the paper is organized as follows: Maxima hunting for feature selection in classi\ufb01cation\nproblems with functional data is introduced in Section 2. Recursive maxima hunting, which is the\nmethod proposed in this work, is described in Section 3. The improvements that can be obtained with\nthis novel feature selection method are analyzed in an exhaustive empirical evaluations whose results\nare presented and discussed in Section 4.\n\n2 Maxima Hunting\n\nMaxima hunting (MH) is a method for feature selection in functional classi\ufb01cation based on measuring\ndependencies between values selected from {X(t), t \u2208 [0, 1]} and the response variable (Berrendero\net al., 2016b). In particular, one selects the values {X(t1), . . . , X(td)} whose dependence with the\nclass label (i.e., the response variable) is locally maximal. Different measures of dependency can\nbe used for this purpose. In Berrendero et al. (2016b), the authors propose the distance correlation\n(Sz\u00b4ekely et al., 2007). The distance covariance between the random variables X \u2208 Rp and Y \u2208 Rq,\nwhose components are assumed to have \ufb01nite \ufb01rst-order moments, is\n\n| \u03d5X,Y (u, v) \u2212 \u03d5X (u)\u03d5Y (v) |2 w(u, v)dudv,\n\n(1)\n\n2\n\n(cid:90)\n\nV 2(X, Y ) =\n\nRp+q\n\n\f|v|1+q\n\nwhere \u03d5X,Y , \u03d5X, \u03d5Y are the characteristic functions of (X, Y ), X and Y , respectively, w(u, v) =\n(cpcq|u|1+p\n\u0393((1+d)/2) is half the surface area of the unit sphere in Rd+1, and | \u00b7 |d\n)\u22121, cd = \u03c0(1+d)/2\nstands for the Euclidean norm in Rd.\nIn terms of V 2(X, Y ), the square of the distance correlation is\n\np\n\nq\n\nR2(X, Y ) =\n\n\u221a\n\nV 2(X,Y )\n\nV 2(X,X)V 2(Y,Y )\n\n0,\n\n, V 2(X)V 2(Y ) > 0\nV 2(X)V 2(Y ) = 0.\n\n(2)\n\n(cid:40)\n\nThe distance correlation is a measure of statistical independence; that is, R2(X, Y ) = 0 if and only\nif X and Y are independent. Besides being de\ufb01ned for random variables of different dimensions, it\nhas other valuable properties. In particular, it is rotationally invariant and scale equivariant (Sz\u00b4ekely\nand Rizzo, 2012). A further advantage over other measures of independence, such as the mutual\ninformation, is that the distance correlation can be readily estimated using a plug-in estimator that\ndoes not involve any parameter tuning. The almost sure convergence of the estimator V 2\nn is proved in\nSz\u00b4ekely et al. (2007, Thm. 2).\nTo summarize, in maxima hunting, one selects the d different local maxima of the distance correlation\nbetween X(t), the values of random process at different instants t \u2208 [0, 1], and the response variable\n(3)\n\nR2(X(t), Y ),\n\ni = 1, 2, . . . , d.\n\nX(ti) = argmax\nt\u2208[0,1]\n\nMaxima Hunting is easy to interpret. It is also well-motivated from the point of view of FDA, because\nit takes advantage of functional properties of the data, such as continuity, which implies that similar\ninformation is conveyed by the values of the function at neighboring points. In spite of the simplicity\nof the method, it naturally accounts for the relevance and redundancy trade-off in feature selection (Yu\nand Liu, 2004): the local maxima (3) are relevant for discrimination. Points around them, which do\nnot maximize the distance correlation with the class label, are automatically excluded. Furthermore, it\nis also possible to derive a uniform convergence result, which provides additional theoretical support\nfor the method. Finally, the empirical investigation carried out in Berrendero et al. (2016b) shows\nthat MH performs well in standard benchmark classi\ufb01cation problems for functional data. In fact,\nfor some problems, one can show that the optimal (Bayes) classi\ufb01cation rules depends only on the\nmaxima of R2(X(t), Y ).\nHowever, maxima hunting presents also some limitations. First, it is not always a simple task\nto estimate the local maxima, especially in functions that are very smooth or that vary abruptly.\nFurthermore, there is no guarantee that different maxima are not redundant. In most cases, the local\nmaxima of R2(X(t), Y ) are indeed relevant for classi\ufb01cation. However, there are important points\nfor which this quantity does not attain a maximum.\nAs an example, consider the family of classi\ufb01cation problems introduced in Berrendero et al. (2016b,\nProp. 3), in which the goal is to discriminate trajectories generated by a standard Brownian motion\nprocess, B(t), and trajectories from the process B(t) + \u03a6m,k(t), where\n\n( 2k\u22121\n\n(cid:105)\n(cid:1) \u2212 X(cid:0) 2k\u22122\n(cid:1)(cid:1) +(cid:0)X(cid:0) 2k\u22121\n(cid:16)(cid:107)\u03a6(cid:48)\n\n2m )(s)\n\n2m , 2k\n\n(cid:17)\n\n2m\n\n2m\n\n2m\n\n(4)\n\n( 2k\u22122\n\n2m\u22121\n\n\u03a6m,k(t) =\n\n2m , 2k\u22121\n\nAssuming a balanced class distribution (P(Y = 0) = P(Y = 1) = 1/2), the optimal classi\ufb01cation\n1\u221a\n2m+1 .\n\n2m )(s) \u2212 I\nrule is g\u2217(x) = 1 if and only if(cid:0)X(cid:0) 2k\u22121\n(cid:1) \u2212 X(cid:0) 2k\n(cid:1)(cid:1) >\n(cid:1) (cid:39) 0.3085,\n= 1 \u2212 normcdf(cid:0) 1\n(cid:1). However, the\nthe standard normal. The relevance function has a single maximum at X(cid:0) 2k\u22121\n\nThe optimal classi\ufb01cation error is L\u2217 = 1 \u2212 normcdf\nwhere, (cid:107) \u00b7 (cid:107) denotes the L2[0, 1] norm, and normcdf(\u00b7) is the cumulative distribution function of\n\nds, m, k \u2208 N, 1 \u2264 k \u2264 2m\u22121.\n\nm,k(t)(cid:107)\n2\n\nBayes classi\ufb01cation rule involves three relevant variables, two of which are clearly not maxima of\nR2(X(t), Y ). In spite of the simplicity of these types of functional classi\ufb01cation problems, they are\nimportant to analyze, because the set of functions \u03a6m,k, with m > 0 and k > 0 form an orthonormal\nbasis of the Dirichlet space D[0, 1], the space of continuous functions whose derivatives are in\nL2[0, 1]. Furthermore, this space is the reproducing kernel Hilbert space associated with Brownian\nmotion and plays and important role in functional classi\ufb01cation (M\u00a8orters and Peres, 2010; Berrendero\net al., 2015). In fact, any trend in the Brownian process can be approximated by a linear combination\nor by a mixture of \u03a6m,k(t).\n\n2m\n\n2m\n\n2\n\n(cid:90) t\n\n0\n\n\u221a\n\n(cid:104)I\n\n3\n\n\fFigure 1: First row: Individual and average trajectories for the classi\ufb01cation of B(t) vs. B(t) + 2\u03a63,3(t)\ninitially (left) and after the \ufb01rst (center) and second (right) corrections. Second row: Values of R2(X(t), Y ) as\na function of t. The variables required for optimal classi\ufb01cation are marked with vertical dashed lines.\n\nTo illustrate the workings of maxima hunting and its limitations we analyze in detail the classi\ufb01cation\nproblem B(t) vs. B(t) + 2\u03a63,3(t), which is of the type considered above. In this case, the optimal\nclassi\ufb01cation rule depends on the maximum X(5/8), and on X(1/2) and X(3/4), which are not\nmaxima, and would therefore not be selected by the MH algorithm. The optimal error is L\u2217 = 15.87%.\nTo illustrate the importance of selecting all the relevant variables, we perform simulations in which\nwe compare the accuracy of the linear Fisher discriminant with the maxima hunting selection, and\nwith the optimal variable selection procedures. In these experiments, independent training and test\nsamples of size 1000 are generated. The values reported are averages over 100 independent runs.\nStandard deviations are given between parentheses. The average prediction error when only the\nmaximum of the trajectories is considered is 37.63%(1.44%). When all three variables are used the\nempirical error is 15.98%(1%), which is close to the Bayes error. When other points in addition\nto the maximum are used (i.e., (X(t1), X(5/8), X(t2), with t1 and t2 randomly chosen so that\n0 \u2264 t1 < 5/8 < t2 \u2264 1) the average classi\ufb01cation error is 22.32%(2.18%). In the top leftmost plot\nof Figure 1 trajectories from both classes, together with the corresponding averages (thick lines) are\nshown. The relevance function R2(X(t), Y ) is plotted below. The relevant variables, which are\nrequired for optimal classi\ufb01cation, are marked by dashed vertical lines.\n\n3 Recursive Maxima Hunting\n\nAs a variable selection process, MH avoids, at least partially, the redundancy introduced by the\ncontinuity of the functions that characterize the instances. However, this local approach cannot detect\nredundancies among different local maxima. Furthermore, there could be points in the trajectory\nthat do not correspond to maxima of the relevance function, but which are relevant when considered\njointly with the maxima. The goal of recursive maxima hunting (RMH) is to select the maxima of\nR2(X(t), Y ) in a recursive manner by removing at each step the information associated to the most\nrecently selected maximum. This avoids the in\ufb02uence of previously selected maxima, which can\nobscure ulterior dependencies. The in\ufb02uence of a selected variable X(t0) on the rest of the trajectory\ncan be eliminated by subtracting the conditional expectation E(X(t)|X(t0)) from X(t). Assuming\nthat the underlying process is Brownian\n\nE(X(t)|X(t0)) =\n\nmin(t, t0)\n\nt0\n\nX(t0),\n\nt \u2208 [0, 1].\n\n(5)\n\nIn the subsequent iterations, there are two intervals: [t, t0] and [t0, 1]. Conditioned on the value at\nX(t0), the process in the interval [t0, 1] is still Brownian motion. By contrast, for the interval [0, t0]\nthe process is a Brownian bridge, whose conditional expectation is\n\nE(X(t)|X(t0) =\n\nmin(t, t0) \u2212 t t0\n\nt0(1 \u2212 t0)\n\nX(t0) =\n\n(cid:26) t\n\nX(t0),\nX(t0),\n\nt0\n1\u2212t\n1\u2212t0\n\nt < t0\nt > t0.\n\n(6)\n\nAs illustrated by the results in the experimental section, the Brownian hypothesis is a robust assump-\ntion. Nevertheless, if additional information on the underlying stochastic processes is available, it can\n\n4\n\n01/41/23/41X(t)-2-1012Initialsteptime01/41/23/41R2(X(t);Y)00,101/41/23/41-2-1012After-rstcorrectiontime01/41/23/4100.401/41/23/41-2-1012Aftersecondcorrectiontime01/41/23/4100.1\fbe incorporated to the algorithm during the calculation of the conditional expectation in Equations (5)\nand (6).\nThe center and right plots in Figure 1 illustrate the behavior of RMH in the example described in\nthe previous section. The top center plot diplays the trajectories and corresponding averages (thick\nlines) for both classes after applying the correction (5) with t0 = 5/8, which is the \ufb01rst maximum of\nthe distance correlation function (bottom leftmost plot in Figure 1). The variable X(5/8) is clearly\nuninformative once this correction has been applied. The distance correlation R2(X(t), Y ) for the\ncorrected trajectories is displayed in the bottom center plot. Also in this plot the relevant variables\nare marked by vertical dashed lines. It is clear that the subsequent local maxima at t = 1/2, in the\nsubinterval [0, 5/8], and at t = 3/4, in the subinterval, and [5/8, 1] correspond to the remaining\nrelevant variables. The last column shows the corresponding plots after the correction is applied\nanew (equations (6) with t0 = 1/2 in [0, 5/8] and (5) with t0 = 3/4 in [5/8, 1]). After this second\ncorrection, the discriminant information has been removed. In consequence, the distance correlation\nfunction, up to sample \ufb02uctuations, is zero.\nAn important issue in the application of this method is how to decide when to stop the recursive\nsearch. The goal is to avoid including irrelevant and/or redundant variables. To address the \ufb01rst\nproblem, we only include maxima that are suf\ufb01ciently prominent R2(X(tmax), Y ) > s, where\n0 < s < 1 can be used to gauge the relative importance of the maximum. Redundancy is avoided\nby excluding points around a selected maximum tmax for which R2(X(tmax), X(t)) \u2265 r, for some\nredundancy threshold 0 < r < 1, which is typically close to one. As a result of these two conditions\nonly a \ufb01nite (typically small) number of variables are selected. This data-driven stopping criterion\navoids the need to set the number of selected variables beforehand or to determine this number by a\ncostly validation procedure. The sensitivity of the results to the values of r and s will be studied in\nSection 4. Nonetheless, RMH has a good and robust performance for a wide range of reasonable\nvalues of these parameters (r close to 1 and s close to 0). The pseudocode of the RMH algorithm is\ngiven in Algorithm 1.\n\nAlgorithm 1 Recursive Maxima Hunting\n1: function RMH(X(t), Y )\nt\u2217 \u2190 [ ]\n2:\nRMH rec(X(t),Y ,0,1)\n3:\nreturn t\u2217\n4:\n5: end function\n6: procedure RMH REC(X(t), Y, tinf , tsup)\n7:\n\n(cid:8)R2(X(t), Y )(cid:9)\n\n(cid:46) Vector of selected points initially empty\n(cid:46) Recursive search of the maxima of R2(X(t), Y )\n(cid:46) Vector of selected points\n\ntmax \u2190 argmax\ntinf \u2264t\u2264tsup\nif R2(X(tmax), Y ) > s then\nt\u2217 \u2190 [t\u2217 tmax]\n(cid:46) Include tmax in t\u2217 the vector of selected points\nX(t) \u2190 X(t) \u2212 E(X(t) | X(tmax)), t \u2208 [tinf , tsup] (cid:46) Correction of type (5) or (6) as required\nreturn\n\nelse\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n\n16:\n17:\n18:\n19:\n20:\n\n21:\n22:\n23:\n24:\n25: end procedure\n\nend if\nreturn\n\n(cid:8)t : R2 (X(tmax), X(t)) \u2264 r(cid:9)\n\nend if\n(cid:46) Exclude redundant points to the left of tmax\nmax \u2190 max\nt\u2212\ntinf \u2264t tinf then\nRMH rec(X(t), Y, tinf , t\u2212\n\n(cid:8)t : R2 (X(tmax), X(t)) \u2264 r(cid:9)\n\nend if\n(cid:46) Exclude redundant points to the right of tmax\nmax \u2190 min\nt+\ntmax