{"title": "Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity", "book": "Advances in Neural Information Processing Systems", "page_first": 996, "page_last": 1002, "abstract": "", "full_text": "Nearest Neighbor Based Feature Selection for\n\nRegression and its Application to Neural Activity\n\nAmir Navot12 Lavi Shpigelman12 Naftali Tishby12 Eilon Vaadia23\n\n1School of computer Science and Engineering\n\n2Interdisciplinary Center for Neural Computation\n3Dept. of Physiology, Hadassah Medical School\nThe Hebrew University Jerusalem, 91904, Israel\n\nEmail for correspondence: {anavot,shpigi}@cs.huji.ac.il\n\nAbstract\n\nWe present a non-linear, simple, yet effective, feature subset selection\nmethod for regression and use it in analyzing cortical neural activity. Our\nalgorithm involves a feature-weighted version of the k-nearest-neighbor\nalgorithm. It is able to capture complex dependency of the target func-\ntion on its input and makes use of the leave-one-out error as a natural\nregularization. We explain the characteristics of our algorithm on syn-\nthetic problems and use it in the context of predicting hand velocity from\nspikes recorded in motor cortex of a behaving monkey. By applying fea-\nture selection we are able to improve prediction quality and suggest a\nnovel way of exploring neural data.\n\n1 Introduction\n\nIn many supervised learning tasks the input is represented by a very large number of fea-\ntures, many of which are not needed for predicting the labels. Feature selection is the task\nof choosing a small subset of features that is suf\ufb01cient to predict the target labels well. Fea-\nture selection reduces the computational complexity of learning and prediction algorithms\nand saves on the cost of measuring non selected features. In many situations, feature se-\nlection can also enhance the prediction accuracy by improving the signal to noise ratio.\nAnother bene\ufb01t of feature selection is that the identity of the selected features can provide\ninsights into the nature of the problem at hand. Therefore feature selection is an important\nstep in ef\ufb01cient learning of large multi-featured data sets.\nFeature selection (variously known as subset selection, attribute selection or variable selec-\ntion) has been studied extensively both in statistics and by the machine learning community\nover the last few decades. In the most common selection paradigm an evaluation function\nis used to assign scores to subsets of features and a search algorithm is used to search for\na subset with a high score. The evaluation function can be based on the performance of a\nspeci\ufb01c predictor (wrapper model, [1]) or on some general (typically cheaper to compute)\nrelevance measure of the features to the prediction (\ufb01lter model). In any case, an exhaustive\nsearch over all feature sets is generally intractable due to the exponentially large number of\npossible sets. Therefore, search methods are employed which apply a variety of heuristics,\nsuch as hill climbing and genetic algorithms. Other methods simply rank individual fea-\ntures, assigning a score to each feature independently. These methods are usually very fast,\n\n\fbut inevitably fail in situations where only a combined set of features is predictive of the\ntarget function. See [2] for a comprehensive overview of feature selection and [3] which\ndiscusses selection methods for linear regression.\nA possible choice of evaluation function is the leave-one-out (LOO) mean square error\n(MSE) of the k-Nearest-Neighbor (kNN) estimator ([4, 5]). This evaluation function has\nthe advantage that it both gives a good approximation of the expected generalization error\nand can be computed quickly. [6] used this criterion on small synthetic problems (up to 12\nfeatures). They searched for good subsets using forward selection, backward elimination\nand an algorithm (called schemata) that races feature sets against each other (eliminating\npoor sets, keeping the \ufb01ttest) in order to \ufb01nd a subset with a good score. All these algo-\nrithms perform a local search by \ufb02ipping one or more features at a time. Since the space\nis discrete the direction of improvement is found by trial and error, which slows the search\nand makes it impractical for large scale real world problems involving many features.\nIn this paper we develop a novel selection algorithm. We extend the LOO-kNN-MSE\nevaluation function to assign scores to weight vectors over the features, instead of just to\nfeature subsets. This results in a smooth (\u201calmost everywhere\u201d) function over a continuous\ndomain, which allows us to compute the gradient analytically and to employ a stochastic\ngradient ascent to \ufb01nd a locally optimal weight vector. The resulting weights provide a\nranking of the features, which we can then threshold in order to produce a subset. In this\nway we can apply an easy-to-compute, gradient directed search, without relearning of a\nregression model at each step but while employing a strong non-linear function estimate\n(kNN) that can capture complex dependency of the function on its features1.\nOur motivation for developing this method is to address a major computational neuro-\nscience question: which features of the neural code are relevant to the observed behavior.\nThis is an important element of enabling interpretability of neural activity. Feature selec-\ntion is a promising tool for this task. Here, we apply our feature selection method to the\ntask of reconstructing hand movements from neural activity, which is one of the main chal-\nlenges in implementing brain computer interfaces [8]. We look at neural population spike\ncounts, recorded in motor cortex of a monkey while it performed hand movements and lo-\ncate the most informative subset of neural features. We show that it is possible to improve\nprediction results by wisely selecting a subset of cortical units and their time lags, relative\nto the movement. Our algorithm, which considers feature subsets, outperforms methods\nthat consider features on an individual basis, suggesting that complex dependency on a set\nof features exists in the code.\nThe remainder of the paper is organized as follows: we describe the problem setting in\nsection 2. Our method is presented in section 3. Next, we demonstrate its ability to cope\nwith a complicated dependency of the target function on groups of features using synthetic\ndata (section 4). The results of applying our method to the hand movement reconstruction\nproblem is presented in section 5.\n\n2 Problem Setting\n\nFirst, let us introduce some notation. Vectors in Rn are denoted by boldface small letters\n(e.g. x, w). Scalars are denoted by small letters (e.g. x, y). The i\u2019th element of a vector x\nis denoted by xi. Let f (x), f : Rn \u2212\u2192 R be a function that we wish to estimate. Given\na set S \u2282 Rn, the empiric mean square error (MSE) of an estimator \u02c6f for f is de\ufb01ned as\nM SES( \u02c6f ) = 1\n\n|S| Px\u2208S (cid:16)f (x) \u2212 \u02c6f (x)(cid:17)2\n\n.\n\n1The design of this algorithm was inspired by work done by Gilad-Bachrach et al. ([7]) which\nused a large margin based evaluation function to derive feature selection algorithms for classi\ufb01cation.\n\n\fkNN Regression k-Nearest-Neighbor (kNN) is a simple, intuitive and ef\ufb01cient way to es-\ntimate the value of an unknown function in a given point using its values in other (training)\npoints. Let S = {x1, . . . , xm} be a set of training points. The kNN estimator is de\ufb01ned\nas the mean function value of the nearest neighbors: \u02c6f (x) = 1\n\u2032\u2208N (x) f (x\u2032) where\nN (x) \u2282 S is the set of k nearest points to x in S and k is a parameter([4, 5]). A softer\nversion takes a weighted average, where the weight of each neighbor is proportional to its\nproximity. One speci\ufb01c way of doing this is\n1\nZ X\n\n\u2032)e\u2212d(x,x\n\nk Px\n\n\u02c6f (x) =\n\nx\n\n\u2032\u2208N (x)\n\nf (x\n\n\u2032)/\u03b2\n\n(1)\n\n\u2032\u2208N (x) e\u2212d(x,x\n\n2 is the \u21132 norm, Z = Px\n\n\u2032)/\u03b2 is a normalization\nwhere d (x, x\u2032) = kx \u2212 x\u2032k2\nfactor and \u03b2 is a parameter. The soft kNN version will be used in the remainder of this\npaper. This regression method is a special form of locally weighted regression (See [5] for\nan overview of the literature on this subject.) It has the desirable property that no learning\n(other than storage of the training set) is required for the regression. Also note that the\nGaussian Radial Basis Function has the form of a kernel ([9]) and can be replaced with any\noperator on two data points that decays as a function of the difference between them (e.g.\nkernel induced distances). As will be seen in the next section, we use the MSE of a modi\ufb01ed\nkNN regressor to guide the search for a set of features F \u2282 {1, . . . n} that achieves a low\nMSE. However, the MSE and the Gaussian kernel can be replaced by other loss measures\nand kernels (respectively) as long as they are differentiable almost everywhere.\n\n3 The Feature Selection Algorithm\n\nIn this section we present our selection algorithm called RGS (Regression, Gradient guided,\nfeature Selection). It can be seen as a \ufb01lter method for general regression algorithms or as\na wrapper for estimation by the kNN algorithm.\nOur goal is to \ufb01nd subsets of features that induce a small estimation error. As in most super-\nvised learning problems, we wish to \ufb01nd subsets that induce a small generalization error,\nbut since it is not known, we use an evaluation function on the training set. This evaluation\nfunction is de\ufb01ned not only for subsets but for any weight vector over the features. This\nis more general because a feature subset can be represented by a binary weight vector that\nassigns a value of one to features in the set and zero to the rest of the features.\nFor a given weights vector over the features w \u2208 Rn, we consider the weighted squared \u21132\nnorm induced by w, de\ufb01ned as kzk2\ni . Given a training set S, we denote by\n\u02c6fw(x) the value assigned to x by a weighted kNN estimator, de\ufb01ned in equation 1, using\nthe weighted squared \u21132-norm as the distances d(x, x\u2032) and the nearest neighbors are found\namong the points of S excluding x. The evaluation function is de\ufb01ned as the negative\n(halved) square error of the weighted kNN estimator:\n\nw = Pi z2\n\ni w2\n\ne(w) = \u2212\n\n1\n2 X\n\nx\u2208S\n\n(cid:16)f (x) \u2212 \u02c6fw(x)(cid:17)2\n\n.\n\n(2)\n\nThis evaluation function scores weight vectors (w). A change of weights will cause a\nchange in the distances and, possibly, the identity of each point\u2019s nearest neighbors, which\nwill change the function estimates. A weight vector that induces a distance measure in\nwhich neighbors have similar labels would receive a high score. The mean, 1/|S| is re-\nplaced with a 1/2 to ease later differentiation. Note that there is no explicit regularization\nterm in e(w). This is justi\ufb01ed by the fact that for each point, the estimate of its function\nvalue does not include that point as part of the training set. Thus, equation 2 is a leave-one-\nout cross validation error. Clearly, it is impossible to go over all the weight vectors (or even\nover all the feature subsets), and therefore some search technique is required.\n\n\fAlgorithm 1 RGS(S, k, \u03b2, T )\n\n1. initialize w = (1, 1, . . . , 1)\n2. for t = 1 . . . T\n\n(a) pick randomly an instance x from S\n(b) calculate the gradient of e(w):\n\n\u2207e(w) = \u2212 X\n\nx\u2208S\n\n(cid:16)f (x) \u2212 \u02c6fw(x)(cid:17) \u2207w \u02c6fw(x)\n\n\u2207w\n\n\u02c6fw(x) =\n\n\u2212 4\n\u03b2 Px\n\n\u2032\u2032,x\n\n\u2032\u2208N (x) f (x\u2032\u2032)a(x\u2032, x\u2032\u2032) u(x\u2032, x\u2032\u2032)\nPx\n\n\u2032\u2208N (x) a(x\u2032, x\u2032\u2032)\n\n\u2032\u2032,x\n\nwhere a(x\u2032, x\u2032\u2032) = e\u2212(||x\u2212x\u2032||2\nand u(x\u2032, x\u2032\u2032) \u2208 Rn is a vector with ui = wi (cid:2)(xi \u2212 x\u2032\n\nw+||x\u2212x\u2032\u2032||2\n\nw)/\u03b2\n\n(c) w = w + \u03b7t\u2207e(w) = w (cid:16)1 + \u03b7t\u2207w\n\ni )2(cid:3) .\n\u02c6fw(x)(cid:17) where \u03b7t is a decay factor.\n\ni)2 + (xi \u2212 x\u2032\u2032\n\nOur method \ufb01nds a weight vector w that locally maximizes e(w) as de\ufb01ned in (2) and\nthen uses a threshold in order to obtain a feature subset. The threshold can be set either\nby cross validation or by \ufb01nding a natural cutoff in the weight values. However, we later\nshow that using the distance measure induced by w in the regression stage compensates for\ntaking too many features. Since e(w) is de\ufb01ned over a continuous domain and is smooth\nalmost everywhere we can use gradient ascent in order to maximize it. RGS (algorithm 1)\nis a stochastic gradient ascent over e(w). In each step the gradient is evaluated using one\nsample point and is added to the current weight vector. RGS considers the weights of all\nthe features at the same time and thus it can handle dependency on a group of features.\nThis is demonstrated in section 4. In this respect, it is superior to selection algorithms that\nscores each feature independently. It is also faster than methods that try to \ufb01nd a good\nsubset directly by trial and error. Note, however, that convergence to global optima is not\nguaranteed and standard techniques to avoid local optima can be used.\nThe parameters of the algorithm are k (number of neighbors), \u03b2 (Gaussian decay factor),\nT (number of iterations) and {\u03b7t}T\nt=1 (step size decay scheme). The value of k can be\ntuned by cross validation, however a proper choice of \u03b2 can compensate for a k that is too\nlarge. It makes sense to tune \u03b2 to a value that places most neighbors in an active zone of\nthe Gaussian. In our experiments, we set \u03b2 to half of the mean distance between points\nand their k neighbors. It usually makes sense to use \u03b7t that decays over time to ensure\nconvergence, however, on our data, convergence was also achieved with \u03b7t = 1.\nThe computational complexity of RGS is \u0398(T N m) where T is the number of iterations,\nN is the number of features and m is the size of the training set S. This is correct for a\nnaive implementation which \ufb01nds the nearest neighbors and their distances from scratch at\neach step by measuring the distances between the current point to all the other points. RGS\nis basically an on line method which can be used in batch mode by running it in epochs\non the training set. When it is run for only one epoch, T = m and the complexity is\n\u0398(cid:0)m2N(cid:1). Matlab code for this algorithm (and those that we compare with) is available at\nhttp://www.cs.huji.ac.il/labs/learning/code/fsr/\n\n4 Testing on synthetic data\n\nThe use of synthetic data, where we can control the importance of each feature, allows us\nto illustrate the properties of our algorithm. We compare our algorithm with other common\n\n\f1\n\n0.5\n\n0\n1\n\n2\n1\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n(a)\n\n0.5\n\n0\n\n0\n\n0.5\n(c)\n\n1\n\n0\n\n\u22121\n1\n\n1\n\n1\n\n0\n\n\u22121\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n(b)\n\n1\n\n0\n\n\u22121\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n(d)\n\n0.5\n\n0\n\n0\n\n0.5\n(e)\n\n1\n\n0\n\n\u22121\n1\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n(f)\n\n1\n\nFigure 1: (a)-(d): Illustration of the four synthetic target functions. The plots shows the functions\nvalue as function of the \ufb01rst two features. (e),(f): demonstration of the effect of feature selection on\nestimating the second function using kNN regression (k = 5, \u03b2 = 0.05). (e) using both features\n(mse = 0.03), (f) using the relevant feature only (mse = 0.004)\n\nselection methods: infoGain [10], correlation coef\ufb01cients (corrcoef ) and forward selection\n(see [2]) . infoGain and corrcoef simply rank features according to the mutual information2\nor the correlation coef\ufb01cient (respectively) between each feature and the labels (i.e.\nthe\ntarget function value). Forward selection (fwdSel) is a greedy method in which features\nare iteratively added into a growing subset. In each step, the feature showing the greatest\nimprovement (given the previously selected subset) is added. This is a search method that\ncan be applied to any evaluation function and we use our criterion (equation 2 on feature\nsubsets). This well known method has the advantages of considering feature subsets and\nthat it can be used with non linear predictors. Another algorithm we compare with scores\neach feature independently using our evaluation function (2). This helps us in analyzing\nRGS, as it may help single out the respective contributions to performance of the properties\nof the evaluation function and the search method. We refer to this algorithm as SKS (Single\nfeature, kNN regression, feature Selection).\nWe look at four different target functions over R50. The training sets include 20 to 100\npoints that were chosen randomly from the [\u22121, 1]50 cube. The target functions are given\nin the top row of \ufb01gure 2 and are illustrated in \ufb01gure 1(a-d). A random Gaussian noise with\nzero mean and a variance of 1/7 was added to the function value of the training points.\nClearly, only the \ufb01rst feature is relevant for the \ufb01rst two target functions, and only the \ufb01rst\ntwo features are relevant for the last two target functions. Note also that the last function\nis a smoothed version of parity function learning and is considered hard for many feature\nselection algorithms [2].\nFirst, to illustrate the importance of feature selection on regression quality we use kNN to\nestimate the second target function. Figure 1(e-f) shows the regression results for target\n(b), using either only the relevant feature or both the relevant and an irrelevant feature.\nThe addition of one irrelevant feature degrades the MSE ten fold. Next, to demonstrate\nthe capabilities of the various algorithms, we run them on each of the above problems with\nvarying training set size. We measure their success by counting the number of times that\nthe relevant features were assigned the highest rank (repeating the experiment 250 times by\nre-sampling the training set). Figure 2 presents success rate as function of training set size.\nWe can see that all the algorithms succeeded on the \ufb01rst function which is monotonic and\ndepends on one feature alone. infoGain and corrcoef fail on the second, non-monotonic\nfunction. The three kNN based algorithms succeed because they only depend on local\nproperties of the target function. We see, however, that RGS needs a larger training set to\nachieve a high success rate. The third target function depends on two features but the de-\npendency is simple as each of them alone is highly correlated with the function value. The\nfourth, XOR-like function exhibits a complicated dependency that requires consideration\nof the two relevant features simultaneously. SKS which considers features separately sees\nthe effect of all other features as noise and, therefore, has only marginal success on the third\n\n2Feature and function values were \u201cbinarized\u201d by comparing them to the median value.\n\n\f(a) x2\n\n1\n\n(b) sin(2\u03c0x1 + \u03c0/2)\n\n(c) sin(2\u03c0x1 + \u03c0/2) + x2\n\n(d) sin(2\u03c0x1) sin(2\u03c0x2)\n\n100\n80\n60\n40\n20\n\ne\n\nt\n\na\nr\n \ns\ns\ne\nc\nc\nu\ns\n\n100\n80\n60\n40\n20\n\n20 40 60 80 100\n\n# examples\n\n20\n\n100\n80\n60\n40\n20\n\n20\n\n60\n\n40\n# examples\n\n80 100\n\ncorrcoef\ninfoGain\nSKS\nfwdSel\nRGS\n\n100\n80\n60\n40\n20\n\n60\n\n40\n# examples\n\n80 100\n\n20\n\n60\n\n40\n# examples\n\n80 100\n\nFigure 2: Success rate of the different algorithms on 4 synthetic regression tasks (averaged over 250\nrepetitions) as a function of the number of training examples. Success is measured by the percent of\nthe repetitions in which the relevant feature(s) received \ufb01rst place(s).\n\nfunction and fails on the fourth altogether. RGS and fwdSel apply different search methods.\nfwdSel considers subsets but can evaluate only one additional feature in each step, giving\nit some advantage over RGS on the third function but causing it to fail on the fourth. RGS\ntakes a step in all features simultaneously. Only such an approach can succeed on the fourth\nfunction.\n\n5 Hand Movements Reconstruction from Neural Activity\n\nTo suggest an interpretation of neural coding we apply RGS and compare it with the alter-\nnatives presented in the previous section3 on the hand movement reconstruction task. The\ndata sets were collected while a monkey performed a planar center-out reaching task with\none or both hands [11]. 16 electrodes, inserted daily into novel positions in primary motor\ncortex were used to detect and sort spikes in up to 64 channels (4 per electrode). Most\nof the channels detected isolated neuronal spikes by template matching. Some, however,\nhad templates that were not tuned, producing spikes during only a fraction of the session.\nOthers (about 25%) contained unused templates (resulting in a constant zero producing\nchannel or, possibly, a few random spikes). The rest of the channels (one per electrode)\nproduced spikes by threshold passing. We construct a labeled regression data set as fol-\nlows. Each example corresponds to one time point in a trial. It consists of the spike counts\nthat occurred in the 10 previous consecutive 100ms long time bins from all 64 channels\n(64 \u00d7 10 = 640 features) and the label is the X or Y component of the instantaneous hand\nvelocity. We analyze data collected over 8 days. Each data set has an average of 5050\nexamples collected during the movement periods of the successful trials.\nIn order to evaluate the different feature selection methods we separate the data into training\nand test sets. Each selection method is used to produce a ranking of the features. We then\napply kNN (based on the training set) using different size groups of top ranking features to\nthe test set. We use the resulting MSE (or correlation coef\ufb01cient between true and estimated\nmovement) as our measure of quality. To test the signi\ufb01cance of the results we apply 5-\nfold cross validation and repeat the process 5 times on different permutations of the trial\nordering. Figure 3 shows the average (over permutations, folds and velocity components)\nMSE as a function of the number of selected features on four of the different data sets\n(results on the rest are similar and omitted due to lack of space)4. It is clear that RGS\nachieves better results than the other methods throughout the range of feature numbers.\nTo test whether the performance of RGS was consistently better than the other methods\nwe counted winning percentages (the percent of the times in which RGS achieved lower\nMSE than another algorithm) in all folds of all data sets and as a function of the number of\n\n3fwdSel was not applied due to its intractably high run time complexity. Note that its run time is\n\nat least r times that of RGS where r is the size of the optimal set and is longer in practice.\n\n4We use k = 50 (approximately 1% of the data points). \u03b2 is set automatically as described in\nsection 3. These parameters were manually tuned for good kNN results and were not optimized for\nany of the feature selection algorithms. The number of epochs for RGS was set to 1 (i.e. T = m).\n\n\f0.74\n\n0.63\n\n0.09\n\n0.06\n\n0.10\n\n0.08\n\nRGS\nSKS\ninfoGain\ncorrcoef\n\n0.77\n\n0.27\n\n7\n\n200\n\n400\n\n600\n\n7\n\n200\n\n400\n\n600\n\n7\n\n200\n\n400\n\n600\n\n7\n\n200\n\n400\n\n600\n\nFigure 3: MSE results for the different feature selection methods on the neural activity data sets. Each\nsub \ufb01gure is a different recording day. MSEs are presented as a function of the number of features\nused. Each point is a mean over all 5 cross validation folds, 5 permutations on the data and the two\nvelocity component targets. Note that some of the data sets are harder than others.\n\nfeatures used. Figure 4 shows the winning percentages of RGS versus the other methods.\nFor a very low number of features, while the error is still high, RGS winning scores are\nonly slightly better than chance but once there are enough features for good predictions\nthe winning percentages are higher than 90%. In \ufb01gure 3 we see that the MSE achieved\nwhen using only approximately 100 features selected by RGS is better than when using all\nthe features. This difference is indeed statistically signi\ufb01cant (win score of 92%). If the\nMSE is replaced by correlation coef\ufb01cient as the measure of quality, the average results\n(not shown due to lack of space) are qualitatively unchanged.\nRGS not only ranks the features but also gives them weights that achieve locally optimal\nresults when using kNN regression. It therefore makes sense not only to select the features\nbut to weigh them accordingly. Figure 5 shows the winning percentages of RGS using\nthe weighted features versus RGS using uniformly weighted features. The corresponding\nMSEs (with and without weights) on the \ufb01rst data set are also displayed. It is clear that\nusing the weights improves the results in a manner that becomes increasingly signi\ufb01cant as\nthe number of features grows, especially when the number of features is greater than the\noptimal number. Thus, using weighted features can compensate for choosing too many by\ndiminishing the effect of the surplus features.\nTo take a closer look at what features are selected, \ufb01gure 6 shows the 100 highest ranking\nfeatures for all algorithms on one data set. Similar selection results were obtained in the\nrest of the folds. One would expect to \ufb01nd that well isolated cells (template matching) are\nmore informative than threshold based spikes. Indeed, all the algorithms select isolated\ncells more frequently within the top 100 features (RGS does so in 95% of the time and the\nrest in 70%-80%). A human selection of channels, based only on looking at raster plots\nand selecting channels with stable \ufb01ring rates was also available to us. This selection was\nindependent of the template/threshold categorisation. Once again, the algorithms selected\nthe humanly preferred channels more frequently than the other channels. Another and more\ninteresting observation that can also be seen in the \ufb01gure is that while corrcoef, SKS and\ninfoGain tend to select all time lags of a channel, RGS\u2019s selections are more scattered (more\nchannels and only a few time bins per channel). Since RGS achieves best results, we\n\ne\ng\na\nt\nn\ne\nc\nr\ne\np\n \ng\nn\nn\nn\nw\n\ni\n\ni\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n0\n\n100\n\nRGS vs SKS\nRGS vs infoGain\nRGS vs corrcoef\n\n0.8\n\nE\nS\nM\n\nwinning percentages\nuniform weights\nnon\u2212uniform weights\n\n100\n\n90\n\n80\n\n70\n\n60\n\ne\ng\na\nt\nn\ne\nc\nr\ne\np\n \ng\nn\nn\nn\nw\n\ni\n\ni\n\n200\n\n300\n\n400\n\nnumber of features\n\n500\n\n600\n\n50\n0\n\n100\n\n200\n\n300\n\n400\n\nnumber of features\n\n500\n\n600\n\n0.6\n\nFigure 4: Winning percentages of RGS over the\nother algorithms. RGS achieves better MSEs\nconsistently.\n\nFigure 5: Winning percentages of RGS with\nand without weighting of features (black). Gray\nlines are corresponding MSEs of these methods\non the \ufb01rst data set.\n\n\fS\nG\nR\n\nf\n\ne\no\nC\n\nr\nr\no\nc\n\nS\nK\nS\n\ni\n\nn\na\nG\no\nf\nn\n\ni\n\nFigure 6: 100 highest ranking features (grayed out) selected by the algorithms. Results are for one\nfold of one data set. In each sub \ufb01gure the bottom row is the (100ms) time bin with least delay and\nthe higher rows correspond to longer delays. Each column is a channel (silent channels omitted).\n\nconclude that this selection pattern is useful. Apparently RGS found these patterns thanks\nto its ability to evaluate complex dependency on feature subsets. This suggests that such\ndependency of the behavior on the neural activity does exist.\n\n6 Summary\n\nIn this paper we present a new method of selecting features for function estimation and use\nit to analyze neural activity during a motor control task . We use the leave-one-out mean\nsquared error of the kNN estimator and minimize it using a gradient ascent on an \u201calmost\u201d\nsmooth function. This yields a selection method which can handle a complicated depen-\ndency of the target function on groups of features yet can be applied to large scale problems.\nThis is valuable since many common selection methods lack one of these properties. By\ncomparing the result of our method to other selection methods on the motor control task,\nwe show that consideration of complex dependency helps to achieve better performance.\nThese results suggest that this is an important property of the code.\nOur future work is aimed at a better understanding of neural activity through the use of\nfeature selection. One possibility is to perform feature selection on other kinds of neural\ndata such as local \ufb01eld potentials or retinal activity. Another promising option is to explore\nthe temporally changing properties of neural activity. Motor control is a dynamic process\nin which the input output relation has a temporally varying structure. RGS can be used in\non line (rather than batch) mode to identify these structures in the code.\n\nReferences\n[1] R. Kohavi and G.H. John. Wrapper for feature subset selection. Arti\ufb01cial Intelligence, 97(1-\n\n2):273\u2013324, 1997.\n\n[2] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 2003.\n[3] A.J. Miller. Subset Selection in Regression. Chapman and Hall, 1990.\n[4] L. Devroye. The uniform convergence of nearest neighbor regression function estimators and\n\ntheir application in optimization. IEEE transactions in information theory, 24(2), 1978.\n\n[5] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 11.\n[6] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners. In Arti\ufb01cial\n\nIntelligence Review, volume 11, pages 193\u2013225, April 1997.\n\n[7] R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection - theory and\n\nalgorithms. In Proc. 21st (ICML), pages 337\u2013344, 2004.\n\n[8] D. M. Taylor, S. I. Tillery, and A. B. Schwartz. Direct cortical control of 3d neuroprosthetic\n\ndevices. Science, 296(7):1829\u20131832, 2002.\n\n[9] V. Vapnik. The Nature Of Statistical Learning Theory. Springer-Verlag, 1995.\n[10] J. R. Quinlan.\n\nIn Jude W. Shavlik and Thomas G. Dietterich,\neditors, Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in\nMachine Learning 1:81\u2013106, 1986.\n\nInduction of decision trees.\n\n[11] R. Paz, T. Boraud, C. Natan, H. Bergman, and E. Vaadia. Preparatory activity in motor cortex\nre\ufb02ects learning of local visuomotor skills. Nature Neuroscience, 6(8):882\u2013890, August 2003.\n\n\f", "award": [], "sourceid": 2848, "authors": [{"given_name": "Amir", "family_name": "Navot", "institution": null}, {"given_name": "Lavi", "family_name": "Shpigelman", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}, {"given_name": "Eilon", "family_name": "Vaadia", "institution": null}]}