{"title": "Data Visualization and Feature Selection: New Algorithms for Nongaussian Data", "book": "Advances in Neural Information Processing Systems", "page_first": 687, "page_last": 693, "abstract": "", "full_text": "Data Visualization and Feature Selection: \n\nNew Algorithms for Nongaussian Data \n\nHoward Hua Yang and John Moody \n\nOregon Graduate Institute of Science and Technology \n20000 NW, Walker Rd., Beaverton, OR97006, USA \n\nhyang@ece.ogi.edu, moody@cse.ogi.edu, FAX:503 7481406 \n\nAbstract \n\nData visualization and feature selection methods are proposed \nbased on the )oint mutual information and ICA. The visualization \nmethods can find many good 2-D projections for high dimensional \ndata interpretation, which cannot be easily found by the other ex(cid:173)\nisting methods. The new variable selection method is found to be \nbetter in eliminating redundancy in the inputs than other methods \nbased on simple mutual information. The efficacy of the methods \nis illustrated on a radar signal analysis problem to find 2-D viewing \ncoordinates for data visualization and to select inputs for a neural \nnetwork classifier. \nKeywords: feature selection, joint mutual information, ICA, vi(cid:173)\nsualization, classification. \n\n1 \n\nINTRODUCTION \n\nVisualization of input data and feature selection are intimately related. A good \nfeature selection algorithm can identify meaningful coordinate projections for low \ndimensional data visualization. Conversely, a good visualization technique can sug(cid:173)\ngest meaningful features to include in a model. \n\nInput variable selection is the most important step in the model selection process. \nGiven a target variable, a set of input variables can be selected as explanatory \nvariables by some prior knowledge. However, many irrelevant input variables cannot \nbe ruled out by the prior knowledge. Too many input variables irrelevant to the \ntarget variable will not only severely complicate the model selection/estimation \nprocess but also damage the performance of the final model. \n\nSelecting input variables after model specification is a model-dependent approach[6]. \nHowever, these methods can be very slow if the model space is large. To reduce the \ncomputational burden in the estimation and selection processes, we need model(cid:173)\nindependent approaches to select input variables before model specification. One \nsuch approach is 6-Test [7]. Other approaches are based on the mutual information \n(MI) [2, 3,4] which is very effective in evaluating the relevance of each input variable, \nbut it fails to eliminate redundant variables. \n\nIn this paper, we focus on the model-independent approach for input variable selec-\n\n\f688 \n\nH. H. Yang and J. Moody \n\ntion based on joint mutual information (JMI). The increment from MI to JMI is the \nconditional MI. Although the conditional MI was used in [4] to show the monotonic \nproperty of the MI, it was not used for input selection. \n\nData visualization is very important for human to understand the structural re(cid:173)\nlations among variables in a system. It is also a critical step to eliminate some \nunrealistic models. We give two methods for data visualization. One is based on \nthe JMI and another is based on Independent Component Analysis (ICA). Both \nmethods perform better than some existing methods such as the methods based on \nPCA and canonical correlation analysis (CCA) for nongaussian data. \n\n2 Joint mutual information for input/feature selection \n\nLet Y be a target variable and Xi'S are inputs. The relevance of a single input is \nmeasured by the MI \n\nI(Xi;Y) = K(p(Xj,y)llp(Xi)P(Y)) \n\nwhere K(pllq) is the Kullback-Leibler divergence of two probability functions P and \nq defined by K(p(x)llq(x)) = Lx p(x) log~. \nThe relevance of a set of inputs is defined by the joint mutual information \n\nI(Xi' ... , Xk; Y) = K(P(Xi' ... , Xk, y)llp(Xi' ... , Xk)P(Y))\u00b7 \nGiven two selected inputs Xj and Xk, the conditional MI is defined by \n\nI(Xi; YIXj, Xk) = L p(Xj, xk)K(p(Xi' ylxj, xk)llp(xilxj, xk)p(ylxj, Xk)). \n\nSimilarly define I(Xi; YIXj, ... , X k) conditioned on more than two variables. \nThe conditional MI is always non-negative since it is a weighted average of the \nKullback-Leibler divergence. It has the following property \n\nI(XI. \u00b7 \u00b7\u00b7, Xn- l , Xn; Y) - I(XI ,\u00b7\u00b7\u00b7, Xn- l ; Y) = I(Xn; YIXI , \u00b7\u00b7\u00b7, Xn-I) 2: o. \n\nTherefore, I(X I , \u00b7\u00b7\u00b7 , X n- l , Xn; Y) 2: I(X I ,\u00b7\u00b7\u00b7, Xn- l ; Y), i.e., adding the variable \nXn will always increase the mutual information. The information gained by adding \na variable is measured by the conditional MI. \n\nWhen Xn and Yare conditionally independent given Xl,\u00b7 \u00b7 \u00b7, X n - l , the conditional \nMI between Xn and Y is \n\n(1) \nso Xn provides no extra information about Y when Xl,\u00b7\u00b7 \u00b7,Xn - l are known. In \nparticular, when Xn is a function of Xl, .. . , Xn- l , the equality (1) holds. This is \nthe reason why the joint MI can be used to eliminate redundant inputs. \n\nThe conditional MI is useful when the input variables cannot be distinguished by \nthe mutual information I(Xi;Y). For example, assume I(XI;Y) = I(X2;Y) \nI(X3; Y), and the problem is to select (Xl, X2), (Xl, X3) or (X2' X3) . Since \n\nI(XI,X2;Y) - I(XI,X3;Y) = I(X2;YIXI) - I(X3;YIXt}, \n\nwe should choose (Xl, X2) rather than (Xl, X3) if I(X2; YIXI ) > I(X3; YIXI ). Oth(cid:173)\nerwise, we should choose (Xl, X3). All possible comparisons are represented by a \nbinary tree in Figure 1. \nTo estimate I(X I, . . . , Xk; Y), we need \njoint probability \nP(XI,\u00b7\u00b7 \u00b7, Xk, y). This suffers from the curse of dimensionality when k is large. \n\nto estimate \n\nthe \n\n\fData Visualization and Feature Selection \n\n689 \n\nSometimes, we may not be able to estimate high dimensional MI due to the sample \nshortage. Further work is needed to estimate high dimensional joint MI based on \nparametric and non-parametric density estimations, when the sample size is not \nlarge enough. \n\nIn some real world problems such as mining large data bases and radar pulse classi(cid:173)\nfication, the sample size is large. Since the parametric densities for the underlying \ndistributions are unknown, it is better to use non-parametric methods such as his(cid:173)\ntograms to estimate the joint probability and the joint MI to avoid the risk of \nspecifying a wrong or too complicated model for the true density function. \n\n(xl. x2) \n\n\"' A. .~ \n\n(xl .x3) \n\nl(Xl ;Y\\X3\u00bb:1(X2;Y\\X3y\\!(XLY\\X3)<1(X2;Y\\X3) \n\n1(Xl.Y\\X2\u00bb:1(X3;Y\\X/, \n\n, \\1;Y\\X2l<1(X3;Y\\X2) \n\n/ \" \\ \n\n(xl.x2) \n\n(x2.x3) \n\n(x1 .xl) \n\n(xl,x3) \n\nFigure 1: Input selection based on the conditional MI. \n\nIn this paper, we use the joint mutual information I(Xi, Xj; Y) instead of the \nmutual information I(Xi; Y) to select inputs for a neural network classifier. Another \napplication is to select two inputs most relevant to the target variable for data \nvisualiz ation. \n\n3 Data visualization methods \n\nWe present supervised data visualization methods based on joint MI and discuss \nunsupervised methods based on ICA. \n\nThe most natural way to visualize high-dimensional input patterns is to dis(cid:173)\nplay them using two of the existing coordinates, where each coordinate corre(cid:173)\nsponds to one input variable. Those inputs which are most relevant to the tar(cid:173)\nget variable corresponds the best coordinates for data visualization , Let (i*, j*) = \narg maxU ,nI(Xi, Xj; Y). Then, the coordinate axes (Xi-, Xj-) should be used for \nvisualizing the input patterns since the corresponding inputs achieve the maximum \njoint MI. To find the maximum I(Xj-, Xj-IY), we need to evaluate every joint MI \nI(Xi' Xj; Y) for i < j. The number of evaluations is O(n 2 ) . \nNoticing that I(Xj,Xj;Y) = I(Xi;Y) + I(Xj;YIXi), we can first maximize the \nMI I(Xi; Y), then maximize the conditional MI. This algorithm is suboptimal, but \nonly requires n - 1 evaluations of the joint MIs. Sometimes, this is equivalent to \nexhaustive search. One such example is given in next section. \n\nSome existing methods to visualize high-dimensional patterns are based on dimen(cid:173)\nsionality reduction methods such as PCA and CCA to find the new coordinates to \ndisplay the data, The new coordinates found by PCA and CCA are orthogonal in \nEuclidean space and the space with Mahalanobis inner product, respectively. How(cid:173)\never, these two methods are not suitable for visualizing nongaussian data because \nthe projections on the PCA or CCA coordinates are not statistically independent \nfor nongaussian vectors. Since the JMI method is model-independent, it is better \nfor analyzing nongaussian data. \n\n\f690 \n\nH H Yang and J. Moody \n\nBoth CCA and maximumjoint MI are supervised methods while the PCA method \nis unsupervised. An alternative to these methods is ICA for visualizing clusters [5]. \nThe ICA is a technique to transform a set of variables into a new set of variables, \nso that statistical dependency among the transformed variables is minimized. The \nversion of ICA that we use here is based on the algorithms in [1, 8]. It discovers \na non-orthogonal basis that minimizes mutual information between projections on \nbasis vectors. We shall compare these methods in a real world application. \n\n4 Application to Signal Visualization and Classification \n\n4.1 \n\nJoint mutual information and visualization of radar pulse patterns \n\nOur goal is to design a classifier for radar pulse recognition. Each radar pulse \npattern is a 15-dimensional vector. We first compute the joint MIs, then use them \nto select inputs for the visualization and classification of radar pulse patterns. \n\nA set of radar pulse patterns is denoted by D = {(zi, yi) : i = 1\"\", N} which \nconsists of patterns in three different classes. Here, each Zi E R t5 and each yi E \n{I, 2, 3}. \n\nI~ \" MIl\"\" mlormabon \n\nCondIIionai NI given X2 \n\n0 \n\n14 \n\n12 \n\ni e-\n::E 0.8 \n1 \n106 \nI \n;; \n\n0.' \n\n~ \n\nto \n\nt>: \n\nI> \n\nt> \n\n02 \n\n~ \n\n0 \n\n00 \n\n0 \n\n0 \n\n0 \n\n0 \n\nI> \n\n0 \n\n0 \" \n\nInputvanablallldex \n\n(a) \n\n~ . '.a .. ,. \n\n'0 \n\n.~:: : 0 \n\nI> .. \n\n15 \n\n, \n\n2 \n\n1 \n\nJ \n\n8 \n\n, \n\n02 \n\n0 \n\n1 \n\n2 \n\n3 \n\n.4 \n\n5 \n\n6 \n\n8 \n\n7 \n9 \nbundle nutrtJer \n\n(b) \n\n1J J 11 \n\n10 \n\n12 \n\n11 \n\n13 \n\n14 \n\n15 \n\nFigure 2: (a) MI vs conditional MI for the radar pulse data; maximizing the MI then \nthe conditional MI with O(n) evaluations gives I(Xil' Xii; Y) = 1.201 bits. (b) The \njoint MI for the radar pulse data; maximizing the joint MI gives I(Xi. ,Xj-; Y) = \n1.201 bits with O(n2 ) evaluations of the joint MI. (il' it) = (i* , j*) in this case. \n\nLet it = arg maxJ(Xi;Y) and it = arg maXj;tiJ(Xj;YIXi1 ). From Figure 2(a), \nwe obtain (it,it) = (2,9) and I(XillXjl;Y) = I(Xil;Y) + I(Xj1;YIXi1 ) = \n1.201 bits . If the number of total inputs is n, then the number of evaluations for \ncomputing the mutual information I(Xi; Y) and the conditional mutual information \nI(Xj; YIXiJ is O(n). \nTo find the maximum I(Xi-, X j>; Y), we evaluate every I(Xi, Xj; Y) for i < j. \nThese MIs are shown by the bars in Figure 2(b), where the i-th bundle displays the \nMIs I(Xi,Xj;Y) for j = i+ 1\" \" , 15. \nIn order to compute the joint MIs, the MI and the conditional MI is evaluated 0 (n) \nand O(n2 ) times respectively. The maximumjoint MI is I(Xi-, X j-; Y) = 1.201 bits. \nGenerally, we only know I(Xil ' Xii; Y) ~ I(Xi-, Xj-; Y). But in this particular \n\n\fData Visualization and Feature Selection \n\n691 \n\napplication, the equality holds. This suggests that sometimes we can use an efficient \nalgorithm with only linear complexity to find the optimal coordinate axis view \n(Xi\u00b7,Xj.). The joint MI also gives other good sets of coordinate axis views with \nhigh joint MI values . \n\n<> \nx \n\n0 \nN \n\n~ \n\n0 \n\n0 \n\n~ \n\n25 \n\n15 \n\n05 \n\n0 \n\n3.~ \n\n'\" \nc . \n~ \n0 \n~o \n8 ., \n\n.. \n\n!O \n.~ ~ \nIi\n' \n\n/2 h \n'\" 'l' \n\n\u00b720 \n\n-,0 \n\n,0 \n\n20 \n\nfirs, prinopol oomponen' \n\n(a) \n\nJ \n\n3 \n\n3J3 \n~, 3 \n\n3 \n\n3 \n\n3 \n\n\u00a7~~ 3 \n. ~j;l;, \n\n2 \n\n2 \n\nf2 3 \n\na> \n2~:ai 2~ \n2 \n2 2 ~ \n2 \n\n3 \n\n2 \n\n\" \n, \", ' \n1~~1~~ \n~, , \n\" , \n\" \n\n1,1 \n\n, \n, \" \n\n, \n\n, 3 \n3 \n\n, \n\nN \n\nCl \n..J \n/20 \n\n\u00a7 \n\n0 \n\n'l' \n\n'1 \n\n-6 \n\n~ \n\n\u00b72 \n\nF.rstLD \n(c) \n\n2 \n\n'I\"llll~ \n\n3 3 \n\n~ 33 \n\n-20 \n\n20 \n\n40 \n\nX2 \n\n(b) \n\n1 \n\n1 \n\n1 \n\n\u2022 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\nf \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n3 \n\n3 \n\n2 \n\n1 \n\n\u2022 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n2 \n\n3 \n\n3 \n\n3 \n\n2 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n2 \n\n2 \n\n3 \n\n3 \n\n3 \n\n2 \n\n2 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n3 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n3 \n\n3 \n\n3 \n\n2 \n\n2 \n\n3 \n\n2 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n- 1 \n\n(d) \n\n~ \n\n-<15 \n\n-1 \n\n-15 \n\n-2 \n-3 \n\n2 \n\n-2 \n\nFigure 3: (a) Data visualization by two principal components; the spatial relation \nbetween patterns is not clear. (b) Use the optimal coordinate axis view (Xi., Xj-) \nfound via joint MI to project the radar pulse data; the patterns are well spread \nto give a better view on the spatial relation between patterns and the boundary \nbetween classes. (c) The CCA method. (d) The ICA method. \n\nEach bar in Figure 2(b) is associated with a pair of inputs. Those pairs with high \njoint MI give good coordinate axis view for data visualization. Figure 3 shows that \nthe data visualizations by the maximum JMI and the ICA is better than those by \nthe PCA and the CCA because the data is nongaussian. \n\n4.2 Radar pulse classification \n\nNow we train a two layer feed-forward network to classify the radar pulse patterns. \nFigure 3 shows that it is very difficult to separate the patterns by using just two \ninputs. We shall use all inputs or four selected inputs. The data set D is divided \n\n\f692 \n\nH H Yang and J. Moody \n\ninto a training set DI and a test set D2 consisting of 20 percent patterns in D. The \nnetwork trained on the data set DI using all input variables is denoted by \n\nY = f(X I ,'\" ,Xn; WI, W 2 , 0) \n\nwhere WI and W 2 are weight matrices and 0 is a vector of thresholds for the hidden \nlayer. \nFrom the data set D, we estimate the mutual information I(Xi; Y) and select \ni l = arg maxJ(Xi ; Y). Given Xii' we estimate the conditional mutual information \nI(Xj; YIXii ) for j =1= \ni l . Choose three inputs Xi'J' Xi3 and Xi 4 with the largest \nconditional MI. We found a quartet (iI, i2, i3, i4) = (1,2,3,9). The two-layer feed(cid:173)\nforward network trained on DI with four selected inputs is denoted by \n\nY = g(X I ,X2 , X 3 , X g; W~, W~, 0'). \n\nThere are 1365 choices to select 4 input variables out of 15. To set a reference perfor(cid:173)\nmance for network with four inputs for comparison. Choose 20 quartets from the set \nQ = {(h,h,h,h): 1 ~ jl < h < h < j4 ~ 15}. For each quartet (h,h,h,j4), a \ntwo-layer feed-forward network is trained using inputs (XjllXh,Xh,Xj4)' These \nnetworks are denoted by \n\nY = hi(Xil ,Xh , Xh, X j4 ; W~, W~, 0\"), \n\ni = 1,2\"\",20 . \n\n5 ... \n\u2022 , -~ - -\n\nl\\ \n\nI \n\nI \n\n\u2022 \n.55 \n\n3 \n\n.25 \n\n2 \n\n5 \n\n1 \n\n\u2022\u2022 \n\n-\n.... w..q ER. wlh3)QJnIIa \n- - -\n.... ~ER 'd\\mcpdltl \n- - . \nIf'1n11ER lIIIII'I4I1111d1dtnpil Xl. X2,l(3,n:G \n1>---+ 1eItIngst.., 4.-..ct .... XI, X2.)(J, MIl xv \n\n- -\n\n- - - -\n\nI \n\\ \n\n\\\n\n\\~ \n\n, \n.. .. \n'7;:Y.-\n15 \n\n. \n, , \n\n. ' , \n\n\" \n\n10 \n\n... - -\n\n--\n\n(a) \n\nnini'I EA ............... Xl,X2, lQ, _)fJ \n..... ER; .............. Xl.X2. lQ .... XI \n..... a. . .., .. ~ \n..... Eft ....... ...,. \n\nI \n\n\u2022.. \n\n015 \n\n' .1 \n\nY \n\n25 \n\n-\n\n(b) \n\n(a) The error rates of the network with four inputs (Xl, X 2 , X 3 , Xg) \nFigure 4: \nselected by the joint MI are well below the average error rates (with error bars \nattached) of the 20 networks with different input quartets randomly selected; this \nshows that the input quartet (X I ,X2 ,X3 ,X9 ) is rare but informative. (b) The \nnetwork with the inputs (X I ,X2 ,X3 ,X9 ) converges faster than the network with \nall inputs. The former uses 65% fewer parameters (weights and thresholds) and \n73% fewer inputs than the latter. The classifier with the four best inputs is less \nexpensive to construct and use, in terms of data acquisition costs, training time, \nand computing costs for real-time application. \n\nThe mean and the variance of the error rates of the 20 networks are then computed. \nAll networks have seven hidden units. The training and testing error rates of the \nnetworks at each epoch are shown in Figure 4, where we see that the network \nwith four inputs selected by the joint MI performs better than the networks with \nrandomly selected input quartets and converges faster than the network with all \ninputs. The network with fewer inputs is not only faster in computing but also less \nexpensive in data collection. \n\n\fData Visualization and Feature Selection \n\n693 \n\n5 CONCLUSIONS \n\nWe have proposed data visualization and feature selection methods based on the \njoint mutual information and ICA. \n\nThe maximum JMI method can find many good 2-D projections for visualizing high \ndimensional data which cannot be easily found by the other existing methods. Both \nthe maximum JMI method and the ICA method are very effective for visualizing \nnongaussian data. \n\nThe variable selection method based on the JMI is found to be better in eliminating \nredundancy in the inputs than other methods based on simple mutual information. \nInput selection methods based on mutual information (MI) have been useful in many \napplications, but they have two disadvantages. First, they cannot distinguish inputs \nwhen all of them have the same MI. Second, they cannot eliminate the redundancy \nin the inputs when one input is a function of other inputs. In contrast, our new \ninput selection method based on th~ joint MI offers significant advantages in these \ntwo aspects. \n\nWe have successfully applied these methods to visualize radar patterns and to select \ninputs for a neural network classifier to recognize radar pulses. We found a smaller \nyet more robust neural network for radar signal analysis using the JMI. \n\nAcknowledgement: This research was supported by grant ONR N00014-96-1-\n0476. \n\nReferences \n\n[1] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind \nsignal separation. In Advances in Neural Information Processing Systems, 8, \neds. David S. Touretzky, Michael C. Mozer and Michael E. Hasselmo, MIT \nPress: Cambridge, MA., pages 757-763, 1996. \n\n[2] G. Barrows and J. Sciortino. A mutual information measure for feature selection \n\nwith application to pulse classification. In IEEE Intern. Symposium on Time(cid:173)\nFrequency and Time-Scale Analysis, pages 249-253, 1996. \n\n[3] R. Battiti. Using mutual information for selecting features in supervised neural \n\nnet learning. IEEE Trans. on Neural Networks, 5(4):537-550, July 1994. \n\n[4] B. Bonnlander. Nonparametric selection of input variables for connectionist \n\nlearning. Technical report, PhD Thesis. University of Colorado, 1996. \n\n[5] C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive \nalgorithm based on neuromimetic architecture. Signal Processing, 24:1-10, 1991. \n[6] J. Moody. Prediction risk and architecture selection for neural network. In \nV. Cherkassky, J .H. Friedman, and H. Wechsler, editors, From Statistics to \nNeural Networks: Theory and Pattern Recognition Applications. NATO ASI \nSeries F, Springer-Verlag, 1994. \n\n[7] H. Pi and C. Peterson. Finding the embedding dimension and variable depen(cid:173)\n\ndencies in time series. Neural Computation, 6:509-520, 1994. \n\n[8] H. H. Yang and S. Amari. Adaptive on-line learning algorithms for blind sep(cid:173)\naration: Maximum entropy and minimum mutual information. Neural Compu(cid:173)\ntation, 9(7):1457-1482, 1997. \n\n\f", "award": [], "sourceid": 1779, "authors": [{"given_name": "Howard", "family_name": "Yang", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}