{"title": "Constructing Heterogeneous Committees Using Input Feature Grouping: Application to Economic Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 927, "abstract": null, "full_text": "Constructing Heterogeneous Committees \n\nUsing Input Feature Grouping: \n\nApplication to Economic Forecasting \n\nYuansong Liao and John Moody \n\nDepartment of Computer Science, Oregon Graduate Institute, \n\nP.O.Box 91000, Portland, OR 97291-1000 \n\nAbstract \n\nThe committee approach has been proposed for reducing model \nuncertainty and improving generalization performance. The ad(cid:173)\nvantage of committees depends on (1) the performance of individ(cid:173)\nual members and (2) the correlational structure of errors between \nmembers. This paper presents an input grouping technique for de(cid:173)\nsigning a heterogeneous committee. With this technique, all input \nvariables are first grouped based on their mutual information. Sta(cid:173)\ntistically similar variables are assigned to the same group. Each \nmember's input set is then formed by input variables extracted \nfrom different groups. Our designed committees have less error cor(cid:173)\nrelation between its members, since each member observes different \ninput variable combinations. The individual member's feature sets \ncontain less redundant information, because highly correlated vari(cid:173)\nables will not be combined together. The member feature sets con(cid:173)\ntain almost complete information, since each set contains a feature \nfrom each information group. An empirical study for a noisy and \nnonstationary economic forecasting problem shows that commit(cid:173)\ntees constructed by our proposed technique outperform committees \nformed using several existing techniques. \n\nIntroduction \n\n1 \nThe committee approach has been widely used to reduce model uncertainty and \nimprove generalization performance. Developing methods for generating candidate \ncommittee members is a very important direction of committee research. Good \ncandidate members of a committee should have (1) good (not necessarily excellent) \nindividual performance and (2) small residual error correlations with other members. \n\nMany techniques have been proposed to reduce residual correlations between mem(cid:173)\nbers. These include resampling the training and validation data [3], adding ran(cid:173)\ndomness to data [7], and decorrelation training [8]. These approaches are only \neffective for certain models and problems. Genetic algorithms have also been used \nto generate good and diverse members [6]. \n\nInput feature selection is one of the most important stages of the model learning \nprocess. It has a crucial impact both on the learning complexity and the general-\n\n\f922 \n\nY. Liao and J. Moody \n\nization performance. It is essential that a feature vector gives sufficient information \nfor estimation. However, too many redundant input features not only burden the \nwhole learning process, but also degrade the achievable generalization performance. \n\nInput feature selection for individual estimators has received a lot of attention \nbecause of its importance. However, there has not been much research on feature \nselection for estimators in the context of committees. Previous research found \nthat giving committee members different input features is very useful for improving \ncommittee performance [4], but is difficult to implement [9]. The feature selection \nproblem for committee members is conceptually different than for single estimators. \nWhen using committees for estimation, as we stated previously, committee members \nnot only need to have reasonable performance themselves, but should also make \ndecisions independently. \n\nWhen all committee members are trained to model the same underlying function, \nit is difficult for committee members to optimize both criteria at the same time. In \norder to generate members that provide a good balance between the two criteria, we \npropose a feature selection approach, called input feature grouping, for commit(cid:173)\ntee members. The idea is to give each member estimator of a committee a rich but \ndistinct feature sets, in the hope that each member will generalize independently \nwith reduced error correlations. \n\nThe proposed method first groups input features using a hierarchical clustering \nalgorithm based on their mutual information, such that features in different groups \nare less related to each other and features within a group are statistically similar \nto each other. Then the feature set for each committee member is formed by \nselecting a feature from each group. Our empirical results demonstrate that forming \na heterogeneous committee using input feature grouping is a promising approach. \n\n2 Committee Performance Analysis \n\nThere are many ways to construct a committee. \nIn this paper, we are mainly \ninterested in heterogeneous committees whose members have different input feature \nsets. Committee members are given different subsets of the available feature set. \nThey are trained independently, and the committee output is either a weighted or \nunweighted combination of individual members' outputs. \n\nIn the following, we analyze the relationship between committee errors and average \nmember errors from the regression point of view and discuss how the residual cor(cid:173)\nrelations between members affect the committee error. We define the training data \nV = {(X.B, y.B);;:3 = 1,2, . . . N} and the test data T = {(XI', YI'); JL = 1,2, ... oo}, \nwhere both are assumed to be generated by the model: Y = t(X) + f. \nf. '\" \nN(o, (72) . The data V and T are independent, and inputs are drawn from an \nunknown distribution. Assume that a committee has K members. Denote the \navailable input features as X = [Xl, X2, . .. ,Xm ], the feature sets for the ith and \njth members as Xi = [Xiu Xi2' . .. , xm;] and Xj = [Xjl) Xj2' . .. ,xmj ] respectively, \nwhere Xi EX, Xj E X and Xi =I X j , and the mapping function of the ith and lh \nmember models trained on data from V as fi(Xd and fi(Xj ). Define the model \nerror ef = tl' - h(Xn , for all JL = 1,2,3, ... ,00 and i = 1,2, ... , K. \n\n, \n\n\fConstructing Heterogeneous Committees for Economic Forecasting \n\nThe MSE of a committee is \n\n= ~2 L \u00a311 [(en2] + ~2 L \u00a311 [efej] , \n\nK \n\ni#j \n\nK \n\ni=l \n\nand the average MSE made by the committee members acting individually is \n\nK \n\nEave = ~ L \u00a311 [(en 21 , \n\ni=l \n\n923 \n\n(1) \n\n(2) \n\nwhere \u00a3[.] denotes the expectation over all test data T. Using Jensen's inequality, \nwe get Ec ~ Eave, which indicates that the performance of a committee is always \nequal to or better than the average performance of its members. \nWe define the average model error correlation as C = K(i -1) l:~j \u00a311 [efejl , and \nthen have \n\n1 \n\nK-1 \n\n1 K-1 \n\nEc = KEave + ~C = (K + ~q)Eave , \n. We consider the following four cases of q: \n\n(3) \n\nwhere q = Be \n\nave \n\n\u2022 Case 1: - K~l ~ q < O. In this case, the model errors between members \nare anti-correlated, which might be achieved through decorrelation training. \n\n\u2022 Case 2: q = O. \n\nIn this case, the model errors between members are \nuncorrelated, and we have: Ec = k Eave. That is to say, a committee can \ndo much better than the average performance of its members. \n\n\u2022 Case 3: 0 < q < 1. If Eave is bounded above, when the committee size \nK - t 00, we have Ec = qEave . This gives the asymptotic limit of a \ncommittee's performance. As the size of a committee goes to infinity, the \ncommittee error is equal to the average model error correlation C. The \ndifference between Ec and Eave is determined by the ratio q. \n\n\u2022 Case 4: q = 1. In this case, Ec is equal to Eave. This happens only when \nei = ej, for all i,j = 1, ... ,K. It is obvious that there is no advantage to \ncombining a set of models that act identically. \n\nIt is clear from the analyses above that a committee shows its advantage when \nthe ratio q is less than one. The smaller the ratio q is, the better the committee \nperforms compared to the average performance of its members. For the commit(cid:173)\ntee to achieve substantial improvement over a single model, committee members \nnot only should have small errors individually, but also should have small residual \ncorrelations between each other. \n\nInput Feature Grouping \n\n3 \nOne way to construct a feature subset for a committee member is by randomly \npicking a certain number of features from the original feature set. The advantage \nof this method is that it is simple. However, we have no control on each member's \nperformance or on the residual correlation between members by randomly selecting \nsubsets. \n\n\f924 \n\nY. Liao and J. Moody \n\nInstead of randomly picking a subset of features for a member, we propose an input \nfeature grouping method for forming committee member feature sets. The input \ngrouping method first groups features based on a relevance measure in a way such \nthat features between different groups are less related to one another and features \nwithin a group are more related to one another. \n\nAfter grouping, there are two ways to form member feature sets. One method \nis to construct the feature set for each member by selecting a feature from \u00b7 each \ngroup. Forming a member's feature set in this way, each member will have enough \ninformation to make decision, and its feature set has less redundancy. This is the \nmethod we use in this paper. \n\nAnother way is to use each group as the feature set for a committee member. In this \nmethod each member will only have partial information. This is likely to hurt in(cid:173)\ndividual member's performance. However, because the input features for different \nmembers are less dependent, these members tend to make decisions more inde(cid:173)\npendently. There is always a trade-off between increasing members ' independence \nand hurting individual members' performance. If there is no redundancy among \ninput feature representations, removing several features may hurt individual mem(cid:173)\nbers' performance badly, and the overall committee performance will be hurt even \nthough members make decisions independently. This method is currently under \ninvestigation. \n\nThe mutual information [(Xi; X j) between two input variables Xi and X j is used as \nthe relevance measure to group inputs. The mutual information [(Xi; Xj) , which is \ndefined in equation 4, measures the dependence between the two random variables. \n\n(4) \n\nIf features Xi and X j are highly dependent, [(Xi; X j) will be large. Because the \nmutual information measures arbitrary dependencies between random variables, it \nhas been effectively used for feature selections in complex prediction tasks [1], where \nmethods bases on linear relations like the correlation are likely to make mistakes. \nThe fact that the mutual information is independent of the coordinates chosen \npermits a robust estimation. \n\n4 Empirical Studies \nWe apply the input grouping method to predict the one-month rate of change of the \nIndex of Industrial Production (IP), one of the key measures of economic activity. \nIt is computed and published monthly. Figure 4 plots monthly IP data from 1967 \nto 1993. \n\nNine macroeconomic time series, whose names are given in Table 1, are used \nfor forecasting IP. Macroeconomic forecasting is a difficult task because data are \nusually limited, and these series are intrinsically very noise and nonstationary. \nThese series are preprocessed before they are applied to the forecasting mod(cid:173)\nels. The representation used for input series is the first difference on one month \ntime scales of the logged series. For example, the notation IP.L.Dl represents \nIP.L.Dl == In(IP(t)) -In(IP(t-l)). The target series is IP.L.FDl , which is defined \nas IP.L.FDI == In(IP(t+l)) - In(IP(t)). The data set has been one of our bench(cid:173)\nmarks for various studies [5, 10]. \n\n\fConstructing Heterogeneous Committees for Economic Forecasting \n\n925 \n\nIndex of Industrial Production: 1967 \u2022 1993 \n\nFigure 1: U.S. Index of Industrial Production (IP) for the period 1967 to 1993. Shaded \nregions denote official recessions, while unshaded regions denote official expansions. The \nboundaries for recessions and expansions are determined by the National Bureau of Eco(cid:173)\nnomic Research based on several macroeconomic series. As is evident for IP, business \ncycles are irregular in magnitude, duration, and structure, making prediction of IP an \ninteresting challenge. \n\nSeries Description \nIP \nSP \nDL \nM2 \nCP \nCB \nHS \nTB3 \nTr \n\nIndex of Industrial Production \nStandard & Poor's 500 \nIndex of Leading Indicators \nMoney Supply \nConsumer Price Index \nMoody's Aaa Bond Yield \nHOUSing Starts \n3-month Treasury Bill Yield \nYield Curve Slope: (10-Year Bond Composite)-(3-Month Treasury Bill) \n\nTable 1: Input data series. Data are taken from the Citibase database. \n\nDuring the grouping procedure, measures of mutual information between all pairs \nof input variables are computed first. A simple histogram method is used to calcu(cid:173)\nlate these estimates. Then a hierarchical clustering algorithm [2] is applied to these \nvalues to group inputs. Hierarchical clustering proceeds by a series of successive \nfusions of the nine input variables into groups. At any particular stage, the pro(cid:173)\ncess fuses variables or groups of variables which are closest, base on their mutual \ninformation estimates. The distance between two groups is defined as the average \nof the distances between all pairs of individuals in the two groups. The result is \npresented by a tree which illustrates the fusions made at each successive level (see \nFigure 2). From the clustering tree, it is clear that we can break the input variables \ninto four groups: (IP.L.Dl, DL.L.Dl) measure recent economic changes, (SP.L.Dl) \nreflects recent stock market momentum, (CB.D1, TB3.D1, Tr.D1) give interest rate \ninformation, and (M2.L.D1, CP.L.D1, HS.L.D1) provide inflation information. The \ngrouping algorithm meaningfully clusters the nine input series. \n\n\f926 \n\nY. Liao and J. Moody \n\n~ \n\n:; \n\n::l \n\n::I \n\n:;; \n\n~ \n\n~ \n\n~ \n\u2022 \n\n~ \n\n~ \n\n~ es \n\nS \n\n~ \n\nFigure 2: Variable grouping based on mutual information. Y label is the distance. \n\nEighteen differept subsets of features can be generated from the four groups by \nselecting a feature from each group. Each subset is given to a committee mem(cid:173)\nber. For example, the subsets (IP.L.Dl, SP.L.Dl, CB.Dl, M2.L.Dl) and (DL.L.Dl, \nSP.L.Dl, TB3.Dl, M2.L.Dl) are used as feature sets for different committee mem(cid:173)\nbers. A committee has totally eighteen members. Data from Jan. 1950 to Dec. \n1979 is used for training and validation, and from Jan. 1980 to Dec. 1989 is used for \ntesting. Each member is a linear model that is trained using neural net techniques. \n\nWe compare the input grouping method with three other committee member gen(cid:173)\nerating methods: baseline, random selection, and bootstrapping. The baseline \nmethod is to train a committee member using all the input variables. Members \nare only different in their initial weights. The bootstrapping method also trains \na member using all the input features, but each member has different bootstrap \nreplicates of the original training data as its training and validation sets. The ran(cid:173)\ndom selection method constructs a feature set for a member by randomly picking a \nsubset from the available features. For comparison with the grouping method, each \ncommittee generated by these three methods has 18 members. \n\nTwenty runs are performed for each of the four methods in order to get reliable per(cid:173)\nformance measures. Figure 3 shows the boxplots of normalized MSE for the four \nmethods. The grouping method gives the best result, and the performance improve(cid:173)\nment is significant compared to other methods. The grouping method outperforms \nthe random selection method by meaningfully grouping of input features. It is in(cid:173)\nteresting to note that the heterogeneous committee methods, grouping and random \nselection, perform better than homogeneous methods for this data set. One of the \nreasons for this is that giving different members different input sets increases their \nmodel independence. Another reason could be that the problem becomes easier to \nmodel because of smaller feature sets. \n\n5 Conclusions \n\nThe performance of a committee depends on both the performance of individual \nmembers and the correlational structure of errors between members. An empirical \nstudy for a noisy and nonstationary economic forecasting problem has demonstrated \nthat committees constructed by input variable grouping outperform committees \nformed by randomly selecting member input variables. They also outperform com(cid:173)\nmittees without any input variable manipulation. \n\n\fConstructing Heterogeneous Committees Jor Economic Forecasting \n\n927 \n\nCommltt .. Performanc. Comp.rteon (20 rurw) \n\n0.84 \n\n0.82 \n\nUJ \n~ 0.8 \n\n) J 0.78 \n\n0.75 \n\n0.74 \n\nI \nI \n\n8; \n1 8 \n\nI \n\n-I-\n\n! \n\n9 I \n\n-L \n\n1 \n\n2 \n\n1 : Gr~lng. 2:Random \u2022\u2022 lectlon. 3.BaM\"n ..... Bootatrtpplng \n\n3 \n\nFigure 3: Comparison between four different committee member generating methods. \nThe proposed grouping method gives the best result, and the performance improvement \nis significant compared to the other three methods. \n\nReferences \n\n[1] R. Battiti. Using mutual information for selecting features in supervised neural net \n\nlearning. IEEE TI-ans. on Neural Networks, 5(4), July 1994. \n\n[2] B.Everitt. Cluster Analysis. Heinemann Educational Books, 1974. \n\n[3] L. Breiman. Bagging predictors. Machine Learning, 24(2):123-40, 1996. \n\n[4] K.J. Cherkauer. Human expert-level performance on a scientific image analysis task \nby a system using combined artificai neural networks. In P. Chan, editor, Working \nNotes of the AAAI Workshop on Integrating Multiple Learned Models, pages 15-2l. \n1996. \n\n[5] J. Moody, U. Levin, and S. Rehfuss. Predicting the U.S. index of industrial pro(cid:173)\n\nduction. In proceedings of the 1993 Parallel Applications in Statistics and Economics \nConference, Zeist, The Netherlands. Special issue of Neural Network World, 3(6):791-\n794, 1993. \n\n[6] D. Opitz and J. Shavlik. Generating accurate and diverse members of a neural(cid:173)\n\nnetwork ensemble. In D . Touretzky, M. Mozer, and M. Hasselmo, editors, Advances \nin Neural Information Processing Systems 8. MIT Press, Cambridge, MA, 1996. \n\n[7] Y. Raviv and N. Intrator. Bootstrapping with noise: An effective regularization \n\ntechnique. Connection Science, 8(3-4):355-72, 1996. \n\n[8] B. E. Rosen. Ensemble learning using decorrelated neural networks. Connection \n\nScience, 8(3-4):373-83, 1996. \n\n[9] K. Tumer and J. Ghosh. Error correlation and error reduction in ensemble classifiers. \n\nConnection Science, 8(3-4):385-404, December 1996. \n\n[10] L. Wu and J . Moody. A smoothing regularizer for feedforward and recurrent neural \n\nnetworks. Neural Computation, 8.3:463- 491, 1996. \n\n\f", "award": [], "sourceid": 1704, "authors": [{"given_name": "Yuansong", "family_name": "Liao", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}