{"title": "Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging", "book": "Advances in Neural Information Processing Systems", "page_first": 705, "page_last": 712, "abstract": null, "full_text": "Optimal Aggregation of Classifiers and Boosting\n Maps in Functional Magnetic Resonance\n Imaging\n\n\n\n Vladimir Koltchinskii\n Department of Mathematics and Statistics\n University of New Mexico\n Albuquerque, NM, 87131\n\n\n Manel Martinez-Ramon\n Department of Electrical and Computer Engineering\n University of New Mexico\n Albuquerque, NM, 87131\n\n\n Stefan Posse\n Department of Psychiatry and The Mind Institute\n University of New Mexico\n Albuquerque, NM, 87131\n\n\n\n Abstract\n\n We study a method of optimal data-driven aggregation of classifiers in a\n convex combination and establish tight upper bounds on its excess risk\n with respect to a convex loss function under the assumption that the so-\n lution of optimal aggregation problem is sparse. We use a boosting type\n algorithm of optimal aggregation to develop aggregate classifiers of ac-\n tivation patterns in fMRI based on locally trained SVM classifiers. The\n aggregation coefficients are then used to design a \"boosting map\" of the\n brain needed to identify the regions with most significant impact on clas-\n sification.\n\n\n1 Introduction\n\nWe consider a problem of optimal aggregation (see [1]) of a finite set of base classifiers in\na complex aggregate classifier. The aggregate classifiers we study are convex combinations\nof base classifiers and we are using boosting type algorithms as aggregation tools. Building\nupon recent developments in learning theory, we show that such boosting type aggregation\nyields a classifier with a small value of excess risk in the case when optimal aggregate\nclassifiers are sparse and that, moreover, the procedure provides reasonably good estimates\nof aggregation coefficients. Our primary goal is to use this approach in the problem of\nclassification of activation patterns in functional Magnetic Resonance Imaging (fMRI) (see,\ne.g., [2]).\n In these problems it is of interest not only to classify the patterns, but also to determine\nareas of the brain that are relevant for a particular classification task. Our approach is based\n\n\f\non splitting the image into a number of functional areas, training base classifiers locally in\neach area and then combining them into a complex aggregate classifier. The aggregation\ncoefficients are used to create a special representation of the image we call the boosting\nmap of the brain. It is needed to identify the functional areas with the most significant\nimpact on classification.\n Previous work has focused on classifying patterns within subject [2] and these pat-\nterns were located in the occipital lobe. Here we are considering a different problem, that\nis widely distributed patterns in multiple brain regions across groups of subjects. We use\nprior knowledge from functional neuroanatomical brain atlases to subdivide the brain into\nRegions of Interest, which makes this problem amenable to boosting. Classification across\nsubjects requires spatial normalization to account for inter-subject differences in brain size\nand shape, but also needs to be robust with respect to inter-subject differences in activation\npatterns shape and amplitude.\n Since fMRI patterns are very high dimensional and the amount of training data is\ntypically limited, some form of \"bet on sparsity\" principle (\"use a procedure that does well\nin sparse problems, since no procedure does well in dense problems\" see [3]) becomes\nalmost unavoidable and our theoretical analysis shows that boosting maps might have a\ngood chance of success in sparse problems (when only few functional areas are relevant for\nclassification).\n\n\n2 Optimal aggregation of classifiers\n\nAlthough we developed a multiclass extension of the method, for simplicity, we are dealing\nhere with a standard binary classification. Let (X, Y ) be a random couple with distribution\nP, X being an instance in some space S (e.g., it might be an fMRI pattern) and Y \n{-1, 1} being a binary label. Here and in what follows all the random variables are defined\non a probability space (, , P), E denotes the expectation. Functions f : S R will\nbe used as classifiers, sign(f (x)) being a predictor of the label for an instance x S (no\ndecision is being made if f (x) = 0). The quantity P {(x, y) : yf (x) 0} (the probability\nof misclassification or abstaining) is called the generalization error or the risk of f. Suppose\nthat H := {h1, . . . , hN } is a given family of classifiers taking values in [-1, 1]. Let\n\n N N\n conv(H) := jhj : |j| 1\n j=1 j=1\n\nbe the symmetric convex hull of H. One of the versions of optimal aggregation problem\nwould be to find a convex combination f conv(H) that minimizes the generalization\nerror of f in conv(H). For a given f conv(H) its quality is measured by\n\n E(f ) := P {(x, y) : yf (x) 0} - inf P {(x, y) : yg(x) 0},\n gconv(H)\n\nwhich is often called the excess risk of f. Since the true distribution P of (X, Y ) is un-\nknown, the solution of the optimal aggregation problem is to be found based on the training\ndata (X1, Y1), . . . , (Xn, Yn) consisting of n independent copies of (X, Y ).\n Let Pn denote the empirical measure based on the training data, i.e., Pn(A) represents\nthe frequency of training examples in a set A S{-1, 1}. In what follows, we denote P h\nor Pnh the integrals of a function h on S {-1, 1} with respect to P or Pn, respectively.\nWe use the same notation for functions on S with an obvious meaning.\n Since the generalization error is not known, it is tempting to try to estimate the optimal\nconvex aggregate classifier by minimizing the training error Pn{(x, y) : yf (x) 0}\nover the convex hull conv(H). However, this minimization problem is not computationally\nfeasible and, moreover, the accuracy of empirical approximation (approximation of P by\nPn) over the class of sets {{(x, y) : yf (x) 0} : f conv(H)} is not good enough\nwhen H is a large class. An approach that allows one to overcome both difficulties and that\n\n\f\nproved to be very successful in the recent years is to replace the minimization of the training\nerror by the minimization of the empirical risk with respect to a convex loss function. To\nbe specific, let be a nonnegative decreasing convex function on R such that (u) 1 for\nu 0. We will denote ( f )(x, y) := (yf (x)). The quantity\n\n P ( f ) = ( f )dP = E(Y f(X))\n\nis called the risk of f with respect to the loss , or the -risk of f. We will call a function\n\n N\n\n f0 := 0jhj conv(H)\n j=1\n\nan -otimal aggregate classifier if it minimizes the -risk over conv(H). Similarly to the\nexcess risk, one can define the excess -risk of f as\n\n E(f) := P ( f) - inf P ( g).\n gconv(H)\n\n Despite the fact that we concentrate in what follows on optimizing the excess -risk\n(-optimal aggregation) it often provides also a reasonably good solution of the problem\nof minimizing the generalization error (optimal aggregation), as it follows from simple\ninequalities relating the two risks and proved in [4].\n As before, since P is unknown, the minimization of -risk has to be replaced by the\ncorresponding empirical risk minimization problem\n\n 1 n\n Pn( f ) = Y\n n j f (Xj ) - min, f conv(H),\n i=1\n\nwhose solution ^\n f := N ^\n \n j=1 j hj is called an empirical -optimal aggregate classifier.\n\n We will show that if f0, ^\n f are \"sparse\" (i.e., 0j, ^j are small for most of the values\nof j), then the excess -risk of the empirical -optimal aggregate classifier is small and,\nmoreover, the coefficients of ^\n f are close to the coefficients of f0 in 1-distance.\n The sparsity assumption is almost unavoidable in many problems because of the \"bet\non sparsity\" principle (see the Introduction).\n At a more formal level, if there exists a small subset J {1, 2, . . . , N } such that the\nsets of random variables {Y, hj(X), j J} and {hj(X), j J} are independent and, in\naddition, Ehj(X) = 0, j J, then, using Jensen's inequality, it is easy to check that in an\n-optimal aggregate classifier f0 one can take 0j = 0, j J.\n We will define a measure of sparsity of a function f := N \n j=1 j hj conv(H) that\nis somewhat akin to sparsity charactersitics considered in [5, 6]. For 0 d N, let\n\n (f ; d) := min |j| : J {1, . . . , N }, card(J) = d\n jJ\n\nand let n(d) := d log(Nn/d) .\n n\n Define\n\n dn(f ) := min d : 1 d N, n(d) (d) .\n\n Of course, if there exists J {1, . . . , N } such that j = 0 for all j J and\ncard(J) = d, then dn(f) d.\n We will also need the following measure of linear independence of functions in H :\n\n -1\n (d) := (H; d) = inf inf jhj .\n J{1,...,N },card(J)=d |\n jJ j |=1 jJ L2(P )\n\n\f\n Finally, we need some standard conditions on the loss function (as, for instance, in\n[4]). Assume that is Lipschitz on [-1, 1] with some constant L, |(u) - (v)| L|u -\nv|, u, v [-1, 1], and the following condition on the convexity modulus of holds with\n L :\n (u) + (v) u + v\n - |u - v|2, u, v [-1, 1].\n 2 2\n In fact, (u) is often replaced by a function (uM ) with a large enough M (in other\nwords, the -risk is minimized over M conv(H)). This is the case, for instance, for so called\nregularized boosting [7]. The theorem below applies to this case as well, only a simple\nrescaling of the constants is needed.\n\nTheorem 1 There exist constants K1, K2 > 0 such that for all t > 0\n\n L2 log N t\n P E( ^\n f ) K1 + e-t\n n(dn( ^\n f )) n n\nand\n N L t\n P |^\n j - 0j| K2 (d e-t.\n n( ^\n f ) + dn(f0)) n(dn( ^\n f ) + dn(f0)) + n\n j=1\n\n Our proof requires some background material on localized Rademacher complexities\nand their role in bounding of excess risk (see [8]). We defer it to the full version of the\npaper. Note that the first bound depends only on dn( ^\n f ) and the second on dn( ^\n f ), dn(f0).\nBoth quantities can be much smaller than N despite the fact that empirical risk minimiza-\ntion occurs over the whole N -dimensional convex hull. However, the approach to convex\naggregation based on minimization of the empirical -risk over the convex hull does not\nguarantee that ^\n f is sparse even if f0 is. To address this problem, we also studied another\napproach based on minimization of the penalized empirical -risk with the penalty based\non the number of nonzero coefficients of the classifier, but the size of the paper does not\nallow us to discuss it.\n\n3 Classification of fMRI patterns and boosting maps\n\nWe are using optimal aggregation methods described above in the problem of classification\nof activation patterns in fMRI. Our approach is based on dividing the training data into\ntwo parts: for local training and for aggregation. Then, we split the image into N func-\ntional areas and train N local classifiers h1, . . . , hN based on the portions of fMRI data\ncorresponding to the areas. The data reserved for aggregation is then used to construct an\naggregate classifier. In applications, we are often replacing direct minimization of empirical\nrisk with convex loss by the standard AdaBoost algorithm (see, e.g., [9]), which essentially\nmeans choosing the loss function as (u) = e-u. A weak (base) learner for AdaBoost sim-\nply chooses in this case a local classifier among h1, . . . , hN with the smallest weighted\ntraining error [in more sophisticated versions, we choose a local classifier at random with\nprobability depending on the size of its weighted training error] and after a number of\nrounds AdaBoost returns a convex combination of local classifiers. The coefficients of this\naggregate classifier are then used to create a new visual representation of the brain (the\nboosting map) that highlights the functional areas with significant impact on classification.\nIn principle, it is also possible to use the same data for training of local classifiers and for\naggregation (retraining the local classifiers at each round of boosting), but this approach is\ntime consuming.\n We use statistical parametric model (SPM) t-maps of MRI scans [10]. Statistical para-\nmetric maps (SPMs) are image processes with voxel1 values that are, under the null hypoth-\nesis, distributed according to a known probability density function, usually the Student's\n\n 1A voxel is the amplitude of a position in the 3-D MRI image matrix.\n\n\f\nFigure 1: Masks used to split the image into functional areas in multi-slice and 3 orthogonal\nslice display representations.\n\n\n\nT or F distributions. These are known colloquially as t- or f-maps. Namely, one analyzes\neach and every voxel using any standard (univariate) statistical test. The resulting statistical\nparameters are assembled into an image - the SPM.\n The classification system essentially transforms the t-map of the image into the boost-\ning map and at the same time it returns the aggregate classifier. The system consists of the\ndata preprocessing block that splits the image into functional areas based on specified\nmasks, and also splits the data into portions corresponding to the areas. In one of our exam-\nples, we use the main functional areas brainstem, cerebellum, occipital, temporal, parietal,\nsubcortical and frontal. We split these masks in left and right, having in total 14 of them.\nThe classifier block then trains local classifiers based on local data (in the current version\nwe are using SVM classifiers). Finally, the aggregation or boosting block computes and\noutputs the aggregate classifier and the boosting map of the image. We developed a ver-\nsion of the system that deals with multi-class problems in spirit of [11], but the details go\nbeyond the scope of this paper. The architecture of the network allows us also to train it\nsequentially. Let f be a classifier produced by the network in the previous round of work,\nlet (X1, Y1), . . . , (Xn, Yn) be either the same or a new training data set and let h1, . . . , hN\nbe local classifiers (based either on the same, or on a new set of masks). Then one can\n\nassign to the training examples the initial weights wj = e-Yjf(Xj) , where Z is a standard\n Z\nnormalizing constant, instead of usually chosen uniform weights. After this, the AdaBoost\ncan proceed in a normal fashion creating at the end an aggregate of f and of new local clas-\nsifiers. The process can be repeated recursively updating both the classifier and the boosting\nmap.\n\n\n\n 5 5 5\n\n\n\n\n10 10 10\n\n\n\n\n15 15 15\n\n\n\n\n20 20 20\n\n\n\n\n25 25 25\n\n\n\n\n30 30 30\n\n\n\n\n35 35 35\n\n\n\n\n 5 10 15 20 25 30 35 5 10 15 20 25 30 35 5 10 15 20 25 30 35\n\n\n\n\n\nFigure 2: Left and center: Patterns corresponding to two classes of data. Right: Locations\nof the learners chosen by the boosting procedure (white spots). The background image\ncorresponds to the two patterns of left and center figures superimposed.\n\n\f\n 5 5\n\n\n\n\n 10 10\n\n\n\n\n 15 15\n\n\n\n\n 20 20\n\n\n\n\n 25 25\n\n\n\n\n 30 30\n\n\n\n\n 35 35\n\n\n\n\n 5 10 15 20 25 30 35 5 10 15 20 25 30 35\n\n\n\n\n\nFigure 3: Patterns corrupted with noise in the gaussian parameters, artifacts, and additive\nnoise used in the synthetic data experiment.\n\n\n\n\n\nFigure 4: Two t-maps corresponding to visual (left) and motor activations in the same\nsubject used in the real data experiment.\n\n\n\n\n As a synthetic data example, we generate 40 40 pixels images of two classes. Each\nclass of images consists of three gaussian clusters placed in different positions. We generate\nthe set of images by adding gaussian noise of standard deviation 0.1 to the standard devi-\nation and position of the clusters. Then, we add 10 more clusters with random parameters,\nand finally, additive noise of standard deviation 0.1. Figure 2 (left and center) shows the\naverages of class 1 and class 2 images respectively. Two samples of the images can be seen\nin Figure 3\n We apply a base learner to each one of the 1600 pixels of the images. Learners have\nbeen trained with 200 data, 100 of each class, and the aggregation has been trained with\n200 more data. The classifier has been tested with 200 previously unknown data. The error\naveraged over 100 trials is of 9.5%. The same experiment has been made with a single linear\nSVM, producing an error which exceeds 20%, although this rate can be slightly improved\nby selecting C by cross validation.\n The resulting boosting map can be seen in Fig. 2 (right). As a proof of concept, we\nremark that the map is able to focus in the areas in which the clusters corresponding to each\nclass are, discarding those areas in which only randomly placed clusters are present.\n In order to test the algorithm in a real fMRI experiment, we use 20 images taken from\n10 healthy subjects on a 1.5 Tesla Siemens Sonata scanner. Stimuli were presented via MR\ncompatible LCD goggles and headphones. The paradigm consists of four interleaved tasks:\nvisual (8 Hz checkerboard stimulation), motor (2 Hz right index finger tapping), auditory\n\n\f\nFigure 5: Boosting map of the brain corresponding to the classification problem with visual\nand motor activations. Darker regions correspond to higher values.\n\n\n left brainstem: 0 right brainstem: 0\n left cerebellum: 0.15 right cerebellum: 0.16\n left parietal: 0.02 right parietal: 0.06\n left temporal: 0.03 right temporal: 0.15\n left occipital: 0.29 right occipital: 0.15\n left subcortical: 0 right subcortical: 0\n left frontal: 0 right frontal: 0\n\n\n Table 1: Values of the convex aggregation.\n\n\n\n(syllable discrimination) and cognitive (mental calculation). These tasks are arranged in\nrandomized blocks (8 s per block). Finger tapping in the motor task was regulated with\nan auditory tone, subjects were asked to tap onto a button-response pad. During the audi-\ntory task, subjects were asked to respond on a button-response pad for each \"Ta\" (25% of\nsounds), but not to similar syllables. Mental calculation stimuli consisted of three single-\ndigit numbers heard via headphone. Participants had to sum them and divide by three,\nresponding by button press when there was no remainder (50% of trials).\n Functional MRI data were acquired using single-shot echo-planar imaging with TR:\n2 s, TE: 50 ms, flip angle: 90 degrees, matrix size: 64 64 pixels, FOV: 192 mm. Slices\nwere 6 mm thick, with 25% gap, 66 volumes were collected for a total measurement time\nof 132 sec per run. Statistical parametric mapping was performed to generate t-maps that\nrepresent brain activation changes.\n The t-maps are lowpass filtered and undersampled to obtain 32 32 24 t-maps (Fig.\n4). The resulting t-maps are masked to obtain 14 subimages, then the data is normalized\nin amplitude. We proceed as mentioned to train a set of 14 Support Vector Machines. The\nused kernel is a gaussian one with = 2 and C = 10. These parameters have been chosen\nto provide an acceptable generalization. A convex aggregation of the classifier outputs is\nthen trained.\n We tested the algorithm in binary classification of visual against auditory activations.\nWe train the base learners with 10 images, and the boosting with 9. Then, we train the\nbase learners again with 19, leaving one for testing. We repeat the experiment leaving\nout a different image each trial. None of the images was misclassified. The values for the\naggregation are in Table 1. The corresponding boosting map is shown in Fig 5. It highlights\nthe right temporal and both occipital areas, where the motor and visual activations are\n\n\f\npresent (see Fig. 4). Also, there is activation in the cerebellum area in some of the motor\nt-maps, which is highlighted by the boosting map.\n In experiments for the six binary combination of activation stimuli, the average error\nwas less than 10%. This is an acceptable result if we take into account that the data included\nten different subjects, whose brain activation patterns present noticeable differences.\n\n\n4 Future goals\n\nBoosting maps we introduced in this paper might become a useful tool in solving classifi-\ncation problems for fMRI data, but there is a number of questions to be answered before it\nis the case. The most difficult problem is the choice of functional areas and local classifiers\nso that the \"true\" boosting map is identifiable based on the data. As our theoretical analysis\nshows, this is related to the degree of linear independence of local classifiers quantified by\nthe function (d). If (d) is too large for d = dn(f0) dn( ^\n f ), the empirical boosting map\ncan become very unstable and misleading. In such cases, there is a challenging model selec-\ntion problem (how to choose a \"good\" subset of H or how to split H into \"almost linearly\nindependent clusters\" of functions) that has to be addressed to develop this methodology\nfurther.\n\nAcknowledgments\n\nWe want to acknowledge to Jeremy Bockholt (MIND Institute) for providing the brain\nmasks, generated with BRAINS2. Partially supported by NSF Grant DMS-0304861 and\nNIH Grant NIBIB 1 RO1 EB002618-01, Dept. of Mathematics and Statistics, Dept. of\nElectrical and Computing Engineering, Dept. of Psychiatry and The MIND Institute.\n\n\nReferences\n\n [1] Tsybakov, A. (2003) Optimal rates of aggregation. In: COLT2003, Lecture Notes in Artificial\n Intelligence, Eds.: M. Warmuth and B. Schoelkopf, Springer.\n\n [2] Cox, D.D., Savoy, R.L. (2003) Functional magnetic resonance imaging (fMRI) \"brain read-\n ing\": detecting and classifying distributed patterns of fMRI activity in human visual cortex,\n Neuroimage19, 2, 26170.\n\n [3] Friedman, J., Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004) Discussion on Boosting,\n Annals of Statistics, 32, 1, 102107.\n\n [4] Bartlett, P. L., Jordan, M.I., McAuliffe, J. D. (2003) Convexity, classification, and risk\n bounds. Technical Report 638, Department of Statistics, U.C. Berkeley, 2003. Journal of\n the American Statistical Association.To appear.\n\n [5] Koltchinskii, V., Panchenko, D. and Lozano, F. (2003) Bounding the generalization error of\n combined classifiers: balancing the dimensionality and the margins. A. Appl. Prob. , 13, 1.\n\n [6] Koltchinskii, V., Panchenko, D. and Andonova, S. (2003) Generalization bounds for vot-\n ing classifiers based on sparsity and clustering. In: COLT2003, Lecture Notes in Artificial\n Intelligence, Eds.: M. Warmuth and B. Schoelkopf, Springer.\n\n [7] Blanchard, G., Lugosi, G. and Vayatis, N. (2003) On the rates of convergence of regularized\n boosting classifiers. Journal of Machine Learning Research 4, 861-894.\n\n [8] Koltchinskii, V. (2003) Local Rademacher Complexities and Oracle Inequalities in Risk\n Minimization. Preprint.\n\n [9] Schapire, R. E. (1999) A brief Introduction to Boosting. In: Proc. of the 6th Intl. Conf. on\n Artificial Inteligence.\n\n [10] Friston, K., Frith, C., Liddle, P. and Frackowiak, R. (1991) Comparing functional (PET)\n images: the assessment of significant change. J. Cereb. Blood Flow Met.11, 690-699\n\n [11] Allwein, E. L., Schapire, R. E., and Singer, Y. (2000) Reducing multiclass to binary: A\n unifying approach for margin classifiers. J. Machine Learning Research, 1, 113-141.\n\n\f\n", "award": [], "sourceid": 2699, "authors": [{"given_name": "Vladimir", "family_name": "Koltchinskii", "institution": null}, {"given_name": "Manel", "family_name": "Mart\u00ednez-ram\u00f3n", "institution": null}, {"given_name": "Stefan", "family_name": "Posse", "institution": null}]}