{"title": "Boosting with Spatial Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 2107, "page_last": 2115, "abstract": "By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a ``grouping effect, which encourages the selection of all spatially local, discriminative base classifiers. The algorithms primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithms performance on various data sets.", "full_text": "Boosting with Spatial Regularization\n\nZhen James Xiang1 Yongxin Taylor Xi1 Uri Hasson2\n\nPeter J. Ramadge1\n\n1: Department of Electrical Engineering, Princeton University, Princeton NJ, USA\n\n2: Department of Psychology, and Neuroscience Institute, Princeton University, Princeton NJ, USA\n\n{zxiang, yxi, hasson, ramadge} @ princeton.edu\n\nAbstract\n\nBy adding a spatial regularization kernel to a standard loss function formulation\nof the boosting problem, we develop a framework for spatially informed boosting.\nFrom this regularized loss framework we derive an ef\ufb01cient boosting algorithm\nthat uses additional weights/priors on the base classi\ufb01ers. We prove that the pro-\nposed algorithm exhibits a \u201cgrouping effect\u201d, which encourages the selection of\nall spatially local, discriminative base classi\ufb01ers. The algorithm\u2019s primary advan-\ntage is in applications where the trained classi\ufb01er is used to identify the spatial\npattern of discriminative information, e.g. the voxel selection problem in fMRI.\nWe demonstrate the algorithm\u2019s performance on various data sets.\n\n1 Introduction\n\nWhen applying off-the-shelf machine learning algorithms to data with spatial dimensions (images,\ngeo-spatial data, fMRI, etc) a central question arises: how to incorporate prior information on the\nspatial characteristics of the data? For example, if we feed a boosting or SVM algorithm with\nindividual image voxels as features, the voxel spatial information is ignored. Indeed, if we randomly\nshuf\ufb02ed the voxels, the algorithm would not notice any difference. Yet in many cases the spatial\narrangement of the voxels together with prior information about expected spatial characteristics of\nthe data may be very helpful. We are particularly interested in the situation when the trained classi\ufb01er\nis used to identify relevant spatial regions. To make this more concrete, consider the problem of\ntraining a classi\ufb01er to distinguish two different brain states based on fMRI responses. Successful\nclassi\ufb01cation suggests that the voxels used are important in discriminating between the two classes.\nHence we could use a successful classi\ufb01er to learn a set of discriminative voxels. We expect that\nthese voxels will be spatially compact and clustered. How can this prior knowledge be incorporated\ninto the training of the classi\ufb01er? In summary, our primary objective is improving the ability of\nthe trained classi\ufb01er to usefully identify the spatial pattern of discriminative information. However,\nincorporating spatial information into boosting may also improve classi\ufb01cation accuracy.\nOur key contribution is the development of a framework for spatially regularized boosting. We\ndo this by adding a spatial regularization kernel to the standard loss minimization formulation of\nboosting. We then design an associated boosting algorithm by using coordinate descent on the\nregularized loss. We show that the algorithm minimizes the regularized loss function and has a\nnatural interpretation of boosting with additional adaptive priors/weights on both spatial locations\nand training examples. We also show that it exhibits a natural grouping effect on nearby spatial\nlocations with similar discriminative power.\nWe believe our contributions are fundamental and relevant to a variety of applications where base\nclassi\ufb01ers are attributed with a known auxiliary variable and prior information is known about this\nauxiliary variable. However, since our study is motivated by the particular problem of voxel selection\nin fMRI analysis, we brie\ufb02y review the state of the art in this domain so as to put our contribution\ninto a concrete context.\n\n1\n\n\fBrie\ufb02y, the fMRI voxel selection problem is to use the fMRI signal to identify a subset of voxels\nthat are key in discriminating between two stimuli. One expects such voxels to be spatially compact\nand clustered. Traditionally this is done by thresholding a statistical univariate test score on each\nvoxel [1]. Spatial smoothing prior to this analysis is commonly employed to integrate activity from\nneighboring voxels. An extreme case is hypothesis testings on clusters of voxels rather than on\nvoxels themselves [2]. The problem with these methods is that they greatly sacri\ufb01ce the spatial\nresolution of the results and averaging could hide \ufb01ne patterns in data. An alternative is to spatially\naverage the univariate test scores, e.g.\nthresholding in some transformed domain (e.g. wavelet\ndomain) [3, 4]. However, this also compromises the spatial accuracy of the result because one\nselects discriminating wavelet components, not voxels. A more promising spatially aware approach\nselects voxels with tree-based spatial regularization of a univariate statistic [5, 6]. This can achieve\nboth spatial precision and smoothness but uses a complex regularization method. Our proposed\nmethod also selects single voxels with the help of spatial regularization but operates in a multivariate\nclassi\ufb01er framework using a simpler form of regularization.\nRecent research has suggested that multivariate analysis has potential advantages over univariate\ntests [7, 8], e.g. it brings in machine learning algorithms (such as boosting, SVM, etc.) and there-\nfore might capture more intricate activation patterns involving multiple voxels. To ensure spatial\nclustering of selected voxels, one can run a searchlight (a spherical mask) [9] to pre-select clustered\ninformative features. In each searchlight location, a multivariate analysis is performed to see whether\nthe masked area contains informative data. One can then train a classi\ufb01er on the pre-selected voxels.\nA variant of this two-stage framework is to train classi\ufb01ers on a few prede\ufb01ned masks, and then\naggregate these classi\ufb01ers by boosting [10, 11]. This is faster but assumes detailed prior knowledge\nto select the prede\ufb01ned masks. Unlike two-stage approaches, [12] directly uses AdaBoost to train\nclassi\ufb01ers with \u201crich features\u201d (features involving the values of several adjacent voxels) to capture\nspatial structure in the data. Although exhibiting superior performance, this method selects \u201crich\nfeatures\u201d rather than individual discriminating voxels. Moreover, there is no control on the spatial\nsmoothness of the results. Our method is similar to [12] in that we combine the feature selection\nand classi\ufb01cation into one boosting process. But our algorithm operates on single voxels and uses\nsimple spatial regularization to incorporate spatial information.\nThe remainder of the paper is organized as follows. After introducing notation in \u00a72, we formu-\nlate our spatial regularization approach in \u00a73 and derive an associated spatially regularized boosting\nalgorithm in \u00a74. We prove an interesting property of the algorithm in \u00a75 that guarantees the simulta-\nneous selection of equivalent locations that are spatially close. In \u00a76, we test the algorithm on face\ngender detection, OCR image classi\ufb01cation, and fMRI experiments.\n\na composite binary classi\ufb01er of the form h\u03b1(xi) = sgn((cid:80)p\n\n2 Boosting Preliminaries\nIn a supervised learning setting, we are given m training instances X = {xi \u2208 Rn, i = 1, . . . , m}\nand corresponding binary labels Y = {yi = \u00b11, i = 1, . . . , m}. Using the training instances X , we\nselect a pool of base classi\ufb01ers H = {hj : Rn \u2192 {\u22121, +1}, j = 1, . . . , p}. Our objective is to train\nj=1 \u03b1jhj(xi)). We can further assume\nthat hj \u2208 H \u21d2 \u2212hj \u2208 H, thus all values in \u03b1 can be assumed to be nonnegative. Boosting is a\ntechnique for constructing from X , Y and H the weight \u03b1 of a composite classi\ufb01er to best predict\nthe labels. This can be done by seeking \u03b1 to minimize a loss function of the form:\n\nm(cid:88)\n\nL(X ,Y, \u03b1) =\n\nl(yi, h\u03b1(xi)).\n\n(1)\n\ni=1\n\nVarious boosting algorithms can be derived as iterative greedy coordinate descent procedures to\nminimize (1) [13]. In particular, AdaBoost [14] is of this form with l(yi, h\u03b1(xi)) = e\u2212yih\u03b1(xi).\nThe result of a conventional boosting algorithm is determined by the m \u00d7 p matrix M = [yihj(xi)]\n[15]. Under a component permutation \u02c6xi = P xi, the base classi\ufb01ers become \u02c6hj = hj \u00b7 P \u22121; so\n\u02c6hj(\u02c6xi)] = [yihj(xi)] = M. Hence training on {P xi, yi} or {xi, yi} yields the same \u03b1,\n\u02c6M = [yi\ni.e., the arrangement of the components can be arbitrary as long as it is consistent.\nThe weights \u03b1 of a composite classi\ufb01er not only indicate how to construct the classi\ufb01er, but also\nthe relative reliance of the classi\ufb01er on each of the n instance components. To see this, assume each\n\n2\n\n\fhj depends on only a single component of x \u2208 Rn, i.e., for some standard basis vector ek, and\nfunction gj : R \u2192 {\u22121, +1}, hj(x) = gj(eT\nk x) (the base classi\ufb01ers are decision stumps). To make\nthe association between base classi\ufb01ers and components explicit, let s be the function s(j) = k if\nk x) and Q = [qkj] be the n\u00d7 p matrix with qkj = 1[s(j)=k]. Then the vector \u03b2 = Q\u03b1\nhj(x) = gj(eT\nindicates the relative importance the classi\ufb01er assigns to each instance component. Although we\nused decision stumps above for simplicity, more complex base classi\ufb01ers such as decision trees could\nbe used with proper modi\ufb01cation of mapping from \u03b1 to \u03b2. We call \u03b2 the component importance\nmap. Suppose the instance components re\ufb02ect spatial structure in the data, e.g. the components are\nsamples along an interval or pixels in an image. Then the component importance map is indicating\nthe spatial distribution of weights that the classi\ufb01er employs. Presumably a good classi\ufb01er distributes\nthe weights in accordance with the discriminative power of the components; in which case, the\nmap is indicating how discriminative information is spatially distributed. It is in this aspect of the\nclassi\ufb01er that we are particularly interested. Now as shown above, conventional boosting ignores\nspatial information. Our objective, pursued in the next sections, is to incorporated prior information\non spatial structure, e.g. a prior on the component importance map, into the boosting problem.\n\n3 Adding Spatial Regularization\n\nTo incorporate spatial information we add spatial regularization of the form \u03b2T K\u03b2 to the loss (1)\nwhere the kernel K \u2208 Rn\u00d7n\n++ is positive de\ufb01nite. For concreteness, we employ the exponential loss\nl(yi, h\u03b1(xi)) = e\u2212yih\u03b1(xi). Thus the regularized loss is:\n\nm(cid:88)\nm(cid:88)\n\ni=1\n\np(cid:88)\np(cid:88)\n\nj=1\n\nLexp\nreg (X ,Y, \u03b1) =\n\n=\n\nexp(\u2212yi\n\nexp(\u2212yi\n\n\u03b1jhj(xi)) + \u03bb\u03b2T K\u03b2\n\n\u03b1jhj(xi)) + \u03bb\u03b1T QT KQ\u03b1.\n\n(2)\n\n(3)\n\ni=1\n\nj=1\n\nThe term \u03b2T K\u03b2 imposes a spatial smoothness constraint on \u03b2. To see this, consider the eigen-\ndecomposition K = U\u03a3U T , where the columns {uj} of U are the orthonormal eigenvectors, \u03c3j\nis the eigenvalue of uj and \u03a3 = diag(\u03c31, \u03c32, . . . , \u03c3n). Then the regularizing term can be rewrit-\nten as \u03bb(cid:107)\u03a3 1\n2 where U T \u03b2 is the \u201cspectrum\u201d of \u03b2 under the orthogonal transformation U T .\nRather than standard Tikhonov regularization with (cid:107)\u03b2(cid:107)2\n2, we penalize the variation in\ndirection uj proportional to the eigenvalue \u03c3j. By doing so we are encouraging \u03b2 to be close to the\neigenvectors uj with small eigenvalues. This encodes our prior spatial knowledge.\n\n2 = (cid:107)U T \u03b2(cid:107)2\n\n2 U T \u03b2(cid:107)2\n\nFigure 1: Each graph is the eigenimage of size d \u00d7 d corresponding to an eigenvector of K = \u00b5I \u2212 G.\n\nAs an example, consider the kernel K = \u00b5I \u2212 G, where G is a Gaussian kernel matrix:\n\nGij = e\u2212 1\n\n2(cid:107)vi\u2212vj(cid:107)2\n\n2/r2\n\n(4)\nwith vj the spatial location of component j, (cid:107)vi \u2212 vj(cid:107)2 the Euclidean distance (other distances\ncan also be used) between components i and j, and r the radius parameter of the Gaussian kernel.\nFor the 2D case, i = (i1, i2) ranges over (1, 1), (1, 2), . . . , (d, d). j = (j1, j2) ranges over the same\ncoordinates. So G is a size d2\u00d7d2 matrix. We plot the 6 eigenimages of K with smallest eigenvalues\nin Figure 1. The regularization imposes a spatial smoothness constraint by encouraging \u03b2 to give\nmore weight to the eigenimages with smaller eigenvalues, e.g. the patterns shown in Figure 1.\n\n,\n\n4 A Spatially Regularized Boosting Algorithm\n\nWe now derive a spatially regularized boosting algorithm (abbreviated as SRB) using coordinate\ndescent on (3). In particular, in each iteration we choose a coordinate of \u03b1 with the largest negative\n\n3\n\n\fgradient and increase the weight of that coordinate by step size \u03b5. This results in an algorithm similar\nto AdaBoost, but with additional consideration of spatial location.\nTo begin, we take the partial derivative of (3) w.r.t. \u03b1j(cid:48):\nyihj(cid:48)(xi) exp(\u2212yi\n\n\u03b1jhj(xi)) \u2212 2eT\n\nLexp\nreg (X ,Y, \u03b1) =\n\nj(cid:48)\u03bbQT KQ\u03b1.\n\nm(cid:88)\n\np(cid:88)\n\n\u2212 \u2202\n\u2202\u03b1j(cid:48)\n\ni=1\n\nj=1\n\nHere ej(cid:48) is the j(cid:48)-th standard basis vector, so eT\nthe de\ufb01nition of Q, (eT\nbe \u03b3 = \u22122\u03bbK\u03b2, and wi = exp(\u2212yi\non training instance xi, then the partial derivative in (4) can be written as:\n\nj(cid:48) \u03bbQT KQ\u03b1 is the j(cid:48)-th element of \u03bbQT KQ\u03b1. By\n(cid:80)p\nj(cid:48)QT )\u03bbKQ\u03b1 is the s(j(cid:48))-th element of \u03bbKQ\u03b1. Therefore if we de\ufb01ne \u03b3 to\nj=1 \u03b1jhj(xi)) (1 \u2264 i \u2264 m) to be the unnormalized weight\n\n\u2212 \u2202\n\u2202\u03b1j(cid:48)\n\nLexp\nreg (X ,Y, \u03b1) =\n\nyihj(cid:48)(xi)wi + \u03b3s(j(cid:48))\n\nm(cid:88)\n\ni=1\n\nThe term(cid:80)m\n\ni=1 yihj(cid:48)(xi)wi is the weighted performance of base classi\ufb01er hj(cid:48) on the training ex-\namples. Normally, we choose hj(cid:48) to maximize this term. This corresponds to choosing the best\nbase classi\ufb01er under the current weight distribution. However, here we have an additional term: the\nperformance of base classi\ufb01er hj(cid:48) is enhanced by a weight \u03b3s(j(cid:48)) on its corresponding component\ns(j(cid:48)). We call \u03b3 the spatial compensation weight. To proceed, we choose a base classi\ufb01er hj(cid:48) to\nmaximize the sum of these two terms and then increase the weight of that base classi\ufb01er by a step\nsize \u03b5. This gives Algorithm 1 shown in Figure 2. The key differences from AdaBoost are: (a) the\nnew algorithm maintains a new set of \u201cspatial compensation weights\u201d \u03b3; (b) the weights on training\nexamples wi are not normalized at the end of each iteration.\n\nAlgorithm 1 The SRB algorithm\n1: wi \u2190 1, 1 \u2264 i \u2264 m\n2: \u03b1 \u2190 0\n3: for t = 1 to T do\n\u03b2 \u2190 Q\u03b1\n4:\n\u03b3 \u2190 \u22122\u03bbK\u03b2\n5:\n\ufb01nd the \u201cbest\u201d base classi\ufb01er in the fol-\n6:\nlowing sense:\nj(cid:48) \u2190 arg maxj\nchoose a step size \u03b5, \u03b1j(cid:48) \u2190 \u03b1j(cid:48) + \u03b5\nadjust weights:\nif yihj(cid:48)(xi) = \u22121\nwi \u2190\nwie\u2212\u03b5\nif yihj(cid:48)(xi) = 1\nfor 1 \u2264 i \u2264 m\n\n(cid:8)\u2126(hj, w) + \u03b3s(j)\n\n(cid:26) wie\u03b5\n\n10: Output result: h\u03b1(x) =(cid:80)p\n\n9: end for\n\nj=1 \u03b1jhj(x)\n\n(cid:9)\n\n7:\n8:\n\nm(cid:88)\n\nIn both algorithms, \u2126(hj, w) is de\ufb01ned to be:\n\n\u2126(hj, w) =\n\nyihj(xi)wi,\n\ni=1\n\nwhich is a performance measure of classi\ufb01er hj un-\nder weight distribution w on training examples.\n\n(cid:9)\n\n7:\n8:\n\n9:\n\n(cid:8)\u2126(hj, w) + \u03b3s(j)\n\nAlgorithm 2 SRB algorithm with backward steps\n1: wi \u2190 1, 1 \u2264 i \u2264 m\n2: \u03b1 \u2190 0\n3: for t = 1 to T do\n\u03b2 \u2190 Q\u03b1\n4:\n\u03b3 \u2190 \u22122\u03bbK\u03b2\n5:\n\ufb01nd the \u201cbest\u201d base classi\ufb01er in the follow-\n6:\ning sense:\nj(cid:48) \u2190 arg maxj\nchoose a step size \u03b51, \u03b1j(cid:48) \u2190 \u03b1j(cid:48) + \u03b51\nadjust weights:\nif yihj(cid:48)(xi) = \u22121\nwi \u2190\nwie\u2212\u03b51\nif yihj(cid:48)(xi) = 1\n\ufb01nd the \u201cworst\u201d active classi\ufb01er in the fol-\nlowing sense:\nj(cid:48)(cid:48) \u2190 arg minj:\u03b1j >0\n\u03b1j(cid:48)(cid:48) \u2190 \u03b1j(cid:48)(cid:48) \u2212 \u03b52\nadjust weights again:\nwi \u2190\nfor 1 \u2264 i \u2264 m\n\n(cid:8)\u2126(hj, w) + \u03b3s(j)\n\nif yihj(cid:48)(cid:48)(xi) = \u22121\nif yihj(cid:48)(cid:48)(xi) = 1\n\n(cid:26) wie\u2212\u03b52/2\n\n(cid:26) wie\u03b51\n\n13: Output result: h\u03b1(x) =(cid:80)p\n\n12: end for\n\nj=1 \u03b1jhj(x)\n\nwie\u03b52/2\n\n10:\n11:\n\n(cid:9)\n\n2\n\nFigure 2: The SRB (spatially regularized boosting algorithms).\n\nTo elucidate the effect of the compensation weights, consider the kernel K = \u00b5I\u2212G, with G de\ufb01ned\nin (4). In this case, \u03b3 = 2\u03bb(\u00af\u03b2 \u2212 \u00b5\u03b2) where \u00af\u03b2 = G\u03b2 is the Gaussian smoothing of \u03b2 . Therefore,\n\n4\n\n\fa component receives a high compensation weight \u03b3k = 2\u03bb( \u00af\u03b2k \u2212 \u00b5\u03b2k) if some neighboring spatial\nlocations have already been selected (i.e., made \u201cactive\u201d) by the composite classi\ufb01er. On the other\nhand, the weight of a component is reduced (proportional to the magnitude of parameter \u00b5) if it\nis already \u201cactive\u201d, i.e., \u03b2k > 0. So the algorithm encourages the selection of base classi\ufb01ers\nassociated with \u201cinactive\u201d locations that are close to \u201cactive\u201d locations.\nWe can enhance the algorithm by including a backward step each iteration: \u03b1j(cid:48)(cid:48) \u2190 \u03b1j(cid:48)(cid:48) \u2212 \u03b5(cid:48), where\n\n(cid:40) m(cid:88)\n\ni=1\n\nj(cid:48)(cid:48) = arg min\n1\u2264j\u2264p,\u03b1j >0\n\n(cid:41)\n\nyihj(xi)wi + \u03b3s(j)\n\n.\n\n(5)\n\nThis helps remove prematurely selected base classi\ufb01ers [16, 17]. This is Algorithm 2 in Figure 2.\nSpatial regularization brings no signi\ufb01cant computational overhead: Compared to AdaBoost, SRB\nhas additional steps 4,5, which can be computed in time O(n) every iteration. Adaptive weight \u03b3\nincurs no additional complexity for step 6 in our current implementation.\nWe now brie\ufb02y discuss the choice of step size \u03b5 in Algorithm 1 (\u03b51 and \u03b52 in Algorithm 2 can be\nchosen similarly). \u03b5 could be a \ufb01xed (small) step size at each iteration. This is not greedy but may\nnecessitate a large number of iterations. Alternatively, one can be greedy and select \u03b5 to minimize\nthe value of the loss function (3) after the change \u03b1j(cid:48) \u2190 \u03b1j(cid:48) + \u03b5:\n\nwhere W\u2212 =(cid:80)\n\nW\u2212e\u03b5 + W+e\u2212\u03b5 + \u03bb(\u03b2 + \u03b5ek(cid:48))T K(\u03b2 + \u03b5ek(cid:48)),\n\ni:yihj(cid:48) (xi)=\u22121 exp(\u2212yih\u03b1(xi)), W+ =(cid:80)\n\n(6)\ni:yihj(cid:48) (xi)=1 exp(\u2212yih\u03b1(xi)) and k(cid:48) =\n\ns(j(cid:48)). Setting the derivative of (6) to 0 yields:\n\n(7)\nUsing e\u00b1\u03b5 \u2248 1 \u00b1 \u03b5 gives the solution \u02c6\u03b5 = W+\u2212W\u2212+\u03b3k(cid:48)\nW++W\u2212+2\u03bbKk(cid:48) k(cid:48) , which can be used as a step size.\nHowever, for the following slightly more conservative step size we can prove algorithm convergence:\n\nW\u2212e\u03b5 \u2212 W+e\u2212\u03b5 \u2212 \u03b3k(cid:48) + 2\u03bb\u03b5Kk(cid:48)k(cid:48) = 0.\n\n\u02dc\u03b5 = min\n\n3\n\n(W + \u2212 W \u2212)\nW+ + 1.36W\u2212\n\n,\n\nW+ \u2212 W\u2212 + \u03b3k(cid:48)\n\nW+ + W\u2212 + 2\u03bbKk(cid:48)k(cid:48)\n\n, 1\n\n.\n\n(8)\n\n(cid:26)\n\n(cid:27)\n\nTheorem 1. The step size (8) ensures convergence of Algorithm 1.\nProof. (6) is convex, so its minimum point \u03b5\u2217 is the unique solution of (7): f1(\u03b5\u2217) + f2(\u03b5\u2217) = 0\nwhere f1(\u03b5) = W\u2212e\u03b5 \u2212 W+e\u2212\u03b5 and f2(\u03b5) = 2\u03bbKk(cid:48)k(cid:48) \u03b5 \u2212 \u03b3k(cid:48). We have the inequality chain:\n\nf1(\u02dc\u03b5) + f2(\u02dc\u03b5) \u2264 g1(\u02dc\u03b5) + f2(\u02dc\u03b5) \u2264 g1(\u02c6\u03b5) + f2(\u02c6\u03b5) = 0 = f1(\u03b5\u2217) + f2(\u03b5\u2217),\n\n(9)\nwhere g1(\u03b5) = W\u2212(1 + \u03b5)\u2212 W+(1\u2212 \u03b5). So \u02dc\u03b5 is on the descending slope of (6), which is a suf\ufb01cient\ncondition for \u02dc\u03b5 to reduce the objective (6). Since the objective (3) is nonnegative and each iteration\nof the algorithm reduces (3), the algorithm converges. The second inequality in (9) uses monoticity\nwhile the \ufb01rst inequality in (9) uses the following lemma proved in the supplementary material:\nLemma: If 0 < \u03b5 \u2264 min{3 (W +\u2212W \u2212)\n\nW++1.36W\u2212 , 1}, then f1(\u03b5) \u2212 g1(\u03b5) \u2264 0.\n\n5 The Grouping Effect: Asymptotic Analysis\n\nRecall our objective of using the component importance map of the trained classi\ufb01er to ascertain\nthe spatial distribution of informative components in the data. Ideally, we would like \u03b2 to faithfully\nrepresent this information. In general, however, a boosting algorithm will select a suf\ufb01cient but\nincomplete collection of base classi\ufb01ers (and hence components) to accomplish the classi\ufb01cation.\nFor example, after selecting one base classi\ufb01er hj, AdaBoost will adjust the weights of training\n2 (totally uninformative), thus preventing\nexamples to make the weighted training error of hj exactly 1\nthe selection of any classi\ufb01ers similar to hj in the next iteration. In fact, for AdaBoost we can prove\nthat in the optimal solution \u03b1\u2217, we can transfer coef\ufb01cient weights between any two equivalent base\nclassi\ufb01ers without impacting optimality. So minimizing the loss function (1) does not require any\nparticular distribution among the \u03b2 coef\ufb01cients of identical components. This is the content of the\nfollowing proposition.\n\n5\n\n\fj1 , \u03b1\u2217\n\nProposition 1. Assume hj1and hj2, j1 < j2, are base classi\ufb01ers with s(j1) (cid:54)= s(j2), and hj1(xi) =\nj2}],\nhj2(xi) for all xi \u2208 X . If \u03b1\u2217 minimizes the loss function (1), then for any \u03b7 in [0, min{\u03b1\u2217\n\u03b1\u2020 also minimizes loss function (1) where \u03b1\u2020 = \u03b1\u2217\u2212\u03b7ej1 +\u03b7ej2 where ej denotes the j-th standard\nbasis vector in Rp.\nProof. hj1(xi) = hj2(xi) implies that h\u03b1\u2217(xi) = h\u03b1\u2020(xi) for all xi \u2208 X .\nWhat is desirable is a \u201cgrouping effect\u201d, in which components with similar behavior under H receive\nsimilar \u03b2 weights. We will prove that asymptotically, SRB exhibits a \u201cgrouping effect\u201d. In particular,\nfor kernel K = \u00b5I \u2212 G, G de\ufb01ned in (4), we will look at the minimizer \u03b2\n\u2217 = Q\u03b1\u2217 of the loss\nfunction (2), and in the spirit of [18], establish a bound on the difference |\u03b2\u2217\ni2| of the coef\ufb01cients\ni1\u2212\u03b2\u2217\non two similar components.\n\u2217, and the corresponding training\nTo proceed, let \u03b1\u2217 minimize (3) with: \u03b2\ninstance weight w\u2217. Let Hk denote the subset of base classi\ufb01ers acting on component k, i.e., Hk =\n(cid:80)m\n{hj \u2208 H: s(j) = k}. The following lemma is proved in the supplementary material:\nLemma: For any k, 1 \u2264 k \u2264 n, \u2212\u03b3\u2217\ni=1 yihj(xi)w\u2217\nAssuming K = \u00b5I \u2212 G, G de\ufb01ned in (4), we have the following result:\nTheorem 2. Let \u00af\u03b2\n\n\u2217 = Q\u03b1\u2217, \u03b3\u2217 = \u22122\u03bbK\u03b2\n\n\u2217. Then for any k1 and k2:\n\ni with equality if \u03b2\u2217\n\nk \u2265 maxhj\u2208Hk\n\n\u2217 = G\u03b2\n\nk > 0.\n\nd(k1, k2),\n\n(cid:80)m\ni |.\ni=1 yihj(xi)w\u2217\n\n(10)\n\n\u2217 be the smoothed version of vector \u03b2\nk2| \u2264 1\nk1 \u2212 \u03b2\u2217\n|\u03b2\u2217\n(cid:80)m\nwhere d(k1, k2) = | maxhj\u2208Hk1\ni=1 yihj(xi)w\u2217\nProof. We prove the following three cases separately:\n(1). \u03b2\u2217\n\ni \u2212 maxhj\u2208Hk2\n\n| \u00af\u03b2\u2217\nk1 \u2212 \u00af\u03b2\u2217\n\nk2| +\n\n1\n\u03bb\u00b5\n\n\u00b5\n\nk2)(cid:12)(cid:12) = |\u03b3\u2217\n\n(cid:12)(cid:12)(2\u03bb \u00af\u03b2\u2217\n\nIn this case, using the lemma on \u03b3\u2217\n\nk1 and \u03b2\u2217\nk2 are both positive.\nk1) \u2212 (2\u03bb \u00af\u03b2\u2217\nk2 \u2212 2\u03bb\u00b5\u03b2\u2217\nk1 \u2212 2\u03bb\u00b5\u03b2\u2217\n(cid:80)m\nangle inequality on the LHS to obtain the result.\nk1 and \u03b2\u2217\n(2). One of \u03b2\u2217\ni=1 yihj(xi)w\u2217\nmaxhj\u2208Hk1\nk1 \u2212 \u03b3\u2217\nk2 \u2264 max\n\u03b3\u2217\nhj\u2208Hk2\n\nk2 yields:\nk1 \u2212 \u03b3\u2217\nk2| = d(vk1, vk2). We can then use the tri-\n(cid:80)m\nk1 \u2265\nk2 is zero the other is positive. WLOG assume \u03b2\u2217\ni=1 yihj(xi)w\u2217\nm(cid:88)\nyihj(xi)w\u2217\n\nk1 = 0. Then \u2212\u03b3\u2217\ni . This gives:\ni \u2264 d(vk1, vk2).\n\nk2 = maxhj\u2208Hk2\ni \u2212 max\nhj\u2208Hk1\n\ni and \u2212\u03b3\u2217\nm(cid:88)\n\nyihj(xi)w\u2217\n\nk1 and \u03b3\u2217\n\ni=1\n\ni=1\n\nSubstituting the de\ufb01nition of \u03b3: \u03b3 = 2\u03bbG\u03b2 \u2212 2\u03bb\u00b5\u03b2 = 2\u03bb\u00af\u03b2 \u2212 2\u03bb\u00b5\u03b2, yields (2\u03bb \u00af\u03b2\u2217\n(2\u03bb \u00af\u03b2\u2217\ntriangle inequality on the right hand side of the previous expression yields the result.\n(3) \u03b2\u2217\n\nk2) \u2264 d(vk1, vk2). Therefore 2\u03bb\u00b5\u03b2\u2217\n\nk2 = 0. In this case, the inequality is obvious.\n\nk2 \u2212 2\u03bb\u00b5\u03b2\u2217\nk1 = \u03b2\u2217\n\nk2 \u2264 (2\u03bb \u00af\u03b2\u2217\n\nk2 \u2212 2\u03bb \u00af\u03b2\u2217\n\nk1 \u2212 2\u03bb\u00b50) \u2212\nk1) + d(vk1, vk2). Using the\n\nk1\u2212 \u00af\u03b2\u2217\n\nThe theorem upper bounds the difference in the importance coef\ufb01cient of two components by the\nsum of two terms: the \ufb01rst, | \u00af\u03b2\u2217\nk2|, takes into account the importance weight of nearby locations.\nThis term is small when the two locations are spatially close, or when they are in two neighborhoods\nthat contain a similar amount of important voxels. The second term re\ufb02ects the dissimilarity between\ntwo voxels. This term measures the difference in the weighted performances of a location\u2019s best base\nclassi\ufb01er. Clearly, d(k1, k2) = 0 when components k1 and k2 are identical under H over the training\ninstances. More generally, we can sort all the training examples by the activation level on a single\ncomponent. If sorting on locations k1 and k2 yields the same results, then d(k1, k2) = 0.\n\n6 Experiments\n\nThe \ufb01rst experiment is gender classi\ufb01cation using features located on 58 annotated landmark points\nin the IMM face data set [19] (Figure 3(a)). For each point we extract the \ufb01rst 3 principal components\nof a 15\u00d715 window as features. We randomly choose 7 males and 7 females to do leave-one-out 7-\nfold cross-validation for 100 trials. AdaBoost yields an average classi\ufb01cation accuracy of \u03c4 =78.8%\n\n6\n\n\fand demonstrates the grouping effect. (All experiments in this section use \u00b5 = maxj((cid:80)\n\nwith a standard deviation of \u03c3 =19.9%. SRB (\u03bb=0.1, r =10 pixel-length) achieves \u03c4 =80.5% and\n\u03c3 = 18.7%. The component importance map \u03b2 of SRB reveals both eyes as discriminating areas\ni Gij). By\n(10), a larger \u00b5 will make this grouping effect more dominant). The \u03b2 for AdaBoost is less smooth\nand less interpretable with the most important component on the left chin (Figure 3(b,c)).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(a)\n\n(b)\n\n(c)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 3: Experiment 1. (a): an example showing annotated\npoints; (b-c): the average component importance map \u03b2 (in-\ndicated by sizes of the circles) after running (b) AdaBoost\nand (c) SRB for 50 iterations.\n\nFigure 4: Experiment 2.\n(a-d): example im-\nages; (e): example training image with noise;\n(f): ground truth of discriminative pixels; (g-h):\npixels selected by (g) AdaBoost and (h) SRB.\n\n2\n\nThe second experiment is a binary image classi\ufb01cation task. Each image contains the handwritten\ndigits 1,1,0,3 and a random digit, all in \ufb01xed locations. Digits 0 and 1 are swapped between the\nclasses (Figure 4(a-d)). The handwritten digit images are from the OCR digits data set [20]. To\nobtain the training/testing instances we add noise to the images (Figure 4(e)). We test the ability\nof several algorithms to: (a) \ufb01nd the discriminating pixels, and (b) if a classi\ufb01cation algorithm, ac-\ncurately classify the classes. The quality of pixel selection is measured by a precision-recall curve,\nwith ground truth pixels (Figure 4(f)) selected by a t-test on the two classes of noiseless images. This\ncurve is plotted for the following methods: (1) SRB (\u03bb = 0.5, r = 1\u221a\npixel-length) (2) AdaBoost;\n(3) thresholding the univariate t-test score; (4) thresholding the \ufb01rst one or two principle compo-\nnent(s); (5) thresholding the pixel coef\ufb01cients in an LDA model with diagonal covariance (Gaussian\nnaive bayes classi\ufb01er); (6) level-set method [6] on a Z-statistics map. We plot the precision-recall\ncurve by varying the number of iterations (for (1),(2)) or the value of the threshold (for (3)-(6)). We\nalso tried all methods with Gaussian spatial pre-smoothing as a preprocessing step. The classi\ufb01ca-\ntion accuracies are measured for methods (1), (2) and (5) on separate test data.\nThe results, averaged over 100 noise realizations, are plotted in Figure 5. SRB showed no loss of\nclassi\ufb01cation accuracy nor convergence speed (usually within 100 iterations), and achieved the best\npixel selection among all methods. It is better than Gaussian naive Bayes and PCA methods, even\nwhen the noise matches the i.i.d. Gaussian assumption of these methods (Figure 5(a,d)). In all cases,\nlocal spatial averaging deteriorates the classi\ufb01cation performance of boosting.\nIn the third experiment, subjects watch a movie during the fMRI scan. The classi\ufb01cation task is\nto discriminate two types of scenes (faces and objects) based on the fMRI responses. Each fMRI\nresponses is a single TR scan of the brain volume. We divide the data (14 subjects, 26 face and\n18 object fMRI responses) into 10 cross validation groups and average the classi\ufb01cation accuracies.\nSRB (\u03bb = 0.1, r = 5 voxel-length) trained for 100 iterations yields accuracy \u03c4 = 73.3% with\n\u03c3 = 9.3% across 14 subjects. AdaBoost yields \u03c4 = 75.5% with \u03c3 = 4.9%. To make sure this\nis signi\ufb01cant, we repeated the training with shuf\ufb02ed labels. After shuf\ufb02ing, \u03c4 = 49.7%, with\n\u03c3 = 4.6%, which is effectively chance. We note that spatially regularized boosting yields a more\nclustered and interpretable selection of voxels. The result for one subject (Figure 6) shows that\nstandard boosting (AdaBoost) selects voxels scattered in the brain, while SRB selects clustered\nvoxels and nicely highlights the relevant FFA area [21] and posterior central sulcus [22, 23].\n\n7 Conclusions\n\nThe proposed SRB algorithm is applicable to a variety of situations in which one needs to boost\nthe performance of base classi\ufb01ers with spatial structure. The mechanism of the algorithm has a\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 5: Experiment 2. (a-c): test classi\ufb01cation accuracy: (a) i.i.d. Gaussian noise, (b) poisson noise, (c)\nspatially correlated Gaussian noise. (b,c) share the legend of (a). (d-f): pixel selection performances: (d) i.i.d.\nGaussian noise, (e) poisson noise, (f) spatial correlated Gaussian noise. (e,f) share the legend of (d).\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: Experiment 3: an example: sets of voxels selected by (a) univariate t-test (b) AdaBoost and (c) SRB\n\nnatural interpretation: in each iteration, the algorithm selects a base classi\ufb01er with the best perfor-\nmance evaluated under two sets of weights: weights on training examples (as in AdaBoost) and\nweights on locations. The additional set of location weights encourages or discourages the selection\nof certain base classi\ufb01ers based on the spatial location of base classi\ufb01ers that have already been se-\nlected. Computationally, SRB is as effective as AdaBoost. We demonstrated the effectiveness of the\nalgorithm both by providing a theoretical analysis of the \u201cgrouping effect\u201d and by experiments on\nthree data sets. The grouping effect is clearly demonstrated in the face gender detection experiment.\nIn the OCR classi\ufb01cation experiment, the algorithm shows superior performance in pixel selection\naccuracy without loss of classi\ufb01cation accuracy. The algorithm matches the performance of the\nstate-of-the-art set estimation methods [6] that use a more complex spatial regularization and cycle\nspinning technique. In the fMRI experiment, the algorithm yields a clustered selection of voxels in\npositions relevant to the task. An alternative approach, being explored, is to combine searchlight [9]\nwith a strong learning algorithm (e.g. SVM) to integrate spatial locality and accurate classi\ufb01cation.\n\n8 Acknowledgments\n\nThe authors thank Princeton University\u2019s J. Insley Blair Pyne Fund for seed research funding.\n\n8\n\n0501001502000.860.880.90.920.940.960.981iterations of boostingclassification accuracy on test images Spatial Regularized BoostingSpatial Regularized Boosting with smoothingAdaBoostAdaBoost with smoothingGaussian naive BayesGaussian naive Bayes with smoothing0501001502000.70.750.80.850.90.95iterations of boostingclassification accuracy on test images0501001502000.80.850.90.951iterations of boostingclassification accuracy on test images00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecision Spatial Regularized BoostingSpatial Regularized Boosting with smoothingAdaBoostAdaBoost with smoothingUnivariate testUnivariate test with smoothingPCA, first PCPCA, first two PCsPCA with smoothing, first PCGaussian naive BayesGaussian naive Bayes with smoothinglevel\u2212setlevel\u2212set with smoothing00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecision00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecision\fReferences\n[1] K.J. Friston, J. Ashburner, J. Heather, et al. Statistical parametric mapping. Neuroscience Databases: A\n\nPractical Guide, page 237, 2003.\n\n[2] R. Heller, D. Stanley, D. Yekutieli, N. Rubin, and Y. Benjamini. Cluster-based analysis of FMRI data.\n\nNeuroImage, 33(2):599\u2013608, 2006.\n\n[3] D. Van De Ville, T. Blu, and M. Unser. Integrated wavelet processing and spatial statistical testing of\n\nfMRI data. NeuroImage, 23(4):1472\u20131485, 2004.\n\n[4] D. Van De Ville, M.L. Seghier, F. Lazeyras, T. Blu, and M. Unser. WSPM: Wavelet-based statistical\n\nparametric mapping. NeuroImage, 37(4):1205\u20131217, 2007.\n\n[5] Z. Harmany, R. Willett, A. Singh, and R. Nowak. Controlling the error in fmri: Hypothesis testing or set\n\nestimation? In Biomedical Imaging, 5th IEEE International Symposium on, pages 552\u2013555, 2008.\n\n[6] R.M. Willett and R.D. Nowak. Minimax optimal level-set estimation.\n\nProcessing, 16(12):2965\u20132979, 2007.\n\nIEEE Transactions on Image\n\n[7] J.V. Haxby, M.I. Gobbini, M.L. Furey, A. Ishai, J.L. Schouten, and P. Pietrini. Distributed and overlapping\n\nrepresentations of faces and objects in ventral temporal cortex. Science, 293(5539):2425\u20132430, 2001.\n\n[8] K.A. Norman, S.M. Polyn, G.J. Detre, and J.V. Haxby. Beyond mind-reading: multi-voxel pattern analysis\n\nof fMRI data. Trends in Cognitive Sciences, 10(9):424\u2013430, 2006.\n\n[9] N. Kriegeskorte, R. Goebel, and P. Bandettini. Information-based functional brain mapping. Proceedings\n\nof the National Academy of Sciences, 103(10):3863\u20133868, 2006.\n\n[10] V. Koltchinskii, M. Mart\u0131nez-Ramon, and S. Posse. Optimal aggregation of classi\ufb01ers and boosting maps\nin functional magnetic resonance imaging. Advances in Neural Information Processing Systems, 17:705\u2013\n712, 2005.\n\n[11] M. Mart\u00b4\u0131nez-Ram\u00b4on, V. Koltchinskii, G.L. Heileman, and S. Posse.\n\nfMRI pattern classi\ufb01cation using\n\nneuroanatomically constrained boosting. NeuroImage, 31(3):1129\u20131141, 2006.\n\n[12] Melissa K. Carroll, Kenneth A. Norman, James V. Haxby, and Robert E. Schapire. Exploiting spatial\ninformation to improve fmri pattern classi\ufb01cation. In 12th Annual Meeting of the Organization for Human\nBrain Mapping, Florence, Italy, 2006.\n\n[13] J.H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics,\n\n29(5):1189\u20131232, 2001.\n\n[14] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. In European Conference on Computational Learning Theory, pages 23\u201337, 1995.\n\n[15] C. Rudin, I. Daubechies, and R.E. Schapire. The dynamics of adaboost: Cyclic behavior and convergence\n\nof margins. Journal of Machine Learning Research, 5(2):1557, 2005.\n\n[16] Z.J. Xiang and P.J. Ramadge. Sparse boosting. In IEEE International Conference on Acoustics, Speech\n\nand Signal Processing, 2009.\n\n[17] T. Zhang. Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models. In\n\nProc. Neural Information Processing Systems, 2008.\n\n[18] H. Zou and T. Hastie. Regression shrinkage and selection via the elastic net, with applications to microar-\n\nrays. JR Statist. Soc. B, 2004.\n\n[19] M.M. Nordstr\u00f8m, M. Larsen, J. Sierakowski, and M.B. Stegmann. The IMM face database-an annotated\n\ndataset of 240 face images. Technical report, DTU Informatics, Building 321, 2004.\n\n[20] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[21] N. Kanwisher, J. McDermott, and M.M. Chun. The fusiform face area: a module in human extrastriate\n\ncortex specialized for face perception. Journal of Neuroscience, 17(11):4302\u20134311, 1997.\n\n[22] U. Hasson, M. Harel, I. Levy, and R. Malach. Large-scale mirror-symmetry organization of human\n\noccipito-temporal object areas. Neuron, 37(6):1027\u20131041, 2003.\n\n[23] U. Hasson, Y. Nir, I. Levy, G. Fuhrmann, and R. Malach. Intersubject synchronization of cortical activity\n\nduring natural vision. Science, 303(5664):1634\u20131640, 2004.\n\n9\n\n\f", "award": [], "sourceid": 149, "authors": [{"given_name": "Yongxin", "family_name": "Xi", "institution": null}, {"given_name": "Uri", "family_name": "Hasson", "institution": null}, {"given_name": "Peter", "family_name": "Ramadge", "institution": null}, {"given_name": "Zhen", "family_name": "Xiang", "institution": null}]}