{"title": "Learning brain regions via large-scale online structured sparse dictionary learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4610, "page_last": 4618, "abstract": "We propose a multivariate online dictionary-learning method for obtaining decompositions of brain images with structured and sparse components (aka atoms). Sparsity is to be understood in the usual sense: the dictionary atoms are constrained to contain mostly zeros. This is imposed via an $\\ell_1$-norm constraint. By \"structured\", we mean that the atoms are piece-wise smooth and compact, thus making up blobs, as opposed to scattered patterns of activation. We propose to use a Sobolev (Laplacian) penalty to impose this type of structure. Combining the two penalties, we obtain decompositions that properly delineate brain structures from functional images. This non-trivially extends the online dictionary-learning work of Mairal et al. (2010), at the price of only a factor of 2 or 3 on the overall running time. Just like the Mairal et al. (2010) reference method, the online nature of our proposed algorithm allows it to scale to arbitrarily sized datasets. Experiments on brain data show that our proposed method extracts structured and denoised dictionaries that are more intepretable and better capture inter-subject variability in small medium, and large-scale regimes alike, compared to state-of-the-art models.", "full_text": "Learning brain regions via large-scale online\n\nstructured sparse dictionary-learning\n\nElvis Dohmatob, Arthur Mensch, Gael Varoquaux, Bertrand Thirion\n\nParietal Team, INRIA / CEA, Neurospin, Universit\u00e9 Paris-Saclay, France\n\nfirstname.lastname@inria.fr\n\nAbstract\n\nWe propose a multivariate online dictionary-learning method for obtaining de-\ncompositions of brain images with structured and sparse components (aka atoms).\nSparsity is to be understood in the usual sense: the dictionary atoms are constrained\nto contain mostly zeros. This is imposed via an (cid:96)1-norm constraint. By \"struc-\ntured\", we mean that the atoms are piece-wise smooth and compact, thus making up\nblobs, as opposed to scattered patterns of activation. We propose to use a Sobolev\n(Laplacian) penalty to impose this type of structure. Combining the two penalties,\nwe obtain decompositions that properly delineate brain structures from functional\nimages. This non-trivially extends the online dictionary-learning work of Mairal et\nal. (2010), at the price of only a factor of 2 or 3 on the overall running time. Just\nlike the Mairal et al. (2010) reference method, the online nature of our proposed\nalgorithm allows it to scale to arbitrarily sized datasets. Preliminary xperiments\non brain data show that our proposed method extracts structured and denoised\ndictionaries that are more intepretable and better capture inter-subject variability in\nsmall medium, and large-scale regimes alike, compared to state-of-the-art models.\n\n1\n\nIntroduction\n\nIn neuro-imaging, inter-subject variability is often handled as a statistical residual and discarded. Yet\nthere is evidence that it displays structure and contains important information. Univariate models are\nine\ufb00ective both computationally and statistically due to the large number of voxels compared to the\nnumber of subjects. Likewise, statistical analysis of weak e\ufb00ects on medical images often relies on\nde\ufb01ning regions of interests (ROIs). For instance, pharmacology with Positron Emission Tomography\n(PET) often studies metabolic processes in speci\ufb01c organ sub-parts that are de\ufb01ned from anatomy.\nPopulation-level tests of tissue properties, such as di\ufb00usion, or simply their density, are performed on\nROIs adapted to the spatial impact of the pathology of interest. Also, in functional brain imaging,\ne.g function magnetic resonance imaging (fMRI), ROIs must be adapted to the cognitive process\nunder study, and are often de\ufb01ned by the very activation elicited by a closely related process [18].\nROIs can boost statistical power by reducing multiple comparisons that plague image-based statistical\ntesting. If they are de\ufb01ned to match spatially the di\ufb00erences to detect, they can also improve the\nsignal-to-noise ratio by averaging related signals. However, the crux of the problem is how to de\ufb01ne\nthese ROIs in a principled way. Indeed, standard approaches to region de\ufb01nition imply a segmentation\nstep. Segmenting structures in individual statistical maps, as in fMRI, typically yields meaningful\nunits, but is limited by the noise inherent to these maps. Relying on a di\ufb00erent imaging modality hits\ncross-modality correspondence problems.\n\nSketch of our contributions.\nIn this manuscript, we propose to use the variability of the statistical\nmaps across the population to de\ufb01ne regions. This idea is reminiscent of clustering approaches, that\nhave been employed to de\ufb01ne spatial units for quantitative analysis of information as diverse as brain\n\ufb01ber tracking, brain activity, brain structure, or even imaging-genetics. See [21, 14] and references\ntherein. The key idea is to group together features \u2013voxels of an image, vertices on a mesh, \ufb01ber tracts\u2013\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fbased on the quantity of interest, to create regions \u2013or \ufb01ber bundles\u2013 for statistical analysis. However,\nunlike clustering that models each observation as an instance of a cluster, we use a model closer to\nthe signal, where each observation is a linear mixture of several signals. The model is closer to mode\n\ufb01nding, as in a principal component analysis (PCA), or an independent component analysis (ICA),\noften used in brain imaging to extract functional units [5]. Yet, an important constraint is that the\nmodes should be sparse and spatially-localized. For this purpose, the problem can be reformulated as\na linear decomposition problem like ICA/PCA, with appropriate spatial and sparse penalties [25, 1].\nWe propose a multivariate online dictionary-learning method for obtaining decompositions with\nstructured and sparse components (aka atoms). Sparsity is to be understood in the usual sense: the\natoms contain mostly zeros. This is imposed via an (cid:96)1 penalty on the atoms. By \"structured\", we mean\nthat the atoms are piece-wise smooth and compact, thus making up blobs, as opposed to scattered\npatterns of activation. We impose this type of structure via a Laplacian penalty on the dictionary atoms.\nCombining the two penalties, we therefore obtain decompositions that are closer to known functional\norganization of the brain. This non-trivially extends the online dictionary-learning work [16], with\nonly a factor of 2 or 3 on the running time. By means of experiments on a large public dataset, we\nshow the improvements brought by the spatial regularization with respect to traditional (cid:96)1-regularized\ndictionary learning. We also provide a concise study of the impact of hyper-parameter selection on\nthis problem and describe the optimality regime, based on relevant criteria (reproducibility, captured\nvariability, explanatory power in prediction problems).\n\n2 Smooth Sparse Online Dictionary-Learning (Smooth-SODL)\nConsider a stack X \u2208 Rn\u00d7p of n subject-level brain images X1, X2, . . . , Xn each of shape n1 \u00d7\nn2 \u00d7 n3, seen as p-dimensional row vectors \u2013with p = n1 \u00d7 n2 \u00d7 n3, the number of voxels. These\ncould be images of fMRI activity patterns like statistical parametric maps of brain activation, raw\npre-registered (into a common coordinate space) fMRI time-series, PET images, etc. We would\nlike to decompose these images as a mixture of k \u2264 min(n, p) component maps (aka latent factors\nor dictionary atoms) V1, . . . , Vk \u2208 Rp\u00d71 and modulation coe\ufb03cients U1, . . . , Un \u2208 Rk\u00d71 called\ncodes (one k-dimensional code per sample point), i.e\n(1)\nwhere V := [V1| . . .|Vk] \u2208 Rp\u00d7k, an unknown dictionary to be estimated. Typically, p \u223c 105 \u2013\n106 (in full-brain high-resolution fMRI) and n \u223c 102 \u2013 105 (for example, in considering all the 500\nsubjects and all the about functional tasks of the Human Connectome Project dataset [20]). Our\nwork handles the extreme case where both n and p are large (massive-data setting). It is reasonable\nthen to only consider under-complete dictionaries: k \u2264 min(n, p). Typically, we use k \u223c 50 or 100\ncomponents. It should be noted that online optimization is not only crucial in the case where n/p is\nbig; it is relevant whenever n is large, leading to prohibitive memory issues irrespective of how big or\nsmall p is.\nAs explained in section 1, we want the component maps (aka dictionary atoms) Vj to be sparse and\nspatially smooth. A principled way to achieve such a goal is to impose a boundedness constraint on\n(cid:96)1-like norms of these maps to achieve sparsity and simultaneously impose smoothness by penalizing\ntheir Laplacian. Thus, we propose the following penalized dictionary-learning model\n\nXi \u2248 VUi, for i = 1, 2, . . . , n\n\n1\n2(cid:107)Xi \u2212 VUi(cid:107)2\n\n2 +\n\n1\n2\n\n\u03b1(cid:107)Ui(cid:107)2\n\n2\n\n+ \u03b3\n\n\u2126Lap(Vj).\n\n(2)\n\n(cid:33)\n\nk(cid:88)\n\nj=1\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nmin\n\nlim\nn\u2192\u221e\n\nmin\nV\u2208Rp\u00d7k\nUi\u2208Rk\nsubject to V1, . . . , Vk \u2208 C\n\nThe ingredients in the model can be broken down as follows:\n\n1\n2(cid:107)Xi \u2212 VUi(cid:107)2\n\n2 measures how well the current dictionary V\n\u2022 Each of the terms maxUi\u2208Rk\n2 on the codes\nexplains data Xi from subject i. The Ridge penalty term \u03c6(Ui) \u2261 1\namounts to assuming that the energy of the decomposition is spread across the di\ufb00erent\nsamples. In the context of a speci\ufb01c neuro-imaging problem, if there are good grounds to\nassume that each sample / subject should be sparsely encoded across only a few atoms of\nthe dictionary, then we can use the (cid:96)1 penalty \u03c6(Ui) := \u03b1(cid:107)Ui(cid:107)1 as in [16]. We note that in\n\n2 \u03b1(cid:107)Ui(cid:107)2\n\n2\n\n\fcontrast to the (cid:96)1 penalty, the Ridge leads to stable codes. The parameter \u03b1 > 0 controls the\namount of penalization on the codes.\n\u2022 The constraint set C is a sparsity-inducing compact simple (mainly in the sense that the\nEuclidean projection onto C should be easy to comput) convex subset of Rp like an (cid:96)1-ball\nBp,(cid:96)1(\u03c4 ) or a simplex Sp(\u03c4 ), de\ufb01ned respectively as\n(3)\nOther choices (e.g ElasticNet ball) are of course possible. The radius parameter \u03c4 > 0\ncontrols the amount of sparsity: smaller values lead to sparser atoms.\n\u2022 Finally, \u2126Lap is the 3D Laplacian regularization functional de\ufb01ned by\n\nBp,(cid:96)1(\u03c4 ) := {v \u2208 Rp s.t |v1| + . . . + |vp| \u2264 \u03c4} , and Sp(\u03c4 ) := Bp,(cid:96)1(\u03c4 ) \u2229 Rp\n+.\n\np(cid:88)\n\nk=1\n\n\u2126Lap(v) :=\n\n1\n2\n\n(\u2207xv)2\n\nk + (\u2207yv)2\n\nk + (\u2207zv)2\n\nk =\n\n1\n2\n\nvT \u2206v \u2265 0, \u2200v \u2208 Rp,\n\n(4)\n\n\u2207x being the discrete spatial gradient operator along the x-axis (a p-by-p matrix), \u2207y along\nthe y-axis, etc., and \u2206 := \u2207T\u2207 is the p-by-p matrix representing the discrete Laplacian\noperator. This penalty is meant to impose blobs. The regularization parameter \u03b3 \u2265 0 controls\nhow much regularization we impose on the atoms, compared to the reconstruction error.\n\nThe above formulation, which we dub Smooth Sparse Online Dictionary-Learning (Smooth-SODL)\nis inspired by, and generalizes the standard online dictionary-learning framework of [16] \u2013henceforth\nreferred to as Sparse Online Dictionary-Learning (SODL)\u2013 with corresponds to the special case\n\u03b3 = 0.\n\n3 Estimating the model\n3.1 Algorithms\nThe objective function in problem (2) is separately convex and block-separable w.r.t each of U and V\nbut is not jointly convex in (U, V). Also, it is continuously di\ufb00erentiable on the constraint set, which\nis compact and convex. Thus by classical results (e.g Bertsekas [6]), the problem can be solved via\nBlock-Coordinate Descent (BCD) [16]. Reasoning along the lines of [15], we derive that the BCD\niterates are as given in Alg. 1 in which, for each incoming sample point Xt, the loading vector Ut is\ncomputing by solving a ridge regression problem (5) with the current dictionary Vt held \ufb01xed, and\nthe dictionary atoms are then updated sequentially via Alg. 2. A crucial advantage of using a BCD\nscheme is that it is parameter free: there is not step size to tune. The resulting algorithm Alg. 1, is\nadapted from [16]. It relies on Alg. 2 for performing the structured dictionary updates, the details of\nwhich are discussed below.\nAlgorithm 1 Online algorithm for the dictionary-learning problem (2)\nRequire: Regularization parameters \u03b1, \u03b3 > 0; initial dictionary V \u2208 Rp\u00d7k, number of passes /\n1: A0 \u2190 0 \u2208 Rk\u00d7k, B0 \u2190 0 \u2208 Rp\u00d7k (historical \u201csu\ufb03cient statistics\u201d)\n2: for t = 1 to T do\n3:\n4:\n\nEmpirically draw a sample point Xt at random.\nCode update: Ridge-regression (via SVD of current dictionary V)\n\niterations T on the data.\n\nUt \u2190 argminu\u2208Rk\nRank-1 updates: At \u2190 At\u22121 + UtUT\nBCD dictionary update: Compute update for dictionary V using Alg. 2.\n\nt , Bt \u2190 Bt\u22121 + XtUT\n\n2 +\n\n\u03b1(cid:107)u(cid:107)2\n2.\n\nt\n\n1\n2(cid:107)Xt \u2212 Vu(cid:107)2\n\n1\n2\n\n5:\n6:\n7: end for\n\nUpdate of the codes: Ridge-coding. The Ridge sub-problem for updating the codes\n\nUt = (VT V + \u03b1I)\u22121VT Xt\n\n3\n\n(5)\n\n(6)\n\n\f(cid:80)t\n\ni=1 XiUT\n\ni=1 UiUT\n\nVj = argminv\u2208C,V=[V1|...|v|...|Vk]\n\ni \u2208 Rk\u00d7k and Bt :=(cid:80)t\n\nis computed via an SVD of the current dictionary V. For \u03b1 \u2248 0, Ut reduces to the orthogonal\nprojection of Xt onto the image of the current dictionary V. As in [16], we speed up the overall\nalgorithm by sampling mini-batches of \u03b7 samples Xt, . . . , X\u03b7 and compute the corresponding codes\nU1, U2, ..., U\u03b7 at once. We typically use we use mini-batches of size \u03b7 = 20.\nBCD dictionary update for the dictionary atoms. Let us de\ufb01ne time-varying matrices At :=\ni \u2208 Rp\u00d7k, where t = 1, 2, . . . denotes time. We \ufb01x\n(cid:32) t(cid:88)\nthe matrix of codes U, and for each j, consider the update of the jth dictionary atom, with all the\nother atoms Vk(cid:54)=j kept \ufb01xed. The update for the atom Vj can then be written as\n(cid:124)\nwhere F\u02dc\u03b3(v, a) \u2261 1\n2(cid:107)v \u2212 a(cid:107)2\nAlgorithm 2 BCD dictionary update with Laplacian prior\nRequire: V = [V1| . . .|Vk] \u2208 Rp\u00d7k (input dictionary),\n1: A = [A1| . . .|Ak] \u2208 Rk\u00d7k, Bt = [B1\n2: while stopping criteria not met, do\n3:\n4:\n\n1\n2(cid:107)Xi \u2212 VUi(cid:107)2\n(cid:123)(cid:122)\n= argminv\u2208C F\u03b3(At[j,j]/t)\u22121(v, Vj + At[j, j]\u22121(Bj\n\nFix the code U and all atoms k (cid:54)= j of the dictionary V and then update Vj as follows\nVj \u2190 argminv\u2208C F\u03b3(At[j,j]/t)\u22121(v, Vj + At[j, j]\u22121(Bj\n(8)\n(See below for details on the derivation and the resolution of this problem)\n\nt \u2212 VAj))\n\nrefer to [16] for the details\n2 \u02dc\u03b3vT \u2206v.\n\n2 + 1\n\nt| . . .|Bk\n\nt ] \u2208 Rp\u00d7k (history)\n\nfor j = 1 to r do\n\nt \u2212 VAj\nt )\n\n),\n\n2(cid:107)v \u2212 a(cid:107)2\n\n2 + \u02dc\u03b3\u2126Lap(v) = 1\n\n+ \u03b3t\u2126Lap(v)\n\n2\n\n(cid:33)\n\n(cid:125)\n\n(7)\n\ni=1\n\nend for\n5:\n6: end while\n\nProblem (7) is the compactly-constrained minimization of the 1-strongly-convex quadratic functions\nF\u02dc\u03b3(., a) : Rp \u2192 R de\ufb01ned above. This problem can further be identi\ufb01ed with a denoising instance\n(i.e in which the design matrix / deconvolution operator is the identity operator) of the GraphNet\nmodel [11, 13]. Fast \ufb01rst-order methods like FISTA [4] with optimal rates O(L/\u221a\u0001) are available1\nfor solving such problems to arbitrary precision \u0001 > 0. One computes the Lipschitz constant to be\nLF\u02dc\u03b3 (.,a) \u2261 1 + \u02dc\u03b3L\u2126Lap = 1 + 4D\u02dc\u03b3, where as before, D is the number of spatial dimensions (D = 3\nfor volumic images). One should also mention that under certain circumstances, it is possible to\nperform the dictionary updates in the Fourier domain, via FFT. This alternative approach is detailed\nin the supplementary materials.\nFinally, one notes that, since constraints in problem (2) are separable in the dictionary atoms Vj,\nthe BCD dictionary-update algorithm Alg. 2 is guaranteed to converge to a global optimum, at each\niteration [6, 16].\n\nHow di\ufb03cult is the dictionary update for our proposed model ? A favorable property of the\nvanilla dictionary-learning [16] is that the BCD dictionary updates amount to Euclidean projections\nonto the constraint set C, which can be easily computed for a variety of choices (simplexes, closed\nconvex balls, etc.). One may then ask: do we retain a comparable algorithmic simplicity even with the\nadditional Laplacian terms \u2126Lap(Vj) ? YES!: empirically, we found that 1 or 2 iterations of FISTA\n[4] are su\ufb03cient to reach an accuracy of 10\u22126 in problem (7), which is su\ufb03cient to obtain a good\ndecomposition in the overall algorithm.\nHowever, choosing \u03b3 \u201ctoo large\u201d will provably cause the dictionary updates to eventually take forever\nto run. Indeed, the Lipschitz constant in problem (7) is Lt = 1 + 4D\u03b3(At[j, j]/t)\u22121, which will\nblow-up (leading to arbitrarily small step-sizes) unless \u03b3 is chosen so that\n\n(cid:18)\n\n(cid:19)\n\n(cid:32)\n\n\u03b3 = \u03b3t = O\n\nmax\n1\u2264j\u2264k\n\nAt[j, j]\n\n= O\n\nmax\n1\u2264j\u2264k\n\n(cid:107)Uj(cid:107)2\n2/t\n\n= O((cid:107)At(cid:107)\u221e,\u221e/t).\n\n(9)\n\n(cid:33)\n\nt(cid:88)\n\ni=1\n\n1For example, see [8, 24], implemented as part of the Nilearn open-source library Python library [2].\n\n4\n\n\fFinally, the Euclidean projections onto the (cid:96)1 ball C can be computed exactly in linear-time O(p) (see\nfor example [7, 9]). The dictionary atoms j are repeatedly cycled and problem (7) solved. All in all,\nin practice we observe that a single iteration is su\ufb03cient for the dictionary update sub-routine in Alg.\n2 to converge to a qualitatively good dictionary.\n\nConvergence of the overall algorithm. The Convergence of our algorithm (to a local optimum) is\nguaranteed since all hypotheses of [16] are satis\ufb01ed. For example, assumption (A) is satis\ufb01ed because\nfMRI data are naturally compactly supported. Assumption (C) is satis\ufb01ed since the ridge-regression\nproblem (5) has a unique solution. More details are provided in the supplementary materials.\n\n3.2 Practical considerations\nHyper-parameter\ntuning. Parameter-\nselection in dictionary-learning is known to\nbe a di\ufb03cult unsolved problem [16, 15], and\nour proposed model (2) is not an exception\nto this rule. We did an extensive study of the\nquality of estimated dictionary varies with the\nmodel hyper-parameters (\u03b1, \u03b3, \u03c4 ). The data\nexperimental setup is described in Section 5.\nThe results are presented in Fig. 1. We make\nthe following observations: Taking the sparsity\nparameter \u03c4 in (2) too large leads to dense\natoms that perfectly explain the data but are not\nvery intepretable. Taking it too small leads to\noverly sparse maps that barely explain the data.\nThis normalized sparsity metric (small is better,\nceteris paribus) is de\ufb01ned as the mean ratio\n(cid:107)Vj(cid:107)1/(cid:107)Vj(cid:107)2 over the dictionary atoms.\nConcerning the \u03b1 parameter, inspired by [26], we have found the following time-varying data-adaptive\nchoice for the \u03b1 parameter to work very well in practice:\n\u03b1 = \u03b1t \u223c t\u22121/2.\n\nFigure 1: In\ufb02uence of model parameters. In the\nexperiments, \u03b1 was chosen according to (10). Left:\nPercentage explained variance of the decomposi-\ntion, measured on left-out data split. Right: Aver-\nage normalized sparsity of the dictionary atoms.\n\n(10)\n\nLikewise, care must be taken in selecting the Laplacian regularization parameter \u03b3. Indeed taking it\ntoo small amounts to doing vanilla dictionary-learning model [16]. Taking it too large can lead to\ndegenerate maps, as the spatial regularization then dominates the reconstruction error (data \ufb01delity)\nterm. We \ufb01nd that there is a safe range of the parameter pair (\u03b3, \u03c4 ) in which a good compromise\nbetween the sparsity of the dictionary (thus its intepretability) and its explanation power of the data\ncan be reached. See Fig. 1. K-fold cross-validation with explained variance metric was retained as a\ngood strategy for setting the Laplacian regularization \u03b3 parameter and the sparsity parameter \u03c4.\n\nInitialization of the dictionary. Problem (2) is non-convex jointly in (U, V), and so initialization\nmight be a crucial issue. However, in our experiments, we have observed that even randomly initialized\ndictionaries eventually produce sensible results that do not jitter much across di\ufb00erent runs of the\nsame experiment.\n\n4 Related works\n\nWhile there exist algorithms for online sparse dictionary-learning that are very e\ufb03cient in large-scale\nsettings (for example [16], or more recently [17]) imposing spatial structure introduces couplings\nin the corresponding optimization problem [8]. So far, spatially-structured decompositions have\nbeen solved by very slow alternated optimization [25, 1]. Notably, structured priors such as TV-(cid:96)1\n[3] minimization, were used by [1] to extract data-driven state-of-the-art atlases of brain function.\nHowever, alternated minimization is very slow, and large-scale medical imaging has shifted to online\nsolvers for dictionary-learning like [16] and [17]. These do not readily integrate structured penalties.\nAs a result, the use of structured decompositions has been limited so far, by the computational cost of\nthe resulting algorithms. Our approach instead uses a Laplacian penalty to impose spatial structure at\n\n5\n\n0102103104105106107108\u03b32\u221232\u221222\u221212021222324\u03c46%12%18%24%30%36%42%48%54%explainedvariance0102103104105106107108\u03b3020406080100120140normalizedsparsity\fa very minor cost and adapts the online-learning dictionary-learning framework [16], resulting in a\nfast and scalable structured decomposition. Second, the approach in [1] though very novel, is mostly\nheuristic. In contrast, our method enjoys the same convergence guarantees and comparable numerical\ncomplexity as the basic unstructured online dictionary-learning [16].\nFinally, one should also mention [23] that introduced an online group-level functional brain mapping\nstrategy for di\ufb00erentiating regions re\ufb02ecting the variety of brain network con\ufb01gurations observed in a\nthe population, by learning a sparse-representation of these in the spirit of [16].\n\n5 Experiments\nSetup. Our experiments were done on task fMRI data from 500 subjects from the HCP \u2013Human\nConnectome Project\u2013 dataset [20]. These task fMRI data were acquired in an attempt to assess\nmajor domains that are thought to sample the diversity of neural systems of interest in functional\nconnectomics. We studied the activation maps related to a task that involves language (story under-\nstanding) and mathematics (mental computation). This particular task is expected to outline number,\nattentional and language networks, but the variability modes observed in the population cover even\nwider cognitive systems. For the experiments, mass-univariate General Linear Models (GLMs) [10]\nfor n = 500 subjects were estimated for the Math vs Story contrast (language protocol), and the\ncorresponding full-brain Z-score maps each containing p = 2.6 \u00d7 105 voxels, were used as the input\ndata X \u2208 Rn\u00d7p, and we sought a decomposition into a dictionary of k = 40 atoms (components).\nThe input data X were shu\ufb04ed and then split into two groups of the same size.\nModels compared and metrics. We compared our proposed Smooth-SODL model (2) against\nboth the Canonical ICA \u2013CanICA [22], a single-batch multi-subject PCA/ICA-based method, and\nthe standard SODL (sparse online dictionary-learning) [16]. While the CanICA model accounts for\nsubject-to-subject di\ufb00erences, one of its major limitations is that it does not model spatial variability\nacross subjects. Thus we estimated the CanICA components on smoothed data: isotropic FWHM of\n6mm, a necessary preprocessing step for such methods. In contrast, we did not perform pre-smoothing\nfor the SODL of Smooth-SODL models. The di\ufb00erent models were compared across a variety of\nqualitative and quantitative metrics: visual quality of the dictionaries obtained, explained variance,\nstability of the dictionary atoms, their reproducibility, performance of the dictionaries in predicting\nbehavioral scores (IQ, picture vocabulary, reading pro\ufb01ciency, etc.) shipped with the HCP data [20].\nFor both SODL [16] and our proposed Smooth-SODL model, the constraint set for the dictionary\natoms was taken to be a simplex C := Sp(\u03c4 ) (see section 2 for de\ufb01nition). The results of these\nexperiments are presented in Fig. 2 and Tab. 1.\n\n6 Results\nRunning time. On the computational side, the vanilla dictionary-learning SODL algorithm [16]\nwith a batch size of \u03b7 = 20 took about 110s (\u2248 1.7 minutes) to run, whilst with the same batch size,\nour proposed Smooth-SODL model (2) implemented in Alg. 1 took 340s (\u2248 5.6 minutes), which\nis slightly less than 3 times slower than SODL. Finally, CanICA [22] for this experiment took 530s\n(\u2248 8.8 minutes) to run, which is about 5 times slower than the SODL model and 1.6 times slower\nthan our proposed Smooth-SODL (2) model. All experiments were run on a single CPU of laptop.\n\nQualitative assessment of dictionaries. As can be seen in Fig. 2(a), all methods recover dictionary\natoms that represent known functional brain organization; notably the dictionaries all contain the\nwell-known executive control and attention networks, at least in part. Vanilla dictionary-learning\nleverages the denoising properties of the (cid:96)1 sparsity constraint, but the voxel clusters are not very\nstructured. For, example most blobs are surrounded with a thick ring of very small nonzero values. In\ncontrast, our proposed regularization model leverages both sparse and structured dictionary atoms,\nthat are more spatially structured and less noisy.\nIn contrast to both SODL and Smooth-SODL, CanICA [22] is an ICA-based method that enforces no\nnotion of sparsity whatsoever. The result are therefore dense and noisy dictionary atoms that explain\nthe data very well (Fig. 2(b) but which are completely unintepretable. In a futile attempt to remedy\nthe situation, in practice such PCA/ICA-based methods (including FSL\u2019s MELODIC tool [19]) are\nhard-thresholded in order to see information. For CanICA, the hard-thresholded version has been\n\n6\n\n\f(a) Qualitative comparison of the estimated dictionaries. Each column represents an atom of the estimated\ndictionary, where atoms from the di\ufb00erent models (the rows of the plots) have been matched via a Hungarian\nalgorithm. Here, we only show a limited number of the most \u201cintepretable\u201d atoms. Notice how the major\nstructures in each atom are reproducible across the di\ufb00erent models. Maps corresponding to hard-thresholded\nCanICA [22] components have also been included, and have been called tCanICA. In contrast, the maps from the\nSODL [16] and our proposed Smooth-SODL (2) have not been thresholded.\n\n(b) Mean explained variance of the\ndi\ufb00erent models on both training data\nand test (left-out) data. N.B.: Bold\nbars represent performance on test\nset while faint bars in the background\nrepresent performance on train set.\nFigure 2: Main results. Benchmarking our proposed Smooth-SODL (2) model against competing\nstate-of-the-art methods like SODL (sparse online dictionary-learning) [16] and CanICA [22].\n\n(c) Predicting behavioral variables of the HCP [20] dataset using\nsubject-level Z-maps. N.B.: Bold bars represent performance on\ntest set while faint bars in the background represent performance\non train set.\n\nnamed tCanICA in Fig. 2. That notwithstanding, notice how the major structures (parietal lobes,\nsulci, etc.) in each atom are reproducible across the di\ufb00erent models.\n\nStability-\ufb01delity trade-o\ufb00s. PCA/ICA-based methods like CanICA [22] and MELODIC [19] are\nthe optimal linear decomposition method to maximize explained variance on a dataset. On the training\nset, CanICA [22] out-performs all others algorithms with about 66% (resp. 50% for SODL [16]\nand 58% for Smooth-SODL) of explained variance on the training set, and 60% (resp. 49% for\nSODL and 55% for Smooth-SODL) on left-out (test) data. See Fig. 2(b). However, as noted in the\nabove paragraph, such methods lead to dictionaries that are hardly intepretable and thus the user\nmust recourse to some kind of post-processing hard-thresholding step, which destroys the estimated\nmodel. More so, assessing the stability of the dictionaries, measured by mean correlation between\ncorresponding atoms, across di\ufb00erent splits of the data, CanICA [22] scores a meager 0.1, whilst the\nhard-thresholded version tCanICA obtains 0.2, compared to 0.4 for Smooth-SODL and 0.1 for SODL.\n\n7\n\n0%10%20%30%40%50%60%70%80%90%explainedvarianceSmooth-SODL(\u03b3=104)Smooth-SODL(\u03b3=103)SODLtCanICACanICAPCAPicturevocab.EnglishreadingPenn.MatrixTestStrengthEndurancePictureseq.mem.Dexterity0.00.10.20.30.4R2-scoreSmooth-SODL(\u03b3=104)Smooth-SODL(\u03b3=103)SODLtCanICACanICAPCARAW\fIs spatial regularization really needed ? As rightly pointed out by one of the reviewers, one does\nnot need spatial regularization if data are abundant (like in the HCP). So we computed learning curves\nof mean explained variance (EV) on test data, as a function of the amount training data seen by\nboth Smooth-SODL and SODL [16] (Table 1). In the beginning of the curve, our proposed spatially\nregularized Smooth-SODL model starts o\ufb00 with more than 31% explained variance (computed on\n241 subjects), after having pooled only 17 subjects. In contrast, the vanilla SODL model [16] scores\na meager 2% explained variance; this corresponds to a 14-fold gain of Smooth-SODL over SODL. As\nmore and more data are pooled, both models explain more variance, the gap between Smooth-SODL\nand SODL reduces, and both models perform comparably asymptotically.\n\nNb. subjects pooled mean EV for vanilla SODL Smooth-SODL (2)\n\n17\n92\n167\n241\n\n2%\n37%\n47%\n49%\n\n31%\n50%\n54%\n55%\n\ngain factor\n\n13.8\n1.35\n1.15\n1.11\n\nTable 1: Learning-curve for boost in explained variance of our proposed Smooth-SODL model over\nthe reference SODL model. Note the reduction in the explained variance gain as more data are pooled.\nThus our proposed Smooth-SODL method extracts structured denoised dictionaries that better capture\ninter-subject variability in small, medium, and large-scale regimes alike.\n\nPrediction of behavioral variables.\nIf Smooth-SODL captures the patterns of inter-subject variabil-\nity, then it should be possible to predict cognitive scores y like picture vocabulary, reading pro\ufb01ciency,\nmath aptitude, etc. (the behavioral variables are explained in the HCP wiki [12]) by projecting new\nsubjects\u2019 data into this learned low-dimensional space (via solving the ridge problem (5) for each\nsample Xt), without loss of performance compared with using the raw Z-values values X. Let RAW\nrefer to the direct prediction of targets y from X, using the top 2000 most voxels most correlated with\nthe target variable. Results of for the comparison are shown in Fig. 2(c). Only variables predicted\nwith a a positive mean (across the di\ufb00erent methods and across subjects) R-score are reported. We\nsee that the RAW model, as expected over-\ufb01ts drastically, scoring an R2 of 0.3 on training data and\nonly 0.14 on test data. Overall, for this metric CanICA performs best than all the other models in\npredicting the di\ufb00erent behavioral variables on test data. However, our proposed Smooth-SODL\nmodel outperforms both SODL [16] and tCanICA, the thresholded version of CanICA.\n\n7 Concluding remarks\n\nTo extract structured functionally discriminating patterns from massive brain data (i.e data-driven\natlases), we have extended the online dictionary-learning framework \ufb01rst developed in [16], to learn\nstructured regions representative of brain organization. To this end, we have successfully augmented\n[16] with a Laplacian penalty on the component maps, while conserving the low numerical complexity\nof the latter. Through experiments, we have shown that the resultant model \u2013Smooth-SODL model (2)\u2013\nextracts structured and denoised dictionaries that are more intepretable and better capture inter-subject\nvariability in small medium, and large-scale regimes alike, compared to state-of-the-art models. We\nbelieve such online multivariate online methods shall become the de facto way to do dimensionality\nreduction and ROI extraction in the future.\n\nImplementation. The authors\u2019 implementation of the proposed Smooth-SODL (2) model will soon\nbe made available as part of the Nilearn package [2].\n\nAcknowledgment. This work has been funded by EU FP7/2007-2013 under grant agreement no.\n604102, Human Brain Project (HBP) and the iConnectome Digiteo. We would also like to thank the\nHuman Connectome Projection for making their wonderful data publicly available.\n\n8\n\n\fReferences\n\n[1] A. Abraham et al. \u201cExtracting brain regions from rest fMRI with Total-Variation constrained\n\ndictionary learning\u201d. In: MICCAI. 2013.\n\n[2] A. Abraham et al. \u201cMachine learning for neuroimaging with scikit-learn\u201d. In: Frontiers in\n\nNeuroinformatics (2014).\n\n[3] L. Baldassarre, J. Mourao-Miranda, and M. Pontil. \u201cStructured sparsity models for brain\n\ndecoding from fMRI data\u201d. In: PRNI. 2012.\n\n[4] A. Beck and M. Teboulle. \u201cA Fast Iterative Shrinkage-Thresholding Algorithm for Linear\n\nInverse Problems\u201d. In: SIAM J. Imaging Sci. 2 (2009).\n\n[5] C. F. Beckmann and S. M. Smith. \u201cProbabilistic independent component analysis for functional\n\nmagnetic resonance imaging\u201d. In: Trans Med. Im. 23 (2004).\n\n[6] D. P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[7] L. Condat. \u201cFast projection onto the simplex and the (cid:96)1 ball\u201d. In: Math. Program. (2014).\n[8] E. Dohmatob et al. \u201cBenchmarking solvers for TV-l1 least-squares and logistic regression in\n\nbrain imaging\u201d. In: PRNI. IEEE. 2014.\nJ. Duchi et al. \u201cE\ufb03cient projections onto the l 1-ball for learning in high dimensions\u201d. In:\nICML. ACM. 2008.\n\n[9]\n\n[10] K. J. Friston et al. \u201cStatistical Parametric Maps in Functional Imaging: A General Linear\n\nApproach\u201d. In: Hum Brain Mapp (1995).\n\n[11] L. Grosenick et al. \u201cInterpretable whole-brain prediction analysis with GraphNet\u201d. In: Neu-\n\nroImage 72 (2013).\n\n[12] HCP wiki. https://wiki.humanconnectome.org/display/PublicData/HCP+Data+\n\nDictionary+Public-+500+Subject+Release. Accessed: 2010-09-30.\n\n[13] M. Hebiri and S. van de Geer. \u201cThe Smooth-Lasso and other (cid:96)1 + (cid:96)2-penalized methods\u201d. In:\n\n[14] D. P. Hibar et al. \u201cGenetic clustering on the hippocampal surface for genome-wide association\n\nElectron. J. Stat. 5 (2011).\n\nstudies\u201d. In: MICCAI. 2013.\n\n[15] R. Jenatton, G. Obozinski, and F. Bach. \u201cStructured sparse principal component analysis\u201d. In:\n\nAISTATS. 2010.\nJ. Mairal et al. \u201cOnline learning for matrix factorization and sparse coding\u201d. In: Journal of\nMachine Learning Research 11 (2010).\n\n[16]\n\n[17] A. Mensch et al. \u201cDictionary Learning for Massive Matrix Factorization\u201d. In: ICML. ACM.\n\n[18] R. Saxe, M. Brett, and N. Kanwisher. \u201cDivide and conquer: a defense of functional localizers\u201d.\n\n2016.\n\nIn: Neuroimage 30 (2006).\n\nas FSL\u201d. In: Neuroimage 23 (2004).\n\nNeuroImage 62 (2012).\n\n[19] S. M. Smith et al. \u201cAdvances in functional and structural MR image analysis and implementation\n\n[20] D. van Essen et al. \u201cThe Human Connectome Project: A data acquisition perspective\u201d. In:\n\n[21] E. Varol and C. Davatzikos. \u201cSupervised block sparse dictionary learning for simultaneous\nclustering and classi\ufb01cation in computational anatomy.\u201d eng. In: Med Image Comput Comput\nAssist Interv 17 (2014).\n\n[22] G. Varoquaux et al. \u201cA group model for stable multi-subject ICA on fMRI datasets\u201d. In:\n\n[23] G. Varoquaux et al. \u201cCohort-level brain mapping: learning cognitive atoms to single out\n\nNeuroimage 51 (2010).\n\nspecialized regions\u201d. In: IPMI. 2013.\n\n[24] G. Varoquaux et al. \u201cFAASTA: A fast solver for total-variation regularization of ill-conditioned\n\nproblems with application to brain imaging\u201d. In: arXiv:1512.06999 (2015).\n\n[25] G. Varoquaux et al. \u201cMulti-subject dictionary learning to segment an atlas of brain spontaneous\n\nactivity\u201d. In: Inf Proc Med Imag. 2011.\n\n[26] Y. Ying and D.-X. Zhou. \u201cOnline regularized classi\ufb01cation algorithms\u201d. In: IEEE Trans. Inf.\n\nTheory 52 (2006).\n\n9\n\n\f", "award": [], "sourceid": 2296, "authors": [{"given_name": "Elvis", "family_name": "DOHMATOB", "institution": "Inria"}, {"given_name": "Arthur", "family_name": "Mensch", "institution": "inria"}, {"given_name": "Gael", "family_name": "Varoquaux", "institution": "Parietal Team, INRIA"}, {"given_name": "Bertrand", "family_name": "Thirion", "institution": "INRIA"}]}