{"title": "Learning the Morphology of Brain Signals Using Alpha-Stable Convolutional Sparse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 1099, "page_last": 1108, "abstract": "Neural time-series data contain a wide variety of prototypical signal waveforms (atoms) that are of significant importance in clinical and cognitive research. One of the goals for analyzing such data is hence to extract such `shift-invariant' atoms. Even though some success has been reported with existing algorithms, they are limited in applicability due to their heuristic nature. Moreover, they are often vulnerable to artifacts and impulsive noise, which are typically present in raw neural recordings. In this study, we address these issues and propose a novel probabilistic convolutional sparse coding (CSC) model for learning shift-invariant atoms from raw neural signals containing potentially severe artifacts. In the core of our model, which we call $\\alpha$CSC, lies a family of heavy-tailed distributions called $\\alpha$-stable distributions. We develop a novel, computationally efficient Monte Carlo expectation-maximization algorithm for inference. The maximization step boils down to a weighted CSC problem, for which we develop a computationally efficient optimization algorithm. Our results show that the proposed algorithm achieves state-of-the-art convergence speeds. Besides, $\\alpha$CSC is significantly more robust to artifacts when compared to three competing algorithms: it can extract spike bursts, oscillations, and even reveal more subtle phenomena such as cross-frequency coupling when applied to noisy neural time series.", "full_text": "Learning the Morphology of Brain Signals Using\n\nAlpha-Stable Convolutional Sparse Coding\n\nMainak Jas1, Tom Dupr\u00e9 La Tour1, Umut \u00b8Sim\u00b8sekli1, Alexandre Gramfort1,2\n\n1: LTCI, Telecom ParisTech, Universit\u00e9 Paris-Saclay, Paris, France\n\n2: INRIA, Universit\u00e9 Paris-Saclay, Saclay, France\n\nAbstract\n\nNeural time-series data contain a wide variety of prototypical signal waveforms\n(atoms) that are of signi\ufb01cant importance in clinical and cognitive research. One of\nthe goals for analyzing such data is hence to extract such \u2018shift-invariant\u2019 atoms.\nEven though some success has been reported with existing algorithms, they are\nlimited in applicability due to their heuristic nature. Moreover, they are often\nvulnerable to artifacts and impulsive noise, which are typically present in raw\nneural recordings. In this study, we address these issues and propose a novel\nprobabilistic convolutional sparse coding (CSC) model for learning shift-invariant\natoms from raw neural signals containing potentially severe artifacts. In the core of\nour model, which we call \u03b1CSC, lies a family of heavy-tailed distributions called\n\u03b1-stable distributions. We develop a novel, computationally ef\ufb01cient Monte Carlo\nexpectation-maximization algorithm for inference. The maximization step boils\ndown to a weighted CSC problem, for which we develop a computationally ef\ufb01cient\noptimization algorithm. Our results show that the proposed algorithm achieves\nstate-of-the-art convergence speeds. Besides, \u03b1CSC is signi\ufb01cantly more robust to\nartifacts when compared to three competing algorithms: it can extract spike bursts,\noscillations, and even reveal more subtle phenomena such as cross-frequency\ncoupling when applied to noisy neural time series.\n\n1\n\nIntroduction\n\nNeural time series data, either non-invasive such as electroencephalograhy (EEG) or invasive such as\nelectrocorticography (ECoG) and local \ufb01eld potentials (LFP), are fundamental to modern experimental\nneuroscience. Such recordings contain a wide variety of \u2018prototypical signals\u2019 that range from beta\nrhythms (12\u201330 Hz) in motor imagery tasks and alpha oscillations (8\u201312 Hz) involved in attention\nmechanisms, to spindles in sleep studies, and the classical P300 event related potential, a biomarker for\nsurprise. These prototypical waveforms are considered critical in clinical and cognitive research [1],\nthereby motivating the development of computational tools for learning such signals from data.\nDespite the underlying complexity in the morphology of neural signals, the majority of the computa-\ntional tools in the community are based on representing the signals with rather simple, prede\ufb01ned\nbases, such as the Fourier or wavelet bases [2]. While such bases lead to computationally ef\ufb01cient\nalgorithms, they often fall short at capturing the precise morphology of signal waveforms, as demon-\nstrated by a number of recent studies [3, 4]. An example of such a failure is the disambiguation of\nthe alpha rhythm from the mu rhythm [5], both of which have a component around 10 Hz but with\ndifferent morphologies that cannot be captured by Fourier- or wavelet-based representations.\nRecently, there have been several attempts for extracting more realistic and precise morphologies\ndirectly from un\ufb01ltered electrophysiology signals, via dictionary learning approaches [6\u20139]. These\nmethods all aim to extract certain shift-invariant prototypical waveforms (called \u2018atoms\u2019 in this\ncontext) to better capture the temporal structure of the signals. As opposed to using generic bases\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthat have prede\ufb01ned shapes, such as the Fourier or the wavelet bases, these atoms provide a more\nmeaningful representation of the data and are not restricted to narrow frequency bands.\nIn this line of research, Jost et al. [6] proposed the MoTIF algorithm, which uses an iterative strategy\nbased on generalized eigenvalue decompositions, where the atoms are assumed to be orthogonal to\neach other and learnt one by one in a greedy way. More recently, the \u2018sliding window matching\u2019\n(SWM) algorithm [9] was proposed for learning time-varying atoms by using a correlation-based\napproach that aims to identify the recurring patterns. Even though some success has been reported\nwith these algorithms, they have several limitations: SWM uses a slow stochastic search inspired by\nsimulated annealing and MoTIF poorly handles correlated atoms, simultaneously activated, or having\nvarying amplitudes; some cases which often occur in practical applications.\nA natural way to cast the problem of learning a dictionary of shift-invariant atoms into an optimization\nproblem is a convolutional sparse coding (CSC) approach [10]. This approach has gained popularity\nin computer vision [11\u201315], biomedical imaging [16] and audio signal processing [10, 17], due to its\nability to obtain compact representations of the signals and to incorporate the temporal structure of\nthe signals via convolution. In the neuroscience context, Barth\u00e9lemy et al. [18] used an extension\nof the K-SVD algorithm using convolutions on EEG data. In a similar spirit, Brockmeier and\nPr\u00edncipe [7] used the matching pursuit algorithm combined with a rather heuristic dictionary update,\nwhich is similar to the MoTIF algorithm. In a very recent study, Hitziger et al. [8] proposed the\nAWL algorithm, which presents a mathematically more principled CSC approach for modeling\nneural signals. Yet, as opposed to classical CSC approaches, the AWL algorithm imposes additional\ncombinatorial constraints, which limit its scope to certain data that contain spike-like atoms. Also,\nsince these constraints increase the complexity of the optimization problem, the authors had to resort\nto dataset-speci\ufb01c initializations and many heuristics in their inference procedure.\nWhile the current state-of-the-art CSC methods have a strong potential for modeling neural signals,\nthey might also be limited as they consider an (cid:96)2 reconstruction error, which corresponds to assuming\nan additive Gaussian noise distribution. While this assumption could be reasonable for several signal\nprocessing tasks, it turns out to be very restrictive for neural signals, which often contain heavy noise\nbursts and have low signal-to-noise ratio.\nIn this study, we aim to address the aforementioned concerns and propose a novel probabilistic\nCSC model called \u03b1CSC, which is better-suited for neural signals. \u03b1CSC is based on a family\nof heavy-tailed distributions called \u03b1-stable distributions [19] whose rich structure covers a broad\nrange of noise distributions. The heavy-tailed nature of the \u03b1-stable distributions renders our model\nrobust to impulsive observations. We develop a Monte Carlo expectation maximization (MCEM)\nalgorithm for inference, with a weighted CSC model for the maximization step. We propose ef\ufb01cient\noptimization strategies that are speci\ufb01cally designed for neural time series. We illustrate the bene\ufb01ts\nof the proposed approach on both synthetic and real datasets.\n\n2 Preliminaries\n\nNotation: For a vector v \u2208 Rn we denote the (cid:96)p norm by (cid:107)v(cid:107)p = ((cid:80)\n\ni |vi|p)1/p. The convolution of\ntwo vectors v1 \u2208 RN and v2 \u2208 RM is denoted by v1 \u2217 v2 \u2208 RN +M\u22121. We denote by x the observed\nsignals, d the temporal atoms, and z the sparse vector of activations. The symbols U, E, N , S denote\nthe univariate uniform, exponential, Gaussian, and \u03b1-stable distributions, respectively.\nConvolutional sparse coding: The CSC problem formulation adopted in this work follows the Shift\nInvariant Sparse Coding (SISC) model from [10]. It is de\ufb01ned as follows:\n\nN(cid:88)\n\n(cid:16) 1\n\n2\n\nn=1\n\n(cid:107)xn \u2212 K(cid:88)\n\nmin\nd,z\n\nK(cid:88)\n\n(cid:17)\n\ndk \u2217 zk\n\nn(cid:107)2\n\n2 + \u03bb\n\n(cid:107)zk\n\nn(cid:107)1\n\ns.t. (cid:107)dk(cid:107)2\n\n2 \u2264 1 and zk\n\nn \u2265 0,\u2200n, k ,\n\n,\n\n(1)\n\nk=1\n\nk=1\n\nwhere xn \u2208 RT denotes one of the N observed segments of signals, also referred to as a trials in\nthis paper. We denote by T as the length of a trial, and K the number of atoms. The aim in this\nmodel is to approximate the signals xn by the convolution of certain atoms and their respective\nactivations, which are sparse. Here, dk \u2208 RL denotes the kth atom of the dictionary d \u2261 {dk}k, and\nn \u2208 RT\u2212L+1\nzk\nThe objective function (1) has two terms, an (cid:96)2 data \ufb01tting term that corresponds to assuming an\nadditive Gaussian noise model, and a regularization term that promotes sparsity with an (cid:96)1 norm. The\n\ndenotes the activation of the kth atom in the nth trial. We denote by z \u2261 {zk\n\nn}n,k.\n\n+\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) PDFs of \u03b1-stable distributions. (b) Illustration of two trials from the striatal LFP data,\nwhich contain severe artifacts. The artifacts are illustrated with dashed rectangles.\n\nregularization parameter is called \u03bb > 0. Two constraints are also imposed. First, we ensure that dk\nlies within the unit sphere, which prevents the scale ambiguity between d and z. Second, a positivity\nconstraint on z is imposed to be able to obtain physically meaningful activations and to avoid sign\nambiguities between d and z. This positivity constraint is not present in the original SISC model [10].\n\u03b1-Stable distributions: The \u03b1-stable distributions have become increasingly popular in modeling\nsignals that might incur large variations [20\u201324] and have a particular importance in statistics since\nthey appear as the limiting distributions in the generalized central limit theorem [19]. They are\ncharacterized by four parameters: \u03b1, \u03b2, \u03c3, and \u00b5: (i) \u03b1 \u2208 (0, 2] is the characteristic exponent and\ndetermines the tail thickness of the distribution: the distribution will be heavier-tailed as \u03b1 gets\nsmaller. (ii) \u03b2 \u2208 [\u22121, 1] is the skewness parameter. If \u03b2 = 0, the distribution is symmetric. (iii)\n\u03c3 \u2208 (0,\u221e) is the scale parameter and measures the spread of the random variable around its mode\n(similar to the standard deviation of a Gaussian distribution). Finally, (iv) \u00b5 \u2208 (\u2212\u221e,\u221e) is the\nlocation parameter (for \u03b1 > 1, it is simply the mean).\nThe probability density function of an \u03b1-stable distribution cannot be written in closed-form except\nfor certain special cases; however, the characteristic function can be written as follows:\n\nx \u223c S(\u03b1, \u03b2, \u03c3, \u00b5) \u21d0\u21d2 E[exp(i\u03c9x)] = exp(\u2212|\u03c3\u03c9|\u03b1 [1 + i sign(\u03c9)\u03b2\u03c8\u03b1(\u03c9)] + i\u00b5\u03c9) ,\n\n\u221a\u22121. As an important\nwhere \u03c8\u03b1(\u03c9) = log |\u03c9| for \u03b1 = 1, \u03c8\u03b1(\u03c9) = tan(\u03c0\u03b1/2) for \u03b1 (cid:54)= 1, and i =\nspecial case of the \u03b1-stable distributions, we obtain the Gaussian distribution when \u03b1 = 2 and \u03b2 = 0,\ni.e. S(2, 0, \u03c3, \u00b5) = N (\u00b5, 2\u03c32). In Fig. 1(a), we illustrate the (approximately computed) probability\ndensity functions (PDF) of the \u03b1-stable distribution for different values of \u03b1 and \u03b2. The distribution\nbecomes heavier-tailed as we decrease \u03b1, whereas the tails vanish quickly when \u03b1 = 2.\nThe moments of the \u03b1-stable distributions can only be de\ufb01ned up to the order \u03b1, i.e. E[|x|p] < \u221e\nif and only if p < \u03b1, which implies the distribution has in\ufb01nite variance when \u03b1 < 2. Further-\nmore, despite the fact that the PDFs of \u03b1-stable distributions do not admit an analytical form, it is\nstraightforward to draw random samples from them [25].\n\n3 Alpha-Stable Convolutional Sparse Coding\n\n3.1 The Model\n\nFrom a probabilistic perspective, the CSC problem can be also formulated as a maximum a-posteriori\n(MAP) estimation problem on the following probabilistic generative model:\n\n\u02c6xn (cid:44) K(cid:88)\n\nn,t \u223c E(\u03bb),\nzk\n\nxn,t|z, d \u223c N (\u02c6xn,t, 1), where,\n\ndk \u2217 zk\nn .\n\n(2)\n\nk=1\n\nn,t denotes the tth element of zk\n\nn. We use the same notations for xn,t and \u02c6xn,t. It is easy to\nHere, zk\nverify that the MAP estimate for this probabilistic model, i.e. maxd,z log p(d, z|x), is identical to the\noriginal optimization problem de\ufb01ned in (1)1.\nIt has been long known that, due to their light-tailed nature, Gaussian models often fail at handling\nnoisy high amplitude observations or outliers [26]. As a result, the \u2018vanilla\u2019 CSC model turns out\nto be highly sensitive to outliers and impulsive noise that frequently occur in electrophysiological\n\n1Note that the positivity constraint on the activations is equivalent to an exponential prior for the regularization\n\nterm rather than the more common Laplacian prior.\n\n3\n\n-30-20-100102030x10-510-410-310-210-1100p(x)\u03b1=2.0, \u03b2=0\u03b1=1.9, \u03b2=0\u03b1=1.8, \u03b2=0\u03b1=0.9, \u03b2=15001000150020002500Time(t)-10010Trial1(xn,t)5001000150020002500Time(t)-505Trial2(xn,t)\frecordings, as illustrated in Fig. 1(b). Possible origins of such artifacts are movement, muscle\ncontractions, ocular blinks or electrode contact losses.\nIn this study, we aim at developing a probabilistic CSC model that would be capable of modeling\nchallenging electrophysiological signals. We propose an extension of the original CSC model de\ufb01ned\nin (2) by replacing the light-tailed Gaussian likelihood (corresponding to the (cid:96)2 reconstruction loss\nin (1)) with heavy-tailed \u03b1-stable distributions. We de\ufb01ne the proposed probabilistic model (\u03b1CSC)\nas follows:\n(3)\nwhere S denotes the \u03b1-stable distribution. While still being able to capture the temporal structure\nof the observed signals via convolution, the proposed model has a richer structure and would allow\nlarge variations and outliers, thanks to the heavy-tailed \u03b1-stable distributions. Note that the vanilla\nCSC de\ufb01ned in (2) appears as a special case of \u03b1CSC, as the \u03b1-stable distribution coincides with the\nGaussian distribution when \u03b1 = 2.\n\n\u221a\nxn,t|z, d \u223c S(\u03b1, 0, 1/\n\nn,t \u223c E(\u03bb),\nzk\n\n2, \u02c6xn,t) ,\n\n3.2 Maximum A-Posteriori Inference\n\n(cid:88)\n\n(cid:16)\n\n(d(cid:63), z(cid:63)) = arg max\n\nd,z\n\nn,t\n\n(cid:17)\n\n(cid:88)\n\nk\n\nGiven the observed signals x, we are interested in the MAP estimates, de\ufb01ned as follows:\n\nlog p(xn,t|d, z) +\n\nlog p(zk\n\nn,t)\n\n.\n\n(4)\n\n(cid:17)\n\nxn,t|z, d, \u03c6 \u223c N(cid:16)\n\n(cid:17)\n\n1\n2\n\nn,t \u223c E(\u03bb), \u03c6n,t \u223c S(cid:16) \u03b1\n\nAs opposed to the Gaussian case, unfortunately, this optimization problem is not amenable to classical\noptimization tools, since the PDF of the \u03b1-stable distributions does not admit an analytical expression.\nAs a remedy, we use the product property of the symmetric \u03b1-stable densities [19, 27] and re-express\nthe \u03b1CSC model as conditionally Gaussian. It leads to:\n\n\u03c0\u03b1\n4\n\n,\n\n,\n\n2\n\nzk\n\n\u03c6n,t\n\n\u02c6xn,t,\n\n)2/\u03b1, 0\n\n, 1, 2(cos\n\n(5)\nwhere \u03c6 is called the impulse variable that is drawn from a positive \u03b1-stable distribution (i.e. \u03b2 = 1),\nwhose PDF is illustrated in Fig. 1(a). It can be shown that both formulations of the \u03b1CSC model are\nidentical by marginalizing the joint distribution p(x, d, z, \u03c6) over \u03c6 [19, Proposition 1.3.1].\nThe impulsive structure of the \u03b1CSC model becomes more prominent in this formulation: the\nvariances of the Gaussian observations are modulated by stable random variables with in\ufb01nite\nvariance, where the impulsiveness depends on the value of \u03b1. It is also worth noting that when \u03b1 = 2,\n\u03c6n,t becomes deterministic and we can again verify that \u03b1CSC coincides with the vanilla CSC.\nThe conditionally Gaussian structure of the augmented model has a crucial practical implication: if\nthe impulse variable \u03c6 were to be known, then the MAP estimation problem over d and z in this\nmodel would turn into a \u2018weighted\u2019 CSC problem, which is a much easier task compared to the\noriginal problem. In order to be able to exploit this property, we propose an expectation-maximization\n(EM) algorithm, which iteratively maximizes a lower bound of the log-posterior log p(d, z|x), and\nalgorithmically boils down to computing the following steps in an iterative manner:\n\n(6)\n\nE-Step:\n\nB(i)(d, z) = E [log p(x, \u03c6, z|d)]p(\u03c6|x,z(i),d(i)) ,\n(d(i+1), z(i+1)) = arg maxd,z B(i)(d, z).\n\nM-Step:\n\n(7)\nwhere E[f (x)]q(x) denotes the expectation of a function f under the distribution q, i denotes the\niterations, and B(i) is a lower bound to log p(d, z|x) and it is tight at the current iterates z(i), d(i).\nThe E-Step: In the \ufb01rst step of our algorithm, we need to compute the EM lower bound B that has\nthe following form:\n\nB(i)(d, z) =+ \u2212 N(cid:88)\n\n(cid:113)\n(cid:16)(cid:107)\n\nn (cid:12) (xn \u2212 K(cid:88)\n\nw(i)\n\ndk \u2217 zk\n\nn)(cid:107)2\n\n(cid:107)zk\n\nn(cid:107)1\n\nn=1\n\n2 + \u03bb\n\n(8)\nwhere =+ denotes equality up to additive constants, (cid:12) denotes the Hadamard (element-wise) product,\nand the square-root operator is also de\ufb01ned element-wise. Here, w(i)\n+ are the weights that are\n(cid:44) E [1/\u03c6n,t]p(\u03c6|x,z(i),d(i)). As the variables \u03c6n,t are expected to be large\nde\ufb01ned as follows: w(i)\nn,t\nwhen \u02c6xn,t cannot explain the observation xn,t \u2013 typically due to a corruption or a high noise \u2013 the\nweights will accordingly suppress the importance of the particular point xn,t. Therefore, the overall\napproach will be more robust to corrupted data than the Gaussian models where all weights would be\ndeterministic and equal to 0.5.\n\nn \u2208 RT\n\nk=1\n\nk=1\n\n,\n\nK(cid:88)\n\n(cid:17)\n\n4\n\n\fj=1 1/\u03c6(i,j)\n\nn,t , where \u03c6(i,j)\n\n(1/J)(cid:80)J\n\n/* E-step: */\nfor j = 1 to J do\n\nUnfortunately, the weights w(i) cannot be\ntherefore we need\ncomputed analytically,\nto resort to approximate methods.\nIn this\nstudy, we develop a Markov chain Monte\nCarlo (MCMC) method to approximately\ncompute the weights, where we approxi-\nmate the intractable expectations with a \ufb01nite\nn,t \u2248\nsample average, given as follows: w(i)\nn,t are some\nsamples that are ideally drawn from the pos-\nterior distribution p(\u03c6|x, z(i), d(i)). Unfor-\ntunately, directly drawing samples from the\nposterior distribution of \u03c6 is not tractable ei-\nther, and therefore, we develop a Metropolis-\nHastings algorithm [28], that asymptotically\ngenerates samples from the target distribution\np(\u03c6|\u00b7) in two steps. In the j-th iteration of this\nalgorithm, we \ufb01rst draw a random sample for each n and t from the prior distribution (cf. (5)), i.e.,\nn,t \u223c p(\u03c6n,t). We then compute an acceptance probability for each \u03c6(cid:48)\n\u03c6(cid:48)\nn,t that is de\ufb01ned as follows:\n\nAlgorithm 1 \u03b1-stable Convolutional Sparse Coding\nRequire: Regularization: \u03bb \u2208 R+, Num. atoms:\nK, Atom length: L, Num. iterations: I , J, M\n1: for i = 1 to I do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: return w(I), d(I), z(I)\n\nend for\nw(i)\n/* M-step: */\nfor m = 1 to M do\n\nz(i) = L-BFGS-B on (10)\nd(i) = L-BFGS-B on the dual of (11)\n\nn,t via MCMC (9)\nj=1 1/\u03c6(i,j)\n\nn,t \u2248 (1/J)(cid:80)J\n\nDraw \u03c6(i,j)\n\nend for\n\nn,t\n\nacc(\u03c6(i,j)\n\nn,t \u2192 \u03c6(cid:48)\n\nn,t) (cid:44) min\n\n1, p(xn,t|d(i), z(i), \u03c6(cid:48)\n\nn,t)/p(xn,t|d(i), z(i), \u03c6(i,j)\nn,t )\n\n(9)\n\n(cid:110)\n\n(cid:111)\n\nn,t \u2192 \u03c6(cid:48)\nn,t = \u03c6(i)\n\nwhere j denotes the iteration number of the MCMC algorithm. Finally, we draw a uniform random\nnumber un,t \u223c U([0, 1]) for each n and t. If un,t < acc(\u03c6(i)\nn,t), we accept the sample and set\nn,t = \u03c6(cid:48)\n\u03c6(i+1)\nn,t; otherwise we reject the sample and set \u03c6(i+1)\nn,t. This procedure forms a Markov\nchain that leaves the target distribution p(\u03c6|\u00b7) invariant, where under mild ergodicity conditions, it\ncan be shown that the \ufb01nite-sample averages converge to their true values when J goes to in\ufb01nity\n[29]. More detailed explanation of this procedure is given in the supplementary document.\nThe M-Step: Given the weights wn that are estimated during the E-step, the objective of the M-\nstep (7) is to solve a weighted CSC problem, which is much easier when compared to our original\nproblem. This objective function is not jointly convex in d and z, yet it is convex if one \ufb01x either d\nor z. Here, similarly to the vanilla CSC approaches [9, 10], we develop a block coordinate descent\nstrategy, where we solve the problem in (7) for either d or z, by keeping respectively z and d \ufb01xed.\nWe \ufb01rst focus on solving the problem for z while keeping d \ufb01xed, given as follows:\n\nN(cid:88)\n\n(cid:16)(cid:107)\u221a\n\nwn (cid:12) (xn \u2212 K(cid:88)\n\nmin\n\nz\n\n(cid:88)\n\n(cid:17)\n\nn=1\n\nk=1\n\nk\n\nDk \u00afzk\n\nn)(cid:107)2\n\n2 + \u03bb\n\n(cid:107)zk\n\nn(cid:107)1\n\ns.t. zk\n\nn \u2265 0,\u2200n, k .\n\n(10)\n\n(cid:44) [(zk\n\nn)(cid:62), 0\u00b7\u00b7\u00b7 0](cid:62) \u2208 RT\n\nn as the inner product of the zero-padded activations\nHere, we expressed the convolution of dk and zk\n+, with a Toeplitz matrix Dk \u2208 RT\u00d7T , that is constructed from dk. The\n\u00afzk\nn\nmatrices Dk are never constructed in practice, and all operations are carried out using convolutions.\nThis problem can be solved by various constrained optimization algorithms. Here, we choose the\nquasi-Newton L-BFGS-B algorithm [30] with a box constraint: 0 \u2264 zk\nn,t \u2264 \u221e. This approach\nonly requires the simple computation of the gradient of the objective function with respect to z (cf.\nsupplementary material). Note that, since each trial is independent from each other, we can solve this\nproblem for each zn in parallel.\nWe then solve the problem for the atoms d while keeping z \ufb01xed. This optimization problem turns out\nto be a constrained weighted least-squares problem. In the non-weighted case, this problem can be\nsolved either in the time domain or in the Fourier domain [10\u201312]. The Fourier transform simpli\ufb01es\nthe convolutions that appear in least-squares problem, but it also induces several dif\ufb01culties, such\nas that the atom dk have to be in a \ufb01nite support L, an important issue ignored in the seminal work\nof [10] and addressed with an ADMM solver in[11, 12]. In the weighted case, it is not clear how to\nsolve this problem in the Fourier domain. We thus perform all the computations in the time domain.\nFollowing the traditional \ufb01lter identi\ufb01cation approach [31], we need to embed the one-dimensional\nn,i+j\u2212L+1 if L \u2212 1 \u2264\nsignals zk\n\nn into a matrix of delayed signals Z k\n\nn \u2208 RT\u00d7L, where (Z k\n\nn)i,j = zk\n\n5\n\n\f(a) K = 10, L = 32.\n\n(b) Time to reach a relative precision of 0.01.\n\nFigure 2: Comparison of state-of-the-art methods with our approach. (a) Convergence plot with the\nobjective function relative to the obtained minimum, as a function of computational time. (b) Time\ntaken to reach a relative precision of 10\u22122, for different settings of K and L.\n\ni + j < T and 0 elsewhere. Equation (1) then becomes:\n\nN(cid:88)\n\nmin\n\nd\n\nwn (cid:12) (xn \u2212 K(cid:88)\n\n(cid:107)\u221a\n\nndk)(cid:107)2\nZ k\n2,\n\ns.t. (cid:107)dk(cid:107)2\n\n2 \u2264 1 .\n\n(11)\n\nn=1\n\nk=1\n\nDue to the constraint, we must resort to an iterative approach. The options are to use (accelerated)\nprojected gradient methods such as FISTA [32] applied to (11), or to solve a dual problem as done\nin [10]. The dual is also a smooth constraint problem yet with a simpler positivity box constraint\n(cf. supplementary material). The dual can therefore be optimized with L-BFGS-B. Using such a\nquasi-Newton solver turned out to be more ef\ufb01cient than any accelerated \ufb01rst order method in either\nthe primal or the dual (cf. benchmarks in supplementary material).\nOur entire EM approach can be summarized in the Algorithm 1. Note that during the alternating\nminimization, thanks to convexity we can warm start the d update and the z update using the solution\nfrom the previous update. This signi\ufb01cantly speeds up the convergence of the L-BFGS-B algorithm,\nparticularly in the later iterations of the overall algorithm.\n\n4 Experiments\n\nIn order to evaluate our approach, we conduct several experiments on both synthetic and real\ndata. First, we show that our proposed optimization scheme for the M-step provides signi\ufb01cant\nimprovements in terms of convergence speed over the state-of-the-art CSC methods. Then, we provide\nempirical evidence that our algorithm is more robust to artifacts and outliers than three competing\nCSC methods [6, 7, 12]. Finally, we consider LFP data, where we illustrate that our algorithm can\nreveal interesting properties in electrophysiological signals without supervision, even in the presence\nof severe artifacts. The source code is publicly available at https://alphacsc.github.io/.\nSynthetic simulation setup: In our synthetic data experiments, we simulate N trials of length T by\n\ufb01rst generating K zero mean and unit norm atoms of length L. The activation instants are integers\n\ndrawn from a uniform distribution in(cid:74)0, T \u2212 L(cid:75). The amplitude of the activations are drawn from a\n\nuniform distribution in [0, 1]. Atoms are activated only once per trial and are allowed to overlap. The\nactivations are then convolved with the generated atoms and summed up as in (1).\nM-step performance: In our \ufb01rst set of synthetic experiments, we illustrate the bene\ufb01ts of our\nM-step optimization approach over state-of-the-art CSC solvers. We set N = 100, T = 2000 and\n\u03bb = 1, and use different values for K and L. To be comparable, we set \u03b1 = 2 and add Gaussian\nnoise to the synthesized signals, where the standard deviation is set to 0.01. In this setting, we\nhave wn,t = 1/2 for all n, t, which reduces the problem to a standard CSC setup. We monitor the\nconvergence of ADMM-based methods by Heide et al. [11] and Wohlberg [12] against our M-step\nalgorithm, using both a single-threaded and a parallel version for the z-update. As the problem is\nnon-convex, even if two algorithms start from the same point, they are not guaranteed to reach the\nsame local minimum2. Hence, for a fair comparison, we use a multiple restart strategy with averaging\nacross 24 random seeds.\n\n2Note that the M-step can be viewed as a biconvex problem, for which global convergence guarantees can be\nshown under certain assumptions [33, 34]. However, we have observed that it is required to use multiple restarts\neven for vanilla CSC, implying that these assumptions are not satis\ufb01ed in this particular problem.\n\n6\n\n020004000Time (s)103102101100101(objective - best) / bestHeide et al (2015)Wohlberg (2016)M-step M-step - 4 parallelK = 2, L = 32K = 2, L = 128K = 10, L = 320100020003000400050006000Time (s)Heide et al (2015)Wohlberg (2016)M-step M-step - 4 parallel\f(a) No corruption.\n\nFigure 3: Simulation to compare state-of-the-art methods against \u03b1CSC.\n\n(b) 10% corruption.\n\n(c) 20% corruption\n\nDuring our experiments we have observed that the ADMM-based methods do not guarantee the\nfeasibility of the iterates. In other words, the norms of the estimated atoms might be greater than 1\nduring the iterations. To keep the algorithms comparable, when computing the objective value, we\nproject the atoms to the unit ball and scale the activations accordingly. To be strictly comparable,\nwe also imposed a positivity constraint on these algorithms. This is easily done by modifying the\nsoft-thresholding operator to be a recti\ufb01ed linear function. In the benchmarks, all algorithms use a\nsingle thread, except \u201cM-step - 4 parallel\u201d which uses 4 threads during the z update.\nIn Fig. 2, we illustrate the convergence behaviors of the different methods. Note that the y-axis is the\nprecision relative to the objective value obtained upon convergence. In other words, each curve is\nrelative to its own local minimum (see supplementary document for details). In the right subplot, we\nshow how long it takes for the algorithms to reach a relative precision of 0.01 for different settings\n(cf. supplementary material for more benchmarks). Our method consistently performs better and the\ndifference is even more striking for more challenging setups. This speed improvement on the M-step\nis crucial for us as this step will be repeatedly executed.\nRobustness to corrupted data: In our second synthetic data experiment, we illustrate the robustness\nof \u03b1CSC in the presence of corrupted observations. In order to simulate the likely presence of high\namplitude artifacts, one way would be to directly simulate the generative model in (3). However,\nthis would give us an unfair advantage, since \u03b1CSC is speci\ufb01cally designed for such data. Here,\nwe take an alternative approach, where we corrupt a randomly chosen fraction of the trials (10% or\n20%) with strong Gaussian noise of standard deviation 0.1, i.e. one order of magnitude higher than\nin a regular trial. We used a regularization parameter of \u03bb = 0.1. In these experiments, by CSC we\nrefer to \u03b1CSC with \u03b1 = 2, that resembles using only the M-step of our algorithm with deterministic\nweights wn,t = 1/2 for all n, t. We used a simpler setup where we set N = 100, T = 512, and\nL = 64. We used K = 2 atoms, as shown in dashed lines in Fig. 3.\nFor \u03b1CSC, we set the number of outer iterations I = 5, the number of iterations of the M-step to\nM = 50, and the number of iterations of the MCMC algorithm to J = 10. We discard the \ufb01rst 5\nsamples of the MCMC algorithm as burn-in. To enable a fair comparison, we run the standard CSC\nalgorithm for I \u00d7 M iterations, i.e. the total number of M-step iterations in \u03b1CSC. We also compared\n\u03b1CSC against competing state-of-art methods previously applied to neural time series: Brockmeier\nand Pr\u00edncipe [7] and MoTIF [6]. Starting from multiple random initializations, the estimated atoms\nwith the smallest (cid:96)2 distance with the true atoms are shown in Fig. 3.\nIn the artifact-free scenario, all algorithms perform equally well, except for MoTIF that suffers\nfrom the presence of activations with varying amplitudes. This is because it aligns the data using\ncorrelations before performing the eigenvalue decomposition, without taking into account the strength\nof activations in each trial. The performance of Brockmeier and Pr\u00edncipe [7] and CSC degrades as\n\n4:\n\nFigure\nAtoms\nlearnt by \u03b1CSC on\nLFP data containing\nepileptiform spikes\nwith \u03b1 = 2.\n\n(a) LFP spike data from [8]\n\n(b) Estimated atoms\n\n7\n\nBrockmeier et al.Atom 1Atom 2G. TruthMoTIFCSC\u03b1CSCBrockmeier et al.MoTIFCSC\u03b1CSCBrockmeier et al.MoTIFCSC\u03b1CSC9.09.510.010.511.011.512.0Time (s)8006004002000200400\u00b5 V0.00.10.20.3Time (s)0.10.0\f(a) Atoms learnt by: CSC (clean data), CSC (full data), \u03b1CSC (full data)\n\n(b) Comodulogram.\n\nFigure 5: (a) Three atoms learnt from a rodent striatal LFP channel, using CSC on cleaned data, and\nboth CSC and \u03b1CSC on the full data. The atoms capture the cross-frequency coupling of the data\n(dashed rectangle). (b) Comodulogram presents the cross-frequency coupling intensity computed\nbetween pairs of frequency bands on the entire cleaned signal, following [37].\n\nthe level of corruption increases. On the other hand, \u03b1CSC is clearly more robust to the increasing\nlevel of corruption and recovers reasonable atoms even when 20% of the trials are corrupted.\nResults on LFP data In our last set of experiments, we consider real neural data from two different\ndatasets. We \ufb01rst applied \u03b1CSC on an LFP dataset previously used in [8] and containing epileptiform\nspikes as shown in Fig. 4(a). The data was recorded in the rat cortex, and is free of artifact. Therefore,\nwe used the standard CSC with our optimization scheme, (i.e. \u03b1CSC with \u03b1 = 2). As a standard\npreprocessing procedure, we applied a high-pass \ufb01lter at 1 Hz in order to remove drifts in the signal,\nand then applied a tapered cosine window to down-weight the samples near the edges. We set \u03bb = 6,\nN = 300, T = 2500, L = 350, and K = 3. The recovered atoms by our algorithm are shown in\nFig. 4(b). We can observe that the estimated atoms resemble the spikes in Fig. 4(a). These results\nshow that, without using any heuristics, our approach can recover similar atoms to the ones reported\nin [8], even though it does not make any assumptions on the shapes of the waveforms, or initializes\nthe atoms with template spikes in order to ease the optimization.\nThe second dataset is an LFP channel in a rodent striatum from [35]. We segmented the data into 70\ntrials of length 2500 samples, windowed each trial with a tapered cosine function, and detrended the\ndata with a high-pass \ufb01lter at 1 Hz. We set \u03bb = 10, initialized the weights wn to the inverse of the\nvariance of the trial xn. Atoms are in all experiments initialized with Gaussian white noise.\nAs opposed to the \ufb01rst LFP dataset, this dataset contains strong artifacts, as shown in Fig. 1(b). In\norder to be able to illustrate the potential of CSC on this data, we \ufb01rst manually identi\ufb01ed and removed\nthe trials that were corrupted by artifacts. In Fig. 5(a), we illustrate the estimated atoms with CSC on\nthe manually-cleaned data. We observe that the estimated atoms correspond to canonical waveforms\nfound in the signal. In particular, the high frequency oscillations around 80 Hz are modulated in\namplitude by the low-frequency oscillation around 3 Hz, a phenomenon known as cross-frequency\ncoupling (CFC) [36]. We can observe this by computing a comodulogram [37] on the entire signal\n(Fig. 5(b)). This measures the correlation between the amplitude of the high frequency band and the\nphase of the low frequency band.\nEven though CSC is able to provide these excellent results on the cleaned data set, its performance\nheavily relies on the manual removal of the artifacts. Finally, we repeated the previous experiment on\nthe full data, without removing the artifacts and compared CSC with \u03b1CSC, where we set \u03b1 = 1.2.\nThe results are shown in the middle and the right sub-\ufb01gures of Fig. 5(a). It can be observed that in\nthe presence of strong artifacts, CSC is not able to recover the atoms anymore. On the contrary, we\nobserve that \u03b1CSC can still recover atoms as observed in the artifact-free regime. In particular, the\ncross-frequency coupling phenomenon is still visible.\n\n5 Conclusion\nWe address the present need in the neuroscience community to better capture the complex morphology\nof brain waves. Our approach is based on a probabilistic formulation of a CSC model. We propose\nan inference strategy based on MCEM to deal ef\ufb01ciently with heavy tailed noise and take into\naccount the polarity of neural activations with a positivity constraint. Our problem formulation\nallows the use of fast quasi-Newton methods that outperform previously proposed state-of-the-art\nADMM-based algorithms, even when not making use of our parallel implementation. Results on LFP\ndata demonstrate that such algorithms can be robust to the presence of transient artifacts in data and\nreveal insights on neural time-series without supervision.\n\n8\n\n0.00.20.40.6Time (s)0.100.050.000.050.100.150.00.20.40.6Time (s)0.00.20.40.6Time (s)2.55.07.510.0Low frequency255075100125150175High frequency0.000 0.001 0.002 0.003 0.004 0.005 \f6 Acknowledgement\n\nThe work was supported by the French National Research Agency grants ANR-14-NEUC-0002-01,\nANR-13-CORD-0008-02, and ANR-16-CE23-0014 (FBIMATRIX), as well as the ERC Starting\nGrant SLAB ERC-YStG-676943.\n\nReferences\n[1] S. R. Cole and B. Voytek. Brain oscillations and the importance of waveform shape. Trends\n\nCogn. Sci., 2017.\n\n[2] M. X. Cohen. Analyzing neural time series data: Theory and practice. MIT Press, 2014. ISBN\n\n9780262319560.\n\n[3] S. R. Jones. When brain rhythms aren\u2019t \u2018rhythmic\u2019: implication for their mechanisms and\n\nmeaning. Curr. Opin. Neurobiol., 40:72\u201380, 2016.\n\n[4] A. Mazaheri and O. Jensen. Asymmetric amplitude modulations of brain oscillations generate\n\nslow evoked responses. The Journal of Neuroscience, 28(31):7781\u20137787, 2008.\n\n[5] R. Hari and A. Puce. MEG-EEG Primer. Oxford University Press, 2017.\n\n[6] P. Jost, P. Vandergheynst, S. Lesage, and R. Gribonval. MoTIF: an ef\ufb01cient algorithm for\nlearning translation invariant dictionaries. In Acoustics, Speech and Signal Processing, ICASSP,\nvolume 5. IEEE, 2006.\n\n[7] A. J. Brockmeier and J. C. Pr\u00edncipe. Learning recurrent waveforms within EEGs.\n\nTransactions on Biomedical Engineering, 63(1):43\u201354, 2016.\n\nIEEE\n\n[8] S. Hitziger, M. Clerc, S. Saillet, C. Benar, and T. Papadopoulo. Adaptive Waveform Learning:\nA Framework for Modeling Variability in Neurophysiological Signals. IEEE Transactions on\nSignal Processing, 2017.\n\n[9] B. Gips, A. Bahramisharif, E. Lowet, M. Roberts, P. de Weerd, O. Jensen, and J. van der Eerden.\nDiscovering recurring patterns in electrophysiological recordings. J. Neurosci. Methods, 275:\n66\u201379, 2017.\n\n[10] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng. Shift-invariant sparse coding for audio classi\ufb01-\ncation. In 23rd Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201907, pages 149\u2013158.\nAUAI Press, 2007. ISBN 0-9749039-3-0.\n\n[11] F. Heide, W. Heidrich, and G. Wetzstein. Fast and \ufb02exible convolutional sparse coding. In\n\nComputer Vision and Pattern Recognition (CVPR), pages 5135\u20135143. IEEE, 2015.\n\n[12] B. Wohlberg. Ef\ufb01cient algorithms for convolutional sparse representations. Image Processing,\n\nIEEE Transactions on, 25(1):301\u2013315, 2016.\n\n[13] M. D. Zeiler, D. Krishnan, G.W. Taylor, and R. Fergus. Deconvolutional networks. In Computer\n\nVision and Pattern Recognition (CVPR), pages 2528\u20132535. IEEE, 2010.\n\n[14] M. \u0160orel and F. \u0160roubek. Fast convolutional sparse coding using matrix inversion lemma.\n\nDigital Signal Processing, 2016.\n\n[15] K. Kavukcuoglu, P. Sermanet, Y-L. Boureau, K. Gregor, M. Mathieu, and Y. Cun. Learning\nconvolutional feature hierarchies for visual recognition. In Advances in Neural Information\nProcessing Systems (NIPS), pages 1090\u20131098, 2010.\n\n[16] M. Pachitariu, A. M Packer, N. Pettit, H. Dalgleish, M. Hausser, and M. Sahani. Extracting\nregions of interest from biological images with convolutional sparse block coding. In Advances\nin Neural Information Processing Systems (NIPS), pages 1745\u20131753, 2013.\n\n[17] B. Mailh\u00e9, S. Lesage, R. Gribonval, F. Bimbot, and P. Vandergheynst. Shift-invariant dictionary\nlearning for sparse representations: extending K-SVD. In 16th Eur. Signal Process. Conf., pages\n1\u20135. IEEE, 2008.\n\n9\n\n\f[18] Q. Barth\u00e9lemy, C. Gouy-Pailler, Y. Isaac, A. Souloumiac, A. Larue, and J. I. Mars. Multivariate\n\ntemporal dictionary learning for EEG. J. Neurosci. Methods, 215(1):19\u201328, 2013.\n\n[19] G. Samorodnitsky and M. S. Taqqu. Stable non-Gaussian random processes: stochastic models\n\nwith in\ufb01nite variance, volume 1. CRC press, 1994.\n\n[20] E. E. Kuruoglu. Signal processing in \u03b1-stable noise environments: a least Lp-norm approach.\n\nPhD thesis, University of Cambridge, 1999.\n\n[21] B. B. Mandelbrot. Fractals and scaling in \ufb01nance: Discontinuity, concentration, risk. Selecta\n\nvolume E. Springer Science & Business Media, 2013.\n\n[22] U. \u00b8Sim\u00b8sekli, A. Liutkus, and A. T. Cemgil. Alpha-stable matrix factorization. IEEE SPL, 22\n\n(12):2289\u20132293, 2015.\n\n[23] Y. Wang, Y. Qi, Y. Wang, Z. Lei, X. Zheng, and G. Pan. Delving into \u03b1-stable distribution in\nnoise suppression for seizure detection from scalp EEG. J. Neural. Eng., 13(5):056009, 2016.\n\n[24] S. Leglaive, U. \u00b8Sim\u00b8sekli, A. Liutkus, R. Badeau, and G. Richard. Alpha-stable multichannel\n\naudio source separation. In ICASSP, pages 576\u2013580, 2017.\n\n[25] J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random\n\nvariables. Journal of the american statistical association, 71(354):340\u2013344, 1976.\n\n[26] P. J. Huber. Robust Statistics. Wiley, 1981.\n\n[27] S. Godsill and E. Kuruoglu. Bayesian inference for time series with heavy-tailed symmetric\n\u03b1-stable noise processes. Proc. Applications of heavy tailed distributions in economics, eng.\nand stat., 1999.\n\n[28] S. Chib and E. Greenberg. Understanding the Metropolis-Hastings algorithm. The American\n\nStatistician, 49(4):327\u2013335, 1995.\n\n[29] J.S. Liu. Monte Carlo strategies in scienti\ufb01c computing. Springer, 2008.\n\n[30] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained\n\noptimization. SIAM Journal on Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\n[31] E. Moulines, P. Duhamel, J-F. Cardoso, and S. Mayrargue. Subspace methods for the blind\nIEEE Transactions on signal processing, 43(2):\n\nidenti\ufb01cation of multichannel FIR \ufb01lters.\n516\u2013525, 1995.\n\n[32] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[33] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish\nTandon. Learning sparsely used overcomplete dictionaries. In Conference on Learning Theory,\npages 123\u2013137, 2014.\n\n[34] Jochen Gorski, Frank Pfeuffer, and Kathrin Klamroth. Biconvex sets and optimization with\nbiconvex functions: a survey and extensions. Mathematical Methods of Operations Research,\n66(3):373\u2013407, 2007.\n\n[35] G. Dall\u00e9rac, M. Graupner, J. Knippenberg, R. C. R. Martinez, T. F. Tavares, L. Tallot, N. El Mas-\nsioui, A. Verschueren, S. H\u00f6hn, J.B. Bertolus, et al. Updating temporal expectancy of an\naversive event engages striatal plasticity under amygdala control. Nature Communications, 8:\n13920, 2017.\n\n[36] O. Jensen and L. L. Colgin. Cross-frequency coupling between neuronal oscillations. Trends in\n\ncognitive sciences, 11(7):267\u2013269, 2007.\n\n[37] A. BL. Tort, R. Komorowski, H. Eichenbaum, and N. Kopell. Measuring phase-amplitude\ncoupling between neuronal oscillations of different frequencies. J. Neurophysiol., 104(2):\n1195\u20131210, 2010.\n\n10\n\n\f", "award": [], "sourceid": 753, "authors": [{"given_name": "Mainak", "family_name": "Jas", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Tom", "family_name": "Dupr\u00e9 la Tour", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Umut", "family_name": "Simsekli", "institution": "Bogazici University"}, {"given_name": "Alexandre", "family_name": "Gramfort", "institution": "LTCI, CNRS, T\u00e9l\u00e9com ParisTech, Universit\u00e9 Paris-Saclay"}]}