{"title": "Multivariate Sparse Coding of Nonstationary Covariances with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1612, "page_last": 1621, "abstract": "This paper studies statistical characteristics of multivariate observations with irregular changes in their covariance structures across input space. We propose a unified nonstationary modeling framework to jointly encode the observation correlations to generate a piece-wise representation with a hyper-level Gaussian process (GP) governing the overall contour of the pieces. In particular, we couple the encoding process with automatic relevance determination (ARD) to promote sparsity to account for the inherent redundancy. The hyper GP enables us to share statistical strength among the observation variables over a collection of GPs defined within the observation pieces to characterize the variables' respective local smoothness. Experiments conducted across domains show superior performances over the state-of-the-art methods.", "full_text": "Multivariate Sparse Coding of Nonstationary\n\nCovariances with Gaussian Processes\n\nRui Li\n\nGolisano College of Computing and Information Sciences\n\nRochester Institute of Technology\n\nRochester, NY 14623\n\nrxlics@rit.edu\n\nAbstract\n\nThis paper studies statistical characteristics of multivariate observations with ir-\nregular changes in their covariance structures across input space. We propose\na uni\ufb01ed nonstationary modeling framework to jointly encode the observation\ncorrelations to generate a piece-wise representation with a hyper-level Gaussian\nprocess (GP) governing the overall contour of the pieces. In particular, we couple\nthe encoding process with automatic relevance determination (ARD) to promote\nsparsity to account for the inherent redundancy. The hyper GP enables us to share\nstatistical strength among the observation variables over a collection of GPs de-\n\ufb01ned within the observation pieces to characterize the variables\u2019 respective local\nsmoothness. Experiments conducted across domains show superior performances\nover the state-of-the-art methods.\n\n1\n\nIntroduction\n\nIn many real-world applications, multivariate observations exhibit critical irregular changes in their\ncovariance smoothness with sharp transitions. For example, a major challenge to accurately locate\na seizure-onset zone (SOZ) through intracranial electroencephalography (iEEG) recordings is to\ndetect different forms of sudden transient electrophysiologic events of SOZ signals [1, 2, 3]. Another\nscenario is that regional outbursts and geographic features (e.g., parks, rivers) lead to complex\ntempo-spatial variations of crime occurrences across regions over time [4, 5].\nIn these scenarios, some segments of the observations exhibit larger variability than others. The\nstationary methods, assuming the same covariance structure throughout the entire input space, cannot\ncapture such change in covariance smoothness. Conventional nonstationary modeling methods are\nlimited to model univariate observation by two consecutive steps [6, 7, 8]. They \ufb01rst recursively\npartition the input space into regions, and then de\ufb01ne separate local Gaussian processes (GPs)\nwithin each region. The GP inference step cannot capture long-range dependence or share statistical\ninformation among the independent local GPs. A solution to alleviate the problem is to combine\nthe local GPs with a global GP which is \ufb01tted to the whole observation [9, 10]. However, it tends to\nover-smooth the local covariance variability.\nTo address these challenges, we propose a novel nonstationary modeling framework that jointly\ninfers a piece-wise representation of the multivariate observations and a hyper-level GP governing\nthe overall contour of the pieces. In particular, we employ multilogit regression function to encode\nthe observation correlations coupled with automatic relevance determination (ARD) priors over the\ncoef\ufb01cients to promote sparsity. This encoding process transforms the observations into a set of\ndisjoint pieces to model the variability in the covariance smoothness with the correlations providing\nbetween-variable information. Since commonly only a portion of observations are informative for\nsuch transformation, ARD shrinks the correlation dimensions towards zero to handle the inherent\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fredundancy. Regulated by the hyper GP through their mean functions, a collection of variable-speci\ufb01c\nlocal GPs are de\ufb01ned to model the variables\u2019 respective smoothness within the pieces. The hyper GP\nnot only shares statistical strength across the local GPs while retaining their distinctive covariance\nproperty, but also induces the observation variables\u2019 conditional independence. The piece-wise\nrepresentation leads to ef\ufb01cient posterior computation with the conjugate priors.\nWe evaluate our nonstationary modeling method across domains: for seizure onset localization,\nwe achieve robustly better performances than the state-of-the-art competing methods; for crime\noccurrence prediction, by modeling the evolving covariances of weekly crime rates among the 179\ncensus tracts in Washington D.C between 2015-2019, we outperform the state-of-the-art methods.\n\n2 Related work\n\nAlthough iEEG recordings provide critical information to locate areas of the brain to remove for\nepilepsy patients, pre-surgical examination of between-seizure iEEG signals is a labor-intensive and\nerror-prone process [1, 11]. It becomes increasingly essential to develop effective computational\nmethods to identify the iEEG channels that are most likely to be in the SOZs by identifying different\nabrupt changes in neurophysiological events [2, 3, 12, 13]. Empirical studies focus on identifying\nbiomarkers (e.g., spectral features, high frequency oscillations (HFO)) related to sub-clinical epileptic\nbursts [1, 12]. Classical modeling methods make stationary assumption without considering covari-\nance change over time. In [13], Markov switching process is coupled with a stochastic process prior\nto analyze the iEEG signal dynamics. A factor graphical model is proposed to integrate temporal\nand spatial information of iEEG channels to infer pathologic brain activity for SOZ localization\n[3]. Speci\ufb01cally, the spatial property is de\ufb01ned as correlations between channels, and the temporal\nfunction measures correlations between a channel\u2019s current state and the linear combination of its\nprevious states. GP with stationary covariance is applied to model nonlinearity in neonatal EEG sig-\nnals for seizure detection, and shows high level of prediction performance [14]. For crime prediction,\nan autoregressive mixture model with Poisson processes is proposed [5]. Its most recent extension\nPoINAR incorporates a stochastic process prior to group spatial correlation modes across multiple\ntime series, and achieves the state-of-the-art performance [15].\nNonstationary covariance function modeling methods with designed or learned kernels typically\nassume the same covariance structure as a function of distances from observations throughout the\ninput space [16, 17]. This is a strong modeling assumption for the above applications where sharp\ntransitions in covariance smoothness play the key role. Partitioning including Bayesian trees, Voronoi\ntessellation, and normalized cuts (N-cuts) is widely used for modeling nonstationarity with abrupt\nchanges [6, 7, 8, 9]. The local GPs de\ufb01ned within the recursively partitioned regions are independent.\nTo capture the long-range trend, some methods de\ufb01ne a global stationary GP over the entire input\nspace, and combine it with the local GPs. This leads to over-smooth the complex covariances induced\nby the local GPs, since the global GP is also independent of the partition inference procedure [9, 10].\nAdditionally, these methods are subject to some constraints such as partition points having to be at\nobservation locations, and balanced binary trees. A mixture of GP experts models nonstationary\nunivariate observations by de\ufb01ning each GP expert over the entire input space [18, 19].\n\n3 Uni\ufb01ed nonstationary modeling framework\n\nOur framework encodes observation correlations into a trending piece-wise representation with both\nARD and hyper GP priors. By coupling the relevance vectors with the hyper GP, we are able to share\nstatistical information among the pieces. Given the hyper GP, each observation variable is modeled\nby a conditionally independent GP within the pieces for its local covariance smoothness.\n\n3.1 Sparse coding for observation correlations\nLet Y = {y1,\u00b7\u00b7\u00b7 , yN} denote a set of multivariate observations at locations {x1,\u00b7\u00b7\u00b7 , xN} with\nxi \u2208 X as a non-random covariate in the input space X and an observation yi \u2208 RD\u00d71. We encode\nY into K pieces with the corresponding inputs X = \u222akXk and Xk \u2229 Xk(cid:48) = \u2205, where k (cid:54)= k(cid:48).\n\n2\n\n\f(cid:80)K\n\nk Q(\u00b7, yi))\n\nexp(\u03b8T\nk(cid:48)=1 exp(\u03b8T\n\nLet Z denote a N \u00d7 K indicator matrix. Its element zik = \u03b4(xi \u2208 Xk) is the one-of-K encoding of\nxi, where zik turns on iff xi \u2208 Xk. Its probability of being 1 is a multilogit regression function:\n\np(zik = 1|\u03b8k, Q) =\n\n(1)\nwhere \u03b8k denotes a N \u00d7 1 coef\ufb01cient vector for the k\u2019th piece, and Q(\u00b7, yi) is the i\u2019th column vector\nof the observation correlation matrix Q.\nWe employ the sparse prior ARD to explore how the correlation between any two observations con-\ntributes to the encoding. ARD eliminates the irrelevant correlations by encouraging their coef\ufb01cients\ngo to zero. Speci\ufb01cally, we de\ufb01ne independent, zero-mean, spherically symmetric Gaussian priors on\n\u03b8k:\n\nk(cid:48)Q(\u00b7, yi))\n\np(\u03b8k|\u03b1k) = N (\u03b8k|0, A\u22121\nk )\n\nk = diag(\u03b1\u22121\n\nk ) denotes a diagonal matrix with the components of vector \u03b1\u22121\n\n(2)\nwhere A\u22121\nk on the\ndiagonal. Each component of precision parameter \u03b1k is given a \u0393(a, b) prior. ARD method penalizes\nnon-zero coef\ufb01cient components by an amount determined by the precision parameters. Iterative\nestimation of \u03b1k and \u03b8k leads to \u03b1k becoming large for components whose evidence in the correlations\nis insuf\ufb01cient for overcoming the penalty induced by the prior. Having \u03b1k \u2192 \u221e drives \u03b8k \u2192 0,\nwhich implies that the corresponding correlations do not contribute to the encoding. Therefore, ARD\nidenti\ufb01es a subset of the observations, known as relevance vectors, with non-zero coef\ufb01cients for\neach piece.\nLet vk denote the input in Xk whose corresponding observation is the relevance vector with the\nmaximum absolute value of non-zero component of \u03b8k, where vk \u2208 V \u2282 X. We de\ufb01ne a function:\ng : V \u2192 R which describes the overall contour of the observation pieces by sharing statistical\ninformation among them:\n\ng(v) \u223c GP (0, \u03bag(v, v(cid:48)))\n\n(3)\nwhere \u03bag is a covariance function de\ufb01ned on V . We use a squared-exponential kernel \u03bag =\ng exp(\u2212lg||v \u2212 v(cid:48)||2\n2) to encourage a smooth pro\ufb01le of the pieces. We further de\ufb01ne a local function\n\u03c32\nfk : Xk \u2192 R for each piece:\n\nfk(x)|g \u223c GP (g(vk), \u03bak(x, x(cid:48)))\n\n(4)\nwhere g(vk) speci\ufb01es the mean function of the GP prior for the local function fk. \u03bak is a squared-\nexponential kernel \u03bak = \u03c32\n2) de\ufb01ning a covariance function. We assume lk =\n||Xk||2\n\nto let the horizontal lengthscales of the local functions re\ufb02ect the global smoothness.\n\nk exp(\u2212lk||x \u2212 x(cid:48)||2\n\nlg\n\n2\n\n3.2 Piece-wise GPs for univariate observations\nLet g = g(V ) \u2208 RK\u00d71 and f = [f1(X1)T ,\u00b7\u00b7\u00b7 , fK(XK)T ]T \u2208 RN\u00d71, the hyper-level and local\nGPs de\ufb01ne two joint Gaussians for any \ufb01nite set of observations, respectively:\np(f|g, Z) = N (f|Zg, \u03a3f )\n\n(5)\nwhere \u03a3g is the covariance matrix with \u03bag(v, v(cid:48)) as the elements, and \u03a3f is a diagonal block\ncovariance matrix in which the elements of the k\u2019th block \u03a3(k)\nare \u03bak of the input pairs in the k\u2019th\npiece as [\u03a3(k)\nOne can analytically marginalize g conditioned on the piece-wise representation Z yielding\n\np(g|V ) = N (g|0, \u03a3g)\n\nf ]ij = \u03bak(x(k)\n\nj \u2208 Xk.\n\n), where x(k)\n\n, x(k)\n\n, x(k)\n\nf\n\nj\n\ni\n\ni\n\n(6)\n\np(f|Z) = N (f|0, Z\u03a3gZ T + \u03a3f )\nA univariate observation y \u2208 RN\u00d71 with noise is thus generated as:\n\n(7)\nwhere I is a N \u00d7 N identity matrix. Recalling (6), the marginal likelihood conditioned on Z yields\n(8)\n\np(y|f )p(f|Z)df = N (y|0, \u03a3y)\n\np(y|Z) =\n\n(cid:90)\n\np(y|f , \u03c32) = N (y|f , \u03c32I)\n\nwhere \u03a3y = Z\u03a3gZ T + \u03a3f + \u03c32I denotes the induced nonstationary covariance matrix. \u03a3y captures\nthe varying covariance structures of the pieces, and the discontinuities between them.\n\n3\n\n\f3.3 Piece-wise GPs for multivariate observations\nTo extend to multivariate observations Y = {y(1),\u00b7\u00b7\u00b7 , y(D)} where y(d) \u2208 RN\u00d71 denotes the\nobservations of the dth variable (e.g., a feature of iEEG recording, a census tract\u2019s crime occurrences),\nwe model each variable\u2019s observations y(d) as a realization from a speci\ufb01c local function f (d) as\n\nThe variable-speci\ufb01c local functions {f (d)} are conditionally independent given g and Z:\n\np(y(d)|f (d), \u03c32) = N (y(d)|f (d), \u03c32I)\n\np(f (1:D)|g, Z) =\n\nD(cid:89)\n\nd=1\n\np(f (d)|g, Z)\nD | 1\n|2\u03c0 \u03a3f\n|2\u03c0\u03a3f| D\n\nexp[\u2212 1\n2\n\n2\n\n2\n\n=\n\n(cid:88)\n\n(f (d) \u2212 \u00aff )T \u03a3\u22121\n\nf (f (d) \u2212 \u00aff ))]N (\u00aff|Zg,\n\ntr(\n\nd\n\n\u03a3f\nD\n\n)\n\n(9)\n\n(10)\n\n(cid:80)\n\n=\n\n(cid:80)\n\nd f (d)\nwhere \u00aff =\nD . The assumption allows to share statistical strength among the observation vari-\nables through g while retaining variable-speci\ufb01c covariance variability, as each variable\u2019s observations\ncan be derived by marginalizing over f (d):\n\np(y(d)|g, Z, \u03c32) =\n\np(y(d)|f (d), \u03c32I)p(f (d)|g, Z)df (d) = N (y(d)|Zg, \u03a3y|g)\n\n(11)\n\n(cid:90)\n\nwhere \u03a3y|g = \u03a3f + \u03c32I. By exploiting the conditional independence of Y , the marginal likelihood\nfor the multivariate observations is:\np(Y |Z, \u03c32) =\n\np(y(d)|g, Z, \u03c32)p(g|V )dg\n\n(cid:90) D(cid:89)\n\nd=1\n\n|2\u03c0 \u03a3y|g\nD | 1\n|2\u03c0\u03a3y|g| D\n\n2\n\n2\n\nexp[\u2212 1\n2\n\ntr(\n\n(cid:88)\n\nd\n\n(y(d) \u2212 \u00afy)(y(d) \u2212 \u00afy)T )\u03a3\u22121\n\ny|g]N (\u00afy|0,\n\n\u03a3y|g\nD\n\n+ Z\u03a3gZ T )\n\n(12)\n\nd y(d)\nwhere \u00afy =\nD . The kth diagonal block of the covariance matrix is [\u03a3(k)]ij = \u03bag(vk, vk) +\n[\u03bak(x(k)\n) + \u03c32\u03b4(i, j)]/D. The multivariate case in (12) can be reduced to (8) when D = 1.\nThe computation complexity for (12) is O(KM 3), where M denotes the rough size of each piece.\nBy optimizing (12), we can determine the settings of the hyperparameters {lg, \u03c32\n\n, x(k)\n\nj\n\ni\n\nf} 1.\n\ng, \u03c32\n\n3.4 Ef\ufb01cient inference\n\nWe develop a Gibbs sampling solution to iteratively sample the GP functions and the piece-wise\nrepresentation given their priors and the observations, and then update the hyper-parameters given\nthe latent functions and the observations.\nFirst, our model\u2019s joint probability can be factorized as\n\n\u221d D(cid:89)\n\np(Y,{f (1:D)}, g, Z,{\u03b8k}, V, X, \u03c32, Q, \u03b1)\n[p(y(d)|f (d), \u03c32)p(f (d)|g, Z)]p(g|Z)\n\nK(cid:89)\n\nN(cid:89)\n\n[\ni=1\nk=1\n\nd=1\n\np(zik|\u03b8k, Q(\u00b7, yi))p(\u03b8k|\u03b1k)p(\u03b1k)]\n\n(13)\n\nWe propose to adopt the Rao-Blackwellized sampling scheme through analytic marginalization from\nthe joint distribution of {f (1:D)} and g, and sample them from their respective posteriors. This\nimproves the ef\ufb01ciency of our Gibbs sampler by reducing the underlying sample space and the\nvariance of later estimates. The conjugate priors result in closed-form marginalization.\nBy Combining the likelihood marginalized over f (d) in (11) and the prior in (5), we sample g from\nits posterior as\n\np(g|Y, Z) \u221d N (g|\u00b5g|y, \u03a3g|y)\n\u03a3\u22121\ng|y = \u03a3\u22121\n\ng + Z T \u03a3\u22121\ny|gZ\n\n\u00b5g|y = \u03a3g|yZ T \u03a3\u22121\ny|g \u00afy\n\n(14)\n\n1See the supplementary material for the derivation of 12 and its gradients.\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Plot of mean and \u00b11 std of the log marginal likelihood in (12) of the true positive\niEEG observations in the training set versus different K. (b) Empirical mean of the 8 PIB features\nof a true positive iEEG observation\u2019s heldout segment (blue), our method\u2019s predictive mean of the\ncorresponding y(1:D)\nin (24) (red), and the predictive mean of Heinonen et al. method [16] (green).\n(c) Boxplots of the cross-validation RMSEs summarizing the true positive observations in the training\nset versus K.\n\n\u2217\n\nBy marginalizing over g, each f (d) has the following posterior distribution:\n\np(f (d)|y(d), f (\u2212d), Z, \u03c32) \u221d p(y(d)|f (d), \u03c32)p(f (d)|f (\u2212d), Z)\n\n(15)\nwhere f (\u2212d) denote the set {f (1:D)} other than f (d). The \ufb01rst term p(y(d)|f (d), \u03c32) is as in (9), and\nfor the second term, we have\n\n(cid:90)\n\np(f (d)|f (\u2212d), Z) =\n\np(f (d)|g, Z)p(g|f (\u2212d), Z)dg\n\nRecalling (10) and (5), the conditional distribution of g in (16) is\n\nwhere \u00aff (\u2212d) =\n\n(cid:80)\n\n)\u22121Z\n\ng + Z T (\n\n\u03a3f\nD \u2212 1\n\np(g|f (\u2212d), Z) \u221d p(f (\u2212d)|g, Z)p(g|Z) = N (g|\u00b5g|f (\u2212d), \u03a3g|f (\u2212d) )\n\u03a3\u22121\ng|f (\u2212d) = \u03a3\u22121\n\u00b5g|f (\u2212d) = \u03a3g|f (\u2212d) Z T (\nd(cid:48)(cid:54)=d f (d(cid:48) )\nD\u22121\np(f (d)|f (\u2212d), Z) =\n\n\u03a3f\nD \u2212 1\n. Thus, we have the conditional distribution of f (d) as\n\n(cid:90)\n= N (f (d)|Z\u00b5g|f (\u2212d) , \u03a3f + Z\u03a3g|f (\u2212d)Z T )\n\np(f (d)|g, Z)p(g|f (\u2212d), Z)dg\n\n)\u22121\u00aff (\u2212d)\n\nand the posterior distribution of f (d) as\n\np(f (d)|y(d), f (\u2212d), Z, \u03c32) = N (f (d)|\u00b5f (d)|f (\u2212d) , \u03a3f (d)|f (\u2212d))\n\u03a3\u22121\nf (d)|f (\u2212d) = (\u03a3f + Z\u03a3g|f (\u2212d)Z T )\u22121 + (\u03c32I)\u22121\n\u00b5f (d)|f (\u2212d) = \u03a3f (d)|f (\u2212d) [(\u03c32I)\u22121y(d) + (\u03a3f + Z\u03a3g|f (\u2212d)Z T )\u22121Z\u00b5g|f (\u2212d)]\n\nWe marginalize over {f (1:D)} to sample zik from its posterior by combining the marginal likelihood\nin (11) and the prior in (1):\n\np(zik = 1|Y, Z\u2212ik, g, \u03c32,{\u03b8k}, Q) \u221d p(Y |g, Z, \u03c32)\n\n(cid:89)\n\n(cid:89)\n\n\u221d(cid:89)\n\n(cid:89)\n\np(zik = 1|\u03b8k, Q(\u00b7, yi))\nk Q(\u00b7, yi))\n\n(cid:89)\n\nexp(\u03b8T\n\ni\n\nk\n\nN (y(d)|Zg, \u03a3y|g)\n\nd\n\ni\n\nk\n\nwhere Z\u2212ik denotes the matrix Z other than element zik. For binary random variables, Metropolis-\nHastings (MH) algorithm is shown to mix faster and have greater statistical ef\ufb01ciency than standard\nGibbs samplers [20]. To update zik given Z\u2212ik, we thus use the posterior of (20) to evaluate a MH\nproposal which \ufb02ips the binary variable zik with the current value z to its complement value \u00afz:\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\nzik \u221d \u03ba(\u00afz|z)\u03b4(zik, \u00afz) + (1 \u2212 \u03ba(\u00afz|z))\u03b4(zik, z)\n\u03ba(\u00afz|z) = min{ p(zik = \u00afz|Y, Z\u2212ik, \u03c32,{\u03b8k}, Q)\n, 1}\np(zik = z|Y, Z\u2212ik, \u03c32,{\u03b8k}, Q)\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Absolute correlation matrix of a heldout iEEG observation with SOZ events. (b) The\ncorresponding posterior covariance matrix of f (1:D) with the diagonal blocks as the local covariance\nmatrices of the observation pieces averaged over the Gibbs samples. (c) The blue bands indicate the\nepileptologist\u2019s labels on the SOZ events of the iEEG signal (gray), and the red segments are the\nencoded pieces predicted to be SOZ events.\n\np(\u03b8k|Z,{\u03b8\u2212k}, \u03b1k, Q) \u221d(cid:89)\n\nTo compute the conditional posterior of a coef\ufb01cient vector \u03b8k, we \ufb01x the set {\u03b8\u2212k} other than \u03b8k\nand have\n(1 \u2212 \u03b7ik)\u03b4(zi(cid:54)=k)\n(22)\nk(cid:48)Q(\u00b7, yi)]. We adopt the logistic sampling technique\n\nwhere \u03b7ik \u221d exp[\u03b8T\nwith auxiliary variable sampling for its ef\ufb01ciency [21].\nFinally, given {\u03b8k} and recalling that each \u03b1k is gamma distributed, its posterior is\n\np(zik|{\u03b8k}, Q)p(\u03b8k|\u03b1k) \u221d N (\u03b8k|0, A\u22121\nk )\n\nk(cid:48)(cid:54)=k exp(\u03b8T\n\ni\n\nk Q(\u00b7, yi)\u2212log(cid:80)\n\n\u03b7\u03b4(zi=k)\nik\n\n(cid:89)\n\ni\n\n(cid:80)\n\n|Sk|\n2\n\ni,k \u03b82\nik\n2\nThe set Sk contains the indices for which \u03b8ik has prior precision \u03b1k.\nFrom (11) and (14), the predictive distribution of new observations y(d)\u2217\n\np(\u03b1k|\u03b8k, a, b) = \u0393(a +\n\n, b +\n\n)\n\n(23)\n\nfor the d\u2019th variable is\n\np(y(d)\u2217 |Y, Z) =\n\np(y(d)\u2217 |g, Z, \u03c32)p(g|Y, Z)dg = N (y(d)\u2217 |Z\u00b5g|y, Z\u03a3g|yZ T + \u03a3y|g)\n\n(24)\n\n(cid:90)\n\nThe computational complexity for the predictive is O(M 2N ) due to the block structure of the\ncovariance matrix. After precomputation, the per-iteration complexity is reduced to O(M 2).\n\n4 Experiments\n\nWe test our method across two domains. For seizure onset localization, we leverage our model to\ndetect early seizure discharges characterized by irregular covariance changes in iEEG recordings. For\ncrime occurrence prediction, our model captures the sharp transitions in regional crime occurrence\ncovariances.\n\n4.1\n\niEEG data description\n\nThe dataset of iEEG recordings for SOZ detection are from 83 epilepsy patients 2. The patients\nwith different SOZs are surgically implanted with different numbers of iEEG sensors in potentially\nepileptogenic regions in the brains. Among 4966 electrodes in total, 911 of them identi\ufb01ed to be in\nSOZs by clinical epileptologists are taken as true positive examples. The iEEG data are down-sampled\nto 5 kHz, and \ufb01ltered to remove artifacts. We adopt power-in-band (PIB) features measuring iEEG\ndata\u2019s spectral power in the 8 frequency bands: Delta (0-3Hz), low-theta (3-6 Hz), high-theta (6-9\nHz), alpha (9-14 Hz), beta (14-25 Hz), low-gamma (30-55 Hz), high-gamma (65-115 Hz), and ripple\n(125-150 Hz), as in [3]. The PIB features extracted from every second in a 2-hour iEEG recording\nconstruct an observation Y with D = 8 feature variables and length of N = 7200.\n\n4.2 MCMC settings\n\n2The dataset is available in ftp://msel.mayo.edu/EEG_Data/\n\n6\n\n\fFor each observation, we simulate 3 chains of 7000 Gibbs iterations,\nand discard the \ufb01rst 3000 as burn-in phase. Each sampling chain is\ninitialized with parameters sampled from their priors. We set \u0393(a, b)\nprior on the ARD precisions as a = |Sk| and b = a/1000, where Sk is\nde\ufb01ned in (23). This prior speci\ufb01cation is equally informative for various\nchoices of effective coef\ufb01cient number |Sk| by \ufb01xing the prior mean of\nthe prior distribution. Given the number of pieces K \ufb01xed, the marginal\nlikelihood in (12) is a function of the hyperparameters {lg, \u03c32\nf}. We\nuse empirical Bayes approach to determine the optimum hyperparameter\nvalues by optimizing the log marginal likelihood 3. To determine K, we\nevaluate the marginal likelihood of the true positive iEEG observations\nin the training set as shown in Figure 1 (a). It suggests K \u2248 1010 is\nsuf\ufb01cient to capture the covariance variability. We perform the Gelman-Rubin diagnostic [22] to\nassess convergence by calculating the within-chain and between-chain variances on the Gibbs samples\nof the posteriors.\n\nFigure 3: The mean and\n\u00b11 std of AUROC scores\nfor different lengthscale\nlg and K.\n\ng, \u03c32\n\n4.3 SOZ localization\n\nWe evaluate SOZ localization as a binary classi\ufb01cation task in terms of SOZ abnormal events predicted\nto be present or absent in an iEEG channel\u2019s observation, and use standard performance metrics to\ncompare with the state-of-the-art methods in Table 1.\nWe use 10-fold cross-validation (CV) to evaluate predictions with 30% test set while keeping the\nsame proportion of SOZ and non-SOZ observations in both sets. We \ufb01rst evaluate our model\u2019s\nregression performance as demonstrated in Figure 1 (b) and (c). Figure 1 (b) shows that both the\nlong-range trend and the changes in covariance smoothness are captured without over-smoothing\nthe local GPs. We summarize the regression performance to \ufb01ne-tune K in Figure 1 (c) based on\nthe discussion in Section 4.2. Our method encodes each observation\u2019s correlations into a covariance\nmatrix with diagonal block structures, as illustrated in Figure 2 (a)-(b). In Figure 2 (c) the pieces\ncapturing SOZ abnormal events are identi\ufb01ed through a clinical epileptologist\u2019s visual inspection\nof the true positive iEEG signals. We utilize the local posterior covariance matrices, illustrated\nin Figure 2 (b), associated with SOZ events as the features of the positive examples, and the\nlocal covariances without SOZ events as the negative ones. The averaged similarities of the local\ncovariances in the test set to the positive and negative examples are calculated via the Wasserstein\nmetric: ||\u00b5f \u2212 \u00b5f(cid:48)||2\n2 ), respectively, and the ratios are used to\npredict whether an observation consists of SOZ related pieces.\n\n2 + tr(\u03a3f + \u03a3f(cid:48) \u2212 2(\u03a3\n\nf(cid:48)\u03a3f \u03a3\n\nf(cid:48)) 1\n\n1\n2\n\n1\n2\n\nMethods\n\nOur method\n\nFactor graph model [3]\nHFO biomarker [12]\n\nN-cuts based mGP [9]\nTree based method [6]\n\nPaciorek et al. [17]\nHeinonen et al. [16]\n\nTable 1: Performance evaluation of the SOZ channel detection\nRecall (Sensitivity)\n\nAUROC\n0.80 \u00b1 0.05\n0.72 \u00b1 0.03\n0.66 \u00b1 0.07\nPartition-based nonstationary models\n0.65 \u00b1 0.03\n0.64 \u00b1 0.08\n0.63 \u00b1 0.07\n0.67 \u00b1 0.05\n\nPrecision\n0.41 \u00b1 0.07\n0.39 \u00b1 0.05\n0.34 \u00b1 0.05\n0.61 \u00b1 0.02\n0.59 \u00b1 0.03\n0.41 \u00b1 0.05\n0.43 \u00b1 0.03\n\n0.75 \u00b1 0.06\n0.74 \u00b1 0.03\n0.53 \u00b1 0.08\n0.35 \u00b1 0.07\n0.38 \u00b1 0.05\n0.43 \u00b1 0.09\n0.58 \u00b1 0.03\n\nNonstationary covariance function models\n\nF1 score\n0.51 \u00b1 0.07\n0.46 \u00b1 0.04\n0.41 \u00b1 0.04\n0.43 \u00b1 0.05\n0.40 \u00b1 0.03\n0.39 \u00b1 0.04\n0.42 \u00b1 0.07\n\nFigure 3 further explores the interactions between K and lg around its optimum in terms of the\nclassi\ufb01cation performances. The K leading to the best performance is consistent with the regression\nperformance in Figure 1 (c). In Table 1, the factor graphical model method heuristically divides the\niEEG recordings into non-overlapping three-second epochs to accommodate SOZ events [3], whereas\nour method is more \ufb02exible by learning the SOZ pieces with various lengths. We implement the\nother partition-based methods with the same settings as ours. Since these methods can only model\n\n3See the supplementary material for the implementation.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: (a) RMSE of prediction performance to \ufb01ne-tune K. (b) Absolute correlation matrix of\nthe crime occurrence rates of 179 CTs in 2015-2019. (c) The corresponding posterior covariance\nmatrix of f (1:D) averaged over the Gibbs samples. (d) plots of observation mean (blue), our method\u2019s\nposterior and predictive mean (red), and N-cuts based mGP [9]\u2019s mean (green).\n\nunivariate observations, we apply them on each PIB feature and take the average. For Heinonen et\nal. method [16], we run 3 chains of 5000 samples of HMC-NUTS sampling to infer the three sets\nof hyperparameters (noise variance, signal variance, and lengthscale), and initialize the method as\nsuggested. One key to the method is the balance between the signal variance and the nonstationary\nlengthscale, which is intrinsically related to the partition-based idea. For Paciorek et al. method [17],\nwe use the Matern covariance function described in the paper. The Matern kernel leads to less smooth\nfunctions, but it still assumes the covariance structure is the same throughout the entire input space.\n\n4.4 Crime event prediction\n\n\u2217\n\nWe apply our method to model the nonstationary evolution of crime\noccurrence rates in the 179 census tracts (CTs) in Washington, D.C.\nbetween 2015-2019 for crime occurrence prediction 4. We analyze\nthe crime rates on a weekly basis, with totally 227 weeks. By denot-\ning the crime rates in a CT with a variable, we have the multivariate\nobservation Y with D = 179 and N = 227.\nWe follow the model setting strategy as in Section 4.2. In particular,\nWe \ufb01nd K = 15 with lg = 0.5 suf\ufb01ciently to account for the crime\nrates\u2019 nonstationary variations based on the predictive performance,\nas shown in Figure 4 (a). The results in Figure 4 (b)-(d) indicate that\nwe are able to capture the abrupt changes in covariance structure of\nthe CTs\u2019 crime rates over time via the posterior and the predictive\nestimates of y(1:D)\n. Figure 4) (d) shows that the classic nonsta-\ntionary method mGP [9] tends to over-smooth the local covariance\nvariability for combining a global GP with the local GPs.\nWe predict the one-week-ahead crime rates in each tract for the \ufb01rst\n16 weeks in 2019 based on the posterior estimates in 2015-2018. We\nestimate the posterior predictive of the 2019 weekly crime rate in\neach CT y(d)\u2217\nas in (24) by averaging over the Gibbs samples. Table 2 shows the monthly-averaged\nprediction RMSEs, conditioned on the observations in 2015-2018. For PoINAR, we use the same\nsetting as in [15]. The implementation of Paciorek et al. method [17] and Heinonen et al. method\n[16] are the same as in Section 4.3. One major challenge to implement Paciorek et al. method is that\nthe number of its hyperparameters increases fast in multivariate cases. In particular, computation of\nthe kernel matrices at each input location is slow because of the matrix decomposition (O(D3)). In\ncontrast, our method is more computationally ef\ufb01cient by introducing the conditional independence\ngiven the hyper-GP as in (10). The results indicate that our method produces lower RMSE. Figure 5\nvisualizes the RMSEs between our method\u2019s predictions and the ground truth by CTs geographically.\n\nFigure 5: 2019 monthly aver-\naged RMSE maps between the\nground truth and our model\u2019s\npredictive means of y(1:D)\n\n\u2217\n\n.\n\n4The crime data are available on http://opendata.dc.gov\n\n8\n\n\fJan. 2019\n\nTable 2: Monthly average RMSE of one-week-ahead predictions of the crime rates in 2019.\nRM SE \u00b1 error\nApril 2019\n0.817 \u00b1 0.027\nOur method\n1.071 \u00b1 0.032\n1.165 \u00b1 0.006\n1.462 \u00b1 0.147\n1.069 \u00b1 0.014\n\nMar. 2019\n0.815 \u00b1 0.029\n0.893 \u00b1 0.033\n0.912 \u00b1 0.086\n1.176 \u00b1 0.209\n0.931 \u00b1 0.763\n\n0.638 \u00b1 0.025\n0.657 \u00b1 0.023\n0.839 \u00b1 0.017\n0.949 \u00b1 0.034\n0.704 \u00b1 0.031\n\nFeb. 2019\n\n0.707 \u00b1 0.023\n0.818 \u00b1 0.019\n0.825 \u00b1 0.014\n1.122 \u00b1 0.055\n0.875 \u00b1 0.118\n\nN-cuts based mGP [9]\n\nPoINAR [15]\n\nPaciorek et al. [17]\nHeinonen et al. [16]\n\n5 Conclusions\n\nOur uni\ufb01ed nonstationary modeling framework integrates a sparse encoding process that transforms\nthe observations into a piece-wise representation with a hyper GP de\ufb01ned over its relevance vectors.\nThe hyper GP governs a set of local GPs \ufb01tted to the pieces through their mean functions. The\nframework ef\ufb01ciently extends to multivariate observations by inducing conditional independence\namong variables and between their respective local GPs. It achieves superior performance over the\nstate-of-the-art competitors by effectively capturing both sharp changes in covariance smoothness\nand long-range trend.\n\n6 Acknowledgments\n\nThis work is funded in part by National Science Foundation (NSF-1850492).\n\nReferences\n[1] Christopher P. Warren, Sanqing Hu, Matt Stead, Benjamin H. Brinkmann, Mark R. Bower, and Gregory A.\nWorrell. Synchrony in normal and focal epileptic brain: The seizure onset zone is functionally disconnected.\nJournal of Neurophysiology, 104(6):3530\u20133539, October 2010.\n\n[2] Greg A. Worrell, Andrew B. Gardner, S. Matt Stead, Sanqing Hu, Steve Goerss, Gregory J. Cascino,\nFredric B. Meyer, Richard Marsh, and Brian Litt. High-frequency oscillations in human temporal lobe:\nSimultaneous microwire and clinical macroelectrode recordings. Brain, 131(4):928\u2013937, October 2008.\n\n[3] Yogatheesan Varatharahah, Min Jin Chong, Krishnakant Saboo, Brent Berry, Benjamin Brinkmann, Gregory\nWorrell, and Ravishankar Iyer. Eeg-graph: A factor-graph-based model for capturing spatial, temporal, and\nobservantional relationships in electroencephalograms. In NIPS, pages 5377\u20135386, December 2017.\n\n[4] David McDowall, Colin Loftin, and Matthew Pate. Seasonal cycles in crime, and their variability. Journal\n\nof Quantitative Criminology, 28(3):389\u2013410, September 2012.\n\n[5] Matthew A. Taddy. Autoregressive mixture models for dynamic spatial poisson processes: Application to\ntracking intensity of violent crime. Journal of the American Statistical Association, 105(492):1403\u20131417,\nJanuary 2010.\n\n[6] Robert B. Gramacy and Herbert K. H. Lee. Bayesian treed gaussian process models with an application to\ncomputer modeling. Journal of the American Statistical Association, 103(483):1119\u20131130, March 2009.\n\n[7] Hyoung-Moon Kim, Bani K. Mallick, and C. C. Holmes. Analyzing nonstationary spatial data using\npiecewise gaussian processes. Journal of the American Statistical Association, 100(470):653\u2013668, June\n2005.\n\n[8] Sunho Park and Seungjin Choi. Hierarchical gaussian process regression. In ACML, pages 95\u2013110,\n\nNovember 2010.\n\n[9] Emily B. Fox and David B. Dunson. Multiresolution gaussian processes.\n\nDecember 2012.\n\nIn NIPS, pages 737\u2013745,\n\n[10] Edward Snelson and Zoubin Ghahramani. Local and global sparse gaussian process approximations. In\n\nAISTATS, pages 524\u2013531, March 2007.\n\n[11] Nicholas Michael Wetjen, Greg Worrell, Jeffrey Britton, Gregory Cascino, W. R. Marsh, Fredric B. Meyer,\nCheolsu Shin, and Elson So. Intracranial electroencephalography seizure onset patterns and surgical\noutcomes in nonlesional extratemporal epilepsy. Journal of Neurosurgery, 61(1):1147\u20131152, July 2009.\n\n9\n\n\f[12] Su Liu, Zhiyi Sha, Altay Sencer, Aydin Aydoseli, Nerse Bebek, Aviva Abosch, Thomas Henry, Candan\nGurses, and Nuri Firat Ince. Exploring the time-frequency content of high frequency oscillations for\nautomated identi\ufb01cation of seizure onset zone in epilepsy. Journal of Neural Engineering, 13(2):26026\u2013\n26041, Feburary 2016.\n\n[13] Drausin F.Wulsina, Emily B.Fox, and BrianLitt. Modeling the complex dynamics and changing correlations\n\nof epileptic events. Arti\ufb01cial Intelligence, 216(1):55\u201375, November 2014.\n\n[14] Stephen Faul, Gregor Gregorcic, Geraldine Boylan, William Marnane, Gordon Lightbody, and Sean\nConnolly. Gaussian process modeling of eeg for the detection of neonatal seizures. IEEE Transactions on\nBiomedical Engineering, 54(12):2151\u20132162, December 2007.\n\n[15] Sivan Aldor-Noiman, Lawrence D. Brown, Emily B. Fox, and Robert A. Stine. Spatio-temporal low count\nprocesses with application to violent crime events. Statistica Sinica, 26(8):1587\u20131610, December 2016.\n\n[16] Markus Heinonen, Henrik Mannerstrom, Juho Rousu, Samuel Kaski, and Harri Lahdesmaki. Non-\nstationary gaussian process regression with hamiltonian monte carlo. In AISTATS, pages 732\u2013740, June\n2016.\n\n[17] Christopher J. Paciorek and Mark J. Schervish. Nonstationary covariance functions for gaussian process\n\nregression. In NIPS, pages 273\u2013280, December 2003.\n\n[18] Carl Edward Rasmussen and Zoubin Ghahramani. In\ufb01nite mixtures of gaussian process experts. In NIPS,\n\npages 881\u2013888, December 2001.\n\n[19] Edward Meeds and Simon Osindero. An alternative in\ufb01nite mixture of gaussian process experts. In NIPS,\n\npages 883\u2013890, December 2005.\n\n[20] Arnoldo Frigessi, Patrizia Di Stefano, Chii-Ruey Hwang, and Shuenn-Jyi Sheu. Convergence rates of the\ngibbs sampler, the metropolis algorithm and other single-site updating dynamics. Journal of the Royal\nStatistical Association, 55(1):205\u2013219, March 1993.\n\n[21] Chris C. Holmes and Leonhard Held. Bayesian auxiliary variable models for binary and multinomial\n\nregression. Bayesian Analysis, 1(1):145\u2013168, March 2006.\n\n[22] Stephen P. Brooks and Andrew E. Gelman. General methods for monitoring convergence of iterative\n\nsimulations. Journal of Computational and Graphical Statistics, 7(4):434\u2013455, November 1998.\n\n10\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "Rui", "family_name": "Li", "institution": "Rochester Institute of Technology"}]}