{"title": "Multi-scale Graphical Models for Spatio-Temporal Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 324, "abstract": "Learning the dependency structure between spatially distributed observations of a spatio-temporal process is an important problem in many fields such as geology, geophysics, atmospheric sciences, oceanography, etc. . However, estimation of such systems is complicated by the fact that they exhibit dynamics at multiple scales of space and time arising due to a combination of diffusion and convection/advection. As we show, time-series graphical models based on vector auto-regressive processes are inef\ufb01cient in capturing such multi-scale structure. In this paper, we present a hierarchical graphical model with physically derived priors that better represents the multi-scale character of these dynamical systems. We also propose algorithms to ef\ufb01ciently estimate the interaction structure from data. We demonstrate results on a general class of problems arising in exploration geophysics by discovering graphical structure that is physically meaningful and provide evidence of its advantages over alternative approaches.", "full_text": "Multi-scale Graphical Models for Spatio-Temporal\n\nProcesses\n\nFirdaus Janoos\u2217\n\nHuseyin Denli\n\nNiranjan Subrahmanya\n\nExxonMobil Corporate Strategic Research\n\nAnnandale, NJ 08801\n\nAbstract\n\nLearning the dependency structure between spatially distributed observations of\na spatio-temporal process is an important problem in many \ufb01elds such as geol-\nogy, geophysics, atmospheric sciences, oceanography, etc. . However, estimation\nof such systems is complicated by the fact that they exhibit dynamics at multiple\nscales of space and time arising due to a combination of diffusion and convec-\ntion/advection [17]. As we show, time-series graphical models based on vector\nauto-regressive processes[18] are inef\ufb01cient in capturing such multi-scale struc-\nture.\nIn this paper, we present a hierarchical graphical model with physically\nderived priors that better represents the multi-scale character of these dynamical\nsystems. We also propose algorithms to ef\ufb01ciently estimate the interaction struc-\nture from data. We demonstrate results on a general class of problems arising in\nexploration geophysics by discovering graphical structure that is physically mean-\ningful and provide evidence of its advantages over alternative approaches.\n\n1\n\nIntroduction\n\nConsider the problem of determining the connectivity structure of subsurface aquifers in a large\nground-water system from time-series measurements of the concentration of tracers injected and\nmeasured at multiple spatial locations. This problem has the following features: (i) pressure gra-\ndients driving ground-water \ufb02ow have unmeasured disturbances and changes; (ii) the data contains\nonly concentration of the tracer, not \ufb02ow direction or velocity; (iii) there are regions of high perme-\nability where ground water \ufb02ows at (relatively) high speeds and tracer concentration is conserved\nand transported over large distances (iv) there are regions of low permeability where ground water\ndiffuses slowly into the bed-rock and the tracer is dispersed over small spatial scales and longer\ntime-scales.\n\nReconstructing the underlying network structure from spatio-temporal data occurring at multiple\nspatial and temporal scales arises in a large number of \ufb01elds. An especially important set of ap-\nplications arise in exploration geophysics, hydrology, petroleum engineering and mining where the\naim is to determine the connectivity of a particular geological structure from sparsely distributed\ntime-series readings [16]. Examples include exploration of ground-water systems and petroleum\nreservoirs from tracer concentrations at key locations, or use of electrical, induced-polarization and\nelectro-magnetic surveys to determine networks of ore deposits, groundwater, petroleum, pollutants\nand other buried structures [24]. Other examples of multi-scale spatio-temporal phenomena with\nthe network structure include: \ufb02ow of information through neural/brain networks [15], traf\ufb01c \ufb02ow\nthrough traf\ufb01c networks[3]; spread of memes through social networks [23]; diffusion of salinity,\ntemperature, pressure and pollutants in atmospheric sciences and oceanography [9]; transmission\nnetworks for genes, populations and diseases in ecology and epidemiology; spread of tracers and\ndrugs through biological networks [17] etc. .\n\n\u2217Corresponding Author:firdaus@ieee.org\n\n1\n\n\f(i) the physics are linear in the ob-\nThese systems typically exhibit the following features:\nserved / state variables (e.g. pressure, temperature, concentration, current) but non-linear in the\nunknown parameter that determines interactions (e.g. permeability, permittivity, conductance); (ii)\nthere may be unobserved / unknown disturbances to the system; (iv) (Multi-scale structure) there\nare interactions occurring over large spatial scales versus those primarily in local neighborhoods.\nMoreover, the large-scale and small-scale processes exhibit characteristic time-scales determined by\nthe balance of convection velocity and diffusivity of the system. A physics-based approach to esti-\nmating the structure of such systems from observed data is by inverting the governing equations [1].\nHowever, in most cases inversion is extremely ill-posed [21] due to non-linearity in model parame-\nters and sparsity of data with respect to the size of the parameter space, necessitating strong priors\non the solution which are rarely available. In contrast, there is a large body of literature on structure\nlearning for time-series using data-driven methods, primarily developed for econometric and neuro-\nscienti\ufb01c data1. The most common approach is to learn vector auto-regressive (VAR) models, either\ndirectly in the time domain[10] or in the frequency domain[4]. These implicitly assume that all\ndynamics and interactions occur at similar time-scales and are acquired at the same frequency [14],\nalthough VAR models for data at different sampling rates have also been proposed [2]. These mod-\nels, however, do not address the problem of interactions occurring at multiple scales of space and\ntime, and as we show, can be very inef\ufb01cient for such systems. Multi-scale graphical models have\nbeen constructed as pyramids of latent variables, where higher levels aggregate interactions at pro-\ngressively larger scales [25]. These techniques are designed for regular grids such as images, and\nare not directly applicable to unstructured grids, where spatial distance is not necessarily related to\nthe dependence between variables. Also, they construct O(log N ) deep trees thereby requiring an\nextremely large (O(N )) latent variable space.\nIn this paper, we propose a new approach to learning the graphical structure of a multi-scale spatio-\ntemporal system using a hierarchy of VAR models with one VAR system representing the large-\nscale (global) system and one VAR-X model for the (small-scale) local interactions. The main\ncontribution of this paper is to model the global system as a \ufb02ow network in which the observed\nvariable both convects and diffuses between sites. Convection-diffusion (C\u2013D) processes naturally\nexhibit multi-scale dynamics [8] and although at small spatial scales their dynamics are varied and\ntransient, at larger spatial scales these processes are smooth, stable and easy to approximate with\ncoarse models [13]. Based on this property, we derive a regularization that replicates the large-scale\ndynamics of C\u2013D processes. The hierarchial model along with this physically derived prior learns\ngraphical structures that are not only extremely sparse and rich in their description of the data, but\nalso physically meaningful. The multi-scale model both reduces the number of edges in the graph by\nclustering nodes and also has smaller order than an equivalent VAR model. Next in Section 3, model\nrelaxations to simplify estimation along with ef\ufb01cient algorithms are developed. In Section 4, we\npresent an application to learning the connectivity structure for a class of problems dealing with \ufb02ow\nthrough a medium under a potential/pressure \ufb01eld and provide theoretical and empirical evidence of\nits advantages over alternative approaches.\nOne similar approach is that of clustering variables while learning the VAR structure [12] using\nsampling-based inference. This method does not, however, model dynamical interactions between\nthe clusters themselves. Alternative techniques such as independent process analysis [20] and AR-\nPCA [7] have also been proposed where auto-regressive models are applied to latent variables ob-\ntained by ICA or PCA of the original variables. Again, because these are AR not VAR models,\nthe interactions between the latent variables are not captured, and moreover, they do not model the\ndynamics of the original space. In contrast to these methods, the main aspects of our paper are a\nhierarchy of dynamical models where each level explicitly corresponds to a spatio-temporal scale\nalong with ef\ufb01cient algorithms to estimate their parameters. Moreover, as we show in Section 4,\nthe prior derived from the physics of C\u2013D processes is critical to estimating meaningful multi-scale\ngraphical structures.\n\n2 Multi-scale Graphical Model\n\nNotation: Throughout the paper, upper case letters indicate matrices and lower-case boldface for\nvectors, subscript for vector components and [t] for time-indexing.\n\n1http://clopinet.com/isabelle/Projects/NIPS2009+/\n\n2\n\n\fLet y \u2208 RN\u00d7T , where y[t] = {y1[t] . . . yN [t]};\nt = 1 . . . T , be the time-series data observed at\nN sites over T time-points. To capture the multi-scale structure of interactions at local and global\nscales, we introduce the K\u2013dimensional (K (cid:28) N) latent process x[t] = {x1[t] . . . xK [t]}; t = 1 . . . T\nto represent K global components that interact with each other. Each observed process yi is then a\nsummation of local interactions along with a global interaction. Speci\ufb01cally:\np=1 A[p]x[t \u2212 p] + u[t],\n\nGlobal\u2013process:\n\nx[t] =(cid:80)P\ny[t] =(cid:80)Q\n\n(1)\n\nLocal\u2013process:\n\nq=1 B[q]y[t \u2212 q] + Zx[t] + v[t].\n\nvI) and u \u223c N (0, \u03c32\n\nr=1 D[r]y[t \u2212 r] +(cid:80)S\n\n(VARMA) process y[t] = (cid:80)R\n\nHere Zi,k, i = 1 . . . N, k = 1 . . . K are binary variables indicating if site yi belongs to global com-\nponent xk. The N \u00d7 N matrices B[1] . . . B[Q] capture the graphical structure and dynamics of the\nlocal interactions between all yi and yj, while the set of K \u00d7 K matrices A = {A1 . . . A[P ]} de-\ntermines the large-scale graphical structure as well as the overall dynamical behavior of the system.\nThe processes v \u223c N (0, \u03c32\nuI) are iid innovations injected into the system at the\nglobal and local scale respectively.\nRemark: From a graphical perspective, two latent components xk and xl are conditionally inde-\npendent given all other components xm, \u2200m (cid:54)= k, l if and only if A[p]i,j = 0 for all p = 1 . . . P .\nMoreover, two nodes yi and yj are conditionally independent given all other nodes ym (cid:54)= i, j and\nlatent components xk,\u2200k = 1 . . . K, if and only if B[q]i,j = 0 for all q = 1 . . . Q.\nTo create the multi-scale hierarchy in the graphical structure, the following two conditions are im-\nposed: (i) each yi belong to only one global component xk, i.e. Zi,kZi,l = \u03b4[k, l], \u2200i = 1 . . . N; and\n(ii) Bi,j be non-zero only for nodes within the same component, i.e. Bi,j = 0 if yi and yj belong to\ndifferent global components xk and xk(cid:48).\nThe advantages of this model over a VAR graphical model are two fold: (i) the hierarchical structure,\nthe fact that K (cid:28) N and that yi \u2194 yj only if they are in the same global component results in\na very sparse graphical model with a rich multi-scale interpretation; and (ii) as per Theorem 1, the\nmodel of eqn. (1) is signi\ufb01cantly more parsimonious than an equivalent VAR model for data that is\ninherently multi-scale.\nTheorem 1. The model of eqn. (1) is equivalent to a vector auto-regressive moving-average\ns=0 E[s]\u0001[t \u2212 s] where P \u2264 R \u2264 P + Q and\n0 \u2264 S \u2264 P , D[r] are N \u00d7 N full-rank matrices and E[s] are N \u00d7 N matrices with rank less than\nK. Moreover the upper bounds are tight if the model of eqn. (1) is minimal. The proof is given in\nSupplemental Appendix A.\nThe multi-scale spatio-temporal dynamics are modeled as stable convection\u2013diffusion (C\u2013D)\nprocesses governed by hyperbolic\u2013parabolic PDEs of the form \u2202y/\u2202t + \u2207 \u00b7 ((cid:126)cy) = \u2207 \u00b7 \u03ba\u2207 + s,\nwhere y is the quantity corresponding to y, \u03ba is the diffusivity and c is the convection velocity\nand s is an exogenous source. The balance between convection and diffusion is quanti\ufb01ed by the\nP\u00b4eclet number2 of the system [8]. These processes are non-linear in diffusivity and velocity and\na full-physics inversion involves estimating \u03ba and (cid:126)c at each spatial location, which is a highly\nill-posed and under-constrained[1]. However, because for systems with physically reasonable\nP\u00b4eclet numbers, dynamics at larger scales can be accurately approximated on increasingly coarse\ngrids [13], we simplify the model by assuming that conditioned on the rest of the system, the\nlarge-scale dynamics between any two components xi \u223c xj | xk \u2200k (cid:54)= i, j can be approximated by\na 1-d C\u2013D system with constant P\u00b4eclet number. This approximation allows us to use Proposition 2:\nTheorem 2. For the VAR system of eqn. (1), if the dynamics between any two variables xi \u223c\nxj | xk \u2200k (cid:54)= i, j are 1\u2013d C\u2013D with in\ufb01nite boundary conditions and constant P\u00b4eclet num-\nber, then the VAR coef\ufb01cients Ai,j[t] can be approximated by a Gaussian function Ai,j[t] \u2248\ni,j where \u00b5i,j is equal to the distance between i and j and \u03c32\ni,j\nis proportional to the product of the distance and the P\u00b4eclet number. Moreover, this approximation\nhas a multiplicative-error exp(\u2212O(t3)). Proof is given in Supplemental Appendix B.\nIn effect, the dynamics of a multi-dimensional (i.e. 2-d or 3-d) continuous spatial system are approx-\nimated as a network of 1-dimensional point-to-point \ufb02ows consisting of a combination of advection\n\nexp(cid:8)\u22120.5(t \u2212 \u00b5i,j)2\u03c3\n\n(cid:113)\n\n(cid:9) /\n\n\u22122\ni,j\n\n2\u03c0\u03c32\n\n2The P\u00b4eclet number Pe = Lc/\u03ba is a dimensionless quantity which determines the ratio of advective to\ndiffusive transfer, where L is the characteristic length, c is the advective velocity and \u03ba is the diffusivity of the\nsystem\n\n3\n\n\fand diffusion. Although in general, the dynamics of higher-dimensional physical systems are not\nequivalent to super-position of lower-dimensional systems, as we show in this paper, the stability of\nC\u2013D physics [13] allows replicating the large-scale graphical structure and dynamics, while avoid-\ning the ill-conditioned and computationally expensive inversion of a full-physics model. Moreover,\nthe stability of the C\u2013D impulse response function ensures that the resulting VAR system is also\nstable.\n\n3 Model Relaxation and Regularization\n\nQ(cid:88)\n\nP(cid:88)\n\nAs the model of eqn. (1) contains non-linear interactions of real-valued variables x, A and B with\nbinary Z along with mixed constraints, direct estimation would require solving a mixed integer\nnon-linear problem. Instead, in this section we present relaxations and regularizations that allow\nestimation of model parameters via convex optimization. The next theorem states that for a given\nassignment of measurement sites to global components, the interactions within a component do not\naffect the interactions between components, which enables replacing the mixed non-linearity due to\nthe constraints on B[q] with a set of unconstrained diagonal matrices C[q], q = 1 . . . Q.\nTheorem 3. For a given global-component assignment Z, if A\u2217 and x\u2217 are local optima to the\nleast-squares problem of eqn. (1), then they are also a local optimum to the least-squares problem\nfor:\n\nx[t] =\n\nA[p]x[t \u2212 p] + u[t]\n\nand\n\ny[t] =\n\nC[p]y[t \u2212 q] + Zx[t] + v[t],\n\n(2)\n\nwhere C[r], r = 1 . . . b are diagonal matrices. The proof is given in Supplemental Appendix C.\n\np=1\n\nFurthermore, a LASSO regularization term proportional (cid:107)C(cid:107)1 =(cid:80)N\n\nq=1\n\n(cid:80)Q\nq=1 |C[q][i, i] is added to\n\ni=1\n\n\u221a\n\nreduce the number of non-zero coef\ufb01cients and thereby the effective order of C .\nNext, the binary indicator variables Zi,k are relaxed to be real-valued. Also, an (cid:96)1 penalty, which\npromotes sparsity, combined with an (cid:96)2 term has been shown to estimate disjoint clusters[19]. There-\nfore, the spatial disjointedness constraint Zi,kZi,l = \u03b4k,l, \u2200i = 1 . . . N, is relaxed by a penalty propor-\ntional to (cid:107)Zi,\u00b7(cid:107)1 along with the constraint that for each yi, the indicator vector Zi,\u00b7 should lie within\nthe unit sphere, i.e. (cid:107)Zi,\u00b7(cid:107)2 \u2264 1. This penalty, which also ensures that |Zi,k| \u2264 1, allows interpretation\nof Zi,\u00b7 as a soft cluster membership.\nOne way to regularize Ai,j according to Theorem 2 would be to directly parameterize it as a Gaus-\n2\u03c0\u03c32 satis\ufb01es the equation\n\nsian function. Instead, observe that G(t) = exp(cid:8)\u22120.5(t \u2212 \u00b5)2/\u03c32(cid:9) /\n[\u2202t + (t \u2212 \u00b5)/\u03c3] G = 0, subject to (cid:82) G(t)dt = 1. Therefore, de\ufb01ning the discrete version of this\nalong with the relaxed constraint 0 \u2264 (cid:80)\n\n(cid:107)D(\u03b3i,j)Ai,j(cid:107)2 where D (\u03b3i,j)p,p = (cid:98)\u2202p + \u03b3i,j (p \u2212 \u00b5i,j) ,\n\np Ai,j[p] \u2264 1. Here, (cid:98)\u2202p is an approximation to time-\n\noperator as D(\u03b3i,j), a P \u00d7 P diagonal matrix, the regularization A is as a penalty proportional to\n\n(cid:107)D(\u03b3)A(cid:107)2,1 =\n\n(cid:88)\n\ndifferentiation, \u00b5i,j is equal to the distance between i and j which is known, and \u03b3i,j \u2265 \u0393 is inversely\nproportional to \u03c3i,j. Importantly, this formulation also admits 0 as a valid solution and has two\nadvantages over direct parametrization: (i) it replaces a problem that is non-linear in \u03c32\ni,j ; i, j =\n1 . . . K with a penalty that is linear in Ai,j; and (ii) unlike Gaussian parametrization, it admits the\nsparse solution Ai,j = 0 for the case when xi does not directly affect xj. The constant \u0393 > 0 is a user-\nspeci\ufb01ed parameter which prevents \u03b3i,j from taking on very small values, thereby obviation solutions\nof Ai,j with extremely large variance i.e. with very small but non-zero value. This penalty, derived\nfrom considerations of the dynamics of multi-scale spatio-temporal systems, is the key difference of\nthe proposed method as compared to sparse time-series graphical model via group LASSO [11].\nPutting it all together, the multi-scale graphical model is obtained by optimizing:\nf (x, A, C, Z, \u03b3) + g(x, A, C, Z)\n\n\u2217\n, A\n\n\u2217\n, C\n\n(3)\n\ni,j\n\n\u2217\n\n\u2217\n\n\u2217\n\nsubject to (cid:107)Zi,\u00b7(cid:107)2\n\u0393, \u2200i, j = 1 . . . K. The objective function is split into a smooth portion :\n\n[x\n\n, \u03b3\n\n, Z\n\n] = argmin\nx,A,C,Z,\u03b3\n\n2 \u2264 1 for all i = 1 . . . N and 0 \u2264 (cid:80)\nT(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)y[t] \u2212 Q(cid:88)\n\nC[q]y[t \u2212 q] \u2212 Zx[t]\n\nq=1\n\nt=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nf (x, \u03b8) =\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)x[t] \u2212 P(cid:88)\n\np=1\n\n(4)\np Ai,j[p] \u2264 1 for all i, j = 1 . . . K, and \u03b3i,j \u2265\n\n+ \u03bb0\n\nA[p]x[t \u2212 p]\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n4\n\n\f(cid:80)T\n\nand a non-smooth portion g(\u03b8) = \u03bb1 (cid:107)D(\u03b3)A(cid:107)2,1 + \u03bb2 (cid:107)C(cid:107)2,1 + \u03bb3 (cid:107)Z(cid:107)1 . After solving eqn. (4),\nis obtained by solving: B\u2217 =\nthe local graphical structure within each global component\n+ \u03bb4 (cid:107)B(cid:107)2,1 , where the zeros of B[q] are pre-\nargminB\ndetermined from Z\u2217.\n\n(cid:13)(cid:13)(cid:13)y[t] \u2212(cid:80)Q\n\nq=1 B[q]y[t \u2212 q] \u2212 Z\u2217x\u2217[t]\n\n(cid:13)(cid:13)(cid:13)2\n\nt=1\n\n2\n\n3.1 Optimization\n\nGiven values of [A, Z, C], the problem of eqn. (4) is unconstrained and strictly convex in x and\n\u03b3 and given [x, \u03b3], it is unconstrained and strictly convex in C and convex constrained in A and\nZ. Therefore, under these conditions block coordinate descent (BCD) is guaranteed to produce a\nsequence of solutions that converge to a stationary point [22]. To avoid saddle-points and achieve\nlocal-minima, a random feasible-direction heuristic is used at stationary points. De\ufb01ning blocks of\nvariables to be [x, \u03b3], and [A, C, Z], BCD operates as follows:\n1 Initialize x(0) and \u03b3(0)\n2 Set n = 0 and repeat until convergence:\n\n[A(n+1), Z(n+1), C(n+1)] \u2190 min\n[x(n+1), \u03b3(n+1)] \u2190 min\n\n[A,Z,C]\n\n[x,\u03b3]\n\nf (x(n), A, C, Z, \u03b3(n)) + g(x(n), A, C, Z)\n\nf (x, A(n+1), C(n+1), Z(n+1), \u03b3) + g(x, A(n+1), C(n+1), Z(n+1)).\n\nAt each iteration x(n+1) is obtained by directly solving a T \u00d7 T tri-diagonal Toeplitz system with\nblocks of size KP which has a have running time of O(T \u00d7 KP 3) (\u00a7Supplemental Appendix D for\ndetails).\n\n(cid:80)P\n\n(cid:16)(cid:98)\u2202pAi,j[p] + \u03b3i,j (p \u2212 \u00b5i,j) Ai,j[p]\n\n(cid:17)2\n\n.\n\ni,j\n\n=\n\np=1\n\n(cid:16)\n\n(cid:54)= j.\n\n(cid:16)\nx(n), \u03b3(n), \u03b8(m\u22121)(cid:17)(cid:17)\n\n\u0393, \u2212(cid:80)\n\nto \u03b3i,j > \u0393 for all\n\np \u2202tAi,j (p \u2212 \u00b5i,j) Ai,j/(cid:80)\n\nEstimating \u03b3(n+1) given A(n+1) is obtained by solving min\u03b3i,j\ni, j = 1 . . . K and i\nsubject\nmax\n\n(cid:16)\npends on (cid:112)L(\u2207\u03b8f ), the Lipschitz constant of the gradient of the smooth portion f. De\ufb01ning\n\nOptimization with respect to A, Z, C is performed using proximal splitting with Nesterov accel-\n\u221a\neration [5] which produces \u0001\u2013optimal solutions in O(1/\n\u0001) time, where the constant factor de-\n\np((p \u2212 \u00b5i,j) Ai,j)2(cid:17)\n\nThis gives \u03b3(n+1)\n\n\u03b8(m\u22121) \u2212 \u03b1m\u2207\u03b8f\n\n\u03b8 = [A, Z, C], the key step in the optimization are proximal-gradient-descent operations of the form:\n, where m is the current gradient-descent\n\u03b8(m) = prox\u03b1mg\niterate, \u03b1m is the step size and the proximal operator is de\ufb01ned as: proxg(\u03b8) = min\u03b8 g(x(n), \u03b3(n), \u03b8)+\n2 (cid:107)\u03b8 \u2212 \u0398(cid:107)2.\nThe gradients \u2207Af, \u2207Cf and \u2207Zf are straightforward to compute. As shown in Supplemental\nAppendix E.1, the problem in Z is decomposable into a sum of problems over Zi,\u00b7 for i = 1 . . . N,\nT\u03bb3 (Zi,k) = sign(Zi,k) min(|Zi,k| \u2212 \u03bb3, 0) is the element-wise shrinkage operator.\n\nwhere the proximal operator for each Zi,\u00b7 is proxg (Zi,\u00b7) = max(cid:0)1,(cid:107)T\u03bb(Zi,\u00b7)(cid:107)\u22121\nBecause A has linear constraints of the form 0 \u2264 (cid:80)\n\np Ai,j[p] \u2264 1, the proximal operator does not\nhave a closed form solution and is instead computed using dual-ascent [6]. As it can be decomposed\nacross Ai,j for all i, j = 1 . . . K, consider the computation of proxg (\u02c6a) where \u02c6a represents one\nAi,j. De\ufb01ning \u03b7 as the dual variable, dual-ascent proceeds by iterating the following two steps until\nconvergence:\n\n(cid:1) T\u03bb(Zi,\u00b7). Here\n\n2\n\n1\n\n(cid:40)\n(cid:40)\n\n(i): a(n+1) =\n\n(ii): \u03b7(n+1) =\n\n(cid:13)(cid:13)(cid:13)D\u22121\u02c6a + \u03b7(n)1\n\n(cid:13)(cid:13)(cid:13)2\n\n> \u03bb\n\n\u02c6a+\u03b7(n)1\n\n\u02c6a + \u03b7(n)1 \u2212 \u03bb\n(cid:107)D\u22121 \u02c6a+\u03b7(n)1(cid:107)2\n(cid:17)\n0\n\n\u03b7(n) + \u03b1(n)(cid:16)\n\n\u03b7(n) \u2212 \u03b1(n)1(cid:62)a(n+1)\n\n1(cid:62)a(n+1) \u2212 1\n\nif\n\notherwise\nif 1(cid:62)a(n+1) < 0\nif 1(cid:62)a(n+1) > 1\n\n.\n\nHere n indexes the dual-ascent inner loop and \u03b1(n) is an appropriately chosen step-size. Note that\nD(\u03b3i,j), the P \u00d7 P matrix approximation to \u2202t + \u03b3i,jt is full rank and therefore invertible. And\n\ufb01nally, the proximal operator for Ci,i for all i = 1 . . . N is Ci,i \u2212 \u03bb2Ci,i/(cid:107)Ci,i(cid:107)2 if (cid:107)Ci,i(cid:107)2 > \u03bb2 and\n0 otherwise.\n\n5\n\n\fRemark: The hyper-parameters of the systems are multipliers \u03bb0 . . . \u03bb4 and threshold \u0393. The term\n\u03bb0, which is proportional to \u03c3u/\u03c3v, implements a trade-off between innovations in the local and\nglobal processes. The parameter \u03bb1 penalizes deviation of Ai,j from expected C\u2013D dynamics, while\n\u03bb2, \u03bb3 and \u03bb4 control the sparsity of C, Z and B respectively. As explained earlier \u0393 > 0, the lower\nbound on \u03b3i,j, prohibits estimates of Ai,j with very high variance and thereby controls the spread /\nsupport of A.\nHyper-parameter selection: Hyper-parameter values that minimize cross-validation error are ob-\ntained using grid-search. First, solutions over the full regularization path are computed with warm-\nstarting. In our experience, for suf\ufb01ciently small step sizes warm-starting leads to convergence in a\nfew (< 5) iterations regardless of problem size. Moreover, as B is solved in a separate step, selection\nof \u03bb4 is done independently of \u03bb0 . . . \u03bb3. Experimentally, we have observed that an upper limit on\n\u0393 = 1 and step-size of 0.1 is suf\ufb01cient to explore the space of all solutions. The upper limit on \u03bb3\nis the smallest value for which any indicator vector Zi,\u00b7 becomes all zero. Guidance about minimum\nand maximum values \u03bb0 is obtained using the system identi\ufb01cation technique of auto-correlation\nleast squares.\nInitialization: To cold start the BCD, \u03b3(0)\ni, j = 1 . . . K. The variables x(0)\nmeans on the time-series data y1 . . . yN.\nModel order selection: Because of the sparsity penalties, the solutions are relatively insensitive to\nmodel order (P, Q). Therefore, these are typically set to high values and the effective model order\nis controlled through the sparsity hyper-parameters.\n\nis initialized with the upper bound \u0393 = 1 for all\nK are initialized as centroids of clusters obtained by K\u2013\n\ni,j\n\n. . . x(0)\n\n1\n\n4 Results\n\nIn this section we present an application to determining the connectivity structure of a medium from\ndata of \ufb02ow through it under a potential/pressure \ufb01eld. Such problems include \ufb02ow of \ufb02uids through\nporous media under pressure gradients, or transmission of electric currents through resistive media\ndue to potential gradients, and commonly arise in exploration geophysics in the study of sub-surface\nsystems like aquifers, petroleum reservoirs, ore deposits and geologic bodies [16]. Speci\ufb01cally,\nthese processes are de\ufb01ned by PDEs of the form:\n\n(cid:126)c + \u03ba\u2207 \u00b7 p = 0\n\u2207 \u00b7 (cid:126)c = sq\n\nwhere\n\nand\n\nand\n\n+ \u2207 (y(cid:126)c) = sy,\n\n\u2202y\n\u2202t\n(cid:126)n \u00b7 \u2207(cid:126)c|\u2202\u2126 = 0,\n\n(5)\n\n(6)\n\nwhere y is the state variable (e.g. concentration or current), p is the pressure or potential \ufb01eld driving\nthe \ufb02ow, (cid:126)c is the resulting velocity \ufb01eld, \u03ba is the permeability / permittivity, sq is the pressure/poten-\ntial forcing term, sy is the rate of state variable injection into the system. The domain boundary is\ndenoted by \u2202\u2126 and the outward normal by (cid:126)n. The initial condition for tracer is zero over the entire\ndomain.\nIn order to permit evaluation against ground truth, we used the permeability \ufb01eld in Fig. 1(a) based\non a geologic model to study the \ufb02ow of \ufb02uids through the earth subsurface under naturally and\narti\ufb01cially induced pressure gradients. The data were generated by numerical simulation of eqn. (5)\nusing a proprietary high-\ufb01delity solver for T = 12500s with spatially varying pressure loadings\nbetween \u00b1100 units and with random temporal \ufb02uctuations (SNR of 20dB). Random amounts of\ntracer varying between 0 and 5 units were injected and concentration measured at 1s intervals at\nthe 275 sites marked in the image. A video of the simulation is provided as supplemental to the\nmanuscript, and the data and model are available on request . These concentration pro\ufb01les at the\n275 locations are used as the time-series data y input to the multi-scale graphical model of eqn. (1).\nEstimation was done for K = 20, with multiple initializations and hyper-parameter selection as\ndescribed above. The K-means step was initialized by distributing seed locations uniformly at\nrandom. The model orders P and Q were kept constant at 50 and 25 respectively. Labels and colors\nof the sites in Fig. 1(b) indicate the clusters identi\ufb01ed by the K-means step for one initialization\nof the estimation procedure, while the estimated multi-scale graphical structure is shown in Figures\n1(c)\u2013(d). The global graphical structure (\u00a7Fig. 1(c)) correctly captures large-scale features in the\nground truth. Furthermore, as seen in Fig. 1(d) the local graphical structure (given by the coef\ufb01cients\nof B) are sparse and spatially compact. Importantly, the local graphs are spatially more contiguous\nthan the initial K-means clusters and only approximately 40% of the labels are conserved between\n\n6\n\n\f(a) Ground truth\n\n(b) Initialization after K-means\n\n(c) Global graphical structure\n\n(d) Local graphical structure\n\n(e) Multi-scale structure with group LASSO\n\n(f) VAR graphical structure\n\nFigure 1: Fig.(a). Ground truth permeability (\u03ba) map overlaid with locations where the tracer is injected and\nmeasured. Fig.(b). Results of K\u2013means initialization step. Colors and labels both indicate cluster assignments\nof the sites. Fig.(c). The global graphical structure for latent variable x. The nodes are positioned at the cen-\ntroids of the corresponding local graphs. Fig.(d). The local graphical structure. Again, colors and labels both\nindicate cluster (i.e. global component) assignments of the sites. Fig.(e). The multi-scale graphical structure\nobtained when the Gaussian function prior is replaced by group LASSO on A . Fig.(f). The graphical structure\nestimated using non-hierarchal VAR with group LASSO.\n\n7\n\n0.50.00.51.01.52.02.50.50.00.51.01.52.02.510002000300040005000600070008000900010000161093121614915618101081600180701711321813131816492117121121014414109912141314141018108173518183318187314415918171716181213174142416161010111114101218157181761192141101818181177717103131741518151313992747131315207710710161518141181467131611913710258881888146117117171146967137376109105641678132510410106149139213131023841639124181018111713101417710106351778181239131817171614131415617103141714417174768651221281616710012345678910111213141516171816151215151518301501800100101600000017199110161011611616141610441415161414421414101810141818318181833331818331418101718181811164444104416101011111155187718118181218101410618187167710161418618618711779707711771677777117711777106811618168168118888888888931616161469614141418146109910104137131310101041010313111114313131013114441013121312131217171017171774101017131731317143114171717171717141713171710711714171317771717671713151216416150123456789101112131415161718\fthe K-means initialization and the \ufb01nal solution. Furthermore, as shown in Supplemental Appendix\nF, the estimated graphical structure is fairly robust to initialization, especially in recovering the\nglobal graph structure. For all initializations, estimation from a cold-start converged in 65\u201390 BCD\niterations, while warm-starts converged in < 5 iterations.\n\nFig. 1(e) shows the results of estimating the\nmulti-scale model when the penalty term of\neqn. (3) for the C\u2013D process prior is replaced\nby group LASSO. This result highlights the im-\nportance of the physically derived prior to re-\nconstruct the graphical structure of the prob-\nlem. Fig. 1(f) shows the graphical structure\nestimated using a non-hierarchal VAR model\nwith group LASSO on the coef\ufb01cients [11] and\nauto-regressive order P = 10. Firstly, this is a\nsigni\ufb01cantly larger model with P \u00d7 N 2 coef\ufb01-\ncients as compared O(P \u00d7N )+O(Q\u00d7K 2) for\nthe hierarchical model, and is therefore much\nmore expensive to compute. Furthermore, the\nestimated graph is denser and harder to inter-\npret in the terms of the underlying problem,\nwith many long range edges intermixed with\nshort range ones.\nIn all cases, model hyper-\nparameters were selected via 10-fold cross-validation described in Supplemental Appendix G. In-\nVAR model performs best (\u2248 %12.1 \u00b1 4.4 relative error) while group LASSO and C\u2013D penalized\nhierarchal models perform equivalently ( 18.3\u00b15.7% and 17.6\u00b16.2%) which can be attributed to the\nhigher degrees of freedom available to non-hierarchical VAR. However, in terms of cross-validation\n(i.e. testing) error, the VAR model was the worst ( 94.5 \u00b1 8.9%) followed by group LASSO hierar-\nchal model (48.3 \u00b1 3.7%). The model with the C\u2013D prior performed the best, with a relative-error\nof 31.6 \u00b1 4.5%.\n\nFigure 2: Response functions at node in cmpnt 17 to\nimpulse in cmpnt 1 of Fig. 1(c). Plotted are the impulse\nresponses for eqn. (5) along with 90% bands, the multi-\nscale model with C\u2013D prior,\nthe multi-scale model\nwith group LASSO prior, and the non-hierarchical VAR\nmodel with group LASSO prior.\n\nterestingly, in terms of mis\ufb01t (i.e. training ) error(cid:0)(cid:80)\n\nt (cid:107)y[t] \u2212 \u02c6y[t](cid:107) /(cid:80)\n\nt (cid:107)y[t](cid:107)(cid:1) , the non-hierarchal\n\nTo characterize the dynamics estimated by the various approaches, we compared the impulse re-\nsponse functions (IRF) of the graphical models with that of the ground truth model (\u00a7eqn. (5)). The\nIRF for a node i is straightforward to generate for eqn. (5), while those for the graphical models are\nobtained by setting v0[i] = 1 and v0[j] = 0 for all j (cid:54)= i and vt = 0 for t > 0 and then running\ntheir equations forward in time. The responses at a node in global component 17 of Fig. 1(c) to an\nimpulse at a node in global component 1 is shown in Fig. 2. As the IRF for eqn. (5) depends on the\ndriving pressure \ufb01eld which \ufb02uctuates over time, the mean IRF along with 90% bands are shown.\nIt can be observed that the multi-scale model with the C\u2013D prior is much better at replicating the\ndynamical properties of the original system as compared to the model with group LASSO, while a\nnon-hierarchical VAR model with group LASSO fails to capture any relevant dynamics. The results\nof comparing IRFs for other pairs of sites were qualitatively similar and therefore omitted.\n\n5 Conclusion\n\nIn this paper, we proposed a new approach that combines machine-learning / data-driven techniques\nwith physically derived priors to reconstruct the connectivity / network structure of multi-scale\nspatio-temporal systems encountered in multiple \ufb01elds such as exploration geophysics, atmospheric\nand ocean sciences . Simple yet computationally ef\ufb01cient algorithms for estimating the model were\ndeveloped through a set of relaxations and regularization. The method was applied to the problem\nof learning the connectivity structure for a general class of problems involving \ufb02ow through a per-\nmeable medium under pressure/potential \ufb01elds and the advantages of this method over alternative\napproaches were demonstrated. Current directions of investigation includes incorporating different\ntypes of physics such as hyperbolic (i.e. wave) equations into the model. We are also investigating\napplications of this technique to learning structure in other domains such as brain networks, traf\ufb01c\nnetworks, and biological and social networks.\n\n8\n\n\fReferences\n\n[1] Akcelik, V., Biros, G., Draganescu, A., Ghattas, O., Hill, J., Bloemen Waanders, B.: Inversion of airborne\ncontaminants in a regional model. In: Computational Science ICCS 2006, Lecture Notes in Computer\nScience, vol. 3993, pp. 481\u2013488. Springer Berlin Heidelberg (2006) 2, 3\n\n[2] Anderson, B., Deistler, M., Felsenstein, E., Funovits, B., Zadrozny, P., Eichler, M., Chen, W., Zamani,\nM.: Identi\ufb01ability of regular and singular multivariate autoregressive models from mixed frequency data.\nIn: Decision and Control (CDC), 2012 IEEE 51st Annual Conference on. pp. 184\u2013189 (Dec 2012) 2\n\n[3] Aw, A., Rascle, M.: Resurrection of \u201csecond order\u201d models of traf\ufb01c \ufb02ow. SIAM J. Appl. Math. 60(3),\n\n916\u2013938 (2000) 1\n\n[4] Bach, F.R., Jordan, M.I.: Learning graphical models for stationary time series. IEEE Trans. Sig. Proc.\n\n52(8), 2189\u20132199 (2004) 2\n\n[5] Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising\n\nand deblurring problems. IEEE Trans. Image Proc, 18(11), 2419\u20132434 (Nov 2009) 5\n\n[6] Bertsekas, D.P.: Nonlinear Programming. Athena Scienti\ufb01c, 2nd edn. (September 1999) 5\n[7] Christmas, J., Everson, R.: Temporally coupled principal component analysis: A probabilistic autoregres-\n\nsion method. In: Int. Joint Conf. Neural Networks (2010) 2\n\n[8] Crank, J.: The mathematics of diffusion. Clarendon Press (1975) 2, 3\n[9] Cressie, N., Wikle, C.K.: Statistics for Spatio-Temporal Data. Wiley, Hoboken (2011) 1\n[10] Eichler, M.: Causal inference with multiple time series: principles and problems. Philosophical Transac-\n\ntion of The Royal Society A 371 (2013) 2\n\n[11] Haufe, S., M\u00a8uller, K.R., Nolte, G., Kr\u00a8amer, N.: Sparse causal discovery in multivariate time series. In:\n\nIsabelle Guyon, D.J., Sch\u00a8olkopf, B. (eds.) NIPS workshop on causality. vol. 1, pp. 1\u201316 (2008) 4, 8\n\n[12] Huang, T., Schneider, J.: Learning bi-clustered vector autoregressive models. In: European Conf. Ma-\n\nchine Learning (2012) 2\n\n[13] Hughes, T.: Multiscale phenomena: Green\u2019s functions, the Dirichlet-to-Neumann formulation, subgrid-\nscale models, bubbles and the origin of stabilized methods. Comput. Methods Appl. Mech. Engrg. 127,\n387401 (1995) 2, 3, 4\n\n[14] Hyv\u00a8arinen, A., Zhang, K., Shimizu, S., Hoyer, P.O.: Estimation of a structural vector autoregression\n\nmodel using non-gaussianity. J. Machine Learning Res. 11, 1709\u20131731 (2010) 2\n\n[15] Janoos, F., Li, W., Subrahmanya, N., Morocz, I.A., Wells, W.: Identi\ufb01cation of recurrent patterns in the\n\nactivation of brain networks. In: Adv. in Neural Info. Proc. Sys. (NIPS) (2012) 1\n\n[16] Kearey, P., Brooks, M., Hill, I.: An Introduction to Geophysical Exploration. Black (2011) 1, 6\n[17] Lloyd, C.D.: Exploring Spatial Scale in Geography. Wiley Blackwell (2014) 1\n[18] Moneta, A.: Graphical causal models for time series econometrics: Some recent developments and appli-\n\ncations. In: NIPS Mini Symp. Causality and Time Series Analysis (2009) 1\n\n[19] Panagakis, Y., Kotropoulos, C.: Elastic net subspace clustering applied to pop/rock music structure anal-\n\nysis. Pattern Recognition Letters 38(0), 46 \u2013 53 (2014) 4\n\n[20] Szab\u00b4o, Z., L\u00a8orincz, A.: Complex independent process analysis. Acta Cybernetica 19, 177\u2013190 (2009) 2\n[21] Tarantola, A.: Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM (2005) 2\n[22] Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. Journal\n\nof Optimization Theory and Applications 109(3), 475\u2013494 (2001) 5\n\n[23] Wang, H., Wang, F., Xu, K.: Modeling information diffusion in online social networks with partial differ-\n\nential equations. CoRR abs/1310.0505 (2013) 1\n\n[24] Wightman, W.E., Jalinoos, F., Sirles, P., Hanna, K.: Application of geophysical methods to highway\n\nrelated problems. Federal Highway Administration FHWA-IF-04-021 (2003) 1\n\n[25] Willsky, A.: Multiresolution markov models for signal and image processing. Proceedings of the IEEE\n\n90(8), 1396\u20131458 (Aug 2002) 2\n\n9\n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "firdaus", "family_name": "janoos", "institution": "ExxonMobil Corporate Strategic Research"}, {"given_name": "Huseyin", "family_name": "Denli", "institution": "ExxonMobil"}, {"given_name": "Niranjan", "family_name": "Subrahmanya", "institution": "ExxonMobil Research"}]}