{"title": "Causal discovery with scale-mixture model for spatiotemporal variance dependencies", "book": "Advances in Neural Information Processing Systems", "page_first": 1727, "page_last": 1735, "abstract": "In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthesis and real world data are conducted to show the applicability of the proposed model and algorithms.", "full_text": "Causal discovery with scale-mixture model for\n\nspatiotemporal variance dependencies\n\nZhitang Chen*, Kun Zhang\u2020, and Laiwan Chan*\n\n*Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong\n\n{ztchen,lwchan}@cse.cuhk.edu.hk\n\n\u2020Max Planck Institute for Intelligent Systems, T\u00a8ubingen, Germany\n\nkzhang@tuebingen.mpg.de\n\nAbstract\n\nIn conventional causal discovery, structural equation models (SEM) are directly\napplied to the observed variables, meaning that the causal effect can be represented\nas a function of the direct causes themselves. However, in many real world prob-\nlems, there are signi\ufb01cant dependencies in the variances or energies, which indi-\ncates that causality may possibly take place at the level of variances or energies. In\nthis paper, we propose a probabilistic causal scale-mixture model with spatiotem-\nporal variance dependencies to represent a speci\ufb01c type of generating mechanism\nof the observations. In particular, the causal mechanism including contempora-\nneous and temporal causal relations in variances or energies is represented by a\nStructural Vector AutoRegressive model (SVAR). We prove the identi\ufb01ability of\nthis model under the non-Gaussian assumption on the innovation processes. We\nalso propose algorithms to estimate the involved parameters and discover the con-\ntemporaneous causal structure. Experiments on synthetic and real world data are\nconducted to show the applicability of the proposed model and algorithms.\n\n1 Introduction\n\nCausal discovery aims to discover the underlying generating mechanism of the observed data, and\nconsequently, the causal relations allow us to predict the effects of interventions on the system\n[15, 19]. For example, if we know the causal structure of a stock market, we are able to predict the\nreactions of other stocks against the sudden collapse of one share price in the market. A traditional\nway to infer the causal structure is by controlled experiments. However, controlled experiments\nare in general expensive, time consuming, technically infeasible and/or ethically prohibited. Thus,\ncausal discovery from non-experimental data is of great importance and has drawn considerable\nattention in the past decades [15, 19, 16, 17, 12, 22, 2]. Probabilistic models such as Bayesian\nNetworks (BNs) and Linear Non-Gaussian Acyclic Models (LiNGAM) have been proposed and\napplied to many real world problems [18, 13, 14, 21].\nConventional models such as LiNGAM assume that the causal relations are of a linear form, i.e., if\nthe observed variable x is the cause of another observed variable y, we model the causal relation as\ny = \u03b1x + e, where \u03b1 is a constant coef\ufb01cient and e is the additive noise independent of x. However,\nin many types of natural signals or time series such as MEG/EEG data [23] and \ufb01nancial data [20],\na common form of nonlinear dependency, as seen from the correlation in variances or energies, is\nfound [5]. This observation indicates that causality may take place at the level of variances or en-\nergies instead of the observed variables themselves. Generally speaking, traditional methods cannot\ndetect this type of causal relations. Another restriction of conventional causal models is that these\nmodels assume constant variances of the observations; this assumption is unrealistic for those data\nwith strong heteroscedasticity [1].\n\n1\n\n\fIn this paper, we propose a new probabilistic model called Causal Scale-Mixture model with Spa-\ntioTemporal Variance Dependencies (CSM-STVD) incorporating the spatial and temporal variance\nor energy dependencies among the observed data. The main feature of the new model is that we\nmodel the spatiotemporal variance dependencies based on the Structural Vector AutoRegressive\n(SVAR) model, in particular the Non-Gaussian SVAR [11]. The contributions of this study are\ntwo-fold. First, we provide an alternative way to model the causal relations among the observa-\ntions, i.e., causality in variances or energies. In this model, causality takes place at the level of\nvariances or energies, i.e., the variance or energy of one observed series at time instant t0 is in\ufb02u-\nenced by the variances or energies of other variables at time instants t \u2264 t0 and its past values at\ntime instants t < t0. Thus, both contemporaneous and temporal causal relations in variances are\nconsidered. Secondly, we prove the identi\ufb01ability of this model and more speci\ufb01cally, we show\nthat Non-Gaussianity makes the model fully identi\ufb01able. Furthermore, we propose a method which\ndirectly estimates such causal structures without explicitly estimating the variances.\n\n2 Related work\n\nTo model the variance or energy dependencies of the observations, a classic method is to use a scale-\nmixture model [5, 23, 9, 8]. Mathematically, we can represent a signal as si = ui\u03c3i, where ui is a\nsignal with zero mean and constant variance, and \u03c3i is a positive factor which is independent of ui\nand modulates the variance or energy of si [5]. For multivariate case, we have\n\ns = u \u2299 \u03c3,\n\n(1)\nwhere \u2299 means element-wise multiplication. In basic scale-mixture model, u and \u03c3 are statistically\n,\u2200t\u03c41, t\u03c42.\nindependent and the components ui are spatiatemporally independent, i.e. ui,t(cid:28)1\nHowever, \u03c3i, the standard deviations of the observations, are dependent across i. The observation\nx, in many situations, is assumed to be a linear mixture of the source s, i.e., x = As, where A is a\nmixing matrix.\nIn [5], Hirayama and Hyv\u00a8arinen proposed a two-stage model. The \ufb01rst stage is a classic ICA model\n[3, 10], where the observation x is a linear mixture of the hidden source s, i.e., x = As. On the\nsecond stage, the variance dependencies are modeled by applying a linear Non-Gaussian (LiN) SEM\nto the log-energies of the sources.\n\n\u22a5\u22a5 uj,t(cid:28)2\n\n\u2211\n\nyi =\n\nj\n\nhijyj + hi0 + ri, i = 1, 2,\u00b7\u00b7\u00b7 , d,\n\nwhere yi = log \u03d5(\u03c3i) are the log-energies of sources si and the nonlinear function \u03d5 is any appropri-\nate measure of energy; ri are non-Gaussian distributed and independent of yj. To make the problem\ntractable, they assumed that ui are binary, i.e., ui \u2208 {\u22121, 1} and uniformly distributed. The param-\neters of this two-stage model including A and hij are estimated by maximum likelihood without\napproximation due to the uniform binary distribution assumption of u. However, this assumption is\nrestrictive and thus may not \ufb01t real world observations well. Furthermore, they assumed that \u03c3i are\nspatially dependent but temporally white. However, many time series show strong heterosecadastic-\nity and temporal variance dependencies such as \ufb01nancial time series and brain signals. Taking into\naccount of temporal variance dependencies would improve the quality of the estimated underlying\nstructure of the observed data.\nAnother two-stage model for magnetoencephalography (MEG) or electroencephalography (EEG)\ndata was propsoed earlier in [23]. The \ufb01rst stage also performs linear separation; they proposed\na blind source separation algorithm by exploiting the autocorrelations and time-varying variances\nof the sources.\nIn the second stage, si(t) are modeled by autoregressive processes with L lags\n(AR(L)) driven by innovations ei(t). The innovation processes ei(t) are mutually uncorrelated and\ntemporally white. However, ei(t) are not necessarily independent. They modeled ei(t) as follows:\n(2)\n\nei(t) = \u03c3itzi(t), where zi(t) \u223c N (0, 1).\n\nTwo different methods are used to model the dependencies among the variances of the innovations.\nThe \ufb01rst method is causal-in-variance GARCH (CausalVar-GARCH). Speci\ufb01cally \u03c32\nit are modeled\nby a multivariate GARCH model. The advantage of this model is that we are able to estimate\nthe temporal causal structure in variances. However, this model provides no information about the\n\n2\n\n\fcontemporaneous causal relations among the sources if there indeed exist such causal relations. The\nsecond method to model the variance dependencies is applying a factor model to the log-energies\n(log \u03c32\nit) of the sources. The disadvantage of this method is that we cannot model the causal relations\namong the sources which are more interesting to us.\nIn many real world observations, there are causal in\ufb02uences in variances among the observed vari-\nables. For instance, there are signi\ufb01cant mutual in\ufb02uences among the volatilities of the observed\nstock prices. We are more interested in investigating the underlying causal structure among the\nvariances of the observed data. Consequently, in this paper, we consider the situation where the\ncorrelation in the variances of the observed data is interesting. That is, the \ufb01rst stage of [5, 23] is\nnot needed, and we focus on the second stage, i.e., modeling the spatiotemporal variance depen-\ndencies and causal mechanism among the observations. In the following sections, we propose our\nprobabilistic model based on SVAR to describe the spatiotemporal variance dependencies among\nthe observations. Our model is, as shown in later sections, closely related to the models introduced\nin [5, 23], but has signi\ufb01cant advantages: (1) both contemporaneous and temporal causal relations\ncan be modeled; (2) this model is fully identi\ufb01able under certain assumptions.\n\n3 Causal scale-mixture model with spatiotemporal variances dependencies\n\nWe propose the causal scale-mixture model with spatiotemporal variance dependencies as follows.\nLet z(t) be the m \u00d7 1 observed vector with components zi(t), which are assumed to be generated\naccording to the scale-mixture model:\n\n(3)\nHere we assume that ui(t) are temporally independent processes, i.e., ui(t\u03c41 ) \u22a5\u22a5 uj(t\u03c42),\u2200t\u03c41\n\u0338= t\u03c42\nbut unlike basic scale-mixture model, here ui(t) may be contemporarily dependent, i.e., ui(t) \u0338\u22a5\u22a5\nuj(t),\u2200i \u0338= j. \u03c3(t) is spatially and temporally independent of u(t). Using vector notation,\n\nzi(t) = ui(t)\u03c3i(t).\n\n(4)\nHere \u03c3it > 0 are related to the variances or energies of the observations zt and are assumed to be\nspatiotemporally dependent. As in [5, 23], let yt = log \u03c3t. In this paper, we model the spatiotem-\nporal variance dependencies by a Structural Vector AutoRegressive model (SVAR), i.e.,\n\nzt = ut \u2299 \u03c3t.\n\nL\u2211\n\nL\u2211\n\nxt = yt + \u03b7t,\n\nyt = A0yt +\n\nyt = A0yt +\n\n\u03c4 =1\n\nB\u03c4 yt\u2212\u03c4 + \u03f5t,\n\n(5)\n\nwhere A0 contains the contemporaneous causal strengths among the variances of the observations,\ni.e., if [A0]ij \u0338= 0, we say that yit is contemporaneously affected by yjt; B\u03c4 contains the temporal\n(time-lag) causal relations, i.e., if [B\u03c4 ]ij \u0338= 0, we say that yi,t is affected by yj,t\u2212\u03c4 . Here, \u03f5t are\ni.i.d. mutually independent innovations. Let xt = log |zt| (In this model, we assume that ui(t) do\nnot take value zero) and \u03b7t = log |ut|.Take log of the absolute values of both sides of equation (4),\nthen we have the following model:\n\nB\u03c4 yt\u2212\u03c4 + \u03f5t.\n\n(6)\n\n\u03c4 =1\nWe make the following assumptions on the model:\n\nA1 Both \u03b7t and \u03f5t are temporally white with zero means. The components of \u03b7t are not neces-\nsarily independent, and we assume that the covariance matrix of \u03b7t is (cid:6)\u03b7. The components\nof \u03f5t are independent and (cid:6)\u03f5 = I1.\nA2 The contemporaneous causal structure is acyclic, i.e., by simultaneous row and column\npermutations, A0 can be permuted to a strictly lower triangular matrix. BL is of full rank.\n1Note that (cid:6)\u03f5 = I is assumed just for convenience. A0 and B(cid:28) can also be correctly estimated if (cid:6)\u03f5 is a\ngeneral diagonal covariance matrix. The explanation why the scaling indeterminacy can be eliminated is the\nsame as LiNGAM given in [16].\n\n3\n\n\fA3 The innovations \u03f5t are non-Gaussian, and \u03b7t are either Gaussian or non-Gaussian.\n\nInspired by the identi\ufb01ability results of the Non-Gaussian state-space model in [24], we show that\nour model is identi\ufb01able. Note that our new model and the state-space model proposed in [24] are\ntwo different models, while interestingly by simple re-parameterization we can prove the following\nLemma 3.1 and Theorem 3.1 following [24].\nLemma 3.1 Given the log-transformed observation xt = log |zt| generated by Equations (6), if the\nassumptions A1 \u223c A2 hold, by solving simple linear equations involving the autocovariances of xt,\nthe covariance (cid:6)\u03b7 and AB\u03c4 can be uniquely determined, where A = (I \u2212 A0)\n\u22121; furthermore, A\nand B\u03c4 can be identi\ufb01ed up to some rotation transformations. That is, suppose that two models with\nparameters (A,{B\u03c4}L\n\u03c4 =1, ~(cid:6) ~\u03b7) generate the same observation xt, then we\nhave (cid:6)\u03b7 = ~(cid:6) ~\u03b7, ~A = AU, ~B\u03c4 = UT B\u03c4 , where U is an orthogonal matrix.\n\n\u03c4 =1, (cid:6)\u03b7) and ( ~A,{ ~B\u03c4}L\n\nNon-Gaussianity of the innovations \u03f5t makes the model fully identi\ufb01able, as seen in the following\ntheorem.\nTheorem 3.1 Given the log-transformed observation xt = log |zt| generated by Equations (6) and\ngiven L, if assumptions A1 \u223c A3 hold, then the model is identi\ufb01able. In other words, suppose\nthat two models with parameters (A,{B\u03c4}L\n\u03c4 =1, ~(cid:6) ~\u03b7) generate the same\nobservation xt; then these two models are identical, i.e., we have ~(cid:6) ~\u03b7 = (cid:6)\u03b7, ~A = A, ~B\u03c4 = B\u03c4 ,\nand ~yt = yt.\n\n\u03c4 =1, (cid:6)\u03b7) and ( ~A,{ ~B\u03c4}L\n\n4 Parameter learning and causal discovery\n\nIn this section, we propose an effective algorithm to estimate the contemporaneous causal structure\nmatrix A0 and temporal causal structure matrices B\u03c4 , \u03c4 = 1,\u00b7\u00b7\u00b7 , L (see (6)).\n\n4.1 Estimation of AB\u03c4\nWe have shown that AB\u03c4 can be uniquely determined, where A = (I \u2212 A0)\n\u22121. The proof of\nLemma 3.1 also suggests a way to estimate AB\u03c4 , as given below. Readers can refer to the appendix\nfor the detailed mathematical derivation. Although we are aware that this method might not be sta-\ntistically ef\ufb01cient, we adopt this estimation method due to its great computational ef\ufb01ciency. Given\nthe log-transformed observations xt = log |zt|, denoted by Rx(k) the autocovariance function of\nt+k). Based on the model assumptions A1 and A2, we have\nxt at lag k, we have Rx(k) = E(xtxT\nthe following linear equations of the autocovarainces of xt.\nRx(L (cid:0) 1)\nRx(L)\n\nRx(L + 1)\n\nRx(L + 1)\n\nRx(L + 2)\n\nRx(L)\n\n(7)\n\n(cid:1)(cid:1)(cid:1) Rx(1)\n(cid:1)(cid:1)(cid:1) Rx(2)\n...\n\n...\n\n37777775 ;\n\n26666664\n\n3777775\n\n}\n\nCT\n1\nCT\n2\n...\nCT\nL\n\n2666664\n\n3777775 =\n\n2666664\n\n|\n\n...\n\nRx(2L)\n\n...\n\nRx(2L (cid:0) 1) Rx(2L (cid:0) 2) (cid:1)(cid:1)(cid:1) Rx(L)\n\n...\n{z\n\nwhere C\u03c4 = AB\u03c4 (\u03c4 = 1,\u00b7\u00b7\u00b7 , L). As shown in the proof of Lemma 3.1, H is invertible. We can\neasily estimate AB\u03c4 by solving the linear Equations (7).\n\n,H\n\n4.2 Estimation of A0\nThe estimations of AB\u03c4 (\u03c4 = 1,\u00b7\u00b7\u00b7 , L) still contain the mixing information of the causal structures\nA0 and B\u03c4 . In order to further obtain the contemporaneous and temporal causal relations, we need\nto estimate both A0 and B\u03c4 (\u03c4 = 1,\u00b7\u00b7\u00b7 , L). Here, we show that the estimation of A0 can be reduced\nto solving a Linear Non-Gaussian Acyclic Models with latent confounders.\nSubstituting yt = xt \u2212 \u03b7t into Equations (6), we have\n\nxt \u2212 \u03b7t =\n\nAB\u03c4 (xt\u2212\u03c4 \u2212 \u03b7t\u2212\u03c4 ) + A\u03f5t.\n\n(8)\n\nL\u2211\n\n\u03c4 =1\n\n4\n\n\fSince AB\u03c4 can be uniquely determined according to Lemma 3.1 or more speci\ufb01cally Equations (7),\n\nwe can easily obtain \u03bet = xt \u2212\u2211\n\n\u03c4 =1 AB\u03c4 xt\u2212\u03c4 , then we have:\n\nL\n\n\u03bet = A\u03f5t + \u03b7t \u2212 L\u2211\n\nAB\u03c4 \u03b7t\u2212\u03c4 .\n\n(9)\n\nThis is exactly a Linear Non-Gaussian Acyclic Model with latent confounders and the estimation of\nA is a very challenging problem [6, 2]. To make to problem tractable, we further have the following\ntwo assumptions on the model:\n\n\u03c4 =1\n\n\u2022 A4 If the components of \u03b7t are not independent, we assume that \u03b7t follows a factor model:\n\u03b7t = Dft, where the components of ft are spatially and temporally independent Gaussian\nfactors and D is the factor loading matrix (not necessarily square).\n\u2022 A5 The components of \u03f5t are simultaneously super-Gaussian or sub-Gaussian.\n\nBy replacing \u03b7t with Dft , we have:\n\n\u03bet = A\u03f5t + Dft \u2212 L\u2211\n{z\n\n|\n\n\u03c4 =1\n\n}\n\nAB\u03c4 Dft\u2212\u03c4\n\n.\n\n(10)\n\nconfounding effects\n\nTo identify the matrix A which contains the contemporaneous causal information of the observed\nvariables, we treat ft and ft\u2212\u03c4 as latent confounders and the interpretation of assumption A4 is that\nwe can treat the independent factors ft as some external factors outside the system. The Gaussian\nassumption of ft can be interpreted hierarchically as the result of central limit theorem because these\nfactors themselves represent the ensemble effects of numerous factors from the whole environment.\nOn the contrary, the disturbances \u03f5it are local factors that describe the intrinsic behaviors of the\nobserved variables [4]. Since they are local and thus not regarded as the ensembles of large amount\nof factors. In this case, the disturbances \u03f5it are assumed to be non-Gaussian.\nThe LiNGAM-GC model [2] takes into the consideration of latent confounders. In that model, the\nconfounders are assumed to follow Gaussian distribution, which was interpreted as the result of\ncentral limit theorem. It mainly focuses on the following cause-effect pair:\n\nx = e1 + \u03b1c,\ny = \u03c1x + e2 + \u03b2c,\n\n(11)\n\nwhere e1 and e2 are non-Gaussian and mutually independent, and c is the latent Gaussian confounder\nindependent of the disturbances e1 and e2. To tackle the causal discovery problem of LiNGAM-\nGC, it was \ufb01rstly shown that if x and y are standardized to unit absolute kurtosis then |\u03c1| < 1\nbased on the assumption that e1 and e2 are simultaneously super-Gaussian or sub-Gaussian. Note\nthat assumption A5 is a natural extension of this assumption. It holds in many practical problems,\nespecially for \ufb01nancial data. After the standardization, the following cumulant-based measure ~Rxy\nwas proposed [2]:\n\n~Rxy = (Cxy + Cyx)(Cxy \u2212 Cyx), where\nCxy = ^E{x3y} \u2212 3^E{xy}^E{x2},\nCyx = ^E{xy3} \u2212 3^E{xy}^E{y2},\n\n(12)\n\nand ^E means sample average. It was shown that the causal direction can be identi\ufb01ed simply by\nexamining the sign of ~Rxy, i.e., if ~Rxy > 0, x \u2192 y is concluded; otherwise if ~Rxy < 0, y \u2192\nx is concluded. Once the causal direction has been identi\ufb01ed, the estimation of causal strength\nis straightforward. The work can be extended to multivariate causal network discovery following\nDirectLiNGAM framework [17].\nHere we adopt LiNGAM-GC-UK, the algorithm proposed in [2], to \ufb01nd the contemporaneous casual\nstructure matrix A0. Once A0 has been estimated, B\u03c4 can be easily obtained by ^B\u03c4 = (I\u2212 ^A0) ^C\u03c4 ,\nwhere ^A0 and ^C\u03c4 are the estimations of A0 and AB\u03c4 , respectively. The algorithm for learning the\nmodel is summarized in the following algorithm.\n\n5\n\n\fAlgorithm 1 Causal discovery with scale-mixture model for spatiotemporal variance dependencies\n1: Given the observations zt, compute xt = log |zt|.\n2: Subtract the mean (cid:22)xt from xt, i.e., xt = xt \u2212 (cid:22)xt\n3: Choose an appropriate lag L for the SVAR and then estimate AB\u03c4 where A = (I\u2212 A0)\n\n\u22121 and\n\n4: Obtain the residues by \u03bet = xt \u2212\u2211\n\n\u03c4 = 1,\u00b7\u00b7\u00b7 , L, using Equations(7).\n\n5: Apply LiNGAM-GC algorithms to \u03bet and obtain the estimation of A0 and B\u03c4 (\u03c4 = 1,\u00b7\u00b7\u00b7 , L)\n\nand the corresponding comtemporaneous and temporal causal orderings.\n\nL\n\n\u03c4 =1 AB\u03c4 xt\u2212\u03c4 .\n\n5 Experiment\n\nWe conduct experiments using synthetic data and real world data to investigate the effectiveness of\nour proposed model and algorithms.\n\n5.1 Synthetic data\n\nWe generate the observations according to the following model:\n\nzt = r \u2299 ut \u2299 \u03c3t,\n\nr is a m\u00d71 scale vector of which the elements are randomly selected from interval [1.0, 6.0]; ut > 0\nand \u03b7t = log ut follows a factor model:\n\nwhere D is m \u00d7 m and the elements of D are randomly selected from [0.2, 0.4] . fit are i.i.d. and\nfit \u223c N (0, 0.5). Denoted by yt = log \u03c3t, we model the spatiotemporal variance dependencies of\nthe observations xt by an SVAR(1):\n\n\u03b7t = Dft,\n\nyt = A0yt + B1yt\u22121 + \u03f5t,\n\nwhere A0 is a m \u00d7 m strictly lower triangular matrix of which the elements are randomly selected\nfrom [0.1, 0.2] or [\u22120.2,\u22120.1]; B1 is a m \u00d7 m matrix of which the diagonal elements [B1]ii are\nrandomly selected from [0.7, 0.8], 80% of the off-diagonal elements [B1]i\u0338=j are zero and the re-\nmaining 20% are randomly selected from [\u22120.1, 0.1]; \u03f5it are i.i.d. super-Gaussian generated by\n\u03f5it = sign(nit)|nit|2(nit \u223c N (0, 1)) and normalized to unit variance. The generated observations\nare permuted to a random order. The task of this experiment is to investigate the performance of our\nalgorithms in estimating the coef\ufb01cient matrix (I \u2212 A0)\n\u22121B1 and also the contemporaneous causal\nordering induced by A0. We estimate the matrix (I \u2212 A0)\n\u22121B1 using Lemma 3.1 or speci\ufb01cally\nEquations (7). We use different algorithms: LiNGAM-GC-UK proposed in [2], C-M proposed in\n[7] and LiNGAM [16] to estimate the contemporaneous causal structure. We investigate the perfor-\nmances of different algorithms in the scenarios of m = 4 with sample size from 500 to 4000 and\nm = 8 with sample size from 1000 to 10000. For each scenario, we randomly conduct 100 inde-\npendent trials and discard those trials where the SVAR processes are not stable. We calculate the\naccuracies of LiNGAM-GC-UK, C-M and LiNGAM in \ufb01nding (1) whole causal ordering (2) exoge-\n\u2211\nnous variable (root) of the causal network. We also calculate the sum square error Err of estimated\n\u2211\ncausal strength matrix of different algorithms with respect to the true one. The average SNR de\ufb01ned\ni V ar(\u03f5i)\nas SN R = 10 log\ni V ar(fi) is about 13.85 dB. The experimental results are shown in Figure 1 and\nTable 1. Figure 1 shows the plots of the estimated entries of (I\u2212A0)\n\u22121B1 versus the true ones when\nthe dimension of the observations m = 8. From Figure 1, we can see that the matrix (I\u2212 A0)\n\u22121B1\nis estimated well enough when the sample size is only 1000. This con\ufb01rms the correctness of our\ntheoretical analysis of the proposed model. From Table 1, we can see that when the dimension of\nthe observations is small (m = 4), all algorithms have acceptable performances. The performance\nof LiNGAM is the best when the sample size is small. This is because C-M and LiNGAM-GC-UK\nare cumulant-based methods which need suf\ufb01ciently large sample size. When the dimension of the\nobservations m increases to 8, we can see that the performances of C-M and LiNGAM degrade\ndramatically. While LiNGAM-GC-UK still successfully \ufb01nds the exogenous variable (root) or even\nthe whole contemporaneous causal ordering among the variances of the observations if the sample\nsize is suf\ufb01ciently large enough. This is mainly due to the fact that when the dimension increases,\n\n6\n\n\fFigure 1: Estimated entries causal strength matrix (I \u2212\nA0)\n\n\u22121B1 vs the true ones (m = 8)\n\nFigure 2: Contemporaneous causal net-\nwork of the selected stock indices\n\nTable 1: Accuracy of \ufb01nding the causal ordering\n\nsample size\n\nwhole causal ordering\n\nC-M LiNGAM LiNGAM-GC-UK\n\nm = 4\n\n500\n1000\n2000\n3000\n4000\nm = 8\n1000\n2000\n4000\n6000\n8000\n10000\n\n37%\n47%\n74%\n67%\n63%\n\n70%\n75%\n86%\n78%\n83%\n\n23.08%\n0%\n1.14% 26.14%\n31.87%\n0%\n25.29%\n0%\n2.20% 30.77%\n23.53%\n\n0%\n\n28%\n25%\n81%\n90%\n90%\n\n8.79%\n25%\n58.24%\n83.91%\n80.22%\n91.76%\n\nC-M\n\n61%\n25%\n82%\n79%\n81%\n\n85%\n92%\n90%\n88%\n92%\n\n20.88% 75.82%\n25%\n70.45%\n19.78% 82.41%\n25.29% 75.86%\n17.58% 79.12%\n12.94% 68.24%\n\n\ufb01rst variable found\nLiNGAM LiNGAM-GC-UK\n\nC-M\n\n0.1101\n0.0865\n0.0679\n0.0716\n0.0669\n\n0.8516\n0.7866\n0.7537\n0.7638\n0.7735\n0.7794\n\nErr\n\nLiNGAM LiNGAM-GC-UK\n\n0.0326\n0.024\n0.02\n0.0201\n0.0193\n\n0.2318\n0.2082\n0.1916\n0.1843\n0.1824\n0.194\n\n0.0938\n0.0444\n0.0199\n0.0126\n0.0109\n\n0.3017\n0.1396\n0.0634\n0.0341\n0.029\n0.0199\n\n60%\n72%\n92%\n96%\n94%\n\n65.93%\n75%\n86.81%\n96.55%\n91.21%\n97.64%\n\nthe confounding effects of Dft \u2212 (I \u2212 A)\n\u22121B1Dft\u22121 become more problematic such that the per-\nformances of C-M and LiNGAM are strongly affected by confounding effect. Table 1 also shows\nthe estimation accuracies of the compared methods. Among them, LiNGAM-GC-UK signi\ufb01cantly\noutperforms other methods given suf\ufb01ciently large sample size.\nIn order to investigate the robustness of our methods against the Gaussian assumption on the ex-\nternal factors ft, we conduct the following experiment. The experimental setting is the same as\nthat in the above experiment but here the external factors ft are non-Gaussian, and more speci\ufb01-\ncally fit = sign(nit)|nit|p, where nit \u223c N (0, 0.5). When p > 1, the factor is super-Gaussian\nand when p < 1 the factor is sub-Gaussian. We investigate the performances of LiNGAM-\nGC-UK, LiNGAM and C-M in \ufb01nding the whole causal ordering in difference scenarios where\np = {0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6} with sample size of 6000. The results in Figure 3 show that\nLiNGAM-GC-UK achieved satisfying results compared to LiNGAM and C-M. This suggests that al-\nthough LiNGAM-GC is developed based on the assumption that the latent confounders are Gaussian\ndistributed, it is still robust in the scenarios where the latent confounders are mildly non-Gaussian\nwith mild causal strengths.\n\nFigure 3: Robustness against Gaussianity of ft\n\n7\n\n\u2212101\u22121\u22120.500.51sample size: 2000true parameters\u2212101\u22121\u22120.500.51estimated parameterssample size: 1000\u2212101\u22121\u22120.500.51sample size: 4000\u2212101\u22121\u22120.500.51estimated parameterssample size: 6000\u2212101\u22121\u22120.500.51true parameterssample size: 8000\u2212101\u22121\u22120.500.51sample size: 10000FTSEFCHIGDAXIDJINDX1.0050.84270.7404-0.6240.98330.47980.20.40.60.811.21.41.61.8020406080100paccuracy(%)whole causal ordering  LiNGAM\u2212GC\u2212UKC\u2212MLiNGAM0.511.5\u22120.4\u22120.200.20.4pkurtosiskurtosis of ft\f5.2 Real world data\n\nIn this section, we use our new model to discover the causal relations among \ufb01ve major world stocks\nindices: (1) Dow Jones Industrial Average (DJI) (2) FTSE 100 (FTSE) (3) Nasdaq-100 (NDX) (4)\nCAC 40 (FCHI) (5) DAX (GDAXI), where DJI and NDX are stock indices in US, and FTSE, FCHI\nand GDAXI are indices in Europe. Note that because of the time difference, we believe that the\ncausal relations among these stock indices are mainly acyclic, as we assumed in this paper. We\ncollect the adjusted close prices of these selected indices from May 2nd, 2006 to April 12th, 2012,\nand use linear interpolation to estimate the prices on those dates when the data are not available.\nWe apply our proposed model with SVAR(1) to model the spatiotemporal variance dependencies\nof the data. For the contemporaneous causal structure discovery, we use LiNGAM-GC-UK, C-M,\nLiNGAM2 and Direct-LiNGAM3 to estimate the causal ordering. The discovered causal orderings\nof different algorithms are shown in Table 2. From Table 2, we see that in the causal ordering\n\nTable 2: Contemporaneous causal ordering of the selected stock indices\n\nLiNGAM-GC-UK f2g ! f4g ! f5g ! f1g ! f3g\n\ncausal ordering\n\nalgorithm\n\nC-M\n\nLiNGAM\n\nf1g ! f2g ! f4g ! f5g\nf2g ! f5g ! f3g ! f1g\n\nf1g ! f3g\nf2g ! f4g\n\nDirect-LiNGAM f3g ! f1g ! f5g ! f4g ! f2g\n\ndiscovered by LiNGAM-GC-UK and LiNGAM, the stock indices in US, i.e., DJI and NDX are con-\ntemporaneously affected by the indices in Europe. Note that each stock index is given in local time.\nBecause of the time difference between Europe and America and the ef\ufb01cient market hypothesis\n(the market is quick to absorb new information and adjust stock prices relative to that), the contem-\nporaneous causal relations should be from Europe to America, if they exist. This is consistent with\nthe results our method and LiNGAM produced. Another interesting \ufb01nding is that in the graphs\nobtained by LiNGAM-GC-UK and LiNGAM, we can see that FTSE is the root, which is consistent\nwith the fact that London is the \ufb01nancial centre of Europe and FTSE is regarded as Europe\u2019s most\nimportant index. However, in results by C-M and DirectLiNGAM, we have the opposite direction,\ni.e., the stock indices in US is contemporaneously the cause of the indices in Europe, which is dif-\n\ufb01cult to interpret. The contemporaneous causal network of the stock indices are shown in Figure 2.\nFurther interpretation on the discovered causal strengths needs expertise knowledge.\n\n6 Conclusion\n\nIn this paper, we investigate the causal discovery problem where causality takes place at the level\nof variances or energies instead of the observed variables themselves. We propose a causal scale-\nmixture model with spatiotemporal variance dependencies to describe this type of causal mech-\nanism. We show that the model is fully identi\ufb01able under the non-Gaussian assumption of the\ninnovations. In addition, we propose algorithms to estimate the parameters, especially the contem-\nporaneous causal structure of this model. Experimental results on synthetic data verify the practical\nusefulness of our model and the effectiveness of our algorithms. Results using real world data fur-\nther suggest that our new model can possibly explain the underlying interaction mechanism of major\nworld stock markets.\n\nAcknowledgments\n\nThe work described in this paper was partially supported by a grant from the Research Grants Coun-\ncil of the Hong Kong Special Administration Region, China.\n\nThe code is available at:http://www.cs.helsinki.\ufb01/group/neuroinf/lingam/\n\n2LiNGAM converges to several local optima. We only show one of the discovered causal ordering here.\n3http://www.ar.sanken.osaka-u.ac.jp/(cid:24)inazumi/dlingam.html\n\n8\n\n\fReferences\n[1] T. Bollerslev. Generalized autoregressive conditional heteroskedasticity.\n\n31(3):307\u2013327, 1986.\n\nJournal of econometrics,\n\n[2] Z. Chen and L. Chan. Causal discovery for linear non-gaussian acyclic models in the presence of latent\ngaussian confounders. In Proceedings of the 10th international conference on Latent Variable Analysis\nand Signal Separation, pages 17\u201324. Springer-Verlag, 2012.\n\n[3] P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):287\u2013314, 1994.\n[4] R. Henao and O. Winther. Sparse linear identi\ufb01able multivariate modeling. Journal of Machine Learning\n\nResearch, 12:863\u2013905, 2011.\n\n[5] J. Hirayama and A. Hyv\u00a8arinen. Structural equations and divisive normalization for energy-dependent\n\ncomponent analysis. Advances in Neural Information Processing Systems (NIPS2011), 24, 2012.\n\n[6] P.O. Hoyer, S. Shimizu, A.J. Kerminen, and M. Palviainen. Estimation of causal effects using linear\nInternational Journal of Approximate Reasoning,\n\nnon-gaussian causal models with hidden variables.\n49(2):362\u2013378, 2008.\n\n[7] A. Hyv\u00a8arinen. Pairwise measures of causal direction in linear non-gaussian acyclic models. In JMLR\nWorkshop and Conference Proceedings (Proc. 2nd Asian Conference on Machine Learning), ACML2010,\nvolume 13, pages 1\u201316, 2010.\n\n[8] A. Hyv\u00a8arinen, P. O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Compu-\n\ntation, 13(7):1527\u20131558, 2001.\n\n[9] A. Hyv\u00a8arinen and J. Hurri. Blind separation of sources that have spatiotemporal variance dependencies.\n\nSignal Processing, 84(2):247\u2013254, 2004.\n\n[10] A. Hyv\u00a8arinen and E. Oja. Independent component analysis: algorithms and applications. Neural net-\n\nworks, 13(4-5):411\u2013430, 2000.\n\n[11] A. Hyv\u00a8arinen, K. Zhang, S. Shimizu, and P. O. Hoyer. Estimation of a structural vector autoregression\n\nmodel using non-gaussianity. Journal of Machine Learning Research, 11:1709\u20131731, 2010.\n\n[12] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniu\u02c7sis, B. Steudel, and B. Sch\u00a8olkopf.\n\nInformation-geometric approach to inferring causal directions. Arti\ufb01cial Intelligence, 2012.\n\n[13] Y. Kawahara, S. Shimizu, and T. Washio. Analyzing relationships among arma processes based on non-\n\ngaussianity of external in\ufb02uences. Neurocomputing, 2011.\n\n[14] A. Moneta, D. Entner, PO Hoyer, and A. Coad. Causal inference by independent component analysis\nwith applications to micro-and macroeconomic data. Jena Economic Research Papers, 2010:031, 2010.\n\n[15] J. Pearl. Causality: models, reasoning, and inference. Cambridge Univ Pr, 2000.\n[16] S. Shimizu, P.O. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen. A linear non-gaussian acyclic model for causal\n\ndiscovery. Journal of Machine Learning Research, 7:2003\u20132030, 2006.\n\n[17] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyv\u00a8arinen, Y. Kawahara, T. Washio, P.O. Hoyer, and K. Bollen.\nDirectlingam: A direct method for learning a linear non-gaussian structural equation model. Journal of\nMachine Learning Research, 12:1225\u20131248, 2011.\n\n[18] Y. Sogawa, S. Shimizu, T. Shimamura, A. Hyv\u00a8arinen, T. Washio, and S. Imoto. Estimating exogenous\n\nvariables in data with more variables than observations. Neural Networks, 2011.\n\n[19] P. Spirtes, C.N. Glymour, and R. Scheines. Causation, prediction, and search. The MIT Press, 2000.\n[20] K. Zhang and L. Chan. Ef\ufb01cient factor garch models and factor-dcc models. Quantitative Finance,\n\n9(1):71\u201391, 2009.\n\n[21] K. Zhang and L.W. Chan. Extensions of ica for causality discovery in the hong kong stock market.\nIn Proc. of the 13th international conference on Neural information processing-Volume Part III, pages\n400\u2013409. Springer-Verlag, 2006.\n\n[22] K. Zhang and A. Hyv\u00a8arinen. On the identi\ufb01ability of the post-nonlinear causal model. In Proceedings of\n\nthe Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 647\u2013655, 2009.\n\n[23] K. Zhang and A. Hyv\u00a8arinen. Source separation and higher-order causal analysis of meg and eeg.\n\nIn\nProceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 709\u2013716,\n2010.\n\n[24] K. Zhang and A. Hyv\u00a8arinen. A general linear non-gaussian state-space model: Identi\ufb01ability, identi\ufb01ca-\ntion, and applications. In Proceedings of Asian Conference on Machine Learning, JMLR W&CP, pages\n113\u2013128, 2011.\n\n9\n\n\f", "award": [], "sourceid": 835, "authors": [{"given_name": "Zhitang", "family_name": "Chen", "institution": null}, {"given_name": "Kun", "family_name": "Zhang", "institution": null}, {"given_name": "Laiwan", "family_name": "Chan", "institution": null}]}