{"title": "Causal Inference on Time Series using Restricted Structural Equation Models", "book": "Advances in Neural Information Processing Systems", "page_first": 154, "page_last": 162, "abstract": "Causal inference uses observational data to infer the causal structure of the data generating system. We study a class of restricted Structural Equation Models for time series that we call Time Series Models with Independent Noise (TiMINo). These models require independent residual time series, whereas traditional methods like Granger causality exploit the variance of residuals. This work contains two main contributions: (1) Theoretical: By restricting the model class (e.g. to additive noise) we provide more general identifiability results than existing ones. The results cover lagged and instantaneous effects that can be nonlinear and unfaithful, and non-instantaneous feedbacks between the time series. (2) Practical: If there are no feedback loops between time series, we propose an algorithm based on non-linear independence tests of time series. When the data are causally insufficient, or the data generating process does not satisfy the model assumptions, this algorithm may still give partial results, but mostly avoids incorrect answers. The Structural Equation Model point of view allows us to extend both the theoretical and the algorithmic part to situations in which the time series have been measured with different time delays (as may happen for fMRI data, for example). TiMINo outperforms existing methods on artificial and real data. Code is provided.", "full_text": "Causal Inference on Time Series using Restricted\n\nStructural Equation Models\n\nJonas Peters\u2217\n\nSeminar for Statistics\n\nETH Z\u00a8urich, Switzerland\npeters@math.ethz.ch\n\nDominik Janzing\n\nMPI for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nBernhard Sch\u00a8olkopf\n\nMPI for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\njanzing@tuebingen.mpg.de\n\nbs@tuebingen.mpg.de\n\nAbstract\n\nCausal inference uses observational data to infer the causal structure of the data\ngenerating system. We study a class of restricted Structural Equation Models for\ntime series that we call Time Series Models with Independent Noise (TiMINo).\nThese models require independent residual time series, whereas traditional meth-\nods like Granger causality exploit the variance of residuals. This work contains\ntwo main contributions: (1) Theoretical: By restricting the model class (e.g. to\nadditive noise) we provide general identi\ufb01ability results. They cover lagged and\ninstantaneous effects that can be nonlinear and unfaithful, and non-instantaneous\nfeedbacks between the time series. (2) Practical: If there are no feedback loops\nbetween time series, we propose an algorithm based on non-linear independence\ntests of time series. We show empirically that when the data are causally insuf-\n\ufb01cient or the model is misspeci\ufb01ed, the method avoids incorrect answers. We\nextend the theoretical and the algorithmic part to situations in which the time se-\nries have been measured with different time delays. TiMINo is applied to arti\ufb01cial\nand real data and code is provided.\n\n1\n\nIntroduction\n\nWe \ufb01rst introduce the problem of causal inference on iid data, that is in the case with no time\nstructure. Let therefore X i, i \u2208 V , be a set of random variables and let G be a directed acyclic\ngraph (DAG) on V describing the causal relationships between the variables. Given iid samples\nfrom P(X i),i\u2208V , we aim at estimating the underlying causal structure of the variables X i, i \u2208 V .\nConstraint- or independence-based methods [e.g. Spirtes et al., 2000] assume that the joint distri-\nbution is Markov, and faithful with respect to G. The PC algorithm, for example, exploits con-\nditional independences for reconstructing the Markov equivalence class of G (some edges remain\nundirected). We say P(X i),i\u2208V satis\ufb01es a Structural Equation Model [Pearl, 2009] w.r.t. DAG G\nif for all i \u2208 V we can write X i = fi(PAi, N i) , where PAi are the parents of node i in G. Ad-\nditionally, we require (N i)i\u2208V to be jointly independent. By restricting the function class one can\nidentify the bivariate case: Shimizu et al. [2006] show that if P(X,Y ) allows for Y = a \u00b7 X + NY\nwith NY \u22a5\u22a5 X then P(X,Y ) only allows for X = b \u00b7 Y + NX with NX \u22a5\u22a5 Y if (X, NY ) are jointly\nGaussian ( \u22a5\u22a5 stands for statistical independence). This idea has led to the extensions of nonlin-\near additive functions f (x, n) = g(x) + n [Hoyer et al., 2009]. Peters et al. [2011b] show how\nidenti\ufb01ability for two variables generalizes to the multivariate case.\n\nWe now turn to the case of time series data. For each i from a \ufb01nite V , let therefore(cid:0)X i\n\n(cid:1)\n\nt\u2208N be\na time series. Xt denotes the vector of time series values at time t. We call the in\ufb01nite graph that\nt as a node the full time graph. The summary time graph contains all #V\ncontains each variable X i\n\nt\n\n\u2217Signi\ufb01cant parts of this research was done, when Jonas Peters was at the MPI T\u00a8ubingen.\n\n1\n\n\ft\u2212k to X j\n\ncomponents of the time series as vertices and an arrow between X i and X j, i (cid:54)= j, if there is an\nt in the full time graph for some k. We are given a sample (X1, . . . , XT )\narrow from X i\nof a multivariate time series and estimate the true summary time graph.\nI.i.d. methods are not\ndirectly applicable because a common history might introduce complicated dependencies between\ncontemporaneous data Xt and Yt. Nevertheless several methods dealing with time series data are\nmotivated by the iid setting (Section 2). Many of them encounter similar problems: when the model\nassumptions are violated (e.g.\nin the presence of a confounder) the methods draw false causal\nconclusions. Furthermore, they do not include nonlinear instantaneous effects. In this work, we\nextend the Structural Equation Model framework to time series data and call this approach time\nseries models with independent noise (TiMINo). These models include nonlinear and instantaneous\neffects. They assume Xt to be a function of all direct causes and some noise variable, the collection\nof which is supposed to be jointly independent. This model formulation comes with substantial\nbene\ufb01ts: In Section 3 we prove that for TiMINo models the full causal structure can be recovered\nfrom the distribution. Section 4 introduces an algorithm (TiMINo causality) that recovers the model\nstructure from a \ufb01nite sample.\nIf\nthe data do not satisfy the model assumptions, TiMINo causality remains mostly undecided instead\nof drawing wrong causal conclusions. Section 5 deals with time series that have been shifted by\ndifferent (unknown) time delays. Experiments on simulated and real data sets are shown in Section 6.\n\nIt can be equipped with any algorithm for \ufb01tting time series.\n\n2 Existing methods\n\nt\u2212h, h > 0 is independent of X j\n\nt given X k\n\nLinear G-causality considers a VAR model: Xt =(cid:80)p\n\nGranger causality [Granger, 1969] (G-causality for the remainder of the article) is based on the\nfollowing idea: X i does not Granger cause X j if including the past of X i does not help in pre-\nt given the past of all all other time series X k, k (cid:54)= i. In principle, \u201call other\u201d means all\ndicting X j\nother information in the world. In practice, one is limited to X k, k \u2208 V . The phrase \u201cdoes not\nhelp\u201d is translated into a signi\ufb01cance test assuming a multivariate time series model. If the data\nfollow the assumed model, e.g. the VAR model below, G-causality is sometimes interpreted as test-\nt\u2212h, k \u2208 V \\ {i}, h > 0 [see Florens and\ning whether X i\nMouchart, 1982, Eichler, 2011, Chu and Glymour, 2008, Quinn et al., 2011, and ANLTSM below].\n\u03c4 =1 A(\u03c4 )Xt\u2212\u03c4 + Nt , where Xt and Nt are\nvectors and A(\u03c4 ) are matrices. For checking whether X i G-causes X j one \ufb01ts a full VAR model\nt without using X j (using the constraints\nMfull to Xt and a VAR model Mrestr to Xt that predicts X i\nA \u00b7 i(\u03c4 ) = 0 for all 1 \u2264 \u03c4 \u2264 p). One tests whether the reduction of the residual sum of squares\nt is signi\ufb01cant by using the following test statistic: T := (RSSrestr\u2212RSSfull)/(pfull\u2212prestr)\n(RSS) of X i\n,\nwhere pfull and prestr are the number of parameters in the respective models. For the signi\ufb01cance\ntest we use T \u223c Fpfull\u2212prestr,N\u2212pfull. G-causality has been extended to nonlinear G-causality, [e.g.\nChen et al., 2004, Ancona et al., 2004]. In this paper we focus on an extension for the bivariate\ncase proposed by Bell et al. [1996]. It is based on generalized additive models (gams) [Hastie and\nTibshirani, 1990]: X i\nt , where Nt is a #V dimensional noise\nvector. Bell et al. [1996] utilize the same F statistic as above using estimated degrees of freedom.\nFollowing Bell et al. [1996], Chu and Glymour [2008] introduce additive nonlinear time series mod-\nels (ANLTSM for short) for performing relaxed conditional independence tests: If including one\nt\u22122 does not im-\nvariable, e.g. X 1\nprove the predictability of X 2\nt\u22122, X 2\nt\u22122\n(if the maximal time lag is 2). Chu and Glymour [2008] propose a method based on constraint-\nbased methods like FCI [Spirtes et al., 2000] in order to infer the causal structure exploiting those\nconditional independence statements. The instantaneous effects are assumed to be linear and the\nconfounders linear and instantaneous.\nTS-LiNGAM [Hyv\u00a8arinen et al., 2008] is based on LiNGAM [Shimizu et al., 2006] from the iid\nsetting. It allows for instantaneous effects and assumes all relationships to be linear.\nInstantaneous effects: G-causality\nThese approaches encounter some methodological problems.\ncannot deal with instantaneous effects. E.g., when Xt is causing Yt, including any of the two time\nseries helps for predicting the other and G-causality infers X \u2192 Y and Y \u2192 X. ANLTSM and\nTS-LiNGAM only allow for linear instantaneous effects. Theorem 1 shows that the summary time\ngraph may still be identi\ufb01able when the instantaneous effects are linear and the variables are jointly\nGaussian. TS-LiNGAM does not work in these situations. Confounders: G-causality might fail\n\nt that already includes X 2\nt\u22122, X 2\nt\u22121 is said to be independent of X 2\n\nt\u22121, into a model for X 2\nt , then X 1\n\nt = (cid:80)p\n\n\u03c4 =1\n\nj=1 fi,j,\u03c4 (X j\n\nt\u2212\u03c4 ) + N i\n\n(cid:80)n\n\nRSSfull/(N\u2212pfull)\n\nt\u22121, and X 1\nt given X 2\n\nt\u22121, X 1\n\n2\n\n\fwhen there is a confounder between Xt and Yt+1, say. The path between Xt and Yt+1 cannot be\nblocked by conditioning on any observed variables; G-causality infers X \u2192 Y . We will see empir-\nically that TiMINo remains undecided instead; Entner and Hoyer [2010] and Janzing et al. [2009]\nprovide (partial) results for the iid setting. ANLTSM does not allow for nonlinear confounders or\nconfounders with time structure and TS-LiNGAM may fail, too (Exp. 1). Robustness: Theorem 1\n(ii) shows that performing general conditional independence tests suf\ufb01ces. The conditioning sets,\nhowever, are too large and the tests are performed under a simple model (e.g. VAR). If the model is\nmisspeci\ufb01ed, one may draw wrong conclusions without noticing (e.g. Exp. 3).\nFor TiMINo (de\ufb01ned below), Lemma 1 shows that after \ufb01tting and checking the model by using\nunconditional independence tests, the dif\ufb01cult conditional independences have been checked im-\nplicitly. A model check is not new [e.g. Hoyer et al., 2009, Entner and Hoyer, 2010] but is thus\nan effective tool. We can equip bivariate G-causality with a test for cross-correlations; this is not\nstraight-forward for multivariate G-causality. Furthermore, using cross-correlation as an indepen-\ndence test does not always suf\ufb01ce (see Section 2).\n\n3 Structural Equation models for time series: TiMINo\n\n(cid:0)(PAi\n\n0 \u2286 X V \\{i}, PAi\n0)t, N i\nt\n\n(cid:1) ,\n\nDe\ufb01nition 1 Consider a time series Xt = (X i\nt )i\u2208V whose \ufb01nite dimensional distributions are ab-\nsolutely continuous w.r.t a product measure (e.g. there is a pdf or pmf). The time series satis\ufb01es a\nTiMINo if there is a p > 0 and \u2200i \u2208 V there are sets PAi\n\nk \u2286 X V , s.t. \u2200t\n\nX i\n\nt = fi\n\n1)t\u22121, (PAi\n\np)t\u2212p, . . . , (PAi\n\nt jointly independent over i and t and for each i, N i\n\n(1)\nwith N i\nt are identically distributed in t. The\ncorresponding full time graph is obtained by drawing arrows from any node that appears in the\nright-hand side of (1) to X i\nt. We require the full time graph to be acyclic. Section 6 shows examples.\nTheorem 1 (i) assumes that (1) follows an identi\ufb01able functional model class (IFMOC). This means\nthat (I) causal minimality holds, a weak form of faithfulness that assumes a statistical dependence\nbetween cause and effect given all other parents [Spirtes et al., 2000]. And (II), all fi come from a\nfunction class that is small enough to make the bivariate case identi\ufb01able. Peters et al. [2011b] give\na precise de\ufb01nition. Important examples include nonlinear functions with additive Gaussian noise\nand linear functions with additive non-Gaussian noise. Due to space constraints, proofs are provided\nin the appendix. In the one-dimensional linear case model (1) is time-reversible if and only if the\nnoise is normally distributed [Peters et al., 2009].\n\nt ) =(cid:83)p\n\nTheorem 1 Suppose that Xt can be represented as a TiMINo (1) with PA(X i\nbeing the direct causes of X i\n\nt and that one of the following holds:\n\nk=0(PAi\n\nk)t\u2212k\n\n(i) Equations (1) come from an IFMOC (e.g. nonlinear functions fi with additive Gaussian\nt ). The summary time\n\nt or linear functions fi with additive non-Gaussian noise N i\n\nnoise N i\ngraph can contain cycles.\n\n(ii) Each component exhibits a time structure (PA(X i\n\nt ) contains at least one X i\n\nt\u2212k), the joint\n\ndistribution is faithful w.r.t. the full time graph, and the summary time graph is acyclic.\n\nThen the full time graph can be recovered from the joint distribution of Xt. In particular, the true\ncausal summary time graph is identi\ufb01able. (Neither of the conditions (i) and (ii) implies the other.)\nt|Xt\u2212p,\nMany function classes satisfy (i) [Peters et al., 2013]. To estimate fi from data (E[X i\n. . . , Xt\u22121] for additive noise) we require stationarity and/or \u03b1 mixing, or geometric ergodicity [e.g.\nChu and Glymour, 2008]. Condition (ii) shows how time structure simpli\ufb01es the causal inference\nproblem. For iid data the true graph is not identi\ufb01able in the linear Gaussian case; with time structure\nit is. We believe that condition (ii) is more dif\ufb01cult to verify in practice; faithfulness is not required\nfor (i). In (ii), the acyclicity prevents the full time graph from being fully connected up to order p.\n\n4 A practical method: TiMINo causality\n\nThe algorithm for TiMINo causality is based on the theoretical \ufb01nding in Theorem 1. It takes the\ntime series data as input and outputs either a DAG that estimates the summary time graph or re-\nmains undecided. It tries to \ufb01t a TiMINo model to the data and outputs the corresponding graph. If\n\n3\n\n\fno model with independent residuals is found, it outputs \u201cI do not know\u201d. This becomes intractable\nfor a time series with many components; for time series without feedback loops, we adapt a method\nfor additive noise models without time structure suggested by Mooij et al. [2009] that avoids enu-\nmerating all DAGs. Algorithm 1 shows the modi\ufb01ed version. As reported by Mooij et al. [2009],\nthe time complexity is O(d2 \u00b7 f (n, d) \u00b7 t(n, d)), where d is the number of time series, n the sample\nsize and f (n, d) and t(n, d) the complexity of the user-speci\ufb01c regression method and independence\ntest, respectively. Peters et al. [2013] discuss the algorithm\u2019s correctness. We present our choices\nbut do not claim their optimality, any other \ufb01tting method and independence test can be used, too.\n\nfor k in S do\n\nt\u22121, X i\n\nt\u2212p, . . . , X i\n\nt\u22121, X i\n\nFit TiMINo for X k\nt\u2212p, . . . , X k\nTest if residuals are indep. of X i, i \u2208 S.\n\nt using X k\n\nAlgorithm 1 TiMINo causality\n1: Input: Samples from a d-dimensional time series of length T : (X1, . . . , XT ), maximal order p\n2: S := (1, . . . , d)\n3: repeat\n4:\n5:\n6:\n7:\n8:\n\nend for\nChoose k\u2217 to be the k with the weakest dependence. (If there is no k with independence,\nbreak and output: \u201cI do not know - bad model \ufb01t\u201d).\nS := S \\ {k\u2217}; pa(k\u2217) := S\n\n9:\n10: until length(S)= 1\n11: For all k remove all parents that are not required to obtain independent residuals.\n12: Output: (pa(1), . . . , pa(d))\n\nt for i \u2208 S \\ {k}\n\nt and another time series X i\n\nDepending on the assumed model class, TiMINo causality has to be provided with a \ufb01tting method.\nHere, we chose the R functions ar for VAR \ufb01tting (fi(p1, . . . , pr, n) = ai,1\u00b7 p1 + . . . + ai,r \u00b7 pr + n),\ngam for generalized additive models (fi(p1, . . . , pr, n) = fi,1(p1)+. . .+fi,r(pr)+n) [e.g. Bell et al.,\n1996] and gptk for GP regression (fi(p1, . . . , pr, n) = fi(p1, . . . , pr) + n). We call the methods\nTiMINo-linear, TiMINo-gam and TiMINo-GP, respectively. For the \ufb01rst two AIC determines the\norder of the process. All \ufb01tting methods are used in a \u201cstandard way\u201d. For gam we used the built-in\nnonparametric smoothing splines. For the GP we used zero mean, squared exponential covariance\nfunction and Gaussian Likelihood. The hyper-parameters are automatically chosen by marginal\nlikelihood optimization. Code is available online.\nt , i \u2208 S,\nTo test for independence between a residual time series N k\nwe shift the latter time series up to the maximal order \u00b1p (but at least up to \u00b14); for each of those\ncombinations we perform HSIC [Gretton et al., 2008], an independence test for iid data. One could\nalso use a test based on cross-correlation that can be derived from Thm 11.2.3. in [Brockwell and\nDavis, 1991]. This is related to what is done in transfer function modeling [e.g. \u00a713.1 in Brockwell\nand Davis, 1991], which is restricted to two time series and linear functions. As opposed to the\niid setting, testing for cross-correlation is often enough in order to reject a wrong model. Only\nExperiments 1 and 5 describe situations, in which cross-correlations fail. To reduce the running\ntime one can use cross-correlation to determine the graph structure and use HSIC as a \ufb01nal model\ncheck. For HSIC we used a Gaussian kernel; as in [Gretton et al., 2008], the bandwidth is chosen\nsuch that the median distance of the input data leads to an exponent of one. Testing for non-vanishing\nautocorrelations in the residuals is not included yet.\nIf the model assumptions only hold in some parts of the summary time graph, we can still try\nto discover parts of the causal structure. Our code package contains this option. We obtained\npositive results on simulated data but there is no corresponding identi\ufb01ability statement.\nOur method has some potential weaknesses. It can happen that one is able to \ufb01t a model only in the\nwrong direction. This, however, requires an \u201cunnatural\u201d \ufb01ne tuning of the functions [Janzing and\nSteudel, 2010] and is relevant only when there are time series without time structure or the data are\nnon-faithful (see Theorem 1). The null hypothesis of the independence test represents independence,\nalthough the scienti\ufb01c discovery of a causal relationship should rather be the alternative hypothesis.\nThis fact may lead to wrong causal conclusions (instead of \u201cI do not know\u201d) on small data sets.The\neffect is strengthened by the Bonferroni correction of the HSIC based independence test; one may\nrequire modi\ufb01cations for a high number of time series components. For large sample sizes, even\n\n4\n\n\fsmallest differences between the true data generating process and the model may lead to rejected\nindependence tests [discussed by Peters et al., 2011a].\n\n5 TiMINo for Shifted Time Series\n\nt = X i\n\nt = X i\n\nt we are then working with \u02dcX i\n\nIn some applications, we observe the components of the time series with varying time delay. Instead\nt\u2212(cid:96), with 0 \u2264 (cid:96) \u2264 k. E.g., in functional magnetic resonance\nof X i\nimaging brain activity is measured through an increased blood \ufb02ow in the corresponding area. It\nhas been reported that these data often suffer from different time delays [e.g. Buxton et al., 1998,\nSmith et al., 2011]. Given the (shifted) measurements \u02dcX i\nt, we therefore have to cope with causal\nrelationships that go backward in time. This is only resolved when going back to the unobserved\nt. Measures like Granger causality will fail in these situations. This does not necessarily\ntrue data X i\nhave to be the case, however. The structure still remains identi\ufb01able even if we observe \u02dcX i\nt instead\nof \u02dcX i\nt (the following theorem generalizes the second part of Theorem 1 and is proved accordingly)1:\nt\u2212(cid:96), where 0 \u2264 (cid:96) \u2264 k are unknown\nTheorem 2 Assume condition (ii) from Theorem 1 with \u02dcX i\ntime delays. Then, the full time graph of \u02dcXt is identi\ufb01able from the joint distribution of \u02dcXt. In\nparticular, the summary time graphs of \u02dcXt and Xt are identical and therefore identi\ufb01able.\nAs opposed to Theorem 1 we cannot identify the full time graph of Xt. It may not be possible, for\nexample, to distinguish between a lag two effect from X 1 to X 2 and a corresponding lag one effect\nwith a shifted time series X 2. The method for recovering the network structure stays almost the same\nas the one for non-shifted time series. only line 5 of Algorithm 1 has to be updated: we additionally\nt+(cid:96) for 0 \u2264 (cid:96) \u2264 k for all i \u2208 S \\ {k}. While TiMINo exploits an asymmetry between\ninclude X i\ncause and effect emerging from restricted structural equations, G-causality exploits the asymmetry\nof time. The latter asymmetry is broken when considering shifted time series.\n\n6 Experiments\n\n6.1 Arti\ufb01cial Data\n\nWe always included instantaneous effects, \ufb01tted models up to order p = 2 or p = 6 and set \u03b1 = 0.05.\nExperiment 1: Confounder with time lag. We simulate 100 data sets (length 1000) from Zt =\na \u00b7 Zt\u22121 + NZ,t, Xt = 0.6 \u00b7 Xt\u22121 + 0.5 \u00b7 Zt\u22121 + NX,t, Yt = 0.6 \u00b7 Yt\u22121 + 0.5 \u00b7 Zt\u22122 + NY,t, with a\nbetween 0 and 0.95 and N\u00b7,t \u223c 0.4\u00b7N (0, 1)3. Here, Z is a hidden common cause for X and Y . For\nall a, Xt contains information about Zt\u22121 and Yt+1 (see Figure 1); G-causality and TS-LiNGAM\nwrongly infer X \u2192 Y . For large a, Yt contains additional information about Xt+1, which leads\nto the wrong arrow Y \u2192 X. TiMINo causality does not decide for any a. The nonlinear methods\nperform very similar (not shown). For a = 0, a cross-correlation test is not enough to reject X \u2192 Y .\nFurther, all methods fail for a = 0 and Gaussian noise. (Similar results for non-linear confounder.)\nExperiment 2: Linear, Gaussian with instantaneous effects. We sample 100 data sets (length\n2000) from Xt = A1 \u00b7 Xt\u22121 + NX,t, Wt = A2 \u00b7 Wt\u22121 + A3 \u00b7 Xt + NW,t, Yt = A4 \u00b7 Yt\u22121 + A5 \u00b7\nWt\u22121 + NY,t, Zt = A6\u00b7 Zt\u22121 + A7\u00b7 Wt + A8\u00b7 Yt\u22121 + NZ,t and N\u00b7,t \u223c 0.4\u00b7N (0, 1) and Ai iid from\nU([\u22120.8,\u22120.2] \u222a [0.2, 0.8]). We regard the graph containing X \u2192 W \u2192 Y \u2192 Z and W \u2192 Z as\ncorrect. TS-LiNGAM and G-causality are not able to recover the true structure (see Table 1). We\nobtain similar results for non-linear instantaneous interactions.\nExperiment 3: Nonlinear, non-Gaussian without instantaneous effects. We simulate 100 data\nsets (length 500) from Xt = 0.8Xt\u22121 + 0.3NX,t, Yt = 0.4Yt\u22121 + (Xt\u22121 \u2212 1)2 + 0.3NY,t, Zt =\n0.4Zt\u22121 + 0.5 cos(Yt\u22121) + sin(Yt\u22121) + 0.3NZ,t, with N\u00b7,t \u223c U([\u22120.5, 0.5]) (similar results for\nother noise distributions, e.g. exponential). Thus, X \u2192 Y \u2192 Z is the ground truth. Nonlinear\nG-causality fails since the implementation is only pairwise and it thus always infers an effect from\nX to Z. Linear G-causality cannot remove the nonlinear effect from Xt\u22122 to Zt by using Yt\u22121. Also\nTiMINo-linear assumes a wrong model but does not make any decision. TiMINo-gam and TiMINo-\nGP work well on this data set (Table 2). This speci\ufb01c choice of parameters show that a signi\ufb01cant\n\n1We believe that a corresponding statement for condition (i) holds, too.\n\n5\n\n\fXt\u22122\n\nZt\u22122\n\nYt\u22122\n\na\n\nXt\u22121\n\nZt\u22121\n\nYt\u22121\n\na\n\nXt\n\nZt\n\nYt\n\na\n\nXt+1\n\nZt+1\n\nYt+1\n\nFigure 1: Exp.1: Part of the causal full time graph with hidden common cause Z (top left). TiMINo\ncausality does not decide (top right), whereas G-causality and TS-LiNGAM wrongly infer causal\nconnections between X and Y (bottom).\n\nDAG\ncorrect\nwrong\nno dec.\n\nG-causal.\n\nlinear\n13%\n87%\n0%\n\nTiMINo\nlinear\n83%\n7%\n10%\n\nTS-\n\nLiNGAM\n\n19%\n81%\n0%\n\nTable 1: Exp.2: Gaussian data and linear\ninstantaneous effects: only TiMINo mostly\ndiscovers the correct DAG.\n\nFigure 2: Exp.4: TiMINo-GP (blue) works reli-\nably for long time series. TiMINo-linear (red) and\nTiMINo-gam (black) mostly remain undecided.\n\ndifference in performance is possible. For other parameters (e.g. less impact of the nonlinearity),\nG-causality and TS-LiNGAM still assume a wrong model but make fewer mistakes.\n\nTable 2: Exp.3: Since the data are nonlinear, linear G-causality and TS-LiNGAM give wrong an-\nswers, TiMINo-lin does not decide. Nonlinear G-causality fails because it analyzes the causal struc-\nture between pairs of time series.\n\nDAG\ncorrect\nwrong\nno dec.\n\nGrangerlin Grangernonlin\n\nTiMINolin\n\nTiMINogam\n\nTiMINoGP\n\nTS-LiNGAM\n\n69%\n31%\n0%\n\n0%\n\n100%\n\n0%\n\n0%\n0%\n\n100%\n\n95%\n1%\n4%\n\n94%\n1%\n5%\n\n12%\n88%\n0%\n\nExperiment 4: Non-additive interaction. We simulate 100 data sets with different lengths from\nXt = 0.2\u00b7 Xt\u22121 + 0.9NX,t, Yt = \u22120.5 + exp(\u2212(Xt\u22121 + Xt\u22122)2) + 0.1NY,t, with N\u00b7,t \u223c N (0, 1).\nFigure 2 shows that TiMINo-linear and TiMINo-gam remain mainly undecided, whereas TiMINo-\nGP performs well. For small sample sizes, one observes two effects: GP regression does not obtain\naccurate estimates for the residuals, these estimates are not independent and thus TiMINo-GP re-\nmains more often undecided. Also, TiMINo-gam makes more correct answers than one would ex-\npect due to more type II errors. Linear G-causality and TS-LiNGAM give more than 90% incorrect\nanswers, but non-linear G-causality is most often correct (not shown). Bad model assumptions do\nnot always lead to incorrect causal conclusions.\nExperiment 5: Non-linear Dependence of Residuals. In Experiment 1, TiMINo equipped with a\ncross-correlation inferred a causal edge, although there were none. The opposite is also possible:\nt\u22121 + NY,t and N\u00b7,t \u223c 0.4\u00b7N (0, 1) (length 1000).\nXt = \u22120.5\u00b7 Xt\u22121 + NX,t, Yt = \u22120.5\u00b7 Yt\u22121 + X 2\nTiMINo-gam with cross-correlation infers no causal link between X and Y , whereas TiMINo-gam\nwith HSIC correctly identi\ufb01es X \u2192 Y .\nExperiment 6: Shifted Time Series. We simulate 100 random DAGs with #V = 3 nodes by\nchoosing a random ordering of the nodes and including edges with a probability of 0.6. The struc-\ntural equations are additive (gam). Each component is of the form f (x) = a \u00b7 max(x,\u22120.1) + b \u00b7\n\nsign(x)(cid:112)|x|, with a, b iid from U([\u22120.5,\u22120.2] \u222a [0.2, 0.5]). We sample time series (length 1000)\n\nfrom Gaussian noise and observe the sink node time series with a time delay of three. This makes all\n\n6\n\n00.10.20.30.40.50.60.70.80.9confounder parameter aTS\u2212LiNGAM0.00.40.8noneY \u2212> XbothX \u2212> Y00.10.20.30.40.50.60.70.80.9TiMINo (linear)0.00.40.8no decisionnoneY \u2212> XX \u2212> Y00.10.20.30.40.50.60.70.80.9confounder parameter aG\u2212caus. (linear)0.00.40.8noneY \u2212> XbothX \u2212> Y10030050070090000.51length of time seriesprop. of correct (\u2212)and incorrect (\u2212.)answers\ftraditional methods inapplicable. The performance of linear G-causality, for example, drops from\nan average Structural Hamming Distance (SHD) of 0.38 without time delay to 1.73 with time delay.\nTiMINo-gam causality recognizes the wrong model assumption. The SHD increases from 0.13 (17\nundecided cases) to 0.71 (79 undecided cases). Adjusting for a time delay (Section 5) yields an\nSHD of 0.70 but many more decisions (18 undecided cases). Although it is possible to adjust for\ntime delays, the procedure enlarges the model space, which makes rejecting all wrong models more\ndif\ufb01cult. Already #V = 5 leads to worse average SHD: G-causality: 4.5, TiMINo-gam: 1.5 (92\nundecided cases) and TiMINo-gam with time delay adjustment: 2.4 (38 undecided cases).\n\n6.2 Real Data\n\nWe \ufb01tted up to order 6 and included instantaneous effects. For TiMINo, \u201ccorrect\u201d means that\nTiMINo-gam is correct and TiMINo-linear is correct or undecided. TiMINo-GP always remains\nundecided because there are too few data points to \ufb01t such a general model. Again, \u03b1 is set to 0.05.\nExperiment 7: Gas Furnace. [Box et al., 2008, length 296], Xt describes the input gas rate and\nYt the output CO2. We regard X \u2192 Y as being true. TS-LiNGAM, G-causality, TiMINo-lin\nand TiMINo-gam correctly infer X \u2192 Y . Disregarding time information leads to a wrong causal\nconclusion: The method described by Hoyer et al. [2009] leads to a p-value of 4.8% in the correct\nand 9.1% in the false direction.\nExperiment 8: Old Faithful. [Azzalini and Bowman, 1990, length 194] Xt contains the duration\nof an eruption and Yt the time interval to the next eruption of the Old Faithful geyser. We regard\nX \u2192 Y as the ground truth. Although the time intervals [t, t + 1] do not have the same length for all\nt, we model the data as two time series. TS-LiNGAM and TiMINo give correct answers, whereas\nlinear G-causality infers X \u2192 Y , and nonlinear G-causality infers Y \u2192 X.\nExperiment 9: Abalone (no time structure). The abalone data set [Asuncion and Newman, 2007]\ncontains (among others that lead to similar results) age Xt and diameter Yt of a certain shell \ufb01sh.\nIf we model 1000 randomly chosen samples as time series, G-causality (both linear and nonlinear)\ninfers no causal relation as expected. TS-LiNGAM wrongly infers Y \u2192 X, which is probably due\nto the nonlinear relationship. TiMINo gives the correct result.\nExperiment 10: Diary (confounder). We consider 10 years of weekly prices for butter Xt and\ncheddar cheese Yt (length 522, http://future.aae.wisc.edu/tab/prices.html) We\nexpect their strong correlation to be due to the (hidden) milk price Mt: X \u2190 M \u2192 Y . TiMINo\ndoes not decide, whereas TS-LiNGAM and G-causality wrongly infer X \u2192 Y . This may be due to\ndifferent time lags of the confounder (cheese has longer storing and maturing times than butter).\nExperiment 11: Temperature in House. We placed temperature sensors in six rooms (1 - Shed,\n2 - Outside, 3 - Kitchen Boiler, 4 - Living Room, 5 - WC, 6 - Bathroom) of a house in the black\nforest and recorded the temperature on an hourly basis (length 7265). This house is not inhabited\nfor most of the year, and lacking central heating; the few electric radiators start if the temperatur\ndrops close to freezing. TiMINo does not decide since the model leads to dependent residuals.\nAlthough we do not provide any theory for the following steps, we analyze the model leading to\nthe \u201cleast dependent\u201d residuals by setting the test level \u03b1 to zero. TiMINo causality then outputs\na causal ordering of the variables. We applied TiMINo-lin and TiMINo-gam to the data sets using\nlags up to twelve (half a day) and report the proportion in which node i precedes node j (see matrix).\nThis procedure reveals a sensible causal structure (we -\narbitrarily- refer to entries larger than 0.5 as causation). 2\n(outside) causes all other readings, and none of the other\ntemperatures causes 2. 1 (shed) causes all other readings\nexcept for 2. This is wrong, but not surprising since the\nshed\u2019s temperature is rather close to the outside temper-\nature. 4 (living room) does not cause any other reading,\nbut every other reading does cause it (the living room is\nthe only room without any heating). The links 5 \u2192 3 and 6 \u2192 3 appear spurious, and come with\nnumbers close to 0.5. These might be erroneous, however, they might also be due to the fact that\nsensor 3 is sitting on top of the kitchen boiler, which acts as a heat reservoir that delays temperature\nchanges. The link 6 \u2192 5 comes with a large number, but it is plausible since unlike the WC, the\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n0.25\n0.67\n0.67\n\n0.83\n0.83\n\n0.75\n0.17\n\n1\n1\n\n0\n1\n1\n\n1\n1\n\n0\n0\n1\n\n1\n1\n\n0\n0\n0\n\n0.75\n\n0.33\n\n0.33\n\n0\n\n0\n0\n0\n\n0\n\n0\n0\n0\n\n0.25\n\n0.17\n\n0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n7\n\n\fbathroom has a window and is thus affected directly by outside temperature, causing fast regulation\nof its radiator, which is placed on a thin wooden wall facing the WC.\nThe phase slope index [Nolte et al., 2008] performed well in Exp. 7, in all other experiments it either\ngave wrong results or did not decide. Due to space constraints we omit details about this method.\nWe did not \ufb01nd any code for ANLTSM.\n\n7 Conclusions and Future Work\n\nThis paper shows how causal inference on time series bene\ufb01ts from the framework of Structural\nEquation Models. The identi\ufb01ability statement is more general than existing results. The algorithm\nis based on unconditional independence tests and is applicable to multivariate, linear, nonlinear\nand instantaneous interactions. It contains the option of remaining undecided. While methods like\nGranger causality are built on the asymmetry of time direction, TiMINo additionally takes into ac-\ncount identi\ufb01ability emerging from restricted structural equation models. This leads to a straightfor-\nward way of dealing with (unknown) time delays in the different time series. Although an extensive\nevaluation on real data sets is still required, we believe that our results emphasize the potential use\nof causal inference methods. They may provide guidance for future interventional experiments.\nAs future work one may use heteroscedastic models [Chen et al., 2012] and systematically prepro-\ncess the data (removing trends, periodicities, etc.). This may reduce the number of cases where\nTiMINo causality is undecided. TiMINo causality evaluates a model \ufb01t by checking independence\nof the residuals. As suggested in Mooij et al. [2009], Yamada and Sugiyama [2010], one may make\nthe independence of the residuals as a criterion for the \ufb01tting process or at least for order selection.\n\n8 Appendix\n\nt\u2212k \u2192 X 2\n\nt )i\u2208V satisfy a TiMINo model, each\n\n0(in G) = PAi\n\n0(in G(cid:48)) \u2200i. Wlog, there is some k > 0 and an edge X 1\n\nt . Applied to G, causal minimality leads to a contradiction: X 1\n\nk)t\u2212k and Eq. (1) we get X i\n\nt ) for an s\nt is a function of all noise variables from its ancestors and\n(cid:3)\n\nt is conditionally independent of each of its non-descendants given its parents.\nt|S=s = fi(s, N i\nt given S = s. This is the only time we assume t \u2208 N in this paper.\n\nLemma 1 (Markov Condition for TiMINo) If Xt = (X i\nvariable X i\nProof . With S := PA(X i\nwith p(s) > 0. Any non-descendant of X i\nis thus independent of X i\nProof of Theorem 1 Suppose that Xt allows for two TiMINo representations that lead to different\nfull time graphs G and G(cid:48).\n(i) Assume that G and G(cid:48) do not differ in the instantaneous effects:\nPAi\nt , say, that is in\nt\u2212l, 1 \u2264 l \u2264\nt\u2212k \u22a5\u22a5 X 2\nG but not in G(cid:48). From G(cid:48) and Lemma 1 we have that X 1\np, i \u2208 V } \u222a NDt) \\ {X 1\nt }, and NDt are all X i\nt that are non-descendants (wrt instantaneous\nt |S . Now,\neffects) of X 2\nt\u2212l, 1 \u2264 l \u2264 p, i \u2208 V }. For\nlet G and G(cid:48) differ in the instantaneous effects and choose S = {X i\ni\neach s and i we have: X i\n0 are all instantaneous parents of X i\nconditioned on S = s. All X i\nt|S=s with the instantaneous effects describe two different structures\nt\nof an IFMOC. This contradicts the identi\ufb01ability results by Peters et al. [2011b]. (ii) Because of\nLemma 1 and faithfulness G and G(cid:48) only differ in the instantaneous effects. But each instantaneous\nt\u2212k cannot be connected with\narrow X i\n(cid:3)\nt since this introduces a cycle in the summary time graph.\nX i\nProof of Theorem 2 Two full time graphs G and G(cid:48) for \u02dcXt can differ only in the directions of edges\nt+k in G(cid:48). Choose the largest k\nbetween time series. Assume X i\npossible. Then there is a v-structure X i\nt+k for some (cid:96). A connection between X i\nt\u2212(cid:96)\n(cid:3)\nand X j\n\nt \u2190 X j\nt+k would lead to a pair as above with a larger k.\n\nt |S , where S = ({X i\nt\u2212k (cid:54)\u22a5\u22a5 X 2\n\nt \u2192 X j\n\nt+k in G and X i\n\nt \u2190 X j\n\nt forms a v-structure together with X j\n\nt|S=s = fi(s, ( \u02dcPA\n\ni\n\n0)t), where \u02dcPA\n\nt \u2192 X j\n\nt\u2212k \u2192 X j\n\nt ; X j\n\nt ) =(cid:83)p\n\nk=0(PAi\n\nt\u2212k, X 2\n\nt\u2212(cid:96) \u2192 X i\n\nReferences\nN. Ancona, D. Marinazzo, and S. Stramaglia. Radial basis function approach to nonlinear Granger causality of\n\ntime series. Phys. Rev. E, 70(5):056221, 2004.\n\n8\n\n\fA. Asuncion and D. J. Newman. UCI repository. http://archive.ics.uci.edu/ml/, 2007.\nA. Azzalini and A. W. Bowman. A look at some data on the Old Faithful Geyser. Applied Statistics, 39(3):\n\n357\u2013365, 1990.\n\nD. Bell, J. Kay, and J. Malley. A non-parametric approach to non-linear causality testing. Economics Letters,\n\n51(1):7\u201318, 1996.\n\nG. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time series analysis: forecasting and control. Wiley series in\n\nprobability and statistics. John Wiley, 2008.\n\nP. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer, 2nd edition, 1991.\nR. B. Buxton, E. C. Wong, and L. R. Frank. Dynamics of blood \ufb02ow and oxygenation changes during brain\n\nactivation: The balloon model. Magnetic Resonance in Medicine, 39(6):855\u2013864, 1998.\n\nY. Chen, G. Rangarajan, J. Feng, and M. Ding. Analyzing multiple nonlinear time series with extended Granger\n\ncausality. Physics Letters A, 324, 2004.\n\nZ. Chen, K. Zhang, and L. Chan. Causal discovery with scale-mixture model for spatiotemporal variance\n\ndependencies. In NIPS 25, 2012.\n\nT. Chu and C. Glymour. Search for additive nonlinear time series causal models. Journal of Machine Learning\n\nResearch, 9:967\u2013991, 2008.\n\nM. Eichler. Graphical modelling of multivariate time series. Probability Theory and Related Fields, 2011.\nD. Entner and P. Hoyer. Discovering unconfounded causal relationships using linear non-Gaussian models. In\n\nJSAI-isAI Workshops, 2010.\n\nJ. P. Florens and M. Mouchart. A note on noncausality. Econometrica, 50(3):583\u2013591, 1982.\nC. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Economet-\n\nrica, 37(3):424\u201338, July 1969.\n\nA. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of indepen-\n\ndence. In NIPS 20, Canada, 2008.\n\nT. J. Hastie and R. J. Tibshirani. Generalized additive models. London: Chapman & Hall, 1990.\nP. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch\u00a8olkopf. Nonlinear causal discovery with additive noise\n\nmodels. In NIPS 21, Canada, 2009.\n\nA. Hyv\u00a8arinen, S. Shimizu, and P. Hoyer. Causal modelling combining instantaneous and lagged effects: an\n\nidenti\ufb01able model based on non-gaussianity. In ICML 25, 2008.\n\nD. Janzing and B. Steudel. Justifying additive-noise-model based causal discovery via algorithmic information\n\ntheory. Open Systems and Information Dynamics, 17:189\u2013212, 2010.\n\nD. Janzing, J. Peters, J.M. Mooij, and B. Sch\u00a8olkopf. Identifying confounders using additive noise models. In\n\nUAI 25, 2009.\n\nJ. Mooij, D. Janzing, J. Peters, and B. Sch\u00a8olkopf. Regression by dependence minimization and its application\n\nto causal inference. In ICML 26, 2009.\n\nG. Nolte, A. Ziehe, V. Nikulin, A. Schl\u00a8ogl, N. Kr\u00a8amer, T. Brismar, and K.-R. M\u00a8uller. Robustly Estimating the\n\nFlow Direction of Information in Complex Physical Systems. Physical Review Letters, 100, 2008.\n\nJ. Pearl. Causality: Models, reasoning, and inference. Cambridge Univ. Press, 2nd edition, 2009.\nJ. Peters, D. Janzing, A. Gretton, and B. Sch\u00a8olkopf. Detecting the dir. of causal time series. In ICML 26, 2009.\nJ. Peters, D. Janzing, and B. Sch\u00a8olkopf. Causal inference on discrete data using additive noise models. IEEE\n\nTrans. Pattern Analysis Machine Intelligence, 33(12):2436\u20132450, 2011a.\n\nJ. Peters, J. Mooij, D. Janzing, and B. Sch\u00a8olkopf. Identi\ufb01ability of causal graphs using functional models. In\n\nUAI 27, 2011b.\n\nJ. Peters, J. Mooij, D. Janzing, and B. Sch\u00a8olkopf. Causal discovery with continuous additive noise models,\n\n2013. arXiv:1309.6779.\n\nC. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos. Estimating the directed information to infer causal\nrelationships in ensemble neural spike train recordings. Journal of Comp. Neuroscience, 30(1):17\u201344, 2011.\nS. Shimizu, P. Hoyer, A. Hyv\u00a8arinen, and A. J. Kerminen. A linear non-Gaussian acyclic model for causal\n\ndiscovery. Journal of Machine Learning Research, 7:2003\u20132030, 2006.\n\nS. M. Smith, K. L. Miller, G. Salimi-Khorshidi, M. Webster, C. F. Beckmann, T. E. Nichols, J. D. Ramsey, and\n\nM. W. Woolrich. Network modelling methods for FMRI. NeuroImage, 54(2):875 \u2013 891, 2011.\n\nP. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2000.\nM. Yamada and M. Sugiyama. Dependence minimizing regression with model selection for non-linear causal\n\ninference under non-Gaussian noise. In AAAI. AAAI Press, 2010.\n\n9\n\n\f", "award": [], "sourceid": 148, "authors": [{"given_name": "Jonas", "family_name": "Peters", "institution": "ETH Zurich"}, {"given_name": "Dominik", "family_name": "Janzing", "institution": "MPI T\u00fcbingen"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI T\u00fcbingen"}]}