{"title": "Likelihood-Free Overcomplete ICA and Applications In Causal Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 6883, "page_last": 6893, "abstract": "Causal discovery witnessed significant progress over the past decades. In particular, many recent causal discovery methods make use of independent, non-Gaussian noise to achieve identifiability of the causal models. Existence of hidden direct common causes, or confounders, generally makes causal discovery more difficult; whenever they are present, the corresponding causal discovery algorithms can be seen as extensions of overcomplete independent component analysis (OICA). However, existing OICA algorithms usually make strong parametric assumptions on the distribution of independent components, which may be violated on real data, leading to sub-optimal or even wrong solutions. In addition, existing OICA algorithms rely on the Expectation Maximization (EM) procedure that requires computationally expensive inference of the posterior distribution of independent components. To tackle these problems, we present a Likelihood-Free Overcomplete ICA algorithm (LFOICA) that estimates the mixing matrix directly by back-propagation without any explicit assumptions on the density function of independent components. Thanks to its computational efficiency, the proposed method makes a number of causal discovery procedures much more practically feasible. For illustrative purposes, we demonstrate the computational efficiency and efficacy of our method in two causal discovery tasks on both synthetic and real data.", "full_text": "Likelihood-Free Overcomplete ICA and Applications\n\nin Causal Discovery\n\nChenwei Ding\n\nUBTECH Sydney AI Centre\n\nSchool of Computer Science, Faculty of Engineering\n\nUniversity of Sydney\n\ncdin2224@uni.sydney.edu.au\n\nMingming Gong\n\nSchool of Mathematics and Statistics\n\nUniversity of Melbourne\n\nmingming.gong@unimelb.edu.au\n\nKun Zhang\n\nDepartment of Philosophy\nCarnegie Mellon University\n\nkunz1@cmu.edu\n\nDacheng Tao\n\nUBTECH Sydney AI Centre\n\nSchool of Computer Science, Faculty of Engineering\n\nUniversity of Sydney\n\ndacheng.tao@uni.sydney.edu.au\n\nAbstract\n\nCausal discovery witnessed signi\ufb01cant progress over the past decades. In particular,\nmany recent causal discovery methods make use of independent, non-Gaussian\nnoise to achieve identi\ufb01ability of the causal models. Existence of hidden direct\ncommon causes, or confounders, generally makes causal discovery more dif\ufb01cult;\nwhenever they are present, the corresponding causal discovery algorithms can\nbe seen as extensions of overcomplete independent component analysis (OICA).\nHowever, existing OICA algorithms usually make strong parametric assumptions\non the distribution of independent components, which may be violated on real\ndata, leading to sub-optimal or even wrong solutions. In addition, existing OICA\nalgorithms rely on the Expectation Maximization (EM) procedure that requires\ncomputationally expensive inference of the posterior distribution of independent\ncomponents. To tackle these problems, we present a Likelihood-Free Overcom-\nplete ICA algorithm (LFOICA1) that estimates the mixing matrix directly by\nback-propagation without any explicit assumptions on the density function of inde-\npendent components. Thanks to its computational ef\ufb01ciency, the proposed method\nmakes a number of causal discovery procedures much more practically feasible.\nFor illustrative purposes, we demonstrate the computational ef\ufb01ciency and ef\ufb01cacy\nof our method in two causal discovery tasks on both synthetic and real data.\n\n1\n\nIntroduction\n\nDiscovering causal relations among variables has been an important problem in various \ufb01elds such\nas medical science and social sciences. Because conducting randomized controlled trials is usually\nexpensive or infeasible, discovering causal relations from observational data, i.e.,causal discovery\n\n1Code for LFOICA can be found here\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[1, 2]) has received much attention in the past decades. Classical causal discovery methods, such\nas PC [2] and GES [3], output multiple causal graphs in the Markov equivalence classes. Since\nthe seminal work [4], there have been various methods that have complete identi\ufb01ability of the\ncausal structure by making use of constrained Functional Causal Models (FCMs), such as linear\nnon-Gaussian models [4], nonlinear additive model [5], and post-nonlinear model [6]. Some recent\nresearches also consider the heterogeneous case [7, 8, 9, 10, 11].\nWhenever there are essentially unobservable direct common causes of two variables (known as\nconfounders), causal discovery can be viewed as learning with hidden variables. With the linearity\nand non-Gaussian noise constraints, it has been shown that the causal model is even identi\ufb01able from\ndata with measurement error [12] or missing common causes [13, 14, 15, 16, 17]. The corresponding\ncausal discovery algorithms can be seen as extension of overcomplete independent component\nanalysis (OICA). Unlike regular ICA [18], in which the mixing matrix is invertible, OICA cannot\nutilize the change of variables technique to derive the joint probability density function of the data,\nwhich is a product of the densities of the independent components (ICs), divided by some value\ndepending on the mixing matrix. The joint density immediately gives rise to the likelihood.\nTo perform maximum likelihood learning, exisiting OICA algorithms typically assume a parametric\ndistribution for the hidden ICs. For example, if assuming each IC follows a Mixture of Gaussian\n(MoG) distribution, we can simply derive the likelihood for the observed data. However, the\nnumber of Gaussian mixtures increases exponentially in the number of ICs, which poses signi\ufb01cant\ncomputational challenges. Many of existing OICA algorithms rely on the Expectation-Maximization\n(EM) procedure combined with approximate inference techniques, such as Gibbs sampling [19] and\nmean-\ufb01eld approximation [20], which usually sacri\ufb01ce the estimation accuracy. Furthermore, the\nextended OICA algorithms for causal discovery are mostly noiseless OICA because they usually\nmodel all the noises as ICs [12, 15]. In order to apply EM, a very low variance Gaussian noise is\nusually added to the noiseless OICA model, resulting in very slow convergence [21]. Finally, the\nparametric assumptions on the ICs might be restrictive for many real-world applications.\nTo tackle these problems, we propose a Likelihood-Free OICA (LFOICA) algorithm that makes\nno explicit assumptions on the density functions of the ICs. In light of recent work on adversarial\nlearning [22], LFOICA utilizes neural networks to learn the distribution of independent components\nimplicitly. By minimizing appropriate distributional distance between the generated data from\nLFOICA model and the observed data, all parameters including the mixing matrix and noise learning\nnetwork parameters in LFOICA can be estimated very ef\ufb01ciently via stochastic gradient descent\n(SGD) [23, 24], without the need to formulate the likelihood function.\nAlthough both our work and [25] use a GAN style approach to solve ICA, they are largely different\nto each other. First, the main purpose of [25] is to recover the ICs instead of how the ICs are mixed\n(i.e.,the mixing matrix). It models the mixing and unmixing procedure implicitly with an encoder-\ndecoder architecture. As a consequence of non-linearity, there is no guarantee for identi\ufb01ability. In\ncontrast, we concentrate on the mixing matrix estimation for causal discovery purpose. Second, the\nencoder-decoder architecture in [25] cannot be easily extended for OICA because the posterior of\nICs cannot be modeled by a deterministic encoder. Third, the adversarial training target of LFOICA\nand [25] are different. While [25] aims at matching the joint distribution and product of marginal\ndistribution of the recovered ICs (this is also how [25] makes the components independent), LFOICA\nis trained to match the distributions of the generated mixtures and true mixtures. And the estimated\nICs by LFOICA are naturally independent because they are generated from independent latent noises\nwith separate networks.\nThe proposed LFOICA will make a number of causal discovery procedures much more practically\nfeasible. For illustrative purposes, we extend our LFOICA method to tackle two causal discovery\ntasks, including causal discovery from data with measurement noise [12] and causal discovery from\nlow-resolution time series [15, 16]. Experimental results on both synthetic and real data demonstrate\nthe ef\ufb01cacy and ef\ufb01ciency of our proposed method.\n\n2\n\n\f2 Likelihood-Free Over-complete ICA\n\n2.1 General Framework\n\nLinear ICA assumes the following data generation model:\n\nx = As,\n\n(1)\nwhere x \u2208 Rp, s \u2208 Rd, A \u2208 Rp\u00d7d are known as mixtures, independent components (ICs), and\nmixing matrix respectively. The elements in s are supposed to be independent from each other and\neach follows a non-Gaussian distribution (or at most one of them is Gaussian). The goal of ICA is to\nrecover both A and s from observed mixtures x. However, in the context of causal discovery, our\nmain goal is to recover a constrained A matrix. When d > p, the problem is known as overcomplete\nICA (OICA).\nIn light of recent advances in Generative Adversarial Nets (GANs) [22], we propose to learn the\nmixing matrix in the OICA model by designing a generator that allows us to draw samples easily. We\nmodel the distribution of each source si by a function model f\u03b8i that transforms a Gaussian variable\nzi to the non-Gaussian source. More speci\ufb01cally, the i-th source can be generated by \u02c6si = f\u03b8i(zi),\nwhere zi \u223c N (0, 1). Thus, the whole generator that generate x can be written as\n= GA,\u03b8(z),\n\n(2)\n(cid:124). Figure 1 shows the graphical structure of our\nwhere \u03b8 = [\u03b81, . . . , \u03b8d]\nLFOICA generator GA,\u03b8 with 4 sources and 3 mixtures. We use a multi-layer perceptron (MLP)\nto model each f\u03b8i. While most of the previous algorithms for both overdetermined [26, 25, 27, 28]\nand overcomplete [29] scenarios try to minimized the dependence among the recovered components,\nthe components \u02c6si recovered by LFOICA are essentially independent because the noises zi are\nindependent, according to the generating process.\nThe LFOICA generator GA,\u03b8 can be learned by minimizing the distributional distance between the\ndata sampled from the generator and the observed x data. Various distributional distances have been\napplied in training generative networks, including the Jensen-Shannon divergence [22], Wasserstein\ndistance [30], and Maximum Mean Discrepancy (MMD) [31, 32]. Here we adopt MMD as the\ndistributional distance as it does not require an explicit discriminator network, which simpli\ufb01es the\nwhole optimization procedure. Speci\ufb01cally, we learn the parameters \u03b8 and A in the generator by\nsolving the following optimization problem:\n\n(cid:124)\n= A[f\u03b81 (z1), . . . , f\u03b8d (zd)]\n\n(cid:124) and z = [z1, . . . , zd]\n\n(cid:124)\n\u02c6x = A[\u02c6s1, . . . , \u02c6sd]\n\n\u2217\n\n\u2217\n\nM (P(x) , P(GA,\u03b8(z)))\n\n(cid:13)(cid:13)Ex\u223cp(x)[\u03c6 (x)] \u2212 Ez\u223cp(z)[\u03c6 (GA,\u03b8(z))]\n(cid:13)(cid:13)2\n\nA\n\n, \u03b8\n\n= arg min\n\nA,\u03b8\n\n= arg min\n\n(3)\nwhere \u03c6 is the feature map of a kernel function k(\u00b7,\u00b7). MMD can be calculated by using kernel trick\nwithout the need for an explicit \u03c6. By choosing characteristic kernels, such as Gaussian kernel, MMD\nis guaranteed to match the distributions [33]. In practice, we optimize some empirical estimator of (3)\non minibatches by stochastic gradient descent (SGD). The entire procedure is shown in Algorithm 1.\n\nA,\u03b8\n\n,\n\nFigure 1: generator architecture of LFOICA. z1, z2, z3, z4 are i.i.d Gaussian noise variables.\n\n(cid:124))\nThe identi\ufb01ability of the mixing matrix A in our model (x = GA,\u03b8(z) = A[f\u03b81(z1), . . . , f\u03b8d (zd)]\nfollows the identi\ufb01ability results for OICA [34], which is summarized in the following theorem.\n\n3\n\nMLP\u03b81MLP\u03b82MLP\u03b84MLP\u03b83z1independent Gaussian noise4 different multiple layer perceptrons\u02c6s1\u02c6s2\u02c6s3\u02c6s4\u02c6x1\u02c6x2\u02c6x3independent non-Gaussian componentsgenerated mixturesz2z3z4\fAlgorithm 1 Likelihood-Free Overcomplete ICA (LFOICA) Algorithm\n1: Get a minibatch of i.i.d samples z from Gaussian noise distribution.\n2: Generate mixtures using (2).\n3: Get a minibatch of samples from the distribution of observed mixtures p(x).\n4: Update A and \u03b8 by minimizing the empirical estimate of (3) on the minibatch.\n5: Repeat step 1 to step 4 until max iterations reached.\n\nTheorem 1 Given two OICA models x = GA,\u03b8(z) and x(cid:48) = GA(cid:48),\u03b8(cid:48)(z(cid:48)) that specify distributions\nP(x) and P(x(cid:48)), respectively. Under the non-Gaussian assumption of fi(zi) (please refer to Theorem\n1 & 3 in [34] for precise de\ufb01nitions), if M M D(P(x), P(x(cid:48))) = 0, then A(cid:48) = APpSp, where Pp is\na p \u00d7 p column permutation matrix and Sp is a p \u00d7 p scaling matrix.\nThe proof is almost the same as that of Theorem 3 in [34], except that in order to guarantee\nP(x) = P(x(cid:48)), we use M M D = 0 while [34] uses maximum likelihood (KL divergence). Given the\nidenti\ufb01ability results, the estimated mixing matrix converges to the scaled and permuted version of\nthe true mixing matrix and so do the source distributions. The parameters in our MLPs (i.e.,\u03b8) are\nnot identi\ufb01able (\u03b8 (cid:54)= \u03b8(cid:48)), but we do not need the identi\ufb01ability of \u03b8 to perform certain tasks, such as\nthe two causal discovery tasks studied in this paper.\n\n2.2 Practical Considerations\n\nWe consider two important issues when applying LFOICA to real applications.\n\nregularizer [35] to (3), resulting in the loss function M (P(x) , P(GA,\u03b8(z))) + \u03bb(cid:80)\n\n(cid:80)\nSparsity Based on the fact that the mixing matrix is sparse in many real systems, we add a LASSO\nj |Aij|. We\nuse the stochastic proximal gradient method [36] to train our model. The proximal mapping for\nLASSO regularizer corresponds to the soft-thresholding operator:\n\ni\n\n(cid:40) A \u2212 \u03bb\u03b3\n\nprox\u03b3(A) = S\u03bb\u03b3(A) =\n\n0\nA + \u03bb\u03b3\n\nif A > \u03bb\u03b3\nif \u2212 \u03bb\u03b3 \u2264 A \u2264 \u03bb\u03b3\nif A < \u2212\u03bb\u03b3\n\n,\n\nwhere \u03bb, \u03b3 are the regularization weight and the learning rate, respectively. The soft-thresholding\noperator is applied after each gradient descent step:\n\n(cid:16)\n\n(cid:17)\nA(t\u22121) \u2212 \u03b3t\u2207MA(t\u22121) (\u00b7)\n\nA(t) = prox\u03bb\u03b3t\n\n,\n\nt = 1, 2, 3, . . . .\n\nm(cid:88)\n\nm(cid:88)\n\nwi,jN(cid:0)\u02c6si|\u00b5i,j, \u03c32\n\ni,j\n\n(cid:1) ,\n\nInsuf\ufb01cient data When we have rather small datasets, it is bene\ufb01cial to have certain \u201cparametric\"\nassumptions on the source distributions. Here we use Mixture of Gaussian (MoG) distribution to\nmodel the non-Gaussian distribution of independent components. Speci\ufb01cally, the distribution for the\ni-th IC is\n\np\u02c6si =\n\nP (zi = j)P (\u02c6si|zi = j) =\n\ni = 1, 2, . . . , d,\n\nj=1\n\nj=1\n\n(cid:80)m\nwhere m is the number of Gaussian components in MoG and wij is the mixture proportions satisfying\nj=1 wij = 1. If we do not wish to learn wij, we can \ufb01rst sample zi from the categorical distribution\nP (zi = j) = wij, and then use the reparameterization trick in [37] to sample from P (\u02c6si|zi) by\nan encoder network \u02c6si = \u00b5i,zi + \u0001\u03c3i,zi, where \u0001 \u223c N (0, 1). In this way, the gradients can be\nbackpropagated to \u00b5ij and \u03c3ij. Learning wij is relatively hard because zi is discrete and thus does\nnot allow for backpropagation to wij. To address this problem, we adopt the Gumbel-softmax trick\n[38, 39] to sample zi. Speci\ufb01cally, we use the following softmax function to generate one-hot \u02dczi:\n\n\u02dczij =\n\nexp ((log (wij) + gj) /\u03c4 )\nk=1 exp ((log (wik) + gk) /\u03c4 )\n\n,\n\n(4)\n\nwhere g1, . . . , gm are i.i.d samples drawn from Gumbel (0,1), and \u03c4 is the temperature parameter that\ncontrols the approximation accuracy of softmax to argmax. By leveraging the two tricks, we can\nsample \u02c6si from the generator \u02c6si = u\u02dczi + \u0001v\u02dczi, where u = [\u00b5i1, . . . , \u00b5im] and v = [\u03c3i1, . . . , \u03c3im],\nwhich enables learning of all the parameters in the MoG model.\n\n4\n\n(cid:80)m\n\n\f3 Applications in Causal Discovery\n\n3.1 Causal Discovery under Measurement Error\n\nMeasurement error (e.g., noise caused by sensors) in the observed data can lead to wrong result of\nvarious causal discovery methods. Recently, it was proven that the causal structure is identi\ufb01able from\ndata with measurement error, under the assumption of linear relations and non-Gaussian noise [12].\nBased on the identi\ufb01ability theory in [12], we propose a causal discovery algorithm by extending\nLFOICA with additional constraints.\nFollowing [12], we use the LiNGAM model [4] to represent the causal relations on the data without\nmeasurement error. More speci\ufb01cally, the causal model is \u02dcX = B \u02dcX + \u02dcE, where \u02dcX is the vector of\nthe variables without measurement error, \u02dcE is the vector of independent non-Gaussian noise terms,\nand B is the corresponding causal adjacency matrix in which Bij is the coef\ufb01cient of the direct causal\nin\ufb02uence from \u02dcXj to \u02dcXi and Bii = 0 (no self-in\ufb02uence). In fact, \u02dcX is a linear transformation of the\nnoise term \u02dcE because the linear model can be rewritten as \u02dcX = (I \u2212 B)\u22121 \u02dcE. Then, the model with\nmeasurement error E can be written as\n\n\u22121 \u02dcE + E =(cid:2)(I \u2212 B)\n\n\u22121\n\nI(cid:3)(cid:20) \u02dcE\n\n(cid:21)\n\n,\n\nX = \u02dcX + E = (I \u2212 B)\n\nously, (5) is a special OCIA model with(cid:2)(I \u2212 B)\u22121\n\nE\n\nI(cid:3) as the mixing matrix. Therefore, we can\n\nwhere X is the vector of observable variables, and E the vector of measurement error terms. Obvi-\n\nreadily extend our LFOICA algorithm to estimate the causal adjacency matrix B.\n\n(5)\n\n3.2 Causal Discovery from Subsampled Time Series\n\nGranger causal analysis has been shown to be sensitive to temporal frequency/resolution of time\nseries. If the temporal frequency is lower than the underlying causal frequency, it is generally dif\ufb01cult\nto discover the high-frequency causal relations. Recently, it has been shown that the high-frequency\ncausal relations are identi\ufb01able from subsampled low-frequency time series under the linearity and\nnon-Gaussianity assumptions [15]. The corresponding model can also be viewed as extensions\nof OICA and the model parameters are estimated in the (variational) Expectation Maximization\nframework [15]. However, with the non-Gaussian ICs, e.g., MoG is used in [15], the EM algorithm is\ngenerally intractable while the variational EM algorithm loses accuracy. To make causal discovery\nfrom subsampled time series practically feasible, we further extend our LFOICA to discover causal\nrelations from such data.\nFollowing [15], we assume that data at the original causal frequency follow a \ufb01rst-order vector\nautoregressive process (VAR(1)):\n\nxt = Cxt\u22121 + et,\n\n(6)\nwhere xt \u2208 Rn is the high frequency data and et \u2208 Rn represents independent non-Gaussian noise\nin the causal system. C \u2208 Rn\u00d7n is the causal transition matrix at true causal frequency with Cij\nrepresenting the temporal causal in\ufb02uence from variable j to variable i. As done in [15], we consider\nthe following subsampling scheme under which the low frequency data can be obtained: for every k\nconsecutive data points, one is kept and the others being dropped. Then the observed subsampled\ndata with subsampling factor k admits the following representation [15]:\n\n\u02dcxt+1 = Ck\u02dcxt + L\u02dcet+1,\n\n(7)\nwhere \u02dcxt \u2208 Rn is the observed data subsampled from xt, L = [I, C, C2, ..., Ck\u22121], and \u02dcet =\n(cid:124)\n(cid:124)\n(cid:124)\n(cid:124) \u2208 Rnk is a vector containing nk independent noise terms.\n1+tk\u22121, ..., e\n1+tk\u22120, e\n(e\n1+tk\u2212(k\u22121))\nWe are interested in estimating the transition matrix C from the subsampled data. A graphical\nrepresentation of the subsampled data is given in Figure 2(a). Apparently, (7) extends the OICA\nmodel by considering temporal relations between observed \u02dcxt.\nTo apply our LFOICA to this problem, we propose to model the conditional distribution P(\u02dcxt+1|\u02dcxt)\nusing the following model:\n\n(cid:124)\n\u02c6\u02dcxt+1 = GC,\u03b8(\u02dcxt, zt+1) = Ck\u02dcxt + L[f\u03b81 (zt+1,1), . . . , f\u03b8nk (zt+1,nk)]\n\n,\n\n(8)\n\n5\n\n\fwhich belongs to the broad class of conditional Generative Adversarial Nets (cGANs) [40]. We call\nthis extension of LFOICA as LFOICA-conditional. A graphical representation of (8) is shown in\nFigure 2(b). To learn the parameters in (8), we minimize the MMD between the joint distributions of\ntrue and generated data:\n\n(9)\nwhere \u2297 denotes tensor product. The empirical estimate of (9) can be obtained by randomly sampling\n(\u02dcxt, \u02dcxt+1) pairs from true data and sampling from P(zt+1). Again, we can use the mini-batch SGD\nalgorithm to learn the model parameters ef\ufb01ciently.\n\n,\n\n\u2217\nC\n\n, \u03b8\n\n\u2217\n\n= arg min\n\nC,\u03b8\n\n= arg min\n\nC,\u03b8\n\nM (P(\u02dcxt, \u02dcxt+1) , P(GC,\u03b8(\u02dcxt, zt+1), \u02dcxt+1))\n(\u02dcxt,\u02dcxt+1)\u223cp(\u02dcxt,\u02dcxt+1)[\u03c6 (\u02dcxt) \u2297 \u03c6 (\u02dcxt+1)]\n\n(cid:13)(cid:13)E\n\u2212 E\u02dcxt\u223cp(\u02dcxt),zt+1\u223cp(zt+1)[\u03c6(\u02dcxt) \u2297 \u03c6 (GC,\u03b8(zt+1))](cid:13)(cid:13)2\n\n(a)\n\n(b)\n\nFigure 2: (a) Subsampled data with subsampling factor k. (b) LFOICA-conditional model for\nsubsampled data.\n\n4 Experiment\n\nIn this section, we conduct empirical studies on both synthetic and real data to show the effectiveness\nof our LFOICA algorithm and its extensions to solve causal discovery problems. We \ufb01rst compare the\nresults obtained by LFOICA and several OICA algorithms on synthetic over-complete mixtures data.\nThen we apply the extensions of LFOICA mentioned in Section 3.1 and 3.2 in two causal discovery\nproblems using both synthetic and real data.\n\n4.1 Recovering Mixing Matrix from Synthetic OICA Data\n\nWe compare LFOICA with several well-known OICA algorithms on synthetic OICA data.\nAccording to [34], the mixing matrix in OICA can be estimated up to the permutation and scaling\nindeterminacies (including the sign indeterminacy) of the columns. However, these indeterminacies\nstop us from comparing the estimated mixing matrices by different OICA algorithms. In order\nto make the comparison achievable, we need to eliminate these indetermincies. To eliminate the\npermutation indetermincy, we make the non-Gaussian distribution for each synthetic IC not only\nindependent, but also different. With different distributions for each IC, it is convenient to permute\nthe columns to the same order for all the algorithms according to the recovered distribution of each\nIC. We use Laplace distributions with different variance for each IC. In order to eliminate the scaling\nindeterminacy, both ground-truth and estimated mixing matrix are normalized to make the L2 norm\nof the \ufb01rst column equal to 1. With the permutation and scaling indeterminacy eliminated, we can\nconveniently compare the mixing matrices obtained by different algorithms. To further avoid local\noptimum, the mixing matrix is initialized by it\u2019s true value added with noise.\nTable 1 compares the mean square error (MSE) between the ground-truth mixing matrix used to\ngenerate the data and the estimated mixing matrices by different OICA algorithms. In the table, RICA\nrepresents reconstruction ICA [29], MFICA_Gauss and MFICA_MoG represent mean-\ufb01eld ICA [20]\n\n6\n\n\u02dcet+1\u02dcet\u02dcet+2\u02dcet+3CkCkCk\u02dcxt\u02dcxt+1\u02dcxt+2\u02dcxt+3LLLL\u02dcet+1Ck\u02dcxt\u02dcxt+1LCkL\u02dcet+3\u02dcxt+3\u02dcxt+2conditionconditionconditionCkL\u02dcet+2\u02dcxt+2\u02dcxt+1\fTable 1: MSE of the recovered mixing matrix by different methods on synthetic OICA data.\n\nMethods\nRICA\n\nMFICA_Gauss\nMFICA_MoG\n\nNG-EM\nLFOICA\n\np=2, d=4\n2.26e-2\n4.54e-2\n2.38e-2\n1.82e-2\n4.61e-3\n\np=3, d=6\n1.54e-2\n2.45e-2\n9.17e-3\n6.56e-3\n5.95e-3\n\np=4, d=8\n9.03e-3\n4.21e-2\n2.43e-2\n1.21e-2\n6.96e-3\n\np=5, d=10\n7.54e-3\n3.18e-2\n1.04e-2\n6.34e-3\n5.92e-3\n\nTable 2: MSE of the recovered causal adjacency matrix by LFOICA and NG-EM.\n\nMethods\n\nMSE\nn=7\n\nn=5\n\nLFOICA 1.04e-3\nNG-EM 6.98e-3\n\n5.79e-3\n9.85e-3\n\nTime (seconds)\n\nn=50\n1.81e-2\n\n-\n\nn=5\n75.01\n1826.60\n\nn=7\n76.44\n4032.54\n\nn=50\n1219.34\n\n-\n\nwith the prior distribution of ICs set to the Gaussian and the mixture of Gaussians respectively. NG-\nEM denotes the EM-based ICA [15]. p is the number of mixtures, and d is the number of ICs. For each\nalgorithm, we conduct experiments in 4 cases (with [p = 2, d = 4], [p = 3, d = 6], [p = 4, d = 8],\nand [p = 5, d = 10]). Each experiment is repeated 10 times with randomly generated data and the\nresults are averaged. As we can see, our LFOICA achieves best result (smallest error) compared\nwith the others. We also compare the distribution of the recovered components by LFOICA with the\nground-truth, the result can be found in Section 2.2 of Supplementary Material.\n\n4.2 Recovering Causal Relation from Causal Model with Measurement Error\n\nSynthetic Data We generate data with measurement error, and the details about the generating\nprocess can be found in section 3.1 of Supplementary Material. NG-EM [15] is a causal discovery\nalgorithm as an extension of EM-based OICA method. Table 2 compares the MSE between the\nground-truth causal adjacency matrix and those estimated by NG-EM and our LFOICA. The synthetic\ndata we used contains 5000 data points. We test 3 cases where the number of variables n is 5, 10, and\n50 respectively. Each experiment is repeated 10 times with random generated data and the results\nare averaged. As we can see from the table, LFOICA performs better than NG-EM, with smaller\nestimation error. We also compare the time taken by the two methods with the same number of\niterations. As can be seen, NG-EM is much more time consuming than LFOICA (because EM needs\nto calculate the posterior). We found that when n > 7, NG-EM fails to obtain any results because\nit runs out of memory, while LFOICA can still obtain reasonable result. So no results of NG-EM\nis given in the table for n = 50. These experiments show that besides the ef\ufb01cacy, LFOICA is\ncomputationally much more ef\ufb01cient and uses less space than NG-EM as well.\n\nReal Data We apply LFOICA to Sachs\u2019s data [41] with 11 variables. Sachs\u2019s data is a record of\nvarious cellular protein concentrations under a variety of exogenous chemical inputs and, inevitably,\none can imagine that there is much measurement error in the data because of the measureing\nprocess. Here we visualize the causal diagram estimated by LFOICA and the ground-truth in Figure\n3(a) and 3(d). The estimated causal adjacency matrix by LFOICA can be found in section 3.2\nof Supplementary Material. For comparison, we also visualize the causal diagrams estimated by\nNG-EM and the corresponding ground-truth in Figure 3(b) and 3(e). To demonstrate the fact that\nregular causal discovery algorithm cannot properly estimate the underlying causal relations under\nmeasurement error , we further compare the result by a regular causal discovery algorithm called\nLinear, Non-Gaussian Model (LiNG) [42] in Figure 3(c). Unlike LiNGAM, LiNG allows feedback\nin the causal model. We calculate the precision and recall for the output of the three algorithms.\nThe precision are 51.22%, 48.94% and 50.00% for LFOICA, NG-EM and LiNG, and the recall\nare 55.26%, 60.53% and 23.68% respectively. As we can see, LiNG fails to recover most of the\ncausal directions while LFOICA and NG-EM perform clearly better. This makes an important point\nthat measurement error can lead to misleading results by regular causal discovery algorithms, while\nOICA-based algorithms such as LFOICA and NG-EM are able to produce better results. Although\n\n7\n\n\fthe performances of LFOICA and NG-EM are very close, it takes about 48 hours for NG-EM to\nobtain the result while LFOICA takes only 142.19s, which further demonstrates the remarkable\ncomputational ef\ufb01ciency of LFOICA.\n\n(a) LFOICA\n\n(b) NG-EM\n\n(c) LiNG\n\n(d) ground-truth for LFOICA (e) ground-truth for NG-EM (f) ground-truth for LiNG\n\nFigure 3: (a)-(c) Causal diagrams by LFOICA, NG-EM and LiNG. (d)-(f) Three ground-truth causal\ndiagrams which are actually the same with the red arrows representing the missing causal directions\nin the output of the corresponding algorithm. The red arrows in (a)-(c) are falsely discovered causal\ndirections compared with ground-truth. The blue arrows in (a)-(c) are edges with converse causal\ndirections compared with ground-truth.\n\n4.3 Recovering Causal Relation from Low-Resolution Time Series Data\n\nWe then consider discovery of time-delayed causal relations at the original high frequency (represented\nby the VAR model) from their subsampled time series. We conduct experiments on both synthetic\nand real data.\n\nSynthetic Data Following [15], we generate synthetic time series data at the original causal\nfrequency using VAR(1) model described by (6). Details about how the data is generated can be\nfound in section 4.1 of Supplementary Material. NG-EM and NG-MF were \ufb01rst proposed in [15] as\nextensions of OICA algorithms to discover causal relation from low-resolution data. Table 3 shows\nthe MSE between the ground-truth transition matrix and those estimated by LFOICA-conditional, NG-\nEM, and NG-MF when number of variables n = 2. We conduct experiments when the subsampling\nfactor is set to k = 2, 3, 4, 5 and size of dataset T = 100 and 300. Each experiment is repeated 10\nrandom replications and the results are averaged. As one can see from Table 3, LFOICA-conditional\nachieves comparable result as NG-EM and NG-MF [15]. NG-EM has better performance when the\nnumber of data points is small (T = 100), probably because the MMD distance measure used in\nLFOICA-conditional may be inaccurate with small number of samples. When the number of data\npoints is larger (T = 300), LFOICA-conditional obtains the best results. We also conduct experiment\nwhen n is larger (n = 5). The result can be found in Section 4.2 of Supplementary Material; again,\nLFOICA-conditional gives more accurate results and it is computationally much more ef\ufb01cient.\n\nReal Data Here we use Temperature Ozone Data [43], which corresponds to the 49th, 50th, and\n51st causal-effect pairs in the database. These three temperature ozone pairs are taken at three\n\n8\n\nx2x3x4x5x6x7x8x9x10x11x1x2x3x4x5x6x7x8x9x10x11x1x2x3x4x5x6x7x8x9x10x11x1x2x3x4x5x6x7x8x9x10x11x1x2x3x4x5x6x7x8x9x10x11x1x2x3x4x5x6x7x8x9x10x11x1\fTable 3: MSE of the recovered transition matrix by different methods on synthetic subsampled data.\n\nMethods\n\nT=100\n\nT=300\n\nn=2\n\nLFOICA-conditional\n\nNG-EM\nNG-MF\n\nk=2\n\n7.25e-3\n6.50e-3\n9.09e-3\n\nk=3\n\n7.88e-3\n7.32e-3\n9.89e-3\n\nk=4\n\n8.45e-3\n1.02e-2\n1.24e-2\n\nk=5\n\n9.00e-3\n1.04e-2\n2.19e-2\n\nk=2\n\n1.12e-3\n7.24e-3\n8.46e-3\n\nk=3\n\n3.87e-3\n9.11e-3\n8.76e-3\n\nk=4\n\n4.07e-3\n9.54e-3\n1.01e-2\n\nk=5\n\n6.23e-3\n9.98e-3\n2.20e-2\n\ndifferent places in 2009. Each pair of data contains two variables, ozone and temperature, with\nthe ground-truth causal direction temperature \u2212\u2192 ozone. To demonstrate the result when n = 2,\nwe use the 50th pair as in [15]. The optimal subsampling factor k can be determined using the\nmethod of cross-validation on the log-likelihood of the models. Here we use k = 2 according to [15].\n\nThe estimated transition matrix C =(cid:2) 0.9310 0.1295\n\ntemperature in the matrix). from which we can clearly \ufb01nd the causal direction from temperature\nto ozone. We also conduct experiments when n = 6. The result can be found in section 4.3 in\nSupplementary Material.\n\n(cid:3) (the \ufb01rst variable is ozone and the second is\n\n\u22120.0017 0.9996\n\n5 Conclusion\n\nIn this paper, we proposed a Likelihood-Free Ovecomplete ICA model (LFOICA), which does not\nrequire parametric assumptions on the distributions of the independent sources. By generating the\nsources using neural networks and directly matching the generated data and real data with some\ndistance measure other than Kullback-Leibler divergence, LFOICA can ef\ufb01ciently learn the mixing\nmatrix via backpropagation. We further demonstrated how LFOICA can be extended to sovle a\nnumber causal discovery problems that essentially involve confounders, such as causal discovery\nfrom measurement error-contaminated data and low-resolution time series data. Experimental results\nshow that our LFOICA and its extensions enjoy accurate and ef\ufb01cient learning. Compared to previous\nones, the resulting causal discovery methods scale much better to rather high-dimensional problems\nand open the gate to a large number of real applications.\n\n6 Acknowledgements\n\nChenwei Ding and Dacheng Tao would like to acknowledge the support by Australian Research Coun-\ncil Projects FL-170100117 and DP-180103424. Kun Zhang would like to acknowledge the support\nby National Institutes of Health under Contract No. NIH-1R01EB022858-01, FAINR01EB022858,\nNIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN- U54HG008540, by the United States\nAir Force under Contract No. FA8650-17-C-7715, and by National Science Foundation EAGER\nGrant No. IIS-1829681. The National Institutes of Health, the U.S. Air Force, and the National\nScience Foundation are not responsible for the views reported in this article.\n\nReferences\n[1] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY,\n\nUSA, 2000.\n\n[2] Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper,\n\nand Thomas Richardson. Causation, prediction, and search. MIT press, 2000.\n\n[3] David Maxwell Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of machine learn-\ning research, 3(Nov):507\u201355@articlearjovsky2017wasserstein, title=Wasserstein gan, author=Arjovsky,\nMartin and Chintala, Soumith and Bottou, L\u00e9on, journal=arXiv preprint arXiv:1701.07875, year=2017 4,\n2002.\n\n[4] Shohei Shimizu, Patrik O Hoyer, Aapo Hyv\u00e4rinen, and Antti Kerminen. A linear non-gaussian acyclic\n\nmodel for causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030, 2006.\n\n[5] Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch\u00f6lkopf. Nonlinear causal\n\ndiscovery with additive noise models. In NIPS, pages 689\u2013696, 2009.\n\n9\n\n\f[6] Kun Zhang and Aapo Hyv\u00e4rinen. On the identi\ufb01ability of the post-nonlinear causal model. In UAI, pages\n\n647\u2013655. AUAI Press, 2009.\n\n[7] Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery and forecasting in\n\nnonstationary environments with state-space models. arXiv preprint arXiv:1905.10857, 2019.\n\n[8] AmirEmad Ghassami, Negar Kiyavash, Biwei Huang, and Kun Zhang. Multi-domain causal structure\nlearning in linear systems. In Advances in neural information processing systems, pages 6266\u20136276, 2018.\n\n[9] Biwei Huang, Kun Zhang, Jiji Zhang, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Sch\u00f6lkopf.\nBehind distribution shift: Mining driving forces of changes and causal arrows. In 2017 IEEE International\nConference on Data Mining (ICDM), pages 913\u2013918. IEEE, 2017.\n\n[10] Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and Bernhard Sch\u00f6lkopf. Causal discovery\nfrom nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI:\nProceedings of the Conference, volume 2017, page 1347. NIH Public Access, 2017.\n\n[11] Biwei Huang, Kun Zhang, and Bernhard Sch\u00f6lkopf. Identi\ufb01cation of time-dependent causal model: A\ngaussian process treatment. In Twenty-Fourth International Joint Conference on Arti\ufb01cial Intelligence,\n2015.\n\n[12] Kun Zhang, Mingming Gong, Joseph Ramsey, Kayhan Batmanghelich, Peter Spirtes, and Clark Glymour.\nCausal discovery with linear non-gaussian models under measurement error: Structural identi\ufb01ability\nresults. In UAI, 2018.\n\n[13] Patrik O Hoyer, Shohei Shimizu, Antti J Kerminen, and Markus Palviainen. Estimation of causal effects\nusing linear non-gaussian causal models with hidden variables. International Journal of Approximate\nReasoning, 49(2):362\u2013378, 2008.\n\n[14] Philipp Geiger, Kun Zhang, Bernhard Schoelkopf, Mingming Gong, and Dominik Janzing. Causal\ninference by identi\ufb01cation of vector autoregressive processes with hidden components. In ICML, pages\n1917\u20131925, 2015.\n\n[15] Mingming Gong, Kun Zhang, Bernhard Schoelkopf, Dacheng Tao, and Philipp Geiger. Discovering\n\ntemporal causal relations from subsampled data. In ICML, pages 1898\u20131906, 2015.\n\n[16] Mingming Gong, Kun Zhang, Bernhard Sch\u00f6lkopf, Clark Glymour, and Dacheng Tao. Causal discovery\n\nfrom temporally aggregated time series. In UAI, volume 2017. NIH Public Access, 2017.\n\n[17] A Tank, E B Fox, and A Shojaie. Identi\ufb01ability and estimation of structural vector autoregressive models\n\nfor subsampled and mixed-frequency time series. Biometrika, 106(2):433\u2013452, 04 2019.\n\n[18] Aapo Hyv\u00e4rinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46. John Wiley\n\n& Sons, 2004.\n\n[19] Bruno A Olshausen and K Jarrod Millman. Learning sparse codes with a mixture-of-gaussians prior. In\n\nNIPS, pages 841\u2013847, 2000.\n\n[20] Pedro ADFR H\u00f8jen-S\u00f8rensen, Ole Winther, and Lars Kai Hansen. Mean-\ufb01eld approaches to independent\n\ncomponent analysis. Neural Computation, 14(4):889\u2013918, 2002.\n\n[21] Kaare Brandt Petersen, Ole Winther, and Lars Kai Hansen. On the slow convergence of em and vbem in\n\nlow-noise linear models. Neural computation, 17(9):1921\u20131926, 2005.\n\n[22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[25] Philemon Brakel and Yoshua Bengio. Learning independent features with adversarial nets for non-linear\n\nica. arXiv preprint arXiv:1710.05050, 2017.\n\n[26] Aapo Hyvarinen. Fast and robust \ufb01xed-point algorithms for independent component analysis. IEEE\n\ntransactions on Neural Networks, 10(3):626\u2013634, 1999.\n\n10\n\n\f[27] Shun-ichi Amari, Andrzej Cichocki, and Howard Hua Yang. A new learning algorithm for blind signal\n\nseparation. In NIPS, pages 757\u2013763, 1996.\n\n[28] Aapo Hyv\u00e4rinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural\n\nnetworks, 13(4-5):411\u2013430, 2000.\n\n[29] Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. Ica with reconstruction cost for\n\nef\ufb01cient overcomplete feature learning. In NIPS, pages 1017\u20131025, 2011.\n\n[30] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[31] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML, pages\n\n1718\u20131727, 2015.\n\n[32] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. A\n\nkernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\n[33] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, characteristic kernels\n\nand rkhs embedding of measures. Journal of Machine Learning Research, 12(Jul):2389\u20132410, 2011.\n\n[34] Jan Eriksson and Visa Koivunen. Identi\ufb01ability, separability, and uniqueness of linear ica models. IEEE\n\nsignal processing letters, 11(7):601\u2013604, 2004.\n\n[35] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[36] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Z. Ghahramani,\nM. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, NIPS, pages 1574\u20131582. Curran\nAssociates, Inc., 2014.\n\n[37] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[38] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[39] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation\n\nof discrete random variables. In ICLR. OpenReview.net, 2017.\n\n[40] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,\n\n2014.\n\n[41] Karen Sachs, Omar Perez, Dana Pe\u2019er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-\n\nsignaling networks derived from multiparameter single-cell data. Science, 308(5721):523\u2013529, 2005.\n\n[42] Gustavo Lacerda, Peter L Spirtes, Joseph Ramsey, and Patrik O Hoyer. Discovering cyclic causal models\n\nby independent components analysis. arXiv preprint arXiv:1206.3273, 2012.\n\n[43] Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch\u00f6lkopf. Distinguish-\ning cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning\nResearch, 17(32):1\u2013102, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3739, "authors": [{"given_name": "Chenwei", "family_name": "DING", "institution": "The University of Sydney"}, {"given_name": "Mingming", "family_name": "Gong", "institution": "University of Melbourne"}, {"given_name": "Kun", "family_name": "Zhang", "institution": "CMU"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}]}