NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7490
Title:Specific and Shared Causal Relation Modeling and Mechanism-Based Clustering

Reviewer 1

In many scenarios, the causal relationships considered over a set of variables vary across groups and at the same time share some common causal relationships. So it is better to find different causal graphs for each individual. This paper solves this problem by first dividing the set of agents into a number of groups and then finding a causal graph for each group. The authors propose a model over m variables that includes both instantaneous effects and time-lagged effects. Ideally, we would have to estimate this model separately for each user, but that might be impossible with a small number of samples. So the model assumes a mixture of Gaussian prior for the effects. The number of components in the mixture is the number of clusters and the goal is to estimate the prior probabilities over the clusters and individual components of the mixture. Subsequently, the authors use EM algorithm to estimate the parameters of the model. However, computing the posterior exactly is intractable, so they use Monte Carlo integration and stochastic approximation for the E step. Both the simulation and experiment on real-world dataset show that the proposed method performs better than various existing methods in terms of F1 score, clustering, and approximating the true model. I have some suggestions and questions for the authors: 1. Theorem 1 proves an identification result for the degenerate distributions. What breaks down for the general case even if we consider just the instantaneous effects? It would have been nice to see a discussion in this regard. 2. How did the authors choose the parameter l_p the number of time-steps considered for time-lagged effects? 3. Why did the authors choose the threshold of 0.1 for converting the weights to the presence / absence of edges in the graph? Is there a systematic procedure to guide this choice? Originality This paper makes a significant contribution in terms of proposing a new model for sharing causal relations. The proposed algorithm seems to recover individual specific causal graphs and will be of immense interest to researchers working in the field if it can be scaled across a large number of clusters and a large number of variables. Quality I thought the paper makes several significant contributions and will be really helpful for the researchers working in the field of causal modeling and causal discovery. Clarity I thoroughly enjoyed reading the paper. Both the model and the experiment section was clear to me. However, I thought that the paper might benefit from a brief discussion of SAEM algorithm before deriving the steps of the algorithm. Significance The modeling contributions of this paper are sound. The proposed algorithm is interesting, seems to perform better than the existing methods and will be significant if it can be scaled for a large number of variables and a large number of clusters.

Reviewer 2

Update after author rebuttal: I greatly appreciate the authors responding point-by-point to my concerns. I have updated my score to a 7. If the paper is ultimately accepted, in addition to the clarifications the authors made in the rebuttal, I stress that the paper would be improved by clarifying the following 1) The point about direction switching. I am not from a neuroscience/biology background (I believe most of the NeurIPS community is not either) and so the justification here is a little counterintuitive here. The authors might want to consider contrasting this work with existing ways one might (inadequately) represent such behavior (e.g. latent confounders when Gaussianity/linearity is not assumed) -- this could be an elaboration of the comments in the rebuttal. 2) From the rebuttal: "In our simulations under non-zero variance settings, we never observed that the procedure converged to wrong solutions, suggesting that the non-zero-variance case is also identifiable" -- just because a sim didn't obtain non-identifiability, doesn't mean the model is identified. I think the authors' intuition is likely to be correct, however they should consider softening the language around this point. -------------------------------------------------------------------------- This paper is well-written. There are some points the authors should clarify in review and in a future draft to enhance the manuscript. Details of my personal confusions below: 0) What is the relation between the present work and the causal relational modeling literature (e.g. Arbour, Marazopoulou, and Jensen 16, Marazopoulou, Maier, and Jensen 15, and other works from David Jensen and associates)? Is "relational" used in the same sense here? 1) In general, how is this approach different from learning a DAG-based model where there is a cluster indicator variable that indexes the rest of the graph? In my mind this would help with statistical concerns (i.e. you can "lump" all subjects together) to learn one graph rather than several graphs (one per group). 2) In the abstract: "...due to possibly omitted factors that affect the quantitative causal effects." To be clear this refers _only_ to missing edges and not unobserved variables (e.g. latent confounders) instead/as well? 3) "in healthcare, individuals may show different responses to the same treatment". In line with #1 above, this seems more like a distributional notion rather than needing a model that learns separate graphs for separate groups, no? Existing causal effect estimation (more or less developed than discovery is up for debate) tends to be more suited for handling a single non-parametric graph for the purpose of identification and then placing assumptions after identification has been obtained. Would it be valuable for discovery methods to try to follow this approach (at least in the vein of learning a single unified graph that represents the distribution of subjects)? 4) "In addition, they do not allow opposite causal directions..." Following #1 and #3 this seems like a weird thing to allow for in the proposed model. Can the authors please clarify why they permit edge direction to switch among subjects? I'm not familiar with areas in the causal literature that have considered this case and as is, the justification seems light. 5) Sec 2 paragraph 1 (and Conclusion) -- I agree that adding a treatment of partial observations ("hidden confounders" etc) seems challenging. Can the authors please give a high level discussion of the steps they might take towards allowing that behavior in their model? 6) Eq. 1: what's p_l in the middle summation? It doesn't seem to be defined 7) "However, the limited sample size from each individual limits statistical efficiency or even makes discovery impossible" -- is there a formal characterization/proof of this? I understand that statistical identification is challenging. E.g as n -> 0 you have a tougher and tougher time but is there a formal characterization of "impossibility" in this setting? 8) Below Thm 1: "we allow that across different groups, some causal directions are reversed" -- along with earlier comments, how does this provide more information about the ground truth graph than, say, learning an equivalence class that leave direction unspecified when it's unclear? 9) "(1) the cycles are disjoint" -- what does "disjoint" mean here? Why is this requirement necessary? 10) "our empirical results strongly suggest that the causal model is also identifiable" -- To me, this isn't sufficient for the same reason that association isn't sufficient for effect estimation contexts. Can the authors give justification that their experiments (non-sims) use data that satisfies the theoretical requirements implied by Thm 1? 11) "Imagine an extreme case: if there are enough samples ..." Why is is the case that this is identifiable? Can the authors please formalize this? 12) l_s > 2q - 1 -- What's the intuition behind this bound? How is it used in the proof of theorem 1? E.g. "We first show..." -- it seems the authors cite Vandermeulen and Scott 2015 rather than proving this is a sufficient bound in their setting. 13) The proof of theorem 1 seems incomplete. There doesn't appear to be a complete, formal proof in the appendix. It's not obvious from the cited papers that the same conclusions (from those citations) will hold in this setting. Can the authors please give a formal proof? 14) In Sec. 4.1, what is being adapted from SAEM and what is a new contribution? It seems the general EM is from SAEM and the Gibbs sampling procedure is novel? 15) "The computational complexity (...) is O(m^2 n M T0). This is very slow, especially since it is on a per-iteration level. Can the authors give a characterization of types of data where it would be feasible to run this procedure? Obviously the experiments in the penultimate section provide an example but it's not clear what scale of data they used there (e.g. was a very small subset of the total available fMRI data used and would a neuroscientist need to be able to use this procedure on a data set many orders of magnitude larger?) Additionally, how can one be sure that the Gibbs procedure has converged? 16) "It is east to add prior knowledge of causal connections" -- how is this achieved? An imposed independence or dependence constraint in Eq. 1? 17) For Gibbs and other procedures that require parameter initializations, is there some characterization available of how reliant these procedures' performance is on those initializations? 18) "We randomly generated acyclic causal structures according to the Erdos-Renyi model" -- ER gives undirected graphs. How did the authors choose directions of edges and ensure acyclicity in this sim? 19) "Other parameters were set as follows. sigma^2_{k, i, j} = U(.01, .1)" In theorem 1 the authors state that the distributions should be no variance -- why does this sim use distributions with non-zero variance? The authors should consider adding a sim that exactly matches the conditions that are understood according to theory. 20) fMRI and Cellular Signaling Networks data: is this non-gaussian data? How do we know that it is (non-Gaussian)? 21) fMRI "We assume that the causal relations are fixed on the same day, but may change across different days" -- Can the authors provide some insight from neuroscience that backs up this assumption? To me, it seems a little odd that the brain would be re-wired and electric flows would switch from one day to the next. Is there a well understood mechanism that explains this behavior? 22) Cellular: "With different interventions in different conditions, the causal relations over the 11 variables may change across them" -- Unless these are different cells in the different conditions, it would seem more reasonable to model this as having unobserved confounding rather than having causal direction switch according to interventions. Can the authors provide justification for this construction?

Reviewer 3

[Originality] The use of Gaussian mixture as a specific model for causal discovery from data with group-wise causal mechanisms seems novel and interesting. [Quality] - Considering that the essence of the paper is the proposal of using a mixture of Gaussians, experimental assessment on real-world data is quite important. - I am not entirely convinced by the experiments of the current version of the paper. (Table 1) Without a comparison against plain clustering methods (e.g., k-means), it seems still possible for SSCM to take advantage of other clustering signals such as distinct data regions, not the structural difference. To validate the proposed model, I think it is crucial for the paper to collect convincing evidence that the proposed method really likely conducted a mechanism-based clustering. For example, (1) showing the results of a comparison against a plain clustering method or (2) showing variability of the estimated graphs for each group (to see if the posterior is concentrated well around the MAP or the posterior mean) or (3) providing an interpretation of the estimated graphs based on domain knowledge (similarly to the one in the fMRI experiment) may help. Using a biased sampling from each group to create mock "individuals" may also be an option. - The problem of estimating the number of groups remains to be addressed. It seems to be an essentially difficult problem, but the paper did not specify a concrete method for it, and only used the underlying truth value for the experiments (line 262). [Clarity] - I think the manuscript is very well prepared. All paragraphs are easy and smooth to comprehend. - The only problem I had with the presentation is in the statement of Theorem 1. The notion of identifiability is often under some form of (hidden) asymptotics. If I understood correctly, for the case of Theorem 1, the identifiability is under the limit of $n \to \infty$. I think it is important to clarify, especially when there is a "sample size" in the statement of the theorem which is quite confusing (because in standard estimation problems, the identifiability of a parameter is under the limit of (sample size) \to \infty). - (Typo) Supplementary material p.9: "Adjusted Random Index" should be "Adjusted Rand Index." [Significance] Considering the nature of the paper (proposing a specific model), its significance largely depends on the experimental results using real-world data. The experimental results are interesting, providing some insights into the proposed model, but not completely satisfactory for the reasons stated above in the Quality section.