{"title": "The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data", "book": "Advances in Neural Information Processing Systems", "page_first": 11722, "page_last": 11732, "abstract": "Causal inference is central to many areas of artificial intelligence, including complex reasoning, planning, knowledge-base construction, robotics, explanation, and fairness. An active community of researchers develops and enhances algorithms that learn causal models from data, and this work has produced a series of impressive technical advances. However, evaluation techniques for causal modeling algorithms have remained somewhat primitive, limiting what we can learn from experimental studies of algorithm performance, constraining the types of algorithms and model representations that researchers consider, and creating a gap between theory and practice. We argue for more frequent use of evaluation techniques that examine interventional measures rather than structural or observational measures, and that evaluate those measures on empirical data rather than synthetic data. We survey the current practice in evaluation and show that the techniques we recommend are rarely used in practice. We show that such techniques are feasible and that data sets are available to conduct such evaluations. We also show that these techniques produce substantially different results than using structural measures and synthetic data.", "full_text": "The Case for Evaluating Causal Models\n\nUsing Interventional Measures and Empirical Data\n\nAmanda Gentzel, Dan Garant, and David Jensen\n\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nAbstract\n\nCausal modeling is central to many areas of arti\ufb01cial intelligence, including com-\nplex reasoning, planning, knowledge-base construction, robotics, explanation, and\nfairness. An active community of researchers develops and enhances algorithms\nthat learn causal models from data, and this work has produced a series of im-\npressive technical advances. However, evaluation techniques for causal modeling\nalgorithms have remained somewhat primitive, limiting what we can learn from ex-\nperimental studies of algorithm performance, constraining the types of algorithms\nand model representations that researchers consider, and creating a gap between\ntheory and practice. We argue for more frequent use of evaluation techniques that\nexamine interventional measures rather than structural or observational measures,\nand that evaluate using empirical data rather than synthetic data. We survey the\ncurrent practice in evaluation and show that the techniques we recommend are\nrarely used in practice. We show that such techniques are feasible and that data\nsets are available to conduct such evaluations. We also show that these techniques\nproduce substantially different results than using structural measures and synthetic\ndata.\n\n1\n\nIntroduction\n\nEvaluation is central to research in arti\ufb01cial intelligence and machine learning [Cohen, 1995, Langley,\n2011]. How we evaluate algorithms determines our perception of the relative effectiveness and\nusefulness of different approaches, and this knowledge guides choices about future research directions.\nAs Cohen and Howe [1989] explained three decades ago: \u201cIdeally, evaluation should be a mechanism\nby which AI progresses both within and across individual research projects. It should be something\nwe do as individuals to help our own research and, more importantly, on behalf of the \ufb01eld.\u201d\nAs \ufb01elds develop, protocols for evaluation need to develop alongside them. In this paper, we offer\nan empirical analysis of the set of techniques typically used to evaluate algorithms for learning\ncausal models, and we show that this set could be substantially enhanced. The ultimate goal of most\nalgorithms for causal modeling is to learn models capable of accurately estimating the effects of\ninterventions in real-world systems. With this goal in mind, we would like to evaluate algorithms by\ncomparing their estimates to actual interventional effects on data produced by a real-world system.\nIn practice, though, many evaluations fall short of this ideal, most frequently using only synthetic\ndata and structural or observational measures. Without the use of empirical data, our evaluations\nproduce little information about whether our algorithms generalize to real-world systems, and this\ngreatly reduces their likelihood of widespread adoption by others outside of the \ufb01eld. Without the use\nof interventional measures, our evaluations produce little information about whether learned models\nwill accurately estimate the effects of interventions, limiting their real-world utility.\nNote that we do not argue for replacing the prevailing techniques for evaluation. These techniques\nhave substantial value, both in assessing overall performance and in allowing \ufb01ne-grained experiments\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto diagnose speci\ufb01c performance issues. Rather, we argue for augmenting the current suite of\nevaluation techniques to gather experimental evidence that the prevailing techniques cannot. We\nalso do not contend that interventional measures and empirical data are entirely absent from current\nstudies. A very small minority of recent studies use these techniques in combination. Rather, we\nargue that interventional measures and empirical data should be used routinely, and should be used in\ncombination, for any serious study of algorithms for learning causal models. Indeed, the conclusions\nof most studies that lack such evaluation techniques should be considered exploratory and would\nbene\ufb01t from additional evaluation.\nWe make the following contributions:\n\nC1 Decomposition of Evaluation Techniques. We decompose evaluation techniques into three\ninteracting components: the data source, the algorithm, and the evaluation measure, allowing for\na modular discussion of the interacting components of an evaluation.\n\nC2 Survey of Current Techniques. We provide a detailed survey of recent literature in causal\n\nmodeling to provide a quantitative understanding of current evaluation practices.\n\nC3 Critique of Current Practice. We provide evidence that increased adoption of both empirical\n\ndata and interventional measures would be bene\ufb01cial to the community.\n\n2 Survey of Current Techniques\n\nTo assess how frequently different evaluation techniques are used in practice, we surveyed recent\ncomputer science publications on causal modeling. We collected papers from the past \ufb01ve UAI,\nNeurIPS, AAAI, ICML, and KDD conferences, as well as causality workshops held at UAI. We\nexamined papers whose titles contained the terms \u2018cause\u2019, \u2018causal\u2019, or \u2018causality\u2019 and then narrowed\nthis selection of papers to those that describe, propose, or evaluate a causal modeling algorithm. This\nresulted in a \ufb01nal set of 111 papers, of which 82% (91) reported any sort of evaluation.1 Citations to\nall 111 papers are provided in the Supplementary Material.\nThe counts of papers included in the \ufb01nal survey are shown in Table 1. While some relevant papers\nmay fall outside of our search parameters, this approach captures a reasonably representative sample\nof recent work within computer science on causal modeling, allowing us to infer which techniques\nare used in practice and how frequently these techniques are used.\n\nTable 1: Papers included in the survey\n\nVenue 2014 2015 2016 2017 2018 Total\n\nUAI 2\nNeurIPS 3\nAAAI 1\nICML 1\nKDD 0\nUAI-W 2\nTotal 9\n\n3\n5\n6\n5\n2\n2\n23\n\n5\n4\n2\n1\n3\n4\n19\n\n3\n6\n4\n3\n0\n3\n19\n\n7\n13\n5\n5\n2\n9\n41\n\n20\n31\n18\n15\n7\n20\n111\n\n2.1 Survey Results\n\nTable 2: Number of papers using different\nevaluation techniques\n\nData Sources\n\nSynthetic Empirical\n\nStructural\nObservational\nInterventional\nVisual Inspection\n\n44\n22\n11\n0\n\n23\n14\n6\n19\n\ns\ne\nr\nu\ns\na\ne\nM\n\nn\no\ni\nt\na\nu\n\nl\na\nv\nE\n\nFor ease of exposition, we decompose evaluation techniques into three components: (1) the data\nsource; (2) the algorithm under evaluation; and (3) the evaluation measure. These dimensions are\nhighly dependent\u2014a choice of one can determine feasible choices for the others. For example, models\nlearned from observational macro-economic data often cannot be compared against a known structure\nbecause there exists no ground truth, and models consisting only of non-parameterized structure\ncannot be compared to interventional effects because the models cannot produce such estimates.\nData Sources. The surveyed papers used a wide range of data sources, but they fall into two broad\ncategories: synthetic and empirical. We categorized data as empirical when it was collected from\n\n1When reporting survey results, we follow each percentage with a parenthesized number representing the\n\nraw count. The denominator for percentages is 91, except where otherwise noted.\n\n2\n\n\fa \u201creal world\u201d system, whether that was a randomized clinical trial, a global \ufb01nancial system, or\nuser interaction with a website. The important distinction is that empirical data was collected from a\nprocess or a system that exists for some purpose beyond scienti\ufb01c research. Synthetic data includes\nanything else, including data generated from a randomly instantiated directed graphical model or\nfrom a simulation intended to re\ufb02ect a real-world system. In our survey, we found many examples of\nboth, and while synthetic data is used more frequently, both are still common. 81% (74) of papers\nsurveyed used synthetic data, 67% (61) used empirical data, and 48% (44) used both.\nAlgorithms. The algorithm under evaluation is not part of the evaluation technique per se, but aspects\nof the algorithm strongly in\ufb02uence how evaluation can, and should, be performed. Algorithms fall\ninto two broad categories, bivariate and multivariate, based on the number of variables they consider,\nalthough there are many variants.\nSome bivariate algorithms infer only the direction of effect (whether A causes B or B causes A).\nOthers estimate the magnitude of effect between treatment and outcome, while adjusting for the\neffects of a number of covariates. Bivariate methods include Granger causality analysis [Granger,\n1969], additive noise models [Peters et al., 2014], and analyses that use the potential outcomes\nframework [Rubin, 2005]. The most common variety of multivariate algorithm learns a directed\nacyclic graph (DAG). Multivariate algorithms are signi\ufb01cantly more prevalent in the data, accounting\nfor 60% (55/111) of papers surveyed. Bivariate algorithms account for 30% (34/111) of papers\nsurveyed, split between those focused on orientation (10%), magnitude of effect (15%), or both (5%).\nThe remaining papers in the survey fall in between, including those that aim to determine the joint\neffect of multiple treatment variables on a single outcome.\nEvaluation Measures. At the heart of any evaluation technique is a measure of performance. At\na high level, evaluation measures fall into two categories: structural and distributional. Structural\nmeasures include all measures designed to assess whether the structure (including both existence of\nedges and edge orientation) learned by the algorithm matches the ground truth. Structural measures\ninclude structural Hamming distance (SHD), precision, recall, F1-score, true-positive rate, area under\nthe ROC curve (AUROC), and structural intervention distance (SID) [Peters and B\u00a8uhlmann, 2015].\nDistributional measures capture how well the algorithm can estimate quantitative dependence. Such\nmeasures can be further subdivided into observational and interventional measures. Observational\nmeasures compare the learned distribution with an observational ground truth (i.e. probability queries\nwhich do not involve a do operator). This could be a measure of individual edge strengths in a directed\ngraphical model or a measure of the error when predicting a given outcome variable. Interventional\nmeasures, on the other hand, compare the learned distribution to ground truth obtained through\nintervention. Common interventional measures include KL-divergence, total variation distance, and\nmeasures of average and conditional treatment effect.\nOf the types of evaluation measures, structural measures are the most common, being used in 55%\n(50) of papers surveyed. Distributional measures are slightly less common, being used in 46% (42) of\npapers. The vast majority of the distributional measures used, however, are observational rather than\ninterventional; observational measures are used in 32% (29) of papers, while interventional measures\nare used in only 14% (13).\nThe choice of evaluation measure depends on both the data generating process and type of algorithm,\nwhich is re\ufb02ected in our survey. When synthetic data is evaluated, structural measures are used\n59% (44/74) of the time. However, when empirical data is evaluated, structural measures are used\nonly 38% (23/61) of the time, since empirical data is less likely to have ground truth. This lack of\nground truth sometimes prevents any signi\ufb01cant evaluation when using empirical data\u201426% (16/61)\nof empirical evaluations used only visual inspection of the results, with no ground truth. Table 2\nsummarizes the interaction between data source and evaluation measure in the survey.\n\n2.2 Findings\n\nThe survey makes clear that the vast majority of papers that perform evaluation use either (1)\nsynthetic data; or (2) empirical data combined with non-interventional measures (observational\nmeasures, structural measures, or visual inspection). Our proposed ideal evaluation (empirical data\nand interventional measures) is used in only 7% (6) of papers. This raises an obvious question: Are\nthe most commonly used evaluation techniques suf\ufb01cient for determining whether algorithms for\nlearning causal models will work effectively in realistic scenarios? As we argue below, they are not.\n\n3\n\n\f3 The Case for Empirical Data\n\nAs already noted, nearly all causal modeling algorithms are ultimately designed for use outside of a\nlaboratory, on real systems to infer useful causal knowledge about the world. Despite this, evaluation\nof such algorithms often uses synthetic rather than empirical data.\n\n3.1 Limitations of Synthetic Data\n\nResearchers have developed several approaches to generating synthetic data. The most common is to\nuse some form of directed graphical model. In some cases, the structure of the model is designed\nto match the causal structure of a realistic system, either by manually specifying the structure or by\nlearning it from empirical data. Large-scale simulators designed for other reasons can also be used.\nIn some cases, simulators can be complex enough to generate data that is effectively equivalent to\nempirical data, though such simulations vary in quality.\nSynthetic data is easy to collect, allows for straightforward comparison with ground truth, and\nfacilitates systematic testing across a variety of data parameters. Its popularity is evident\u201484% (74)\nof surveyed papers used it in their evaluation, and 41% (30/74) of those used only synthetic data.\nHowever, using synthetic data for evaluation also has signi\ufb01cant limitations. These include:\nUnquestioned assumptions\u2014Synthetic data tends to match the assumptions of the researcher running\nthe study and any algorithms they have created. For example, a researcher developing an algorithm\nthat outputs a DAG will be inclined to generate data from a DAG.\nUnknown in\ufb02uences\u2014Even the best data generators can only include the in\ufb02uences already known to\nresearchers. Almost by de\ufb01nition, synthetic data generators cannot include any \u201cunknown unknowns\u201d\nthat may in\ufb02uence the outputs of real-world systems. While latent variables can be added, they are\nstill de\ufb01ned and created by the researcher, limiting the realism of the data.\nLack of standardization\u2014Synthetic data is typically generated differently by each researcher, and\nthis lack of standardization impedes comparison between studies.\nResearcher degrees-of-freedom\u2014Synthetic data is typically designed and parameterized by the\nresearchers who created the algorithm being evaluated, giving them an enormous range of choices.\nSuch high \u201cresearcher degrees-of-freedom\u201d [Simmons et al., 2011] are a basic challenge to the\nvalidity of any study.\nThese factors signi\ufb01cantly limit the external validity and realism of most synthetic data, making it\ninsuf\ufb01cient as the sole source of data for evaluation. Synthetic data is not without value\u2014it can be a\npowerful way to assess features of an algorithm and test its performance under different conditions.\nHowever, it typically falls short in providing insights into how the algorithm will perform on data\nfrom a real-world system.\n\n3.2 Bene\ufb01ts of Empirical Data\n\nEmpirical data is almost always more dif\ufb01cult to collect than synthetic data, and information on the\neffects of interventions is typically also much more dif\ufb01cult to obtain. However, using empirical data\nhas multiple bene\ufb01ts:\nRealistic complexity\u2014Empirical data typically has a distribution that is more complex than synthetic\ndata. That distribution is subject to realistic latent factors and measurement error. This creates a\nlearning task that is often signi\ufb01cantly harder than synthetic data, but also more closely matches the\nchallenges of real-world settings.\nLower potential researcher bias\u2014Empirical data is typically not generated by the researcher who\ndesigned the algorithm being evaluated, and thus it is less subject to unintentional biases. In addition,\nindividual data sets are often shared across the community, creating standardization and comparability\nacross studies.\nReal-world demonstration\u2014The aim of research on algorithms for causal modeling is to have these\nalgorithms used by others to learn causal models and reason about causal effects in real-world settings.\nPractitioners considering use of these methods may be legitimately skeptical about their effectiveness\nuntil they see successful demonstrations of accurate causal modeling on real-world data.\n\n4\n\n\fHowever, using empirical data poses challenges as well. Because it is generally not collected\nby the person using it, some features of the data may not be fully understood, hindering correct\ninterpretation. Also, ground truth can be challenging to obtain, limiting evaluation to visual inspection\nor observational measures. This is unsatisfying at best and misleading at worst, since, when evaluating\nwithout ground truth, it can be easy to see meaning where none exists or to imagine explanations for\nmany possible con\ufb02icting outputs. Despite these challenges, empirical data is still used frequently in\npractice; 67% (61) of surveyed papers use empirical data, and 28% (17/61) used only empirical data.\n\n3.3 Sources of Empirical Data\n\nTypes of empirical data vary depending on the level of ground truth and the source of the ground\ntruth. Purely observational data is the most readily available and is used most often. While this\nis rarely accompanied by full knowledge of the underlying structure, there are generally some\ndependencies that are known, either from common sense knowledge (such as temporal ordering) or\nfrom dependencies that have already been established by prior work. For a randomized controlled\ntrial, the dependence between the measured treatment and outcome is generally taken as ground truth.\nThe same is true for cases in which multiple potential outcomes can be recorded for each unit. This\nincludes gene regulatory networks, \ufb02ow cytometry analysis, and software systems, where essentially\nidentical units can receive multiple treatments and thus produce multiple potential outcomes.\nBecause interventional measures and empirical data are used so infrequently, one might assume this\nis because such data sets are dif\ufb01cult to obtain. This is partially true\u2014there are signi\ufb01cantly more\nobservational data sets available than interventional data sets. However, a growing community is\nproducing data sets that provide interventional effects. We describe some of them here.\nThe cause-effect pairs challenge [Mooij et al., 2016] provides data that is empirical and, while\ninterventional effects are not available, the direction of causality is known. The 2016 Atlantic Causal\nInference Conference Competition and subsequent competitions [Dorie et al., 2019, Hahn et al.,\n2019] created semi-synthetic data sets, producing synthetic treatment and outcome functions using\ncovariates from a real-world system. A similar approach was used by Shimoni et al. [2018] for\nthe IBM Causal Inference Benchmarking Framework. Flow cytometry data, measuring protein\nsignaling pathways, is another common choice for interventional data [Sachs et al., 2005]. Dixit\net al. [2016] provide data on gene expression, collected using their proposed Perturb-Seq technique to\nperform gene deletion interventions. There has also been work in partially randomized experiments,\nwhere a population is split into randomized and observational groups, creating parallel datasets for\nevaluation [Shadish et al., 2008]. Other sources of interventional and empirical data include results\nof advertising campaigns [Sun et al., 2015] and clinical studies [McDonald et al., 1992], as well as\nmultiple challenges organized for machine learning conferences [Guyon et al., 2008, 2010]. Domain\nspeci\ufb01c simulations are another useful source of data. While technically synthetic, a suf\ufb01ciently\nsophisticated simulation falls on a spectrum between purely synthetic and purely empirical data.\nThey are often highly complex, are created by someone other than the researcher, and are created for\na purpose other than evaluation, making them ideal for evaluation. One popular simulation that is\nused for evaluation is the DREAM in silico data sets, since multiple combinations of single-gene\ninterventions can be performed on identical networks [Schaffter et al., 2011].\nWe also introduce an additional source of empirical data where interventions are possible: large-scale\nsoftware systems. These systems have many desirable properties for the purposes of empirical\nevaluation: (1) They are pre-existing systems created by people other than the researchers for a\npurpose other than evaluating algorithms for causal modeling; (2) They produce non-deterministic\nexperimental results due to latent variables and natural stochasticity; (3) System parameters provide\nnatural treatment variables; and (4) Each experiment is recoverable, allowing the same experiment to\nbe performed multiple times with different combinations of interventions. Three such data sets are\ndiscussed in more detail in Section 5 and in the Supplementary Material.2\n\n3.4 How Different are the Results?\n\nReaders may ask: In practice, what\u2019s the difference between using empirical data rather than synthetic\ndata? If that difference is small, then the substantial extra work involved in evaluation with empirical\ndata may not be worth the effort.\n\n2These data sets are available for download at http://kdl.cs.umass.edu/data.\n\n5\n\n\fFigure 1: Comparison of TVD on empirical data and synthetic data derived from empirical data. (a)\nand (b): synthetic data with structure obtained from PC or GES. (c): TVD on empirical data.\n\nTo begin addressing this question, we conducted a series of experiments using interventional data\nfrom the software systems mentioned above. For these experiments, we used a common approach\nfor generating somewhat realistic synthetic data. This approach uses an empirical data set to learn a\ncausal model and then uses that model to generate synthetic data (and known ground truth effects)\nfor model evaluation. While the \ufb01nal data set is synthetic, its structure may better approximate the\nempirical system, rather than being entirely de\ufb01ned by the researcher, lending it more credibility. We\nused this approach to generate synthetic data in the style of the three empirical data sets we generated\nfrom software systems. Since we now have both empirical and synthetic data, each with ground truth,\nwe can use causal modeling algorithms to construct a model for both of these data sets and compare\nthe conclusions we would draw from each.\nThe synthetic data used was created by \ufb01rst choosing an initial causal modeling algorithm to create\na ground truth model from the empirical data. After learning a ground truth model with each of\ntwo algorithms that construct causal graphical models (PC and GES),3 we generated synthetic data\nusing the resulting models. We then evaluated the same three algorithms on both the synthetic and\nempirical data. Figure 1 shows how mean TVD varies for different causal modeling algorithms and\ndifferent data sets. The results shown are the mean TVD when evaluating PC, GES, and MMHC on\ntwo types of synthetic data sets (using the model as ground truth) and on the empirical data (using the\nknown interventional effects). There is signi\ufb01cant variability between the two methods of generating\nthe synthetic ground truth model from the empirical data (PC and GES), both in the mean TVD\nand in the relative ordering of the algorithms. Comparing the synthetic and empirical results, some\nrelative orderings of the algorithms are the same (e.g., network), but other orderings are signi\ufb01cantly\ndifferent (e.g., Postgres). These results suggest that algorithm performance cannot be expected to\nmatch between synthetic and empirical data, even when the synthetic data is created in a way that\nwould be expected to match aspects of the empirical data.\n\n4 The Case for Interventional Measures\n\nMany algorithms are currently evaluated based on their ability to learn causal structure. However, the\nactual desired task is almost never to model structure alone. In practice, estimating the magnitude of\ninterventional effects is vitally important, and an algorithm that cannot distinguish between strong\nand weak effects is severely limited in scope. Despite this, the majority of current evaluations use\nobservational or structural measures rather than measures of interventional effect.\n\n4.1 Limitations of Observational Measures\n\nObservational measures are widely used to evaluate algorithms for associational modeling, where\nthe task of the algorithm is to discern statistical associations between two or more variables. In\nsuch applications, the primary focus is effectively modeling the magnitude and form of statistical\ndependence, rather than explicitly learning causal dependence. This highlights a severe and obvious\nlimitation of observational measures:\n\n3We reach similar conclusions based on the results for MMHC, which are reported in the Supplementary\n\nMaterial.\n\n6\n\n\fNon-causal\u2014Observational measures are, by de\ufb01nition, not causal. They measure the error of\nestimates of the outcome variable, but they do not measure that error under intervention. They\nprovide a sense of how well an algorithm has learned statistical dependence, but not how well it has\nlearned causal dependence. Despite this, observational measures are the only evaluation used in 23%\n(21/91) of papers surveyed.\n\n4.2 Limitations of Structural Measures\n\nStructural measures are easy to calculate, and they have a clear intuition. If an algorithm produces\na causal structure and we know structural ground truth, it seems sensible to determine if the two\nstructures match. This has led to the widespread adoption of structural measures: 55% (50) of\nsurveyed papers used such measures, and 84% (42/50) of those used only structural measures.\nHowever, structural measures have several serious limitations:\nRequires known structure\u2014Calculating structural measures requires a full ground-truth graph struc-\nture, which is only rarely available for empirical data.\nConstrains research directions\u2014The prevalence of structural measures may constrain research to\nalgorithms that can be evaluated with these measures. Algorithms that do not produce DAGs are\nless likely to be developed or favorably reviewed. Since structural measures can only be used by\nalgorithms that produce a directed graphical model as output, they implicitly assume that directed\ngraphical models are capable of accurately representing any causal process being modeled, an unlikely\nassumption.\nOblivious to magnitude and type of dependence\u2014Structural measures, by design, do not account\nfor different magnitudes of dependence, so an error in an edge with a strong effect incurs the same\npenalty as an error in an edge with a very weak effect. In addition, structural measures are only able\nto measure which variables in a causal model change as the result of an intervention. In many cases,\nit is also necessary to determine how much or in what way a given target quantity will change with\nrespect to an intervention.\nOblivious to likely treatments and outcomes\u2014In most cases, structural measures do not consider\nwhere an edge is located in the overall structure of the DAG, so an edge with many downstream\neffects is treated the same as a less central edge.\n\n4.3 Bene\ufb01ts of Interventional Measures\n\nIn contrast to observational and structural measures, interventional measures have strong advantages:\nCorrespondence to actual use\u2014Interventional measures evaluate how well the model estimates\ninterventional effects, which aligns more closely with the eventual use of nearly all causal models.\nFor example, a directed acyclic graph is not the ultimate artifact of interest for most applications\u2014\nDAGs are simply a representation that facilitates estimation of interventional effects [Spirtes et al.,\n2000, Pearl, 2009]. Thus, it seems natural to de\ufb01ne an evaluation measure in terms of interventional\neffects rather than graphical structure.\nWeighting of different errors\u2014While most structural measures penalize each edge misorientation\nequally, interventional measures penalize misorientation errors proportionally to their effect on the\nestimation of interventional effect.\n\n4.4 How Different are the Results?\n\nInterventional measures are intended to capture something different than structural measures, but\nthey are ultimately affected by the structure of the learned model, and we would expect structural\nerrors to lead to errors in interventional effect estimates. Of course, interventional and structural\nmeasures are equal when structure and parameterizations are perfect, but they can differ signi\ufb01cantly\nwhen the learned structure is only approximately correct (which is almost always the case). To assess\nthe extent to which interventional measures capture different information than structural measures in\nsuch cases, we ran experiments using synthetic data. This allowed us to produce data where we could\ncalculate both structural measures and interventional measures, since we had the full parameterized\nground truth model to compare against.\n\n7\n\n\fFor these experiments, we produced data from\nrandom DAG structures with conditional proba-\nbility models drawn from a Dirichlet distribution.\nWe generated 5000 instances, applied a causal\nmodeling algorithm, and calculated various eval-\nuation measures. Figure 2 shows the results for\nGES. SHD and SID are clearly strongly corre-\nlated, suggesting that both structural measures\nultimately produce similar quality measures of\nthe algorithm. However, SHD and TVD are only\nvery weakly correlated, with many models scor-\ning highly with one measure and poorly with the\nother. At least in this case, the interventional measure (TVD) appears to capture substantially different\ninformation than that of a structural measure (SHD). Results for PC and MMHC are reported in the\nSupplementary Material.\n\nFigure 2: Structural and interventional measures\ncompared on synthetic data with GES.\n\n5 Example of an Evaluation\n\nTo further explain what we mean by empirical data and interventional measures, we describe one\nexample of this type of evaluation, shown schematically in Figure 3. This example demonstrates one\nway that an evaluation with empirical data and interventional measures could be performed, though\nmany other techniques are possible, depending on the algorithm, data source, evaluation measure,\nand the research question under consideration. In our example, we evaluate the PC algorithm [Spirtes\net al., 2000], Greedy Equivalence Search (GES) [Chickering, 2003], and MMHC [Tsamardinos et al.,\n2006] by measuring total variation distance (an interventional measure de\ufb01ned later) on a data set\nproduced by experimentation with a large-scale software system.\nAn obvious way to evaluate how well an algorithm can learn causal models from real-world data is\nto compare the model\u2019s estimates to empirical data drawn from a system in which we can perform\nmultiple interventions on the same units, giving us full interventional data in which we can assess\nevery potential outcome for each unit. Large-scale software systems allow for this type of intervention\nbecause they let us run the same experiments multiple times under different conditions (e.g., different\nsettings of key system parameters). An example of this is a Postgres database, where we can run\nthe same queries with different settings of key con\ufb01guration parameters. In this context, each query\ncorresponds to a unit, a set of con\ufb01guration parameters correspond to treatment, and variables such as\nruntime correspond to outcomes. Details about this data can be found in the Supplementary Material.\nMany algorithms for causal modeling are designed to run on observational data, in which only\na single, non-randomized treatment assignment is observed for each unit. In the absence of an\nobservational data set that matches our interventional data, we can create an observational-style data\nset by sub-sampling the full interventional data in a non-random manner. To do this, we select a single\ntreatment assignment for each query. Selecting treatment at random is equivalent to a randomized\ncontrolled trial. In most observational contexts, however, treatment assignment would be based\non covariates of the units. For example, a database administrator might choose the con\ufb01guration\nparameters based on features of each query. We use a similar process to create observational data by\nusing a measured covariate of the query to probabilistically assign treatment.\n\nFigure 3: A diagram of one approach to evaluating a causal modeling algorithm\n\n8\n\n\fGiven such an observational data set, we can apply a causal modeling algorithm and learn a causal\nmodel. A fully parameterized model can produce an estimated interventional distribution \u02c6P by\napplying the do-calculus [Galles and Pearl, 1995]. Under this framework, causal quantities take the\nform of probability queries with do operators, for instance P (O|do(T = 1)). We can also estimate\nthe actual interventional distribution P = P (O = o|do(T = t)) for any outcome o and treatment t,\nbecause we can measure the effects of both values of treatment for each query in our data set.\nWe then can use an interventional measure to compare the true interventional distribution P to\nthe estimated distribution \u02c6P . One example of an interventional measure is total variation distance\n(TVD) [Lin, 1991], which measures the distance between two probability distributions. For discrete\noutcomes O, the quality of an estimated interventional distribution relative to a known distribution\nunder TVD is straightforward to compute:\n\n(cid:12)(cid:12)P (O = o|do(T = t)) \u2212 \u02c6P (O = o|do(T = t))(cid:12)(cid:12),\n\nT VP, \u02c6P ,T =t(O) =\n\n1\n2\n\n(cid:88)\n\no\u2208\u2126(O)\n\nwhere \u2126(O) is the domain of O. This gives us a numerical measure of how well the estimated\ninterventional estimates match the ground truth. A single TVD value is computed for each causal\neffect, which can then be aggregated for comparison. Results of this evaluation on the software data\nis shown in Figure 1c. For these datasets, we can conclude that GES has the best overall performance.\n\n6 Conclusion\n\nEvaluation is a key mechanism that determines how algorithms are viewed within the community,\nwhat research directions are pursued next, and whether our research has broader impacts outside the\ncommunity. Our current evaluation techniques aim too low, and they fail to evaluate the full range of\nquestions that our research goals imply.\nWe are not the \ufb01rst to point out the need for more robust evaluation techniques. Some of the datasets\nwe discuss were created in response to recognition that better evaluation was necessary [Dorie et al.,\n2019, Shimoni et al., 2018, Mooij et al., 2016]. In addition, prior work has examined the importance\nof testing the generalizability of causal inferences drawn from observational data [Zhao et al., 2019,\nKeane and Wolpin, 2007] and comparing causal effects drawn from observational and experimental\ndata [Cook et al., 2008, Eckles et al., 2016, Eckles and Bakshy, 2017, Gordon et al., 2019]. However,\ndespite this, as our survey shows, empirical evaluation with interventional measures is rarely used by\ncomputer science researchers.\nWe acknowledge that, while the evaluation techniques we advocate are applicable to a wide range\nof algorithms, data sets may not be available for every task. The diverse tasks of causal modeling\nalgorithms make it dif\ufb01cult to recommend a single data set and evaluation measure to evaluate every\nalgorithm. However, the data sets and measures that are most commonly used are largely insuf\ufb01cient.\nThe community would bene\ufb01t if more data sets with interventional effects were created and made\navailable for public use, allowing for a breadth of evaluation options.\nWe do not advocate abandoning synthetic data and structural measures. Both have many uses for\nevaluating algorithm performance and can be indispensable scienti\ufb01c tools. However, they are\ninsuf\ufb01cient on their own. Instead, they should be viewed as a \ufb01rst step in evaluation. If we want\ncausal modeling algorithms to be adopted outside our research community, we need demonstrations\nof their utility outside of a laboratory setting. If we do not evaluate on empirical data, we cannot\nbe certain our algorithms will perform well on real-world data, and if we do not evaluate with\ninterventional measures, we cannot be certain that the causal effects our algorithms infer will translate\nto actual, substantial causal effects in practice. Expanding our routine evaluations will substantially\nimprove the credibility and comparability of results, the external validity and trustworthiness of\nalgorithms, and the ef\ufb01ciency with which we conduct our research.\n\nAcknowledgments\n\nThis material is based upon work supported by the United States Air Force under Contract No,\nFA8750-17-C-0120. Any opinions, \ufb01ndings and conclusions or recommendations expressed in this\nmaterial are those of the author(s) and do not necessarily re\ufb02ect the views of the United States Air\nForce.\n\n9\n\n\fReferences\nDavid Maxwell Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of Machine\n\nLearning Research, 3, March 2003.\n\nPaul Cohen. Empirical Methods for Arti\ufb01cial Intelligence. MIT Press: Cambridge, MA, 1995.\n\nPaul Cohen and Adele Howe. Toward AI research methodology: Three case studies in evaluation.\n\nIEEE Transactions on Systems, Man, and Cybernetics, 19(3):634\u2013646, 1989.\n\nThomas D. Cook, William R. Shadish, and Vivian C. Wong. Three conditions under which ex-\nperiments and observational studies produce comparable causal estimates: New \ufb01ndings from\nwithin-study comparisons. Journal of Policy Analysis and Management: The Journal of the\nAssociation for Public Policy Analysis and Management, 27(4):724\u2013750, 2008.\n\nAtray Dixit, Oren Parnas, Biyu Li, Jenny Chen, Charles P. Fulco, Livnat Jerby-Arnon, Nemanja D.\nMarjanovic, Danielle Dionne, Tyler Burks, Raktima Raychowdhury, Britt Adamson, Thomas M.\nNorman, Eric S. Lander, Jonathan S. Weissman, Nir Friedman, and Aviv Regev. Perturb-Seq:\nDissecting molecular circuits with scalable single-cell RNA pro\ufb01ling of pooled genetic screens.\nCell, 167(7):1853\u20131866, 2016.\n\nVincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. Automated versus do-it-\nyourself methods for causal inference: Lessons learned from a data analysis competition. Statistical\nScience, 34:43\u201368, 2019.\n\nDean Eckles and Eytan Bakshy. Bias and high-dimensional adjustment in observational studies of\n\npeer effects. arXiv preprint:1706.04692, 2017.\n\nDean Eckles, Ren\u00b4e F. Kizilcec, and Eytan Bakshy. Estimating peer effects in networks with peer\nencouragement designs. Proceedings of the National Academy of Sciences, 113(27):7316\u20137322,\n2016.\n\nDavid Galles and Judea Pearl. Testing identi\ufb01ability of causal effects. In Proceedings of the 11th\n\nInternational Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 185\u2013195, 1995.\n\nBrett R Gordon, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. A comparison of approaches\nto advertising measurement: Evidence from big \ufb01eld experiments at facebook. Marketing Science,\n38(2):193\u2013225, 2019.\n\nClive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.\n\nEconometrica: Journal of the Econometric Society, pages 424\u2013438, 1969.\n\nIsabelle Guyon, Constantin Aliferis, Greg Cooper, and Peter Spirtes. Design and analysis of the\n\ncausation and prediction challenge. WCCI 2008 Workshop on Causality, pages 1\u201333, 2008.\n\nIsabelle Guyon, Dominik Janzing, and Bernhard Sch\u00a8olkopf. Causality: Objectives and assessment.\n\nNIPS 2008 Workshop on Causality, 6:1\u201338, 2010.\n\nP. Richard Hahn, Vincent Dorie, and Jared S. Murray. Atlantic causal inference conference (ACIC)\n\ndata analysis challenge 2017, 2019.\n\nMichael P. Keane and Kenneth I. Wolpin. Exploring the usefulness of a nonrandom holdout sample\nfor model validation: Welfare effects on female behavior. International Economic Review, 48(4):\n1351\u20131378, 2007.\n\nPat Langley. The changing science of machine learning. Machine Learning, 82(3):275\u2013279, Mar\n\n2011.\n\nJianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information\n\nTheory, 37(1):145\u2013151, Jan 1991.\n\nClement J. McDonald, Siu L. Hui, and William M. Tierney. Effects of computer reminders for\nin\ufb02uenza vaccination on morbidity during in\ufb02uenza epidemics. MD Computing, 9:304\u2013312, 1992.\n\n10\n\n\fJoris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch\u00a8olkopf. Distin-\nguishing cause from effect using observational data: methods and benchmarks. Journal of Machine\nLearning Research, 17(1):1103\u20131204, 2016.\n\nJudea Pearl. Causality. Cambridge University Press, 2009.\n\nJonas Peters and Peter B\u00a8uhlmann. Structural intervention distance for evaluating causal graphs.\n\nNeural Computation, 27(3):771\u2013799, March 2015.\n\nJonas Peters, Joris M. Mooij, Dominik Janzing, and Bernhard Sch\u00a8olkopf. Causal discovery with\ncontinuous additive noise models. Journal of Machine Learning Research, 15(1):2009\u20132053, 2014.\n\nDonald B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal\n\nof the American Statistical Association, 100(469):322\u2013331, 2005.\n\nKaren Sachs, Omar Perez, Dana Pe\u2019er, Douglas A. Lauffenburger, and Garry P. Nolan. Causal\nprotein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):\n523\u2013529, April 2005.\n\nThomas Schaffter, Daniel Marbach, and Dario Floreano. GeneNetWeaver: In silico benchmark\ngeneration and performance pro\ufb01ling of network inference methods. Bioinformatics, 27(16):\n2263\u20132270, 2011.\n\nWilliam R. Shadish, M. H. Clark, and Peter M. Steiner. Can nonrandomized experiments yield\naccurate answers? A randomized experiment comparing random and nonrandom assignments.\nJournal of the American Statistical Association, 103(484):1334\u20131344, 2008.\n\nYishai Shimoni, Chen Yanover, Ehud Karavani, and Yaara Goldschmnidt. Benchmarking framework\nfor performance-evaluation of causal inference analysis. arXiv preprint arXiv:1802.05046, 2018.\n\nJoseph P. Simmons, Leif D. Nelson, and Uri Simonsohn. False-positive psychology: Undisclosed\n\ufb02exibility in data collection and analysis allows presenting anything as signi\ufb01cant. Psychological\nScience, 22(11):1359\u20131366, 2011.\n\nPeter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction and Search. MIT Press,\n\nCambridge, MA, 2nd edition, 2000.\n\nWei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang. Causal Inference via Sparse Additive\nModels with Application to Online Advertising. Proceedings of the 29th AAAI Conference on\nArti\ufb01cial Intelligence, pages 297\u2013303, 2015.\n\nIoannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-climbing\nBayesian network structure learning algorithm. Journal of Machine Learning Research, 65(1):\n31\u201378, 2006.\n\nQingyuan Zhao, Luke J. Keele, and Dylan S. Small. Comment: Will competition-winning methods\n\nfor causal inference also succeed in practice? Statistical Science, 34(1):72\u201376, 02 2019.\n\n11\n\n\f", "award": [], "sourceid": 6256, "authors": [{"given_name": "Amanda", "family_name": "Gentzel", "institution": "UMass Amherst"}, {"given_name": "Dan", "family_name": "Garant", "institution": "C&S Wholesale Grocers"}, {"given_name": "David", "family_name": "Jensen", "institution": "Univ. of Massachusetts"}]}