{"title": "Theory-Based Causal Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 43, "page_last": 50, "abstract": "", "full_text": "Theory-Based Causal Inference\n\nJoshua B. Tenenbaum & Thomas L. Grif\ufb01ths\n\nDepartment of Brain and Cognitive Sciences\n\nMIT, Cambridge, MA 02139\n\n jbt, gruffydd\n\n@mit.edu\n\nAbstract\n\nPeople routinely make sophisticated causal inferences unconsciously, ef-\nfortlessly, and from very little data \u2013 often from just one or a few ob-\nservations. We argue that these inferences can be explained as Bayesian\ncomputations over a hypothesis space of causal graphical models, shaped\nby strong top-down prior knowledge in the form of intuitive theories. We\npresent two case studies of our approach, including quantitative mod-\nels of human causal judgments and brief comparisons with traditional\nbottom-up models of inference.\n\n1 Introduction\n\nPeople are remarkably good at inferring the causal structure of a system from observations\nof its behavior. Like any inductive task, causal inference is an ill-posed problem: the data\nwe see typically underdetermine the true causal structure. This problem is worse than\nthe usual statistician\u2019s dilemma that \u201ccorrelation does not imply causation\u201d. Many cases of\neveryday causal inference follow from just one or a few observations, where there isn\u2019t even\nenough data to reliably infer correlations! This fact notwithstanding, most conventional\naccounts of causal inference attempt to generate hypotheses in a bottom-up fashion based\non empirical correlations. These include associationist models [12], as well as more recent\nrational models that embody an explicit concept of causation [1,3], and most algorithms\nfor learning causal Bayes nets [10,14,7].\n\nHere we argue for an alternative top-down approach, within the causal Bayes net frame-\nwork. In contrast to standard bottom-up approaches to structure learning [10,14,7], which\naim to optimize or integrate over all possible causal models (structures and parameters),\nwe propose that people consider only a relatively constrained set of hypotheses determined\nby their prior knowledge of how the world works. The allowed causal hypotheses not only\nform a small set of all possible causal graphs, but also instantiate speci\ufb01c causal mecha-\nnisms with constrained conditional probability tables, rather than much more general con-\nditional dependence and independence relations.\n\nThe prior knowledge that generates this hypothesis space of possible causal models can be\nthought of as an intuitive theory, analogous to the scienti\ufb01c theories of classical mechan-\nics or electrodynamics that generate constrained spaces of possible causal models in their\ndomains. Following the suggestions of recent work in cognitive development (reviewed\nin [4]), we take the existence of strong intuitive theories to be the foundation for human\ncausal inference. However, our view contrasts with some recent suggestions [4,11] that\n\n\u0001\n\fan intuitive theory may be represented as a causal Bayes net model. Rather, we consider\na theory to be the underlying principles that generate the range of causal network models\npotentially applicable in a given domain \u2013 the abstractions that allow a learner to construct\nand reason with appropriate causal network hypotheses about novel systems in the presence\nof minimal perceptual input.\n\nGiven the hypothesis space generated by an intuitive theory, causal inference then follows\nthe standard Bayesian paradigm: weighing each hypothesis according to its posterior prob-\nability and averaging their predictions about the system according to those weights. The\ncombination of Bayesian causal inference with strong top-down knowledge is quite power-\nful, allowing us to explain people\u2019s very rapid inferences about model complexity in both\nstatic and temporally extended domains. Here we present two case studies of our approach,\nincluding quantitative models of human causal judgments and brief comparisons with more\nbottom-up accounts.\n\n2 Inferring hidden causal powers\n\nWe begin with a paradigm introduced by Gopnik and Sobel for studying causal infer-\nence in children [5]. Subjects are shown a number of blocks, along with a machine \u2013\nthe \u201cblicket detector\u201d. The blicket detector \u201cactivates\u201d \u2013 lights up and makes noise \u2013 when-\never a \u201cblicket\u201d is placed on it. Some of the blocks are \u201cblickets\u201d, others are not, but their\noutward appearance is no guide. Subjects observe a series of trials, on each of which one\nor more blocks are placed on the detector and the detector activates or not. They are then\nasked which blocks have the hidden causal power to activate the machine.\n\nand\n\nGopnik and Sobel have demonstrated various conditions under which children successfully\ninfer the causal status of blocks from just one or a few observations. Of particular interest\nis their \u201cbackwards blocking\u201d condition [13]: on trial 1 (the \u201c1-2\u201d trial), children observe\ntwo blocks (\n) placed on the detector and the detector activates. Most children\n\u0002\u0001\nis placed on\nnow say that both\nthe detector alone and the detector activates. Now all children say that\nis a blicket, and\nis not a blicket. Intuitively, this is a kind of \u201cexplaining away\u201d: seeing\nmost say that\nis suf\ufb01cient to activate the detector alone explains away the previously observed\nthat\nassociation of\n\nare blickets. On trial 2 (the \u201c1 alone\u201d trial),\n\nwith detector activation.\n\n\u0004\u0003\nand\n\n\u0004\u0003\n\n\u0005\u0001\n\nand\n\n\u0001#\u001d\n\n\u0003#\u001d\n\nand\n\n\f\u0003\n\n\u000b\u0001\n\nbut no edge\n\nare on the detector;\n\n\u0006\u0007\u0001\t\b\nrepresents the hypothesis that\n\nGopnik et al. [6] suggest that children\u2019s causal reasoning here may be thought of in terms\n, that is\nof learning the structure of a causal Bayes net. Figure 1a shows a Bayes net,\nrepresent whether\nconsistent with children\u2019s judgments after trial 2. Variables\nrepresents whether the detector activates; the\nblocks\nbut\nexistence of an edge\n\u0003\u0010\u000e\n\u0001\u000f\u000e\nhas the power to turn on the detector. We encode the\nnot\nis a blicket \u2013 that\n\u0004\u0003\n\u0005\u0001\n, where\nif block 1 is on the detector\ntwo observations \u0012\u0011\n\u0003\u0018\u0013\u001a\u0019\u001c\u001b\n\u0001\u0017\u0013\n\u0001\u0014\u0013\nand block 2, and\n), likewise for\n(else\nif the detector is active (else\n\u0019\u0004\u001d\"\u001f\n, standard Bayes net learning algorithms\nGiven only the data \u0012\u0011\n\u0013$\u001f$\u001b\n\u001f%\u0013\nhave no way to converge on subjects\u2019s choice\n. The data are not suf\ufb01cient to compute\n\u0006(\u0001)\b\nthe conditional independence relations required by constraint-based methods [9,13], 1 nor\nto strongly in\ufb02uence the Bayesian structural score using arbitrary conditional probability\ntables [7]. Standard psychological models of causal strength judgment [12,3], equivalent\nto maximum-likelihood parameter estimates for the family of Bayes nets in Figure 1a [15],\neither predict no explaining away here or make no prediction due to insuf\ufb01cient data.\n\nbut not\n\u0004\u0003\nas vectors\n\n\u001f\u0018\u0013$\u001f%\u0013\u001c\u001f&\u001b'\u0013\n\n\u0001\u001e\u001d \u001f\n\n).\n\n1Gopnik et al. [6] argue that constraint-based learning could be applied here, if we supplement the\nobserved data with large numbers of \ufb01ctional observations. However, this account does not explain\nwhy subjects make the inferences that they do from the very limited data actually observed, nor why\nthey are justi\ufb01ed in doing so. Nor does it generalize to the three experiments we present here.\n\n\n\u0001\n\n\u0003\n\n\u0001\n\n\u0001\n\n\u0003\n\n\u0001\n\n\u0003\n\n\n\u0001\n\u0011\n\u0003\n\u0001\n\u0015\n\u0016\n\u0016\n\u0016\n!\n\u0016\n\u0003\n!\n\u0015\n\u0011\n\u0015\n!\n\u0001\n\f\u0010\u0003\n\n\u0002\u0001\n\nis a blicket but\n\nAlternatively, reasoning on this task could be explained in terms of a simple logical deduc-\ntion. We require as a premise the activation law: a blicket detector activates if and only if\n,\none or more blickets are placed on it. Based on the activation law and the data \u0012\u0011\nwe can deduce that\nremains undetermined. If we further assume a\nform of Occam\u2019s razor, positing the minimal number of hidden causal powers, then we can\ninfer that\nis not a blicket, as most children do. Other cases studied by Gopnik et al. can\nbe explained similarly. However, this deductive model cannot explain many plausible but\nnondemonstrative causal inferences that people make, or people\u2019s degrees of con\ufb01dence in\ntheir judgments, or their ability to infer probabilistic causal relationships from noisy data\n[3,12,15]. It also leaves mysterious the origin and form of Occam\u2019s razor. In sum, neither\ndeductive logic nor standard Bayes net learning provides a satisfying account of people\u2019s\nrapid causal inferences. We now show how a Bayesian structural inference based on strong\ntop-down knowledge can explain the blicket detector judgments, as well as several proba-\nbilistic variants that clearly exceed the capacity of deductive accounts.\n\nMost generally, the top-down knowledge takes the form of a causal theory with at least two\ncomponents: an ontology of object, attribute and event types, and a set of causal principles\nrelating these elements. Here we treat theories only informally; we are currently developing\na formal treatment using the tools of probabilistic relational logic (e.g., [9]). In the basic\nblicket detector domain, we have two kinds of objects, blocks and machines; two relevant\nattributes, being a blicket and being a blicket detector; and two kinds of events, a block\nbeing placed on a machine and a machine activating. The causal principle relating these\nevents and attributes is just the activation law introduced above. Instead of serving as a\npremise for deductive inference, the causal law now generates a hypothesis space of causal\nBayes nets for statistical inference. This space is quite restricted: with two objects and one\ndetector, there are only 4 consistent hypotheses\n(Figure 1a). The\nare also determined by the theory. Based\nconditional probabilities for each hypothesis\n\u0006\u0002\u0001\u0004\u0003\nif\n;\non the activation law,\notherwise it equals 0.\n\n\u0001\t\b\nand\n\n\u0001\u0014\u0013\n, or\n\n\b&\u0001\u0017\u0013\n\nand\n\n\u0005\u0007\u0006\n\n\u0019\t\b\n\n\u0006\t\u0001\u0004\u0003\u000b\n\n\u0005\u0012\u0006\n\n\u0001\u0010\b\n\n\u0003\u0011\b\n\n\u0001\u0004\u0003\n\n\u0001\u0014\u0013\n\n\u0001\u0004\u0003\n\n\u0001\u0004\u0003\n\u0019\t\b\n\n\u000f\u0005\u0007\u0006\n\u0016(\u0001\n\u000e\u0017\n\nto the set of hypotheses consistent with\n\nCausal inference then follows by Bayesian updating of probabilities over\nin light of\nthe observed data\n. We assume independent observations so that the total likelihood\n, the individual-trial\nfactors into separate terms for individual trials. For all hypotheses in\nlikelihoods also factor into\n, and we can ignore the\n\u0019\t\b\n\n\u0007\u0005\u0007\u0006\n\u0005\u0007\u0006\n\u0003\u0018\u0013\nassuming that block positions are independent of the\nlast two terms\n\u0001\u0004\u0003\n\u0003\u0014\b\n\n\u0013\u0005\u0012\u0006\nis 1 for any hypothesis consistent\ncausal structure. The remaining term\n\u0005\u0012\u0006\nwith the data and 0 otherwise, because of the deterministic activation law. The posterior\nfor any data set\nis then simply the restriction and renormalization of the prior\n\u0001\u0004\u0003\u0016\b\n\u0005\u0012\u0006\n\u0006\u0014\u0001\u0004\u0003\u0015\n\n\u0005\u0012\u0006\nBackwards blocking proceeds as follows. After the \u201c1-2\u201d trial (\u0011\nbe a blicket: the consistent hypotheses are\n\u0001\u001a\u0001\n), only\n(\u0011\nremain consistent. The prior over causal structures\n\u0001\u001a\u0001\nwritten as\n\u0001\u001c\u001b\u001d\u0001\nprobability\nlows (all others are zero):\n\n), at least one block must\n. After the \u201c1 alone\u201d trial\nand\ncan be\n\u0001\u001f\u001b \u0003 , assuming that each block has some independent\n\u001f\u001a\u0019\nof being a blicket. The nonzero posterior probabilities are then given as fol-\n,\n\n\u0001\u0004\u0003\n\u0006\u0014\u0001\u0004\u0003\u0015\n\n. 2\n\n!#\"\n!\u001f%\u0004&\n!\u0004$\n!\u0004%'&\n\u0001\u001f\u001b\n. Finally, the\n!#\"\n\u0005\u0012\u0006\nprobability that block\nmay be computed by averaging the\n&\u0013!(\"\npredictions of all consistent hypotheses weighted their posterior probabilities:\n\u0005\u0012\u0006\n\n\u000e.\n\n2More generally, we could allow for some noise in the detector, by letting the likelihood\n:<;\u001f=>:\t?(=A@CB'DFE be probabilistic rather than deterministic. For simplicity we consider only the noise-\n\nless case here; a low level of noise would give similar results.\n\n\u0005\u0012\u0006\n, and\n\n-,\n\u0005\u0007\u0006\n\u001d2/\n\u000e.\n\n\u001f\u001e\u0019\n\u0001\t\b\u0011\b\n\u0005\u0012\u0006\n\u001f)\u0019\n!\u001f$\n\u0001\u001f\u001b\nis a blicket\n!\u0004$\n\n\u0001\u000b\b\n!\u001f$\n\u000e\u0017\n\n\u0001\u0017\u0013\n\u000e.\n\n\u0001'\u0003\u001d\b\n\n\u0005\u0012\u0006\n&*!#\"\n\n\b&\u0001\u000b\b\n\u0005\u0012\u0006\n\n46587\u000b9\n\n\u0001\u0015\b\n\n\u000e.\n\n\u0001\u0004\u0003\n\n1\u0005\u0007\u0006\n\n\u0001\u001f\u001b\n\n!(\"\n\n, and\n\n,\n\n\u0005\u0012\u0006\n\n\u000e.\n\n\u0001\u0004\u0003\n\n\u0005\u0007\u0006\n\n\u0001\u001c\u001b\n!#\"\n\n!\u001f$\n\u0001\u001c\u001b\n\n\u0001\t\b\u0011\b\n\n\u0001\u0017\u0013\n\n\u0001\t\b%\u0013\n\n\b&\u0001\n\n\u0005\u0012\u0006\n\n\u0001\u0004\u0003\n\n\u001d0/\n\n,\n\n!\u001f$\n\n\u0003\u000f\u000e\n\n\u0005\u0012\u0006\n\n.\n\n\u001d3/\n\n\u0001)\b\n\n\u0018\t\n\n\u0018\u0014\n\n\u0001\u0004\u0003\u0002\b\n\n\u0001\u0010\b\n\n\u0001\u001a\u0001\u0010\b\n\n\u0001\u001f\u001b\n\n\u0001\n\u0013\n\u0011\n\u0003\n\u0001\n\n\u0003\n\n\u001d\n\n\u0006\n\u0013\n\u0006\n\u0006\n\u0001\n\u0006\n\b\n\b\n\u0001\n\u0016\n\u0001\n\u0013\n\u0016\n\u0003\n\u0013\n\u001d\n\u001f\n\f\n\u001d\n\u001f\n\u0016\n\u0001\n\u001d\n\u001f\n\n\u001d\n\u001f\n\u0016\n\u0003\n\u001d\n\u001f\n\n\u000e\n\n\u0016\n\u0016\n\u0006\n\u0016\n\u0006\n\u0016\n\u0006\n\n\u0016\n\u0006\n\u0016\n\u0006\n\n\u0013\n\u0016\n\u0003\n\u0013\n\u0006\n\u000e\n\u000e\n\u0001\n\u0006\n\u0006\n\u0006\n\u0003\n\u0006\n\u0006\n\u0006\n\n\u0018\n\u0001\n\u0006\n\u0018\n\u0003\n\u0006\n\u0018\n\u0006\n\u0011\n\u0001\n\n\u001d\n\u0006\n\u0011\n\u0001\n\n\u001d\n\u0003\n\u0006\n\u0001\n\u0011\n\u0001\n\n\u001d\n!\n%\n\u0003\n\u0006\n\u0011\n\u0011\n\u0003\n\n\u001d\n!\n%\n\u001d\n\u0018\n\u0006\n\u0011\n\u0011\n\u0003\n\n\u001d\n!\n%\n!\n%\n\u001d\n\u0018\n+\n\u000e\n\n\b\n\n\u0001\n\u000e\n\n\b\n\n\u0001\n\u000e\n\n\b\n\u0006\n\u0006\n\u0003\n\u0006\n\u0006\n\n\b\n\u0001\n\u0006\n\u0001\n\f\u0005\u0012\u0006\n\n\u0005\u0007\u0006\n\n\u001f\u0001\u0003\u0002\n\n\u001f\u0007\u0007\u0002\n\n\u0001#\n\n\u001d\u0004\u0002\u0005\u0003\u0006\n\u0005\u0012\u0006\n\nis probably not (\n\n; then, after the \u201c1 alone\u201d trial,\n\n\u001a\n\n-,\n\u0001#\n\n. Setting\n\nIn comparing with human judgments in the backwards blocking paradigm, the relevant\n, the baseline judgments before either block is placed on\nprobabilities are\nthe detector;\n, judgments after the \u201c1-2\u201d trial; and\n,\n\u0005\u0012\u0006\n\u0003#\n\njudgments after the \u201c1 alone\u201d trial. These probabilities depend only on the prior proba-\nbility of blickets,\nqualitatively matches children\u2019s backwards block-\ning behavior: after the \u201c1-2\u201d trial, both blocks are more likely than not to be blickets\n(\nis de\ufb01nitely a blicket while\n\u0005\u0012\u0006\n). Thus there is no need to posit a special\n\u0004\u0003\n\u201cOccam\u2019s razor\u201d just to explain why\nbecomes like less likely to be a blicket after the \u201c1\nalone\u201d trial \u2013 this adjustment follows naturally as a rational statistical inference. However,\nwe do have to assume that blickets are somewhat rare (\n). Following the \u201c1 alone\u201d\n), because the unambiguous\ntrial the probability of\nsecond trial explains away all the evidence for\n,\n\u001f\u0001\u000b\n\nblock 2 would remain likely to be a blicket even after the \u201c1 alone\u201d trial.\nIn order to test whether human causal reasoning actually embodies this Bayesian form of\nOccam\u2019s razor, or instead a more qualitative rule such as the classical version, \u201cEntities\nshould not be multiplied beyond necessity\u201d, we conducted three new blicket-detector ex-\nperiments on both adults and 4-year-old children (in collaboration with Sobel & Gopnik).\nThe \ufb01rst two experiments were just like the original backwards blocking studies, except\nthat we manipulated subjects\u2019 estimates of\nby introducing a pretraining phase. Subjects\n\ufb01rst saw 12 objects placed on the detector, of which either 2, in the \u201crare\u201d condition\u201d,\nor 10, in the \u201ccommon\u201d condition, activated the detector. We hypothesized that this ma-\nnipulation would lead subjects to set their subjective prior for blickets to either\n\u001f\u0001\u0007\u0010\nor\n, and thus, if guided by the Bayesian Occam\u2019s razor, to show strong or weak\nblocking respectively.\n\n\u001f\u0001\u000b\n\nbeing a blicket returns to baseline (\n\nfrom the \ufb01rst trial. Thus for\n\n\u0018\r\f\n\n\u0002\u0001\n\n\u0018\t\b\n\n\u0018\u000f\u000e\n\n\u0018\r\u000e\n\n\u0006\u0011\u0003\u0010\n\nWe gave adult subjects a different cover story, involving \u201csuper pencils\u201d and a \u201csuperlead\ndetector\u201d, but here we translate the results into blicket detector terms. Following the \u201crare\u201d\nwere picked at random from the same\nor \u201ccommon\u201d training, two new objects\npile and subjects were asked three times to judge the probability that each one could activate\nthe detector: \ufb01rst, before seeing it on the detector, as a baseline; second, after a \u201c1-2\u201d trial;\nthird, after a \u201c1 alone\u201d trial. Probabilities were judged on a 1-7 scale and then rescaled to\nthe range 0-1.\n\nand\n\n\u000f\u0003\n\n\u0002\u0001\n\nand\n\nThe mean adult probability judgments and the model predictions are shown in Figures 2a\n(rare) and 2b (common). Wherever two objects have the same pattern of observed contin-\nat baseline and after the \u201c1-2\u201d trial), subjects\u2019 mean judgments\ngencies (e.g.,\nwere found not to be signi\ufb01cantly different and were averaged together for this analysis. In\nto match subjects\u2019 baseline judgments; the best-\ufb01tting val-\n\ufb01tting the model, we adjusted\nues were very close to the true base rates. More interestingly, subjects\u2019 judgments tracked\nthe Bayesian model over both trials and conditions. Following the \u201c1-2\u201d trial, mean ratings\nof both objects increased above baseline, but more so in the rare condition where the activa-\ntion of the detector was more surprising. Following the \u201c1 alone\u201d trial, all subjects in both\nconditions were 100% sure that\nhad the power to activate the detector, and the mean\nrating of\nreturned to baseline: low in the rare condition, but high in the common con-\ndition. Four-year-old children made \u201cyes\u201d/\u201dno\u201d judgments that were qualitatively similar,\nacross both rare and common conditions [13].\n\n\u0002\u0001\n\nHuman causal inference thus appears to follow rational statistical principles, obeying the\nBayesian version of Occam\u2019s razor rather than the classical logical version. However, an\nalternative explanation of our data is that subjects are simply employing a combination of\nlogical reasoning and simple heuristics. Following the \u201c1 alone\u201d trial, people could log-\nically deduce that they have no information about the status of\nand then fall back on\nthe base rate of blickets as a default, without the need for any genuinely Bayesian com-\nputations. To rule out this possibility, our third study tested causal explaining way in the\n\n\u0005\u0003\n\n\u000e\n\n,\n\u000e\n\n\b\n\u0011\n\n,\n\u000e\n\n\b\n\u0011\n\u0001\n\u0013\n\u0011\n\u0018\n\u0018\n\u001d\n\n,\n\u000e\n\n\b\n\u0011\n\n\u0003\n\u000e\n\n\b\n\u0011\n\u0001\n\u0013\n\u0011\n\u0003\n\n\u001d\n\n\u0003\n\n\u0003\n\u0018\n\n\u0003\n\u0018\n\n\u0001\n\n\u0003\n\u0018\n\n\u0003\n\f,\n\n ) hypotheses\n\nabsence of unambiguous data that could be used to support deductive reasoning. Subjects\nagain saw the \u201crare\u201d pretraining, but now the critical trials involved three objects,\n,\n. After judging the baseline probability that each object could activate the detector,\n\u0001\nsubjects saw two trials: a \u201c1-2\u201d trial, followed by a \u201c1-3\u201d trial, in which objects\nand\nactivated the detector together. The Bayesian hypothesis space is analogous to Figure\n1a, but now includes eight (\nrepresenting all possible assignments of\ncausal powers to the three objects. As before, the prior over causal structures\ncan\n\u0006\u0016\u0001\u0004\u0003\u0003\u0002\n\u0001\u001c\u001b\u0004\u0002 , the likelihood\nbe written as\nreduces\n\u0001\u001c\u001b6\u0003\n\u0001\u001c\u001b\u001d\u0001\n\u0018\t\n\n\u0018\t\n\n(under the activation law) and 0 otherwise, and\nto 1 for any hypothesis consistent with\nmay be computed by summing the\nis a blicket\nthe probability that block\n\u000e.\n\n\u0005\u0012\u0006\nposterior probabilities of all consistent hypotheses, e.g.,\n.\n\u000e.\n\n\u0005\u0012\u0006\nFigure 2c shows that the Bayesian model\u2019s predictions and subjects\u2019 mean judgments match\nwell except for a slight overshoot in the model. Following the \u201c1-3\u201d trial, people judge that\nprobably activates the detector, but now with less than 100% con\ufb01dence. Correspond-\n\u0005\u0001\ningly, the probability that\nactivates the detector increases, to a level above baseline but below 0.5. All of these pre-\ndicted effects are statistically signi\ufb01cant (\n\nactivates the detector decreases, and the probability that\n\n, one-tailed paired t-tests).\n\n\u0001\u0004\u0003\u0003\u0002\n\u001f)\u0019\n\n\u0005\u0007\u0006\n\u0001\u0004\u0003\u0003\u0002\n\n\u001d0/\n\n\u0005\u0012\u00061\u000e\n\n\u000e\u0017\n\n\u001f\u001a\u0019\n\n\u0001'\u0003\u0003\u0002\n\n\u0005\u0012\u0006\n\n\u0003\u0003\u0002\n\n\u001f\u001a\u0019\n\n\u0018\t\n\n!\u0006\u0005\n\nThese results provide strong support for our claim that rapid human inferences about causal\nstructure can be explained as theory-guided Bayesian computations. Particularly striking\nis the contrast between the effects of the \u201c1 alone\u201d trial and the \u201c1-3 trial\u201d. In the former\ncase, subjects observe unambiguously that\nfalls completely to baseline; in the latter, they observe only a suspicious coincidence and\nso explaining away is not complete. A logical deductive mechanism might generate the\nall-or-none explaining-away observed in the former case, while a bottom-up associative\nlearning mechanism might generate the incomplete effect seen in the latter case, but only\nour top-down Bayesian approach naturally explains the full spectrum of one-shot causal\ninferences, from uncertainty to certainty.\n\nis a cause and their judgment about\n\n3 Causal inference in perception\n\nOur second case study argues for the importance of causal theories in a very different\ndomain: perceiving the mechanics of collisions and vibrations. Michotte\u2019s [8] studies of\ncausal perception showed that a moving ball coming to rest next to a stationary ball would\nbe perceived as the cause of the latter\u2019s subsequent motion only if there was essentially no\ngap in space or time between the end of the \ufb01rst ball\u2019s motion and the beginning of the sec-\nond ball\u2019s. The standard explanation is that people have automatic perceptual mechanisms\nfor detecting certain kinds of physical causal relations, such as transfer of force, and these\nmechanisms are driven by simple bottom-up cues such as spatial and temporal proximity.\n\nFigure 3a shows data from an experiment described in [2] which might appear to support\nthis view. Subjects viewed a computer screen depicting a long horizontal beam. At one end\nof the beam was a trap door, closed at the beginning of each trial. On each trial, a heavy\n, the trap door\nblock was dropped onto the beam at some position\nopened and a ball \ufb02ew out. Subjects were told that the block dropping on the beam might\nhave jarred loose a latch that opens the door, and they were asked to judge (on a\nscale)\nhow likely it was that the block dropping was the cause of the door opening. The distance\nseparating these two events were varied across trials. Figure 3a shows that\n\n, and after some time \u0007\n\n\u001f \u0019\t\b\n\nincreases, the judged probability of a causal link decreases.\n\nand time \u0007\n\nas either\n\nor \u0007\n\nAnderson [1] proposed that this judgment could be formalized as a Bayesian inference with\ntwo alternative hypotheses:\n, that no causal link exists.\nHe suggested that the likelihood\nshould be product of decreasing exponentials\nwould pre-\nin space and time,\n\n, while\n\n, that a causal link exists, and\n\u0005\u0012\u0006\n\u001d\u000b\n\r\f\u000f\u000e\u0011\u0010\r\u0012\n\n\u0019\u0014\f\n\n\u0019\u0013\n\n\u000e\u0011\u0010\r\u0012\n\n\u0005\u0007\u0006\n\n\u0005\u0012\u0006\n\n\n\u0001\n\n\u0003\n\n\u0001\n\n\n\n\u0006\n\n\u0018\n\u0001\n\u0006\n\u0018\n\u0003\n\u0006\n\u0018\n\u0002\n\u0006\n\b\n\u0006\n\n\u000e\n+\n\n,\n\u000e\n\n\b\n\n\u0001\n\u000e\n\n\b\n\u0006\n\b\n\n\u0003\n\n\n\u0005\n\b\n!\n\u0006\n\n\u0001\n\n\u0003\n\n\u0006\n\u0001\n\u0006\n\b\n\u0007\n\u0013\n\n\b\n\u0006\n\u0001\n\n\u0007\n\u0013\n\n\b\n\u0006\n\u0001\n\n\u0006\n\u0007\n\n\u0006\n\n\u0007\n\u0013\n\n\b\n\u0006\n\b\n\n\f\u0005\u0007\u0006\n\nincreases.\n\n. The door state\n\n-\u0006\u0001\n, the door opens if and only if the noise amplitude\n\nand\nsumably be constant. This model has three free parameters \u2013 the decay constants\n, and the prior probability\n\u2013 plus multiplicative and additive scaling parameters to\nbring the model ouptuts onto the same range as the data. Figure 3c shows that this model\ncan be adjusted to \ufb01t the broad outlines of the data, but it misses the crossover interac-\ntion: in the data, but not the model, the typical advantage of small distances\nover large\ndistances disappears and even reverses as \u0007\nThis crossover may re\ufb02ect the presence of a much more sophisticated theory of force trans-\nfer than is captured by the spatiotemporal decay model. Figure 1b shows a causal graphical\nstructure representing a simpli\ufb01ed physical model of this situation. The graph is a dynamic\nBayes net (DBN), enabling inferences about the system\u2019s behavior over time. There are\nfour basic event types, each indexed by time\ncan be either open\n(\n), and once open it stays open. There is an intrinsic source\n) or closed (\n\r-\u0006\u0002\n\u001d\"\u001f\nin the door mechanism, which we take to be i.i.d., zero-mean gaussian. At\nof noise\n\u0003\u001a\u0006\u0002\neach time step\nexceeds some\nthreshold (which we take to be 1 without loss of generality). The block hits the beam at\nposition\n), setting up a vibration in the door mechanism with en-\n!\u0011\n\n. We assume this energy decreases according to an inverse power law with the\nergy\n\u0004-\u0006\ndistance between the block and the door,\n,\nabsorbing it into the parameter\nbelow.) For simplicity, we assume that energy propagates\ninstantaneously from the block to the door (plausible given the speed of sound relative\nto the distances and times used here), and that there is no vibrational damping over time\n). Anderson [2] also sketches an account along these lines, although he\n(\n\u0004-\u0006\u0002\nprovides no formal model.\nAt time\ndepends strictly on the variance of the noise\n\u2013 the bigger the variance, the sooner the\ndoor should pop open. At issue is whether there exists a causal link between the vibration\n\u2013 which causes the door to open. More\n\n\u001b\b\u0007 . (We can always set\n\n\u2013 caused by the block dropping \u2013 and the noise\n\n, the door pops open; we denote this event as\n\n. The likelihood of\n\n(and time\n\n\u0004-\u0006\u0002\n\n\u0003\u001a\u0006\u0002\n\n\u001d\u0006\u0005\n\n!\u0015\n\n!\u0015\n\n!\u0015\n\n\u0006\u0002\n\n(causal link) and\n\nof the vibrational energy\n\nprecisely, we propose that causal inference is based on the probabilities\nunder the two hypotheses\nsome low intrinsic level\n\n(no causal link). The noise variance has\n\u2013 is increased by some fraction\n.\n\u00066\u0003\u0015\n\n\u0005\u0012\u0006\u000b\u0003\n) analytically or through simulation.\nWe can then solve for the likelihoods\n,\n, leaving three free parameters,\nWe take the limit as the intrinsic noise level\n, plus multiplicative and additive scaling parameters, just as in the spatiotemporal\nand\ndecay model. Figure 3b plots the (scaled) posterior probabilities\nfor the best\n\ufb01tting parameter values. In contrast to the spatiotemporal decay model, the DBN model\ncaptures the crossover interaction between space and time.\n\n, which under\n. That is,\n\u0005\u0012\u0006\n\n\u2013 but not\n\u0006\u0001\n\u0004-\u0006\u0002\n\n\u0012\t\u0013\u0004-\u0006\u0001\n,\n\n\u001d\r\f\n\n\u0011\u0010\n\n\u0005\u0007\u0006\n\n\u0005\u0012\u0006\n\n\u0005\u0012\u0006\n\n\u0001\u0010\b\n\n\u000f\u000e\n\n\u0006\u000b\u0003\n\n\u0006\u0001\n\nand\n\nThis difference between the two models is fundamental, not just an accident of the param-\neter values chosen. The spatiotemporal decay model can never produce a crossover effect\n. A crossover of some form is generic in\ndue to its functional form \u2013 separable in \u0007\nthe DBN model, because its predictions essentially follow an exponential decay function\non \u0007 with a decay rate that is a nonlinear function of\n. Other mathematical models with\na nonseparable form could surely be devised to \ufb01t this data as well. The strength of our\nmodel lies in its combination of rational statistical inference and realistic physical motiva-\ntion. These results suggest that whatever schema of force transfer is in people\u2019s brains, it\nmust embody a more complex interaction between spatial and temporal factors than is as-\nsumed in traditional bottom-up models of causal inference, and its functional form may be\na rational consequence of a rich but implicit physical theory that underlies people\u2019s instan-\ntaneous percepts of causality. It is an interesting open question whether human observers\ncan use this knowledge only by carrying out an online simulation in parallel with their\nobservations, or can access it in a \u201ccompiled\u201d form to interpret bottom-up spatiotemporal\ncues without the need to conduct any explicit internal simulations.\n\n\f\n\u0006\n\u0001\n\n\n\n\u001d\n!\n\n\n\b\n\n\b\n\n\u0006\n\n\u001d\n!\n\u0004\n\u0006\n\n\u0006\n\u0005\n\u001d\n\u001f\n\t\n\n\u001d\n\u0019\n\u001f\n\n\n\u001d\n\u0007\n\u0019\n\u0006\n\u0007\n\n\u0019\n\u0006\n\u0007\n\n\u0003\n\u0004\n\u0003\n\u0019\n\u0006\n\u0007\n\n\b\n\n\u0013\n\u0006\n\u0003\n\n\u0006\n\u0001\n\u0006\n\b\n\n\u0006\n\u0001\n\u0006\n\b\n\t\n\u0004\n\n\b\n\n\u0013\n!\n\u0013\n\n\u0019\n\u0006\n\u0007\n\n\b\n\n\u0013\n\u0006\n\u0003\n\n\u000e\n!\n\t\n\u0014\n\u0006\n\u0001\n\n\u0006\n\u0019\n\u0006\n\u0007\n\n\u0013\n\n\f4 Conclusion\n\nIn two case studies, we have explored how people make rapid inferences about the causal\ntexture of their environment. We have argued that these inferences can be explained best as\nBayesian computations, working over hypothesis spaces strongly constrained by top-down\ncausal theories. This framework allowed us to construct quantitative models of causal\njudgment \u2013 the most accurate models to date in both domains, and in the blicket detec-\ntor domain, the only quantitatively predictive model to date. Our models make a number\nof substantive and mechanistic assumptions about aspects of the environment that are not\ndirectly accessible to human observers. From a scienti\ufb01c standpoint this might seem unde-\nsirable; we would like to work towards models that require the fewest number of a priori\nassumptions. Yet we feel there is no escaping the need for powerful top-down constraints\non causal inference, in the form of intuitive theories. In ongoing work, we are beginning\nto study the origins of these theories themselves. We expect that Bayesian learning mecha-\nnisms similar to those considered here will also be useful in understanding how we acquire\nthe ingredients of theories: abstract causal principles and ontological types.\n\nReferences\n\n[1] J. .R. Anderson. The Adaptive Character of Thought. Erlbaum, 1990.\n\n[2] J. .R. Anderson. Is human cognition adaptive? Behavioral and Brain Sciences, 14, 471\u2013484,\n1991.\n\n[3] P. W. Cheng. From covariation to causation: A causal power theory. Psychological Review, 104,\n367\u2013405, 1997.\n\n[4] A. Gopnik & C. Glymour. Causal maps and Bayes nets: a cognitive and computational account\nof theory-formation. In Carruthers et al. (eds.), The Cognitive Basis of Science. Cambridge, 2002.\n\n[5] A. Gopnik & D. M. Sobel. Detecting blickets: How young children use information about causal\nproperties in categorization and induction. Child Development, 71, 1205\u20131222, 2000.\n\n[6] A. Gopnik, C. Glymour, D. M. Sobel, L. E. Schulz, T. Kushnir, D. Danks. A theory of causal\nlearning in children: Causal maps and Bayes nets. Psychological Review, in press.\n\n[7] D. Heckerman. A Bayesian approach to learning causal networks. In Proc. Eleventh Conf. on\nUncertainty in Arti\ufb01cial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA, 1995.\n\n[8] A. E. Michotte. The Perception of Causality. Basic Books, 1963.\n\n[9] H. Pasula & S. Russell. Approximate inference for \ufb01rst-order probabilistic languages. In Proc.\nInternational Joint Conference on Arti\ufb01cial Intelligence, Seattle, 2001.\n\n[10] J. Pearl. Causality. New York: Oxford University Press, 2000.\n\n[11] B. Rehder. A causal-model theory of conceptual representation and categorization. Submitted\nfor publication, 2001.\n\n[12] D. R. Shanks. Is human learning rational? Quarterly Journal of Experimental Psychology, 48a,\n257\u2013279, 1995.\n\n[13] D. Sobel, J. B. Tenenbaum & A. Gopnik. The development of causal learning based on indirect\nevidence: More than associations. Submitted for publication, 2002.\n\n[14] P. Spirtes, C. Glymour, & R. Scheines. Causation, prediction, and search (2nd edition, revised).\nCambridge, MA: MIT Press, 2001.\n\n[15] J. B. Tenenbaum & T. L. Grif\ufb01ths. Structure learning in human causal induction. In T. Leen, T.\nDietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems 13. Cambridge,\nMA: MIT Press, 2001.\n\n\f(a)\n\nh\n10\n\nh01\n\nX1\n\nX2\n\nX1\n\nX2\n\n(b)\n\nblock\nposition\n\nX(0)\n\nh\n1\nh\n0\n\npresent\n\nabsent\n\nE\n\nh11\n\nE\n\nh00\n\nX1\n\nX2\n\nX1\n\nX2\n\nE\n\nE\n\nvibrational\n\nenergy\n\nV(0)\n\nV(1)\n\n...\n\nV(n)\n\nnoise\n\nZ(0)\n\nZ(1)\n\nZ(n)\n\ndoor\nstate\n\nE(0)\n\ntime\n\nt=0\n\nE(1)\n\nt=1\n\n...\n\nE(n)\n\nt=n\n\nFigure 1: Hypothesis spaces of causal Bayes nets for (a) the blicket detector and (b) the\nmechanical vibration domains.\n\nPeople\nBayes\n\n1\n\n(a)\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nB1,B2\nBaseline\n\nB1,B2\nAfter \n \"12\" \ntrial \n\n1\n\n(b)\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nB1,B2\nBaseline\n\nB1,B2\nAfter \n \"12\" \ntrial \n\nB2\n\nB1\n After \n\"1 alone\"\n\n trial \n\n1\n\n(c)\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\nB1,B2,B3\nBaseline\n\nB1,B2 B3\n\nB1 B2,B3\n\nAfter \n \"12\" \ntrial \n\nAfter \n \"13\" \ntrial \n\nB2\n\nB1\n After \n\"1 alone\"\n\n trial \n\nFigure 2: Human judgments and model predictions (based on Figure 1a) for one-shot back-\nwards blocking with blickets, when blickets are (a) rare or (b) common, or (c) rare and only\nobserved in ambiguous combinations. Bar height represents the mean judged probability\nthat an object has the causal power to activate the detector.\n\nt\n\nh\ng\nn\ne\nr\nt\ns\n \nl\n\na\ns\nu\na\nC\n\n6\n\n5\n\n4\n\n3\n\n2\n\n X = 15\n X = 7 \n X = 3 \n X = 1 \n\n6\n\n5\n\n)\n\nX\n\n \n \n,\n\nT\n\n6\n\n5\n\n)\n\nX\n\n \n \n,\n\nT\n\n1\n\n4\n\n \n|\n\nh\n \n(\n\nP\n\n1\n\n4\n\n \n|\n\nh\n \n(\n\nP\n\n \n\n \n\n3\n\n2\n\n0.1 0.3 0.9 2.7 8.1\n\nTime (sec)\n\n3\n\n2\n\n0.1 0.3 0.9 2.7 8.1\n\nTime (sec)\n\n0.1 0.3 0.9 2.7 8.1\n\nTime (sec)\n\nFigure 3: Probability of a causal connection between two events: a block dropping onto a\nbeam and a trap door opening. Each curve corresponds to a different spatial gap\nbetween\nthese events; each x-axis value to a different temporal gap \u0007\n. (a) Human judgments. (b)\nPredictions of the dynamic Bayes net model (Figure 1b). (c) Predictions of the spatiotem-\nporal decay model.\n\n\f", "award": [], "sourceid": 2332, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}