{"title": "Dynamical Causal Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 83, "page_last": 90, "abstract": "", "full_text": "ynamic Causal Learning\n\nDavid Danks\n\nInstitute for Human & Machine Cognition\n\nUniversity of West Florida\n\nPensacola, FL 32501\n\nddanks@ai.uwf.edu\n\nThomas L. Griffiths\n\nDepartment of Psychology\n\nStanford University\n\nStanford, CA 94305-2130\n\ngruffydd@psych.stanford.edu\n\nJoshua B. Tenenbaum\n\nDepartment of Brain & Cognitive Sciences\n\n~AIT\n\nCambridge, MA 02139\n\njbt@rnit.edu\n\nAbstract\n\ntheories of human causal\n\nfocus primarily on long-run predictions:\n\nlearning and\nCurrent psychological\ntwo by\njudgment\nestimating parameters of a causal Bayes nets (though for different\nparameterizations), and a third through structural learning. This\nshort-run behavior by examining\npaper\ndynamical versions of these three theories, and comparing their\npredictions to a real-world dataset.\n\nfocuses on people's\n\n1\n\nIntroduction\n\nCurrently active quantitative models of human causal judgment for single (and\nsometimes multiple) causes include conditional j}JJ [8], power PC [1], and Bayesian\n[9]. All of these theories have some normative\nnetwork structure learning [4],\njustification, and all can be understood rationally in terms of learning causal Bayes\nnets. The first two theories assume a parameterization for a Bayes net, and then\nperform maximum likelihood parameter estimation. Each has been the target of\nnumerous psychological studies (both confirming and disconfirming) over the past\nten years. The third theory uses a Bayesian structural score, representing the log\nlikelihood ratio in favor of the existence of a connection between the potential cause\nand effect pair. Recent work found that this structural score gave a generally good\naccount, and fit data that could be fit by neither of the other two models [9].\n\nTo date, all of these models have addressed only the static case, in which judgments\nare made after observing all of the data (either sequentially or in summary format).\nin which\nLearning in the real world, however, also involves dynamic tasks,\njudgments are made after each trial (or small number). Experiments on dynamic\ntasks, and theories that model human behavior in them, have received surprisingly\nlittle attention in the psychological community. In this paper, we explore dynamical\nvariants of each of the above learning models, and compare their results to a real\ndata set (from [7]). We focus only on the case of one potential cause, due to space\nand theoretical constraints, and a lack of experimental data for the multivariate case.\n\n\f2 Real-World Data\n\nIn the experiment on which we focus in this paper [7], people's stepwise acquisition\ncurves were measured by asking people to determine whether camouflage makes a\ntank more or less likely to be destroyed. Subjects observed a sequence of cases in\nwhich the tank was either camouflaged or not, and destroyed or not. They were\nasked after every five cases to judge the causal strength of the- camouflage on a\n[-100, +100] scale, where -100 and +100 respectively correspond to the potential\ncause always preventing or producing the effect. The learning curves, constructed\nfrom average strength ratings, were:\n\nPositive contingent\nHigh P(E) non-contingent\nLow P(E) non-contingent\nNegative contingent\n\n50\n\nMe\nan\njud\ngm\nent\n\n-50\n\n10\n\n15\n\n20\n\nTrials\n\n25\n\n30\n\n35\n\n40\n\nFigure 1: Example of learning curves\n\nIn this paper, we focus on qualitative features of the learning curves. These learning\ncurves can be divided on the basis of the actual contingencies in the experimental\ncondition. There were two contingent conditions: a positive condition in which\npeE I C) = .75 (the probability of the effect given the cause) and peE I -,C) = .25,\nand a negative condition where the opposite was true. There were also two non(cid:173)\ncontingent conditions, one in which peE) = .75 and one in which peE) = .25,\nirrespective of the presence or absence of the causal variable. We refer to the former\nnon-contingent condition as having a high peE), and the latter as having a low peE).\nThere are two salient, qualitative features of the acquisition curves:\n\n1. For contingent cases, the strength rating does not immediately reach the\n\nfinal judgment, but rather converges to it slowly; and\n\n2. For non-contingent cases, there is an initial non-zero strength rating when\nthe probability of the effect, peE), is high, followed by convergence to zero.\n\n3 Parameter Estimation Theories\n\n3.1 Conditional ~p\n\nThe conditional f1P theory predicts that the causal strength rating for a particular\nfactor will be (proportional to) the conditional contrast for that factor [5], [8]. The\ngeneral form of the conditional contrast for a particular potential cause is given by:\nf1P C.{X} = peE I C & X) - peE I -,C & X), where X ranges over the possible states of\nthe other potential causes. So, for example, if we have two potential causes, C1 and\nC2, then there are two conditional contrasts for C1: f1PCl.{C2} = peE I C1 & C2)\npeE I -'C1 & C2) and f1P Cl.{-.C2} = peE I C1 & -,C2) - peE I-'C1 & -,C2). Depending\n\n(cid:173)\n\fon the probability distribution, some conditional contrasts for a potential cause may\nbe undefined, and the defined contrasts for a particular variable may not agree. The\nconditional I1P theory only makes predictions about a potential cause when the\nunderlying probability distribution is \"well-behaved\": at least one of the conditional\ncontrasts for the factor is defined, and all of the defined conditional contrasts for the\nfactor are equal. For a single cause-effect relationship, calculation of the J1P value is\na maximum likelihood parameter estimator assuming that\nthe cause and the\nbackground combine linearly to predict the effect [9J.\n\nAny long-run learning model can model sequential data by being applied to all of\nthe data observed up to a particular point. That is, after observing n datapoints, one\nsimply applies the model, regardless of whether n is \"the long-run.\" The behavior of\nsuch a strategy for the conditional ~p theory is shown in Figure 2 (a), and clearly\nfails to model accurately the above on-line learning curves. There is no gradual\nconvergence to asymptote in the contingent cases, nor is there differential behavior\nin the non-contingent cases.\nAn alternative dynamical model\nis the Rescorla-Wagner model [6J, which has\nessentially the same form as the well-known delta rule used for training simple\nneural networks. The R-W model has been shown to converge to the conditionall1P\nvalue in exactly the situations in which the I1P theory makes a prediction [2J. The\nlogic as the I1P theory: J1P gives the\nR-W model follows a similar statistical\nmaximum likelihood estimates in closed-form, and the R-W model essentially\nimplements gradient ascent on the log-likelihood surface, as the delta rule has been\nshown to do. The R-W model produces' learning curves that qualitatively fit the\nlearning curves in Figure 1, but suffers from other serious flaws. For example,\nsuppose a subject is presented with trials of A, C, and E, followed by trials with only\nA and E. In such a task, called backwards blocking, the R-W model predicts that C\nshould be viewed as moderately causal, but human subjects rate C as non-causal.\n\nIn the augmented R-W model [10J causal strength estimates (denoted by Vi, and\nassumed to start at zero) change after each observed case. Assuming that b(.x) = 1 if\nX occurs on a particular trial, and 0 otherwise, then strength estimates change by the\nfollowing equation:\n\naiO and ail are rate parameters (saliences) applied when Ci is present and absent,\nrespectively, and Po and PI are the rate parameters when E is present and absent,\nrespectively. By updating the causal strengths of absent potential causes, this model\nis able to explain many of the phenomena that escape the normal R-W model, such\nas backwards blocking.\n\nAlthough the augmented R-W model does not always have the same asymptotic\nbehavior as the regular R-W model, it does have the same asymptotic behavior in\nexactly those situations in which the conditional J1P theory makes a prediction\n(under typical assumptions: aiO = -ail, Po = PI, and A = 1) [2]. To determine whether\nthe augmented R-W model also captures the qualitative features of people's\ndynamic learning, we performed a simulation in which 1000 simulated individuals\nwere shown randomly ordered cases that matched the probability distributions used\nin [7]. The model parameter values were A = 1.0, Q{)o = 0.4, alO = 0.7, au = -0.2, Po\n= PI = 0.5, with two learned parameters: Vo for the always present background cause\nCo, and VI for the potential cause CI . The mean values of VI, multiplied by 100 to\nmatch scale with Figure 1, are shown in Figure 2 (b).\n\n\f(b)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n(d)\n\n50\n\n-50\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n30\n\n35\n\n40\n\n(a)\n\n50\n\n-50\n\n(c)\n\n50\n\n-50\n\n(e) 50~\n\n5\n\n, =:.::=:=t\n\n25\n\n10\n\n15\n\n20\n\n_5~~~~\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\nFigure 2: Modeling results. (a) is the maximum-likelihood estimate of fJJ , (b) is the\naugmented R-W model, (c) is the maximum-likelihood estimate of causal power, (d)\nis the analogue of augmented R-W model for causal power, (e) shows the Bayesian\nstrength estimate with a uniform prior on all parameters, and (f) does likewise with\n\na beta(I,5) prior on Va. The line-markers follow the conventions of Figure 1.\n\nthan the presence,\n\nVariations in A only change the response scale. Higher values of lXoo (the salience of\nthe background) shift downward all early values of the learning curves, but do not\naffect the asymptotic values. The initial non-zero values for the non-contingent\ncases is proportional in size to (alO + al r), and so if the absence of the cause is more\nsalient\nthe initial non-zero value will actually be negative.\nRaising the fJ values increases the speed of convergence to asymptote, and the\nabsolute values of the contingent asymptotes decrease in proportion to (fJo - fJI).\nFor the chosen parameter values, the learning curves for the contingent cases both\ngradually curve towards an asymptote, and in the non-contingent, high peE) case,\nthere is an initial non-zero rating. Despite this qualitative fit and its computational\nsimplicity, the augmented R-W model does not have a strong rational motivation. Its\nonly rational justification is that it is a consistent estimator of fJJ: in the limit of\ninfinite data, it converges to fJJ under the same circumstances that the regular (and\nwell-motivated) R-W model does. But it does not seem to have any of the other\nproperties of a good statistical estimator: it is not unbiased, nor does it seem to be a\nmaximum likelihood or\ngradient-ascent-on-log-Iikelihood algorithm (indeed,\nsometimes it appears to descend in likelihood). This raises the question of whether\nthere might be an alternative dynamical model of causal learning that produces the\nappropriate learning curves but is also a principled, rational statistical estimator.\n\n3.2 Power PC\n\nIn Cheng's power PC theory [1], causal strength estimates are predicted to be\n(proportional\nthe\n\nthe (unobserved) probability that\n\nto) perceived causal power:\n\n\fpotential cause, in the absence of all other causes, will produce the effect. Although\ncausal power cannot be directly observed,\nit can be estimated from observed\nstatistics given some assumptions. The power PC theory predicts that, when the\nassumptions are believed to be satisfied, causal power for (potentially) generative or\npreventive 'causes will be estimated by the following equations:\n\nG\n\neneratIve: p _\n\n\u00b7\n\nM\n\nC\n\nC-1-P(EI-,C)\n\nPreventive: p =\n\n- Me\n\ne p(EI-'C)\n\nBecause the power PC theory focuses on the long-run, one can easily d'etermine\nwhich equation to use: simply wait until asymptote, determine J1Pc, and then divide\nby the appropriate factor. Similar equations can also be given for interactive causes.\nNote that although the preventive causal power equation yields a positive number,\nwe should expect people to report a negative rating for preventive causes.\n\nAs with the t:JJ theory, the power PC theory can, in the case of a single cause-effect\npair, also be seen as a maximum li).== L J11 \"111 h,D) \"hi D)dW\n\nhEHO\n\nwhere H = {h+, ha, h_}. The effective value of the strength parameter is a in the\nmodel where there is no relationship between cause and effect, and should be\nnegative for preventive causes. We thus have:\n\nwhere f.l+, f.l- are the posterior means of VI under h+ and h_ respectively.\n\n = P(h+)f.l+ - P(h_)f.l-\n\nWhile this theory is appealing from a rational and statistical point of view, it has\ncomputational drawbacks. All\nin the above expression are quite\ncomputationally intensive to compute, and require an amount of information that\nincreases exponentially with the number of causes. Furthermore,\nthe number of\ndifferent hypotheses we must consider grows exponentially with the number of\npotential causes, limiting its applicability for multivariate cases.\n\nterms\n\nfour\n\nWe applied this model to the data of [7J, using a uniform prior over models, and\nalso over parameters. The results, averaged across 200 random orderings of trials,\nare shown in Figure 2 (e). The predictions are somewhat symmetric with respect to\npositive and negative contingencies and high and low peE). This symmetry is a\nconsequence of choosing a uniform (i.e., strongly uninformative) prior for the\nparameters. If we instead take a uniform prior on VI and a beta(1,5) prior on Va,\nconsistent with a prior belief that effects occur only rarely without an observed\ncause and similar to starting with zero weights in the algorithms presented above,\nwe obtain the results shown in Figure 2 (t). In both cases, the curvature of the\nlearning curves is a consequence of structural uncertainty, and the asymptotic values\nreflect the strength of causal relationships. In the contingent cases, the probability\ndistribution over structures rapidly transfers all of its mass to the correct hypothesis,\nand the result asymptotes at the posterior mean of' VI in that model, which will be\nvery close to causal power. The initial non-zero ratings in the non-contingent cases\nresult from h+ giving a slightly better account of the data than h_, essentially due to\nthe non-uniform prior on Va.\n\n\fThis structural account is only one means of understanding the rational basis for\nthese learning curves. Dayan and Kakade [3] provide a statistical theory of classical\nconditioning based on Bayesian estimation of the parameters in a linear model\nsimilar to that underlying 11P. Their theory accounts for phenomena that\nthe\nclassical R-W theory does not, such as backwards blocking. They also give a neural\nnetwork learning model that approximates the Bayesian estimate, and that closely\nresembles the augmented R-W model considered here. Their network model can\nalso produce the learning curves discussed in this paper. However, because it is\nbased on a linear model of causal interaction, it is not a good candidate for modeling\nhuman causal judgments, which across various studies of asymptotic behavior seem\nto be more closely approximated by parameter estimates' in noisy logic gates, as\ninstantiated in the power PC model [1] and our Bayesian model.\n\n5 Conclusion\n\nIn this paper, we have outlined a range of dynamical models, from computationally\nsimple ones (such as simply applying conditional liP to the observed datapoints) to\n(such as Bayesian structure/parameter estimation).\nrationally grounded ones\nMoreover, there seems to be a tension in this domain in trying to develop a model\nthat is easily implemented in an individual and scales well with additional variables,\nand one that has a rational statistical basis. Part of our effort here has been aimed at\nproviding a set of models that seem to equally well explain human behavior, but that\nhave different virtues besides their fit with the data. Human causal learning might\nnot scale up well, or it might not be rational; further discrimination among these\npossible theories awaits additional data about causal learning curves.\n\nReferences\n\n[1] Cheng, Patricia W. 1997. \"From Covariation to Causation: A Causal Power Theory.\"\nPsychological Review, 104 (2): 367-405.\n\n[2] Danks, David. Forthcoming. \"Equilibria of the Rescorla-Wagner Model.\" Journal of\nMathematical Psychology.\n\n[3] Dayan, Peter, & Kakade, Sham. 2001. \"Explaining Away in Weight Space.\" In Advances\nin Neural Information Processing Systems 13.\n\n[4] Gopnik, Alison, Clark Glymour, David M. Sobel, Laura E. Schulz, Tamar Kushnir, &\nDavid Danks. 2002. \"A Theory of Causal Learning in Children: Causal Maps and Bayes\nNets.\" Submitted to Psychological Review.\n\n[5] Lober, Klaus, & David R. Shanks. 2000. \"Is Causal Induction Based on Causal Power?\nCritique of Cheng (1997).\" Psychological Review, 107 (1): 195:..212.\n\n[6] Rescorla, Robert A., & Allan R. Wagner. 1972. \"A Theory of Pavlovian Conditioning:\nVariations in the Effectiveness of Reinforcement and Nonreinforcement.\" In A. H. Black &\nW. F. Prokasy, eds. Classical Conditioning II: Current Research and Theory. New York:\nAppleton-Century-Crofts. pp. 64-99.\n\n[7] Shanks, David R. 1995. \"Is Human Learning Rational?\" The Quarterly Journal of\nExperimental Psychology, 48A (2): 257-279.\n\n[8] Spellman, Barbara A. 1996. \"Conditionalizing Causality.\" In D. R. Shanks, K. J.\nHolyoak, & D. L. Medin, eds. 1996., Causal Learning: The Psychology of Learning and\nMotivation, Vol. 34. San Diego, Calif.: Academic Press. pp. 167-206.\n\n[9] Tenenbaum, Joshua B., & Thomas L. Griffiths. 2000. \"Structure Learning in Human\nCausal Induction.\" In Advances in Neural Information Processing Systems 13.\n\n[10] Van Hamme, Linda J., & Edward A. Wasserman. 1994. \"Cue Competition in Causality\nJudgments: The Role of Nonpresentation of Compound Stimulus Elements.\" Learning and\nMotivation, 25: 127-151.\n\n\f", "award": [], "sourceid": 2158, "authors": [{"given_name": "David", "family_name": "Danks", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}