{"title": "Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 1321, "page_last": 1329, "abstract": "When individuals learn facts (e.g., foreign language vocabulary) over multiple study sessions, the temporal spacing of study has a significant impact on memory retention. Behavioral experiments have shown a nonmonotonic relationship between spacing and retention: short or long intervals between study sessions yield lower cued-recall accuracy than intermediate intervals. Appropriate spacing of study can double retention on educationally relevant time scales. We introduce a Multiscale Context Model (MCM) that is able to predict the influence of a particular study schedule on retention for specific material. MCMs prediction is based on empirical data characterizing forgetting of the material following a single study session. MCM is a synthesis of two existing memory models (Staddon, Chelaru, & Higa, 2002; Raaijmakers, 2003). On the surface, these models are unrelated and incompatible, but we show they share a core feature that allows them to be integrated. MCM can determine study schedules that maximize the durability of learning, and has implications for education and training. MCM can be cast either as a neural network with inputs that fluctuate over time, or as a cascade of leaky integrators. MCM is intriguingly similar to a Bayesian multiscale model of memory (Kording, Tenenbaum, Shadmehr, 2007), yet MCM is better able to account for human declarative memory.", "full_text": "Predicting the Optimal Spacing of Study:\nA Multiscale Context Model of Memory\n\n? Dept. of Computer Science, University of Colorado\n\nRobert Lindsey?, & Ed Vul\u2021\n\u2020Dept. of Psychology, UCSD\n\nMichael C. Mozer?, Harold Pashler\u2020, Nicholas Cepeda\u25e6,\n\n\u25e6Dept. of Psychology, York University\n\n\u2021Dept. of Brain and Cognitive Sciences, MIT\n\nAbstract\n\nWhen individuals learn facts (e.g., foreign language vocabulary) over multiple\nstudy sessions, the temporal spacing of study has a signi\ufb01cant impact on memory\nretention. Behavioral experiments have shown a nonmonotonic relationship be-\ntween spacing and retention: short or long intervals between study sessions yield\nlower cued-recall accuracy than intermediate intervals. Appropriate spacing of\nstudy can double retention on educationally relevant time scales. We introduce a\nMultiscale Context Model (MCM) that is able to predict the in\ufb02uence of a partic-\nular study schedule on retention for speci\ufb01c material. MCM\u2019s prediction is based\non empirical data characterizing forgetting of the material following a single study\nsession. MCM is a synthesis of two existing memory models (Staddon, Chelaru,\n& Higa, 2002; Raaijmakers, 2003). On the surface, these models are unrelated\nand incompatible, but we show they share a core feature that allows them to be\nintegrated. MCM can determine study schedules that maximize the durability of\nlearning, and has implications for education and training. MCM can be cast either\nas a neural network with inputs that \ufb02uctuate over time, or as a cascade of leaky\nintegrators. MCM is intriguingly similar to a Bayesian multiscale model of mem-\nory (Kording, Tenenbaum, & Shadmehr, 2007), yet MCM is better able to account\nfor human declarative memory.\n\n1\n\nIntroduction\n\nStudents often face the task of memorizing facts such as foreign language vocabulary or state cap-\nitals. To retain such information for a long time, students are advised not to cram their study, but\nrather to study over multiple, well-spaced sessions. This advice is based on a memory phenomenon\nknown as the distributed practice or spacing effect (Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006).\nThe spacing effect is typically studied via a controlled experimental paradigm in which participants\nare asked to study unfamiliar paired associates (e.g., English-Japanese vocabulary) in two sessions.\nThe time between sessions, known as the intersession interval or ISI, is manipulated across partici-\npants. Some time after the second study session, a cued-recall test is administered to the participants,\ne.g., \u201cWhat is \u2018rabbit\u2019 in Japanese?\u201d The lag between second session and the test is known as the\nretention interval or RI.\nRecall accuracy as a function of ISI follows a characteristic curve. The solid line of Figure 1a\nsketches this curve, which we will refer to as the spacing function. The left edge of the graph cor-\nresponds to massed practice, when session two immediately follows session one. Recall accuracy\nrises dramatically as the ISI increases, reaches a peak, and falls off gradually. The ISI corresponding\nto the peak\u2014the optimal ISI\u2014depends strongly on RI: a meta-analysis by Cepeda et al. (2006) sug-\n\n1\n\n\fFigure 1: (a) The spacing function (solid line) depicts recall at test following two study sessions\nseparated by a given ISI; the forgetting function (dashed line) depicts recall as a function of the lag\nbetween study and test. (b) A sketch of the Multiscale Context Model.\n\ngests a power-law relationship. The optimal ISI almost certainly depends on the speci\ufb01c materials\nbeing studied and the manner of study as well. For educationally relevant RIs on the order of weeks\nand months, the effect of spacing can be tremendous: optimal spacing can double retention over\nmassed practice (Cepeda et al., in press).\nThe spacing function is related to another observable measure of retention, the forgetting function,\nwhich characterizes recall accuracy following a single study session as a function of the lag between\nstudy and test. For example, suppose participants in the experiment described above learned material\nin study session 1, and were then tested on the material immediately prior to study session 2. As the\nISI increased, session 1 memories would decay. This decay is shown in the dashed line of Figure 1a.\nTypical forgetting functions follow a generalized power-law decay, of the form P (recall) = A(1 +\nBt)\u2212C, where A, B, and C are constants, and t is the study-test lag (Wixted & Carpenter, 2007).\nOur goal is to develop a model of long-term memory that characterizes the memory-trace strength\nof items learned over two or more sessions. The model predicts recall accuracy as a function of the\nRI, taking into account the study schedule\u2014the ISI or set of ISIs determining the spacing of study\nsessions. We would like to use this model to prescribe the optimal study schedule.\nThe spacing effect is among the best known phenomena in cognitive psychology, and many the-\noretical explanations have been suggested. Two well developed computational models of human\nmemory have been elaborated to explain the spacing effect (Pavlik & Anderson, 2005; Raaijmakers,\n2003). These models are necessarily complex: the brain contains multiple, interacting memory sys-\ntems whose decay and interference characteristics depend on the speci\ufb01c content being stored and\nits relationship to other content. Consequently, these computational theories are fairly \ufb02exible and\ncan provide reasonable post-hoc \ufb01ts to spacing effect data, but we question their predictive value.\nRather than developing a general theory of memory, we introduce a model that speci\ufb01cally predicts\nthe shape of the spacing function. Because the spacing function depends not only on the RI, but also\non the nature of the material being learned, and the manner and amount of study, the model requires\nempirical constraints. We propose a novel approach to obtaining a predictive model: we collect\nbehavioral data to determine the forgetting function for the speci\ufb01c material being learned. We then\nuse the forgetting function, which is based on a single study session, to predict the spacing function,\nwhich is based on two or more study sessions. Such a predictive model has signi\ufb01cant implications\nfor education and training. The model can be used to search for the ISI or set of ISIs that maximizes\nexpected recall accuracy for a \ufb01xed RI. Although the required RI is not known in practical settings,\none can instead optimize over RI as a random variable with an assumed distribution.\n\n2 Accounts of the spacing effect\n\nWe review two existing theories proposed to explain the spacing effect, and then propose a synthesis\nof these theories. The two theories appear to be unrelated and mutually exclusive on the surface,\nbut in fact share a core unifying feature. In contrast to most modeling work appearing in the NIPS\nvolumes, our model is cast at Marr\u2019s implementation level, not at the level of a computational theory.\nHowever, after introducing our model and showing its predictive power, we discuss an intriguingly\nsimilar Bayesian theory of memory adaptation (Kording et al., 2007). Although our model has a\n\n2\n\n% recallISIforgetting functionspacing function(a)(b)m1m2m3m4mNpool 1pool 2pool 3pool 4pool N......\fstrong correspondence with the Bayesian model, their points of difference seem to be crucial for\npredicting behavioral phenomena of human declarative memory.\n\n2.1 Encoding-variability theories\n\nOne class of theories proposed to explain the spacing effect focuses on the notion of encoding\nvariability. According to these theories, when an item is studied, a memory trace is formed that\nincorporates the current psychological context. Psychological context includes conditions of study,\ninternal state of the learner, and recent experiences of the learner. Retrieval of a stored item depends\nat least in part on the similarity of the contexts at the study and test. If psychological context is\nassumed to \ufb02uctuate randomly over time, two study sessions close together in time will have similar\ncontexts. Consequently, at the time of a recall test, either both study contexts will match the test\ncontext or neither will. Increasing the ISI can thus prove advantageous because the test context\nwill have higher likelihood of matching one study context or the other. Greater contextual variation\nenhances memory on this account by making for less redundancy in the underlying memory traces.\nHowever, increasing the ISI also incurs a retrieval cost because random drift makes the \ufb01rst-study\ncontext increasingly less likely to match the test context. The optimal ISI depends on the tradeoff\nbetween the retrieval bene\ufb01t and cost at test.\nRaaijmakers (2003) developed an encoding variability theory by incorporating time-varying contex-\ntual drift into the well-known Search of Associative Memory (SAM) model (Raaijmakers & Shiffrin,\n1981), and explained a range of data from the spacing literature. In this model, the contextual state is\ncharacterized by a high-dimensional binary vector. Each element of the vector indicates the presence\nor absence of a particular contextual feature. The contextual state evolves according to a stochastic\nprocess in which features \ufb02ip from absent to present at rate \u03c001 and from present to absent at rate\n\u03c010. If the context is sampled at two points in time with lag \u2206t, the probability that a contextual\nfeature will be present at both times is\n\nP (feature present at time t and t + \u2206t) = \u03b22 + \u03b2(1 \u2212 \u03b2) exp(\u2212\u2206t/\u03c4),\n\n(1)\nwhere \u03c4 \u2261 1/(\u03c001 + \u03c010) and \u03b2 \u2261 \u03c001\u03c4 is the expected proportion of features present at any instant.\nTo assist in understanding the mechanisms of SAM, we \ufb01nd it useful to recast the model as a neural\nnetwork. The input layer to this neural net is a pool of binary valued neurons that represent the\ncontextual state at the current time; the output layer consists of a set of memory elements, one per\nitem to be stored. To simplify notation throughout this paper, we\u2019ll describe this model and all\nothers in terms of a single-item memory, allowing us to avoid an explicit index term for the item\nbeing stored or retrieved. The memory element for the item under consideration has an activation\nj wjcj, where cj is the\nbinary activation level of context unit j and wj is the strength of connection from context j. The\nprobability of retrieval of the item is assumed to be monotonically related to m.\nWhen an item is studied, its connection strengths are adjusted according to a Hebbian learning rule\nwith an upper limit on the connection strength:\n\nlevel, m, which is a linear function of the context unit activities: m = P\n\n\u2206wj = min(1 \u2212 wj, cj \u02c6m),\n\n(2)\nwhere \u02c6m = 1 if the item was just presented for study, or 0 otherwise. When an item is studied, the\nweights for all contextual features present at the time of study will be strengthened. Later retrieval\nis more likely if the context at test matches the context at study:\nthe memory element receives\na contribution only when an input is active and its connection strength is nonzero. Thus, after a\nsingle study and lag \u2206t, retrieval probability is directly related to Equation 1. When an item has\nbeen studied twice, retrieval will be more robust if the two study opportunities strengthen different\nweights, which occurs when the ISI is large and the contextual states do not overlap signi\ufb01cantly.\nOne other feature of SAM is crucial for explaining spacing-effect data. After an item has been\nstudied at least once, SAM assumes that the memory trace resulting from further study is in\ufb02uenced\nby whether the item is accessible to retrieval at the time of study. Speci\ufb01cally, SAM assumes that\nthe weights have effectively decayed to zero if recall fails. Other memory models similarly claim\nthat memory traces are weaker if an item is inaccessible to retrieval at the time of study (e.g., Pavlik\n& Anderson, 2005), which we label as the retrieval-dependent update assumption.\nWe have described the key components of SAM that explain the spacing effect, but the model has\nadditional complexity, including a short-term memory store, inter-item interference, and additional\n\n3\n\n\fcontext based on associativity and explicit cues. Even with all this machinery, SAM has a serious\nlimitation. Spacing effects occur on many time scales (Cepeda et al., 2006). SAM can explain\neffects on any one time scale (e.g., hours), but the same model cannot explain spacing effects on a\ndifferent time scale (e.g., months). The reason is essentially that the exponential decay in context\noverlap bounds the time scale at which the model operates.\n\n2.2 Predictive-utility theories\n\nWe now turn to another class of theories that has been proposed to explain the spacing effect. These\ntheories, which we will refer to as predictive-utility theories, are premised on the assumption that\nmemory is limited in capacity and/or is imperfect and allows intrusions. To achieve optimal per-\nformance, memories should therefore be erased if they are not likely to be needed in the future.\nAnderson and Milson (1989) proposed a rational analysis of memory from which they estimated\nthe future need probability of a stored trace. When an item is studied multiple times with a given\nISI, the rational analysis suggests that the need probability drops off rapidly following the last study\nonce an interval of time greater than the ISI has passed. Consequently, increasing the ISI should\nlead to a more persistent memory trace. Although this analysis yields a reasonable qualitative match\nto spacing-effect data, no attempt was made to make quantitative predictions.\nThe notion of predictive utility is embedded in the multiple time-scale or MTS model of Staddon\net al. (2002). In MTS, each item to be stored is represented by a dedicated cascade of N leaky\nintegrators. The activation of integrator i, xi, decays over time according to:\n\nxi(t + \u2206t) = xi(t) exp(\u2212\u2206t/\u03c4i),\n\ntrace strength, sN , where sk =Pk\n\n(3)\nwhere \u03c4i is the decay time constant. The probability of retrieving the item is related to the total\nj=1 xj. The integrators are ordered from shortest to longest time\nconstant, i.e., \u03c4i < \u03c4i+1 for all i. When an item is studied, the integrators receive a bump in activity\naccording to a cascaded error-correction update,\n\n\u2206xi = \u0001 max(0, 1 \u2212 si\u22121),\n\n(4)\nwhich is based on the idea that an integrator at some time scale \u03c4i receives a boost only if integrators\nat shorter time scales fail to represent the item at the time it is studied. The constant \u0001 is a step\nsize. When an item is repeatedly presented for study with short ISIs, the trace can successfully\nbe represented by the integrators with short time constants, and consequently, the trace will decay\nrapidly. Increasing the spacing shifts the representation to integrators with slower decay rates.\nMTS was designed to explain rate-sensitive habituation data from the animal learning literature: the\nfact that recovery following spaced stimuli is slower than following massed. We tried \ufb01tting MTS\nto human-memory data and were unable to obtain quantitatively accurate \ufb01ts.\n\n3 The multiscale context model (MCM)\n\nSAM and MTS are motivated by quite different considerations, and appear to be unrelated mecha-\nnisms. Nonetheless, they share a fundamental property: both suppose an exponential decay of in-\nternal representations over time (compare Equations 1 and 3). When we establish a correspondence\nbetween the mechanisms in SAM and MTS that produce exponential decay, we obtain a synthesis\nof the two models that incorporates features of each. Essentially, we take from SAM the notion of\ncontextual drift and retrieval-dependent update, and from MTS the multiscale representation and the\ncascaded error-correction memory update, and we obtain a new model which we call the Multiscale\nContext Model or MCM. MCM can be described as a neural network whose input layer consists of\nN pools of time-varying context units. Units in pool i operate with time constant \u03c4i. The relative\nsize of pool i is \u03b3i. MCM is thus like SAM with multiple pools of context units. MCM can also be\ndescribed in terms of N leaky integrators, where integrator i has time constant \u03c4i and activity scaled\nby \u03b3i. MCM is thus like MTS with the addition of scaling factors.\nBefore formally describing MCM, we detour to explain the choice of the parameters {\u03c4i} and {\u03b3i}.\nAs the reader might infer from our description of SAM and MTS, these parameters characterize\nmemory decay, extending Equation 3 such that the total trace strength at time t is de\ufb01ned as:\n\nNX\n\ni=1\n\nsN (t) =\n\n\u03b3i exp(\u2212 t\n\u03c4i\n\n)xi(0).\n\n4\n\n\fZ \u221e\n\n0\n\nNX\n\nIf xi(0) = 1 for all i\u2014which is the integrator activity following the \ufb01rst study in MTS\u2014the trace\nstrength as a function of time is a mixture of exponentials. To match the form of human forgetting\n(Figure 1), this mixture must approximate a power function. We can show that a generalized power\nfunction can be exactly expressed as an in\ufb01nite mixture of exponentials:\n\nA(1 + Bt)\u2212C = A\n\nInv-Gamma(\u03c4; C, 1) exp( Bt\n\u03c4\n\n)d\u03c4,\n\nwhere Inv-Gamma(\u03c4; C, 1) is the inverse-gamma probability density function with shape parameter\nC and scale 1, and the equality is valid for t \u2265 0 and C > 0. We have identi\ufb01ed several \ufb01nite\nmixture-of-exponential formulations that empirically yield an extremely good approximation to ar-\nbitrary power functions over ten orders of magnitude. The formulation we prefer de\ufb01nes \u03c4i and \u03b3i\nin terms of four primitive parameters:\n\n\u03c4i = \u00b5\u03bdi and \u03b3i = \u03c9\u03bei/\n\n\u03bej.\n\n(5)\n\nj=1\n\nWith \u03bd > 1 and \u03be < 1, the higher-order components (i.e., larger indices) represent exponentially\nlonger time scales with exponentially smaller weighting. As a result, truncating higher-order mixture\ncomponents has little impact on the approximation on shorter time scales. Consequently, we simply\nneed to pick a value of N that allows for a representation of many orders of magnitude of time.\nGiven N and human forgetting data collected in an experiment, we can search for the parameters\n{\u00b5, \u03bd, \u03c9, \u03be} that obtain a least squares \ufb01t to the data. Given the human forgetting function function,\nthen, we can completely determine the {\u03c4i} and {\u03b3i}. In all simulation results we report, we \ufb01xed\nN = 100, although equivalent results are obtained for N = 50 or N = 200.\n\n3.1 Casting MCM as a cascade of leaky integrators\n\nAssume that\u2014as in MTS\u2014a dedicated set of N leaky integrators hold the memory of each item\nto be learned. Let xi denote the activity of integrator i associated with the item, and let si be the\naverage strength of the \ufb01rst i integrators, weighted by the {\u03b3j} terms:\n\niX\n\niX\n\nsi =\n\n1\n\u0393i\n\n\u03b3jxj, where \u0393i =\n\n\u03b3j.\n\nj=1\n\nj=1\n\nThe recall probability is simply related to the net strength of the item: P (recall) = min(1, sN ).\nWhen an item is studied, its integrators receive a boost in activity. Integrator i receives a boost that\ndepends on how close the average strength of the \ufb01rst i integrators is to full strength, i.e.,\n\n\u2206xi = \u0001(1 \u2212 si)\n\n(6)\nwhere \u0001 is a step size. We adopt the retrieval-dependent update assumption of SAM, and \ufb01x \u0001 = 1\nfor an item that is unsuccessfully recalled at the time of study, and \u0001 = \u0001r > 1 for an item that is\nsuccessfully recalled.\nThis description of MCM is identical to MTS except the following. (1) MTS weighs all integrators\nequally when combining the individual integrator activities. MCM uses a \u03b3-weighted average. (2)\nMTS provides no guidance in setting the \u03c4 and \u03b3 constants; MCM constrains these parameters based\non the human forgetting function. (3) The integrator update magnitude is retrieval dependent, as in\nSAM. (4) The MCM update rule (Equation 6) is based on si, whereas the MTS rule (Equation 4)\nis based on si\u22121. This modi\ufb01cation is motivated by the neural net formulation of MCM, in which\nusing si allows the update to be interpreted as performing gradient ascent in prediction ability.\n\n3.2 Casting MCM as a neural network\n\nThe neural net conceptualization of MCM is depicted in Figure 1b. The input layer is like that of\nSAM with the context units arranged in N pools, with \u03b3i being the relative size of pool i. The\nactivity of unit j in pool i is denoted cij. The context units are binary valued and units in pool i \ufb02ip\nwith time constant \u03c4i. On average a fraction \u03b2 are on at any time. (\u03b2 has no effect on the model\u2019s\npredictions, and is cancelled out in the formulation that follows.)\n\n5\n\n\fAs depicted in Figure 1b, the model also includes a set of N memory elements for each item to\nbe learned. Memory elements are in one-to-one correspondence with context pools. Activation of\nmemory element i, denoted mi, indicates strength of retrieval for the item based on context pools\n1...i. The activation function is cascaded such that memory element i receives input from context\nunits in pool i as well as memory element i \u2212 1:\n\nmi = mi\u22121 +X\n\nwijcij + b,\n\nj\n\nwhere wij is the connection weight from context unit j to memory element i, m0 \u2261 0, and b =\n\u2212\u03b2/(1\u2212 \u03b2) is a bias weight. The bias simply serves to offset spurious activity reaching the memory\nelements, activity that is unrelated to the fact that the item was previously studied and stored. The\nlarger the fraction of context units that are on at any time (\u03b2), the more spurious activation there will\nbe that needs to be cancelled out. The probability of recalling the item is related to the activity of\nmemory element N: P (recall) = min(1, mN ).\nWhen the item is studied, the weights from context units in pool i are adjusted according to an\n2, where ei = 1 \u2212 mi/\u0393i.\nupdate rule that performs gradient descent in an error measure Ei = ei\nThis error is minimized when the memory element i reaches activation level \u0393i (de\ufb01ned earlier as\nthe proportion of units in the entire context pool that contributes to activity at stage i). The weight\nupdate that performs gradient descent in Ei is\n\u2206wij =\n\n(7)\n\n\u0001\n\nN \u03b2(1 \u2212 \u03b2) eicij,\n\nwhere \u0001 is a learning rate and the denominator of the \ufb01rst term is a normalization constant which\ncan be folded into the learning rate. As in SAM, \u0001 is assumed to be contingent on retrieval success\nat the start of the study trial, in the manner we described previously.\nWhat is the motivation for minimizing the prediction error at every stage, versus minimizing the\nprediction error just at the \ufb01nal stage, EN ? To answer this question, note that there are two con-\nsequences of minimizing the error Ei to zero for any i. First, reducing Ei will also likely serve to\nreduce El for all l > i. Second, achieving this objective will allow the {wl,j,k : l > i} to all be set\nto zero without any effect on the memory. Essentially, there is no need to store information for a\nlonger time scale than it is needed.\nThis description of MCM is identical to SAM except: (1) SAM has a single temporal scale of repre-\nsentation; MCM has a multiscale representation. (2) SAM\u2019s memory update rule can be interpreted\nas Hebbian learning; MCM\u2019s update can be interpreted as error-correction learning.\n\n3.3 Relating leaky integrator and neural net characterizations of MCM\n\nTo make contact with MTS, we have described MCM as a cascade of leaky integrators, and to make\ncontact with SAM, we have described MCM as a neural net. One can easily verify that the leaky-\nintegrator and neural-net descriptions of MCM are equivalent via the following correspondence be-\ntween variables of the two models, where E[.] denotes the expectation over context representations:\n\nP\n\nj E[wijcij] + b\nN \u03b2(1 \u2212 \u03b2)\n\n.\n\nsi = E[mi]/\u0393i and xi =\n\n4 Simulations\n\nCepeda and colleagues (Cepeda, Vul, Rohrer, Wixted, & Pashler, 2008; Cepeda et al., in press)\nhave recently conducted well-controlled experimental manipulations of spacing involving RIs on\neducationally relevant time scales of days to months. Most research in the spacing literature involves\nbrief RIs, on the scale of minutes to an hour, and methodological concerns have been raised with\nthe few well-known studies involving longer RIs (Cepeda et al., 2006). In Cepeda\u2019s experiments,\nparticipants study a set of paired associates over two sessions. In the \ufb01rst session, participants are\ntrained until they reach a performance criterion, ensuring that the material has been successfully\nencoded. At the start of the second session, participants are tested via a cued-recall paradigm, and\nthen are given a \ufb01xed number of study passes through all the pairs. Following a speci\ufb01ed RI, a \ufb01nal\ncued-recall test is administered. Recall accuracy at the start of the second session provides the basic\nforgetting function, and recall accuracy at test provides the spacing function.\n\n6\n\n\fFigure 2: Modeling and experimental data of (Cepeda et al., in press) (a) Experiment 1 (Swahili-\nEnglish), (b) Experiment 2a (obscure facts), and (c) Experiment 2b (object names). The four RI\nconditions of Cepeda et al. (2008) are modeled using (d) MCM and (e) the Bayesian multiscale\nmodel of Kording et al. (2007). In panel (e), the peaks of the model\u2019s spacing functions are indicated\nby the triangle pointers.\n\nFor each experiment, we optimized MCM\u2019s parameters, {\u00b5, \u03bd, \u03c9, \u03be}, to obtain a least squares \ufb01t to\nthe forgetting function. These four model parameters determine the time constants and weighting\ncoef\ufb01cients of the mixture-of-exponentials approximation to the forgetting function (Equation 5).\nThe model has only one other free parameter, \u0001r, the magnitude of update on a trial when an item is\nsuccessfully recalled (see Equation 6). We chose \u0001r = 9 for all experiments, based on hand tuning\nthe parameter to \ufb01t the \ufb01rst experiment reported here. With \u0001r, MCM is fully constrained and can\nmake strong predictions regarding the spacing function.\nFigure 2 shows MCM\u2019s predictions of Cepeda\u2019s experiments. Panels a-c show the forgetting function\ndata for the experiments (open blue squares connected by dotted lines), MCM\u2019s post-hoc \ufb01t to the\nforgetting function (solid blue line), the spacing function data (solid green points connected by\ndotted lines), and MCM\u2019s parameter-free prediction of the spacing function (solid green line). The\nindividual panels show the ISIs studied and the RI. For each experiment, MCM\u2019s prediction of the\npeak of the spacing function is entirely consistent with the data, and for the most part, MCM\u2019s\nquantiative predictions are excellent. (In panel c, MCM\u2019s predictions are about 20% too low across\nthe range of ISIs.) Interestingly, the experiments in panels b and c explored identical ISIs and RIs\nwith two different types of material. With the coarse range of ISIs explored, the authors of these\nexperiments concluded that the peak ISI was the same independent of the material (28 days). MCM\nsuggests a different peak for the two sets of material, a prediction that can be evaluated empirically.\n(It would be extremely surprising to psychologists if the peak were in general independent of the\nmaterial, as content effects pervade the memory literature.)\nPanel d presents the results of a complex study involving a single set of items studied with 11 differ-\nent ISIs, ranging from minutes to months, and four RIs, ranging from a week to nearly a year. We\nomit the \ufb01t to the forgetting function to avoid cluttering the graph. The data and model predictions\n\n7\n\n0124714020406080100ISI (days)% recall(a) RI = 10 days172884168020406080100ISI (days)(b) RI = 168 days172884168020406080100ISI (days)(c) RI = 168 days1714213570105020406080100ISI (days)% recall(d) RIs = 7, 35, 70, 350 days1714213570105020406080100ISI (days)% recall(e) RIs = 7, 35, 70, 350 days\fFigure 3: A meta-analysis of the literature by\nCepeda et al. (2006). Each red circle represents\na single spacing experiment in which the ISI was\nvaried for a given RI. The optimal ISI obtained\nin the experiment is plotted against the RI on a\nlog-log scale. (Note that the data are intrinsically\nnoisy because experiments typically examine only\na small set of ISIs, from which the \u2019optimum\u2019 is\nchosen.) The X\u2019s represent the mean from 1000\nreplications of MCM for a given RI with randomly\ndrawn parameter settings (i.e., random forgetting\nfunctions), and the dashed line is the best regres-\nsion \ufb01t to the X\u2019s. Both the experimental data and\nMCM show a power law relationship between op-\ntimal ISI and RI.\n\nare color coded by RI, with higher recall accuracy for shorter RIs. MCM predicts the spacing func-\ntions with absolutely spectacular precision, considering the predictions are fully constrained and\nparameter free. Moreover, MCM anticipates the peaks of the spacing functions, with the curvature\nof the peak decreasing with the RI, and the optimal ISI increasing with the RI.\nIn addition to these results, MCM also predicts the probability of recall at test conditional on suc-\ncessful or unsuccessful recall during the test at the start of the second study session. As explained\nin Figure 3, MCM obtains a sensible parameter-free \ufb01t to a meta-analysis of the experimental liter-\nature by Cepeda et al. (2006). Finally, MCM is able to post-hoc \ufb01t classic studies from the spacing\nliterature (for which forgetting functions are not available).\n\n5 Discussion\nMCM\u2019s blind prediction of 7 different spacing functions is remarkable considering that the domain\u2019s\ncomplexity (the content, manner and amount of study) is reduced to four parameters, which are fully\ndetermined by the forgetting function. Obtaining empirical forgetting functions is straightforward.\nObtaining empirical evidence to optimize study schedules, especially when more than two sessions\nare involved, is nearly infeasible. MCM thus offers a signi\ufb01cant practical tool for educators in\ndevising study schedules. Optimizing study schedules with MCM is straightfoward, and particularly\nuseful considering that MCM can optimize not only for a known RI but for RI as a random variable.\nMCM arose from two existing models, MTS and SAM, and all three models are characterized at\nMarr\u2019s implementation or algorithmic levels, not at the level of a computational theory. Kording et\nal. (2007) have proposed a Bayesian memory model which has intriguing similarities to MCM, and\nhas the potential of serving as the complementary computational theory. The model is a Kalman\n\ufb01lter (KF) with internal state variables that decay exponentially at different rates. The state predicts\nthe appearance of an item in the temporal stream of experience. The dynamics of MCM can be\nexactly mapped onto the KF, with \u03c4 related to the decay of a variable, and \u03b3 to its internal noise\nlevel. However, the KF model has a very different update rule, based on the Kalman gain. We\nhave tried to \ufb01t experimental data with the KF model, but have not been satis\ufb01ed with the outcome.\nFor example, Figure 2e shows a least-squares \ufb01t to the six free parameters of the KF model to the\nCepeda et al. (2008) data. (Two parameters determine the range of time scales; two specify internal\nand observation noise levels; and two perform an af\ufb01ne transform from internal memory strength to\nrecall probability.) In terms of sum-squared error, the model shows a reasonable \ufb01t, but the model\nclearly misses the peaks of the spacing functions, and in fact predicts a peak that is independent\nof RI. Notably, the KF model is a post-hoc \ufb01t to the spacing functions, whereas MCM produces a\ntrue prediction of the spacing functions, i.e., parameters of MCM are determined without peeking\nat the spacing function. Exploring many parameterizations of the KF model, we \ufb01nd that the model\ngenerally predicts decreasing or constant optimal ISIs as a function of the RI. In contrast, MCM\nnecessarily produces an increasing optimal ISI as a function of the RI, consistent with all behavioral\ndata. It remains an important and intriguing challenge to unify MCM and the KF model; each has\nsomething to offer the other.\n\n8\n\n\u22125\u22124\u22123\u22122\u221210123\u22126\u22125\u22124\u22123\u22122\u22121012log10(RI)log10(optimal ISI) human dataMCMMCM regression\fReferences\nAnderson, J. R., & Milson, R. (1989). Human memory: An adaptive perspective. Psych. Rev., 96,\n\n703\u2013719.\n\nCepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (in press). Optimiz-\ning distributed practice: Theoretical analysis and practical implications. Journal of Experimental\nPsychology.\n\nCepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal\n\nrecall tasks: A review and quantitative synthesis. Psychological Bulletin, 132, 354\u2013380.\n\nCepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in learning:\n\nA temporal ridgeline of optimal retention. Psychological Science, 19, 1095\u20131102.\n\nKording, K. P., Tenenbaum, J. B., & Shadmehr, R. (2007). The dynamics of memory as a conse-\n\nquence of optimal adaptation to a changing body. Nature Neuroscience, 10, 779\u2013786.\n\nPavlik, P. I., & Anderson, J. R. (2005). Practice and forgetting effects on vocabulary memory: An\n\nactivation-based model of the spacing effect. Cognitive Science, 29(4), 559-586.\n\nRaaijmakers, J. G. W. (2003). Spacing and repetition effects in human memory: application of the\n\nSAM model. Cognitive Science, 27, 431\u2013452.\n\nRaaijmakers, J. G. W., & Shiffrin, R. M. (1981). Search of associative memory. Psych. Rev., 88,\n\n93\u2013134.\n\nStaddon, J. E. R., Chelaru, I. M., & Higa, J. J. (2002). Habituation, memory and the brain: The\n\ndynamics of interval timing. Behavioural Processes, 57, 71-88.\n\nWixted, J. T., & Carpenter, S. K. (2007). The Wickelgren power law and the Ebbinghaus savings\n\nfunction. Psychological Science, 18, 133\u2013134.\n\n9\n\n\f", "award": [], "sourceid": 736, "authors": [{"given_name": "Harold", "family_name": "Pashler", "institution": null}, {"given_name": "Nicholas", "family_name": "Cepeda", "institution": null}, {"given_name": "Robert", "family_name": "Lindsey", "institution": null}, {"given_name": "Ed", "family_name": "Vul", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}