{"title": "Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories", "book": "Advances in Neural Information Processing Systems", "page_first": 1305, "page_last": 1313, "abstract": "Storing a new pattern in a palimpsest memory system comes at the cost of interfering with the memory traces of previously stored items. Knowing the age of a pattern thus becomes critical for recalling it faithfully. This implies that there should be a tight coupling between estimates of age, as a form of familiarity, and the neural dynamics of recollection, something which current theories omit. Using a normative model of autoassociative memory, we show that a dual memory system, consisting of two interacting modules for familiarity and recollection, has best performance for both recollection and recognition. This finding provides a new window onto actively contentious psychological and neural aspects of recognition memory.", "full_text": "Two is better than one: distinct roles for familiarity\nand recollection in retrieving palimpsest memories\n\nCristina Savin1\n\ncs664@cam.ac.uk\n\nPeter Dayan2\n\ndayan@gatsby.ucl.ac.uk\n\nM\u00b4at\u00b4e Lengyel1\n\nm.lengyel@eng.cam.ac.uk\n\n1Computational & Biological Learning Lab, Dept. of Engineering, University of Cambridge, UK\n\n2Gatsby Computational Neuroscience Unit, University College London, UK\n\nAbstract\n\nStoring a new pattern in a palimpsest memory system comes at the cost of inter-\nfering with the memory traces of previously stored items. Knowing the age of\na pattern thus becomes critical for recalling it faithfully. This implies that there\nshould be a tight coupling between estimates of age, as a form of familiarity, and\nthe neural dynamics of recollection, something which current theories omit. Us-\ning a normative model of autoassociative memory, we show that a dual memory\nsystem, consisting of two interacting modules for familiarity and recollection, has\nbest performance for both recollection and recognition. This \ufb01nding provides a\nnew window onto actively contentious psychological and neural aspects of recog-\nnition memory.\n\n1\n\nIntroduction\n\nEpisodic memory such as that in the hippocampus acts like a palimpsest \u2013 each new entity to be\nstored is overlaid on top of its predecessors, and, in turn, is submerged by its successors. This implies\nboth anterograde interference (existing memories hinder the processing of new ones) and retrograde\ninterference (new memories overwrite information about old ones). Both pose important challenges\nfor the storage and retrieval of information in neural circuits. Some aspects of these challenges have\nbeen addressed in two theoretical frameworks \u2013 one focusing on anterograde interference through the\ninteraction of novelty and storage [1]; the other on retrograde interference in individual synapses [2].\nHowever, neither fully considered the critical issue of retrieval from palimpsests; this is our focus.\nFirst, [1] made the critical observation that autoassociative memories only work if normal recall\ndynamics are suppressed on presentation of new patterns that need to be stored. Otherwise, rather\nthan memorizing the new pattern, the memory associated with the existing pattern that most closely\nmatches the new input will be strengthened. This suggests that it is critical to have a mechanism for\nassessing pattern novelty or, conversely, familiarity, a function that is often ascribed to neocortical\nareas surrounding the hippocampus.\nSecond, [2] considered the palimpsest problem of overwriting information in synapses whose ef\ufb01-\ncacies have limited dynamic ranges. They pointed out that this can be at least partially addressed\nthrough allowing multiple internal states (for instance forming a cascade) for each observable synap-\ntic ef\ufb01cacy level. However, although [2] provide an attractive formalism for analyzing and optimiz-\ning synaptic storage, a retrieval mechanism associated with this storage is missing.\n\n1\n\n\fFigure 1: a. The cascade model. Internal states of a synapse (circles) can express one of two different\nef\ufb01cacies (W, columns). Transitions between states are stochastic and can either be potentiating,\nor depressing, depending on pre- and postsynaptic activities. Probabilities of transitions between\nstates expressing the same ef\ufb01cacy p and between states expressing different ef\ufb01cacies, q, decrease\ngeometrically with cascade depth. b. Generative model for the autoassociative memory task. The\nrecall cue \u02dcx is a noisy version of one of the stored patterns x. Upon storing pattern x synaptic states\nchanged from V0 (sampled from the stationary distribution of synaptic dynamics) to V1. Recall\noccurs after the presentation of t \u2212 1 intervening patterns, when synapses are in states Vt, with\ncorresponding synaptic ef\ufb01cacies Wt. Only Wt and \u02dcx are observed at recall.\n\nAlthough these pieces of work might seem completely unrelated, we show here that they are closely\nlinked via retrieval. The critical fact about recall from memory, in general, is to know how the infor-\nmation should appear at the time of retrieval. In the case of a palimpsest, the trace of a memory\nin the synaptic ef\ufb01cacies depends critically on the age of the memory, i.e., its relative familiarity.\nThis suggests a central role for novelty (or familiarity) signals during recollection. Indeed, we show\nretrieval is substantially worse when familiarity is not explicitly represented than when it is.\nDual system models for recognition memory are the topic of a heated debate [3, 4]. Our results\ncould provide a computational rationale for them, showing that separating a perirhinal-like network\n(involved in familiarity) from a hippocampal-like network can be bene\ufb01cial even when the only\ntask is recollection. We also show that the task of recognition can also be best accomplished by\ncombining the outputs of both networks, as suggested experimentally [4].\n\n2 Storage in a palimpsest memory\n\nWe consider the task of autoassociative recall of binary patterns from a palimpsest memory. Specif-\nically, the neural circuit consists of N binary neurons that enjoy all-to-all connectivity. During\nstorage, network activity is clamped to the presented pattern x, inducing changes in the synapses\u2019\n\u2018internal\u2019 states V and corresponding observed binary ef\ufb01cacies W (Fig. 1a).\nAt recall, we seek to retrieve a pattern x that was originally stored, given a noisy cue \u02dcx and the\ncurrent weight matrix W. This weight matrix is assumed to result from storing x on top of the\nstationary distribution of the synaptic ef\ufb01cacies coming from the large number of patterns that had\nbeen previously stored, and then subsequently storing a sequence of t \u2212 1 other intervening patterns\nwith the same statistics on top of x (Fig. 1b).\nIn more detail, a pattern to be stored has density f, and is drawn from the distribution:\n\nPstore(x) =\n\nPstore(xi) =\n\ni\n\nf xi \u00b7 (1 \u2212 f )1\u2212xi\n\ni\n\n(1)\n\n(cid:89)\n\n(cid:89)\n(cid:89)\n\nThe recall cue is a noisy version of the original pattern, modeled using a binary symmetric channel:\n\nPnoise(\u02dcx|x) =\n\nPnoise( \u02dcxi|xi) =(cid:0)(1 \u2212 r)xi \u00b7 r1\u2212xi(cid:1) \u02dcxi \u00b7(cid:0)rxi \u00b7 (1 \u2212 r)1\u2212xi(cid:1)1\u2212 \u02dcxi\n\nPnoise( \u02dcxi|xi)\n\ni\n\n(2)\n\n(3)\n\nwhere r de\ufb01nes the level of input noise.\n\n2\n\npotentiationdepressionab\fThe recall time t is assumed to come from a geometric distribution with mean \u00aft:\n\n(cid:18)\n\n(cid:19)t\u22121\n\nPrecall(t) =\n\n\u00b7\n\n1\n\u00aft\n\n1 \u2212 1\n\u00aft\n\n(4)\n\nf\n\nand \u03c2\u2212 = f\n\ni = \u03c7i\u22121, with q\u00b1\n\nn = \u03c7n\u22121\n\nThe synaptic learning rule is local and stochastic, with the probability of an event actually leading\nto state changes determined by the current state of the synapse Vij and the activity at the pre- and\npost-synaptic neurons, xi and xj. Hence, learning is speci\ufb01ed through a set of transition matrices\nij = l(cid:48)|Vij = l, xi, xj). For convenience, we adopted the cas-\nM (xi, xj), with M (xi, xj)l(cid:48)l = P(V (cid:48)\ncade model [2] (Fig. 1a), which assumes that the probability of potentiation and depression decays\nwith cascade depth i as a geometric progression, q\u00b1\n1\u2212\u03c7 to compensate for\nboundary effects. The transition between metastates is given by p\u00b1\n1\u2212\u03c7, with the correction\nfactors \u03c2+ = 1\u2212f\n1\u2212f ensuring that different metastates are equally occupied for differ-\nent pattern sparseness values f [2]. Furthermore, we assume synaptic changes occur only when the\npostsynaptic neuron is active, leading to potentiation if the presynaptic neuron is also active and to\ndepression otherwise. The speci\ufb01c form of the learning rule could in\ufb02uence the memory span of the\nnetwork, but we expect it not to change the results below qualitatively.\nThe evolution of the distribution over synaptic states after encoding can be described by a Markov\nprocess, with a transition matrix M given as the average change in synaptic states expected after\nPstore(xi)\u00b7Pstore(xj)\u00b7M(xi, xj).\nAdditionally, we de\ufb01ne the column vectors \u03c0V(xi, xj) and \u03c0W(xi, xj) for the distribution of the\nsynaptic states and observable ef\ufb01cacies, respectively, when one of the patterns stored was (xi, xj),\nl (xi, xj) = P(Vij = l|xi, xj). Given these\nsuch that \u03c0W\nde\ufb01nitions, we can express the \ufb01nal distribution over synaptic states as:\n\nstoring an arbitrary pattern from the prior Pstore(x), M =(cid:80)\n\nl (xi, xj) = P(Wij = l|xi, xj) and \u03c0V\n\ni = \u03c2\u00b1 \u03c7i\n\nxi,xj\n\n(cid:88)\n\n(cid:16)\n\nt\u22121 \u00b7 M(xi, xj) \u00b7 \u03c0\u221e(cid:17)\n\n\u03c0V(xi, xj) =\n\nPrecall(t) \u00b7 M\n\n(5)\n\nt\n\nwhere we start from the stationary distribution \u03c0\u221e (the eigenvector of M for eigenvalue 1), encode\npattern (xi, xj) and then t \u2212 1 additional patterns from the same distribution. The corresponding\nweight distribution is \u03c0W(xi, xj) = T \u00b7 \u03c0V(xi, xj), where T is a 2 \u00d7 2n matrix de\ufb01ning the\ndeterministic mapping from synaptic states to observable ef\ufb01cacies.\nThe fact that the recency of the pattern to be recalled, t, appears in equation 5 implies that pattern age\nwill strongly in\ufb02uence information retrieval. In the following, we consider two possible solutions\nto this problem. We \ufb01rst show the limitations of recall dynamics that involve a single, monolithic\nmodule which averages over t. We then prove the bene\ufb01ts of a dual system with two qualitatively\ndifferent modules, one of which explicitly represents an estimate of pattern age.\n\n3 A single module recollection system\n\n3.1 Optimal retrieval dynamics\n\nSince information storage by synaptic plasticity is lossy, the recollection task described above is a\nprobabilistic inference problem [5,6]. Essentially, neural dynamics should represent (aspects of) the\nposterior over stored patterns, P (x|\u02dcx, W), that expresses the probability of any pattern x being the\ncorrect response for the recall query given a noisy recall cue, \u02dcx, and the synaptic ef\ufb01cacies W.\nIn more detail, the posterior over possible stored patterns can be computed as:\n\nP (x|W, \u02dcx) \u221d Pstore(x) \u00b7 Pnoise(\u02dcx|x) \u00b7 P(W|x)\n\n(cid:81)\n(6)\nwhere we assume that evidence from the weights factorizes over synapses1, P (W|x) =\nij P (Wij|xi, xj).\n1This assumption is never exactly true in practice, as synapses that share a pre- or post- synaptic partner are\nbound to be correlated. Here, we assume the intervening patterns cause independent weight changes and ignore\nthe effects of such correlations.\n\n3\n\n\fPrevious Bayesian recall dynamics derivations assumed learning rules for which the contribution of\neach pattern to the \ufb01nal weight were the same, irrespective of the order of pattern presentation [5,6].\nBy contrast, the Markov chain behaviour of our synaptic learning rule forces us to explicitly consider\npattern age. Furthermore, as pattern age is unknown at recall, we need to integrate over all possible t\nvalues (Eq. 5). This integral (which is technically a sum, for discrete t) can be computed analytically\nusing the eigenvalue decomposition of the transition matrix M. Alternatively, if the value of t is\nknown during recall, the prior is replaced by a delta function, Precall(t) = \u03b4(t \u2212 t\u2217).\nThere are several possible ways of representing the posterior in Eq.6 through neural dynamics with-\nout reifying t. For consistency, we assume neural states to be binary, with network activity at each\nstep representing a sample from the posterior [7, 8]. An advantage of this approach is that the full\nposterior is represented in the network dynamics, such that higher decision modules can not only\nextract the \u2018best\u2019 pattern (for the mean squared error cost function considered here, this would be\nthe mean of the posterior) but also estimate the uncertainty of this solution. Nevertheless, other\nrepresentations, for example representing the parameters of a mean-\ufb01eld approximation to the true\nposterior [5, 9, 10], would also be possible and similarly informative about uncertainty.\nIn particular, we use Gibbs sampling, as it allows for neurally plausible recall dynamics [7]. This\nresults in asynchronous updates, in which the activity of a neuron xi changes stochastically as a\nfunction of its input cue \u02dcxi, the activity of all other neurons, x\\i, and neighbouring synapses, Wi,\u00b7\nand W\u00b7,i. Speci\ufb01cally, the Gibbs sampler results in a sigmoid transfer function, with the total current\nto the neuron given by the log-odds ratio:\n\nI rec\ni\n\n= log\n\nP(xi = 1|x\\i, W, \u02dcxi)\nP(xi = 0|x\\i, W, \u02dcxi)\n\n= I rec,in\n\ni\n\n+ I rec,out\n\ni\n\n+ a\u02dcxi + b\n\n(7)\n\nrec\n\nwith the terms I in/out\nde\ufb01ning the evidence from the incoming and outgoing synapses of neuron i,\nand the constants a and b determined by the prior over patterns and the noise model.2 The terms\ndescribing the contribution from recurrent interactions, have a similar shape:\n3 \u00b7 xj + cin\n\n1 \u00b7 Wij xj + cin\n\n2 \u00b7 Wij + cin\n\nI rec,in\ni\n\n(cid:1)\n\n(8)\n\n=\n\n(cid:88)\n(cid:88)\n\nj\n\n(cid:0)cin\n(cid:0)cout\n\n1\n\n4\n\n(cid:1)\n\n(9)\n\nI rec,out\ni\n\n=\n\n\u00b7 Wji xj + cout\n\n2\n\n\u00b7 Wji + cout\n\n3\n\n\u00b7 xj + cout\n\n4\n\nj\n\nk\n\nThe parameters cin/out\n, uniquely determined by the learning rule and the priors for x and t, rescale\nthe contribution of the evidence from the weights as a function of pattern age (see supplementary\ntext). Furthermore, these constants translate into a unique signal, giving a sort of \u2018suf\ufb01cient statis-\ntic\u2019 for the expected memory strength. Note that the optimal dynamics include two homeostatic\nj Wij,\n\nprocesses, corresponding to global inhibition,(cid:80)\n\nj xj, and neuronal excitability regulation,(cid:80)\n\nthat stabilize network activity during recall.\n\n3.2 Limitations\n\nBeside the effects of assuming a factorized weight distribution, the neural dynamics derived above\nshould be the best we can do given the available data (i.e. recall cue and synaptic weights). How\nwell does the network fare in practice?\nPerformance is as expected when pattern age is assumed known: as the available information from\nthe weights decreases, so does performance, \ufb01nally converging to control levels, de\ufb01ned by the re-\ntrieval performance of a network without plastic recurrent connections, i.e. when inference uses only\nthe recall cue and the prior over stored patterns (Fig. 2a, green). When t is unknown, performance\nalso deteriorates with increasing pattern age, however this time beneath control levels (Fig. 2a, blue).\nIntuitively, one can see that relying on the prior over t is similar to assuming t \ufb01xed to a value close\n\n2Real neurons can only receive information from their presynaptic partners, so cannot estimate I out\n\nrec . We\ntherefore ran simulations without this term in the dynamics and found that although it did decrease recall per-\nformance, this decrease was similar to that obtained by randomly pruning half of the connections in the network\nand keeping this term in the dynamics (not shown). This indicated that performance is mostly determined by\nthe number of available synapses used for inference, and not so much by the direction of those synapses. Hence,\nin the following we use both terms and leave the systematic study of connectivity for future work.\n\n4\n\n\fFigure 2: a. Recall performance for a single module memory system. b. Average recollection error\ncomparison for the single and dual memory system. Black lines mark control performance, when\nignoring the information from the synaptic weights.\n\nto the mean of this prior. When the pattern that was actually presented is older than this estimate, the\nresulting memory signal is weaker than expected, suggesting that the initial pattern was very sparse\n(since a pair of inactive elements does not induce any synaptic changes according to our learning\nrule). However, less reasonable is the fact that averaging over the prior distribution of recall times t\n(Eq. 4), performance is worse than this control (Fig. 2b).\nOne possible reason for this failure is that the sampling procedure used for inference might not work\nin certain cases. Since Gibbs samplers are known to mix poorly when the shape of the posterior is\ncomplex (with strong correlations, as in frustrated Ising models), perhaps our neural dynamics are\nunable to sample the desired distribution effectively. We con\ufb01rmed this hypothesis by implementing\na more sophisticated sampling procedure using tempered transitions [11] (details in supplementary\ntext). Indeed, with tempered transitions performance becomes signi\ufb01cantly better than control, even\nfor the cases where Gibbs sampling fails (Fig. 2b). Unfortunately, there has yet to be a convincing\nsuggestion as to how tempering dynamics (or in fact any other sampling algorithm that works well\nwith correlated posteriors) can be represented neurally since, for example, they require a global\nacceptance decision to be taken at the end of each temperature cycle.\nIt is worth noting that with more complex synaptic dynamics (e.g. deeper cascades) simple Gibbs\nsampling works reasonably well (data not shown), probably because the posterior is smoother and\nhence easier to sample.\n\n4 A dual memory system\n\nAn alternative to implicitly marginalizing over the age of the pattern throughout the inference pro-\ncess is to estimate it at the same time as performing recollection. This suggests the use of dual\nmodules that together estimate the joint posterior P (x, t|\u02dcx, W), with sampling proceeding in a\nloop: the familiarity module generates a sample from the posterior over the age of the currently esti-\nmated pattern, P(t|x, \u02dcx, W); and the recollection module uses this estimated age to compute a new\nsample from the distribution over possible stored patterns given the age, P (x|\u02dcx, W, t) (Fig. 3a).\nThe module that computes familiarity can also be seen as a palimpsest, with each pattern overlaying,\nand being overlaid by, its predecessors and successors. Formally, it needs to compute the probability\nP(t|x, \u02dcx, W), as the system continues to implement a Gibbs sampler with t as an additional dimen-\nsion. As a separate module, the neural network estimating familiarity cannot however access the\nweights W of the recollection module. A biologically plausible approximation is to assume that\nthe familiarity module uses a separate set of weights, which we call Wfam. Also, it is clear from\nFig. 1b that t is independent of \u02dcx conditioned on x, thus the conditioning on \u02dcx can be dropped when\ncomputing the posterior over t, that is, external input need only feed directly into the recollection\nbut not the familiarity module (Fig. 3a).\nIn particular, we assume a feedforward network structure in the familiarity module, with each neuron\nreceiving the output of the recollection module as inputs through synapses Wfam. These synaptic\n\n5\n\nabt knownt knownGibbstempereddual system051015single modulecontroltransitionserror (%)terror (%)050100150010203040t unknowncontrol\fFigure 3: a. An overview of the dual memory system. The familiarity network has a feedforward\nstructure, with the activity of individual neurons estimating the probability of the true pattern age\nbeing a certain value t, see example in inset. The estimated pattern age translates into a familiarity\nsignal, which scales the contribution of the recurrent inputs in the network dynamics. b. Dependence\nof the familiarity signal on the estimated pattern age.\n\nweights change according to the same cascade rule used for recollection.3 For simplicity, we assume\nthat the familiarity neurons are always activated during encoding, so that synapses can change state\n(either by potentiation or depression) with every storage event.\nConcretely, the familiarity module consists of Nfam neurons, each corresponding to a certain pattern\nage in the range 1\u2013Nfam (the last unit codes for t \u2265 Nfam). This forms a localist code for familiarity.\n= log P(t = i|x, Wfam) which\nThe total input to a neuron is given by the log-posterior I fam\ntranslates into a simple linear activation function:\n\ni\n\n(cid:3) + log P(t) \u2212 log(Z)\n\n(10)\n\n(cid:88)\n\n(cid:2)cfam\n\nI fam\ni =\n\n1,i W fam\n\nij xj + cfam\n\n2,i W fam\n\nij + cfam\n\n3,i xj + cfam\n4,i\n\nj\n\ni = 1) = eIi(cid:80)\n\nk,i are similar to parameters cin/out before (albeit different for each neuron\n\nwhere the constants cfam\nbecause of their tuning to different values of t), and Z is the unknown partition function.\nAs mentioned above, we treat the activity of the familiarity module as a sample from the posterior\nover age t. This representation requires lateral competition between different units such that only one\ncan become active at each step. Dynamics of this sort can be implemented using a softmax operator,\nj eIj (thus rendering the evaluation of the partition function Z unnecessary), and\nP(xfam\nare a common feature of a range of neural models [12, 13].\nCritically, this familiarity module is not just a convenient theoretical construct associated with re-\ntrieval. First, as we mentioned before, the assessment of novelty actually plays a key part in memory\nstorage \u2013 in making the decision as to whether a pattern that is presented is novel, and so should\nbe stored, or familiar, and so should have its details be recalled. This venerable suggestion [1] has\nplayed a central part in the understanding of structure-function relationships in the hippocampus.\nThe graded familiarity module that we have suggested is an obvious extension of this idea; the use\nfor retrieval is new. Second, it is in general accord with substantial data on the role of perirhinal cor-\ntex and the activity of neurons in this structure [3]. Recency neurons would be associated with small\nvalues of t; novelty neurons with large or effectively in\ufb01nite values of t [14], although perirhinal\ncortex appears to adopt a population coding strategy for age, rather than just one-of-n.\nThe recollection module has the same dynamics as before, with constants ci computed assuming t\n\ufb01xed to the output of the familiarity module. Thus we predict that familiarity multiplicatively mod-\nulates recurrent interactions in the recollection module during recall. Since there is a deterministic\nmapping between t and this modulatory factor (Fig. 3b), it can be computed using a linear unit pool-\ning the outputs of all the neurons in the familiarity module, with weights given by the corresponding\nvalues for cfam\n\n(t).\n\ni\n\n3There is nothing to say that the learning rule that optimizes the recollection network\u2019s ability to recall\npatterns should be equally appropriate for assessing familiarity. Hence, the familiarity module could have their\nown learning rule, optimized for its speci\ufb01c task.\n\n6\n\nfamiliarityrecollectionab05010015010\u22121100101 tfamiliarity signal110020000.05neuron indexactivationcue\fFigure 4: a. Decision boundaries for the recognition module. b. Corresponding ROC curve. c.\nPerformance comparison when the decision layer uses signals from the familiarity module, the rec-\nollection module, or both. d. Same comparison, when data is restricted to recent stimuli. Note that\ndifference between fam and rec became signi\ufb01cant compared to c.\n\nIn order to compare single and dual module systems fairly, the computational resources employed by\neach should be the same. We therefore reduced the overall connectivity in the dual system such that\nthe two have the same total number of synapses. Moreover, since elements of Wfam are correlated,\nthe effective number of connections is in fact somewhat lower in the dual system. Regardless, the\ndual memory system performs signi\ufb01cantly better than the single module system (Fig. 2b).\n\n5 Recognition memory\n\nWe have so far considered familiarity merely as an instrument for effective recollection. However,\nthere are many practical and experimental tasks in which it is suf\ufb01cient to make a binary decision\nabout whether a pattern is novel or familiar rather than recalling it in all its gory detail. It is these\ntasks that have been used to elucidate the role of perirhinal cortex in recognition memory.\nIn the dual module system, information about recognition is available from both the familiarity\nmodule (patterns judged to have young ages are recognized) and the recollection module (patterns\nrecalled with higher certainty are recognized). We therefore construct an additional decision module\nwhich takes the outputs of the familiarity and recollection modules and maps them into a binary\nbehavioral response (familiar vs. novel).\nSpeci\ufb01cally, we use the average of the entropies associated with the activities of neurons in the\nrecollection module and the mean estimate of t from the familiarity module. Since the palimpsest\nproperty implicitly assumes that all patterns have been presented at some point, we de\ufb01ne a pattern to\nbe familiar if its age is less than a \ufb01xed threshold tth. We train the decision module using a Gaussian\nprocess classi\ufb01er4 [15], which yields as outcome the probability of a hit, P(familiar|t\u2217, x\u2217), shown\nin Fig. 4a. The shape of the resulting discriminator, that it is not parallel to either axis, suggests that\nthe output of both modules is needed for successful recognition, as suggested experimentally [4,16].\nThe fact that a classi\ufb01er trained using only one of the two dimensions cannot match the recognition\nperformance of that using both con\ufb01rms this observation (Fig. 4c).\nMoreover, the ROC curve produced by the classi\ufb01er, plotting hit rates against false alarms as relative\nlosses are varied, has a similar shape to those obtained for human behavioral data: it has a so-called\n\u2018curvi-linear\u2019 character because of the apparent intersect at a \ufb01nite hit probability for 0 false alarm\nrate [17] (Fig. 4b). Lastly, as recognition is known to rely more on familiarity for relatively recent\npatterns [18], we estimate recognition performance for recent patterns, which we de\ufb01ne as having\nage t \u2264 tth\n2 . To determine the contribution of each module in recognition outcomes in this case, we\nestimate performance of classi\ufb01ers trained on single input dimensions for this test data. Consistent\nwith experimental data, our analysis reveals that the familiarity signal gives a more reliable estimate\nof novelty, compared to the recollection output for relatively recent items (Fig. 4d).\n\n4The speci\ufb01c classi\ufb01er was chosen as it allows for an easy estimation of the ROC curves. Future work\n\nshould explore analytical decision rules.\n\n7\n\n00.040.080.12010203040recollection: average entropyfamiliarity: estimated t0.10.30.50.70.9novelfamiliaracd00.20.40.60.810.50.60.70.80.91false alarmshitsbbothrecfam050100**famrec8090100*\f6 Conclusions and discussion\n\nKnowing the age of a pattern is critical for retrieval from palimpsest memories, a consideration\nthat has so far eluded theoretical inquiry. We showed that a memory system could either treat this\ninformation implicitly, by marginalizing over all possible ages, or it could estimate age explicitly as\na form of familiarity. In principle, both solutions should have similar performance, given the same\nresources. In practice, however, a system involving dual modules is signi\ufb01cantly better.\nIn our model, the posterior over possible stored patterns was represented in neural activities via\nsamples. We showed that a complex, biologically-questionable sampling procedure would be nec-\nessary for the implicit, single module, system. Instead, a dual memory system with two functionally\ndistinct but closely interacting modules, yielded the best performance both for ef\ufb01cient recollection\nand for recognition. Importantly, though Gibbs sampling and tempered transitions provide a use-\nful framework for understanding the performance differences between different memory systems,\nthe presented results are not restricted to a sampling-based implementation. Since age and identity\nare tightly correlated, a mean \ufb01eld solution that use factorized distributions [5] shows very simi-\nlar behavior (see supplementary text). Similarly, the speci\ufb01c details of the familiarity module are\nnot critical for these effects, which should be apparent for any alternative implementation correctly\nestimating pattern age.\nRepresenting pattern age, t, explicitly essentially amounts to implementing an auxiliary variable for\nsampling the space of possible patterns, x more ef\ufb01ciently. Such auxiliary variable methods are\nwidely used to increase sampling ef\ufb01ciency when other, simpler methods fail [19]. Moreover, since\nt in our case speci\ufb01cally modulates the correlated components of the posterior it can be seen as a\n\u2018temperature\u2019 parameter, and so we can understand the advantages brought about by the dual system\nas due to implementing a form of \u2018simulated tempering\u2019 \u2013 a class of methods known to help mixing\nin strongly correlated posteriors.\nOur proposal provides a powerful new window onto the contentious debate about the neural mech-\nanisms of recognition and recall. The rationale for our familiarity network was improving recollec-\ntion; however, the form of the network was motivated by the substantial experimental data [14] on\nrecognition, and indeed standard models of perirhinal cortex activity [20]. These, for instance, also\nrely on some form of inhibition to mediate interactions between different familiarity neurons. Nev-\nertheless, our model is the \ufb01rst to link the computational function of familiarity networks to recall;\nit is distinct also in that it considers palimpsest synapses, as previous models use purely additive\nlearning rules [20]. Although we only considered pattern age as the basis of familiarity here, the\nprinciple of the interaction between familiarity and recollection remains the same in an extended\nsetting, when familiarity characterizes the expected strength of the memory trace more completely,\nincluding the effects of retention interval, number of repetitions, and spacing between repetitions.\nFuture work with the extended model should allow us to address familiarity, novelty, and recency\nneurons in the perirhinal cortex, and indeed provide a foundation for new thinking about this region.\nIn our model familiarity interacts with recollection by multiplicatively (or divisively) modulating the\ncontribution of recurrent inputs in the recollection module. Neurally, this effect could be mediated\nby shunting inhibition via speci\ufb01c classes of hippocampal interneurons which target the dendritic\nsegment corresponding to recurrent connections, thus rescaling the relative contribution of exter-\nnal versus recurrent inputs [21]. Whether pathways reaching CA3 from perirhinal cortex through\nentorhinal cortex preserve a suf\ufb01cient amount of input speci\ufb01city of feed-forward inhibition is un-\nknown.\nOur theory predicts important systems-level aspects of memory from synaptic-level constraints. In\nparticular, by optimizing our dual system solely for memory recall we also predicted non-trivial\nROC curves for recognition that are in at least broad qualitative agreement with experiments. Future\nwork will be needed to explore whether the ROC curves in our model show similar dissociations in\nresponse to speci\ufb01c lesions of the two modules to those found in recent experiments [22,23] and the\nrelation to other recognition memory models [24].\n\nAcknowledgements\n\nThis work was supported by the Wellcome Trust (CS, ML) and the Gatsby Charitable Foundation\n(PD).\n\n8\n\n\fReferences\n[1] Hasselmo, M.E. The role of acetylcholine in learning and memory. Current opinion in neuro-\n\nbiology 16, 710\u2013715 (2006).\n\n[2] Fusi, S., Drew, P.J. & Abbott, L.F. Cascade models of synaptically stored memories. Neuron\n\n45, 599\u2013611 (2005).\n\n[3] Brown, M.W. & Aggleton, J.P. Recognition memory: What are the roles of the perirhinal\n\ncortex and hippocampus? Nature Reviews Neuroscience 2, 51\u201361 (2001).\n\n[4] Wixted, J.T. & Squire, L.R. The medial temporal lobe and the attributes of memory. Trends in\n\nCognitive Sciences 15, 210\u2013217 (2011).\n\n[5] Sommer, F.T. & Dayan, P. Bayesian retrieval in associative memories with storage errors.\n\nIEEE transactions on neural networks 9, 705\u2013713 (1998).\n\n[6] Lengyel, M., Kwag, J., Paulsen, O. & Dayan, P. Matching storage and recall: hippocampal\nspike timing-dependent plasticity and phase response curves. Nature Neuroscience 8, 1677\u2013\n1683 (2005).\n\n[7] Ackley, D., Hinton, G. & Sejnowski, T. A learning algorithm for Boltzmann machines. Cog-\n\nnitive Science 9, 147\u2013169 (1995).\n\n[8] Fiser, J., Berkes, P., Orb\u00b4an, G. & Lengyel, M. Statistically optimal perception and learning:\n\nfrom behavior to neural representations. Trends in Cognitive Sciences 14, 119\u2013130 (2010).\n\n[9] Hinton, G. Deterministic Boltzmann learning performs steepest descent in weight-space.\n\nNeural Computation 1, 143\u2013150 (1990).\n\n[10] Lengyel, M. & Dayan, P. Uncertainty, phase and oscillatory hippocampal recall. Advances in\n\nNeural Information Processing (2007).\n\n[11] Neal, R.M. Sampling from multimodal distributions using tempered transitions. Statistics and\n\nComputing 6, 353\u2013366 (1996).\n\n[12] Fukai, T. & Tanaka, S. A simple neural network exhibiting selective activation of neuronal\n\nensembles: from winner-take-all to winners-share-all. Neural computation 9, 77\u201397 (1997).\n\n[13] Bogacz, R. & Gurney, K. The basal ganglia and cortex implement optimal decision making\n\nbetween alternative actions. Neural computation 19, 442\u2013477 (2007).\n\n[14] Xiang, J.Z. & Brown, M.W. Differential neuronal encoding of novelty, familiarity and recency\n\nin regions of the anterior temporal lobe. Neuropharmacology 37, 657\u2013676 (1998).\n\n[15] Rasmussen, C.E. & Williams, C.K.I. Gaussian Processes for Machine Learning (MIT Press,\n\n2006).\n\n[16] Warburton, E.C. & Brown, M.W. Findings from animals concerning when interactions be-\ntween perirhinal cortex, hippocampus and medial prefrontal cortex are necessary for recogni-\ntion memory. Neuropsychologia 48, 2262\u20132272 (2010).\n\n[17] Yonelinas, A.P. Components of episodic memory: the contribution of recollection and famil-\niarity. Philosophical Transactions of the Royal Society B: Biological Sciences 356, 1363\u20131374\n(2001).\n\n[18] Yonelinas, A. The nature of recollection and familiarity: A review of 30 years of research.\n\nJournal of memory and language 46, 441\u2013517 (2002).\n\n[19] Iba, Y. Extended ensemble Monte Carlo. Int. J. Mod. Phys 12, 653\u2013656 (2001).\n[20] Bogacz, R. Comparison of computational models of familiarity discrimination in the perirhinal\n\ncortex. Hippocampus (2003).\n\n[21] Mitchell, S. Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron\n\n(2003).\n\n[22] Fortin, N.J., Wright, S.P. & Eichenbaum, H. Recollection-like memory retrieval in rats is\n\ndependent on the hippocampus. Nature 431, 188\u2013191 (2004).\n\n[23] Cowell, R., Winters, B., Bussey, T. & Saksida, L. Paradoxical false memory for objects after\n\nbrain damage. Science (2010).\n\n[24] Norman, K. & O\u2019Reilly, R. Modeling hippocampal and neocortical contributions to recognition\n\nmemory: A complementary-learning-systems approach. Psychological Review (2003).\n\n9\n\n\f", "award": [], "sourceid": 769, "authors": [{"given_name": "Cristina", "family_name": "Savin", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "M\u00e1t\u00e9", "family_name": "Lengyel", "institution": null}]}