{"title": "Mental Sampling in Multimodal Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 5748, "page_last": 5759, "abstract": "Both resources in the natural environment and concepts in a semantic space are distributed \"patchily\", with large gaps in between the patches. To describe people's internal and external foraging behavior, various random walk models have been proposed. In particular, internal foraging has been modeled as sampling: in order to gather relevant information for making a decision, people draw samples from a mental representation using random-walk algorithms such as Markov chain Monte Carlo (MCMC). However, two common empirical observations argue against people using simple sampling algorithms such as MCMC for internal foraging. First, the distance between samples is often best described by a Levy flight distribution: the probability of the distance between two successive locations follows a power-law on the distances. Second, humans and other animals produce long-range, slowly decaying autocorrelations characterized as 1/f-like fluctuations, instead of the 1/f^2 fluctuations produced by random walks. We propose that mental sampling is not done by simple MCMC, but is instead adapted to multimodal representations and is implemented by Metropolis-coupled Markov chain Monte Carlo (MC3), one of the first algorithms developed for sampling from multimodal distributions. MC3 involves running multiple Markov chains in parallel but with target distributions of different temperatures, and it swaps the states of the chains whenever a better location is found. Heated chains more readily traverse valleys in the probability landscape to propose moves to far-away peaks, while the colder chains make the local steps that explore the current peak or patch. We show that MC3 generates distances between successive samples that follow a Levy flight distribution and produce 1/f-like autocorrelations, providing a single mechanistic account of these two puzzling empirical phenomena of internal foraging.", "full_text": "Mental Sampling in Multimodal Representations\n\nJian-Qiao Zhu\n\nDepartment of Psychology\n\nUniversity of Warwick\nj.zhu@warwick.ac.uk\n\nAdam N. Sanborn\n\nDepartment of Psychology\n\nUniversity of Warwick\n\na.n.sanborn@warwick.ac.uk\n\nNick Chater\n\nBehavioural Science Group\nWarwick Business School\nnick.chater@wbs.ac.uk\n\nAbstract\n\nBoth resources in the natural environment and concepts in a semantic space are\ndistributed \u201cpatchily\u201d, with large gaps in between the patches. To describe people\u2019s\ninternal and external foraging behavior, various random walk models have been\nproposed. In particular, internal foraging has been modeled as sampling: in order\nto gather relevant information for making a decision, people draw samples from a\nmental representation using random-walk algorithms such as Markov chain Monte\nCarlo (MCMC). However, two common empirical observations argue against peo-\nple using simple sampling algorithms such as MCMC for internal foraging. First,\nthe distance between samples is often best described by a L\u00e9vy \ufb02ight distribu-\ntion: the probability of the distance between two successive locations follows a\npower-law on the distances. Second, humans and other animals produce long-range,\nslowly decaying autocorrelations characterized as 1/f -like \ufb02uctuations, instead of\nthe 1/f 2 \ufb02uctuations produced by random walks. We propose that mental sampling\nis not done by simple MCMC, but is instead adapted to multimodal representations\nand is implemented by Metropolis-coupled Markov chain Monte Carlo (MC3), one\nof the \ufb01rst algorithms developed for sampling from multimodal distributions. MC3\ninvolves running multiple Markov chains in parallel but with target distributions\nof different temperatures, and it swaps the states of the chains whenever a better\nlocation is found. Heated chains more readily traverse valleys in the probability\nlandscape to propose moves to far-away peaks, while the colder chains make the\nlocal steps that explore the current peak or patch. We show that MC3 generates\ndistances between successive samples that follow a L\u00e9vy \ufb02ight distribution and\nproduce 1/f -like autocorrelations, providing a single mechanistic account of these\ntwo puzzling empirical phenomena of internal foraging.\n\n1\n\nIntroduction\n\nIn many complex domains, such as vision, motor control, language, categorization or common-sense\nreasoning, human behavior is consistent with the predictions of Bayesian models (e.g., [4, 39, 8, 3,\n19, 25, 52, 54]). Bayes\u2019 theorem prescribes a simple normative method for combining prior beliefs\nwith new information to make inferences about the world. However, the sheer number of hypotheses\nthat must be considered in complex domains makes exact Bayesian inference intractable. Instead\nit must be that individuals are performing some kind of approximate inference, such as sampling\n[49, 38].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fUsing sampling to approximate Bayesian models of complex problems makes many dif\ufb01cult compu-\ntations easy: instead of integrating over vast hypothesis spaces, samples of hypotheses can be drawn\nfrom the posterior distribution. The computational cost of sample-based approximations only scales\nwith the number of samples rather than the size of the hypothesis space, though using a small number\nof samples results in biased inferences.\n\nInterestingly, the biases in inference that are introduced by using a small number of samples match\nsome of the biases observed in human behavior. For example, probability matching [49], anchoring\neffects [26], and many reasoning fallacies [10, 38] can all be explained in this way. However, there\nis as of yet no consensus on the exact nature of the algorithm used to sample from human mental\nrepresentations.\n\nPrevious work has posited that people either use direct sampling or Markov chain Monte Carlo\n(MCMC) to sample from their posterior distribution over hypotheses [49, 26, 10, 38]. In this paper,\nwe demonstrate that these algorithms cannot explain two key empirical effects that have been found in\na wide variety of tasks. In particular, these algorithms do not produce distances between samples that\nfollow a L\u00e9vy \ufb02ight distribution, and separately they do not produce autocorrelations in the samples\nthat follow 1/f scaling. A further issue is that mental representations have been shown to be \u201cpatchy\u201d\nor multimodal \u2013 there are high probability regions separated by large regions of low probability \u2013 and\nMCMC is ill suited for multimodal distributions. We therefore evaluate one of the \ufb01rst algorithms\ndeveloped for sampling from multimodal probability distributions, Metropolis-coupled MCMC\n(MC3), and demonstrate that it produces both key empirical phenomena. Previously, L\u00e9vy \ufb02ight\ndistributions and 1/f scaling have been explained separately as the result of ef\ufb01cient search and the\nsignal of self-organizing behavior respectively [48, 46], and we provide the \ufb01rst account that can\nexplain both phenomena as the result of the same purposeful mental activity.\n\n1.1 Distances between mental samples: L\u00e9vy \ufb02ights\n\nIn the real world, resources are rarely distributed uniformly in the environment. Food, water, and other\ncritical natural resources often occur in spatially isolated patches with large gaps in between. As a\nresult, humans and other animals\u2019 foraging behaviors should be adapted to such patchy environments.\nIn fact, foraging behavior has been observed to produce L\u00e9vy \ufb02ights, which is a class of random walk\nwhose step lengths follow a heavy-tailed power-law distribution [42]. In the L\u00e9vy \ufb02ight distribution,\nthe probability of executing a jump of length l is given by:\n\nP (l) \u223c l\u2212\u00b5\n\n(1)\n\nwhere 1 < \u00b5 \u2264 3, and the values \u00b5 \u2264 1 do not correspond to normalizable probability distributions.\nExamples of mobility patterns following the L\u00e9vy \ufb02ight distribution have been recorded in Albatrosses\n[47], marine predators [43], monkeys [35], and humans [18].\n\nL\u00e9vy \ufb02ights are advantageous in patchy environments where resources are sparsely and randomly\ndistributed because the probability of returning to a previously visited target site is smaller than in a\nstandard random walk. In the same patchy environment, L\u00e9vy \ufb02ights can visit more new target sites\nthan a random walk does [6]. More formally, it has been proven that in foraging the optimal exponent\nis \u00b5 = 2 regardless of the dimensionality of the space if (a) the target sites are sparse, (b) they can be\nvisited any number of times, and (c) the forager can only detect and remember nearby target sites\n[48].\n\nIt has long been known that mental representations of concepts are also patchy [7] and remarkably\nthe distance between mental samples also follows a L\u00e9vy \ufb02ight distribution. For example, in semantic\n\ufb02uency tasks (e.g., asking participants to name as many distinct animals as they can), the retrieved\nanimals tend to form clusters (e.g., pets, water animals, African animals) [45]. This same task has\nalso been found to produce L\u00e9vy \ufb02ight distributions of inter-response intervals (IRI) [37].\n\nAs we are interested in sampling, which can retrieve the same item multiple times, rather than\ndestructive foraging, we conducted a new memory retrieval experiment. Ten native English speakers\nwere asked to type animal names in English as they came to mind, explicitly allowing participants\nto revisit animal names. A detailed description of the experimental procedure can be found in\nthe Supplementary Material and a summary of the data appears in Figure 2A: participants showed\npower-law scaling of their inter-retrieval intervals (IRI), replicating the main \ufb01nding of [37]. IRIs can\n\n2\n\n\fbe considered a rough measure of distance between samples, assuming that generating a sample takes\na \ufb01xed amount of time, that there are unreported samples that are generated between each reported\nsample, and that the sampler has travelled further the more unreported samples that are generated.\nAs further support, we used a standard technique from computational linguistics to measure the\ndistances between mental samples, again \ufb01nding L\u00e9vy \ufb02ight distributions for these distances (see in\nthe Supplemental Material for details and Table 1 for exponents).\n\n1.2 Autocorrelations of mental samples: 1/f noise\n\nSeparate from investigations into the distances between mental samples, a number of studies have\nreported that many cognitive activities contain long-range, slowly decaying autocorrelations in time.\nThese autocorrelations tend to follow a 1/f scaling law [24]:\n\nC(k) \u223c k\u2212\u03b1\n\n(2)\n\nwhere C(k) is the autocorrelation function of temporal lag k. The same phenomenon is often\nexpressed in the frequency domain:\n\nS(f ) \u223c f \u2212\u03b1\n\n(3)\n\nwhere f is frequency, S(f ) is spectral power resulting from a Fourier analysis and \u03b1 \u2208 [0.5, 1.5] is\nconsidered 1/f scaling.\n\n1/f noise is also known as pink or \ufb02icker noise, which varies in predictability intermediately between\nwhite noise (no serial correlation, S(f ) \u223c 1/f 0) and brown noise (no correlation between increments,\nS(f ) \u223c 1/f 2). Note that L\u00e9vy \ufb02ights (i.e., randomly selecting a \ufb02ight direction and then execute a\n\ufb02ight distance that has power-law scaling as in Equation 1) are random walks and so produce 1/f 2\nnoise instead of 1/f noise (see Supplementary Material for details).\n\n1/f -like autocorrelations in human cognition were \ufb01rst reported in time estimation and spatial\ninterval estimation tasks in which participants were asked to repeatedly estimate a pre-determined\ntime interval of 1 second or spatial interval of 1 inch [17]. Subsequent studies have shown 1/f scaling\nlaws in the response times of mental rotation, lexical decision, serial visual search, and parallel visual\nsearch [16], as well as the time to switch between different percepts when looking at a bistable\nstimulus (i.e., a Necker cube [12]).\n\nTable 1: Empirical evidence for L\u00e9vy \ufb02ights and 1/f noise in human mental samples\n\nEffect\n\nPapers\n\nExperiments\n\nMain \ufb01ndings\n\nL\u00e9vy \ufb02ight\n\nMemory retrieval task\n[37]\nCurrent Memory retrieval task\n\n1/f noise\n\n[17]\n\n[16]\n\nTime interval estimation\nSpatial interval estimation\nMental rotation\nLexical decision\nSerial search\nParallel search\n\nPower-law exponents IRI \u00b5 \u2208 [1.37, 1.98]\nPower-law exponents IRI \u00b5 \u2208 [0.77, 2.39]\nPower-law exponents distance \u00b5 \u2208 [0.76, 1.28]\nPower spectra slopes \u03b1 \u2208 [0.90, 1.20]\nPower spectra slope \u03b1 = 1\nRT power spectra slope \u03b1 = 0.7\nRT power spectra slope \u03b1 = 0.9\nRT power spectra slope \u03b1 = 0.7\nRT power spectra slope \u03b1 = 0.7\n\n2 Mental sampling algorithms\n\nGiven that the distances between mental samples follows a L\u00e9vy \ufb02ight distribution and that the\nsamples have 1/f autocorrelations (see Table 1 for summary), we now investigate which sampling\nalgorithms can capture both aspects of human cognition.\n\nWe consider three possible sampling algorithms that might be employed in human cognition: Direct\nSampling (DS), Random walk Metropolis (RwM), and Metropolis-coupled MCMC (MC3). We\nde\ufb01ne DS as independently drawing samples in accord with the posterior probability distribution. DS\nis the most ef\ufb01cient algorithm for sampling of the three, but can only be applied to relatively simple\n\n3\n\n\fexamples as it often requires calculating intractable normalizing constants that scale exponentially\nwith the dimensionality of the hypothesis space [28, 9]. DS has been used to explain biases in human\ncognition such as probability matching [49].\n\nMCMC algorithms bypass the problem of the normalizing constant by simulating a Markov chain\nthat transitions between states according only to the ratio of the probability of hypotheses [28]. We\nde\ufb01ne RwM as a classical Metropolis-Hastings MCMC algorithm, which can be thought of as a\nrandom walker exploring the probability landscape of hypotheses, preferentially climbing the peaks\nof the posterior probability distribution [29, 21]. However, with limited number of samples, RwM\nis very unlikely to reach modes in the probability distribution that are separated by large regions of\nlow probability. This leads to biased approximations of the posterior distribution [44, 38]. Random\nwalks have been used to model clustered responses in memory retrieval [1], and RwM in particular\nhas been used to model multistable perception [13], the anchoring effect [26], and various reasoning\nbiases [10, 38]. However, RwM will struggle with multimodal probability distributions.\nOur third algorithm is MC3, also known as parallel tempering or replica-exchange MCMC, was\none of the \ufb01rst algorithms to successfully tackle the problem of multimodality [14]. MC3 involves\nrunning M Markov chains in parallel, each at a different temperature: T1, T2, ..., TM . In general,\n1 = T1 < T2 < ... < TM , and T1 is the temperature of the interest where the target distribution is\nunchanged. The purpose of the heated chains is to traverse valleys in the probability landscape to\npropose moves to far-away peaks (by sampling from heated target distributions: \u03c01/T ), while the\ncolder chains make the local steps that explore the current probability peak or patch. MC3 decides\nwhether to swap the states between two randomly chosen chains in every iteration [14]. In particular,\nswapping of chain i and j is accepted or rejected according to a Metropolis rule; hence, the name\nMetropolis-coupled MCMC\n\nAswap = min{1,\n\n\u03c0(xj)1/Ti \u03c0(xi)1/Tj\n\u03c0(xi)1/Ti \u03c0(xj)1/Tj\n\n}\n\n(4)\n\nCoupling induces dependence among the chains, so each chain is no longer Markovian. The stationary\ndistribution of the entire set of chains is thus QM\ni=1 \u03c01/Ti but we only use samples from the cold\nchain (T = 1) to approximate the posterior distribution [14]. Pseudocode for MC3 is presented below.\nNote that MC3 reduces to RwM when the number of parallel chains M = 1.\n\nfor t = 2 to L do\n\nAlgorithm Metropolis-coupled Markov chain Monte Carlo\n1: Choose a starting point x1.\n2:\n3:\n4:\n5:\n\nfor m = 1 to M do\n\nDraw a candidate sample x\u2032 \u223c N (xm\nSample u \u223c U [0, 1]\nAm = min{1, [ \u03c0(x\u2032)\n\u03c0(xm\nif u < Am then xm\n\nt\u22121) ]1/Tm}\nt = x\u2032 else xm\n\nend for\nrepeat \ufb02oor(M/2) times\n\nRandomly select two chain i, j without repetition\nSample u \u223c U [0, 1]\nAswap = min{1, \u03c0(xj\n\u03c0(xi\nif u < Aswap then swap(xi\n\nt )1/Ti \u03c0(xi\nt)1/Ti \u03c0(xj\nt, xj\n\nt)1/Tj\nt )1/Tj }\nt ) end if\n\nend repeat\n\nend for\n\n6:\n\n7:\n8:\n9:\n10:\n11:\n\n12:\n\n13:\n14:\n15:\n\nt\u22121, \u03c3)\n\n\u22b2 update all M chains\n\u22b2 Gaussian proposal distribution\n\nt = xm\n\nt\u22121 end if \u22b2 Metropolis acceptance rule\n\n\u22b2 swapping chains\n\n3 Results\n\nIn this section, we evaluate whether the two key empirical effects of L\u00e9vy \ufb02ights and 1/f auto-\ncorrelations can be produced by the Direct Sampling (DS), Random walk Metropolis (RwM), and\nMetropolis-coupled MCMC (MC3) algorithms.\n\n4\n\n\fFigure 1: An example of searching behaviors in a 2D patchy environment. Each patch could represent\na cluster of animal names. Repeated simulation of samplers in different environments can be found\nin Figure 2. (Left Panel) Simulation result for DS. The top panel shows the trajectory of \ufb01rst 100\npositions (red dots). The bottom panel shows the log-log plot of \ufb02ight distance distribution. The raw\nhistogram of \ufb02ight distance is also included in the bottom panel. The power-law exponent is \ufb01tted\nusing LBN method, which corrects for irregular spacing of points [37]. (Middle Panel) the same\ntreatments for RwM sampler. The Gaussian proposal distribution was an identity covariance matrix.\n(Right Panel) the same treatments for MC3 sampler with 8 parallel chains and only the positions of\nthe cold chain were displayed here. The Gaussian proposal distributions for all 8 chains had the same\nidentity covariance matrix. For all three samplers considered here, only the \ufb01rst 1024 samples were\nused to match the length of human experiments.\n\n3.1 Producing L\u00e9vy \ufb02ights with sampling algorithms1\n\nTo simulate the sampling algorithms, we use a spatial representation of semantics (rather than the\ngraph structure used in semantic networks), and we justify this choice in the Supplementary Material.\nFor generality, we \ufb01rst focus on simulating patchy environments without making detailed assumptions\nabout any one participant\u2019s semantic space. In particular, we create a series of 2D environment using\nNmode = 15 Gaussian mixtures where the means are uniformly generated from [\u2212r, r] for both\ndimensions, where r = 9 and the covariance matrix is \ufb01xed as the identity matrix for all mixtures.\nThis procedure will produce patchy environments (for example the top panel of Figure 1). We ran\nDS, RwM, and MC3 on this multimodal probability landscape, and the \ufb01rst 100 positions for each\nalgorithm can be found in the top panel of Figure 1. The empirical \ufb02ight distances were obtained by\ncalculating the Euclidean distance between two consecutive positions of the sampler. For MC3, only\nthe positions of the cold chain (T = 1) were used.\n\nPower-law distributions should produce straight lines in a log-log plot. To estimate power-law\nexponents of \ufb02ight distance, we used the normalized logarithmic binning (LBN) method as it has\nhigher accuracy than other methods [37, 48]. In LBN, \ufb02ight distances are grouped into logarithmically-\nincreasing sized bins and the geometric midpoints are used for plotting the data. Figure 1 (bottom)\nshows that only MC3 can reproduce the distributional property of \ufb02ight distance as a L\u00e9vy \ufb02ight with\nestimated power-law exponent \u02c6\u00b5 = 1.14. Both DS (\u02c6\u00b5 = \u22120.26) and RwM (\u02c6\u00b5 = 0.04) produced\nvalues outside the range of power-law exponents found in human data. Indeed, RwM produces\na highly non-linear log-log plot, differing in form as well as exponent from a L\u00e9vy \ufb02ight. In the\nSupplemental Material, we support this result by showing how sampling from a low-dimensional\n\n1relevant code can be found at Open Science Framework: https://osf.io/26xb5/\n\n5\n\n\fA\n\nB\n\nC\n\nD\n\nFigure 2: (A) Animal naming task as non-destructive mental foraging (10 participants). The estimated\npower-law exponents for IRI are \u00b5 \u2208 [0.77, 2.39]. (B) Estimated power-law exponents for \ufb02ight\ndistance distributions for the three sampling algorithms across different patchy environments, manip-\nulating the spatial sparsity of the Gaussian mixture. The dashed lines show the range of power-law\nexponents suggested by our human data. Only MC3 falls in this range. (C) KL divergence of mode\nvisiting from the true distribution for the three sampling algorithms. Red denotes RwM, black denotes\nMC3, and blue denotes DS. The patchy environments are the same for all three algorithms. The\nquicker the sampler approach zero KL divergence, the better the sampler is searching the patchy\nenvironment. The solid lines are medians of the dashed lines. (D) Simulated standard MCMC with\npower-law proposal distribution. The solid line shows the median in estimated power-law exponent.\nThe dashed lines show the range of human data.\n\nsemantic space representation of animal names with MC3 can produce L\u00e9vy \ufb02ight exponents similar\nto those of produced by participants for distances.\n\nNote that only one run of all three samplers in a patchy environment is shown in Figure 1. We\nalso demonstrated the same samplers in different patchy environments where the impact of spatial\nsparsity on the estimated power-law exponents was investigated (see Figure 2B). In this simulation,\nthe same number of Gaussian mixture were used but the range r was varied: the higher r, the patchy\nenvironment is more likely to be sparse. The spatial sparsity was formally de\ufb01ned as the mean\ndistance between Gaussian modes. With small or moderate spatial sparsity we found a positive\nrelationship between spatial sparsity and the estimated power-law exponents for both DS and MC3\n(Figure 2B). In this range, only MC3 produced power-law exponents in the range reported in our\nmental foraging task unlike DS and RwM. For all three algorithms, once spatial sparsity was too\ngreat only a single mode was explored and no large jumps were made.\n\nWe then vary values of hyperparameters and test whether this result is robust. In particular, we\nsampled 4 different values respectively for temperature spacing {0.5, 3, 7, 10}, number of parallel\nchains {2, 4, 6, 10}, resulting in 16 combinations of hyperparameters. Intuitively, larger temperature\nspacing, more parallel chains, and greater step size should lead to more explorative behavior of\nthe sampler, and vice versa. Hence, for a certain environmental structure, MC3 could tune these\nhyperparameters to balance between explorative and exploitative searches. For searches in semantic\nspace of animal names, we run MC3 repeatedly 10 times and the mean of these power-law exponents\nwas considered. 62.50% hyperparameters reproduced L\u00e9vy \ufb02ights.\nWe also checked whether MC3 really is more suitable to explore patchy mental representations than\nRwM. In our simulated patchy environments, which used Gaussian mixtures with identity covariance\nmatrix, an optimal sampling algorithm should visit each mode equally often, hence will produce a\nuniform distribution of visit frequencies over all the modes. To this end, the effectiveness of exploring\nsuch mental representation can be examined by computing a Kullback-Leibler divergence (KL) [28]\nbetween a uniform distribution over all modes and a the relative frequency of how often an algorithm\nvisited each mode:\n\nDKL(H1:t||U ) =\n\nNmode\nX\n\ni=1\n\nH1:t log\n\nH1:t\n\n1/Nmode\n\n(5)\n\n6\n\n\fA\n\nB\n\nC\n\nFigure 3: (A) Estimates of time duration show 1/f noise. The target durations for participants to\nestimate are shown next to scatterplots and the target duration ranged from 10s (top) to 0.3s (bottom).\nBest \ufb01t power-law exponents to the power spectra are \u03b1 \u2208 [0.90, 1.20] and this is also the range\nshown in dashed lines in Figure 3C. Figure adapted from [17] . (B) Power spectra produced by DS\n(left), RwM (middle), and MC3 (right). Only MC3 with 8 parallel chains can generate 1/f noise. For\nall the sampling algorithms, the \ufb01rst 1024 samples were used. (C) Estimated power-law exponent in\npower spectra are related to the ratio between Gaussian width and proposal step size. The power-law\nexponents for power spectra (\u02c6\u03b1) were \ufb01tted following methods suggested by [17, 16]. The dashed\nlines show the range of 1/f \u03b1 suggested by [17]. Error bars indicate \u00b1SEM. When the ratio is low,\nthe acceptance rate of proposed sample should be low; it is the opposite case for the high ratio. The\nasymptotic behaviors of MC3 are 1/f noise, of RwM are brown noise, and of DS are white noise.\n\nwhere U is a discrete uniform distribution, Nmode is the number of identical Gaussian mixtures, and\nH is the empirical frequency of visited modes up to time t. Samples were assigned to the closest\nmode when determining these empirical frequencies. The faster the KL divergence for an algorithm\nreaches zero, the more effective the algorithm is at exploring the underlying environment and the DS\nalgorithm serves as a benchmark for the other two algorithms. As shown in Figure 2C, MC3 catches\nup to DS, while RwM lags far behind in exploring this patchy environment.\n\nWe checked whether the negative results for RwM were due to the choice of proposal distribution, by\nchanging the Gaussian proposal distribution to a L\u00e9vy \ufb02ight proposal distribution which has a higher\nprobability of larger steps. Using a L\u00e9vy \ufb02ight proposal distribution will straightforwardly produce\npower-law \ufb02ight distance if the posterior distribution is uniform over the entire space (i.e., every\nproposal will be accepted). However, in a patchy environment, a L\u00e9vy \ufb02ight proposal distribution\nwill not typically produce a L\u00e9vy \ufb02ight distribution of distances between samples that has estimated\npower-law exponents in the range of human data, as also can be seen in Figure 2D using different\nspatial sparsities. The reason for this is that the long jumps in the proposal distribution are unlikely to\nbe successful: these long jumps often propose new states that lie in regions of nearly zero posterior\nprobability.\n\n3.2 Producing 1/f noise with sampling algorithms\n\nA typical interval estimation task requires participants to repeatedly produce an estimate of the same\ntarget interval [17, 16]. For instance, participants were \ufb01rst given an example of a target interval (e.g.,\n1 second time interval or 1 inch spatial interval) and then repeatedly attempted to reproduce this target\nwithout feedback for up to 1000 trials. The time series produced by participants showed 1/f noise,\nwith an exponent close to 1. However, the log-log plot of the human data is typically observed \ufb02atten\nout for the highest frequencies [17]. This effect has been explained as the result of two processes:\nfractional Brownian motion combined with white noise due to motor errors at the highest frequencies\n[17].\n\nWe investigated how well our three sampling algorithms can explain the autocorrelations in this\ntemporal estimation task (Figure 3A: [17]). Gaussian distributions were used as target distributions\nfor all sampling algorithms because the distribution of responses produced by participants was\nindistinguishable from a Gaussian [17]. For temporal estimation, it is known that the Gaussian\ndistributions of responses have a scalar property that resembles Weber\u2019s law: the ratio of the mean to\n\n7\n\n\fthe standard deviation is constant [34, 15]. For these simulations, we set this ratio between the mean\nand the standard deviation equal to 8 [34].\n\nWe then ran the sampling algorithms on the target durations tested by [17] (Figure 3B). Unlike in\nthe simulations of distances between samples above, the time estimates produced by participants are\nestimates so we can directly compare them to the samples produced by the algorithms. RwM and\nMC3 were initiated at the mode of the Gaussian distribution, and there was no burn-in period in our\nsimulations. As in [17], for all three algorithms we added Gaussian motor noise to each sample to \ufb01t\nthe upswing in the plot at higher frequencies. As each trial in the experiment started immediately as\nthe previous trial ended, this resulted in the recorded estimate being equal to the sample plus the motor\nnoise, but minus the motor noise from the previous trial, producing high frequency autocorrelations.\nOur motor noise had a constant standard deviation of 0.1. Overall, the results show that only MC3\nproduces 1/f noise (\u02c6\u03b1 \u2208 [0.5, 1.5]), whereas DS tends to produce white noise (\u02c6\u03b1 \u2208 [0, 0.5]) and\nRwM is closest to Brown noise (1/f 2:\u02c6\u03b1 \u2208 [1.5, 2]).\n\nRwM tends to generate brown noise because, if every proposed sample is accepted, then the algorithm\nreduces to \ufb01rst-order autoregressive process (i.e., AR(1)) [53]. This can be seen numerically by run-\nning the sampling algorithms using different ratios of the target distribution and proposal distribution\nstandard deviations (Figure 3C). To see this relationship more clearly, in Figure 3C we did not add\nany motor noise. When the Gaussian width (\u03c3target) becomes much greater width of the Gaussian\nproposal distribution (\u03c3proposal), RwM produces brown noise. In contrast, MC3 has a tendency to\nproduce 1/f noise when the acceptance rate is high (Figure 3C black line). It has been shown that the\nsum of as few as three AR(1) processes with widely distributed autoregressive coef\ufb01cients produces\nan approximation to 1/f noise [51]. As the higher-temperature chains can be thought of as very\nroughly similar to AR(1) processes with lower autoregressive coef\ufb01cients, this may explain why the\nasymptotic behavior of the MC3 is 1/f noise.\n\nNote that, from effective sample size perspective, DS is clearly the best among three sampling\nalgorithms. The cognitive emission of 1/f noise is very suboptimal from a statistical standpoint as\nit produces a smaller effective sample size than the independent samples drawn by DS or the mild\nautocorrelations found in RwM. However, our sampling account provides a reason for why the mind\nwould produce 1/f noise: these long-range autocorrelations need to be tolerated in order to retain the\npossibility of generating samples from far-reaching modes.\n\nWe did a similar robustness check for hyperparameter settings using the same 16 combinations as\nabove. For search in representation of temporal interval, only the 10s target interval was considered\nas it shows least in\ufb02uence of motor noise in the power spectra (see Figure 3A). 43.75% parameters\nreproduced 1/f noise. Combined, 18.75% parameters reproduced both L\u00e9vy \ufb02ights in the animal\nnaming task and 1/f noise in the time estimation task.\n\n4 Discussions\n\nL\u00e9vy \ufb02ights are advantageous in a patchy world, and have been observed in many foraging task with\nhumans and other animals. A random walk with Gaussian steps does not produce the occasional\nlong-distance jump as a L\u00e9vy \ufb02ight does. However, the swapping scheme between parallel chains\nof MC3 enables it to produce L\u00e9vy-like scaling in the \ufb02ight distance distribution. Additionally\nMC3 produces the long-range slowly-decaying autocorrelations of 1/f scaling. This long-range\ndependence rules out any sampling algorithm that draws independent samples from the posterior\ndistribution, such as DS, since the sample sequence would have no serial correlation (i.e., white\nnoise). It also rules out RwM because the current sample solely depends on the previous sample.\nBoth of these results suggest that the algorithms people use to sample mental representations are\nmore complex than DS or RwM, and, like MC3, are instead adapted to sampling from multimodal\ndistributions.\n\nHowever, if people are adapted to multimodal distributions, their behavior appears not to change\neven when they are actually sampling from a unimodal distribution. In Gilden\u2019s experiments, the\noverall distribution of estimated intervals (i.e., ignoring serial order) was not multimodal, instead it\nwas indistinguishable from a Gaussian distribution [17]. Assuming that the posterior distribution in\nthe hypothesis space is also unimodal then it is somewhat inef\ufb01cient to use MC3 rather than simple\nMCMC. Potentially the brain is hardwired to use particular algorithms, or it is slow to adapt to\nunimodal representations because it is very dif\ufb01cult to know that a distribution is unimodal rather\n\n8\n\n\fthan just a single mode in a patchy space. Of course, it could be that even if MC3 is always used, that\nthe number of chains or temperature parameters are adapted to the task at hand. Additionally, it may\nbe that a cognitive load manipulation would reduce the number of available chains and thus reduce\nexploration, which is an interesting prediction to test in future work.\n\nPrevious explanations of scale-free phenomenon in human cognition such as self-organized criticality\nargue that 1/f noise is generated from the interactions of many simple processes that produce\nsuch hallmarks of complexity [46]. Other explanations assume that it is due to a mixture of scaled\nprocesses like noise in attention or noise in our ability to perform cognitive tasks [50]. These\napproaches argue that 1/f noise is a general property of cognition, and do not tie it to other empirical\neffects. Our explanation of this scale-free process is more mechanistic, assuming that they re\ufb02ect\nthe cognitive need to gather vital informational resources from multimodal probability distributions.\nWhile autocorrelations make samplers less effective when sampling from simple distributions, they\nmay need to be tolerated in multimodal distributions in order to sample other isolated modes.\nAn avenue for future work is to consider how MC3 might be implemented in the brain. Researchers\nhave proposed a variety of mechanisms for how sampling algorithms could be implemented in the\nbrain, and these mechanisms can account for many neural response properties [2, 5, 20, 23, 32, 41].\nWe are not aware of any implementations of MC3 in particular, but other work has proposed how\nmultiple chains could be implemented in neural hardware [41]. Adapting this existing multiple-chain\nscheme to implement MC3 would require: 1) running the different chains at different temperatures,\n2) tracking the cold chain for the output samples, and 3) implementing a mechanism for switching\nstates (or equivalently switching temperatures) between chains.\nWhile we have evaluated MC3 for internal sampling, it is interesting to consider whether it might\ndescribe some aspects of external search as well. Eye movements have been shown to produce both\nL\u00e9vy \ufb02ights and 1/f noise, and the areas of interest in natural images are certainly multimodal [36].\nOf course, we do not claim that MC3 is the only sampling algorithm that is able to produce both\n1/f noise and L\u00e9vy \ufb02ights. It is possible that other algorithms that deal better with multimodality\nthan MCMC, such as running a single chain at different temperatures [31, 40] or Hamiltonian Monte\nCarlo [2, 11], could produce similar results. Future work will further explore which algorithms can\nmatch these key human data.\n\nAcknowledgements\n\nJQZ was supported by China Scholarship Council. ANS was supported by The Alan Tur-\ning Institute under the EPSRC grant EP/N510129/1. NC was supported by ERC grant\n295917-RATIONALITY, the ESRC Network for Integrated Behavioural Science [grant numbers\nES/K002201/1 and ES/P008976/1], the Leverhulme Trust [grant number RP2012-V-022], and RCUK\nGrant EP/K039830/1.\n\nReferences\n\n[1] J. T. Abbott, J. L. Austerweil, and T. L. Grif\ufb01ths. Human memory search as a random walk in a\nsemantic network. In Advances in Neural Information Processing Systems, pages 3050\u20133058,\n2012.\n\n[2] L. Aitchison and M. Lengyel. The Hamiltonian brain: ef\ufb01cient probabilistic inference with\nexcitatory-inhibitory neural circuit dynamics. PLoS Computational Biology, 12(12):e1005186,\n2016.\n\n[3] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98(3):409,\n\n1991.\n\n[4] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene\nunderstanding. Proceedings of the National Academy of Sciences, 110(45):18327\u201318332, 2013.\n\n[5] P. Berkes, G. Orb\u00e1n, M. Lengyel, and J. Fiser. Spontaneous cortical activity reveals hallmarks\n\nof an optimal internal model of the environment. Science, 331(6013):83\u201387, 2011.\n\n9\n\n\f[6] G. Berkolaiko, S. Havlin, H. Larralde, and G. Weiss. Expected number of distinct sites visited\n\nby N L\u00e9vy \ufb02ights on a one-dimensional lattice. Physical Review E, 53(6):5774, 1996.\n\n[7] W. A. Bous\ufb01eld and C. H. W. Sedgewick. An analysis of sequences of restricted associative\n\nresponses. The Journal of General Psychology, 30(2):149\u2013165, 1944.\n\n[8] N. Chater and C. D. Manning. Probabilistic models of language processing and acquisition.\n\nTrends in Cognitive Sciences, 10(7):335\u2013344, 2006.\n\n[9] N. Chater, J. B. Tenenbaum, and A. Yuille. Probabilistic models of cognition: Conceptual\n\nfoundations. Trends in Cognitive Sciences, 10(7):287\u2013291, 2006.\n\n[10] I. Dasgupta, E. Schulz, and S. J. Gershman. Where do hypotheses come from? Cognitive\n\nPsychology, 96:1\u201325, 2017.\n\n[11] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters\n\nB, 195(2):216\u2013222, 1987.\n\n[12] J. Gao, V. A. Billock, I. Merk, W. Tung, K. D. White, J. Harris, and V. P. Roychowdhury. Inertia\n\nand memory in ambiguous visual perception. Cognitive Processing, 7(2):105\u2013112, 2006.\n\n[13] S. J. Gershman, E. Vul, and J. B. Tenenbaum. Multistability and perceptual inference. Neural\n\nComputation, 24(1):1\u201324, 2012.\n\n[14] C. Geyer. Markov chain Monte Carlo maximum likelihood. In Computing science and statistics:\nProceedings of 23rd Symposium on the Interface Interface Foundation, Fairfax Station, 1991,\npages 156\u2013163, 1991.\n\n[15] J. Gibbon. Scalar expectancy theory and Weber\u2019s law in animal timing. Psychological Review,\n\n84(3):279, 1977.\n\n[16] D. L. Gilden. Fluctuations in the time required for elementary decisions. Psychological Science,\n\n8(4):296\u2013301, 1997.\n\n[17] D. L. Gilden, T. Thornton, and M. W. Mallon. 1/f noise in human cognition. Science,\n\n267(5205):1837, 1995.\n\n[18] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi. Understanding individual human mobility\n\npatterns. Nature, 453(7196):779\u2013782, 2008.\n\n[19] T. L. Grif\ufb01ths and J. B. Tenenbaum. Predicting the future as Bayesian inference: people\ncombine prior knowledge with observations when estimating duration and extent. Journal of\nExperimental Psychology: General, 140(4):725, 2011.\n\n[20] R. M. Haefner, P. Berkes, and J. Fiser. Perceptual decision-making as probabilistic inference by\n\nneural sampling. Neuron, 90(3):649\u2013660, 2016.\n\n[21] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.\n\nBiometrika, 57(1):97\u2013109, 1970.\n\n[22] T. T. Hills, M. N. Jones, and P. M. Todd. Optimal foraging in semantic memory. Psychological\n\nreview, 119(2):431, 2012.\n\n[23] P. O. Hoyer and A. Hyv\u00e4rinen. Interpreting neural response variability as Monte Carlo sampling\nof the posterior. In Advances in Neural Information Processing Systems, pages 293\u2013300, 2003.\n\n[24] C. T. Kello, G. D. Brown, R. Ferrer-i Cancho, J. G. Holden, K. Linkenkaer-Hansen, T. Rhodes,\nand G. C. Van Orden. Scaling laws in cognitive sciences. Trends in Cognitive Sciences,\n14(5):223\u2013232, 2010.\n\n[25] C. Kemp and J. B. Tenenbaum. Structured statistical models of inductive reasoning. Psycholog-\n\nical Review, 116(1):20, 2009.\n\n[26] F. Lieder, T. Grif\ufb01ths, and N. Goodman. Burn-in, bias, and the rationality of anchoring. In\n\nAdvances in Neural Information Processing Systems, pages 2690\u20132798, 2012.\n\n10\n\n\f[27] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning\n\nResearch, 9(Nov):2579\u20132605, 2008.\n\n[28] D. J. MacKay. Information theory, inference and learning algorithms. Cambridge University\n\nPress, 2003.\n\n[29] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state\ncalculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087\u20131092,\n1953.\n\n[30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in Neural Information Processing\nSystems, pages 3111\u20133119, 2013.\n\n[31] R. M. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and\n\nComputing, 6(4):353\u2013366, 1996.\n\n[32] G. Orb\u00e1n, P. Berkes, J. Fiser, and M. Lengyel. Neural variability and sampling-based proba-\n\nbilistic representations in the visual cortex. Neuron, 92(2):530\u2013543, 2016.\n\n[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[34] B. C. Rakitin, J. Gibbon, T. B. Penney, C. Malapani, S. C. Hinton, and W. H. Meck. Scalar\nexpectancy theory and peak-interval timing in humans. Journal of Experimental Psychology:\nAnimal Behavior Processes, 24(1):15, 1998.\n\n[35] G. Ramos-Fern\u00e1ndez, J. L. Mateos, O. Miramontes, G. Cocho, H. Larralde, and B. Ayala-\nOrozco. L\u00e9vy walk patterns in the foraging movements of spider monkeys (Ateles geoffroyi).\nBehavioral Ecology and Sociobiology, 55(3):223\u2013230, 2004.\n\n[36] T. Rhodes, C. T. Kello, and B. Kerster. Distributional and temporal properties of eye movement\ntrajectories in scene perception. In 33th Annual Meeting of the Cognitive Science Society, 2011.\n\n[37] T. Rhodes and M. T. Turvey. Human memory retrieval as L\u00e9vy foraging. Physica A: Statistical\n\nMechanics and its Applications, 385(1):255\u2013260, 2007.\n\n[38] A. N. Sanborn and N. Chater. Bayesian brains without probabilities. Trends in Cognitive\n\nSciences, 20(12):883\u2013893, 2016.\n\n[39] A. N. Sanborn, V. K. Mansinghka, and T. L. Grif\ufb01ths. Reconciling intuitive physics and\n\nnewtonian mechanics for colliding objects. Psychological Review, 120(2):411, 2013.\n\n[40] C. Savin, P. Dayan, and M. Lengyel. Optimal recall from bounded metaplastic synapses:\npredicting functional adaptations in hippocampal area ca3. PLoS Computational Biology,\n10(2):e1003489, 2014.\n\n[41] C. Savin and S. Deneve. Spatio-temporal representations of uncertainty in spiking neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2024\u20132032, 2014.\n\n[42] M. F. Shlesinger, G. M. Zaslavsky, and U. Frisch. L\u00e9vy \ufb02ights and related topics in physics.\n\nLecture Notes in Physics, 450:52, 1995.\n\n[43] D. W. Sims, E. J. Southall, N. E. Humphries, G. C. Hays, C. J. Bradshaw, J. W. Pitchford,\nA. James, M. Z. Ahmed, A. S. Brierley, M. A. Hindell, et al. Scaling laws of marine predator\nsearch behaviour. Nature, 451(7182):1098\u20131102, 2008.\n\n[44] R. H. Swendsen and J.-S. Wang. Replica Monte Carlo simulation of spin-glasses. Physical\n\nReview Letters, 57(21):2607, 1986.\n\n[45] A. K. Troyer, M. Moscovitch, and G. Winocur. Clustering and switching as two components of\nverbal \ufb02uency: evidence from younger and older healthy adults. Neuropsychology, 11(1):138,\n1997.\n\n11\n\n\f[46] G. C. Van Orden, J. G. Holden, and M. T. Turvey. Self-organization of cognitive performance.\n\nJournal of Experimental Psychology: General, 132(3):331, 2003.\n\n[47] G. M. Viswanathan, V. Afanasyev, S. Buldyrev, E. Murphy, et al. L\u00e9vy \ufb02ight search patterns of\n\nwandering albatrosses. Nature, 381(6581):413, 1996.\n\n[48] G. M. Viswanathan, S. V. Buldyrev, S. Havlin, M. Da Luz, E. Raposo, and H. E. Stanley.\n\nOptimizing the success of random searches. Nature, 401(6756):911\u2013914, 1999.\n\n[49] E. Vul, N. Goodman, T. L. Grif\ufb01ths, and J. B. Tenenbaum. One and done? Optimal decisions\n\nfrom very few samples. Cognitive Science, 38(4):599\u2013637, 2014.\n\n[50] E.-J. Wagenmakers, S. Farrell, and R. Ratcliff. Estimation and interpretation of 1/f \u03b1 noise in\n\nhuman cognition. Psychonomic Bulletin & Review, 11(4):579\u2013615, 2004.\n\n[51] L. M. Ward. Dynamical Cognitive Science. MIT press, 2002.\n\n[52] D. M. Wolpert. Probabilistic models in human sensorimotor control. Human Movement Science,\n\n26(4):511\u2013524, 2007.\n\n[53] J. Xu and T. L. Grif\ufb01ths. How memory biases affect information transmission: A rational\nanalysis of serial reproduction. In Advances in Neural Information Processing Systems, pages\n1809\u20131816, 2009.\n\n[54] A. Yuille and D. Kersten. Vision as Bayesian inference: analysis by synthesis? Trends in\n\nCognitive Sciences, 10(7):301\u2013308, 2006.\n\nA L\u00e9vy \ufb02ights do not generate 1/f noise\n\nIn a L\u00e9vy \ufb02ight, the direction of the \ufb02ight is selected at random but the \ufb02ight distance is distributed\naccording to the power law [42, 47]. In an one-dimensional space, whether to move to the left or right\nis selected with equal probability, then the \ufb02ight distance l \u223c U \u22121/(\u00b5\u22121) where U is the uniform\ndistribution on [0,1]. This procedure guarantees the distribution of \ufb02ight distances follows power-law\nwith exponent \u00b5.\n\nFigure 4: Autocorrelations produced by a L\u00e9vy \ufb02ight. (Left) The traceplot of \ufb01rst 1024 locations of\nthe L\u00e9vy \ufb02ight. (Right) The power spectra of the locations.\n\nIn Figure 4, we simulated a L\u00e9vy \ufb02ight and applied the same power spectra analysis on the traceplot\nthat we did in the main text. L\u00e9vy \ufb02ights produce independent increments so the location only depends\non the previous location, and indeed the simulated L\u00e9vy \ufb02ight produced 1/f 2 noise (\u02c6\u03b1 = 2.02).\n\nB Methods of the animal naming task\n\nTen native English speakers (6 Female and 4 Male, and aged 19-25 years) were recruited from the\nSONA system of Warwick University (Coventry, UK). The experiment lasted about 60 minutes or\nuntil the participants typed 1024 words. Participants sat in a soundproof cubicle for this task, and\nwere paid \u00a36 for the experiment.\n\n12\n\n\f", "award": [], "sourceid": 2777, "authors": [{"given_name": "Jianqiao", "family_name": "Zhu", "institution": "University of Warwick"}, {"given_name": "Adam", "family_name": "Sanborn", "institution": "University of Warwick"}, {"given_name": "Nick", "family_name": "Chater", "institution": "Warwick Business School"}]}