{"title": "Human Memory Search as Initial-Visit Emitting Random Walk", "book": "Advances in Neural Information Processing Systems", "page_first": 1072, "page_last": 1080, "abstract": "Imagine a random walk that outputs a state only when visiting it for the first time. The observed output is therefore a repeat-censored version of the underlying walk, and consists of a permutation of the states or a prefix of it. We call this model initial-visit emitting random walk (INVITE). Prior work has shown that the random walks with such a repeat-censoring mechanism explain well human behavior in memory search tasks, which is of great interest in both the study of human cognition and various clinical applications. However, parameter estimation in INVITE is challenging, because naive likelihood computation by marginalizing over infinitely many hidden random walk trajectories is intractable. In this paper, we propose the first efficient maximum likelihood estimate (MLE) for INVITE by decomposing the censored output into a series of absorbing random walks. We also prove theoretical properties of the MLE including identifiability and consistency. We show that INVITE outperforms several existing methods on real-world human response data from memory search tasks.", "full_text": "Human Memory Search as Initial-Visit Emitting\n\nRandom Walk\n\n\u2217Wisconsin Institute for Discovery, \u2020Department of Computer Sciences, \u2021Department of Psychology\n\nKwang-Sung Jun\u2217, Xiaojin Zhu\u2020, Timothy Rogers\u2021\n\nkjun@discovery.wisc.edu, jerryzhu@cs.wisc.edu, ttrogers@wisc.edu\n\nUniversity of Wisconsin-Madison\n\nZhuoran Yang\n\nDepartment of Mathematical Sciences\n\nTsinghua University\n\nyzr11@mails.tsinghua.edu.cn\n\nMing Yuan\n\nDepartment of Statistics\n\nUniversity of Wisconsin-Madison\n\nmyuan@stat.wisc.edu\n\nAbstract\n\nImagine a random walk that outputs a state only when visiting it for the \ufb01rst time.\nThe observed output is therefore a repeat-censored version of the underlying walk,\nand consists of a permutation of the states or a pre\ufb01x of it. We call this model\ninitial-visit emitting random walk (INVITE). Prior work has shown that the ran-\ndom walks with such a repeat-censoring mechanism explain well human behavior\nin memory search tasks, which is of great interest in both the study of human\ncognition and various clinical applications. However, parameter estimation in IN-\nVITE is challenging, because naive likelihood computation by marginalizing over\nin\ufb01nitely many hidden random walk trajectories is intractable. In this paper, we\npropose the \ufb01rst ef\ufb01cient maximum likelihood estimate (MLE) for INVITE by de-\ncomposing the censored output into a series of absorbing random walks. We also\nprove theoretical properties of the MLE including identi\ufb01ability and consistency.\nWe show that INVITE outperforms several existing methods on real-world human\nresponse data from memory search tasks.\n\n1 Human Memory Search as a Random Walk\nA key goal for cognitive science has been to understand the mental structures and processes that\nunderlie human semantic memory search. Semantic \ufb02uency has provided the central paradigm for\nthis work: given a category label as a cue (e.g. animals, vehicles, etc.) participants must generate as\nmany example words as possible in 60 seconds without repetition. The task is useful because, while\nexceedingly easy to administer, it yields rich information about human semantic memory. Partici-\npants do not generate responses in random order but produce \u201cbursts\u201d of related items, beginning\nwith the highly frequent and prototypical, then moving to subclusters of related items. This ordinal\nstructure sheds light on associative structures in memory: retrieval of a given item promotes retrieval\nof a related item, and so on, so that the temporal proximity of items in generated lists re\ufb02ects the\ndegree to which the two items are related in memory [14, 5]. The task also places demands on other\nimportant cognitive contributors to memory search: for instance, participants must retain a men-\ntal trace of previously-generated items and use it to refrain from repetition, so that the task draws\nupon working memory and cognitive control in addition to semantic processes. For these reasons\nthe task is a central tool in all commonly-used metrics for diagnosing cognitive dysfunction (see\ne.g. [6]). Performance is generally sensitive to a variety of neurological disorders [19], but different\nsyndromes also give rise to different patterns of impairment, making it useful for diagnosis [17]. For\nthese reasons the task has been widely employed both in basic science and applied health research.\nNevertheless, the representations and processes that support category \ufb02uency remain poorly under-\nstood. Beyond the general observation that responses tend to be clustered by semantic relatedness,\n\n1\n\n\fit is not clear what ordinal structure in produced responses reveals about the structure of human\nsemantic memory, in either healthy or disordered populations. In the past few years researchers in\ncognitive science have begun to \ufb01ll this gap by considering how search models from other domains\nof science might explain patterns of responses observed in \ufb02uency tasks [12, 13, 15]. We review\nrelated works in Section 4.\nIn the current work we build on these advances by considering, not how search might operate on a\npre-speci\ufb01ed semantic representation, but rather how the representation itself can be learned from\ndata (i.e., human-produced semantic \ufb02uency lists) given a speci\ufb01ed model of the list-generation\nprocess. Speci\ufb01cally, we model search as a random walk on a set of states (e.g. words) where the\ntransition probability indicates the strength of association in memory, and with the further constraint\nthat node labels are only generated when the node is \ufb01rst visited. Thus, repeated visits are censored\nin the output. We refer to this generative process as the initial-visit emitting (INVITE) random walk.\nThe repeat-censoring mechanism of INVITE was \ufb01rst employed in Abbott et al. [1]. However, their\nwork did not provide a tractable method to compute the likelihood nor to estimate the transition\nmatrix from the \ufb02uency responses. The problem of estimating the underlying Markov chain from\nthe lists so produced is nontrivial because once the \ufb01rst two items in a list have been produced there\nmay exist in\ufb01nitely many pathways that lead to production of the next item. For instance, consider\nthe produced sequence \u201cdog\u201d \u2192 \u201ccat\u201d \u2192 \u201cgoat\u201d where the underlying graph is fully connected.\nSuppose a random walk visits \u201cdog\u201d then \u201ccat\u201d. The walk can then visit \u201cdog\u201d and \u201ccat\u201d arbitrarily\nmany times before visiting \u201cgoat\u201d; there exist in\ufb01nitely many walks that outputs the given sequence.\nHow can the transition probabilities of the underlying random walk be learned?\nA solution to this problem would represent a signi\ufb01cant advance from prior works that estimate\nparameters from a separate source such as a standard text corpus [13]. First, one reason for ver-\nbal \ufb02uency\u2019s enduring appeal has been that the task appears to reveal important semantic structure\nthat may not be discoverable by other means. It is not clear that methods for estimating semantic\nstructure based on another corpus do a very good job at modelling the structure of human semantic\nrepresentations generally [10], or that they would reveal the same structures that govern behavior\nspeci\ufb01cally in this widely-used \ufb02uency task. Second, the representational structures employed can\nvary depending upon the \ufb02uency category. For instance, the probability of producing \u201cchicken\u201d\nafter \u201cgoat\u201d will differ depending on whether the task involves listing \u201canimals\u201d, \u201cmammals\u201d, or\n\u201cfarm animals\u201d. Simply estimating a single structure from the same corpus will not capture these\ntask-based effects. Third, special populations, including neurological patients and developing chil-\ndren, may generate lists from quite different underlying mental representations, which cannot be\nindependently estimated from a standard corpus.\nIn this work, we make two important contributions on the INVITE random walk. First, we propose\na tractable way to compute the INVITE likelihood. Our key insight in computing the likelihood is\nto turn INVITE into a series of absorbing random walks. This formulation allows us to leverage the\nfundamental matrix [7] and compute the likelihood in polynomial time. Second, we show that the\nMLE of INVITE is consistent, which is non-trivial given that the convergence of the log likelihood\nfunction is not uniform. We formally de\ufb01ne INVITE and present the two main contributions as\nwell as an ef\ufb01cient optimization method to estimate the parameters in Section 2. In Section 3, we\napply INVITE to both toy data and real-world \ufb02uency data. On toy data our experiments empirically\ncon\ufb01rm the consistency result. On actual human responses from verbal \ufb02uency INVITE outperforms\noff-the-shelf baselines. The results suggest that INVITE may provide a useful tool for investigating\nhuman cognitive functions.\n\n2 The INVITE Random Walk\n\nINVITE is a probabilistic model with the following generative story. Consider a random walk on\na set of n states S with an initial distribution \u03c0 > 0 (entry-wise) and an arbitrary transition matrix\nP where Pij is the probability of jumping from state i to j. A surfer starts from a random initial\nstate drawn from \u03c0. She outputs a state if it is the \ufb01rst time she visits that state. Upon arriving\nat an already visited state, however, she does not output the state. The random walk continues\ninde\ufb01nitely. Therefore, the output consists of states in the order of their \ufb01rst-visit; the underlying\nentire walk trajectory is hidden. We further assume that the time step of each output is unobserved.\nFor example, consider the random walk over four states in Figure 1(a). If the underlying random\nwalk takes the trajectory (1, 2, 1, 3, 1, 2, 1, 4, 1, . . .), the observation is (1, 2, 3, 4).\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a-c) Example Markov chains (d) Example nonconvexity of the INVITE log likelihood\n\nWe say that the observation produced by INVITE is a censored list since non-initial visits are cen-\nsored. It is easy to see that a censored list is a permutation of the n states or a pre\ufb01x thereof (more\non this later). We denote a censored list by a = (a1, a2, . . . , aM ) where M \u2264 n. A censored list\nis not Markovian since the probability of a transition in censored list depends on the whole history\nrather than just the current state. It is worth noting that INVITE is distinct from Broder\u2019s algorithm\nfor generating random spanning trees [4], or the self-avoiding random walk [9], or cascade models\nof infection. We discuss the technical difference to related works in Section 4.\nWe characterize the type of output INVITE is capable of producing, given that the underlying un-\ncensored random walk continues inde\ufb01nitely. A state s is said to be transient if a random walk\nstarting from s has nonzero probability of not returning to itself in \ufb01nite time and recurrent if such\nprobability is zero. A set of states A is closed if a walk cannot exit A; i.e., if i \u2208 A and j (cid:54)\u2208 A,\nthen a random walk from i cannot reach j. A set of states B is irreducible if there exists a path\nbetween every pair of states in B; i.e., if i, j \u2208 B, then a random walk from i can reach j. De\ufb01ne\n[M ] = {1, 2, . . . , M}. We use a1:M as a shorthand for a1, . . . , aM . Theorem 1 states that a \ufb01nite\nstate Markov chain can be uniquely decomposed into disjoint sets, and Theorem 2 states what a\ncensored list should look like. All proofs are in the supplementary material.\n[8] If the state space S is \ufb01nite, then S can be written as a disjoint union T \u222a W1 \u222a\nTheorem 1.\n. . . \u222a WK, where T is a set of transient states that is possibly empty and each Wk, k \u2208 [K], is a\nnonempty closed irreducible set of recurrent states.\nTheorem 2. Consider a Markov chain P with the decomposition S = T \u222a W1 \u222a . . . \u222a WK as in\nTheorem 1. A censored list a = (a1:M ) generated by INVITE on P has zero or more transient states,\nfollowed by all states in one and only one closed irreducible set. That is, \u2203(cid:96) \u2208 [M ] s.t. {a1:(cid:96)\u22121} \u2286 T\nand {a(cid:96):M} = Wk for some k \u2208 [K].\nAs an example, when the graph is fully connected INVITE is capable of producing all n! permuta-\ntions of the n states as the censored lists. As another example, in Figure 1 (b) and (c), both chains\nhave two transient states T = {1, 2} and two recurrent states W1 = {3, 4}. (b) has no path that\nvisits both 1 and 2, and thus every censored list must be a pre\ufb01x of a permutation. However, (c) has\na path that visits both 1 and 2, thus can generate (1,2,3,4), a full permutation.\nIn general, each INVITE run generates a permutation of n states, or a pre\ufb01x of a permutation. Let\nSym(n) be the symmetric group on [n]. Then, the data space D of censored lists is D \u2261 {(a1:k) |\na \u2208 Sym(n), k \u2208 [n]}.\n\ncensored random walk trajectories x which produces a: P(a; \u03c0, P) =(cid:80)\n\n2.1 Computing the INVITE likelihood\nLearning and inference under the INVITE model is challenging due to its likelihood function. A\nnaive method to compute the probability of a censored list a given \u03c0 and P is to sum over all un-\nP(x; \u03c0, P).\nThis naive computation is intractable since the summation can be over an in\ufb01nite number of tra-\njectories x\u2019s that might have produced the censored list a. For example, consider the censored list\na = (1, 2, 3, 4) generated from Figure 1(a). There are in\ufb01nite uncensored trajectories to produce a\nby visiting states 1 and 2 arbitrarily many times before visiting state 3, and later state 4.\nThe likelihood of \u03c0 and P on a censored list a is\n\nx produces a\n\nP(ak+1 | a1:k; P)\n\nif a cannot be extended\notherwise.\n\n(1)\n\n3\n\n(cid:81)M\u22121\n\nk=1\n\nP(a; \u03c0, P) =\n\n(cid:40)\n\n\u03c0a1\n0\n\n123411313131121341111TW1TW12134111100.51\u22121.52\u22121.5\u22121.48\u22121.46\u22121.44\u03bbLog likelihood\fI\n\n0\n\n(cid:18)Q R\n(cid:19)\n\nNote we assign zero probability to a censored list that is not completed yet, since the underly-\ning random walk must run forever. We say a censored list a is valid (invalid) under \u03c0 and P if\nP(a; \u03c0, P) > 0 (= 0).\nWe \ufb01rst review the fundamental matrix in the absorbing random walk. A state that transits to itself\nwith probability 1 is called an absorbing state. Given a Markov chain P with absorbing states, we\ncan rearrange the states into P(cid:48) =\n, where Q is the transition between the nonabsorbing\nstates, R is the transition from the nonabsorbing states to absorbing states, and the rest trivially\nrepresent the absorbing states. Theorem 3 presents the fundamental matrix, the essential tool for the\ntractable computation of the INVITE likelihood.\nTheorem 3. [7] The fundamental matrix of the Markov chain P(cid:48) is N = (I \u2212 Q)\u22121. Nij is\nthe expected number of times that a chain visits state j before absorption when starting from i.\nFurthermore, de\ufb01ne B = (I \u2212 Q)\u22121R. Then, Bik is the probability of a chain starting from i being\nabsorbed by k. In other words, Bi\u00b7 is the absorption distribution of a chain starting from i.\nAs a tractable way to compute the likelihood, we propose a novel formulation that turns an INVITE\nrandom walk into a series of absorbing random walks. Although INVITE itself is not an absorbing\nrandom walk, each segment that produces the next item in the censored list can be modeled as one.\nThat is, for each k = 1 . . . M \u2212 1 consider the segment of the uncensored random walk starting\nfrom the previous output ak until the next output ak+1. For this segment, we construct an absorbing\nrandom walk by keeping a1:k nonabsorbing and turning the rest into the absorbing states. A random\nwalk starting from ak is eventually absorbed by a state in S \\ {a1:k}. The probability of being\nabsorbed by ak+1 is exactly the probability of outputting ak+1 after outputting a1:k in INVITE.\nFormally, we construct an absorbing random walk P(k):\n\nP(k) =\n\n,\n\n(2)\n\n(cid:16)Q(k) R(k)\n\n(cid:17)\n\n0\n\nI\n\nwhere the states are ordered as a1:M . Corollary 1 summarizes our computation of the INVITE\nlikelihood.\nCorollary 1. The k-th step INVITE likelihood for k \u2208 [M \u2212 1] is\n\nP(ak+1 | a1:k, P) =\n\n(cid:26)[(I \u2212 Q(k))\u22121R(k)]k1\n(cid:17)(cid:111)\n(cid:110)(cid:16)\nThen, the INVITE log likelihood is (cid:96)(\u03c0, P; Dm) =(cid:80)m\n\nobserve m independent\n, ...,\n\nSuppose\nwe\na(1)\n1 , ..., a(1)\nM1\n\n, where Mi\n\n, ..., a(m)\nMm\n\na(m)\n1\n\n(cid:17)\n\n(cid:16)\n\n0\n\nif (I \u2212 Q(k))\u22121 exists\notherwise\n\n(3)\n\nof\n\nrealizations\n\nINVITE: Dm\n\n=\nis the length of the i-th censored list.\ni=1 log P(a(i); \u03c0, P).\n\nP(a2 | a1; P) = (1 \u2212 P11)\u22121P1j = ((cid:80)\n\n2.2 Consistency of the MLE\nIdenti\ufb01ability is an essential property for a model to be consistent. Theorem 4 shows that allowing\nself-transitions in P cause INVITE to be unidenti\ufb01able. Then, Theorem 5 presents a remedy. The\nproof for both theorems are presented in our supplementary material. Let diag(q) be a diagonal\nmatrix whose i-th diagonal entry is qi.\nTheorem 4. Let P be an n \u00d7 n transition matrix without any self-transition (Pii = 0,\u2200i), and\nq \u2208 [0, 1)n. De\ufb01ne P(cid:48) = diag(q) + (I \u2212 diag(q))P, a scaled transition matrix with self-transition\nprobabilities q. Then, P(a; \u03c0, P) = P(a; \u03c0, P(cid:48)), for every censored list a.\nFor example, consider a censored list a = (1, j) where j (cid:54)= 1. Using the fundamental matrix,\nj(cid:48)(cid:54)=1 cP1j(cid:48))\u22121cP1j,\u2200c. This implies\nthat multiplying a constant c to P1j for all j (cid:54)= 1 and renormalizing the \ufb01rst row P1\u00b7 to sum to 1\ndoes not change the likelihood.\nTheorem 5. Assume the initial distribution \u03c0 > 0 elementwise. In the space of transition matrices\nP without self-transitions, INVITE is identi\ufb01able.\ni pi = 1} be the probability simplex. For brevity, we pack\nthe parameters of INVITE into one vector \u03b8 as follows: \u03b8 \u2208 \u0398 = {(\u03c0(cid:62), P1\u00b7, . . . , Pn\u00b7)(cid:62) | \u03c0, Pi\u00b7 \u2208\n\u2206n\u22121, Pii = 0,\u2200i}. Let \u03b8\u2217 = (\u03c0\u2217(cid:62), P\u2217\nn\u00b7)(cid:62) \u2208 \u0398 be the true model. Given a set of m\n1\u00b7, . . . , P\u2217\ncensored lists Dm generated from \u03b8\u2217, the average log likelihood function and its pointwise limit are\n\nLet \u2206n\u22121 = {p \u2208 Rn | pi \u2265 0,\u2200i,(cid:80)\n\nj(cid:48)(cid:54)=1 P1j(cid:48))\u22121P1j = ((cid:80)\n\n4\n\n\f(cid:98)Qm(\u03b8) =\n\n1\nm\n\ni=1\n\nm(cid:88)\n\nlog P(a(i)); \u03b8)\n\nand\n\nQ\u2217(\u03b8) =\n\n(cid:88)\n\na\u2208D\n\nP(a; \u03b8\u2217) log P(a; \u03b8).\n\n(4)\n\nFor brevity, we assume that the true model \u03b8\u2217 is strongly connected; the analysis can be easily\nextended to remove it. Under the Assumption A1, Theorem 6 states the consistency result.\nn\u00b7)(cid:62) \u2208 \u0398 be the true model. \u03c0\u2217 has no zero entries.\nAssumption A1. Let \u03b8\u2217 = (\u03c0\u2217(cid:62), P\u2217\nTheorem 6. Assume A1. The MLE of INVITE(cid:98)\u03b8m \u2261 max\u03b8\u2208\u0398 (cid:98)Qm(\u03b8) is consistent.\nFurthermore, P\u2217 is strongly connected.\nsupplementary material. Since \u0398 is compact, the sequence {(cid:98)\u03b8m} has a convergent subsequence\n{(cid:98)\u03b8mj}. Let \u03b8(cid:48) = limj\u2192\u221e(cid:98)\u03b8mj . Since (cid:98)Qmj (\u03b8\u2217) \u2264 (cid:98)Qmj ((cid:98)\u03b8mj ),\n\nWe provide a sketch here. The proof relies on Lemma 6 and Lemma 2 that are presented in our\n\n1\u00b7, . . . , P\u2217\n\nQ\u2217(\u03b8\u2217) = lim\n\nj\u2192\u221e (cid:98)Qmj (\u03b8\u2217) \u2264 lim\nsubsequence converges to \u03b8\u2217,(cid:98)\u03b8m converges to \u03b8\u2217.\n\nj\u2192\u221e (cid:98)Qmj ((cid:98)\u03b8mj ) = Q\u2217(\u03b8(cid:48)),\n\nwhere the last equality is due to Lemma 6. By Lemma 2, \u03b8\u2217 is the unique maximizer of Q\u2217,\nwhich implies \u03b8(cid:48) = \u03b8\u2217. Note that the subsequence was chosen arbitrarily. Since every convergent\n\n2.3 Parameter Estimation via Regularized Maximum Likelihood\nWe present a regularized MLE (RegMLE) of INVITE. We \ufb01rst extend the censored lists that we\nconsider. Now we allow the underlying walk to terminate after \ufb01nite steps because in real-world\napplications the observed censored lists are often truncated. That is, the underlying random walk\ncan be stopped before exhausting every state the walk could visit. For example, in verbal \ufb02uency,\nparticipants have limited time to produce a list. Consequently, we use the pre\ufb01x likelihood\n\nWe \ufb01nd the RegMLE by maximizing the pre\ufb01x log likelihood plus a regularization term on \u03c0, P.\nNote that, \u03c0 and P can be separately optimized. For \u03c0, we place a Dirichlet prior and \ufb01nd the\n\nk=1\n\nL(a; \u03c0, P) = \u03c0a1\n\nM\u22121(cid:89)\nmaximum a posteriori (MAP) estimator(cid:98)\u03c0 by(cid:98)\u03c0j \u221d(cid:80)m\nto derive P: Pij = e\u03b2ij /(cid:80)n\n\u2212(cid:80)m\n\n(cid:80)Mi\u22121\n\nlog P(a(i)\n\nmin\n\ni=1\n\nk=1\n\n\u03b2\n\nDirectly computing the RegMLE of P requires solving a constrained optimization problem, because\nthe transition matrix P must be row stochastic. We re-parametrize P which leads to a more conve-\nnient unconstrained optimization problem. Let \u03b2 \u2208 Rn\u00d7n. We exponentiate \u03b2 and row-normalize it\nj(cid:48)=1 e\u03b2ij(cid:48) ,\u2200i, j. We \ufb01x the diagonal entries of \u03b2 to \u2212\u221e to disallow self-\ntransitions. We place squared (cid:96)2 norm regularizer on \u03b2 to prevent over\ufb01tting. The unconstrained\noptimization problem is:\n\nk+1 | a(i)\n\n1:k; \u03b2) + 1\n\n2 C\u03b2\n\ni(cid:54)=j \u03b22\nij\n\n,\n\n(6)\n\nP(ak+1 | a1:k; P).\n\n(5)\n\n1\n\n1 =j + C\u03c0,\u2200j.\n\na(i)\n\ni=1\n\n(cid:80)\n\n(cid:88)\n\ni(cid:54)=j\n\nwhere C\u03b2 > 0 is a regularization parameter. We provide the derivative of the pre\ufb01x log likelihood\nw.r.t. \u03b2 in our supplementary material. We point out that the objective function of (6) is not convex\nin \u03b2 in general. Let n = 5 and suppose we observe two censored lists (5, 4, 3, 1, 2) and (3, 4, 5, 1, 2).\nWe found with random starts two different local optima \u03b2(1) and \u03b2(2) of (6). We plot the pre\ufb01x log\nlikelihood of (1 \u2212 \u03bb)\u03b2(1) + \u03bb\u03b2(2), where \u03bb \u2208 [0, 1] in Figure 1(d). Nonconvexity of this 1D slice\nimplies nonconvexity of the pre\ufb01x log likelihood surface in general.\nEf\ufb01cient Optimization using Averaged Stochastic Gradient Descent Given a censored list a of\nlength M, computing the derivative of P(ak+1 | a1:k) w.r.t. \u03b2 takes O(k3) time for matrix inversion.\nThere are n2 entries in \u03b2, so the time complexity per item is O(k3 + n2). This computation needs to\nbe done for k = 1, ..., (M \u2212 1) in a list and for m censored lists, which makes the overall time com-\nplexity O(mM (M 3 +n2)). In the worst case, M is as large as n, which makes it O(mn4). Even the\nstate-of-the-art batch optimization method such as LBFGS takes a very long time to \ufb01nd the solution\nfor a moderate problem size such as n \u2248 500. For a faster computation of the RegMLE (6), we turn\nto averaged stochastic gradient descent (ASGD) [20, 18]. ASGD processes the lists sequentially by\nupdating the parameters after every list. The per-round objective function for \u03b2 on the i-th list is\n\nf (a(i); \u03b2) \u2261 \u2212 Mi\u22121(cid:88)\n\nlog P(a(i)\n\nk+1 | a(i)\n\n1:k; \u03b2) +\n\nC\u03b2\n2m\n\n\u03b22\nij.\n\nk=1\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Toy experiment results where the error is measured with the Frobenius norm.\n\nrc\n\n(cid:17)\n\nrc\n\ni=1\n\nj=1\n\n1\n\nj+1=c)\n\n1\n\ni=1\n\n(a(i)\n\n1 =r)\u2227(a(i)\n\n2 =c)\n\n(a(i)\n\nj =r)\u2227(a(i)\n\nt \u03b2t\u22121 + 1\n\n(cid:80)Mi\u22121\n\n\u221d(cid:16)(cid:80)m\n\n\u221d (cid:16)(cid:80)m\n(cid:17)\n\nunderlying uncensored walk trajectory: (cid:98)P (RW )\ncensored list: (cid:98)P (F E)\n\nWe randomly initialize \u03b20. At round t, we update the solution \u03b2t with \u03b2t \u2190 \u03b2t\u22121 \u2212 \u03b7t\u2207f (a(i); \u03b2)\nand the average estimate \u03b2t with \u03b2t \u2190 t\u22121\nt \u03b2t. Let \u03b7t = \u03b30(1 + \u03b30at)\u2212c. We use\na = C\u03b2/m and c = 3/4 following [3] and pick \u03b30 by running the algorithm on a small subsample\nof the train set. We run ASGD for a \ufb01xed number of epochs and take the \ufb01nal \u03b2t as the solution.\n3 Experiments\nWe compare INVITE against two popular estimators of P: naive random walk (RW) and First-Edge\n(FE). RW is the regularized MLE of the naive random walk, pretending the censored lists are the\n+ CRW .\nThough simple and popular, RW is a biased estimator due to the model mismatch. FE was proposed\nin [2] for graph structure recovery in cascade model. FE uses only the \ufb01rst two items in each\n+CF E. Because the \ufb01rst transition in a censored\nlist is always the same as the \ufb01rst transition in its underlying trajectory, FE is a consistent estimator\nof P (assuming \u03c0 has no zero entries).\nIn fact, FE is equivalent to the RegMLE of the length\ntwo pre\ufb01x likelihood of the INVITE model. However, we expect FE to waste information since it\ndiscards the rest of the censored lists. Furthermore, FE cannot estimate the transition probabilities\nfrom an item that does not appear as the \ufb01rst item in the lists, which is common in real-world data.\n3.1 Toy Experiments\nHere we compare the three estimators INVITE, RW, and FE on toy datasets, where the observations\nare indeed generated by an initial-visit emitting random walk. We construct three undirected, un-\nweighted graphs of n = 25 nodes each: (i) Ring, a ring graph, (ii) Star, n\u2212 1 nodes each connected\nto a \u201chub\u201d node, and (iii) Grid, a 2-dimensional\nThe initial distribution \u03c0\u2217 is uniform, and the transition matrix P\u2217 at each node has an\nequal transition probability to its neighbors. For each graph, we generate datasets with m \u2208\n{10, 20, 40, 80, 160, 320, 640} censored lists. Each censored list has length n. We note that, in\nthe star graph a censored list contains many apparent transitions between leaf nodes, although such\ntransitions are not allowed in its underlying uncensored random walk. This will mislead RW. This\neffect is less severe in the grid graph and the ring graph.\nFor each estimator, we perform 5-fold cross validation (CV) for \ufb01nding the best smoothing param-\neters C\u03b2, CRW , CF E on the grid 10\u22122, 10\u22121.5, . . . , 101, respectively, with which we compute each\n\nestimator. Then, we evaluate the three estimators using the Frobenius norm between(cid:98)P and the true\ntransition matrix P\u2217: error((cid:98)P) =\nFigure 2 shows how error((cid:98)P) changes as the number of censored lists m increases. The error bars\n\nij)2. Note the error must approach 0 as m in-\ncreases for consistent estimators. We repeat the same experiment 20 times where each time we draw\na new set of censored lists.\n\n(cid:113)(cid:80)\ni,j((cid:98)Pij \u2212 P \u2217\n\nare 95% con\ufb01dence bounds. We make three observations: (1) INVITE tends towards 0 error. This\nis expected given the consistency of INVITE in Theorem 6. (2) RW is biased. In all three plots, RW\ntends towards some positive number, unlike INVITE and FE. This is because RW has the wrong\n(3) INVITE outperforms FE. On the ring and grid graphs INVITE\nmodel on the censored lists.\ndominates FE for every training set size. On the star graph FE is better than INVITE with a small\nm, but INVITE eventually achieves lower error. This re\ufb02ects the fact that, although FE is unbiased,\nit discards most of the censored lists and therefore has higher variance compared to INVITE.\n\n\u221a\n\nn \u00d7 \u221a\n\nn lattice.\n\n6\n\nThe number of lists (m)102error(bP)0123Ring, n=25INVITERWFEThe number of lists (m)102error(bP)12345Star, n=25INVITERWFEThe number of lists (m)102error(bP)0123Grid, n=25INVITERWFE\fn\nm\n\nMin.\nMax.\nMean\nMedian\n\nLength\n\nAnimal\n\n274\n4710\n\n2\n36\n\n19\n\nFood\n452\n4622\n\n1\n47\n\n21\n\n18.72\n\n20.73\n\nAnimal\n\nFood\n\nModel\nINVITE\n\nINVITE\n\nRW\nFE\n\nRW\nFE\n\nTest set mean neg. loglik.\n\n60.18 (\u00b11.75)\n69.16 (\u00b12.00)\n72.12 (\u00b12.17)\n83.62 (\u00b12.32)\n94.54 (\u00b12.75)\n100.27 (\u00b12.96)\n\nTable 1: Statistics of the verbal \ufb02uency data.\n\nTable 2: Verbal \ufb02uency test set log likelihood.\n\n3.2 Verbal Fluency\nWe now turn to the real-world \ufb02uency data where we compare INVITE with the baseline models.\nSince we do not have the ground truth parameter \u03c0 and P, we compare test set log likelihood\nof various models. Con\ufb01rming the empirical performance of INVITE sheds light on using it for\npractical applications such as the dignosis and classi\ufb01cation of the brain-damaged patient.\nData The data used to assess human memory search consists of two verbal \ufb02uency datasets from\nthe Wisconsin Longitudinal Survey (WLS). The WLS is a longitudinal assessment of many sociode-\nmographic and health factors that has been administered to a large cohort of Wisconsin residents\nevery \ufb01ve years since the 1950s. Verbal \ufb02uency for two semantic categories, animals and foods,\nwas administered in the last two testing rounds (2005 and 2010), yielding a total of 4714 lists for\nanimals and 4624 lists for foods collected from a total of 5674 participants ranging in age from their\nearly-60\u2019s to mid-70\u2019s. The raw lists included in the WLS were preprocessed by expanding abbrevi-\nations (\u201clab\u201d \u2192 \u201clabrador\u201d), removing in\ufb02ections (\u201ccats\u201d \u2192 \u201ccat\u201d), correcting spelling errors, and\nremoving response errors like unintelligible items. Though instructed to not repeat, some human\nparticipants did occasionally produce repeated words. We removed the repetitions from the data,\nwhich consist of 4% of the word token responses. Finally, the data exhibits a Zip\ufb01an behavior with\nmany idiosyncratic, low count words. We removed words appearing in less than 10 lists. In total,\nthe process resulted in removing 5% of the total number of word token responses. The statistics of\nthe data after preprocessing is summarized in Table 1.\nProcedure We randomly subsample 10% of the lists as the test set, and use the rest as the training\nset. We perform 5-fold CV on the training set for each estimator to \ufb01nd the best smoothing parameter\nC\u03b2, CRW , CF E \u2208 {101, 10.5, 100, 10\u2212.5, 10\u22121, 10\u22121.5, 10\u22122} respectively, where the validation\nmeasure is the pre\ufb01x log likelihood for INVITE and the standard random walk likelihood for RW.\nFor the validation measure of FE we use the INVITE pre\ufb01x log likelihood since FE is equivalent to\nthe length two pre\ufb01x likelihood of INVITE. Then, we train the \ufb01nal estimator on the whole training\nset using the \ufb01tted regularization parameter.\nResult The experiment result is summarized in Table 2. For each estimator, we measure the aver-\nage per-list negative pre\ufb01x log likelihood on the test set for INVITE and FE, and the standard random\nwalk per-list negative log likelihood for RW. The number in the parenthesis is the 95% con\ufb01dence\ninterval. Boldfaced numbers mean that the corresponding estimator is the best and the difference\nfrom the others is statistically signi\ufb01cant under a two-tailed paired t-test at 95% signi\ufb01cance level.\nIn both animal and food verbal \ufb02uency tasks, the result indicates that human-generated \ufb02uency lists\nare better explained by INVITE than by either RW or FE. Furthermore, RW outperforms FE. We\nbelieve that FE performs poorly despite being consistent because the number of lists is too small\n(compared to the number of states) for FE to reach a good estimate.\n\n4 Related Work\nThough behavior in semantic \ufb02uency tasks has been studied for many years, few computationally ex-\nplicit models of the task have been advanced. In\ufb02uential models in the psychological literature, such\nas the widely-known \u201dclustering and switching\u201d model of Troyer et al. [21], have been articulated\nonly verbally. Efforts to estimate the structure of semantic memory from \ufb02uency lists have mainly\nfocused on decomposing the structure apparent in distance matrices that re\ufb02ect the mean inter-item\nordinal distances across many \ufb02uency lists [5]\u2014but without an account of the processes that gener-\nate list structure it is not clear how the results of such studies are best interpreted. More recently,\nresearchers in cognitive science have begun to focus on explicit model of the processes by which\n\ufb02uency lists are generated. In these works, the structure of semantic memory is \ufb01rst modelled either\nas a graph or as a continuous multidimensional space estimated from word co-occurrence statistics\nin large corpora of natural language. Researchers then assess whether structure in \ufb02uency data can\nbe understood as resulting from a particular search process operating over the speci\ufb01ed semantic\n\n7\n\n\fstructure. Models explored in this vein include simple random walk over a semantic network, with\nrepeated nodes omitted from the sequence produced [12], the PageRank algorithm employed for net-\nwork search by Google [13], and foraging algorithms designed to explain the behavior of animals\nsearching for food [15]. Each example reports aspects of human behavior that are well-explained by\nthe respective search process, given accompanying assumptions about the nature of the underlying\nsemantic structure. However, these works do not learn their model directly from the \ufb02uency lists,\nwhich is the key difference from our study.\nBroder\u2019s algorithm Generate [4] for generating random spanning tree is similar to INVITE\u2019s gen-\nerative process. Given an undirected graph, the algorithm runs a random walk and outputs each\ntransition to an unvisited node. Upon transiting to an already visited node, however, it does not\noutput the transition. The random walk stops after visiting every node in the graph. In the end, we\nobserve an ordered list of transitions. For example, in Figure 1(a) if the random walk trajectory is\n(2,1,2,1,3,1,4), then the output is (2\u21921, 1\u21923, 1\u21924). Note that if we take the starting node of the\n\ufb01rst transition and the arriving nodes of each transition, then the output list reduces to a censored list\ngenerated from INVITE with the same underlying random walk. Despite the similarity, to the best\nof our knowledge, the censored list derived from the output of the algorithm Generate has not been\nstudied, and there has been no parameter estimation task discussed in prior works.\nSelf-avoiding random walk, or non-self-intersecting random walk, performs random walk while\navoiding already visited node [9]. For example, in Figure 1(a), if a self-avoiding random walk starts\nfrom state 2 then visits 1, then it can only visit states 3 or 4 since 2 is already visited. In not visiting\nthe same node twice, self-avoiding walk is similar to INVITE. However, a key difference is that\nself-avoiding walk cannot produce a transition i \u2192 j if Pij = 0. In contrast, INVITE can appear to\nhave such \u201ctransitions\u201d in the censored list. Such behavior is a core property that allows INVITE to\nswitch clusters in modeling human memory search.\nINVITE resembles cascade models in many aspects [16, 11]. In a cascade model, the information\nor disease spreads out from a seed node to the whole graph by infections that occur from an infected\nnode to its neighbors. [11] formulates a graph learning problem where an observation is a list, or\nso-called trace, that contains infected nodes along with their infection time. Although not discussed\nin the present paper, it is trivial for INVITE to produce time stamps for each item in its censored\nlist, too. However, there is a fundamental difference in how the infection occurs. A cascade model\ntypically allows multiple infected nodes to infect their neighbors in parallel, so that infection can\nhappen simultaneously in many parts of the graph. On the other hand, INVITE contains a single\nsurfer that is responsible for all the infection via a random walk. Therefore, infection in INVITE is\nnecessarily sequential. This results in INVITE exhibiting clustering behaviors in the censored lists,\nwhich is well-known in human memory search tasks [21].\n5 Discussion\nThere are numerous directions to extend INVITE. First, more theoretical investigation is needed. For\nexample, although we know the MLE of INVITE is consistent, the convergence rate is unknown.\nSecond, one can improve the INVITE estimate when data is sparse by assuming certain cluster\nstructures in the transition matrix P, thereby reducing the degrees of freedom. For instance, it is\nknown that verbal \ufb02uency tends to exhibit \u201cruns\u201d of semantically related words. One can assume\na stochastic block model P with parameter sharing at the block level, where the blocks represent\nsemantic clusters of words. One then estimates the block structure and the shared parameters at the\nsame time. Third, INVITE can be extended to allow repetitions in a list. The basic idea is as follows.\nIn the k-th segment we previously used an absorbing random walk to compute P(ak+1 | a1:k), where\na1:k were the nonabsorbing states. For each nonabsorbing state ai, add a \u201cdongle twin\u201d absorbing\nstate a(cid:48)\ni attached only to ai. Allow a small transition probability from ai to a(cid:48)\ni. If the walk is absorbed\nby a(cid:48)\ni, we output ai in the censored list, which becomes a repeated item in the censored list. Note\nthat the likelihood computation in this augmented model is still polynomial. Such a model with\n\u201creluctant repetitions\u201d will be an interesting interpolation between \u201cno repetitions\u201d and \u201crepetitions\nas in a standard random walk.\u201d\nAcknowledgments\nThe authors are thankful to the anonymous reviewers for their comments. This work is supported in\npart by NSF grants IIS-0953219 and DGE-1545481, NIH Big Data to Knowledge 1U54AI117924-\n01, NSF Grant DMS-1265202, and NIH Grant 1U54AI117924-01.\n\n8\n\n\fReferences\n[1] J. T. Abbott, J. L. Austerweil, and T. L. Grif\ufb01ths, \u201cHuman memory search as a random walk in\n\na semantic network,\u201d in NIPS, 2012, pp. 3050\u20133058.\n\n[2] B. D. Abrahao, F. Chierichetti, R. Kleinberg, and A. Panconesi, \u201cTrace complexity of network\n\ninference.\u201d CoRR, vol. abs/1308.2954, 2013.\n\n[3] L. Bottou, \u201cStochastic gradient tricks,\u201d in Neural Networks, Tricks of the Trade, Reloaded, ser.\nLecture Notes in Computer Science (LNCS 7700), G. Montavon, G. B. Orr, and K.-R. M\u00a8uller,\nEds. Springer, 2012, pp. 430\u2013445.\n\n[4] A. Z. Broder, \u201cGenerating random spanning trees,\u201d in FOCS.\n\npp. 442\u2013447.\n\nIEEE Computer Society, 1989,\n\n[5] A. S. Chan, N. Butters, J. S. Paulsen, D. P. Salmon, M. R. Swenson, and L. T. Maloney, \u201cAn\nassessment of the semantic network in patients with alzheimer\u2019s disease.\u201d Journal of Cognitive\nNeuroscience, vol. 5, no. 2, pp. 254\u2013261, 1993.\n\n[6] J. R. Cockrell and M. F. Folstein, \u201cMini-mental state examination.\u201d Principles and practice of\n\ngeriatric psychiatry, pp. 140\u2013141, 2002.\n\n[7] P. G. Doyle and J. L. Snell, Random Walks and Electric Networks. Washington, DC: Mathe-\n\nmatical Association of America, 1984.\n\n[8] R. Durrett, Essentials of stochastic processes, 2nd ed., ser. Springer texts in statistics. New\n\nYork: Springer, 2012.\n\n[9] P. Flory, Principles of polymer chemistry. Cornell University Press, 1953.\n[10] A. M. Glenberg and S. Mehta, \u201cOptimal foraging in semantic memory,\u201d Italian Journal of\n\nLinguistics, 2009.\n\n[11] M. Gomez Rodriguez, J. Leskovec, and A. Krause, \u201cInferring networks of diffusion and in\ufb02u-\nence,\u201d Max-Planck-Gesellschaft. New York, NY, USA: ACM Press, July 2010, pp. 1019\u2013\n1028.\n\n[12] J. Goi, G. Arrondo, J. Sepulcre, I. Martincorena, N. V. de Mendizbal, B. Corominas-Murtra,\nB. Bejarano, S. Ardanza-Trevijano, H. Peraita, D. P. Wall, and P. Villoslada, \u201cThe semantic or-\nganization of the animal category: evidence from semantic verbal \ufb02uency and network theory.\u201d\nCognitive Processing, vol. 12, no. 2, pp. 183\u2013196, 2011.\n\n[13] T. L. Grif\ufb01ths, M. Steyvers, and A. Firl, \u201cGoogle and the mind: Predicting \ufb02uency with pager-\n\nank,\u201d Psychological Science, vol. 18, no. 12, pp. 1069\u20131076, 2007.\n\n[14] N. M. Henley, \u201cA psychological study of the semantics of animal terms.\u201d Journal of Verbal\n\nLearning and Verbal Behavior, vol. 8, no. 2, pp. 176\u2013184, Apr. 1969.\n\n[15] T. T. Hills, P. M. Todd, and M. N. Jones, \u201cOptimal foraging in semantic memory,\u201d Psycholog-\n\nical Review, pp. 431\u2013440, 2012.\n\n[16] D. Kempe, J. Kleinberg, and E. Tardos, \u201cMaximizing the spread of in\ufb02uence through a social\nnetwork,\u201d in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, ser. KDD \u201903. New York, NY, USA: ACM, 2003, pp. 137\u2013146.\n[17] F. Pasquier, F. Lebert, L. Grymonprez, and H. Petit, \u201cVerbal \ufb02uency in dementia of frontal lobe\ntype and dementia of Alzheimer type.\u201d Journal of Neurology, vol. 58, no. 1, pp. 81\u201384, 1995.\n[18] B. T. Polyak and A. B. Juditsky, \u201cAcceleration of stochastic approximation by averaging,\u201d\n\nSIAM J. Control Optim., vol. 30, no. 4, pp. 838\u2013855, July 1992.\n\n[19] T. T. Rogers, A. Ivanoiu, K. Patterson, and J. R. Hodges, \u201cSemantic memory in Alzheimer\u2019s\ndisease and the frontotemporal dementias: a longitudinal study of 236 patients.\u201d Neuropsy-\nchology, vol. 20, no. 3, pp. 319\u2013335, 2006.\n\n[20] D. Ruppert, \u201cEf\ufb01cient estimations from a slowly convergent robbins-monro process,\u201d Cornell\n\nUniversity Operations Research and Industrial Engineering, Tech. Rep., 1988.\n\n[21] A. Troyer, M. Moscovitch, G. Winocur, M. Alexander, and D. Stuss, \u201cClustering and switching\non verbal \ufb02uency: The effects of focal fronal- and temporal-lobe lesions,\u201d Neuropsychologia,\nvol. 36, no. 6, 1998.\n\n9\n\n\f", "award": [], "sourceid": 676, "authors": [{"given_name": "Kwang-Sung", "family_name": "Jun", "institution": "University of Wisconsin-Madison"}, {"given_name": "Jerry", "family_name": "Zhu", "institution": "University of Wisconsin-Madison"}, {"given_name": "Timothy", "family_name": "Rogers", "institution": "University of Wisconsin-Madison"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Tsinghua University"}, {"given_name": "ming", "family_name": "yuan", "institution": "University of Wisconsin - Madison"}]}