{"title": "Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter", "book": "Advances in Neural Information Processing Systems", "page_first": 1335, "page_last": 1342, "abstract": "", "full_text": "Estimating Internal Variables and Parameters of\n\na Learning Agent by a Particle Filter\n\nKazuyuki Samejima\nKenji Doya\nDepartment of Computational Neurobiology\n\nATR Computational Neuroscience laboratories;\n\n\u201cCreating the Brain\u201d, CREST, JST.\n\n\u201cKeihan-na Science City\u201d, Kyoto, 619-0288, Japan\n\nfsamejima, doyag@atr.jp\n\nYasumasa Ueda\n\nMinoru Kimura\n\nDepartment of Physiology, Kyoto Prefecture University of Medicine,\n\nKyoto, 602-8566, Japan\n\nfyasu, mkimurag@basic.kpu-m.ac.jp\n\nAbstract\n\nWhen we model a higher order functions, such as learning and memory,\nwe face a dif\ufb01culty of comparing neural activities with hidden variables\nthat depend on the history of sensory and motor signals and the dynam-\nics of the network. Here, we propose novel method for estimating hidden\nvariables of a learning agent, such as connection weights from sequences\nof observable variables. Bayesian estimation is a method to estimate the\nposterior probability of hidden variables from observable data sequence\nusing a dynamic model of hidden and observable variables. In this pa-\nper, we apply particle \ufb01lter for estimating internal parameters and meta-\nparameters of a reinforcement learning model. We veri\ufb01ed the effective-\nness of the method using both arti\ufb01cial data and real animal behavioral\ndata.\n\n1\n\nIntroduction\n\nIn neurophysiology, the traditional approach to discover unknown information processing\nmechanisms is to compare neuronal activities with external variables, such as sensory stim-\nuli or motor output. Recent advances in computational neuroscience allow us to make pre-\ndictions on neural mechanisms based on computational models. However, when we model\nhigher order functions, such as attention, memory and learning, the model must inevitably\ninclude hidden variables which are dif\ufb01cult to infer directly from externally observable\nvariables.\n\nAlthough the assessment of the plausibility of such models depends on the right estimate of\nthe hidden variables, tracking their values in an experimental setting is a dif\ufb01cult problem.\nFor example, in learning agents, hidden variables such as connection weights change in\ntime. In addition, the course of learning is modulated by hidden meta-parameters such as\n\n\fthe learning rate.\n\nThe goal of this study is two-fold: First to establish a method to estimate hidden vari-\nables, including meta-parameters from observable experimental data. Second to provide a\nmethod for objectively selecting the most plausible computational model out of multiple\ncandidates. We introduce a numerical Bayesian estimation method, known as particle \ufb01l-\ntering, to estimate hidden variables. We validate this method with a reinforcement learning\ntask.\n\n2 Reinforcement learning model as an animal and a human decision\n\nprocesses\n\nReinforcement learning can be a model of animal or human behaviors based on reward\ndelivery. Notably, the response of monkey midbrain dopamine neurons are successfully\nexplained by the temporal differnce (TD) error of reinforcement learning models [2]. The\ngoal of reinforcement learning is to improve the policy so that the agent maximizes rewards\nin the long run. The basic strategy of reinforcement learning is to estimate cumulative\nfuture reward under the current policy as the value function and then to improve the policy\nbased on the value function. A standard algorithm of reinforcement learning is to learn the\naction-value function,\n\nQ(st; at) = E\" 1\nX(cid:28) =t\n\n(cid:13)((cid:28) (cid:0)t)r(cid:28) jst; at# ;\n\n(1)\n\nwhich estimates the cumulative future reward when action a is taken at a state . The dis-\ncount factor 0 < (cid:13) < 1 is a meta-parameter that controls the time scale of prediction. The\npolicy of the learner is then given by comparing action-values, e.g. according to Boltzman\ndistribution\n\nP (ajst) =\n\n;\n\n(2)\n\nexp (cid:12)Q(st; a)\n\nP~a2A exp (cid:12)Q(st; ~a)\n\nwhere the inverse temperature (cid:12) > 0 is another meta-parameter that controls randomness\nof action selection. From an experience of state st, action at, reward rt, and next state\nst+1, the action-value function is updated by Q-learning algorithm [1] as\n\n(cid:14)T D(t) = rt + (cid:13) max\na2A\n\nQ(st+1; a) (cid:0) Q(st; at)\n\nQ(st; at) Q(st; at) + (cid:11)(cid:14)T D(t)\n\n(3)\nwhere (cid:11) > 0 is the meta-parameter that controls learning rate. Thus this simple reinforce-\nment learning modol has three meta-paramters, (cid:11),(cid:12) and (cid:13) Such a reinforcement learning\nmodel does not only predict subject\u2019s actions, but also predicts internal process of the brain,\nwhich may be recorded as neural \ufb01ring or brain imaging data. However, a big problem is\nthat the predictions are depended on the setting of meta-parameters, such as learning rate\n(cid:11), action randomness (cid:12) and discount factor (cid:13).\n\n3 Bayesian estimation of hidden variables of reinforcement learning\n\nagent\n\nLet us consider a problem of estimating the time course of action-values fQt(s; a); s 2\nS; s 2 A; 0 (cid:20) t (cid:20) T g and meta-parameters (cid:11), (cid:12) , and (cid:13) of reinforcement learner by\nonly observing the sequence of states st, actions at and rewards rt. We use a Bayesian\nmethod of estimating a dynamic hidden variable fxt; t 2 N g from sequence of observ-\nable variable fyt; t 2 N g. We assume that the hidden variable follows a Markov process\n\n\fmeta-\nparameters\n\nHidden \nparameters\n\nObservable\nvariables\n\na t\n\ng t\n\nb t\n\na t+1\n\ng t+1\n\nb t+1\n\nQt\n\nUpdate\n\nQt+1\n\nDecision\n\na\n\nt\n\na\n\nt+1\n\ns\n\nt\n\ns\n\nt+1\n\nState\ntransition\n\nr\n\nt\n\nGet\nreward\n\nr\n\nt+1\n\nFigure 1: A Bayesian network representation of a Q-learning agent: dynamics of observ-\nable and unobservable variable is depended on decision, reward probability, state transition,\nand update rule for value function. Circles: hidden variable. Double box: observable vari-\nable. Arrow: probabilistic dependency\n\nof initial distribution p(x0) and the transition probability p(xt+1jxt). The observations\nfyt; t 2 N g are assumed to be conditionally independent given the process fxt; t 2 N g\nand has the marginal distribution p(ytjxt). The problem is to estimate recursively in time\nthe posterior distribution of hidden variable p(x0:tjy1:t), where x0:t = fx0; : : : ; xtg and\ny1:t = fy1; : : : ; ytg. The marginal distribution is given by recursive procedure of the\nfollowing prediction and updating,\n\nP redicdion :\n\np(xtjy1:t(cid:0)1) =Z p(xtjxt(cid:0)1)p(xt(cid:0)1jy1:t(cid:0)1)dxt(cid:0)1;\n\nU pdating\n\n:\n\np(xtjy1:t) =\n\n:\n\np(ytjxt)p(xtjy1:t(cid:0)1)\n\nR p(ytjxt)p(xtjy1:t(cid:0)1)dxt\n\nWe use a numerical method called particle \ufb01lter [3] to approximate this process. In the\nparticle \ufb01lter, the distributions of sequence of hidden variables p(x0:tjy1:t) are represented\nby a set of random samples, called \u201cparticles\u201d. Figure 1 is the dynamical Bayesian network\nrepresentation of a Q-learning agent. The hidden variable xt consists of the action-values\nQ(s; a) for each state-action pair, learning rate (cid:11), inverse temperature (cid:12), and discount\nfactor (cid:13). The observable variable yt consists of the state st, action at, and reward rt.\nThe marginal distribution p(ytjxt) of observation process is given by the softmax action\nselection probability (2) combined with the state transition rule and the reward condition\np(rt+1jst; at) given by the environment. The transition probability p(st+1jst; at) of the\nhidden variable is given by the Q-learning rule (3) and an assumption about the meta-\n\n\fb(t)\n\na(t)\n\nb(t+1)\n\na(t+1)\n\nQ(t)\n\nQ(t+1))\n\na(t)\n\na(t+1)\n\nr(t)\n\nr(t+1)\n\nFigure 2: Simpli\ufb01ed Bayesian network for the two-armed bandit problem.\n\nparameter dynamics. Here we assume that meta-parameters are constant with small drifts.\nBecause (cid:11), (cid:12) and (cid:13) should all be positive, we assume random-walk dynamics in logarith-\nmic space\n\nlog(xt+1) = log(xt) + \"x;\n\n\"x (cid:24) N (0; (cid:27)x)\n\n(4)\n\nwhere (cid:27)x is a meta-meta-parameter that de\ufb01nes variability of the meta-parameter x 2\nf(cid:11); (cid:12); (cid:13)g.\n\n4 Simulations\n\n4.1 Two armed bandit problem with block wise reward change\n\nIn order to test the validity of the proposed method, we use a simple Q-leaning agent that\nlearns a two armed bandit problem [1]. The task has only one state, two actions, and\nstochastic binary reward. The reward probability for each action is \ufb01xed in a block of\n100 trials. The reward probabilities Pr1 for action a = 1 and Pr2 for action a = 2 are\nselected randomly from three settings; fPr1; Pr2g = f0:1; 0:9g; f0:50:5g; f0:9; 0:1g at the\nbeginning of each block.\n\nThe Q-learning agent tries to learn reward expectation of each action and maximize reward\nacquired in each block. Because the task has only one state, the agent does not need to\ntake into account next state\u2019s value, and thus, we set the discount factor as (cid:13) = 0. The\nBayesian network for this example is simpli\ufb01ed as Figure 2. Simulated actions are selected\naccording to Boltzman distribution (2) using action-values Q(a = 1) and Q(a = 2), and\nthe inverse temperature (cid:12). The action values are updated by equation (3) with the action\nat, reward rt, and learning rate (cid:11).\n\n\f4.2 Result\n\nWe used 1000 particles for approximating the distribution of hidden variable x = (Q(a =\n1); Q(a = 2); log((cid:11)); log((cid:12))). We set the initial distribution of particles as Gaussian dis-\ntribution with the mean f0; 0; 0; 0g and the variance f1; 1; 3; 1g for fQ(a = 1); Q(a =\n2); log((cid:11)); log((cid:12))g, respectively. We set the meta-meta-parameters for learning rate as\n(cid:27)(cid:11) = 0:05 , and inverse temperature as (cid:27)(cid:12) = 0:005 . The reward is r = 5 when delivered,\nand otherwise r = 0.\nFigure 3(a) shows the simulated actions and rewards of 1000 trials by Q-learning agent\nwith (cid:11) = 0:05 and (cid:12) = 1. From this observable sequence of yt = (st; at; rt), the particle\n\ufb01lter estimated the time course of action-values, Qt(a = 1) and Qt(a = 2), learning\nrate (cid:11)t and inverse temperature (cid:12)t. The expected values of the marginal distribution of\nthese hidden variables (Figure 3(b)-(e) solid line) are in good agreement with the true value\n(Figure 3(b)-(e) dotted line) recorded in simulation. Although the initial estimates were\ninevitable inaccurate, the particle \ufb01lter are good estimation of each variable after about 200\nobservations.\n\nTo test robustness of the particle \ufb01lter approach, we generated behavioral sequences of Q-\nlearners with different combinations of (cid:11) = f0:01; 0:15; 0:1; 0:5g and (cid:12) = f0:5; 1; 2; 4g,\nand estimated meta-parameters (cid:11) and (cid:12). Even if we set a broad initial distribution of (cid:11)\nand (cid:12), the expectation value of the estimated values are in good agreement with the true\nvalue. When the agent had the smallest learning rate (cid:11) = 0:01, the particle \ufb01lter tended to\nunderestimated (cid:12) and overestimated (cid:11).\n\n5 Application to monkey behavioral data\n\nWe applied the particle \ufb01lter approach to monkey behavioral data of the two-armed bandit\nproblem [4]. In this task, the monkey faces a lever that can be turned to either left or right.\nAfter adjusting a lever at center position and holding it for one second, the monkey turned\nthe lever to left or right based on the reward probabilities assigned on each direction of\nlever turn. Probabilities [PL, PR] of reward delivery on the left and right turns, respectively\nwere varied across three trial blocks as: [PL, PR]=[0.5, 0.5]; [0.1, 0.9]; [0.9, 0.1]. In each\nblock, the monkeys shifted selection to the direction with higher reward probability.\n\nWe used 1000 particles and Gaussian initial distribution with the mean (2,2,3,0) and\nthe variance (2,2,1,1) for x = (Q(R); Q(L); log((cid:11)); log((cid:12))). We set the meta-meta-\nparameters for learning rate as (cid:27)(cid:11) = 0:05 , and for inverse temperature as (cid:27)(cid:12) = 0:001\n. The reward was r = 5 when delivered, and otherwise r = 0.\nFigure 5(a) shows the sequence of selected actions and rewards in a day. Figure 5(b) shows\nthe estimated action-values Q(a = L) and Q(a = R) for the left and right lever turns. The\nestimated action value Q(L) for left action increased in the blocks of [PL, PR] = [0.9, 0.1],\ndecreased in the blocks of [0.1, 0.9], and \ufb02uctuated in the blocks of [0.5, 0.5].\n\nWe tested whether the estimated action-value and meta-parameters could reproduce the\naction sequences. We quanti\ufb01ed the prediction performance of action sequences by the\nlikelihood of the action data given the estimated model,\n\nLt =\n\n1\n\nN (cid:0) T + 1\n\nlog ^p(a = atjfa1; r1; (cid:1) (cid:1) (cid:1) ; at(cid:0)1; rt(cid:0)1g; M; (cid:18)t);\n\n(5)\n\nN\n\nXt=T\n\nwhere ^p(a) is estimated probability of action at t by model M and estimated parameters (cid:18)t\nfrom the sequence of past experience fa1; r1; (cid:1) (cid:1) (cid:1) ; at(cid:0)1; rt(cid:0)1g.\nFigure 6(b) shows the distribution of the likelihood computed for the action data of 74\nsessions. We compared the predictability of the proposed method, Q-learning model with\n\n\f[0.1,0.9]\n\n[0.9,0.1]\n\n[0.1,0.9]\n\n[0.1,0.9]\n\n[0.9,0.1]\n\nTrials\n\nTrials\n\nFigure 3: Estimation of hidden variables by simulated actions and rewards of Q-learning\nagent. (a) Sequence of simulated actions and rewards by Q-learning agent: Circles are\nrewarded trials. Dots are non-rewarded trials; (b)-(e) Time course of the hidden variables\nof the model (dotted line) and of the expectation value (solid line) of estimation by particle\n\ufb01lter; (b)(c) Q-values for each action, (d) learning rate , and (e) action randomness . Shaded\nareas indicate the blocks of [0.9, 0.1] or [0.1, 0.9]. White areas indicate [0.5, 0.5].\n\n-2\n\n10\n\n-1\n\n10\n\n0\n\n10\n\nFigure 4: Expected values of estimated meta-parameter from the 1000 trials generated with\ndifferent settings. The side boxes show initial distribution of particles.\n\n\fBlock \n\nRight\n\nLeft\n\n(a)\n\n(b)\n\nQ\n\n0\n\n4\n\n2\n\n0\n100\n\n0\n\n(c)\n\na\n\n-2\n\n10\n\n(d)\n\nb\n\n-4\n\n10\n\n0\n\n2\n\n1\n\n0.2\n\n0\n\n[0.9 0.1]\n\n[0.1 0.9] \n\n[0.1 0.9]\n\n[0.9 0.1]\n\n[0.1 0.9] \n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\nQ(right)\n\nQ(left)\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\nTrials\n\nFigure 5: Expected values of estimated hidden variables by animal behavioral data: (a)\naction and reward sequences; Circles are rewarded trials; Dots indicate no rewarded trials.\n(b)-(d) Estimated value of (b) action value function , (c) learning rate, and (d) action ran-\ndomness. Shaded areas indicate the blocks of [0.9, 0.1] or [0.1, 0.9]. White areas indicate\n[0.5, 0.5].\n\n(a)\n\n(b)\n\nMaximum likelihood point \n (Optimal meta-parameter)\n\n-0.662\n\n-0.664\n\n-0.666\n\n-0.668\n\n-0.67\n\n-0.672\n\n-0.674\n-0.676\n\nFigure 6: Comparing models: (a) An example of contour plot of log likelihood for predicted\naction by a \ufb01xed meta-parameter Q-learning model. Fixed meta-parameter method needs\nto \ufb01nd the optimal learning rate (cid:11) and the inverse temperature (cid:12).\n(b) Distributions of\nlog likelihood of action prediction by proposed particle \ufb01lter method and by \ufb01xed meta-\nparameter Q-learning model with the optimal meta-parameter: The top and bottom limits\nof each box show the lower quartile and the upper quartile, and the center of the notch is\nthe median. Crosses indicate outliers. Boxplot notches show the 95% con\ufb01dence interval\nfor the median. The median of log likelihood of action prediction by proposed method is\nsigni\ufb01cantly larger than one by the \ufb01xed meta-parameter method ( Wilcoxon signed rank\ntest; p < 0:0001).\n\n\festimating meta-parameters by particle \ufb01ltering, to the \ufb01xed meta-parameter Q-learning\nmodel, which used the \ufb01xed optimal learning rate (cid:11) and inverse temperature (cid:12) in the mean-\ning of maximizing likelihood of action prediction in a session (Figure 6(a)).\n\nThe particle \ufb01lter could predict actions better than \ufb01xed meta-parameter Q-learning model\nwith the optimal meta-parameter (Wilcoxon signed rank test; p < 0:0001). This result\nindicated that the particle \ufb01ltering method successfully track the change of the meta-\nparameters, the learning rate (cid:11) and the inverse temperature (cid:12), through the sessions.\n\n6 Discussion\n\nAn advantage of the proposed particle \ufb01lter method is that we do not have to hand-tune\nmeta-parameter, such as learning rate. Although we still have to set the meta-meta- param-\neters, which de\ufb01nes dynamics of meta-parameters, the behavior of the estimates are less\nsensitive to their settings, compared to the setting of the meta-parameters. Dependency on\nthe initial distribution of the hidden variables decreases with increasing number of data.\n\nAn extension of this study would be to model selection objectively using a hierarchical\nBayesian approach. For example, the several possible reinforcement learning models, e.g.\nQ-learning, Sarsa algorithm or policy gradient algorithm, could be compared in term of\nmeasure of the posterior probability of models.\n\nRecently, computational models with heuristic meta-parameters have been successfully\nused to generate regressors for neuroimaging data [5]. Bayesian method enables gener-\nating such regressors in a more objective, data-driven manner. We are going to apply the\ncurrent method for characterizing neural recording data from the monkey.\n\n7 Conclusion\n\nWe proposed a particle \ufb01lter method to estimate internal parameters and meta-parameters\nof a reinforcement learning agent from observable variables. Our method is a powerful\ntool for interpreting neurophysiological and neuroimaging data in light of computational\nmodels, and to build better models in light of experimental data.\n\nAcknowledgments\n\nThis research was conducted as part of \u2018Research on Human Communication\u2019; with fund-\ning from the Telecommunications Advancement Organization of Japan\n\nReferences\n\n[1] Sutton RS & Barto AG (1998) Reinforcement Learning: An Introduction, MIT Press, Cambridge,\nMA.\n\n[2] Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science.\n14;275(5306):1593-1599\n\n[3] Doucet A, de Freitas N and Gordon. N, (2001) An introduction to sequential Monte Carlo meth-\nods, In Sequential Monte Carlo Methods in Practice, Doucet A, de Freitas N & Gordon N eds,\nSpringer-Verlag, pp.3-14.\n\n[4] Ueda Y, Samejima K, Doya K, & Kimura M (2002) Reward value dependent striate neuron\nactivity of monkey performing trial-and-error behavioral decision task, Abst. of Soc Neurosci, 765.13.\n\n[5] O\u2019Doherty, Dayan P, Friston K , Critchley H and Dolan R (2003) Temporal difference models and\nreward-related learning in human brain, Neuron 28, 329-337.\n\n\f", "award": [], "sourceid": 2418, "authors": [{"given_name": "Kazuyuki", "family_name": "Samejima", "institution": null}, {"given_name": "Kenji", "family_name": "Doya", "institution": null}, {"given_name": "Yasumasa", "family_name": "Ueda", "institution": null}, {"given_name": "Minoru", "family_name": "Kimura", "institution": null}]}