{"title": "Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1081, "page_last": 1089, "abstract": "When animals repeatedly choose actions from multiple alternatives, they can allocate their choices stochastically depending on past actions and outcomes. It is commonly assumed that this ability is achieved by modifications in synaptic weights related to decision making. Choice behavior has been empirically found to follow Herrnstein\u2019s matching law. Loewenstein & Seung (2006) demonstrated that matching behavior is a steady state of learning in neural networks if the synaptic weights change proportionally to the covariance between reward and neural activities. However, their proof did not take into account the change in entire synaptic distributions. In this study, we show that matching behavior is not necessarily a steady state of the covariance-based learning rule when the synaptic strength is sufficiently strong so that the fluctuations in input from individual sensory neurons influence the net input to output neurons. This is caused by the increasing variance in the input potential due to the diffusion of synaptic weights. This effect causes an undermatching phenomenon, which has been observed in many behavioral experiments. We suggest that the synaptic diffusion effects provide a robust neural mechanism for stochastic choice behavior.", "full_text": "Effects of Synaptic Weight Diffusion on Learning in\n\nDecision Making Networks\n\nKentaro Katahira1;2;3, Kazuo Okanoya1;3 and Masato Okada1;2;3\n\n1ERATO Okanoya Emotional Information Project, Japan Science Technology Agency\n\n2Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan\n\n3RIKEN Brain Science Institute, Wako, Saitama 351-0198, Japan\n\nkatahira@mns.k.u-tokyo.ac.jp okanoya@brain.riken.jp\n\nokada@k.u-tokyo.ac.jp\n\nAbstract\n\nWhen animals repeatedly choose actions from multiple alternatives, they can al-\nlocate their choices stochastically depending on past actions and outcomes.\nIt\nis commonly assumed that this ability is achieved by modi\ufb01cations in synaptic\nweights related to decision making. Choice behavior has been empirically found\nto follow Herrnstein\u2019s matching law. Loewenstein & Seung (2006) demonstrated\nthat matching behavior is a steady state of learning in neural networks if the synap-\ntic weights change proportionally to the covariance between reward and neural ac-\ntivities. However, their proof did not take into account the change in entire synap-\ntic distributions. In this study, we show that matching behavior is not necessarily\na steady state of the covariance-based learning rule when the synaptic strength is\nsuf\ufb01ciently strong so that the \ufb02uctuations in input from individual sensory neu-\nrons in\ufb02uence the net input to output neurons. This is caused by the increasing\nvariance in the input potential due to the diffusion of synaptic weights. This effect\ncauses an undermatching phenomenon, which has been observed in many behav-\nioral experiments. We suggest that the synaptic diffusion effects provide a robust\nneural mechanism for stochastic choice behavior.\n\n1 Introduction\n\nDecision making has often been studied in experiments in which a subject repeatedly chooses actions\nand rewards are given depending on the action. The choice behavior of subjects in such experiments\nis known to obey Herrnstein\u2019s matching law [1]. This law states that the proportional allocation of\nchoices matches the relative reinforcement obtained from those choices. The neural correlates of\nmatching behavior have been investigated [2] and the computational models that explain them have\nbeen developed [3, 4, 5, 6, 7](cid:637)\nPrevious studies have shown that the learning rule in which the weight update is made proportionally\nto the covariance between reward and neural activities lead to matching behavior (we simply refer\nto this learning rule as the covariance rule) [3, 7]. In this study, by means of a statistical mechanical\napproach [8, 9, 10, 11], we analyze the properties of the covariance rule in a limit where the num-\nber of plastic synapses is in\ufb01nite . We demonstrate that matching behavior is not a steady state of\nthe covariance rule under three conditions: (1) learning is achieved through the modi\ufb01cation of the\nsynaptic weights from sensory neurons to the value-encoding neurons; (2) individual \ufb02uctuations in\nsensory input neurons are so large that they can affect the potential of value-coding neurons (possi-\nbly via suf\ufb01ciently strong synapses); (3) the number of plastic synapses that are involved in learning\nis large. This result is caused by the diffusion of synaptic weights. The term \u201cdiffusion\u201d refers\nto a phenomenon where the distributions over the population of synaptic weights broadens. This\ndiffusion increases the variance in the potential of output units since the broader synaptic weight\ndistributions are, the more they amplify \ufb02uctuations in individual inputs. This makes the choice\n\n1\n\n\fbehavior of the network more random and moves the probabilities of choosing alternatives to equal\nprobabilities, than that predicted by the matching law. This outcome corresponds to the under-\nmatching phenomenon, which has been observed in behavioral experiments.\n\nOur results suggest that when we discuss the learning processes in a decision making network, it may\nbe insuf\ufb01cient to only consider a steady state for individual weight updates, and we should therefore\nconsider the dynamics of the weight distribution and the network architecture. This proceeding is a\nshort version of our original paper [12], with the model modi\ufb01ed and new results included.\n\n2 Matching Law\nFirst, let us formulate the matching law. We will consider a case with two alternatives (each denoted\nas A and B), which has generally been studied in animal experiments. Here, we consider stochastic\nchoice behavior, where at each time step, a subject chooses alternative a with probability pa. We\n\u2211\ndenote the reward as r. For the sake of simplicity, we restrict r to a binary variable: r = 0 represents\nthe absence of a reward, and r = 1 means that a reward is given. The expected return, \u27e8r|a\u27e9, refers\n\u2211\n\u2211\nto the average reward per choice a, and the income, Ia, refers to the total amount of reward resulting\nna\nfrom the choice a and Ia/ (\na\u2032 Ia\u2032) is a fractional income from choice a. For a large number of\na\u2032 \u27e8r|a\n\u2032\u27e9pa\u2032 is an average reward per trial over possible choice\ntrials, this equals \u27e8r|a\u27e9pa. \u27e8r\u27e9 =\na\u2032 Ia\u2032) = pa for all a with pa \u0338= 0. For a large\nbehavior. The matching law states that Ia/ (\n\u27e8r|a\u27e9pa\nnumber of trials, the fraction of income from an alternative a is expressed as\n\u27e8r\u27e9\nThen, the matching law states that this quantity equals pa for all a. To make this hold, it should\nsatisfy\n(1)\nif pA \u0338= 0 and pB \u0338= 0. Note that \u27e8r|a\u27e9 is the average reward given the current choice, and this is\na function of the past choice. Equation 1 is a condition for the matching law, and we will often use\nthis identity.\n\n\u27e8r|A\u27e9 = \u27e8r|B\u27e9 = \u27e8r\u27e9,\n\n\u27e8r|a\u27e9pa\na\u2032\u27e8r|a\u2032\u27e9pa\u2032 =\n\n\u2211\n\nna\n\nna\n\n3 Model\nDecision Making Network: The decision making network we study consists of sensory-input neu-\nrons and output neurons that represent the subjective value of each alternative (we call the output\nneurons value-encoding neurons). The network is divided into two groups (A and B), which par-\nticipate in choosing each alternative. Sensory cues from both targets are given simultaneously via\nN ) 1 Each component of input\nthe N-neuron population, xA = (xA\nvectors xA and xB independently obeys a gaussian distribution with mean X0 and variance one\n(these quantities can be spike counts during stimulus presentation).\nThe choice is made in such a way that alternative a is chosen if the potential of output unit ua,\nwhich will be speci\ufb01ed below, is higher than that of the other alternative. Although we do not model\nthis comparison process explicitly, it can be carried out via a winner-take-all competition mediated\nby feedback inhibition, as has been commonly assumed in decision making networks [3, 13]. In\nthis competition, the \u201cwinner\u201d group gains a high \ufb01ring rate while the \u201closer\u201d enters a low \ufb01ring\nstate [13]. Let yA and yB denote the \ufb01nal output of an output neuron after competition and this is\ndetermined as\n\nN ) and xB = (xB\n\n1 , ..., xB\n\n1 , ..., xA\n\nyA = 1, yB = 0,\nyA = 0, yB = 1,\n\nif uA \u2265 uB,\nif uA < uB.\n\nWith the synaptic ef\ufb01cacies (or weights) J A = (J A\nto the output units are given by\n\n1 , ..., J A\n\nN ) and J B = (J B\n\n1 , ..., J B\n\nN ), the net input\n\nN\u2211\n\nha =\n\nJ a\ni xa\n\ni , a = A, B.\n\n(2)\n\n1This assumption might be the case when the sensory input for each alternative is completely different, e.g.,\nin position, and in color such as those in Sugrue et al.\u2019s experiment [2]. The case that output neurons share the\ninputs from sensory neurons are analyzed in [12].\n\ni=1\n\n2\n\n\f\u221a\ni is scaled as O(1/\n\nN). This means that the mean of ha is O(\n\nN), thus diverges\nWe assume that J a\nfor large N, while the variance is kept of order unity. This is a key assumption of our models. If J a\ni\nis scaled as O(1/N) instead, the individual \ufb02uctuations in xa\ni are averaged out. It has been shown\nthat the mean of the potential are kept of order unity while \ufb02uctuations in external sources (xa\ni ) that\nare of order unity affect the potential in output neuron, under the condition that recurrent inputs\nfrom inhibitory interneurons, excitatory recurrent inputs, and input from external sources (xa\ni ) are\nbalanced [14]. We do not explicitly model this recurrent balancing mechanism, but phenomenolog-\nically incorporate it as follows.\n\n\u221a\n\nUsing the order parameters\n\ni=1\n\nJ a\ni ,\n\n1\u221a\nN\n\nN X0 (cid:22)Ja, l2\n\nla = ||J a||, (cid:22)Ja =\n(3)\n\u221a\na) where N (\u00b5, \u03c32) denotes the gaussian distribution with mean \u00b5\nwe \ufb01nd ha \u223c N (\nand variance \u03c32. We assume ua obeys a gaussian distribution of mean Caua/\nN, and variance\nCaVar[ua] + \u03c32\np are constants\nthat are determined according to the speci\ufb01c model architecture of reccurent network, but we set\n\u221a\nCA = CB = 1 since they do not affect the qualitative properties of the model. Then, ua is computed\nas ua = ha \u2212 (cid:22)harec + \u03c3p\u03f5 with (cid:22)harec = (1 \u2212 1/\nN X0 (cid:22)Ja and \u03f5 is a\nN)E[ha] where E[ha] =\ngaussian random variable with unit mean and unit variance. Then, ua obey the independent Gaussian\ndistributions whose means and variances are respectively given by (cid:22)Ja and l2\np. From this, the\nprobability that the network will choose alternative A can be described as\n\np due to the reccurent balancing mechanism [14]. CA, CB and \u03c32\n\na + \u03c32\n\n\u221a\n\n\u221a\n\nN\u2211\n\n\uf8f1\uf8f2\uf8f3\u2212 X0( (cid:22)JA \u2212 (cid:22)JB)\n\u221a\n\npA =\n\nerfc\n\n1\n2\n\n\uf8fc\uf8fd\uf8fe .\n\u222b \u221e\n\n(4)\n\n2(l2\n\nA + l2\n\nB + 2\u03c32\np)\nwhere erfc(\u00b7) is the complementary error function, erfc(x) = 2\u221a\ndt. This expression is\nin a closed form of the order parameters. Thus, if we can describe the evolution of these order\nparameters, we can completely describe how the behavior of the model changes as a consequence of\nlearning. In the following, we will often use an additional order parameter, the variance of weight,\n\u03c32\na. This parameter is more convenient for gaining insights into the evolution of the weight than\nthe weight norm, la. The diffusion of weight distributions is re\ufb02ected by increases in \u03c32\na, i.e., the\ndifferences between the growth of the second order moment of weight distribution l2\na and that of the\nsquare of its mean (cid:22)J 2\na.\nLearning Rules: We consider following two learning rules that belong to the class of the covariance\nlearning rule:\n\nx e\n\n\u2212t2\n\n(cid:25)\n\nReward-modulated (RM) Hebb rule:\n\ni (t + 1) = J a\nJ a\n\ni (t) + \u03b7\nN\n\n[r(t) \u2212 (cid:22)r(t)] ya(t)(xa\n\ni (t) \u2212 cx),\n\n(5)\n\nDelta rule:\n\ni (t) + \u03b7\nN\n\n[r(t) \u2212 (cid:22)r(t)] (xa\n\ni (t) \u2212 cx),\n\ni (t + 1) = J a\nJ a\n\n\u2212 cx) for RM-Hebb rule, and xa\n\n(6)\nwhere \u03b7 is the learning rate, (cid:22)\u00b7 denotes the expected value and cx is a constant. The expectation of\nthese updates is proportional to covariance between the reward, r, and a measure of neural activity\n\u2212 cx for the delta rule). Variants of the RM-Hebb rule\n(ya(xa\ni\nhave recently been studied intensively [4, 15, 16, 17, 18, 19, 20]. The delta rule has been used as\nan example of the covariance rule [3, 7] and has also been used for the learning rule in the model of\n\u221a\nperceptual learning [21]. The expected reward, (cid:22)r, can be estimated, e.g., with an exponential kernel\nsuch as (cid:22)r(t + 1) = (1 \u2212 \u03b3)r(t) + \u03b3(cid:22)r(t) with a constant \u03b3. We assume that cx = (1 \u2212 1/\nN)X0 to\nsimplify the following analysis 2.\n\u2211\n\n2From this assumption, this model can be transformed into a simple mathematical equivalent form that the\nN ; 1) and the potential in output is replaced with ua =\n\np\ni is replaced with N (X0=\n\ndistribution of input xa\n\ni\n\nN\n\ni=1 J a\n\ni xa\n\ni + (cid:27)p(cid:24)a, where (cid:24)a (cid:24) N (0; 1).\n\n3\n\n\f4 Macroscopic Description of Learning Processes\n\nHere, following the statistical mechanical analysis of on-line learning [8, 9, 10, 11], we derive\nequations that describe the evolution of the order parameters. To do this, we \ufb01rst rewrite the learning\nrule in a vector form:\n\nFa (xa \u2212 cx),\n\n1\nN\n\n\u2211\n\n\u2211\n\nJ a(t + 1) = J a(t) +\n\n(7)\nwhere for the RM-Hebb rule, Fa = \u03b7(rt\u2212 (cid:22)rt)ya(cid:636)and for the delta rule, Fa = \u03b7(rt\u2212 (cid:22)rt). Taking the\nN Fa(t)2 +\nsquare norm of each side of equation 7, we obtain la(t + 1)2 = la(t)2 + 2\n\u2212 cx). Summing up over all components\nO(1/N 2), where we have de\ufb01ned ~ha =\non both sides of equation 7, we obtain (cid:22)Ja(t + 1) = (cid:22)Ja(t) + 1\nN Fa(t)~xa, where we have de\ufb01ned\n\u2212 cx). In both these equations, the magnitude of each update is of order 1/N.\n~xa =\nHence, to change the order parameters of order one, O(N) updates are needed. Within this short\nperiod that spans the O(N) updates, the weight change in O(1/N) can be neglected, and the self-\naveraging property holds. By using this property and introducing continuous \u201ctime\u201d scaled by N,\ni.e., \u03b1 = t/N, the evolutions of the order parameters obey ordinary differential equations:\n\nN Fa(t) ~ha + 1\n\ni=1(xa\n\nN\ni=1 J a\n\ni (xa\n\nN\n\ni\n\ni\n\n(8)\nwhere \u27e8\u00b7\u27e9 denotes the ensemble average over all possible inputs and arrivals of rewards. The speci\ufb01c\nform of the ensemble averages are obtained for reward-dependent Hebbian learning as\n\na\n\n= 2\u27e8Fa\n\ndl2\na\nd\u03b1\n\n\u27e9,\n\nd (cid:22)Ja\nd\u03b1\n\n~ha\u27e9 + \u27e8F 2\n{\n\n= \u27e8Fa ~xa\u27e9,\n}\n\n\u27e9 = \u03b72pa\n\n~ha\u27e9 = \u03b7 pa {\u27e8r|a\u27e9 \u2212 \u27e8r\u27e9}\u27e8~ha|a\u27e9,\n\u27e8Fa\n\u27e8F 2\n\u27e8Fa ~xa\u27e9 = \u03b7 pa {\u27e8r|a\u27e9 \u2212 \u27e8r\u27e9}\u27e8~xa|a\u27e9,\n{\n\n(1 \u2212 2\u27e8r\u27e9)\u27e8r|a\u27e9 + (\u27e8r\u27e9)2\n\na\n\n,\n\n}\n\n,\n\n\u2032\u27e9)\u27e8~ha|a\u27e9 + (\u27e8r|a\n\n\u2032\u27e9 \u2212 \u27e8r\u27e9) (cid:22)Ja\n\n\u2032\u27e9)\u27e8~xa|a\u27e9 + \u27e8r|a\n\n\u2032\u27e9 \u2212 \u27e8r\u27e9} .\n\nand for the delta rule,\n\u27e8Fa\n~ha\u27e9 = \u03b7\n\u27e8F 2\n\npa(\u27e8r|a\u27e9 \u2212 \u27e8r|a\n\u27e9 = \u03b72 {\u27e8r\u27e9(1 \u2212 \u27e8r\u27e9)} ,\n\u27e8Fa ~xa\u27e9 = \u03b7 { pa (\u27e8r|a\u27e9 \u2212 \u27e8r|a\n)\n\n(\n\na\n\nThe conditional averages \u27e8~ha|a\u27e9 and \u27e8~xa|a\u27e9 in these equations are computed as\n\u27e8~ha|a\u27e9 = (cid:22)JaX0 +\n\n, \u27e8~xa|a\u27e9 = X0 +\n\nexp\n\n\u2212 X 2\n0 D2\n(cid:22)J\n2L2\n\n(cid:22)Ja\n\u221a\n2\u03c0L2\n\npa\n\nexp\n\n(\n\n)\n\n,\n\n\u2212 X 2\n0 D2\n(cid:22)J\n2L2\n\npa\n\n\u221a\nl2\na\n2\u03c0L2\n\n\u221a\n\nB + 2\u03c32\nwhere we have de\ufb01ned L =\ngiven in the supplementary material and [12].\n\nA + l2\nl2\n\n(9)\np and D (cid:22)J = (cid:22)JB \u2212 (cid:22)JA. The details on the derivation are\n\nNext, we consider weight normalization in which the total length of the weight vector is kept con-\nstant. We adopted this weight normalization because of analytical convenience rather than taking\nbiological realism into account. Other weight constraints would produce no clear differences in\nthe following results. Speci\ufb01cally, we constrained the norm of the weight as ||J||2 = 2, where\nJ = (J A\nB = 2. This is achieved by\nmodifying the learning rule in the following way [22]:\nN Fa xa)\n\nN ). This is equivalent to keeping l2\n\n1 , ..., J B\n\n1 , ..., J A\n\nA + l2\n\nN , J B\n\nJ a(t + 1) =\n\n\u221a\n= J a(t) + Fa xa\n1 + F/N\n\n,\n\n\u221a\n\u221a\n2(J a(t) + 1\n||J A(t) + 1\nN FA xA||2 + ||J B(t) + 1\nA + F 2\n2(F 2\n\nB), provided that ||J||2 = 2 holds at trial t. Expanding the\nwith F \u2261 FAuA + FBuB + 1\nright-hand side to \ufb01rst order in 1/N, we can obtain the differential equations similarly to Equation 8:\n\nN FB xB||2\n\n= 2\u27e8Fa\n\ndl2\na\nd\u03b1\n\nWith \u27e8F\u27e9 = \u27e8FAuA\u27e9 + \u27e8FBuB\u27e9 + 1\nwhen l2\n\nA + l2\n\nB = 2; thus, the length of the weight is kept constant.\n\n~ha\u27e9 + \u27e8F 2\n2(\u27e8F 2\n\na\n\nA\n\nd (cid:22)Ja\nd\u03b1\n\n\u27e9 \u2212 \u27e8F\u27e9l2\n= \u27e8Fa ~xa\u27e9 \u2212 1\n\u27e8F\u27e9 (cid:22)Ja.\na,\n2\n\u27e9 + \u27e8F 2\n\u27e9), we can \ufb01nd that d(l2\nA + l2\n\nB\n\nB)/d\u03b1 becomes zero\n\n(10)\n\n(11)\n\n4\n\n\fFigure 1: Evolution of choice probability and order parameters for RM-Hebb rules (A, B, E, F)\nand delta rule (C, D, G, H), without weight normalization (A-D) and with normalization (E-H).\nParameters were X0 = 2, \u03b7 = 0.1 and \u03c3p = 1, and the reward schedule was a VI schedule (see\nmain text) with \u03bbA = 0.2, \u03bbB = 0.1. Lines represent results of theory and symbols plot mean\nof ten trials with computer simulation. Simulations were done for N = 1, 000. Error bars indicate\nstandard deviation (s.d.). Error bars are almost invisible for choice probability since s.d.\nis very\nsmall.\n\n) is pmatch\n\n5 Results\nTo demonstrate the behavior of the model, we used a time-discrete version of a variable-interval\n(VI) reward schedule, which is commonly used for studying the matching law. In a VI schedule, a\nreward is assigned to two alternatives stochastically and independently, with a constant probability,\n\u03bba for alternative a (a = A, B). The reward remains until it is harvested by choosing the alternative.\nHere, we use \u03bbA = 0.2, \u03bbB = 0.1. For this task setting, the choice probability that yields matching\nbehavior (denoted as pmatch\n= 0.6923. Figure 1(A-D) plots the evolution of choice\nprobability and order parameters in two learning rules without a weight normalization constraint.\nThe lines represent the results for theory and the symbols plot the results for simulations. The results\nfor theory agree well with those for the computer simulations (N = 1, 000), indicating the validity of\nour theory. We can see that the choice probability approaches a value that yields matching behavior\n), while the order parameters (cid:22)Ja and \u03c3a continue to change without becoming saturated.\n(pmatch\nThe weight standard deviation, \u03c3a, always increases (the synaptic weight diffusion).\nFigure 1(E-H) plots the results with weight normalization. Again, the results for theory agree well\nwith those for computer simulations. For the RM-Hebb rule, the choice probability saturates at a\nvalue below pmatch\n, but without\nreaching pmatch\n. It then returns to the uniform choice probability (pA = 0.5) due to its larger\ndiffusion effect than that of the RM-Hebb rule.\n\n. For the delta rule, the choice probability \ufb01rst approaches pmatch\n\nA\n\nA\n\nA\n\nA\n\nA\n\nA\n\n5.1 Matching Behavior Is Not Necessarily Steady State of Learning\n\nFrom Figure 1, the choice probability seems to asymptotically approach matching behavior for the\ncase without wight normalization. However, matching behavior is not necessarily a steady state\nand then\nof learning.\nEquations 8 and 11 are numerically solved. We see that pA does not remain at pmatch\nbut changes\ntoward the uniform choice (pA = 0.5) for both learning rules. Then, for the RM-Hebb rule, pA\nevolves toward pmatch\n, but not do so for the delta rule. To understand the mechanism for this\n\nIn Figure 2, the order parameters are initialized so that pA(0) = pmatch\n\nA\n\nA\n\nA\n\n5\n\n\u001e \u001e\"\u001e$\u001e&\u001e\u001f\u001e\u001e\u001e\u001e\u0002 \u001e\u0002\"\u001e\u0002$\u001e\u0002&\u001f\u001e \u001e\"\u001e$\u001e&\u001e\u001f\u001e\u001e\u001e\u001f !\"#$%\u0014\u0014\u001e\u001f\u001e\u001e \u001e\u001e!\u001e\u001e\"\u001e\u001e#\u001e\u001e\u001e\u001e\u0002 \u001e\u0002\"\u001e\u0002$\u001e\u0002&\u001f\u001e\u001f\u001e\u001e \u001e\u001e!\u001e\u001e\"\u001e\u001e#\u001e\u001e\u001e\u001f !\"#$%\u0014\u0014\u0004=J?DE\u0006C\u0014\u0004=J?DE\u0006C\u0014)\u0014\u0014*+\u0014\u0014,\u0004H@AH\u0014F=H=\u0006AJAHI\u0004H@AH\u0014F=H=\u0006AJAHIF)F)\u001e \u001e\"\u001e$\u001e&\u001e\u001f\u001e\u001e\u001e\u001e\u0002 \u001e\u0002\"\u001e\u0002$\u001e\u0002&\u001f\u001e \u001e\"\u001e$\u001e&\u001e\u001f\u001e\u001e\u0002\u001e\u0002 \u001e\u001e\u0002 \u001e\u0002\"\u001e\u0002$\u001e\u0002&\u001f\u001f\u0002 \u0014\u001e\u001f\u001e\u001e \u001e\u001e!\u001e\u001e\"\u001e\u001e#\u001e\u001e\u001e\u001e\u0002 \u001e\u0002\"\u001e\u0002$\u001e\u0002&\u001f\u001e\u001f\u001e\u001e \u001e\u001e!\u001e\u001e\"\u001e\u001e#\u001e\u001e\u000f\u0002\u001e\u0002#\u001e\u001e\u0002#\u001f\u001f\u0002#\u0014\u0014-\u0014./\u0014\u00140F)F)\u0004H@AH\u0014F=H=\u0006AJAHI\u0004H@AH\u0014F=H=\u0006AJAHI\u0004=J?DE\u0006C\u0014\u0004=J?DE\u0006C\u0014\u00144\u0004\u00020A>>\u0014HK\u0006A\u0014\u0014,A\u0006J=\u0014HK\u0006A\u0004\u0006\u0014\u0006\u0006H\u0006=HE\u0007=JE\u0006\u00069EJD\u0014\u0006\u0006H\u0006=HE\u0007=JE\u0006\u0006\u001c5?=\u0006A@\u0014JE\u0006A\u001d\u0014\fFigure 2: Strict matching behavior is not equilibrium point. We set initial value of order parameters\nto derive perfect matching for (A) no normalization condition and (B) normalization condition.\nIn both cases, choice probability that yields perfect matching is repulsive. For no normalization\ncondition, initial conditions were \ufb01rst set at (cid:22)JB = 1.0, \u03c3A = \u03c3B = 1.0 and then (cid:22)JA was determined\nso that pA = pmatch\n. For normalization condition, these values were rescaled so that normalization\ncondition was met.\n\nA\n\na\n\na\n\n\u2212 (cid:22)J 2\n\na/d\u03b1 = d(l2\n\nrepulsive property of matching behavior, let us substitute the condition of the matching law, \u27e8r|A\u27e9 =\n\u27e8r|B\u27e9 = \u27e8r\u27e9 into Equations 11, for the no normalization condition. We then \ufb01nd that \u27e8Fa\n~ha\u27e9\nand \u27e8Fa ~xa\u27e9 are zero but \u27e8F 2\n\u27e9 is non-zero and positive except for the non-interesting case where r\nalways takes the same value. Therefore, when pA = pmatch\n, the variance in the weight increases,\na)/d\u03b1 > 0. This moves the choice probabilities toward unbiased choice\ni.e., d\u03c32\nbehavior, pA = 0.5 (see Equation 4). This is the reason that pmatch\nis repulsive. This result is in\ncontrast with the N = 1 case [7] where the average changes stop when pA converges to pmatch\nB) in Equation 4 is always two; thus, the only factor that\nWith weight normalization,\ndetermines choice probability is the difference between (cid:22)JA and (cid:22)JB. Substituting \u27e8r|a\u27e9 = \u27e8r\u27e9,\u2200a into\nEquation 11, only term \u27e8F 2\n2(\u27e8F 2\n\u27e9) ( (cid:22)JB \u2212\n\u27e9 > 0 holds; thus, the\n(cid:22)JA) Except for uninteresting cases where r is always 0 or 1, \u27e8F 2\nabsolute difference, | (cid:22)JB \u2212 (cid:22)JA|, always decreases. Hence, again, the choice probability at pmatch\napproaches unbiased choice behavior due to the diffusion effect.\n\n\u221a\n\u27e9 remains, and we obtain d( (cid:22)JB \u2212 (cid:22)JA)/(d\u03b1) = \u2212 1\n\u27e9 + \u27e8F 2\n\n\u27e9 +\u27e8F 2\n\nA + l2\n\n2(l2\n\nB\n\nB\n\nA\n\nA\n\nA\n\nA\n\nA\n\nA\n\na\n\nNevetheless, the choice probability of the RM-Hebb rule without weight normalization asymptoti-\ncally converges to pmatch\n. The reason for this can be explained as follows. First, we rewrite the\nA\nchoice probability as\n\n.\n\n(12)\n\n\uf8f1\uf8f2\uf8f3\u2212\n\n\u221a\n\npA =\n\nerfc\n\n1\n2\n\nX0( (cid:22)JA \u2212 (cid:22)JB)\nB + \u03c32\nA + \u03c32\n\n2( (cid:22)J 2\n\nA + (cid:22)J 2\n\nB + 2\u03c32\np)\n\n\uf8fc\uf8fd\uf8fe .\n\na\n\nA\n\n\u27e9, which moves pA away from pmatch\n\nFrom this expression, we \ufb01nd that the larger the magnitude of (cid:22)Ja is, the weaker the effect of increases\nin \u03c3a. The \u201cdiffusion term\u201d, \u27e8F 2\ndepends on pA but not on\nthe magnitude of (cid:22)Ja\u2019s. Thus, within the order parameter set satisfying pA = pmatch\n, the larger the\nmagnitudes of Ja\u2019s are, the weaker is the repulsive effect. If | (cid:22)JB\u2212 (cid:22)JA| \u2192 \u221e while \u03c3A, \u03c3B are \ufb01nite,\n. Because | (cid:22)JB \u2212 (cid:22)JA| can increase faster than \u03c3A and \u03c3B in the RM-Hebb rule\npA stays at pmatch\nwithout any weight constraints, the network approaches such situations. This is the reason that in\nFigure 2A the pA returned to pmatch\n. When weight normalization\nis imposed, the magnitude of (cid:22)Ja\u2019s are limited as | (cid:22)JB \u2212 (cid:22)JA| < 2. Thus, the diffusion effect prevents\n. In the delta rule, the magnitude of (cid:22)Ja\u2019s cannot increase independently\npA from approaching pmatch\n, where the increase in | (cid:22)JB \u2212 (cid:22)JA| and those in\nof \u03c3a\u2019s. Thus, pA saturates before it reaches pmatch\n\u03c3a\u2019s are balanced.\n\nafter it was repulsed from pmatch\n\nA\n\nA\n\nA\n\nA\n\nA\n\nA\n\n5.2 Learning Rate Dependence of Learning Behavior\n\nNext, we investigate how the learning rate, \u03b7, affects the choice behavior. In the \u201cdiffusion term\u201d,\n\u27e8F 2\n\u27e9, is a quadratic term in the learning rate \u03b7. In contrast, only the \ufb01rst order terms of \u03b7 appear\n\na\n\n6\n\n\u001e#\u001e\u001e\u001f\u001e\u001e\u001e\u001e\u0002##\u001e\u0002$\u001e\u0002$#\u001e\u0002%\u001e\u0002%#\u0004\u0006\u0014\u0006\u0006H\u0006=HE\u0007=JE\u0006\u0006\u0014\u00144\u00040A>>\u0014HK\u0006A,A\u0006J=\u0014HK\u0006A\u001e#\u001e\u001e\u001f\u001e\u001e\u001e\u001e\u0002#\u001e\u0002$\u001e\u0002%\u001e\u0002&9EJD\u0014\u0006\u0006H\u0006=HE\u0007=JE\u0006\u0006\u0014\u0014\u0004=J?DE\u0006C\u0014F)F))*\fFigure 3: Evolution of choice probability for various learning rates, \u03b7. Top rows are for non-weight\nnormalization condition and bottom rows are for normalization condition. Columns at left are for\nRM-Hebb rule and those at right are for delta rule. Parameters for model and task schedules are\nsame as those in Figure 1. Initial conditions were set at \u03c3a = 0.0, (a = A, B), (cid:22)Ja = 5.0 for\nnon-normalization condition and (cid:22)Ja = 1.0 for normalization condition.\n\nin the other terms. Therefore, if \u03b7 is small, the repulsive effect from matching behavior due to the\ndiffusion effect is expected to weaken. Figure 3 plots the dependence of the evolution of pA on \u03b7. As\na whole, as \u03b7 is decreased, the asymptotic value, pA, approaches matching behavior, but relaxation\nslows down due to the diffusion of synaptic weights. As we previously discussed, the diffusion\neffect is more evident for the delta rule than for the RM-Hebb rule, and for the weight-normalization\ncondition than the non- normalization condition. This tendency becomes evident as \u03b7 increases.\n\na\n\nFor the RM-Hebb rule without normalization, networks approach matching behavior even for a very\nlarge learning rate (\u03b7 = 1000). At the beginning of learning when (cid:22)Ja is of small magnitude, the dif-\nfusion term, \u27e8F 2\n\u27e9, has a large impact so that it greatly impedes learning for a large \u03b7 case. However,\nas the magnitude of the differences (cid:22)JA \u2212 (cid:22)JB increases, this effect weakens and the dependence of\npA on \u03b7 becomes quite small. Although there is still a deviation from perfect matching (see inset of\nFigure 3A), the asymptotic value is almost unaffected in the RM-Hebb rule. For the delta rule with-\nout normalization, the asymptotic values gradually depend on \u03b7. With normalization constraints, the\nRM-Hebb rule also demonstrate graded dependence of asymptotic probability on \u03b7. These results\nre\ufb02ect the fact that the greater learning rate \u03b7 is, the larger the diffusion effect.\n\n5.3 Deviation from Matching Law\n\nChoices by animals in many experiments deviate slightly from matching behavior toward unbi-\nased random choice, a phenomenon called undermatching [2, 23]. The synaptic diffusion effects\nreproduces this phenomenon. Figure 4A,B plots choice probability for option A as a function of\nthe fraction income from the option.\nIf this function lies at the diagonal line, it corresponds to\nmatching behavior. For the RM-rule with weight normalization, as the learning rate \u03b7 increases, the\nchoice probabilities deviate from matching behavior towards unbiased random choice, pA = 0.5\n(Figure 4A). Similar results are obtained for another weight constraint, the hard bound condition\n\u221a\n(Figure 4B). In this condition, if the updates makes J a\nis set to\nN (or 0). We see that the larger the \u03b7 is, the broader the weight distributions due the the\nJmax/\nsynaptic diffusion effects (Figure 4A). This result suggests that the weight diffusion effect causes un-\ndermathing regardless of the way of weight constraint, as long as the synaptic weights are con\ufb01ned\nto a \ufb01nite range, as predicted by our theory.\n\n\u221a\ni > Jmax/\n\nN (or J a\n\ni < 0), J a\n\ni\n\n7\n\n\u001f\u001e\u0002\"\u001f\u001e\u000f\u0002 \u001f\u001e\u001e\u001f\u001e \u001f\u001e\"\u001f\u001e$\u001e\u0002\"#\u001e\u0002#\u001e\u0002##\u001e\u0002$\u001e\u0002$#\u001e\u0002%\u001e\u0002%#\u001e\u0002&\u0014\u001f\u001e\u000f\u0002\"\u001f\u001e\u000f\u0002 \u001f\u001e\u001e\u001f\u001e \u001f\u001e\"\u001f\u001e$\u001e\u0002\"#\u001e\u0002#\u001e\u0002##\u001e\u0002$\u001e\u0002$#\u001e\u0002%\u001e\u0002%#\u001e\u0002&\u0014\u0014\u0014\u0003\u0014\u001e\u0002\u001e\u001f\u001e\u0014\u0003\u0014\u001e\u0002\u001f\u001e\u001e\u0014\u0003\u0014\u001f\u0002\u001e\u001e\u0014\u0003\u0014\u001f\u001e\u0002\u001e\u001e\u0014\u0003\u0014\u001f\u001e\u001e\u001e\u0002\u001e\u001e\u001f\u001e\u000f\u0002\"\u001f\u001e\u000f\u0002 \u001f\u001e\u001e\u001f\u001e \u001f\u001e\"\u001f\u001e$\u001e\u0002\"#\u001e\u0002#\u001e\u0002##\u001e\u0002$\u001e\u0002$#\u001e\u0002%\u001e\u0002%#\u001e\u0002&\u0014\u0014\u001f\u001e\u000f\u0002\"\u001f\u001e\u000f\u0002 \u001f\u001e\u001e\u001f\u001e \u001f\u001e\"\u001f\u001e$\u001e\u0002\"#\u001e\u0002#\u001e\u0002##\u001e\u0002$\u001e\u0002$#\u001e\u0002%\u001e\u0002%#\u001e\u0002&\u0014\u0014\u0014\u0003\u0014\u001e\u0002\u001e\u001f\u001e\u0014\u0003\u0014\u001e\u0002\u001f\u001e\u001e\u0014\u0003\u0014\u001f\u0002\u001e\u001e\u0014\u0003\u0014 \u0002\u001e\u001e\u0014\u0003\u0014\u001f\u001e\u0002\u001e\u001e4\u0004\u00020A>>\u0014HK\u0006A\u00149EJD\u0014\u0006\u0006H\u0006=\u0006E\u0007=JE\u0006\u0006)*+,\u0004\u0006\u0014\u0006\u0006H\u0006=\u0006E\u0007=JE\u0006\u0006,A\u0006J=\u0014HK\u0006A\u0014DDDDDDDDDD\u0004=J?DE\u0006C\u0014\u0004=J?DE\u0006C\u00142H\u0006>=>E\u0006EJO\u0014\u0006B\u0014?D\u0006\u0006IE\u0006C\u0014)2H\u0006>=>E\u0006EJO\u0014\u0006B\u0014?D\u0006\u0006IE\u0006C\u0014)2H\u0006>=>E\u0006EJO\u0014\u0006B\u0014?D\u0006\u0006IE\u0006C\u0014)\u00142H\u0006>=>E\u0006EJO\u0014\u0006B\u0014?D\u0006\u0006IE\u0006C\u0014)\u0014\fFigure 4: Constraints on synaptic weights leads to the undermatching behavior through synaptic\ndiffusion effects. (A) Choice probability for A as a function of the fraction income for A for the\nRM-rule with weight normalization. We used VI schedules with \u03bbA = 0.3a and \u03bbB = 0.3(1 \u2212 a),\nvarying the constant a (0 \u2264 a \u2264 1). The results were obtained using stationaly points of the\nmacroscopic equations. The diagonal line indicates the perfect matching behavior. As the learning\nrate \u03b7 increases, the choice probabilities deviate from matching behavior towards unbiased random\n\u221a\nchoice, pA = 0.5.\n(B) The same plot with (A) for the RM-rule with the hard bound condition\n(the synaptic weights are restricted to the interval [0, Jmax/\nN] where Jmax = 5.0) obtained by\nnumerical simulations. Simulations were done for N = 500. (C) The weight distribution after\nconvergence for the simulations in (B) indicated by the gray arrows.\n\n6 Discussion\n\nIn this study, we analyzed the reward-based learning procedure in simple, large-scale decision mak-\ning networks. To achieve this, we employed techniques from statistical mechanics. Although sta-\ntistical mechanical analysis has been successively applied to analyze the dynamics of learning the\nin neural networks, we applied it to reward-modulated learning in decision making networks for\nthe \ufb01rst time, to the best of our knowledge. We have assumed the activities of sensory neurons\nare independent. In realistic cases, there may be correlations among sensory neurons. The exis-\ntence of correlation weakens the diffusion effects. However, if there are independent \ufb02uctuations, as\nobserved in many physiological studies, the diffusion effects are at play here as well.\n\nIf only a single plastic synapse is taken into consideration, covariance learning rules seem to make\nmatching behavior a steady state of learning. However, under certain situations where a large number\nof synapses simultaneously modify their ef\ufb01cacy, matching behavior cannot be a steady state. This\nis because the randomness in weight modi\ufb01cations affects the choice probability of the network, and\nthe effect returns to the learning process. These results may offer suggestions for discussing learning\nbehavior in large-scale neural circuits.\n\nChoice behavior in many experiments deviates slightly from matching behavior toward unbiased\nchoice behavior, a phenomenon called undermatching [23, 2]. There are several possible explana-\ntions for this phenomenon. The learning rule employed by Soltani & Wang [4] is equivalent to the\nstate-less Q-learning in the literature on reinforcement learning [15]. Sakai & Fukai [5, 6] proved\nthat Q-learning does not lead to matching behavior. Thus, Soltani-Wang\u2019s model is intrinsically\nincapable of reproducing matching behavior. The authors interpreted that the departure from match-\ning behavior due to limitations in the learning rule was a possible mechanism for undermatching.\nLoewenstein [7] suggested that the mistuning of parameters in the covariance learning rule could\ncause undermatching. However, we found that in some task settings, the mistuning can cause over-\nmatching, rather than undermatching [12]. Our \ufb01ndings in this study add one possible mechanism\nfor undermatching, i.e., undermatching can be caused by the diffusion of synaptic ef\ufb01cacies. The\ndiffusion effects provide a robust mechanism for undermatching: It reproduces undermatching be-\nhavior, regardless of speci\ufb01c task settings.\n\nTo achieve random choice behavior, it is thought to require \ufb01ne-tuning of network parameters [16],\nwhereas random choice behavior is often observed in behavioral experiments. Our results suggest\nthat the broad distributions of synaptic weights observed in experiments [24] can make it easier to\nrealize stochastic random choice behavior perhaps than previously thought.\n\n8\n\n\fReferences\n\n[1] R. J. Herrnstein, H. Rachlin, and D. I. Laibson. The Matching Law. Russell Sage Foundation\n\nNew York, 1997.\n\n[2] L. P. Sugrue, G. S. Corrado, and W. T. Newsome. Matching behavior and the representation of\n\nvalue in the parietal cortex. Science, 304(5678):1782\u20131787, 2004.\n\n[3] Y. Loewenstein and H. S. Seung. Operant matching is a generic outcome of synaptic plasticity\nbased on the covariance between reward and neural activity. Proceedings of the National\nAcademy of Sciences, 103(41):15224\u201315229, 2006.\n\n[4] A. Soltani and X. J. Wang. A biophysically based neural model of matching law behavior:\n\nmelioration by stochastic synapses. Journal of Neuroscience, 26(14):3731\u20133744, 2006.\n\n[5] Y. Sakai and T. Fukai. The actor-critic learning is behind the matching law: Matching versus\n\noptimal behaviors. Neural Computation, 20(1):227\u2013251, 2008.\n\n[6] Y. Sakai and T. Fukai. When does reward maximization lead to matching law? PLoS ONE,\n\n3(11):e3795, 2008.\n\n[7] Y. Loewenstein. Robustness of learning that is based on covariance-driven synaptic plasticity.\n\nPLoS Computational Biology, 4(3):e1000007, 2008.\n\n[8] W. Kinzel and P. Rujan. Improving a network generalization ability by selecting examples.\n\nEurophysics Letters, 13(5):473\u2013477, 1990.\n\n[9] D. Saad. On-line learning in neural networks. Cambridge University Press, 1998.\n[10] G. Reents and R. Urbanczik. Self-averaging and on-line learning. Physical Review Letters,\n\n80(24):5445\u20135448, 1998.\n\n[11] M. Biehl, N. Caticha, and P. Riegler. Statistical mechanics of on-line learning. Similarity-\n\nBased Clustering, pages 1\u201322, 2009.\n\n[12] K. Katahira, K. Okanoya, and M. Okada. Statistical mechanics of reward-modulated learning\n\nin decision making networks. under review.\n\n[13] X. J. Wang. Probabilistic decision making by slow reverberation in cortical circuits. Neuron,\n\n36(5):955\u2013968, 2002.\n\n[14] C. van Vreeswijk and H. Sompolinsky. Chaotic balanced state in a model of cortical circuits.\n\nNeural Computation, 10(6):1321\u20131371, 1998.\n\n[15] A. Soltani, D. Lee, and X. J. Wang. Neural mechanism for stochastic behaviour during a\n\ncompetitive game. Neural Networks, 19(8):1075\u20131090, 2006.\n\n[16] S. Fusi, W. F. Asaad, E. K. Miller, and X. J. Wang. A neural circuit model of \ufb02exible sen-\nsorimotor mapping: learning and forgetting on multiple timescales. Neuron, 54(2):319\u2013333,\n2007.\n\n[17] E. M. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine\n\nsignaling. Cerebral Cortex, 17:2443\u20132452, 2007.\n\n[18] R. V. Florian. Reinforcement learning through modulation of spike-timing-dependent synaptic\n\nplasticity. Neural Computation, 19(6):1468\u20131502, 2007.\n\n[19] M. A. Farries and A. L. Fairhall. Reinforcement Learning With Modulated Spike Timing\n\nDependent Synaptic Plasticity. Journal of Neurophysiology, 98(6):3648\u20133665, 2007.\n\n[20] R. Legenstein, D. Pecevski, and W. Maass. A learning theory for reward-modulated spike-\ntiming-dependent plasticity with application to biofeedback. PLoS Computational Biology,\n4(10):e1000180, 2008.\n\n[21] C. T. Law and J. I. Gold. Reinforcement learning can account for associative and perceptual\n\nlearning on a visual-decision task. Nature Neuroscience, 12(5):655\u2013663, 2009.\n\n[22] M. Biehl. An exactly solvable model of unsupervised learning. Europhysics Letters,\n\n25(5):391\u2013396, 1994.\n\n[23] W. M. Baum. On two types of deviation from the matching law: Bias and undermatching.\n\nJournal of the Experimental Analysis of Behavior, 22(1):231\u2013242, 1974.\n\n[24] B. Barbour, N. Brunel, V. Hakim, and J. P. Nadal. What can we learn from synaptic weight\n\ndistributions? TRENDS in Neurosciences, 30(12):622\u2013629, 2007.\n\n9\n\n\f", "award": [], "sourceid": 556, "authors": [{"given_name": "Kentaro", "family_name": "Katahira", "institution": null}, {"given_name": "Kazuo", "family_name": "Okanoya", "institution": null}, {"given_name": "Masato", "family_name": "Okada", "institution": null}]}