{"title": "Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1105, "page_last": 1113, "abstract": "The control of neuroprosthetic devices from the activity of motor cortex neurons benefits from learning effects where the function of these neurons is adapted to the control task. It was recently shown that tuning properties of neurons in monkey motor cortex are adapted selectively in order to compensate for an erroneous interpretation of their activity. In particular, it was shown that the tuning curves of those neurons whose preferred directions had been misinterpreted changed more than those of other neurons. In this article, we show that the experimentally observed self-tuning properties of the system can be explained on the basis of a simple learning rule. This learning rule utilizes neuronal noise for exploration and performs Hebbian weight updates that are modulated by a global reward signal. In contrast to most previously proposed reward-modulated Hebbian learning rules, this rule does not require extraneous knowledge about what is noise and what is signal. The learning rule is able to optimize the performance of the model system within biologically realistic periods of time and under high noise levels. When the neuronal noise is fitted to experimental data, the model produces learning effects similar to those found in monkey experiments.", "full_text": "Functional network reorganization in motor cortex\n\ncan be explained by reward-modulated Hebbian\n\nlearning\n\nRobert Legenstein1\u2217, Steven M. Chase2,3,4, Andrew B. Schwartz2,3, Wolfgang Maass1\n\n1 Institute for Theoretical Computer Science, Graz University of Technology, Austria\n\n2Department of Neurobiology, University of Pittsburgh\n\n3Center for the Neural Basis of Cognition\n\n4Department of Statistics, Carnegie Mellon University\n\nAbstract\n\nThe control of neuroprosthetic devices from the activity of motor cortex neurons\nbene\ufb01ts from learning effects where the function of these neurons is adapted to\nthe control task. It was recently shown that tuning properties of neurons in mon-\nkey motor cortex are adapted selectively in order to compensate for an erroneous\ninterpretation of their activity. In particular, it was shown that the tuning curves of\nthose neurons whose preferred directions had been misinterpreted changed more\nthan those of other neurons. In this article, we show that the experimentally ob-\nserved self-tuning properties of the system can be explained on the basis of a\nsimple learning rule. This learning rule utilizes neuronal noise for exploration and\nperforms Hebbian weight updates that are modulated by a global reward signal.\nIn contrast to most previously proposed reward-modulated Hebbian learning rules,\nthis rule does not require extraneous knowledge about what is noise and what is\nsignal. The learning rule is able to optimize the performance of the model system\nwithin biologically realistic periods of time and under high noise levels. When the\nneuronal noise is \ufb01tted to experimental data, the model produces learning effects\nsimilar to those found in monkey experiments.\n\n1 Introduction\n\nIt is a commonly accepted hypothesis that adaptation of behavior results from changes in synap-\ntic ef\ufb01cacies in the nervous system. However, there exists little knowledge about how changes in\nsynaptic ef\ufb01cacies change behavior and about the learning principles that underlie such changes. Re-\ncently, one important hint has been provided in the experimental study [1] of a monkey controlling\na neuroprostethic device. The monkey\u2019s intended movement velocity vector can be extracted from\nthe \ufb01ring rates of a group of recorded units by the population vector algorithm, i.e., by computing\nthe weighted sum of their PDs, where each weight is the unit\u2019s normalized \ufb01ring rate [2].1 In [1],\nthis velocity vector was used to control a cursor in a 3D virtual reality environment. The task for the\nmonkey was to move the cursor from the center of an imaginary cube to a target appearing at one of\nits corners. It is well known that performance increases with practice when monkeys are trained to\nmove to targets in similar experimental setups, i.e., the function of recorded neurons is adapted such\nthat control over the new arti\ufb01cial \u201climb\u201d is improved [3]. In [1], it was systematically studied how\nsuch reorganization changes the tuning properties of recorded neurons. The authors manipulated\nthe interpretation of recorded \ufb01ring rates by the readout system (i.e., the system that converts \ufb01ring\n\n\u2217To whom correspondence should be addressed: robert.legenstein@igi.tugraz.at\n1In general, a unit is not necessarily equal to a neuron in the experiments. Since the spikes of a unit are\n\ndetermined by a spike sorting algorithm, a unit may represent the mixed activity of several neurons.\n\n1\n\n\frates of recorded neurons into cursor movements). When the interpretation was altered for a subset\nof neurons, the tuning properties of the neurons in this subset changed signi\ufb01cantly stronger than\nthose of neurons for which the interpretation of the readout system was not changed. Hence, the ex-\nperiment showed that motor cortical neurons can change their activity speci\ufb01cally and selectively to\ncompensate for an altered interpretation of their activity within some task. Such adjustment strategy\nis quite surprising, since it is not clear how the cortical adaption mechanism is able to determine for\nwhich subset of neurons the interpretation was altered. We refer to this learning effect as the \u201ccredit\nassignment\u201d effect.\n\nIn this article, we propose a simple synaptic learning rule and apply it to a model neural network.\nThis learning rule is capable of optimizing performance in a 3D reaching task and it can explain\nthe learning effects described in [1].\nIt is biologically realistic since weight changes are based\nexclusively on local variables and a global scalar reward signal R(t). The learning rule is reward-\nmodulated Hebbian in the following sense: Weight changes at synapses are driven by the correlation\nbetween a global reward signal, the presynaptic activity, and the difference of the postsynaptic po-\ntential from its recent mean (see [4] for a similar approach). Several reward-modulated Hebbian\nlearning rules have been studied for quite some time both in the context of rate-based [5, 6, 7, 8, 4]\nand spiking models [9, 10, 11, 12, 13, 14, 15, 16]. They turn out to be viable learning mechanisms in\nmany contexts and constitute a biologically plausible alternative [17, 18] to backpropagation based\nmechanisms preferentially used in arti\ufb01cial neural networks. One important feature of the learning\nrule proposed in this article is that noisy neuronal output is used for exploration to improve perfor-\nmance. It was often hypothesized that neuronal variability can optimize motor performance. For\nexample in songbirds, syllable variability results in part from variations in the motor command, i. e.\nthe variability of neuronal activity [19]. Furthermore, there exists evidence for the songbird system\nthat motor variability re\ufb02ects meaningful motor exploration that can support continuous learning\n[20]. We show that relatively high amounts of noise are bene\ufb01cial for the adaptation process but\nnot problematic for the readout system. We \ufb01nd that under realistic noise conditions, the learning\nrule produces effects surprisingly similar to those found in the experiments of [1]. Furthermore,\nthe version of the reward-modulated Hebbian learning rule that we propose does not require ex-\ntraneous information about what is noise and what is signal. Thus, we show in this study that\nreward-modulated learning is a possible explaination for experimental results about neuronal tuning\nchanges in monkey pre-motor cortex. This suggests that reward-modulated learning is an important\nplasticity mechanism for the acquisition of goal-directed behavior.\n\n2 Learning effects in monkey motor cortex\n\nIn this section, we brie\ufb02y describe the experimental results of [1] as well as the network that we used\nto model learning in motor cortex. Neurons in motor and premotor cortex of primates are broadly\ntuned to intended arm movement direction [21, 3].2 This sets the basis for the ability to extract\nintended arm movement from recorded neuronal activity in in these areas. The tuning curve of a\ndirection tuned neuron is given by its \ufb01ring rate as a function of movement direction. This curve can\nbe \ufb01tted reasonably well by a cosine function. The preferred direction (PD) pi \u2208 R3 of a neuron i is\nde\ufb01ned as the direction in which the cosine \ufb01t to its \ufb01ring rate is maximal, and the modulation depth\nis de\ufb01ned as the difference in \ufb01ring rate between the maximum of the cosine \ufb01t and the baseline\n(mean). The experiments in [1] consisted of a sequence of four brain control sessions: Calibration,\nControl, Perturbation, and Washout. The tuning functions of an average of 40 recorded neurons\nwere obtained in the Calibration session where the monkey moved its hand in a center out reaching\ntask. Those PDs (or manipulated versions of them) were later used for decoding neural trajectories.\nWe refer to PDs used for decoding as \u201cdecoding PDs\u201d (dPDs) in order to distinguish them from\nmeasured PDs. In Control, Perturbation, and Washout sessions the monkey had to perform a cursor\ncontrol task in a 3D virtual reality environment (see Figure 1B). The cursor was initially positioned\nin the center of an imaginary cube, a target position on one of the corners of the cube was randomly\nselected and made visible. When the monkey managed to hit the target position with the cursor\nor a 3s time period expired, the cursor position was reset to the origin and a new target position\nwas randomly selected from the eight corners of the imaginary cube. In the Control session, the\nmeasured PDs were used as dPDs for cursor control. In the Perturbation session, the dPDs of a\nrandomly selected subset of neurons (25% or 50% of the recorded neurons) were altered. This was\n\n2Arm movement refers to movement of the endpoint of the arm.\n\n2\n\n\fA\n\nB\n\nplastic\n\nweights w\n\nij\n\nrecorded\nneurons\n\ncursor\ny\nvelocity (t)\nvia dPDs\n\nmonkey arm\nvelocity\n\ncursor\nposition\n\ntarget\n\ndirection *(t)\n\n y \n\ninput to motor\n\ncortex (t)\n\nx\n\nmotor cortex neurons (t)\n\ns\n\ntarget position\n\nFigure 1: Description of the 3D cursor control task and network model for cursor control. A)\nSchematic of the network model. A set of m neurons project to ntotal noisy neurons in motor\ncortex. The monkey arm movement was modeled by a \ufb01xed linear mapping from the activities of\nthe modeled motor cortex neurons to the 3D velocity vector of the monkey arm. A subset of n\nneurons in the simulated motor cortex was recorded for cursor control. The cursor velocity was\ngiven by the population vector. B) The task was to move the cursor from the center of an imaginary\ncube to one of its eight corners.\n\nachieved by rotating the measured PDs by 90 degrees around the x, y, or z axes (all PDs were rotated\naround a single common axis in each experiment). We term these neurons rotated neurons. Other\ndPDs remained the same as in the Control session (non-rotated neurons). The measured PDs were\nused for cursor control in the subsequent Washout session. In the Perturbation session, neurons\nadapted their \ufb01ring behavior to compensate for the altered dPDs. The authors observed differential\neffects of learning for the two groups of non-rotated neurons and rotated neurons. Rotated neurons\ntended to shift their PDs in the direction of dPD rotation, thus compensating for the perturbation.\nFor non-rotated neurons, the change of the preferred directions was weaker and signi\ufb01cantly less\nstrongly biased towards the rotation direction. We refer to this differential behavior of rotated and\nnon-rotated neurons as the \u201ccredit assignment effect\u201d.\nNetwork and neuron model: Our aim in this article is to explain the described effects in the\nsimplest possible model. The model consisted of two populations of neurons, see Figure 1A. The\ninput population modeled those neurons which provide input to the neurons in motor cortex. It\nconsisted of m = 100 neurons with activities x1(t), . . . , xm(t) \u2208 R. Another population modeled\nneurons in motor cortex which receive inputs from the input population. It consisted of ntotal =\n340 neurons with activities s1(t), . . . , sntotal (t).3 All modeled motor cortex neurons were used to\ndetermine the monkey arm movement in our model. A small number of them (n = 40) modeled\nrecorded neurons used for cursor control. We denote the activities of this subset as s1(t), . . . , sn(t).\nThe total synaptic input ai(t) for neuron i at time t was modeled as a noisy weighted sum of its\ninputs:\n\nai(t) =\n\nm\n\nXj=1\n\nwij xj(t) + \u03bei(t),\n\n\u03bei(t) drawn from distribution D(\u03bd),\n\n(1)\n\nwhere wij is the synaptic ef\ufb01cacy from input neuron j to neuron i. These weights were set randomly\nfrom a uniform distribution in the interval [\u22120.5, 0.5] at the beginning of each simulation. \u03bei(t)\nmodels some exploratory signal needed to explore possibly better network behaviors. In cortical\nneurons, this exploratory signal could for example result from neuronal or synpatic noise, or it could\nbe spontaneous activity of the neuron. An independent sample from the zero mean distribution D(\u03bd)\nwas drawn as the exploratory signal \u03bei(t) at each time step. The parameter \u03bd (exploration level)\n\n3The distinction between these two layers is purely functional. Input neurons may be situated in extracortical\nareas, in other cortical areas, or even in motor cortex itself. The functional feature of these two populations\nin our model is that learning takes place solely in synapses of projections between these population since the\naim of this article is to explain the learning effects in the simplest model. But in principle the same learning is\napplicable to multilayer networks.\n\n3\n\n\fdetermines the variance of the distribution and hence the amount of noise in the neuron. A nonlinear\nfunction was applied to the total synaptic input, si(t) = \u03c3 (ai(t)), to obtain the activity si(t) of\nneuron i at time t. We used \u03c3 : R \u2192 R is the piecewise linear activation function \u03c3(x) = max{x, 0}\nin order to guarantee non-negative \ufb01ring rates.\nTask model: We modeled the cursor control task as shown in Figure 1B. Eight possible cursor target\npositions were located at the corners of a unit cube in 3D space which had its center at the origin\n\u2217(t) was\nof the coordinate system. At each time step t the desired direction of cursor movement y\ncomputed from the current cursor and target position. By convention, the desired direction y\n\u2217(t) had\nunit Euclidean norm. From the desired movement direction y\n\u2217(t), the activities x1(t), . . . , xm(t)\nof the neurons that provide input to the motor cortex neurons were computed and the activities\ns1(t), . . . , sn(t) of the recorded neurons were used to determine the cursor velocity via their popu-\nlation activity vector (see below).\n\nIn order to model the cursor control experiment, we had to determine the PDs of recorded neurons.\nObviously, to determine PDs, one needs a model for monkey arm movement. In monkeys, the trans-\nformation from motor cortical activity to arm movements involves a complicated system of several\nsynaptic stages. In our model, we treated this transformation as a black box. Experimental \ufb01ndings\nsuggest that monkey arm movements can be predicted quite well by a linear model based on the\nactivities of a small number of motor cortex neurons [3]. We therefore assumed that the direction\nof the monkey arm movement y\narm(t) at time t can be modeled in a linear way, using the activi-\nties of the total population of the ntotal cortical neurons s1(t), . . . , sntotal (t) in our simple model\nand a \ufb01xed randomly chosen 3 \u00d7 ntotal linear mapping Q (see [23]). With the transformation from\nmotor cortex neurons to monkey arm movements being de\ufb01ned, the input to the network for a given\ndesired direction y\n\u2217 should be chosen such that motor cortex neurons produce a monkey arm move-\nment close to the desired movement direction. We therefore calculated from the desired movement\ndirection input activities x(t) = crate(W total)\u2020Q\u2020\n\u2217(t), where Q\u2020 denotes the pseudo-inverse of\nQ, W total denotes the matrix of weights wij before learning, and crate scales the input activity\nsuch that the activities of the neurons in the simulated motor cortex could directly be interpreted as\nrates in Hz [23]. This transformation from desired directions to input neuron activities was de\ufb01ned\ninitially and held \ufb01xed during each simulation because learning took place in our model in a single\nsynaptic stage from neurons of the input population to neurons in the motor cortex population in our\nmodel and therefore the coding of desired directions did not change in the input population.\n\ny\n\nAs described above, a subset of the motor cortex population was chosen to model recorded neurons\nthat were used for cursor control. For each modeled recorded neuron i \u2208 {1, . . . , n}, we determined\nthe preferred direction pi \u2208 R3 as well as the baseline activity \u03b2i and the modulation depth \u03b1i by\n\ufb01tting a cosine tuning on the basis of simulated monkey arm movements [1, 23]. In the simulation\nof a Perturbation session, dPDs \u02dcpi of rotated neurons were rotated versions of the measured PDs pi\n(as in [1], one of the x, y, or z axis was chosen and the PDs were rotated by 90 degrees around this\naxis), whereas the dPDs of non-rotated neurons were identical to their measured PDs. The dPDs\nwere then used to determine the movement velocity y(t) of the cursor by the population vector\nalgorithm [1, 2, 23]. This decoding strategy is consistent with an interpretation of the neural activity\nwhich codes for the velocity of the movement.\n\n3 Adaptation with an online learning rule\n\nAdaptation of synaptic ef\ufb01cacies wij from input neurons to neurons in motor cortex is necessary\nif the actual decoding PDs \u02dcpi do not produce optimal cursor trajectories. Assume that suboptimal\ndPDs \u02dcp1, . . . , \u02dcpn are used for decoding. Then for some input x(t), the movement of the cursor is\n\u2217(t). The weights wij should therefore be adapted such that at every\nnot in the desired direction y\ntime step t the direction of movement y(t) is close to the desired direction y\n\u2217(t). We can quantify\nthe angular match Rang(t) at time t by the cosine of the angle between movement direction y(t) and\ndesired direction y\n\u2217(t)|| . This measure has a value of 1 if the cursor moves\nexactly in the desired direction, it is 0 if the cursor moves perpendicular to the desired direction, and\nit is -1 if the cursor movement is in the opposite direction.\n\n\u2217(t): Rang(t) = y(t)T\n\ny\n||y(t)||\u00b7||y\n\n\u2217(t)\n\nWe assume in our model that all synapses receive information about a global reward R(t). The\ngeneral idea that a neuromodulatory signal gates local synaptic plasticity was studied in [4]. In that\n\n4\n\n\fEH rule:\n\nstudy, the idea was implemented by learning rules where the weight changes are proportional to the\ncovariance between the reward signal R and some measure of neuronal activity N at the synapse.\nHere, N could correspond to the presynaptic activity, the postsynaptic activity, or the product of\nboth. The authors showed that such learning rules can explain a phenomenon called Herrnstein\u2019s\nmatching law. Interestingly, for the analysis in [4] the speci\ufb01c implementation of this correlation\nbased adaptation mechanism is not important. We investigate in this article a learning rule of this\ntype:\n\n\u2206wij (t) = \u03b7 xj (t) [ai(t) \u2212 \u00afai(t)](cid:2)R(t) \u2212 \u00afR(t)(cid:3) ,\n\n(2)\nwhere \u00afai(t) and \u00afR(t) denote the low-pass \ufb01ltered version of ai(t) and R(t) with an exponential\nkernel4. We refer to this rule as the exploratory Hebb rule (EH rule) in this article. The important\nfeature of this learning rule is that apart from variables which are locally available for each neuron\n(xj(t), ai(t), \u00afai(t)), only a single scalar signal, R(t), is needed to evaluate performance.5 The\nreward signal R(t) is provided by some neural circuit which evaluates performance of the system.\nIn our simulations, we simply used the angular match Rang(t) as this reward signal. Weight updates\nof the rule are based on correlations between deviations of the reward signal R(t) and the activation\nai(t) from their means. It adjusts weights such that rewards above mean are reinforced. The EH\nrule (2) approximates gradient ascent on the reward signal by exploring alternatives to the actual\nbehavior with the help of some exploratory signal \u03be(t). The deviation of the activation from the\nrecent mean ai(t) \u2212 \u00afai(t) is an estimate of the exploratory term \u03bei(t) at time t if the mean \u00afai(t) is\n\nbased on neuron activationsPj wij xj(t\u2032) which are similar to the activation Pj wij xj (t) at time t.\n\nHere we make use of (1) the fact that weights are changing very slowly and (2) the continuity of the\ntask (inputs x at successive time points are similar). Then, (2) can be seen as an approximation of\n\n\u2206wij (t) = \u03b7 xj(t)\u03bei(t)(cid:2)R(t) \u2212 \u00afR(t)(cid:3) .\n\n(3)\n\nThis rule is a typical node-perturbation learning rule [6, 7, 22, 10] which can be shown to approxi-\nmate gradient ascent, see e.g. [10]. A simple derivation that shows the link between the EH rule (2)\nand gradient ascent is given in [23].\n\nThe EH learning rule differs from other node-perturbation rules in an important aspect. In many\nnode-perturbation learning rules, the noise needs to be accessible to the learning mechanism sepa-\nrately from the output signal. For example, in [6] and [7] binary neurons were used. The weight\nupdates there depend on the probability of the neuron to output 1. In [10] the noise term is directly\nincorporated in the learning rule. The EH rule does not directly need the noise signal. Instead a\ntemporally \ufb01ltered version of the neurons activation is used to estimate the noise signal. Obviously,\nthis estimate is only suf\ufb01ciently accurate if the input to the neuron is temporally stable on small time\nscales.\n\n4 Comparison with experimentally observed learning effects\n\nIn this section, we explore the EH rule (2) in a cursor control task that was modeled to closely match\nthe experimental setup in [1]. Each simulated session consisted of a sequence of movements from\nthe center to a target position at one of the corners of the imaginary cube, with online weight updates\nduring the movements. In monkey experiments, perturbation of decoding PDs lead to retuning of\nPDs with the above described credit assignment effect [1]. In order to obtain biologically plausible\nvalues for the noise distribution in our neuron model, the noise in our model was \ufb01tted to data\nfrom experiments (see [23]). Analysis of the neuronal responses in the experiments showed that the\nvariance of the response for a given desired direction scaled roughly linearly with the mean \ufb01ring\nrate of that neuron for this direction. We obtained this behavior with our neuron model with noise\nthat is a mixture of an activation-independent noise source and a noise source where the variance\nscales linearly with the activation of the neuron. In particular, the noise term \u03bei(t) of neuron i was\ndrawn from the uniform distribution in [\u2212\u03bdi(x(t)), \u03bdi(x(t))] with an exploration level \u03bdi given by\n\n\u03bdi(x(t)) = 10 + 2.8r\u03c3(cid:16)Pm\n\ndata. We note that in all simulations with the EH rule, the input activities xj(t) were scaled in such a\nway that the output of the neuron at time t could be interpreted directly as the \ufb01ring rate of the neuron\n\nj=1 wij xj (t)(cid:17). The constants where chosen \ufb01t neuron behavior in the\n\n4We used \u00afai(t) = 0.8\u00afai(t \u2212 1) + 0.2ai(t) and \u00afR(t) = 0.8 \u00afR(t \u2212 1) + 0.2R(t)\n5A rule where the activation ai is replaced by the output si and obtained very similar results.\n\n5\n\n\fB\n\nC\n\nA\n\n1\n\n)\nt\n(\n\n \n\nR\nh\nc\nt\na\nm\n\n \n.\n\ng\nn\na\n\n0.75\n\n0.5\n\n0\n\n200\nt [sec]\n\nFigure 2: One example simulation of the 50% perturbation experiment with the EH rule and data-\nderived network parameters. A) Angular match Rang as a function of learning time. Every 100th\ntime point is plotted. B) PD shifts drawn on the unit sphere (arbitrary units) for non-rotated (black\ntraces) and rotated (light cyan traces) neurons from their initial values (light) to their values after\ntraining (dark, these PDs are connected by the shortest path on the unit sphere). The straight line in-\ndicates the rotation axis. C) Same as B, but the view was altered such that the rotation axis is directed\ntowards the reader. The PDs of rotated neurons are consistently rotated in order to compensate for\nthe perturbation.\n\nA\n\n]\n\u00b0\n[\n \n\nn\no\n\ni\nt\nc\ne\nr\ni\nd\n\n \n\nn\no\n\ni\nt\n\na\nb\nr\nu\nt\nr\ne\np\n\n \n\no\n\nt\n \n.\n\np\nr\ne\np\n\n \nt\nf\ni\n\nh\nS\n\n50% perturbation\n\n \n\n60\n\n30\n\n0\n\n\u221230\n\n25% perturbation\n\nB\n\nnon\u2212rotated\nrotated\n\n60\n\n30\n\n0\n\n\u221230\n\n\u221260\n\n\u221260\n\n\u221230\n\n0\n\n30\n\n60\n\n90\n\nShift along perturbation direction [\u00b0]\n\n\u221260\n\n \n\u221260\n\n\u221230\n\n0\n\n30\n\n60\n\n90\n\nShift along perturbation direction [\u00b0]\n\nFigure 3: PD shifts in simulated Perturbation sessions are in good agreement with experimental\nresults (compare to Figure 3A,B in [1]). Shift in the PDs measured after simulated perturbation\nsessions relative to initial PDs for all units in 20 simulated experiments where 25% (A) or 50% (B)\nof the units were rotated. Dots represent individual data points and black circled dots represent the\nmeans of rotated (light gray) and non-rotated (dark gray) units.\n\nat time t. With such scaling, we obtained output values of the neurons without the exploratory signal\nin the range of 0 to 120Hz with a roughly exponential distribution. Having estimated the variability\nof neuronal response, the learning rate \u03b7 remained the last free parameter of the model. To constrain\nthis parameter, \u03b7 was chosen such that the performance in the 25% perturbation task approximately\nmatched the monkey performance.\n\nWe simulated the two types of perturbation experiments reported in [1] in our model network with\n40 recorded neurons. In the \ufb01rst set of simulations, a random set of 25% of recorded neurons were\nrotated neurons in Perturbation sessions. In the second set of simulations, we chose 50 % of the\nrecorded neurons to be rotated. In each simulation, 320 targets were presented to the model, which\nis similar to the number of target presentations in [1]. Results for one example run are shown in\nFigure 2. The shifts in PDs of recorded neurons induced by training in 20 independent trials were\ncompiled and analyzed separately for rotated neurons and non-rotated neurons. The results are\nin good agreement with the experimental data, see Figure 3. In the simulated 25% perturbation\n\n6\n\n\fexperiment, the mean shift of the PD for rotated neurons was 8.2 \u00b1 4.8 degrees, whereas for non-\nrotated neurons, it was 5.5 \u00b1 1.6 degrees. This relatively small effect is similar to the effect observed\nin [1] where the PD shift of rotated (non-rotated) units was 9.9 (5.2) degrees. The effect is more\npronounced in the 50% perturbation experiment (see below). We also compared the deviation of the\nmovement trajectory from the ideal straight line in rotation direction half way to the target6 from\nearly trials to the deviation of late trials, where we scaled the results to a cube of 11cm side length\nin order to be able to compare the results directly to the results in [1]. In early trials, the trajectory\ndeviation was 9.2 \u00b1 8.8mm, which was reduced by learning to 2.4 \u00b1 4.9mm. In the simulated\n50% perturbation experiment, the mean shift of the PD for rotated neurons was 18.1 \u00b1 4.2 degrees,\nwhereas for non-rotated neurons, it was 12.1 \u00b1 2.6 degrees (in monkey experiments [1] this was\n21.7 and 16.1 degrees respectively). The trajectory deviation was 23.1 \u00b1 7.5mm in early trials, and\n4.8 \u00b1 5.1mm in late trials. Here, the early deviation was stronger than in the monkey experiment,\nwhile the late deviation was smaller.\n\nThe EH rule (2) falls into the general class of correlation-based learning rules described in [4].\nIn these rules the weight change is proportional to the covariance of the reward signal and some\nmeasure of neuronal activity. We performed the same experiment with slightly different correlation-\nbased rules\n\n\u2206wij (t) = \u03b7 xj (t)ai(t)(cid:2)R(t) \u2212 \u00afR(t)(cid:3) ,\n\n\u2206wij (t) = \u03b7 xj (t) [ai(t) \u2212 \u00afai(t)] R(t),\n\n(4)\n(5)\n\n(compare to (2)). The performance improvements were similar to those obtaint with the EH rule.\nHowever, no credit assignment effect was observed with these rules. In the simulated 50% perturba-\ntion experiment, the mean shift of the PD of rotated neurons (non-rotated neurons) was 12.8 \u00b1 3.6\n(12.0 \u00b1 2.4) degrees for rule (4) and 25.5 \u00b1 4 (26.8 \u00b1 2.8) degrees for rule (5).\nIn the monkey experiment, training in the Perturbation session also induced in a decrease of the\nmodulation depth of rotated neurons. This resulted in a decreased contribution of these neurons\nto the cursor movement. We observed a qualitatively similar resultin our simulations. In the 25%\nperturbation simulation, modulation depths decreased on average by 2.7\u00b14.3Hz for rotated neurons.\nModulation depths for non-rotated neurons increased on average by 2.2 \u00b1 3.9Hz (average over 20\nindependent simulations). In the 50% perturbation simulation, the changes in modulation depths\nwere \u22123, 6 \u00b1 5.5Hz for rotated neurons and 5.4 \u00b1 6Hz for non-rotated neurons.7 Thus, the relative\ncontribution of rotated neurons on cursor movement decreased.\n\nComparing the results obtained by our simulations to those of monkey experiments (compare Figure\n3 to Figure 3 in [1]), it is interesting that quantitatively similar effects were obtained when noise\nlevel and learning rate was constrained by the experimental data. One should note here that tuning\nchanges due to learning depend on the noise level. For small exploration levels, PDs changed only\nslightly and the difference in PD change between rotated and non-rotated neurons was small, while\nfor large noise levels, PD change differences can be quite drastic. Also the learning rate \u03b7 in\ufb02uences\nthe amount of PD shift differences with higher learning rates leading to stronger credit assignment\neffects, see [23] for details.\n\nThe performance of the system before and after learning is shown in Figure 4. The neurons in the\nnetwork after training are subject to the same amount of noise as the neurons in the network be-\nfore training, but the angular match after training shows much less \ufb02uctuation than before training.\nHence, the network automatically suppresses jitter on the trajectory in the presence of high explo-\nration levels \u03bd. We quanti\ufb01ed this observation by computing the standard deviation of the angle\nbetween the cursor velocity vector and the desired movement direction for 100 randomly drawn\nnoise samples.8 The mean standard deviation for 50 randomly drawn target directions was always\ndecreased by learning. In the mean over the 20 simulations, the mean STD over 50 target directions\nwas 7.9 degrees before learning and 6.3 degrees after learning. Hence, the network not only adapted\nits response to the input, it also found a way to optimize its sensitivity to the exploratory signal.\n\n6These deviations were computed as described in [1]\n7When comparing these results to experimental results, one has to take into account the modulation depths\n\nin monkey experiments were around 10Hz whereas in the simulations, they were around 25Hz\n\n8This effect is not caused by a larger norm of the weight vectors. The comparison was done with weight\n\nvectors after training normalized to their L2 norm before training.\n\n7\n\n\f1\n\n0.5\n\n)\nt\n(\n\n \n\nR\nh\nc\nt\n\na\nm\n\n \n.\n\ng\nn\na\n\n0\n \n0\n\n \n\nbefore learning\nafter learning\n\n1\n\nt [sec]\n\n2\n\n3\n\nFigure 4: Network performance before and after learning\nfor 50% perturbation. Angular match Rang(t) of the cur-\nsor movements in one reaching trial before (gray) and after\n(black) learning as a function of the time since the target\nwas \ufb01rst made visible. The black curve ends prematurely\nbecause the target was reached faster. After learning tempo-\nral jitter of the performance was reduced, indicating reduced\nsensitivity to noise.\n\n5 Discussion\n\nJarosiewicz et al. [1] discussed three strategies that could potentially be used by the monkey to com-\npensate for the errors caused by perturbations: re-aiming, re-weighting, and re-mapping. Using the\nre-aiming strategy, the monkey compensates for perturbations by aiming for a virtual target located\nin the direction that offsets the visuomotor rotation. The authors identi\ufb01ed a global change in the\nactivity level of all neurons. This indicates a re-aiming strategy of the monkey. Re-weighting would\nsuppress the use of rotated units, leading to a reduction of their modulation depths. A reduction of\nmodulation depths of rotated neurons was also identi\ufb01ed in the experimentals. A re-mapping strat-\negy would selectively change the directional tunings of rotated units. Rotated neurons shifted their\nPDs more than the non-rotated population in the experiments. Hence, the authors found elements of\nall three strategies in their data. These three elements of neuronal adaptation were also identi\ufb01ed in\nour model: a global change in activity of neurons (all neurons changed their tuning properties; re-\naiming), a reduction of modulation depths for rotated neurons (re-weighting), and a selective change\nof the directional tunings of rotated units (re-mapping). This modeling study therefore suggests that\nall three elements can be explained by a single synaptic adaptation strategy that relies on noisy neu-\nronal activity and visual feedback that is made accessible to all synapses in the network by a global\nreward signal. It is noteworthy that the credit assignment phenomenon is an emergent feature of the\nlearning rule rather than implemented in some direct way. Intuitively, this behavior can be explained\nin the following way. The output of non-rotated neurons is consistent with the interpretation of the\nreadout system. So if this output is strongly altered, performance will likely drop. On the other hand,\nif the output of a rotated neuron is radically different, this will often improve performance. Hence,\nthe relatively high noise levels measured in experiments are probably important for the credit assign-\nment phenomenon. Under such realistic noise conditions, our model produced effects surprisingly\nsimilar to those found in the monkey experiments. Thus, this study shows that reward-modulated\nlearning can explain detailed experimental results about neuronal adaptation in motor cortex and\ntherefore suggests that reward-modulated learning is an essential plasticity mechanism in cortex.\n\nThe results of this modeling paper also support the hypotheses introduced in [24]. The authors pre-\nsented data which suggests that neural representations change randomly (background changes) even\nwithout obvious learning, while systematic task-correlated representational changes occur within a\nlearning task.\n\nReward-modulated Hebbian learning rules are currently the most promising candidate for a learning\nmechanism that can support goal-directed behavior by local synaptic changes in combination with\na global performance signal. The EH rule (2) is one particularly simple instance of such rules that\nexploits temporal continuity of inputs and an exploration signal - a signal which would show up as\n\u201cnoise\u201d in neuronal recordings. We showed that large exploration levels are bene\ufb01cial for learning\nwhile they do not interfere with the performance of the system because of pooling effects of readout\nelements. This study therefore provides a hypothesis about the role of \u201cnoise\u201d or ongoing activity in\ncortical circuits as a source for exploration utilized by local learning rules.\n\nAcknowledgments\n\nThis work was supported by the Austrian Science Fund FWF [S9102-N13, to R.L. and W.M.]; the\nEuropean Union [FP6-015879 (FACETS), FP7-216593 (SECO), FP7-506778 (PASCAL2), FP7-\n231267 (ORGANIC) to R.L. and W.M.]; and by the National Institutes of Health [R01-NS050256,\nEB005847, to A.B.S.].\n\n8\n\n\fReferences\n\n[1] B. Jarosiewicz, S. M. Chase, G. W. Fraser, M. Velliste, R. E. Kass, and A. B. Schwartz. Functional net-\nwork reorganization during learning in a brain-computer interface paradigm. Proc. Nat. Acad. Sci. USA,\n105(49):19486\u201391, 2008.\n\n[2] A. P. Georgopoulos, R. E. Ketner, and A. B. Schwartz. Primate motor cortex and free arm movements to\nvisual targets in three- dimensional space. ii. coding of the direction of movement by a neuronal popula-\ntion. J. Neurosci., 8:2928\u20132937, 1988.\n\n[3] A. B. Schwartz. Useful signals from motor cortex. J. Physiology, 579:581\u2013601, 2007.\n[4] Y. Loewenstein and H. S. Seung. Operant matching is a generic outcome of synaptic plasticity based on\nthe covariance between reward and neural activity. Proc. Nat. Acad. Sci. USA, 103(41):15224\u201315229,\n2006.\n\n[5] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve dif\ufb01cult\n\nlearning control problems. IEEE Trans. Syst. Man Cybern., SMC-13(5):834\u2013846, 1983.\n\n[6] P. Mazzoni, R. A. Andersen, and M. I. Jordan. A more biologically plausible learning rule for neural\n\nnetworks. Proc. Nat. Acad. Sci. USA, 88(10):4433\u20134437, 1991.\n\n[7] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8:229\u2013256, 1992.\n\n[8] J. Baxter and P. L. Bartlett. Direct gradient-based reinforcement learning: I. gradient estimation algo-\nrithms. Technical report, Research School of Information Sciences and Engineering, Australian National\nUniversity, 1999.\n\n[9] X. Xie and H. S. Seung. Learning in neural networks by reinforcement of irregular spiking. Phys. Rev. E,\n\n69(041909), 2004.\n\n[10] I. R. Fiete and H. S. Seung. Gradient learning in spiking neural networks by dynamic perturbation of\n\nconductances. Phys. Rev. Lett., 97(4):048104\u20131 to 048104\u20134, 2006.\n\n[11] J.-P. P\ufb01ster, T. Toyoizumi, D. Barber, and W. Gerstner. Optimal spike-timing-dependent plasticity for\n\nprecise action potential \ufb01ring in supervised learning. Neural Computation, 18(6):1318\u20131348, 2006.\n\n[12] E. M. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling.\n\nCerebral Cortex, 17:2443\u20132452, 2007.\n\n[13] D. Baras and R. Meir. Reinforcement learning, spike-time-dependent plasticity, and the bcm rule. Neural\n\nComputation, 19(8):2245\u20132279, 2007.\n\n[14] R. V. Florian. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity.\n\nNeural Computation, 6:1468\u20131502, 2007.\n\n[15] M. A. Farries and A. L. Fairhall. Reinforcement learning with modulated spike timing-dependent synaptic\n\nplasticity. J. Neurophys., 98:3648\u20133665, 2007.\n\n[16] R. Legenstein, D. Pecevski, and W. Maass. A learning theory for reward-modulated spike-timing-\n\ndependent plasticity with application to biofeedback. PLoS Computational Biology, 4(10):1\u201327, 2008.\n\n[17] C. H. Bailey, M. Giustetto, Y.-Y. Huang, R. D. Hawkins, and E. R. Kandel. Is heterosynaptic modulation\n\nessential for stabilizing Hebbian plasticity and memory? Nat. Rev. Neurosci., 1:11\u201320, 2000.\n\n[18] Q. Gu. Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity. Neuro-\n\nscience, 111(4):815\u2013835, 2002.\n\n[19] Samuel J. Sober, Melville J. Wohlgemuth, and Michael S. Brainard. Central contributions to acoustic\n\nvariation in birdsong. J. Neurosci., 28(41):10370\u20139, 2008.\n\n[20] E. C. Tumer and M. S. Brainard. Performance variability enables adaptive plasticity of \u2018crystallized\u2019 adult\n\nbirdsong. Nature, 250(7173):1240\u20131244, 2007.\n\n[21] A. P. Georgopoulos, A. P. Schwartz, and R. E. Ketner. Neuronal population coding of movement direction.\n\nScience, 233:1416\u20131419, 1986.\n\n[22] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. J. Artif. Intell. Res., 15:319\u2013350,\n\n2001.\n\n[23] R. Legenstein, S. M. Chase, A. B. Schwartz, and W. Maass. A reward-modulated hebbian learning\nrule can explain experimentally observed network reorganization in a brain control task. Submitted for\npublication, 2009.\n\n[24] U. Rokni, A G. Richardson, E. Bizzi, and H. S. Seung. Motor learning with unstable neural representa-\n\ntions. Neuron, 54:653\u2013666, 2007.\n\n9\n\n\f", "award": [], "sourceid": 211, "authors": [{"given_name": "Steven", "family_name": "Chase", "institution": null}, {"given_name": "Andrew", "family_name": "Schwartz", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}, {"given_name": "Robert", "family_name": "Legenstein", "institution": null}]}