{"title": "Reinforcement Learning with Long Short-Term Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 1475, "page_last": 1482, "abstract": "", "full_text": "Reinforcement Learning\n\nMemory\n\nBram Bakker\n\nDept. of Psychology, Leiden University / IDSIA\nP.O. Box 9555; 2300 RB, Leiden; The Netherlands\n\nbbakker@fsw.leidenuniv.nl\n\nAbstract\n\nThis paper presents reinforcement learning with a Long Short(cid:173)\nTerm Memory recurrent neural network: RL-LSTM. Model-free\nRL-LSTM using Advantage(,x) learning and directed exploration\ncan solve non-Markovian tasks with long-term dependencies be(cid:173)\ntween relevant events. This is demonstrated in a T-maze task, as\nwell as in a difficult variation of the pole balancing task.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) is a way of learning how to behave based on delayed\nreward signals [12]. Among the more important challenges for RL are tasks where\npart of the state of the environment is hidden from the agent. Such tasks are called\nnon-Markovian tasks or Partially Observable Markov Decision Processes. Many real\nworld tasks have this problem of hidden state. For instance, in a navigation task\ndifferent positions in the environment may look the same, but one and the same\naction may lead to different next states or rewards. Thus, hidden state makes RL\nmore realistic. However, it also makes it more difficult, because now the agent not\nonly needs to learn the mapping from environmental states to actions, for optimal\nperformance it usually needs to determine which environmental state it is in as well.\n\nLong-term dependencies. Most approaches to solving non-Markovian RL tasks\nhave problems if there are long-term dependencies between relevant events. An\nexample of a long-term dependency problem is a maze navigation task where the\nonly way to distinguish between two T-junctions that look identical is to remember\nan observation or action a long time before either T-junction. Such a case prese~ts\nobvious problems for fixed size history window approaches [6], which attempt to(cid:173)\nresolve the hidden state by making the chosen action depend not only on the cur(cid:173)\nrent observation, but also on a fixed number of the most recent observations and\nactions. If the relevant piece of information to be remembered falls outside the his(cid:173)\ntory window, the agent cannot use it. McCallum's variable history window [8] has,\nin principle, the capacity to represent long-term dependencies. However, the system\nstarts with zero history and increases the depth of the history window step by step.\nThis makes learning long-term dependencies difficult, especially when there are no\nshort-term dependencies to build on.\n\nOther approaches to non-Markovian tasks are based on learning Finite State Au(cid:173)\ntomata [2], recurrent neural networks (RNNs) [10, 11, 6], or on learning to set\n\n\fmemory bits [9]. Unlike history window approaches, they do not have to represent\n(possibly long) entire histories, but can in principle extract and represent just the\nrelevant information for an arbitrary amount of time. However, learning to do that\nhas proven difficult. The difficulty lies in discovering the correlation between a\npiece of information and the moment at which this information becomes relevant\nat a later time, given the distracting observations and actions between them. This\ndifficulty can be viewed as an instance of the general problem of learning long-term\ndependencies in timeseries data. This paper uses one particular solution to this\nproblem that has worked well in supervised timeseries learning tasks: Long Short(cid:173)\nTerm Memory (LSTM) [5, 3]. In this paper an LSTM recurrent neural network is\nused in conjunction with model-free RL, in the same spirit as the model-free RNN\napproaches of [10,6]. The next section describes LSTM. Section 3 presents LSTM's\ncombination with reinforcement learning in a system called RL-LSTM. Section 4\ncontains simulation results on non-Markovian RL tasks with long-term dependen(cid:173)\ncies. Section 5, finally, presents the general conclusions.\n\n2 LSTM\n\nLSTM is a recently proposed recurrent neural network architecture, originally de(cid:173)\nsigned for supervised timeseries learning [5, 3].\nIt is based on an analysis of the\nproblems that conventional recurrent neural network learning algorithms, e.g. back(cid:173)\npropagation through time (BPTT) and real-time recurrent learning (RTRL), have\nwhen learning timeseries with long-term dependencies. These problems boil down\nto the problem that errors propagated back in time tend to either vanish or blow\nup (see [5]).\n\nMemory cells. LSTM's solution to this problem is to enforce constant error flow\nin a number of specialized units, called Constant Error Carrousels (CECs). This\nactually corresponds to these CECs having linear activation functions which do\nnot decay over time.\nIn order to prevent the CECs from filling up with useless\ninformation from the timeseries, access to them is regulated using other specialized,\nmultiplicative units, called input gates. Like the CECs, the input. gates receive input\nfrom the timeseries and the other units in the network, and they learn to open and\nclose access to the CECs at appropriate moments. Access from the activations of\nthe CECs to the output units (and possibly other units) of the network is regulated\nusing multiplicative output gates. Similar to the input gates, the output gates learn\nwhen the time is right to send the information stored in the CECs to the output\nside of the network. A recent addition is forget gates [3], which learn to reset\nthe activation of the CECs when the information stored in the CECs is no longer\nuseful. The combination of a CEC with its associated input, output, and forget\ngate is called a memory cell. See figure 1b for a schematic of a memory cell. It is\nalso possible for multiple CECs to be combined with only one input, output, and\nforget gate, in a so-called memory block.\n\nActivation updates. More formally, the network's activations at each timestep\nt are computed as follows. A standard hidden unit's activation yh, output unit\nactivation yk, input gate activation yin, output gate activation y0'Ut, and forget\ngate activation yep is computed in the following standard way:\n\nm\n\n(1)\n\nwhere Wim is the weight of the connection from unit m to unit i. In this paper, Ii\nis the standard logistic sigmoid function for all units except output units, for which\nit is the identity function. The CEC activation Be\"!, or the \"state\" of memory cell v\n\n:J\n\n\fcell output\n\n,/'\n\n,/'\n\na.\n\nb.\n\ncell input\n\nmemory cell\n\nFigure 1: a. The general LSTM architecture used in this paper. Arrows indicate\nunidirectional, fully connected weights. The network's output units directly code\nfor the Advantages of different actions. b. One memory cell.\n\nin Illell10ry block j, is COITlputed as follows:\n\nm\n\n(2)\n\nwhere 9 is a logistic sigmoid function scaled to the range [-2,2]' and sc~(o) == o.\nThe memory cell's output ycj is calculated by\n\n3\n\nycj (t) == youtj (t)h(sc~ (t))\n\n3\n\n(3)\n\nwhere h is a logistic sigmoid function scaled to the range [-1, IJ.\nLearning. At some or all timesteps of the timeseries, the output units of the\nnetwork may make prediction' errors. Errors are propagated just one step back in\ntime through all units other than the CEes, including the gates. However, errors\nare backpropagated through the CECs for an indefinite amount of time, using an\nefficient variation of RTRL [5, 3J. Weight updates are done at every timestep, which\nfits in nicely with the philosophy of online RL. The learning algorithm is adapted\nslightly for RL, as explained in the next section.\n\n3 RL-LSTM\nRNNs, such as LSTM, can be applied to RL tasks in various ways. One way is\nto let the RNN learn a model of the environment, which learns to predict obser(cid:173)\nvations and rewards, and in this way learns to infer the environmental state at\neach point [6, IIJ. LSTM's architecture would allow the predictions to depend on\ninformation from long ago. The model-based system could then learn the mapping\nfrom (inferred) environmental states to actions as in the Markovian case, using\nstandard techniques such as Q-learning [6, 2J, or by backpropagating through the\nfrozen model to the controller [IIJ. An alternative, model-free approach, and the\none used here, is to use the RNN to directly approximate the value function of a\nreinforcement learning algorithm [10, 6]. The state of the environment is approxi(cid:173)\nmated by the current observation, which is the input to the network, together with\nthe recurrent activations in the network, which represent the agent's history. One\npossible advantage of such a model-free approach over a model-based approach is\nthat the system may learn to only resolve hidden state insofar as that is useful for\nobtaining higher rewards, rather than waste time and resources in trying to predict\nfeatures of the environment that are irrelevant for obtaining rewards [6, 8].\n\nIn this paper, the RL-LSTM network approximates the\nAdvantage learning.\nvalue function of Advantage learning [4], which was designed as, an improvement on\n\n\fQ-Iearning for continuous-time RL. In continuous-time RL, values of adjacent states,\nand therefore optimal Q-values of different actions in a given state, typically differ by\nonly small amounts, which can easily get lost in noise. Advantage learning remedies\nthis problem by artificially decreasing the values of suboptimal actions in each state.\nHere Advantage learning is used for both continuous-time and discrete-time RL.\nNote that the same problem of small differences between values of adjacent states\napplies to any RL problem with long paths to rewards. And to demonstrate RL(cid:173)\nLSTM's potential to bridge long time lags, we need to consider such RL problems.\nIn general, Advantage learning may be more suitable for non-Markovian tasks than\nQ-Iearning, because it seems less sensitive to getting the value estimations exactly\nright.\n\nThe LSTM network's output units directly code for the Advantage values of different\nactions. Figure 1a shows the general network architecture used in this paper. As\nin Q-learning with a function approximator, the temporal difference error ETD(t),\nderived from the equivalent of the Bellman equation for Advantage learning [4], is\ntaken as the function approximator's prediction error at timestep t:\n\nETD(t) = V(s(t)) + r(t) +iV(S(t+ 1)) - V(s(t)) _ A(s(t),a(t))\n\n(4)\n\n~\n\nwhere A(s, a) is the Advantage value of action a in state s, r is the immediate\nreward, and V(s) == maxa A(s, a) is the value of the state s.\nis a discount factor\nin the range [0,1], and \"\" is a constant scaling the difference between values of\noptimal and suboptimal actions. Output units associated with other actions than\nthe executed one do not receive error signals.\n\n,\n\n)\n\n)\n\nTD )\n\nIn this work, Advantage learning is extended with eligibil(cid:173)\nEligibility traces.\nity traces, which have often been found to improve learning in RL, especially in\nnon-Markovian domains [7]. This yields Advantage(A) learning, and the necessary\ncomputations turn out virtually the same as in Q(A)-learning [1]. It requires the\nstorage of one eligibility trace eim per weight Wim. A weight update corresponds to\n)\nWim(t+ 1 == Wim(t) +aE (t)eim(t, where eim(t == , Aeim(t-1) + -8--. (5\nK indicates the output unit associated with the executed action, a is a learning rate\nparameter, and A is a parameter determining how fast the eligibility trace decays.\neim(O) == 0, and eim(t - 1) is set to 0 if an exploratory action is taken.\nExploration. Non-Markovian RL requires extra attention to the issue of explo(cid:173)\nration [2, 8]. Undirected exploration attempts to tryout actions in the same way in\neach environmental state. However, in non-Markovian tasks, th~ agent initially does\nnot know which environmental state it is in. Part of the exploration must be aimed\nat discovering the environmental state structure. Furthermore, in many cases, the\nnon-Markovian environment will provide unambiguous observations indicating the\nstate in some parts, while providing ambiguous observations (hidden state) in other\nparts. In general, we want more exploration in the ambiguous parts.\n\nayK (t)\nWim\n\nThis paper employs a directed exploration technique based on these ideas. A sep(cid:173)\narate multilayer feedforward neural network, with the same input as the LSTM\nnetwork (representing the current observation) and one output unit yV, is trained\nconcurrently with the LSTM network. It is trained, using standard backpropaga(cid:173)\ntion, to predict the absolute value of the current temporal difference error E TD (t)\nas defined byeq. 4, plus its own discounted prediction at the next timestep:\n\n(6)\nwhere Y'd(t) is the desired value for output yV(t), and (3 is a discount parameter\nin the range [0,1]. This amounts to attempting to identify which observations are\n\nYd(t) == IE TD (t)1 + (3yV(t + 1)\n\n\fG\n\nFigure 2: Long-term dependency T-maze with length of corridor N == 10. At the\nstarting position S the agent's observation indicates where the goal position G is in\nthis episode.\n\n\"problematic\" , in the sense that they are associated with large errors in the current\nvalue estimation (the first term), or precede situations with large such errors (the\nsecond term). yV(t) is linearly scaled and used as the temperature of a Boltzmann\naction selection rule [12]. The net result is much exploration when, for the current\nobservation, differences between estimated Advantage values are small (the standard\neffect of Boltzmann exploration), or when there is much \"uncertainty\" about current\nAdvantage values or Advantage values in the near future (the effect of this directed\nexploration scheme). This exploration technique has obvious similarities with the\nstatistically more rigorous technique of Interval Estimation (see [12]), as well as\nwith certain model-based approaches where exploration is greater when there is\nmore uncertainty in the predictions of a model [11].\n\n4 Test problems\nLong-term dependency T-maze. The first test problem is a non-Markovian\ngrid-based T-maze (see figure 2). It was designed to test RL-LSTM's capability to\nbridge long time lags, without confounding the results by making the control task\ndifficult in other ways. The agent has four possible actions: move North, East,\nSouth, or West. The agent must learn to move from the starting position at the\nbeginning of the corridor to the T-junction. There it must move either North or\nSouth to a changing goal position, which it cannot see. However, the location of\nthe goal depends on a \"road sign\" the agent has seen at the starting position. If\nthe agent takes the correct action at the T-junction, it receives a reward of 4. If it\ntakes the wrong action, it receives a reward of -.1. In both cases, the episode ends\nand a new episode starts, with the new goal position set randomly either North or\nSouth. During the episode, the agent receives a reward of -.1 when it stands still.\nAt the starting position, the observation is either 011 or 110, in the corridor the\nobservation is 101, and at the T-junction the observation is 010. The length of the\ncorridor N was systematically varied from 5 to 70. In each condition, 10 runs were\nperformed.\n\nIf the agent takes only optimal ac~ions to the T-junction, it must remember the\nobservation from the starting position for N timesteps to determine the optimal\naction at the T-junction. Note that the agent is not aided by experiences in whiGh\nthere are shorter time lag dependencies. In fact, the opposite is true~ Initially, it\ntakes many more actions until even the T-junction is reached, and the experienced\nhistory is very variable from episode to episode. The agent must first learn to\nreliably move to the T-junction. Once this is accomplished, the agent will begin to\nexperience more or less consistent and shortest possible histories of observations and\nactions, from which it can learn to extract the relevant piece of information. The\ndirected exploration mechanism is crucial in this regard: it learns to set exploration\nlow in the corridor and high at the T-junction.\nThe LSTM network had 3 input units, 12 standard hidden units, 3 memory cells, and\na == .0002. The following parameter values were used in all conditions: { == .98, ,\\ ==\n.8, flJ == .1. An empirical comparison was made with two alternative systems that\nhave been used in non-Markovian tasks. The long-term dependency nature of the\n\n\f10\n\n- 8\n\n,,\n\n\\ ,,\n\n\\\n\n\\\n\n\"\"\n\nen\nc::\no\n]1.5\n\no\n\n-*\n~\n\n1\n\n(])\nOJ\n~ 0.5\n~\n\nG---E)\n+ - -+ Elman-BPTT\n~ooooooX Memory bits\n\nLSTM\n\n5\n\n10\n\n15 20\n\n25\n\n30\nN: lenQth of corridor\n\n40\n\n50\n\n60\n\n70\n\n5\n\n10 15 20 25 30\n\n50\nN: lenQth of corridor\n\n40\n\n60\n\n70\n\nFigure 3: Results in noise-free T-maze task. Left: Number of successful runs (out of\n10) as a function of N, length of the corridor. Right: Average number of timesteps\nuntil success as a function of N.\n\n\\\n\ne\n\n,,\n\n~ .. +,\n\\,\n\\,\n\"\"\n\n\\\n\n\\\n\n\\\n\nen 10\n\n2 8\n\n~ 4\n.0\nE\n~ 2\n\n\"\\\n,\n\n\\\n\n\\\n\nIG---E)\n+ - -+ Elman-BPTT\n~ 00000 oX Memory bits\n\nLSTM\n\n5\n\n10\n\n15\n\n20 25\n\n30\nN: lenQth of corridor\n\n40\n\n50\n\n70\n\n5\n\n10 15 20 25 30\n\n50\nN: lenQth of corridor\n\n40\n\n60\n\n70\n\nFigure 4: Results in noisy T-maze task. Left: Number of successful runs (out of\n10) as a function of N, length of the corridor. Right: Average number of timesteps\nuntil success as a function of N.\n\ntask virtually rules out history window approaches. Instead, two alternative systems\nwere used that, like LSTM, are capable in principle of representing information for\narbitrary long time lags. In the first alternative, the LSTM network was replaced by\nan Elman-style Simple Recurrent Network, trained using BPTT [6]. Note that the\nunfolding of the RNN necessary for BPTT means that this is no longer truly online\nRL. The Elman network had 16 hidden units and 16 context units, and a == .001.\nThe second alternative is a table-based system extended with memory bits that are\npart of the observation, and that the controller can switch on and off [9]. Because\nthe task requires the agent to remember just one bit of information, this system\nhad one memory bit, and a == .01. In order to determine the specific contribution of\nLSTM to performance, in both alternatives all elements of the overall system except\nLSTM (i.e. Advantage(A) learning, directed exploration) were left unchanged.\n\nA run was considered a success if the agent learned to take the correct action at the\nT-junction in over 80% of cases, using its stochastic action selection mechanism. In\npractice, this corresponds to 100% correct action choices at the T-junction using\ngreedy action selection, as well as optimal or near-optimal action choices leading\nto the T-junction. Figure 3 shows the number of successful runs (out of 10) as a\nfunction of the length of the corridor N, for each of the three methods.\nIt also\nshows the average number of timesteps needed to reach success. It is apparent that\nRL-LSTM is able to deal with much longer time lags than the two alternatives. RL(cid:173)\nLSTM has perfect performance up to N == 50, after which performance gradually\ndecreases. In those cases where the alternatives also reach success, RL-LSTM also\nlearns faster. The reason why the memory bits system performs worst is probably\nthat, in contrast with the other two, it does not explicitly compute the gradient\nof performance with respect to past events. This should make credit assignment\n\n\fless directed and therefore less effective. The Elman-BPTT system does compute\nsuch a gradient, but in contrast to LSTM, the gradient information tends to vanish\nquickly with longer time lags (as explained in section 2).\n\nT-maze with noise. It is one thing to learn long-term dependencies in a noise-free\ntask, it is quite another thing to do so in the presence of severe noise. To investigate\nthis, a very noisy variation of the T-mazetask described above was designed. Now\nthe observation in the corridor is aOb, where a and b are independent, uniformly\ndistributed random values in the range [0, 1], generate online. All other aspects of\nthe task remain the same as above. Both the LSTM and the Elman-BPTT system\nwere also left unchanged. To allow for a fair comparison, the table-based memory\nbit system's observation was computed using Michie and Chambers's BOXES state\naggregation mechanism (see [12]), partitioning each input dimension into three equal\nregions.\n\nFigure 4 shows the results. The memory bit system suffers most from the noise.\nThis is not very surprising because a table-based system, even if augmented with\nBOXES state aggregation, does not give very sophisticated generalization. The two\nRNN approaches are hardly affected by the severe noise in the observations. Most\nimportantly, RL-LSTM again significantly outperforms the others, both in terms of\nthe maximum time lag it can deal with, and in terms of the number of timesteps\nneeded to learn the task.\n\nMulti-mode pole balancing. The third test problem is less artificial than the\nT-mazes and has more complicated dynamics. It consists of a difficult variation of\nthe classical pole balancing task. In the pole balancing task, an agent must balance\nan inherently unstable pole, hinged to the top of a wheeled cart that travels along a\ntrack, by applying left and right forces to the cart. Even in the Markovian version,\nthe task requires fairly precise control to solve it.\n\nThe version used in this experiment is made more difficult by two sources of hidden\nstate. First, as in [6], the agent cannot observe the state information corresponding\nto the cart velocity and pole angular velocity. It has to learn to approximate this\n(continuous) information using its recurrent connections in order to solve the task.\nSecond, the agent must learn to operate in two different modes. In mode A, action\n1 is left push and action 2 is right push. In mode B, this is reversed: action 1 is right\npush and action 2 is left push. Modes are randomly set at the beginning of each\nepisode. The information which mode the agent is operating in is provided to the\nagent only for the first second of the episode. After that, the corresponding input\nunit is set to zero and the agent must remember which mode it is in. Obviously,\nfailing to remember the mode leads to very poor performance. The only reward\nsignal is -1 if the pole falls past \u00b112\u00b0 or if the cart hits either end of the track.\nNote that the agent must learn to remember the (discrete) mode information for an\ninfinite amount of time if it is to learn to balance the pole indefinitely. This rules\nout history window approaches altogether. However, in contrast with the T-mazes,\nthe system now has the benefit of starting with relatively short time lags.\n\nThe LSTM network had 2 output units, 14 standard hidden units, and 6 memory\ncells. It has 3 input units: one each for cart position and pole angle; and one for the\nmode of operation, set to zero after one second of simulated time (50 timesteps).\nry == .95, A == .6,\nIn this problem, directed exploration was\nnot necessary, because in contrast to the T-mazes, imperfect policies lead to many\ndifferent experiences with reward signals, and there is hidden state everywhere in\nthe environment. For a continuous problem like this, a table-based memory bit\nsystem is not suited very well, so a comparison was only made with the Elman(cid:173)\nBPTT system, which had 16 hidden and context units and a == .002.\n\nfiJ == .2, a == .002.\n\n\fThe Elman-BPTT system never reached satisfactory solutions in 10 runs. It only\nlearned to balance the pole for the first 50 timesteps, when the mode information\nis available, thus failing to learn the long-term dependency. However, RL-LSTM\nlearned optimal performance in 2 out of 10 runs (after an average of 6,250,000\ntimesteps of learning). After learning, these two agents were able to balance the pole\nindefinitely in both modes of operation. In the other 8 runs, the agents still learned\nto balance the pole in both modes for hundreds or even thousands of timesteps\n(after an average of 8,095,000 timesteps of learning), thus showing that the mode\nIn most cases, such an agent\ninformation was remembered for long time lags.\nlearns optimal performance for one mode, while achieving good but suboptimal\nperformance in the other.\n\n5 Conclusions\n\nThe results presented in this paper suggest that reinforcement learning with Long\nShort-Term ~v1emory (RL-LSTI\\,f) is a promising approach to solving non-:r-v1arkovi&t~\nRL tasks with long-term dependencies. This was demonstrated in a T-maze task\nwith minimal time lag dependencies of up to 70 timesteps, as well as in a non(cid:173)\nMarkovian version of pole balancing where optimal performance requires remem(cid:173)\nbering information indefinitely. RL-LSTM's main power is derived from LSTM's\nproperty of constant error flow, but for good performance in RL tasks, the combi(cid:173)\nnation with Advantage(A) learning and directed exploration was crucial.\n\nAcknowledgments\n\nThe author wishes to thank Edwin de Jong, Michiel de Jong, Gwendid van der Voort\nvan der Kleij, Patrick Hudson, Felix Gers, and Jiirgen Schmidhuber for valuable\ncomments.\n\nReferences\n[1] B. Bakker. Reinforcement learning with LSTM in non-Markovian tasks with long(cid:173)\nterm dependencies. Technical report, Dept. of Psychology, Leiden University, 2001.\n[2] L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual dis(cid:173)\n\ntinctions approach. In Proc. of the 10th National Conf. on AI AAAI Press, 1992.\n\n[3] F. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction\n\nwith LSTM. Neural Computation, 12 (10):2451-2471, 2000.\n\n[4] M. E. Harmon and L. C. Baird. Multi-player residual advantage learning with general\n\nfunction approximation. Technical report, Wright-Patterson Air Force Base, 1996.\n\n[5] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9\n\n(8):1735-1780, 1997.\n\n[6] L.-J. Lin and T. Mitchell. Reinforcement learning with hidden states. In Proc. of the\n\n2nd Int. Conf. on Simulation of Adaptive Behavior. MIT Press, 1993.\n\n[7] J. Loch and S. Singh. Using eligibility traces to find the best memoryless policy in\n\nPartially Observable Markov Decision Processes. In Proc. of ICML'98, 1998.\n\n[8] R. A. McCallum. Learning to use selective attention and short-term memory in\nsequential tasks. In Proc. 4th Int. Conf. on Simulation of Adaptive Behavior, 1996.\n[9] L. Peshkin, N. Meuleau, and L. P. Kaelbling. Learning policies with external memory.\n\nIn Proc. of the 16th Int. Conf. on Machine Learning, 1999.\n\n[10] J. Schmidhuber. Networks adjusting networks. In Proc. of Distributed Adaptive Neural\n\nInformation Processing, St. Augustin, 1990.\n\n[11] J. Schmidhuber. Curious model-building control systems.\n\nIn Proc. of IJCNN'91,\n\nvolume 2, pages 1458-1463, Singapore, 1991.\n\n[12] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press,\n\nCambridge; MA, 1998.\n\n\f", "award": [], "sourceid": 1953, "authors": [{"given_name": "Bram", "family_name": "Bakker", "institution": null}]}