{"title": "Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System", "book": "Advances in Neural Information Processing Systems", "page_first": 926, "page_last": 936, "abstract": "", "full_text": "Learning a world model and planning with a\n\nself-organizing, dynamic neural system\n\nInstitut f\u00a8ur Neuroinformatik\n\nRuhr-Universit\u00a8at Bochum, ND 04\n\nMarc Toussaint\n\n44780 Bochum\u2014Germany\n\nmt@neuroinformatik.rub.de\n\nAbstract\n\nWe present a connectionist architecture that can learn a model of the\nrelations between perceptions and actions and use this model for be-\nhavior planning. State representations are learned with a growing self-\norganizing layer which is directly coupled to a perception and a motor\nlayer. Knowledge about possible state transitions is encoded in the lat-\neral connectivity. Motor signals modulate this lateral connectivity and\na dynamic \ufb01eld on the layer organizes a planning process. All mecha-\nnisms are local and adaptation is based on Hebbian ideas. The model is\ncontinuous in the action, perception, and time domain.\n\n1 Introduction\n\nPlanning of behavior requires some knowledge about the consequences of actions in a\ngiven environment. A world model captures such knowledge. There is clear evidence that\nnervous systems use such internal models to perform predictive motor control, imagery,\ninference, and planning in a way that involves a simulation of actions and their perceptual\nimplications [1, 2]. However, the level of abstraction, the representation, on which such\nsimulation occurs is hardly the level of physical coordinates. A tempting hypothesis is\nthat the representations the brain uses for reasoning and planning are particularly designed\n(by adaptation or evolution) for just this purpose. To address such ideas we \ufb01rst need\na basic model for how a connectionist architecture can encode a world model and how\nself-organization of inherent representations is possible.\n\nIn the \ufb01eld of machine learning, world models are a standard approach to handle behav-\nior organization problems (for a comparison of model-based approaches to the classical,\nmodel-free Reinforcement Learning see, e.g., [3]). The basic idea of using neural networks\nto model the environment was given in [4, 5]. Our approach for a connectionist world\nmodel (CWM) is functionally similar to existing Machine Learning approaches with self-\norganizing state space models [6, 7]. It is able to grow neural representations for different\nworld states and to learn the implications of actions in terms of state transitions. It differs\nthough from classical approaches in some crucial points:\n\u2022 The model is continuous in the action, the perception, as well as the time domain.\n\n\fmotor layer a\n\nka(aji, a)\n\nxi\n\ni\n\nwji\n\nj\n\nks(sj , s)\n\nperceptive layer s\n\nFigure 1: Schema of the CWM architecture.\n\n\u2022 All mechanisms are based on local interactions. The adaptation mechanisms are largely\nderived from the idea of Hebbian plasticity. E.g., the lateral connectivity, which encodes\nknowledge about possible state transitions, is adapted by a variant of the temporal Hebb\nrule and allows local adaptation of the world model to local world changes.\n\u2022 The coupling to the motor system is fully integrated in the architecture via a mechanism\n\u2022 The two dynamic processes on the CWM, the \u201ctracking\u201d process estimating the current\nstate and the planning process (similar to Dynamic Programming), will be realized by\nactivation dynamics on the architecture, incorporating in particular lateral interactions,\ninspired by neural \ufb01elds [8].\n\nincorporating modulating synapses (comparable to shunting mechanisms).\n\nThe outline of the paper is as follows: In the next section we describe our architecture,\nthe dynamics of activation and the couplings to perception and motor layers. In section 3\nwe introduce a dynamic process that generates, as an attractor, a value \ufb01eld over the layer\nwhich is comparable to a state value function estimating the expected future return and al-\nlows for goal-oriented behavior organization. The self-organization process and adaptation\nmechanisms are described in section 4. We demonstrate the features of the model on a\nmaze problem in section 5 and \ufb01nally discuss the results and the model in general terms.\n\n2 The model\n\nThe core of the connectionist world model (CWM) is a neural layer which is coupled to a\nperceptual layer and a motor layer, see \ufb01gure 1. Let us enumerate the units of the central\nconnection from the i-th to j-th unit by (ji). E.g., \u201c(cid:80)(ji)\u201d means \u201csumming over all\nlayer by i = 1, .., N. Lateral connections within the layer may exist and we denote a\nexisting connections (ji)\u201d. To every unit we associate an activation xj \u2208 R which is\ngoverned by the dynamics\n\n\u03c4x \u02d9xj = \u2212xj + ks(sj, s) + \u03b7 (cid:88)\n\n(ji)\n\nka(aji, a) wji xi ,\n\n(1)\n\nwhich we will explain in detail in the following. First of all, xi are the time-dependent\nactivations and the dot-notation \u03c4x \u02d9x = F (x) means a time derivative which we algorith-\nmically implemented by a Euler integration step x(t) = x(t \u2212 1) + 1\nThe \ufb01rst term in (1) induces an exponential relaxation while the second and third terms are\nthe inputs. ks(sj, s) is the forward excitation that unit j receives from the perceptive layer.\n\nF (x(t \u2212 1)).\n\n\u03c4x\n\n\fHere, sj is the codebook vector (receptive \ufb01eld) of unit j onto the perception layer which\nis compared to the current stimulus s via the kernel function ks. We will choose Gaussian\nkernels as it is the case, e.g., for typical Radial Basis function networks.\n\nThe third term,(cid:80)(ji) ka(aji, a) wji xi, describes the lateral interaction on the central layer.\n\nNamely, unit j receives lateral input from unit i iff there exists a connection (ji) from i to\nj. This lateral input is weighted by the connection\u2019s synaptic strength wji. Additionally\nthere is another term entering multiplicatively into this lateral interaction: Lateral inputs\nare modulated depending on the current motor activation. We chose a modulation of the\nfollowing kind: To every existing connection (ji) we associate a codebook vector aji onto\nthe motor layer which is compared to the current motor activity a via a Gaussian kernel\nfunction ka. Due to the multiplicative coupling, a connection contributes to lateral inputs\nonly when the current motor activity \u201cmatches\u201d the codebook vector of this connection.\nThe modulation of information transmission by multiplicative or divisive interactions is a\nfundamental principle in biological neural systems [9]. One example is shunting inhibi-\ntion where inhibitory synapses attach to regions of the dentritic tree near to the soma and\nthereby modulate the transmission of the dentritic input [10]. In our architecture, a shunt-\ning synapse, receiving input from the motor layer, might attach to only one branch of a\n(lateral) dentritic tree and thereby multiplicatively modulate the lateral inputs summed up\nat this subtree.\n\nFor the following it is helpful if we brie\ufb02y discuss a certain relation between equation (1)\nand a classical probabilistic approach. Let us assume normalized kernel functions\n\nks(sj, s) =\n\n1\u221a\n2\u03c0 \u03c3s\n\nexp\n\n\u2212(sj \u2212 s)2\n\n2\u03c32\ns\n\n,\n\nka(aji, a) =\n\n1\u221a\n2\u03c0 \u03c3a\n\nexp\n\n\u2212(aji \u2212 a)2\n\n.\n\n2\u03c32\na\n\nThese kernel functions can directly be interpreted as probabilities: ks(sj, s) represents\nthe probability P (s|j) that the stimulus is s if j is active, and ka(aji, a) the probability\nP (a|j, i) that the action is a if a transition i \u2192 j occurred. As for typical hidden Markov\nmodels we may derive the prior probability distribution P (j|a), given the action:\n\nP (j|a, i) = P (a|j, i) P (j|i)\nP (j|a) =(cid:88)\n\nP (a|i)\nka(aji, a) P (j|i)\n\nP (a|i) P (i) .\n\n= ka(aji, a) P (j|i)\nP (a|i) ,\n\ni\n\n1. What we would like to point out here is that in equation (1), the lateral input\n\nP (a|i) can be computed by normalizing P (a|j, i) P (j|i) over j such that(cid:80)j P (j|a, i) =\n(cid:80)(ji) ka(aji, a) wji xi can be compared to the prior P (j|a) under the assumption that\nxi is proportional to P (i) and if we have an adaptation mechanism for wji which con-\nverges to a value proportional to P (j|i) and which also ensures normalization,\n(cid:80)j ka(aji, a) wji = 1 for all i and a. This insight will help to judge some details of\ni.e.,\n\nthe next two section. The probabilistic interpretation can be further exploited, e.g., com-\nparing the input of a unit j (or, in the quasi-stationary case, xj itself) to the posterior and\nderiving theoretically grounded adaptation mechanisms. But this is not within the scope of\nthis paper.\n\n3 The dynamics of planning\n\nTo organize goal-oriented behavior we assume that, in parallel to the activation dynam-\nics (1), there exists a second dynamic process which can be motivated from classical ap-\nproaches to Reinforcement Learning [11, 12]. Recall the Bellman equation\n\n\u03c0 (i) =(cid:88)\n\n\u2217\n\n\u03c0(a|i)(cid:88)\n\nV\n\na\n\nj\n\nP (j|i, a)(cid:104)r(j) + \u03b3 V\n\n\u03c0 (j)(cid:105) ,\n\n\u2217\n\n(2)\n\n\f\u2217(i) of the discounted future return R(t) =(cid:80)\u221e\n\n\u03c4 =1 \u03b3\u03c4\u22121 \u0001(t+\u03c4),\nyielded by the expectation V\nwhich yields R(t) = \u0001(t+1) + \u03b3 R(t+1), when situated in state i. Here, \u03b3 is the discount\nfactor and we presumed that the received rewards \u0001(t) actually depend only on the state\nand thus enter equation (2) only in terms of the reward function r(i) (we neglect here that\nrewards may directly depend on the action). Behavior is described by a stochastic policy\n\u03c0(a|i), the probability of executing action a in state i. Knowing the property (2) of V\nit is\nstraight-forward to de\ufb01ne a recursion algorithm for an approximation V of V\nsuch that V\nconverges to V\n. This recursion algorithm is called Value Iteration and reads\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u03c4v \u2206V\u03c0(i) = \u2212V\u03c0(i) +(cid:88)\n\n\u03c0(a|i)(cid:88)\n\nP (j|i, a)(cid:2)r(j) + \u03b3 V\u03c0(j)(cid:3) ,\n\n(3)\n\na\n\nj\n\nwith a \u201creciprocal learning rate\u201d or time constant \u03c4v. Note that (2) is the \ufb01xed point equation\nof (3).\n\nThe practical meaning of the state-value function V is that it quanti\ufb01es how desirable and\npromising it is to reach a state i, also accounting for future rewards to be expected. In\nparticular, if one knows the current state i it is a simple and ef\ufb01cient rule of behavior to\nchoose that action a that will lead to the neighbor state j with maximal V (j) (the greedy\nIn that sense, V (i) provides a smooth gradient towards desirable goals. Note\npolicy).\nthough that direct Value Iteration presumes that the state and action spaces are known and\n\ufb01nite, and that the current state and the world model P (j|i, a) is known.\nHow can we transfer these classical ideas to our model? We suppose that the CWM is\ngiven a goal stimulus g from outside, i.e., it is given the command to reach a world state\nthat corresponds to the stimulus g. This stimulus induces a reward excitation ri = ks(si, g)\nfor each unit i. Now, besides the activations xi, we introduce another \ufb01eld over the CWM,\nthe value \ufb01eld vi, which is in analogy to the state-value function V (i). The dynamics is\n\n\u03c4v \u02d9vi = \u2212 vi + ri + \u03b3 max\n\n(wji vj) ,\n\n(ji)\n\n(4)\n\nand well comparable to (3): One difference is that vi estimates the \u201ccurrent-plus-\nfuture\u201d reward \u0001(t) + \u03b3R(t) rather than the future reward only\u2014in the upper no-\ntation this corresponds to the value iteration \u03c4v \u2206V\u03c0(i) = \u2212V\u03c0(i) + r(i) +\n(cid:80)a \u03c0(a|i)(cid:80)j P (j|i, a)(cid:2)\u03b3 V\u03c0(j)(cid:3). As it is commonly done for Value Iteration, we as-\nsumed \u03c0 to be the greedy policy. More precisely, we considered only that action (i.e., that\nconnection (ji)) that leads to the neighbor state j with maximal value wji vj. In effect, the\nsummations over a as well as over j can be replaced by a maximization over (ji). Finally\nwe replaced the probability factor P (j|i, a) by wji\u2014we will see in the next section how\nwji is learned and what it will converge to.\n\u2217\nj )\nIn practice, the value \ufb01eld will relax quickly to its \ufb01xed point v\nand stay there if the goal does not change and if the world model is not re-adapted (see the\nexperiments). The quasi-stationary value \ufb01eld vi together with the current (typically non-\nstationary) activations xi allow the system to generate a motor signal that guides towards\nthe goal. More precisely, the value \ufb01eld vi determines for every unit i the \u201cbest\u201d neighbor\nunit ki = argmaxj wji vj. The output motor signal is then the activation average\n\n\u2217\ni = ri + \u03b3 max(ji)(wji v\n\nxi akii\n\n(5)\n\na =(cid:88)\n\ni\n\nof the motor codebook vectors akii that have been learned for the corresponding connec-\ntions. Hence, the information \ufb02ow between the central layer and the motor system is in\nboth ways: In the \u201ctracking\u201d process as given by equation (1) the information \ufb02ows from\nthe motor layer to the central layer: Motor signals activate the corresponding connections\nand cause lateral, predictive excitations. In the action selection process as given by equa-\ntion (5) the signals \ufb02ow from the central layer back to the motor layer to induce the motor\nactivity that should turn predictions into reality.\n\n\fDepending on the speci\ufb01c problem and the representation of motor commands on the motor\nlayer, a post-processing of the motor signal a, e.g. a competition between contradictory\nmotor units, might be necessary. In our experiments we will have two motor units and will\nalways normalize the 2D vector a to unit length.\n\n4 Self-organization and adaptation\n\nThe self-organization process of the central layer combines techniques from standard self-\norganizing maps [13, 14] and their extensions w.r.t. growing representations [15, 16] and\nthe learning of temporal dependencies in lateral connections [17, 18]. The free variables\nof a CWM subject to adaptation are (1) the number of neurons and the lateral connectivity\nitself, (2) the codebook vectors si and aji to the perceptive and motor layers, respectively,\nand (3) the weights wji of the lateral connections. The adaptation mechanisms we pro-\npose are based on three general principles: (1) the addition of units for representation of\nnovel states (novelty), (2) the \ufb01ne tuning of the codebook vectors of units and connec-\ntions (plasticity), and (3) the adaptation of lateral connections in favor of better prediction\nperformance (prediction).\n\nNovelty. Mechanisms similar to those of FuzzyARTMAPs [15] or Growing Neural Gas\n[16] account for the insertion of new units when novelty is detected. We detect novelty\nin a straight-forward manner, namely when the difference between the actual perception\nand the best matching unit becomes too large. To make this detection more robust, we\nuse a low-pass \ufb01lter (leaky integrator). At a given time, let z be the best matching unit,\nz = argmaxi xi. For this unit we integrate the error measure ez\n\u03c4e \u02d9ez = \u2212ez + (1 \u2212 ks(sz, s)) .\n\nWe normalize ks(sz, s) such that it equals 1 in the perfect matching case when sz = s.\nWhenever this error measure exceeds a threshold called vigilance, ez > \u03bd, \u03bd \u2208 [0, 1], we\ngenerate a new unit j with the codebook vector equal to the current perception, sj = s,\n\u2020\nand a connection from the last best matching unit z\nwith the codebook vector equal to the\ncurrent motor signal, ajz\u2020 = a. The errors of both, the new and the old unit, are reset to\nzero, ez \u2190 0, ej = 0.\n\nPlasticity. We use simple Hebbian plasticity to \ufb01ne tune the representations of existing\nunits and connections. Over time, the receptive \ufb01elds of units and connections become\nmore and more similar to the average stimuli that activated them. We use the update rules\n\n\u03c4s \u02d9sz = \u2212sz + s ,\n\n\u03c4a \u02d9azz\u2020 = \u2212azz\u2020 + a ,\n\nwith learning time constants \u03c4s and \u03c4a.\n\nPrediction and a temporal Hebb rule. Although perfect prediction is not the actual ob-\njective of the CWM, the predictive power is a measure of the correctness of the learned\nworld model and good predictive power is one-to-one with good behavior planning. The\n\ufb01rst and simple mechanism to adapt the predictive power is to grow a new lateral connec-\n\u2020\ntion between two successive best matching units z\nand z if it does not yet exist. The new\nconnection is initialized with wzz\u2020 = 1 and azz\u2020 = a. The second, more interesting mech-\nanism addresses the adaptation of wji based on new experiences and can be motivated as\nfollows: The temporal Hebb rule strengthens a synapse if the pre- and post-synaptic neu-\nrons spike in sequence, depending on the inter-spike-interval, and is supposed to roughly\ndescribe LTP and LTD (see, e.g.,[19]). In a population code model, this corresponds to a\nmeasure of correlation between the pre-synaptic and the delayed post-synaptic activity. In\nour case we additionally have to account for the action-dependence of a lateral connection.\n\n\fWe do so by considering the term ka(aji, a) xi instead of only the pre-synaptic activity.\nAs a measure of temporal correlation we choose to relate this term to the derivative \u02d9xj\nof the post-synaptic unit instead of its delayed activation\u2014this saves us from specifying\nan ad-hoc \u201ctypical\u201d delay and directly re\ufb02ects that, in equation (1), lateral inputs relate to\nthe derivative of xj. Hence, we consider the product \u02d9xj ka(aji, a) xi as the measure of\ncorrelation. Our concrete implementation is a robust version of this idea:\n\n\u03c4w \u02d9wji = \u03baji [cji \u2212 wji \u03baji] , where\n\u03c4\u03ba \u02d9cji = \u2212cji + \u02d9xj ka(aji, a) xi ,\n\n\u03c4\u03ba \u02d9\u03baji = \u2212\u03baji + ka(aji, a) xi .\n\nHere, cji and \u03baji are simply low-pass \ufb01lters of \u02d9xj ka(aji, a) xi and of ka(aji, a) xi.\nThe term wji \u03baji ensures convergence (assuming quasi static cji and \u03baji) of wji towards\n\ncji(cid:14)\u03baji. The time scale of adaptation is modulated by the recent activity \u03baji of the connec-\n\ntion.\n\n5 Experiments\n\nTo demonstrate the functionality of the CWM we consider a simple maze problem. The\nparameters we used are\n\n\u03c4x\n2\n\n\u03b7\n\n0.1\n\n2 \u03c32\ns\n0.01\n\n2 \u03c32\na\n0.5\n\n\u03c4v\n2\n\n\u03b3\n\n0.8\n\n\u03c4e\n10\n\n\u03c4s\n20\n\n\u03c4a\n5\n\n\u03c4w\n10\n\n\u03c4\u03ba\n100\n\n.\n\nFigure 2a displays the geometry of the maze. The \u201cagent\u201d is allowed to move continuously\nin this maze. The motor signal is 2-dimensional and encodes the forces f in x- and y-\ndirections; the agent has a momentum and friction according to \u00a8x = 0.2 (f \u2212 \u02d9x). As a\nstimulus, the CWM is given the 2D position x.\nFigure 2a also displays the (lateral) topology of the central layer after 30 000 time steps of\nself-organization, after which the system becomes quasi-stationary. The model is learned\nfrom scratch, initialized with one random unit. During this \ufb01rst phase, behavior planning\nis switched off and the maze is explored with a random walk that changes its direction only\nwith probability 0.1 at a time. In the illustration, the positions of the units correspond to\nthe codebook vectors that have been learned. The directedness and the codebook vectors\nof the connections can not displayed.\n\nAfter the self-organization phase we switched on behavior planning. A goal stimulus cor-\nresponding to a random position in the maze is given and changed every time the agent\nreaches the goal. Generally, the agent has no problem \ufb01nding a path to the goal. Figure 2b\nalready displays a more interesting example. The agent has reached goal A and now seeks\nfor goal B. However, we blocked the trespass 1. Starting at A the agent moves normally\nuntil it reaches the blockade. It stays there and moves slowly up an down in front of the\nblockade for a while\u2014this while is of the order of the low-pass \ufb01lter time scale \u03c4\u03ba. During\nthis time, the lateral weights of the connections pointing to the left are depressed and after\nabout 150 time steps, this change of weights has enough in\ufb02uence on the value \ufb01eld dy-\nnamics (4) to let the agent chose the way around the bottom to goal B. Figure 2c displays\nthe next scene: Starting at B, the agent tries to reach goal C again via the blockade 1 (the\nprevious adaptation depressed only the connections from right to left). Again, it reaches the\nblockade, stays there for a while, and then takes the way around to goal C. Figures 2d and\n2e repeat this experiment with blockade 2. Starting at D, the agent reaches the blockade\n2 and eventually chooses the way around to goal E. Then, seeking for goal F, the agent\nreaches the blockade \ufb01rst from the left, thereafter from the bottom, then from the right,\nthen it tries from the bottom again, and \ufb01nally learned that none of these paths are valid\nanymore and chooses the way all around to goal F. Figures 2f shows that, once the world\nmodel has re-adapted to account for these blockades, the agent will not forget about them:\nHere, moving from G to H, it does not try to trespass block 2.\n\n\fa\n\nd\n\nb\n\ne\n\nc\n\nf\n\nFigure 2: The CWM on a maze problem: (a) the outcome of self-organization; (b-c) agent\nmovements from goal A to B to C, here, the trespass 1 was blocked and requires readap-\ntation of the world model; (d-f) agent movements that demonstrate adaptation to a second\nblockade. Please see the text for more explanations.\n\nThe reader is encouraged to also refer to the movies of these experiments, deposited\nat www.marc-toussaint.net/03-cwm/, which visualize much better the dynamics of self-\norganization, the planning behavior, the dynamics of the value \ufb01eld, and the world model\nreadaptation.\n\n6 Discussion\n\nThe goal of this research is an understanding of how neural systems may learn and represent\na world model that allows for the generation of goal-directed behavioral sequences. In our\napproach for a connectionist world model a perceptual and a motor layer are coupled to self-\norganize a model of the perceptual implications of motor activity. A dynamical value \ufb01eld\non the learned world model organizes behavior planning\u2014a method in principle borrowed\nfrom classical Value Iteration. A major feature of our model is its adaptability. The state\nspace model is developed in a self-organizing way and small world changes require only\nlittle re-adaptation of the CWM. The system is continuous in the action, perception, and\ntime domain and all dynamics and adaptivity rely on local interactions only.\n\nFuture work will include the more rigorous probabilistic interpretations of CWMs which\nwe already indicated in section 2. Another, rather straight-forward extension will be to re-\nplace random-walk exploration by more directed, information seeking exploration methods\nas they have already been developed for classical world models [20, 21].\n\nAB1CB1DE21EF21GH21\fAcknowledgments\n\nI acknowledge support from the German Bundesministerium f\u00a8ur Bildung und Forschung\n(BMBF).\n\nReferences\n\n[1] G. Hesslow. Conscious thought as simulation of behaviour and perception. Trends in Cognitive\n\nSciences, 6:242\u2013247, 2002.\n\n[2] Rick Grush. The emulation theory of representation: motor control, imagery, and perception.\n\nBehavioral and Brain Sciences, 2003. To appear.\n\n[3] M.D. Majors and R.J. Richards. Comparing model-free and model-based reinforcement\nlearning. Cambridge University Engineering Department Technical Report CUED/F- IN-\nENG/TR.286, 1997.\n\n[4] D.E. Rumelhart, P. Smolensky, J.L. McClelland, and G. E. Hinton. Schemata and sequential\nthought processes in PDP models. In D.E. Rumelhart and J. L. McClelland, editors, Parallel\nDistributed Processing, volume 2, pages 7\u201357. MIT Press, Cambridge, 1986.\n\n[5] M. Jordan and D. Rumelhart. Forward models: Supervised learning with a distal teacher. Cog-\n\nnitive Science, 16:307\u2013354, 1992.\n\n[6] B. Kr\u00a8ose and M. Eecen. A self-organizing representation of sensor space for mobile robot\n\nnavigation. In Proc. of Int. Conf. on Intelligent Robots and Systems (IROS 1994), 1994.\n\n[7] U. Zimmer. Robust world-modelling and navigation in a real world. NeuroComputing, 13:247\u2013\n\n260, 1996.\n\n[8] S. Amari. Dynamics of patterns formation in lateral-inhibition type neural \ufb01elds. Biological\n\nCybernetics, 27:77\u201387, 1977.\n\n[9] W.A. Phillips and W. Singer. In the search of common foundations for cortical computation.\n\nBehavioral and Brain Sciences, 20:657\u2013722, 1997.\n\n[10] L.F. Abbott. Realistic synaptic inputs for network models. Network: Computation in Neural\n\nSystems, 2:245\u2013258, 1991.\n\n[11] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[12] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, Cambridge, 1998.\n[13] C. von der Malsburg. Self-organization of orientation-sensitive cells in the striate cortex. Ky-\n\nbernetik, 15:85\u2013100, 1973.\n\n[14] T. Kohonen. Self-organizing maps. Springer, Berlin, 1995.\n[15] G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds, and D.B. Rosen. Fuzzy ARTMAP:\nA neural network architecture for incremental supervised learning of analog multidimensional\nmaps. IEEE Transactions on Neural Networks, 5:698\u2013713, 1992.\n\n[16] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D.S. Touretzky,\nand T.K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 625\u2013632.\nMIT Press, Cambridge MA, 1995.\n\n[17] C.M. Bishop, G.E. Hinton, and I.G.D. Strachan. GTM through time. In Proc. of IEEE Fifth Int.\n\nConf. on Arti\ufb01cial Neural Networks. Cambridge, 1997.\n\n[18] J.C. Wiemer. The time-organized map algorithm: Extending the self-organizing map to spa-\n\ntiotemporal signals. Neural Computation, 15:1143\u20131171, 2003.\n\n[19] P. Dayan and L.F. Abbott. Theoretical Neuroscience. MIT Press, 2001.\n[20] J. Schmidhuber. Adaptive con\ufb01dence and adaptive curiosity. Technical Report FKI-149-91,\n\nTechnical University Munich, 1991.\n\n[21] N. Meuleau and P. Bourgine. Exploration of multi-state environments: Local measures and\n\nback-propagation of uncertainty. Machine Learning, 35:117\u2013154, 1998.\n\n\f", "award": [], "sourceid": 2452, "authors": [{"given_name": "Marc", "family_name": "Toussaint", "institution": null}]}