{"title": "Learning a Forward Model of a Reflex", "book": "Advances in Neural Information Processing Systems", "page_first": 1555, "page_last": 1562, "abstract": null, "full_text": "Learning a Forward Model of a Re\ufb02ex\n\nBernd Porr and Florentin W\u00a8org\u00a8otter\n\nComputational Neuroscience\n\nPsychology\n\nUniversity of Stirling\nFK9 4LR Stirling, UK\n\n bp1,faw1\n\n@cn.stir.ac.uk\n\nAbstract\n\nWe develop a systems theoretical treatment of a behavioural system that\ninteracts with its environment in a closed loop situation such that its mo-\ntor actions in\ufb02uence its sensor inputs. The simplest form of a feedback\nis a re\ufb02ex. Re\ufb02exes occur always \u201ctoo late\u201d; i.e., only after a (unpleas-\nant, painful, dangerous) re\ufb02ex-eliciting sensor event has occurred. This\nde\ufb01nes an objective problem which can be solved if another sensor input\nexists which can predict the primary re\ufb02ex and can generate an earlier\nreaction. In contrast to previous approaches, our linear learning algo-\nrithm allows for an analytical proof that this system learns to apply feed-\nforward control with the result that slow feedback loops are replaced by\ntheir equivalent feed-forward controller creating a forward model.\nIn\nother words, learning turns the reactive system into a pro-active system.\nBy means of a robot implementation we demonstrate the applicability of\nthe theoretical results which can be used in a variety of different areas in\nphysics and engineering.\n\n1 Introduction\n\nFeedback loops are prevalent in animal behaviour, where they are normally called a \u201cre-\n\ufb02ex\u201d. However, the re\ufb02ex has the disadvantage of always being too late. Thus, an objective\ngoal is to avoid a re\ufb02ex (feedback) reaction. This can be done by an anticipatory (feed-\nforward) action; for example when retracting a limb in response to heat radiation without\nactually having to touch the hot surface, which would elicit a pain-induced re\ufb02ex. While\nthis has been interpreted as successful forward control [1] the question arises how such a\nbehavioural system can be robustly generated.\n\nIn this article we introduce a linear algorithm for temporal sequence learning between two\nsensor events and provide an analytical proof that this process turns a pre-wired re\ufb02ex loop\ninto its equivalent feed-forward controller. After learning the system will respond with an\nanticipatory action thereby avoiding the re\ufb02ex.\n\n\u0001\n\fFigure 1: Diagram of the system in its environment (in Laplace-notation). The input signal\n\n(\u201cdisturbance\u201d) reaching both sensor inputs \u0001\u0003\u0002\u0005\u0004\n\nis \ntemporal delay \u0007\ntransfer functions, \n\nneuron \f\n2 The learning rule and its environment\n\n. The environmental transfer functions are denoted as \b\n\n\u0006 at different times as indicated by the\nthe \ufb01ltered inputs which converge with weights \u000b onto the output\n\nare linear\n\n.\n\n.\n\nFig. 1 shows the general situation which arises when temporal sequence learning takes\nplace in a system which interacts with its environment [2]. We distinguish two loops:\nThe inner loop represents the re\ufb02ex which has \ufb01xed unchanging properties. The outer\nloop represents the to-be-learned anticipatory action. Sequence learning requires causally\n\nthe time delay between both inputs. The outer loop receives the earlier (anticipatory) input.\n(e.g. a low-\n\n\u0006 (e.g. heat radiation and pain) where \u0007 denotes\nat both sensors \u0001\r\u0002\u0005\u0004\nrelated input events \n\u0006 are processed by a linear transform \t\nThe delayed and un-delayed signals \u0001\u0003\u0002\u0005\u0004\nor band-pass \ufb01lter), subsequently their sum is taken with weights \u000b on a single neuron. Note\nthat all input signals are \ufb01ltered. The system is therefore completely isotropic. Line \u0001\nis fanned out in order to adjust to the a priori unknown delay \u0007\ndifferent transforms \t\u000f\u000e\n\nby the combination of\n(see below). The output of the neuron is in the L APLACE-domain\n\ngiven by:\n\n\f\u0003\u0010\u0012\u0011\u0014\u0013\u0016\u0015\u0017\u000b\u0018\u0002\u0019\n\u001a\u0002\u0018\u0010\u001b\u0011\u0019\u0013\u001d\u001c\n\n\u000b#\u000e\u0019\n\"\u000e$\u0010\u001b\u0011\u0019\u0013&%\n\n\u000e! \"\u0006\n\nwith\n\n\u0010\u001b\u0011\u0019\u0013'\u0015(\u0001)\u0010\u001b\u0011\u0019\u0013*\t+\u0010\u0012\u0011\u0019\u0013\n\n(1)\n\nin Fig. 1 denote how\nthe environment in\ufb02uences the different signals. The goal of sequence learning is that the\nouter loop should after learning functionally replace the inner loop such that the re\ufb02ex will\n\nwhere \u000b,\u000e are the synaptic weights. In the following we will drop the function argument \u0011\nfor the sake of brevity wherever possible. The transfer functions \b\ncease to be triggered. In this case we receive \u0001\u0003\u0002-\u0015/. which we call the \u201cdesired state\u201d\n\nof the system. This allows calculating the general requirements for the outer loop without\nhaving to specify the actual learning process. The re\ufb02ex pathway is described by\n\n(2)\n\u0015(\b\nin LAPLACE-notation. The signal on the anticipatory\n(outer) pathway has the representation\n\n\f1\u001c+32,4658769:%\n\nrepresents the delay \u0007\n\nwhere 2\n\n\u0002\u00180\n\n46587\n\n;\u001c+\b\n=?>\n\n\u0002<\u0006\n\n\b\u001a\u0006<\b\u001d\u0002<\u0006@\tBA\n\n(3)\n\n\t\n\u0006\n\u001e\n\u001f\n\u0001\n\u0002\n\u0001\n\u0006\n\u0015\n\b\n\u0006\n\u0006\n\b\n\u0001\n\u0002\n\t\n\u0002\n\f5*7\n\n%!\b\n\n(4)\n\n\u0002<\u0006 .\n\nis the learned transfer-function which generates the anticipatory\n\nwhere \t\rA\nresponse triggered by the input \u0003\nfunctions \b\nlonger triggered. Eliminating \u0001\n\n :\u0006\n\u0006 and \b\n\n\u0006 . We want to express \t\u0005\u0004 by the environmental transfer-\nis solved for the condition \u0001\n. where the re\ufb02ex is no\n\u0006 and \f we get:\n\tBA\n=?>\n\ndetermined by:\n\n587\n\u0002&\u0006\u0007\u0006\n46587\n\nA . Such a pole\n\u0010\u001b\u0011\u0019\u0013\n\nEq. 4 can be further simpli\ufb01ed. Following standard control theory [3] we neglect the de-\n\n46587\n46587\n, however, is meaningless because it\nviolates temporal causality. Thus, the denominator can at most add phase-shifts to the sys-\nis\n\n\u0002<\u0006\nnominator, because it does not add additional poles to the transfer function \t\nappears only for \b\"\u0002&\u0006\u000f\u0015\n. A transfer function 2\n. and the behaviour of \t\ntems behaviour. As a consequence, we may set \b\n(5)\n\bB4\n\u0010\u0012\u0011\u0014\u0013'\u0015\nThe interpretation of the last equation is straight-forward. The learning goal of \u00013\u0002\n\u0015/.\n. The disturbance, however, enters the system only\nrequires compensating the disturbance \nafter having been \ufb01ltered by the environmental transfer function \b\n\u0006 . Thus, compensation\nof \nwhich is the inverse environmental\ntransfer function (hence \u201cinverse controller\u201d). The second term 2\nin Eq. 5 compensates\nfor the delay \u0007 between the two sensor signals originating from the disturbance \ning rule and convergence to a given solution \t\u0003A under this rule. 2.2. The construction\nof (approximate) solutions \t\rA . 3. Implementation of the system in a (real world) robot\n\nHaving outlined the general setup in terms of our linear approach and system theoretic\nnotation we devote the remaining three sections to the following topics: 2.1. The learn-\n\nrequires to reverse this \ufb01ltering by a term \b\r4\n\nexperiment.\n\n587\n\n.\n\n2.1 The learning rule and convergence.\n\n\u000b,\u000e\u0005\t\n\n\u0014\u0013\n\n\u000e\u0010\t\u0012\u0011\n\nby which the development of the weight values is controlled and show that any deviation\nis eliminated due to learning. In terms of the time domain\n\nexists (as will be be speci\ufb01ed below) for which a\n\u000e! :\u0006\n, our learning rule is given by:\n\n\u000e . We will now specify the learning rule,\n\nHere, we assume that a set of functions \t\nsolution can be approximated by \t\rA\nfrom the given solution \t\rA\n\u000e ,\t corresponding to \n\nfunctions\b\n\n\f\u000b\nThus, the weight change depends on the correlation between \b\nof \t\nlearning (\u201cISO-learning\u201d). The positive constant \r\ndecay of the responses \b\n\nlearning [4]. The total weight change can be calculated by [5]:\n\n\u000e and \f\n\u0015\u000e\r\u000f\b\n\u000b#\u000e\n\n. Since the structure of the system is completely isotropic (see Fig. 1) and learning\ncan take place at any synapse we shall call our learning algorithm isotropic sequence order\nis taken small enough such that all\nweight changes occur on a much longer time scale (i.e., very slowly) as compared to the\n. This rule is related to the one used in \u201ctemporal difference\u201d\n\n\u000e and the time derivative\n\n\u001c\u001f\u001e\n\n:\u000e,\u0010\nwhere >\u001d\u001c\u001f\u001e\nthat the re\ufb02ex pathway is unchanging with a \ufb01xed weight \u000b\nNote, that its open loop transfer characteristic given by \u000b\n\u0002 must carry a low-pass\ncomponent, otherwise the re\ufb02ex loop would be unstable. We keep \b\n. as before.\n=#\"\nFurthermore we assume that for a given set of \tB\u000e we have found a set of weights \u000b$\u000e,%\n$!\"&% which solves Eq. 5. We will show that a perturbation of the weights \u000b\u0018\u000e will be\n\nrepresents the derivative of \t\n\nin the LAPLACE domain. We assume\n(negative feedback).\n\n>\u001d\u001c\u001f\u001e\n\n>\u001d\u001c\u001f\u001e\n\n>\u001d\u001c\u001f\u001e\n\n\u0002! \n\n\u0019\u001b\u001a\n\n\u000b#\u000e\n\n\u0016\u0018\u0017\n\n(6)\n\n(7)\n\n\u0002&\u0006\n\n\f\r\u0010\n\n\u0015\n\u0001\n\u001e\n\u0002\n\u000b\n\u0002\n\t\n\u0002\n\u0002\n\t\n\u0004\n\u0002\n\u0015\n\u0015\n>\n\b\n4\n\u0006\n\u0006\n2\n\b\n2\n2\n\u0015\nA\n\t\nA\n>\n\u0006\n\u0006\n2\n\u0006\n\u0006\n4\n\u0015\n\u0001\n\u001e\n\n=\n\u0015\n\u0015\n\n4\n\u001a\n\f\n\u0010\n\u0013\n\u0013\n\n\u001e\n\u0013\n.\n\u0002\n\t\n\u0002\n\b\n\u0015\n\fSubstituting \u000b\n\n\u000e\u0004\n\n\u000e\u0004\u0003\n\n\u000e\u0001\n\ncompensated by applying the learning procedure. Since we do not make any assumption as\nto the size of the perturbation this is indicative of convergence in general. To this end, we\n\n\u000e . Stability of the solution is expected if the weight change \u0015\n\nenvironment in which the system internally relaxes on a time scale much shorter than the\ntime scale on which the disturbances occur. To be speci\ufb01c, a disturbance/perturbation may\n\nsubstitute \u000b\nopposes the perturbation, thus, if\u0015\noccur near \u000b\ndisregard any subsequent disturbances as well as perturbations (\u0002\nstate condition. We use the relations for \n\n\u000e . Here, we however assume an \u2019adiabatic\u2019\n. . In calculating the weight change (7) due to this disturbance signal we\n\u000e ) following the steady\nand insert them into Eq. 7. For \n we\n\u0002\u0007\u0006\t\b\u000b\n\n\u0006\t\b\u000b\n\n\u001c+\u0001\n\nInserting Eqs. 2 and 8 into Eq. 1 we get:\n\u00032\nthis yields:\n\nand \f\n\u00013\u0006&\t\n587\n\n\u00151\u0001\u000f\u000e\u0005\t\n\n$\r\f\n\nhave:\n\n\u000e! \"\u0006\n\n=?>\n\n\"\u000e\n\n(9)\n\n(8)\n\n\u000e! :\u0006\n\n=?>\n\n(10)\n\n\u0013 respectively and\n\n:\n\nWe realize that the \ufb01rst part of this integral describes the unperturbed equilibrium state and\nis a transfer\n\n\u001c\u001f\u001e\n\u0013 and \u0010\nand \u0010\n\n\u000b,\u000e\n\nand\n\nWe use the superscript\n\n>\u001d\u001c\u001f\u001e\ncalculate the weight change using Eq. 7 integrating between >\u0004\u0010\n\nto denote the arguments \u0010\n>\u001d\u001c\ncan be dropped, thus, together with \u0001\n\u0016\u0018\u0017\n\nFurthermore we assume orthogonality (see also below) given by:\n\n=?>\n\u0015\u0012\u0011\n>\u001d\u001c\u001f\u001e\n\nfunction, we get:\n\n=?>\n\n\u0016\u0018\u0017\n\n \"\u0006\n\n \"\u0006\n, which holds because \u0001\n\u0013\u0005\t\n\u000b\u0018\u0002\u0005\b\nfor\n\nand get accordingly:\n\n>\u001d\u001c\u001f\u001e\n\n=?>\n\ncause we know that\n\nWe now apply PLANCHEREL\u2019S theorem [5] in order to transfer the integral into the time-\ndomain and prove that it is negative. This assures stability and, hence, convergence, be-\n\nis small, preventing oscillatory behaviour. We have:\n\n\u000b#\u000e\n\n\u0015\u000e\n\n\u000b#\u000e\n\n\u0002214365-7\n\n(\n\n\u0006#\u0010\n\n\u000e$\u0010\n\nthe fraction\n\ndenotes a convolution) and\n\nwhere we call\ntransform of\nis the temporal derivative of\nthe impulse response of the inverse transform of the remaining second term in Eq. 15.\n\nthe autocorrelation function of \u0003\n\u0013 which is the inverse\n183\u000b5-7\nSince we know that \u000b\n\u0002\u0005\tB\u0002\u0005\b:\u0002 must carry a low-pass component we can in general state that\nnegative value for\u000b\n>\u0004\u0010\n(ideally \u0015\n4)(+*\u001a,=*\nis positive around \u000b\n\u00151.\n. . Thus, the integral in question will remain negative for almost\nall realistic choices of\u00036\u0006#\u0010\n\u0013 . As an important special case we \ufb01nd that this especially holds\nif we assume delta-pulse disturbance at \u000b\n\nrepresents a (non-standard) high-pass. Its derivative has a very high\n) and vanishes soon thereafter. The autocorrelation\n\n. , corresponding to \u0003\u001d\u0006,\u0010\n\n\u0013 .\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n\u0013\u0014\t\n\n\bB4\n>\u001d\u001c\n\n\u000b#\u000e\u001c\u001b\n\n\u001e\u001e\u001d\n#$\u001d\n\u001f! \n%& \n>\u001d\u001c\u001f\u001e\n4)(+*-,/.\n\u0013,\u0010\n4)(+*-,/.\n\n\u0015\u0018\u0017\n\u0013\u001a\u0019\n\u000e0\u001b\n\n\u000b,\u000e\n\n\u0013\u001a\u0019\n\n\u0014\u0016\u0015\n\n\f\u000b\n\u0013:9<;\n\n\u000b\n\u000e\n\u001c\n\u0002\n\u000b\n\u000b\n\u000e\n\u000b\n\u0010\n>\n=\n\u0013\n\u0002\n\u000b\n\u0015\n\u000b\n\u000e\n\u0015\n\u0005\n\u0001\n\u0002\n\t\n$\n\u0015\n.\n\u000e\n.\n\f\n\u0015\n\u000b\n\u0002\n\b\n\u0002\n\t\n\u0002\n4\n\u0006\n\u0001\n\u001e\n\u000b\n\u000e\n\t\n\u000e\n\u000b\n\u0002\n\b\n\u0002\n\t\n\u0002\n\u000b\n\u000e\n\u001c\n\u0002\n\u000b\n\u000e\n\u000e\n\f\n\u0015\n\f\n\u001c\n\u0001\n\u0006\n\u0001\n\u001e\n\u0002\n\u000b\n\u000e\n\t\n\u000e\n\u000b\n\u0002\n\b\n\u0002\n\t\n\u0002\n4\n\u000f\n\u0015\n\u0015\n\n\u0019\n\u001e\n0\n\f\n4\n\u001c\n\u0001\n4\n\u0006\n\u0001\n\u001e\n\u0002\n\u0002\n\u000b\n\u0002\n\t\n4\n\u0002\n\u000b\n\u0002\n\b\n4\n\u0002\n\t\n4\n\u0002\n9\n\u0001\n\u000f\n\u0006\n\t\n\u000f\n\u000e\n\n\u001e\n\u000f\n\u0006\n\u0001\n4\n\u0006\n\u0001\n\u0006\n\u0011\n\u0013\n\u0006\n\u0015\n\u000b\n\u000e\n\u0015\n\n\u001e\n\u001f\n\u0002\n\u0002\n\u000b\n\u0002\n\u0019\n\u0011\n\u0001\n\u000f\n\u0006\n\u0011\n\u000f\n\u000e\n\t\n4\n\u0002\n4\n\u0002\n\t\n4\n\u0002\n\n\u001e\n.\n\u0015\n\u0019\n\u0011\n\u0001\n\u000f\n\u0006\n\u0011\n\u000f\n\u000e\n\t\n4\n\u0002\n\u000b\n\u0002\n\u0002\n\t\n4\n\u0002\n\n\u001e\n\u0015\n$\n\u0015\n\u0002\n\"\n\u001d\n'\n\u001d\n#\n\u0006\n*\n%\n.\n*\n\n\u001e\n\u0015\n\u0017\n\u0002\n\u000b\n\u0011\n\u0001\n\u000f\n\u0006\n\t\n\u000f\n\u000e\n\u0011\n\u0006\n\u0006\n*\n%\n.\n*\n\u0013\n\u0015\n\u0002\n\u0019\n\u001a\n\u0010\n\u000b\n\u0013\n\u0006\n\u0011\n\u0010\n\u000b\n\u0013\n\u0010\n\u000b\n\u0013\n\u000b\n\u000b\n\u0011\n\u0001\n\u000f\n\u0006\n\t\n\u000f\n\u000e\n\u0011\n\u0013\n9\n\u0006\n\u0011\n\u0010\n\u000b\n\u0013\n\u0006\n\u0006\n%\n*\n1\n\u0015\n\u000b\n\u0015\n\u000b\n\u0013\n\u0015\n\u0002\n\u0010\n\u000b\n\f2.2 Construction of solutions.\n\n.\n\nLAPLACE-domain:\nthe pole\n\n(band-pass \ufb01lters) and show explicitly that\n\nHere, we use a set of well-known functions \t\na solution which approximates the inverse controller (Eq. 5) can be constructed for %\nand discuss how the approximation is improved for higher values of %\nThe transfer functions of the band-pass \ufb01lters \t\n\u0003 where\n\t+\u0010\u0012\u0011\u0019\u0013\n\u001c\n\t . Real and imaginary parts of the poles are given by\n>\u001d\u0017\n\u0015\u0019\u0018\n\u0006\u0011\u0010\u0013\u0012\nfor \u000b\nconverges fast to zero for \u000b\n\n, which we use, are speci\ufb01ed in the\nrepresents the complex conjugate of\n\u000f\u0002\u0001\n\u0015\f\u000b\u000e\r\u0018\u0010\u000f\b\n\u0013\u000f\u0015\nis the frequency of the oscillation. The\n. Concerning convergence\n. and that\n. . Band-pass functions are not orthogonal to each other but\n\nnumerically we found that they can be approximately treated of being orthogonal. In fact\nonly a small drift of the weights is observed which could be compensated if required. In\npractise, however, this becomes unimportant as discussed below. The use of resonators is\nalso motivated by biology [6] and band-pass \ufb01ltered response characteristics are prevalent\nin neuronal systems which also have been used in other neuro-theoretical approaches [7].\n\ndamping characteristic of the resonator is re\ufb02ected by\none \ufb01nds in Eq. 16 that with such a set of functions\n\n\u0015\u0015\u0014\u0017\u0016\n\n.\u001b\u001a\n\u0013\u0014 \n\n /.\n\n\b(\u0015\n\n, where\n\n\u0003\u0004\n\n\u000f\u0005\u0001\u0007\u0006\n\n\u0010\u000f\b\n\nWe return to Eq. 5. Let us \ufb01rst assume that the environment does not \ufb01lter the disturbance,\n= , an approximative solution of Eq. 5 can be easily\nthus \b\nconstructed by developing >\ninto a Taylor series and obtaining the parameters through\ncomparing coef\ufb01cients in:\n\n= . Then, for the case %\n46587\n\n(17)\n\n5*7\n\n\u0011\u0005\u0007+\u001c\n\n\u001c\u001d\u001a\u001e\u001a\u0007\u001a \u001f\n\n\u0011\u0005\u0007\n\u0010\u0012\u0011\u0019\u0013\n\n\u001921\n\n\u0015\u00193\n\n\u0006,\u0006\n= , this result shows that for all \u0007\n\n\u0011\u0016\u0010\u000f\b\u000f\u001c*\b\n\b!\b\n\"$#&%('\n#+%\n#-,\u0007.\n\u0013-\u0019!)\nAccordingly we get for the parameters of \t\n\u0006 : \u000b\n\u0015\u00150\nFor un-\ufb01ltered throughput \b'\u0006\n\u0006 , which approximates >\n\t\u0003\u0006 with a weight \u000b\ncontinues to improve for higher orders of %\n, which we pursued up to %\n5*7\nTaylor), but the set of equations becomes rather cluttered. In general \b\n\n.\nthere exists a resonator\nto the second order. The approximation\n(fourth order\nrepresents an en-\nvironmental transfer function which is passive and \u201cwell-behaved\u201d. Thus, in most cases it\ncan be represented by just another passive low- or band-pass \ufb01lter (sum of complex con-\njugated poles). Under this assumption a solution can also be constructed for the complete\nterm >\n5*7\nAs mentioned above, constructing solutions becomes impractical for %\n= and it would\nrequire to know \u0007\na priori. Note, if you would know \b\r4\n, you had already\nreached your goal of designing the inverse controller and learning would be obsolete. Thus,\nnormally a set of resonators \t must be prede\ufb01ned in a somewhat arbitrary way and their\nweights \u000b\ncomes secondary in practise, because \u2013 without prior knowledge of \u0007\nto use an over-complete set of \t\n\nshall be learned. The uniqueness of the solution assured by orthogonality be-\n\u2013 one has\n, in order to make sure that a solution can be found. In\npractise, this means that a large enough set of \ufb01lters must be used which normally leads to\na manifold of solutions. Now obviously the question arises if satisfactory solutions exist\nunder these relaxed conditions and if they remain stable.\n\nby a combination of %\nand \b\n\nresonators.\n\nand \b\n\n\u0015\n=\n\u0015\n\u0006\n\n5\n5\n\b\n5\n1\n\u001c\n1\n%\n\t\n\u0013\n\u0010\n\u0016\n\u0017\n\u0006\n\u0013\n\u0013\n>\n1\n\u0013\n\u0006\n\u0012\n\f\n\u001c\n\u0006\n\u0011\n\u0010\n\u000b\n\u0015\n\u0006\n\u0011\n\f\n\u0006\n\u0006\n\u0015\n\u0015\n2\n=\n2\n\u0015\n=\n=\n\u001c\n\u0006\n\u0013\n\u0011\n\u0013\n\u0007\n\u0013\n\u0016\n\u0007\n4\n\u0013\n\u0016\n\u0007\n4\n\u0013\n\u001c\n\u0016\n4\n\u0006\n\u001c\n\u0011\n\u0013\n\u0006\n\u0015\n>\n\u000b\n\u0006\n5\n\n\"\n\u0003\n#\n\u001c\n5\n\u0013\n\"\n'\n4\n\"\n/\n\"\n\u001c\n\u0011\n\u0013\n\u0015\n>\n\u000b\n\u0006\n\t\n\u0006\n\u0006\n\u0015\n>\n\u0016\n\u0006\n7\n#\n%\n\u0006\n\u0013\n\u0006\n7\n%\n\u0012\n\u0006\n\u0006\n\u0013\n\u0015\n2\n4\n\u0015\n\u0016\n\u0006\n\b\n4\n\u0006\n\u0006\n2\n4\n\f\n=\n\u0015\n\u0015\n4\n\u0006\n\u0006\n\u0006\n\u0006\n4\n\u0006\n\u0006\n\f(speed) and \n\u0001\n\n.\u001b\u001a\n\n.\u0004\u0003\n\n\u0011 ) and steering\n\n,\n\n(steering\nangle) with \ufb01xed weights (re\ufb02ex). Each range \ufb01nder (RF) is fed into a \ufb01lter bank of 10\nHz where its output converges with variable\n\nFigure 2: Robot experiment: (a) The robot has 2 output neurons for speed (\n\nangle (\n\u0001 ). The retraction mechanism is implemented by 3 resonators (\n= Hz) which connect the collision sensors (CS) to the neurons \n\n\u0002 with\nresonators \t\nweights on both the\n\n\u2013 movie 1. (b,d) Parts of the motion trajectory for one trial in an arena of \u0016\u0006\u0005\nfrom the left range \ufb01nder sensor to the the neuron \n\n\u0011 and\n\u0001 -neuron. A more detailed technical description together with\n.,.\b\u0007\n\t\n\na set of movies can be found at: http://www.cn.stir.ac.uk/predictor/real\n\nwith three obstacles (shaded). Circles denote collisions. (c) Development of the weights\n\n\u0011 .\n3 Implementation in a robot experiment.\n\nIn this section, we show a robot experiment where we apply a conventional \ufb01lter bank\nand logarithmically spaced frequencies\napproach using rather few \ufb01lters with constant\n\n\u0002 and demonstrate that the algorithms still produces the desired behaviour.\n\nThe task in this robot experiment is collision avoidance [8]. The built-in re\ufb02ex-behaviour\nis a retraction reaction after the robot has hit an obstacle which represents the inner loop\n\nfeedback mechanism1. The robot has three collision sensors (\u0001\n\u0002 ) and two range \ufb01nders\n(\u0001\u0003\u0006 ), which produce the predictive signals. When driving around there is always a causal\n\nrelation between the earlier occurring range \ufb01nder signals and the later occurring collision,\nwhich drives the learning process. Fig. 2b shows that early during learning many collisions\n(circles) occur. After a collision a fast re\ufb02ex-like retraction&turning reaction is elicited. On\nthe other hand, the robot movement trace is now free of collisions after successful learning\nof the temporal correlation between range \ufb01nder and collision signals (Fig. 2d) and the\n\n1In fact it is also possible to construct an attraction-case if the re\ufb02ex performs an initial attraction-\n\nreaction.\n\n\u0012\n\u0015\n\u0002\n\u0006\n\u0015\n\u0011\n\u0012\n\u0015\n=\n\u001a\n\u0006\n\u0002\n\u0015\n=\n.\n\u0010\n\u0014\n.\n\u0003\n\u0016\n\u0013\n\u0012\n\u0006\n\fto the over-complete set of \t\n\ntrajectory is maximally smooth. The robot always found a stable solution, but those were -\nas expected - not unique. This is partly due to the different initial conditions but also due\n. Possible solutions, which we have observed, are that the\nrobot after learning simply stops in front of an obstacle and that it slightly oscillates back\nand forth. The more common solution of the robot is that it continuously drives around\nand uses mainly his steering to avoid obstacles. Note that this rather complex behaviour is\nestablished by only two neurons. Fig. 2c shows that the weight change slows down after the\nlast collision has happened (dotted line in c). The still existing smaller weight change is due\n(no more collisions) temporally correlated\ninputs still exist namely between the left and right range \ufb01nders. Thus, learning is now\ngoverned by these correlations instead and is driven by the earliest response of one of them\nwhich \ufb01nally leads to the desired stabilisation.\n\nto the fact that after functional silencing of \u0001\r\u0002\n\n4 Discussion\n\nReplacing a feedback loop with its equivalent feed-forward controller is of central relevance\nfor ef\ufb01cient control particularly in slow feedback systems, where long loop-delays exist. So\nfar, feed-forward control is in general model-based and, thus, often not robust [9]. On the\nother hand, it has been suggested earlier by studies of limb movement control that temporal\nsequence learning could be used to solve the inverse controller problem [1].\n\nthe case of %\nlike in Fig. 1 with %\n\nFigure 3: Differences between the Sutton and Barto models (a,c) and ISO-learning (b) in\n= . a) shows the drive reinforcement-model by Sutton and Barto [4] and\nc) the temporal difference (TD) learning by Sutton and Barto [10]. Note that the obsolete\nin a) allows to add the reward-signal in c). b) shows ISO-learning\nsummation-point\n= . Additionally the circuit for the weight change (learning) is\nin the Sutton and Barto-models (a,c) are \ufb01rst order low-pass\nrepresent addition and multiplication, respectively. \u0011\nand\nis\n\nshown. The input-\ufb01lters\n\ufb01lters (eligibility trace).\nthe derivative.\n\n(or\n\nif\u00036\u0006 precedes\u0003\n\nWidely used models of derivative based temporal sequence learning are those by Sutton\nand Barto which have the aim to model experiments of classical conditioning [4, 11, 10].\nFig. 3 shows their models in comparison to ISO-learning. All models strengthen the weight\n, respectively). All models use \ufb01lters at the inputs. However, in\nthe Sutton and Barto-models these \ufb01ltered input signals are only used as an input for the\nlearning circuit (Fig. 3a,c) whereas the output is a superposition of the original input sig-\nnals. Learning is therefore achieved by correlating the \ufb01ltered input with the derivative of\nthe (un-\ufb01ltered) output-signal. Thus, \ufb01ltered signals are correlated with un-\ufb01ltered signals.\nIn contrast to the Sutton and Barto-models, our model is completely isotropic and uses\nthe \ufb01ltered signals for both, the learning circuit and the output since the \ufb01ltered signals\nare also responsible for an appropriate behaviour of the organism. These different wirings\n\nre\ufb02ect the different learning goals: in our model the weight \u000b\n\u0006 stabilises when the input\n\u0002 has become silent (the re\ufb02ex has been avoided). In the Sutton and Barto-models the\n\n\u0015\n\n\u0015\n\u0001\n\u0006\n\n\u0002\n\u000b\n\u0006\n\u0002\n\n\u0003\n\fweight stabilises if the output has reached a speci\ufb01c condition. In the drive-reinforcement\n\nmodel this is the case if the output-signal \t caused by \u0003\u001d\u0006 has a similar strength than the\ntriggered by \u0003\noutput \t\n\u0002 . This re\ufb02ects the Rescorla/Wagner rule [12]. In the case of TD-\nlearning learning stops if the prediction error between reward and the output \t\nif \t optimally predicts\n\nis zero, thus\n. In general our model is closely related to any correlation-based\nsequence-learning [4, 13] and is not related to any form of reinforcement-learning [10, 14]\nas it does not need a special reward- or punishment-signal.\n\nThe current study demonstrates analytically the convergence of ISO-learning in a closed\nloop paradigm in conjunction with some rather general assumptions concerning the struc-\nture of such a system. Thus, this type of learning is able to generate a model-free in-\nverse controller of a re\ufb02ex, which improves the performance of conventional feedback-\ncontrol, while the feedback still serves as a fall-back. Apart from biological implications\nthis promises a broad \ufb01eld of applications in physics and engineering.\n\nReferences\n\n[1] Daniel M. Wolpert and Zoubin Ghahramani. Computational principles of movement\n\nneuroscience. Nature Neuroscience supplement, 3:1212\u20131217, 2000.\n\n[2] P. Read Montague, Peter Dayan, and Terrence J. Sejnowski. Bee foraging in uncertain\n\nenvironments using predictive hebbian learning. Nature, 377:725\u2013728, 1995.\n\n[3] W.E Sollecito and S.G Reque. Stability. In Jerry Fitzgerald, editor, Fundamentals of\n\nSystem Analysis, chapter 21. Wiley, New York, 1981.\n\n[4] R.S. Sutton and A.G. Barto. Towards a modern theory of adaptive networks: expec-\n\ntation and prediction. Psychol. Review, 88:135\u2013170, 1981.\n\n[5] John L. Stewart. Fundamentals of signal theory. Mc Graw-Hill, New York, 1960.\n[6] Gordon M. Shepherd, editor. The synaptic organisation of the brain. Oxford Univer-\n\nsity Press, New York, 1990.\n\n[7] Steven Grossberg. A spectral network model of pitch perception. J Acoust Soc Am,\n\n98(2):862\u2013879, 1995.\n\n[8] P.F.M.J Verschure and T. Voegtlin. A bottom-up approach towards the aquisition,\nretention, and expression of sequential representations: Distributed adaptive control\nIII. Neural Networks, 11:1531\u20131549, 1998.\n\n[9] William J. Palm. Modeling, Analysis and Control of Dynamic Systems. Wiley, New\n\nYork, 2000.\n\n[10] R.S. Sutton. Learning to predict by method of temporal differences. Machine learn-\n\ning, 3(1):9\u201344, 1988.\n\n[11] R.S. Sutton and A.G. Barto. Simulation of anticipatory responses in classical condi-\n\ntioning by a neuron-like adaptive element. Behav. Brain. Res., 4(3):221\u2013235, 1982.\n\n[12] R.A. Rescorla and A.R. Wagner. A theory of pavlovian conditioning: Variations in\nthe effectiveness of reinforcement and nonreinforcement.\nIn A.H Black and W.F.\nProkasy, editors, Classical conditioning 2, current theory and research, pages 64\u201399.\nACC, New York, 1972.\n\n[13] A. Harry Klopf. A drive-reinforcement model of single neuron function. In John S.\nDenker, editor, Neural Networks for computing: AIP conference proceedings, volume\n151 of AIP conference proceedings, New York, 1986. American Institute of Physics.\n[14] Christofer J.C.H Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279\u2013\n\n292, 1992.\n\n\f", "award": [], "sourceid": 2245, "authors": [{"given_name": "Bernd", "family_name": "Porr", "institution": null}, {"given_name": "Florentin", "family_name": "W\u00f6rg\u00f6tter", "institution": null}]}