{"title": "Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 347, "page_last": 354, "abstract": null, "full_text": "Learning to Control an Octopus Arm with\n\nGaussian Process Temporal Difference Methods\n\nYaakov Engel\u2217\n\nAICML, Dept. of Computing Science\n\nUniversity of Alberta\nEdmonton, Canada\n\nPeter Szabo and Dmitry Volkinshtein\n\nDept. of Electrical Engineering\nTechnion Institute of Technology\n\nHaifa, Israel\n\nyaki@cs.ualberta.ca\n\npeter.z.szabo@gmail.com\n\ndmitryvolk@gmail.com\n\nAbstract\n\nThe Octopus arm is a highly versatile and complex limb. How the Octo-\npus controls such a hyper-redundant arm (not to mention eight of them!)\nis as yet unknown. Robotic arms based on the same mechanical prin-\nciples may render present day robotic arms obsolete. In this paper, we\ntackle this control problem using an online reinforcement learning al-\ngorithm, based on a Bayesian approach to policy evaluation known as\nGaussian process temporal difference (GPTD) learning. Our substitute\nfor the real arm is a computer simulation of a 2-dimensional model of\nan Octopus arm. Even with the simpli\ufb01cations inherent to this model,\nthe state space we face is a high-dimensional one. We apply a GPTD-\nbased algorithm to this domain, and demonstrate its operation on several\nlearning tasks of varying degrees of dif\ufb01culty.\n\n1\n\nIntroduction\n\nThe Octopus arm is one of the most sophisticated and fascinating appendages found in\nnature. It is an exceptionally \ufb02exible organ, with a remarkable repertoire of motion. In\ncontrast to skeleton-based vertebrate and present-day robotic limbs, the Octopus arm lacks\na rigid skeleton and has virtually in\ufb01nitely many degrees of freedom. As a result, this arm is\nhighly hyper-redundant \u2013 it is capable of stretching, contracting, folding over itself several\ntimes, rotating along its axis at any point, and following the contours of almost any object.\nThese properties allow the Octopus to exhibit feats requiring agility, precision and force.\nFor instance, it is well documented that Octopuses are able to pry open a clam or remove\nthe plug off a glass jar, to gain access to its contents [1].\n\nThe basic mechanism underlying the \ufb02exibility of the Octopus arm (as well as of other\norgans, such as the elephant trunk and vertebrate tongues) is the muscular hydrostat [2].\nMuscular hydrostats are organs capable of exerting force and producing motion with the\nsole use of muscles. The muscles serve in the dual roles of generating the forces and\nmaintaining the structural rigidity of the appendage. This is possible due to a constant\nvolume constraint, which arises from the fact that muscle tissue is incompressible. Proper\n\n\u2217To whom correspondence should be addressed. Web site: www.cs.ualberta.ca/\u223cyaki\n\n\fuse of this constraint allows muscle contractions in one direction to generate forces acting\nin perpendicular directions.\n\nDue to their unique properties, understanding the principles governing the movement and\ncontrol of the Octopus arm and other muscular hydrostats is of great interest to both phys-\niologists and robotics engineers. Recent physiological and behavioral studies produced\nsome interesting insights to the way the Octopus plans and controls its movements. Gut-\nfreund et al. [3] investigated the reaching movement of an Octopus arm and showed that\nthe motion is performed by a stereotypical forward propagation of a bend point along the\narm. Yekutieli et al. [4] propose that the complex behavioral movements of the Octopus\nare composed from a limited number of \u201dmotion primitives\u201d, which are spatio-temporally\ncombined to produce the arm\u2019s motion.\n\nAlthough physical implementations of robotic arms based on the same principles are not\nyet available, recent progress in the technology of \u201carti\ufb01cial muscles\u201d using electroactive\npolymers [5] may allow the construction of such arms in the near future. Needless to say,\neven a single such arm poses a formidable control challenge, which does not appear to be\namenable to conventional control theoretic or robotics methodology. In this paper we pro-\npose a learning approach for tackling this problem. Speci\ufb01cally, we formulate the task of\nbringing some part of the arm into a goal region as a reinforcement learning (RL) problem.\nWe then proceed to solve this problem using Gaussian process temporal difference learning\n(GPTD) algorithms [6, 7, 8].\n\n2 The Domain\n\nOur experimental test-bed is a \ufb01nite-elements computer simulation of a planar variant of the\nOctopus arm, described in [9, 4]. This model is based on a decomposition of the arm into\nquadrilateral compartments, and the constant muscular volume constraint mentioned above\nis translated into a constant area constraint on each compartment. Muscles are modeled\nas dampened springs and the mass of each compartment is concentrated in point masses\nlocated at its corners1. Although this is a rather crude approximation of the real arm, even\nfor a modest 10-segment model there are already 88 continuous state variables2, making\nthis a rather high dimensional learning problem. Figure 1 illustrates this model.\n\nSince our model is 2\u2013dimensional, all force vectors lie on the x \u2212 y plane, and the arm\u2019s\nmotion is planar. This limitation is due mainly to the high computational cost of the full\n3\u2013dimensional calculations for any arm of reasonable size. There are four types of forces\nacting on the arm: 1) The internal forces generated by the arm\u2019s muscles, 2) the vertical\nforces caused by the in\ufb02uence of gravity and the arm\u2019s buoyancy in the medium in which it\nis immersed (typically sea water), 3) drag forces produced by the arm\u2019s motion through this\nmedium, and 4) internal pressure-induced forces responsible for maintaining the constant\nvolume of each compartment. The use of simulation allows us to easily investigate different\noperating scenarios, such as zero or low gravity scenarios, different media, such as water,\nair or vacuum, and different muscle models. In this study, we used a simple linear model\nfor the muscles. The force applied by a muscle at any given time t is\n\nF (t) = (cid:0)k0 + (kmax \u2212 k0)A(t)(cid:1)(cid:0)\u2113(t) \u2212 \u2113rest(cid:1) + c\n\nd\u2113(t)\n\ndt\n\n.\n\n1For the purpose of computing volumes, masses, friction and muscle strength, the arm is effec-\ntively de\ufb01ned in three dimensions. However, no forces or motion are allowed in the third dimension.\nWe also ignore the suckers located along the ventral side of the arm, and treat the arm as if it were\nsymmetric with respect to re\ufb02ection along its long axis. Finally, we comment that this model is\nrestricted to modeling the mechanics of the arm and does not attempt to model its nervous system.\n\n210 segments result in 22 point masses, each being described by 4 state variables \u2013 the x and y\n\ncoordinates and their respective \ufb01rst time-derivatives.\n\n\fdorsal side\n\n\n\u0001\n\u0001 \n\u0001\n\n\u0001\u0001\n\n\u0001\n\n\n\u0001\n\n\u0001\n\n\u0001\u0001\n\narm tip\n\n\u0001\u0001\n\nCN\n\n\u0001\u0001\n\n\u0001\n\u0001 \n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\npair #N+1\n\nventral side\n\nlongitudinal muscle\n\narm base\n\n\u0001\n\nC1\n\npair #1\n\n\u0001\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\n\n\u0001\n\ntransverse muscle\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\u0001\u0001\n\n\n\u0001\u0001\u0001\n\n\u0001\u0001\u0001\n\n\n\u0001\u0001\n\n\u0001\u0001\n\n\u0001\u0001\n\ntransverse muscle\n\n\n\u0001\u0001\n\n\u0001\u0001\n\nlongitudinal muscle\n\nFigure 1: An N compartment simulated Octopus arm. Each constant area compartment Ci\nis de\ufb01ned by its surrounding 2 longitudinal muscles (ventral and dorsal) and 2 transverse\nmuscles. Circles mark the 2N + 2 point masses in which the arm\u2019s mass is distributed. In\nthe bottom right one compartment is magni\ufb01ed with additional detail.\n\nThis equation describes a dampened spring with a controllable spring constant. The\nspring\u2019s length at time t is \u2113(t), its resting length, at which it does not apply any force\nis \u2113rest.3 The spring\u2019s stiffness is controlled by the activation variable A(t) \u2208 [0, 1]. Thus,\nwhen the activation is zero, and the contraction is isometric (with zero velocity), the relaxed\nmuscle exhibits a baseline passive stiffness k0. In a fully activated isometric contraction\nthe spring constant becomes kmax. The second term is a dampening, energy dissipating\nterm, which is proportional to the rate of change in the spring\u2019s length, and (with c > 0) is\ndirected to resist that change. This is a very simple muscle model, which has been chosen\nmainly due to its low computational cost, and the relative ease of computing the energy\nexpended by the muscle (why this is useful will become apparent in the sequel). More\ncomplex muscle models can be easily incorporated into the simulator, but may result in\nhigher computational overhead. For additional details on the modeling of the other forces\nand on the derivation of the equations of motion, refer to [4].\n\n3 The Learning Algorithms\n\nAs mentioned above, we formulate the problem of controlling our Octopus arm as a RL\nproblem. We are therefore required to de\ufb01ne a Markov decision process (MDP), consisting\nof state and action spaces, a reward function and state transition dynamics. The states in\nour model are the Cartesian coordinates of the point masses and their \ufb01rst time-derivatives.\nA \ufb01nite (and relatively small) number of actions are de\ufb01ned by specifying, for each action,\na set of activations for the arm\u2019s muscles. The actions used in this study are depicted\nin Figure 2. Given the arm\u2019s current state and the chosen action, we use the simulator\nto compute the arm\u2019s state after a small \ufb01xed time interval. Throughout this interval the\nactivations remain \ufb01xed, until a new action is chosen for the next interval. The reward is\nde\ufb01ned as \u22121 for non-goal states, and 10 for goal states. This encourages the controller\nto \ufb01nd policies that bring the arm to the goal as quickly as possible. In addition, in order\nto encourage smoothness and economy in the arm\u2019s movements, we subtract an energy\npenalty term from these rewards. This term is proportional to the total energy expended\nby all muscles during each action interval. Training is performed in an episodic manner:\nUpon reaching a goal, the current episode terminates and the arm is placed in a new initial\nposition to begin a new episode. If a goal is not reached by some \ufb01xed amount of time, the\n\n3It is assumed that at all times \u2113(t) \u2265 \u2113rest. This is meant to ensure that our muscles can only\napply force by contracting, as real muscles do. This can be assured by endowing the compartments\nwith suf\ufb01ciently high volumes, or equivalently, by setting \u2113rest suf\ufb01ciently low.\n\n\fepisode terminates regardless.\n\nAction # 1\n\nAction # 2\n\nAction # 3\n\nAction # 4\n\nAction # 5\n\nAction # 6\n\nFigure 2: The actions used in the \ufb01xed-base experiments. Line thickness is proportional to\nactivation intensity. For the rotating base experiment, these actions were augmented with\nversions of actions 1, 2, 4 and 5 that include clockwise and anti-clockwise torques applied\nto the arm\u2019s base.\n\nThe RL algorithms implemented in this study belong to the Policy Iteration family of al-\ngorithms [10]. Such algorithms require an algorithmic component for estimating the mean\nsum of (possibly discounted) future rewards collected along trajectories, as a function of\nthe trajectory\u2019s initial state, also known as the value function. The best known RL algo-\nrithms for performing this task are temporal difference algorithms. Since the state space\nof our problem is very large, some form of function approximation must be used to repre-\nsent the value estimator. Temporal difference methods, such as TD(\u03bb) and LSTD(\u03bb), are\nprovably convergent when used with linearly parametrized function approximation archi-\ntectures [10]. Used this way, they require the user to de\ufb01ne a \ufb01xed set of basis functions,\nwhich are then linearly combined to approximate the value function. These basis functions\nmust be de\ufb01ned over the entire state space, or at least over the subset of states that might\nbe reached during learning. When local basis functions are used (e.g., RBFs or tile codes\n[11]), this inevitably means an exponential explosion of the number of basis functions with\nthe dimensionality of the state space. Nonparametric GPTD learning algorithms4 [8], offer\nan alternative to the conventional parametric approach. The idea is to de\ufb01ne a nonparamet-\nric statistical generative model connecting the hidden values and the observed rewards, and\na prior distribution over value functions. The GPTD modeling assumptions are that both\nthe prior and the observation-noise distributions are Gaussian, and that the model equa-\ntions relating values and rewards have a special linear form. During or following a learning\nsession, in which a sequence of states and rewards are observed, Bayes\u2019 rule may be used\nto compute the posterior distribution over value functions, conditioned on the observed re-\nward sequence. Due to the GPTD model assumptions, this distribution is also Gaussian,\nand is derivable in closed form. The bene\ufb01ts of using (nonparametric) GPTD methods are\nthat 1) the resulting value estimates are generally not constrained to lie in the span of any\npredetermined set of basis functions, 2) no resources are wasted on unvisited state and ac-\ntion space regions, and 3) rather than the point estimates provided by other methods, GPTD\nmethods provide complete probability distributions over value functions.\n\nIn [6, 7, 8] it was shown how the computation of the posterior value GP moments can\nbe performed sequentially and online. This is done by a employing a forward selection\nmechanism, which is aimed at attaining a sparse approximation of the posterior moments,\nunder a constraint on the resulting error. The input samples (states, or state-action pairs)\nused in this approximation are stored in a dictionary, the \ufb01nal size of which is often a good\nindicator of the problem\u2019s complexity. Since nonparametric GPTD algorithms belong to the\nfamily of kernel machines, they require the user to de\ufb01ne a kernel function, which encodes\nher prior knowledge and beliefs concerning similarities and correlations in the domain at\nhand. More speci\ufb01cally, the kernel function k(\u00b7, \u00b7) de\ufb01nes the prior covariance of the value\nprocess. Namely, for two arbitrary states x and x\u2032, Cov[V (x), V (x\u2032)] = k(x, x\u2032) (see [8]\nfor details). In this study we experimented with several kernel functions, however, in this\n\n4GPTD models can also be de\ufb01ned parametrically, see [8].\n\n\f3 (cid:1) = 121,485 linearly independent monomials.\n\npaper we will describe results obtained using a third degree polynomial kernel, de\ufb01ned\nby k(x, x\u2032) = (cid:0)x\u22a4x\u2032 + 1(cid:1)3. It is well known that this kernel induces a feature space of\nmonomials of degree 3 or less [12]. For our 88 dimensional input space, this feature space\nis spanned by a basis consisting of (cid:0)91\nWe experimented with two types of policy-iteration based algorithms. The \ufb01rst was op-\ntimistic policy iteration (OPI), in which, in any given time-step, the current GPTD value\nestimator is used to evaluate the successor states resulting from each one of the actions\navailable at the current state. Since, given an action, the dynamics are deterministic, we\nused the simulation to determine the identity of successor states. An action is then cho-\nsen according to a semi-greedy selection rule (more on this below). A more disciplined\napproach is provided by a paired actor-critic algorithm. Here, two independent GPTD\nestimators are maintained. The \ufb01rst is used to determine the policy, again, by some semi-\ngreedy action selection rule, while its parameters remain \ufb01xed. In the meantime, the second\nGPTD estimator is used to evaluate the stationary policy determined by the \ufb01rst. After the\nsecond GPTD estimator is deemed suf\ufb01ciently accurate, as indicated by the GPTD value\nvariance estimate, the roles are reversed. This is repeated as many times as required, until\nno signi\ufb01cant improvement in policies is observed.\n\nAlthough the latter algorithm, being an instance of approximate policy iteration, has a\nbetter theoretical grounding [10], in practice it was observed that the GPTD-based OPI\nworked signi\ufb01cantly faster in this domain. In the experiments reported in the next section\nwe therefore used the latter. For additional details and experiments refer to [13]. One \ufb01nal\nwrinkle concerns the selection of the initial state in a new episode. Since plausible arm\ncon\ufb01gurations cannot be attained by randomly drawing 88 state variable from some simple\ndistribution, a more involved mechanism for setting the initial state in each episode has to\nbe de\ufb01ned. The method we chose is tightly connected to the GPTD mode of operation: At\nthe end of each episode, 10 random states were drawn from the GPTD dictionary. From\nthese, the state with the highest posterior value variance estimate was selected as the initial\nstate of the next episode. This is a form of active learning, which is made possible by\nemploying GPTD, and that is applicable to general episodic RL problems.\n\n4 Experiments\n\nThe experiments described in this section are aimed at demonstrating the applicability of\nGPTD-based algorithms to large-scale RL problems, such as our Octopus arm. In these\nexperiments we used the simulated 10-compartment arm described in Section 2. The set\nof goal states consisted of a circular region located somewhere within the potential reach\nof the arm (recall that the arm has no \ufb01xed length). The action set depends on the task, as\ndescribed in Figure 2. Training episode duration was set to 4 seconds, and the time interval\nbetween action decisions was 0.4 seconds. This allowed a maximum of 10 learning steps\nper trial. The discount factor was set to 1.\n\nThe exploration policy used was the ubiquitous \u03b5-greedy policy: The greedy action (i.e. the\none for which the sum of the reward and the successor state\u2019s estimated value is the highest)\nis chosen with probability 1 \u2212 \u03b5, and with probability \u03b5 a random action is drawn from a\nuniform distribution over all other actions. The value of \u03b5 is reduced during learning, until\nthe policy converges to the greedy one. In our implementation, in each episode, \u03b5 was\ndependent on the number of successful episodes experienced up to that point. The general\nform of this relation is \u03b5 = \u03b50N 1\n+Ngoals), where Ngoals is the number of successful\nepisodes, \u03b50 is the initial value of \u03b5 and N 1\nis the number of successful episodes required\nto reduce \u03b5 to \u03b50/2.\nIn order to evaluate the quality of learned solutions, 100 initial arm con\ufb01gurations were cre-\n\n/(N 1\n\n2\n\n2\n\n2\n\n\fFigure 3: Examples of initial states for the rotating-base experiments (left) and the \ufb01xed-\nbase experiments (right). Starting states also include velocities, which are not shown.\n\nated. This was done by starting a simulation from some \ufb01xed arm con\ufb01guration, perform-\ning a long sequence of random actions, and sampling states randomly from the resulting\ntrajectory. Some examples of such initial states are depicted in Figure 3. During learning,\nfollowing each training episode, the GPTD-learned parameters were recorded on \ufb01le. Each\nset of GPTD parameters de\ufb01nes a value estimator, and therefore also a greedy policy with\nrespect to the posterior value mean. Each such policy was evaluated by using it, starting\nfrom each of the 100 initial test states. For each starting state, we recorded whether or not\na goal state was reached within the episode\u2019s time limit (4 seconds), and the duration of the\nepisode (successful episodes terminate when a goal state is reached). These two measures\nof performance were averaged over the 100 starting states and plotted against the episode\nindex, resulting in two corresponding learning curves for each experiment5.\nWe started with a simple task in which reaching the goal is quite easy. Any point of the\narm entering the goal circle was considered as a success. The arm\u2019s base was \ufb01xed and the\ngravity constant was set to zero, corresponding to a scenario in which the arm moves on\na horizontal frictionless plane. In the second experiment the task was made a little more\ndif\ufb01cult. The goal was moved further away from the base of the arm. Moreover, gravity\nwas set to its natural level, of 9.8 m\ns2 , with the motion of the arm now restricted to a vertical\nplane. The learning curves corresponding to these two experiments are shown in Figure 4.\nA success rate of 100% was reached after 10 and 20 episodes, respectively. In both cases,\neven after a success rate of 100% is attained, the mean time-to-goal keeps improving. The\n\ufb01nal dictionaries contained about 200 and 350 states, respectively.\nIn our next two experiments, the arm had to reach a goal located so that it cannot be reached\nunless the base of the arm is allowed to rotate. We added base-rotating actions to the\nbasic actions used in the previous experiments (see Figure 2 for an explanation). Allowing\na rotating base signi\ufb01cantly increases the size of the action set, as well the size of the\nreachable state space, making the learning task considerably more dif\ufb01cult. To make things\neven more dif\ufb01cult, we rewarded the arm only if it reached the goal with its tip, i.e. the\ntwo point-masses at the end of the arm. In the \ufb01rst experiment in this series, gravity was\nswitched on. A 99% success rate was attained after 270 trials, with a \ufb01nal dictionary size of\n\n5It is worth noting that this evaluation procedure requires by far more time than the actual learning,\nsince each point in the graphs shown below requires us to perform 100 simulation runs. Whereas\nlearning can be performed almost in real-time (depending on dictionary size), computing the statistics\nfor a single learning run may take a day, or more.\n\n\fFigure 4: Success rate (solid) and mean time to goal (dashed) for a \ufb01xed-base arm in zero\ngravity (left), and with gravity (right). 100% success was reached after 10 and 20 trials,\nrespectively. The insets illustrate one starting position and the location of the goal regions,\nin each case.\n\nabout 600 states. In the second experiment gravity was switched off, but a circular region\nof obstacle states was placed between the arm\u2019s base and the goal circle. If any part of the\narm touched the obstacle, the episode immediately terminated with a negative reward of -2.\nHere, the success rate peaked at 40% after around 1000 episodes, and remained roughly\nconstant thereafter. It should be taken into consideration that at least some of the 100 test\nstarting states are so close to the obstacle that, regardless of the action taken, the arm cannot\navoid hitting the obstacle. The learning curves are presented in Figure 5.\n\nFigure 5: Success rate (solid) and mean time to goal (dashed) for a rotating-base arm with\ngravity switched on (left), and with gravity switched off but with an obstacle blocking the\ndirect path to the goal (right). The arm has to rotate its base in order to reach the goal in\neither case (see insets). Positive reward was given only for arm-tip contact, any contact\nwith the obstacle terminated the episode with a penalty. A 99% success rate was attained\nafter 270 episodes for the \ufb01rst task, whereas for the second task success rate reached 40%.\n\nVideo movies\nwww.cs.ualberta.ca/\u223cyaki/movies/.\n\narm in\n\nshowing\n\nthe\n\nvarious\n\nscenarios\n\nare\n\navailable\n\nat\n\n5 Discussion\n\nUp to now, GPTD based RL algorithms have only been tested on low dimensional problem\ndomains. Although kernel methods have handled high-dimensional data, such as handwrit-\n\n\ften digits, remarkably well in supervised learning domains, the applicability of the kernel-\nbased GPTD approach to high dimensional RL problems has remained an open question.\nThe results presented in this paper are, in our view, a clear indication that GPTD meth-\nods are indeed scalable, and should be considered seriously as a possible solution method\nby practitioners facing large-scale RL problems. Further work on the theory and prac-\ntice of GPTD methods is called for. Standard techniques for model selection and tuning\nof hyper-parameters can be incorporated straightforwardly into GPTD algorithms. Value\niteration-based variants, i.e. \u201cGPQ-learning\u201d, would provide yet another useful set of tools.\n\nThe Octopus arm domain is of independent interest, both to physiologists and robotics\nengineers. The fact that reasonable controllers for such a complex arm can be learned from\ntrial and error, in a relatively short time, should not be understated. Further work in this\ndirection should be aimed at extending the Octopus arm simulation to a full 3-dimensional\nmodel, as well as applying our RL algorithms to real robotic arms based on the muscular\nhydrostat principle, when these become available.\n\nAcknowledgments\n\nY. E. was partially supported by the AICML and the Alberta Ingenuity fund. We would\nalso like to thank the Ollendorff Minerva Center, for supporting this project.\n\nReferences\n\n[1] G. Fiorito, C. V. Planta, and P. Scotto. Problem solving ability of Octopus Vulgaris Lamarck\n\n(Mollusca, Cephalopoda). Behavioral and Neural Biology, 53 (2):217\u2013230, 1990.\n\n[2] W.M. Kier and K.K. Smith. Tongues, tentacles and trunks: The biomechanics of movement in\n\nmuscular-hydrostats. Zoological Journal of the Linnean Society, 83:307\u2013324, 1985.\n\n[3] Y. Gutfreund, T. Flash, Y. Yarom, G. Fiorito, I. Segev, and B. Hochner. Organization of Octopus\narm movements: A model system for studying the control of \ufb02exible arms. The journal of\nNeuroscience, 16:7297\u20137307, 1996.\n\n[4] Y. Yekutieli, R. Sagiv-Zohar, R. Aharonov, Y. Engel, B. Hochner, and T. Flash. A dynamic\nmodel of the Octopus arm. I. Biomechanics of the Octopus reaching movement. Journal of\nNeurophysiology (in press), 2005.\n\n[5] Y. Bar-Cohen, editor. Electroactive Polymer (EAP) Actuators as Arti\ufb01cial Muscles - Reality,\n\nPotential and Challenges. SPIE Press, 2nd edition, 2004.\n\n[6] Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman: The Gaussian process approach\nIn Proc. of the 20th International Conference on Machine\n\nto temporal difference learning.\nLearning, 2003.\n\n[7] Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In Proc.\n\nof the 22nd International Conference on Machine Learning, 2005.\n\n[8] Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, The Hebrew\nUniversity of Jerusalem, 2005. www.cs.ualberta.ca/\u223cyaki/papers/thesis.ps.\n[9] R. Aharonov, Y. Engel, B. Hochner, and T. Flash. A dynamical model of the octopus arm.\nIn Neuroscience letters. Supl. 48. Proceedings of the 6th annual meeting of the Israeli Neuro-\nscience Society, 1997.\n\n[10] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[11] R.S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[12] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University\n\nPress, Cambridge, England, 2004.\n\n[13] Y. Engel, P. Szabo, and D. Volkinshtein. Learning to control an Octopus arm with Gaussian\nprocess temporal difference methods. Technical report, Technion Institute of Technology, 2005.\nwww.cs.ualberta.ca/\u223cyaki/reports/octopus.pdf.\n\n\f", "award": [], "sourceid": 2785, "authors": [{"given_name": "Yaakov", "family_name": "Engel", "institution": null}, {"given_name": "Peter", "family_name": "Szabo", "institution": null}, {"given_name": "Dmitry", "family_name": "Volkinshtein", "institution": null}]}