{"title": "Multi-task Gaussian Process Learning of Robot Inverse Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 265, "page_last": 272, "abstract": "The inverse dynamics problem for a robotic manipulator is to compute the torques needed at the joints to drive it along a given trajectory; it is beneficial to be able to learn this function for adaptive control. A given robot manipulator will often need to be controlled while holding different loads in its end effector, giving rise to a multi-task learning problem. We show how the structure of the inverse dynamics problem gives rise to a multi-task Gaussian process prior over functions, where the inter-task similarity depends on the underlying dynamic parameters. Experiments demonstrate that this multi-task formulation generally improves performance over either learning only on single tasks or pooling the data over all tasks.", "full_text": "Multi-task Gaussian Process Learning of Robot\n\nInverse Dynamics\n\nKian Ming A. Chai\nSchool of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK\n\nChristopher K. I. Williams\n\nStefan Klanke\n\nSethu Vijayakumar\n\n{k.m.a.chai, c.k.i.williams, s.klanke, sethu.vijayakumar}@ed.ac.uk\n\nAbstract\n\nThe inverse dynamics problem for a robotic manipulator is to compute the torques\nneeded at the joints to drive it along a given trajectory; it is bene\ufb01cial to be able\nto learn this function for adaptive control. A robotic manipulator will often need\nto be controlled while holding different loads in its end effector, giving rise to a\nmulti-task learning problem. By placing independent Gaussian process priors over\nthe latent functions of the inverse dynamics, we obtain a multi-task Gaussian pro-\ncess prior for handling multiple loads, where the inter-task similarity depends on\nthe underlying inertial parameters. Experiments demonstrate that this multi-task\nformulation is effective in sharing information among the various loads, and gen-\nerally improves performance over either learning only on single tasks or pooling\nthe data over all tasks.\n\n1 Introduction\n\nThe inverse dynamics problem for a robotic manipulator is to compute the torques \u03c4 needed at the\njoints to drive it along a given trajectory, i.e. the motion speci\ufb01ed by the joint angles q(t), velocities\n\u02d9q(t) and accelerations \u00a8q(t), through time t. Analytical models for the inverse dynamics \u03c4 (q, \u02d9q, \u00a8q)\nare often infeasible, for example due to uncertainty in the physical parameters of the robot, or the\ndif\ufb01culty of modelling friction. This leads to the need to learn the inverse dynamics.\n\nA given robotic manipulator will often need to be controlled while holding different loads in its end\neffector. We refer to different loadings as different contexts. The inverse dynamics functions depend\non the different contexts. A simple approach is to learn a different mapping for each context, but\nit is more attractive if one can exploit commonality in these related tasks to improve performance,\ni.e. to carry out multi-task learning (MTL) [1, 2]. The aim of this paper is to show how this can be\ncarried out for the inverse dynamics problem using a multi-task Gaussian process (GP) framework.\n\nIn \u00a72 we discuss the relevant theory for the problem. Details of how we optimize the hyperparam-\neters of the multi-task GP are given in \u00a73, and model selection is described in \u00a74. Relationships to\nother work are discussed in \u00a75, and the experimental setup and results are given in \u00a76.\n\n2 Theory\n\nWe \ufb01rst describe the relationship of inverse dynamics functions among contexts in \u00a72.1. In \u00a72.2 we\nreview the multi-task GP regression model proposed in [3], and in \u00a72.3 we describe how to derive a\nmulti-task GP model for the inverse-dynamics problem.\n\n2.1 Linear relationship of inverse dynamics between contexts\n\nSuppose we have a robotic manipulator consisting of J joints, and a set of M loads. Figure 1 illus-\ntrates a six-jointed manipulator, with joint j connecting links j \u22121 and j. We wish to learn the inverse\n\n\fJoint 1\nWaist\n\nJoint 2\nShoulder\n\nBase\n\nJoint 3\nElbow\n\nJoint 5\nWrist Bend\n\nq3\n\nJoint 6\nFlange\n\nJoint 4\n\nWrist rotation\n\nFigure 1: Schematic of the PUMA 560 without\nthe end-effector (to be connected to joint 6).\n\nm = 1 . . . M\n\n\uf8f6\n\uf8f7\uf8f8\n\n\uf8eb\n\uf8ec\uf8ed\n\n1\n\u03c0m\nJ ,1\n\u00b7 \u00b7 \u00b7\n\u03c0m\n\nJ ,10\n\n\u03c4 m\nj\n\n\u00b7 \u00b7 \u00b7\n\n\u00b7 \u00b7 \u00b7\u00b7 \u00b7 \u00b7\n\nyjJ ,10\n\nhj\n\nj = 1 . . . J\n\nyjJ ,1\n\nFigure 2: A schematic diagram on how the dif-\nferent functions are related. A plate repeats its\ncontents over the speci\ufb01ed range.\n\ndynamics model of the manipulator for the mth context, i.e. when it handles the mth load in its end-\neffector connected to the last link. We denote this by \u03c4 m(x) \u2208 RJ , with x def= (qT, \u02d9qT, \u00a8qT)T \u2208 R3J .\nIt can be shown that the required torque for the jth joint can be written as [4]\n\nj (x) = PJ\n\u03c4 m\n\n(1)\nj \u2032 \u2208 R10 is the vector of inertial parameters1\nwhere the yjj \u2032\u2019s are vector-valued functions of x, and \u03c0m\nof the j \u2032th joint when manipulating the mth load. The inertial parameters for a joint depend on the\nphysical characteristics of its corresponding link (e.g. mass) and are independent of x.\n\nyjj \u2032 : R3J 7\u2192 R10,\n\njj \u2032(x)\u03c0m\n\nj \u2032=j yT\n\nj \u2032\n\nWhen, as in our case, the loads are rigidly attached to the end effector, each load may be considered\nas part of the last link, and thus modi\ufb01es the inertia parameters for the last link only [5]. The\nparameters for the other links remain unchanged since the parameters are local to the links and their\nframes. Denoting the common inertial parameters of the j \u2032th link by \u03c0\u2022\n\nj \u2032, we can write\n(2)\nj \u2032=j yT\njj \u2032(x)\u03c0\u2022\nj \u2032 .\nDe\ufb01ne \u02dcyj(x) def= (hj(x), (yjJ (x))T)T and \u02dc\u03c0m def= (1, (\u03c0m\nj (x) = \u02dcyj(x)T \u02dc\u03c0m. Note\nthat the \u02dcyjs are shared among the contexts, while the \u02dc\u03c0ms are shared among the J links, as illustrated\nin Figure 2. This decomposition is not unique, since given a non-singular square 11\u00d711 matrix Aj,\nsetting zj(x) def= A\u2212T\n\nwhere hj(x) def= PJ \u22121\nJ )T)T, then \u03c4 m\n\ndef= Aj \u02dc\u03c0m, we also have\n\nj (x) = hj(x) + yT\n\u03c4 m\n\nj \u02dcyj(x) and \u03c1m\n\njJ (x)\u03c0m\nJ ,\n\nj\n\n(3)\nHence the vector of parameters \u02dc\u03c0\u03b3 is identi\ufb01able only up to a linear combination. Note that in\ngeneral the matrix Aj may vary across the joints.\n\nj Aj \u02dc\u03c0m = zj(x)T\u03c1m\nj .\n\nj (x) = \u02dcyj(x)TA\u22121\n\u03c4 m\n\n2.2 Multi-task GP regression model\n\nWe give a brief summary of the multi-task Gaussian process (GP) regression model described in [3].\nThis model learns M related functions {f m}M\nm=1 by placing a zero mean GP prior which directly\ninduces correlations between tasks. Let tm be the observation of the mth function at x. Then the\nmodel is given by\n\nhf m(x)f m\u2032\n\n(4)\nwhere kx is a covariance function over inputs, K f is a positive semi-de\ufb01nite (p.s.d) matrix of inter-\ntask similarities, and \u03c32\n\nm is the noise variance for the mth task.\n\ntm \u223c N (f m(x), \u03c32\n\nmm\u2032kx(x, x\u2032)\n\n= K f\n\nm),\n\n(x\u2032)i def\n\n2.3 Multi-task GP model for multiple contexts\n\nWe now show that the multi-task GP model can be used for inferring inverse dynamics for multiple\ncontexts. We begin by placing independent zero mean GP priors on all the component functions of\nz1(\u00b7), . . . , zJ (\u00b7). Let \u03b1 be an index into the elements of the vector function zj(\u00b7), then our prior is\n(5)\n\nhzj\u03b1(x)zj \u2032\u03b1\u2032 (x\u2032)i = \u03b4jj \u2032 \u03b4\u03b1\u03b1\u2032 kx\n\nj (x, x\u2032).\n\n1We may also formulate our model using the more general vector of dynamic parameters which includes\nalso the friction parameters, motor inertia etc. However, these additional parameters are independent of the\nload, and so can be absorbed into the function hj in eq. 2.\n\n\fh\u03c4 m\ndef= (\u03c11\n\nj (\u00b7) is given by\n\nj (x)\u03c4 m\u2032\nj | \u00b7 \u00b7 \u00b7 |\u03c1M\n\nIn addition to independence speci\ufb01ed by the Kronecker delta functions \u03b4\u00b7\u00b7, this model also imposes\nthe constraint that all component functions for a given joint j share the same covariance function\nj (\u00b7, \u00b7). With this prior over the zjs, the Gaussian process prior for \u03c4 m\nkx\nj (x, x\u2032),\n\n(6)\nj )mm\u2032, the\nj de\ufb01nes the similarity\nj is the rank of Pj, and is upper bounded by min(M , 11),\n\nj \u2032 (x\u2032)i = \u03b4jj \u2032 (K \u03c1\nwhere we have set Pj\nj ) and K \u03c1\ndef= P T\n(m, m\u2032)th entry of the positive semi-de\ufb01nite matrix K \u03c1\nbetween different contexts. The rank of K \u03c1\nre\ufb02ecting the fact that there are at most 11 underlying latent functions (see Figure 2).\nj (x) may be modelled with\nLet tm\nj , sharing the vari-\ntm\nj . . . \u2261 \u03c3M\nj (x) \u223c N (\u03c4 m\nance parameters among the contexts. This completes the correspondence with the multi-task GP\nmodel in eq. 4. Note, however, that in this case we have J multi-task GP models, one for each joint.\nThis model is a simple and convenient one where the prior, likelihood and posterior factorize over\njoints. Hence inference and hyperparameter learning can be done separately for each joint.\n\nj )mm\u2032 kx\nj Pj, so that (\u03c1m\nj . Notice that K \u03c1\n\nj (x). The deviations from \u03c4 m\nj \u2261 \u03c32\n\nj (x) be the observed value of \u03c4 m\n\nj )2), though in practice we let \u03c3j\n\nj (x), (\u03c3m\n\nj )T\u03c1m\u2032\n\nj = (K \u03c1\n\ndef= \u03c31\n\nj\n\nMaking predictions As in [3], inference in our model can be done by using the standard GP\nformulae for the mean and variance of the predictive distribution with the covariance function given\nin eq. 6 together with the normal noise model. The observations over all contexts for a given joint j\nwill be used to make the predictions. For the case of complete data (where there are observations at\nthe same set of x-values for all contexts) one can exploit the Kronecker-product structure [3, eq. 2].\n\n2.3.1 The relationship among task similarity matrices\n\nLet \u02dc\u03a0 def= ( \u02dc\u03c01| \u00b7 \u00b7 \u00b7 | \u02dc\u03c0M ). Recall that \u02dc\u03c0m is an 11 dimensional vector. However, if the different loads\nin the end effector do not explore the full space (e.g. if some of the inertial parameters are constant\nover all loads), then it can happen that s def= rank( \u02dc\u03a0) \u2264 min(M , 11).\nIt is worthwhile to investigate the relationship between K \u03c1\nj and K \u03c1\n\u03c1m\nj\nso that rank(K \u03c1\nexact values may differ. This observation will be useful for model selection in \u00a74.\n\nj \u2032, j 6= j \u2032. Recall from eq. 3 that\nj Aj \u02dc\u03a0,\nj s have the same rank for all joints, although their\n\ndef= Aj \u02dc\u03c0m, where Aj is a full-rank square matrix. This gives Pj = Aj \u02dc\u03a0 and K \u03c1\n\nj ) = rank( \u02dc\u03a0). Therefore the K \u03c1\n\nj = \u02dc\u03a0TAT\n\n3 Learning the hyperparameters \u2014 a staged optimization heuristic\n\nIn this section, we drop the joint index j for the sake of brevity and clarity. The following applies\nseparately for each joint. Let tm be the vector of nm observed torques at the joint for context m,\nand X m be the corresponding 3J \u00d7nm design matrix. Further, let X be the 3J \u00d7N design matrix\nof distinct x-con\ufb01gurations observed over all M contexts. Given this data, we wish to optimize the\nmarginal likelihood L(\u03b8x, K \u03c1, \u03c32) def= p({tm}M\nm=1|X, \u03b8x, K \u03c1, \u03c32), where \u03b8x are the parameters of\nkx. As pointed out in [3], one may approach this either using general gradient-based optimization,\nor using expectation-maximization. In this paper, the former is used.\nIn general, the objective function L(\u03b8x, K \u03c1, \u03c32) will have multiple modes, and it is a dif\ufb01cult prob-\nlem of how to locate the best mode. We propose a staged strategy during optimization to help\nlocalize the search region. This is outlined below, with details given in the subsections that follow.\n\nRequire: Starting positions \u03b8x\n\n0, K \u03c1\n\n0 , \u03c32\n\n0, and rank r.\n\n{All arg max operations are understood to \ufb01nd only the local maximum.}\n\n1: Starting from \u03b8x\n2: Calculate K1\n3: Starting from \u03b8x\n\n0 and \u03c32\n\n0, \ufb01nd (\u03b8x\n\n1, \u03c32\n\u03c1 based on details in \u00a73.2.\n0, \ufb01nd (\u03b8x\n\n1 , and \u03c32\n\n1, K \u03c1\n\n1) = arg max\u03b8x,\u03c32 L(\u03b8x, K \u03c1\n\n0 , \u03c32).\n\nans, K \u03c1\n\nans, \u03c32\n\nans) = arg max\u03b8x,K \u03c1,\u03c32 L(\u03b8x, K \u03c1, \u03c32).\n\nThe optimization order re\ufb02ects the relative importance of the different constituents of the model.\nThe most important is kx, hence the estimation of \u03b8x begins in step 1; the least important is \u03c32,\nhence its estimation from the initial value \u03c32\n0 is in step 3. For our application, we \ufb01nd that this\nstrategy works better than one which simultaneously optimizes for all the parameters.\n\n\f3.1 The initial choice of K \u03c1\n\nThe choice of K \u03c1\n0 is important, since it affects the search very early on. Reasonable values that\nadmit ready interpretations are the matrix of ones 11T and the identity matrix I. For K \u03c1\n0 = 11T,\nwe initially assume the contexts to be indistinguishable from each other; while for K \u03c1\n0 = I, we\ninitially assume the contexts to be independent given the kernel parameters, which is a multi-task\nlearning model that has been previously explored, e.g. [6]. These two are at the opposite extremes\nin the spectrum of inter-context/task correlation, and we believe the merit of each will be application\ndependent. Since these two models have the same number of free parameters, we select the one with\nthe higher likelihood as the starting point for the search in step 2. However, we note that in some\napplications there may be reasons to prefer one over the other.\n\n3.2 Computation of K \u03c1\n\n1 in step 2\n\n1 and \u03c32\n\n1, we wish to estimate a K \u03c1\n\nGiven estimates \u03b8x\nin step 3. Here we give the sequence of considerations that leads to a formula for computing K \u03c1\n1 .\n1 for kx. Let T be an N\u00d7M matrix which\nLet K x\ncorresponds to the true values of the torque function \u03c4 m(xi) for m = 1, . . . , M and i = 1, . . . , N.\nThen as per the EM step discussed in [3, eq. 4], we have\n\n1 be the covariance matrix for all pairs in X, using \u03b8x\n\n1 from which the likelihood can be optimized\n\nEM = N \u22121 (cid:10)T T(K x\nK \u03c1\n\n1 )\u22121T (cid:11)\u02dc\u03b80\n\n\u2243 N \u22121 hT iT\n\u02dc\u03b80\n\n(K x\n\n1 )\u22121 hT i\u02dc\u03b80\n\n,\n\n(7)\n\n1, K \u03c1\n\n0 , \u03c32\n\n0 , so that the rank of K \u03c1\n\nwhere the expectations are taken w.r.t a GP with parameters \u02dc\u03b80 = (\u03b8x\n1), and the (i, m)th\nentry of hT i\u02dc\u03b80\nis the mean of \u03c4 m(xi) with this GP. The approximation neglects the GP\u2019s variance;\nthis is justi\ufb01able since the current aim is to obtain a starting estimate of K \u03c1 for a search procedure.\nThere are two weaknesses with eq. 7 that we shall address. The \ufb01rst is that the rank of hT i\u02dc\u03b80\nis\nEM is similarly upper bounded.2 This property\nupper bounded by that of K \u03c1\nis undesirable, particularly when K \u03c1\nwith the\ncorresponding observed value tm(xi) wherever it is available, and call the resultant matrix Taug.\nThe second weakness is that with the commonly used covariance functions, K x\n1 will typically have\nrapidly decaying eigenvalues [7, \u00a74.3.1]. To overcome this, we regularize its inversion by adding \u03b72I\nto the diagonal of K x\naugTaug)/(M N ),\nso that tr(K \u03c1\nFinally, the required K \u03c1\nrently achieved by computing the eigen-decomposition of K \u03c1\nvectors/values; it could also be implemented using an incomplete Cholesky decomposition.\n\naug by constraining it to have rank r. This is cur-\naug and keeping only the top r eigen-\n\n0 = 11T . We ameliorate this by replacing h\u03c4 m(xi)i\u02dc\u03b80\n\naug(K x\n1 were the zero matrix.\n\n1 + \u03b72I)\u22121Taug. We set \u03b72 to tr(T T\n\n1 is obtained from K \u03c1\n\naug) = M if K x\n\naug = N \u22121T T\n\n1 to give K \u03c1\n\n3.3\n\nIncorporating a novel task\n\nAbove we have assumed that data from all contexts is available at training time. However, we may\nencounter a new context for which we have not seen much data. In this case we \ufb01x \u03b8x and \u03c32 while\nextending K \u03c1 by an extra row and column for the new context, and it is only this new border which\nneeds to be learned by maximising the marginal likelihood. Note that as K \u03c1 is p.s.d this means\nlearning only at most M new parameters, or fewer if we exploit the rank-constraint property of K \u03c1.\n\n4 Model selection\n\nj in the model is important, since it re\ufb02ects on the rank s of \u02dc\u03a0. In our\nThe choice of the rank r of K \u03c1\nmodel, r is not a hyperparameter to be optimized. Thus to infer its value we rely on an information\ncriterion to select the most parsimonious correct model. Here, we use the Bayesian Information\nCriterion (BIC), but the use of Akaike or Hannan-Quinn criteria is similar.\nLet Ljr be the likelihood for each joint at optimized hyperparameters \u03b8x\nis constrained to have rank r; let nm\n\nj , when K \u03c1\nj be the number of observations for the jth joint in the mth\n\nj , and \u03c32\n\nj , K \u03c1\n\nj\n\n2This is not due to our approximation; indeed, it can be shown that the rank of K \u03c1\n\nEM is upper bounded by\n\nthat of K \u03c1\n\n0 even if the exact EM update in eq. 7 has been used.\n\n\fcontext, and n def= Pj,m nm\n\u03b8x\nj . Since the likelihood of the model factorizes over joints, we have\n\nj be the total number of observations; and let dj be the dimensionality of\n\nBIC(r) = \u22122PJ\n\nj=1 log Ljr + (cid:16)PJ\n\nj=1 dj + J\n\n2 r(2M + 1 \u2212 r) + J(cid:17) log n,\n\n(8)\n\nwhere r(2M + 1 \u2212 r)/2 is the number of parameters needed to de\ufb01ne an incomplete Cholesky\ndecomposition of rank r for an M \u00d7M matrix. For selecting the appropriate rank of the K \u03c1\nj s, we\ncompute and compare BIC(r) for different values of r.\n\n5 Relationships to other work\n\nWe consider related work \ufb01rst with regard to the inverse dynamics problem, and then to multi-task\nlearning with Gaussian processes.\n\nLearning methods for the single-context inverse dynamics problem can be found in e.g. [8], where\nthe locally weighted projection regression (LWPR) method is used. Gaussian process methods for\nthe same problem have also been shown to be effective [7, \u00a72.5; 9]. The LWPR method has been\nextended to the multi-context situation by Petkos and Vijayakumar [5]. If the inertial parameters\nJ s are known for at least 11 contexts then the estimated torque functions can be used to estimate\n\u03c0m\nthe underlying yjj \u2032s using linear regression, and prediction in a novel context (with limited training\ndata) will depend on estimating the inertial parameters for that context. Assuming the original\nestimated torque functions are imperfect, having more than 11 models for distinct known inertial\nparameters will improve load estimation. If the inertial parameters are unknown, the novel torque\nfunction can still be represented as a linear combination of a set of 11 linearly independent torque\nfunctions, and so one can estimate the inverse dynamics in a novel context by linear regression on\nthose estimated functions. In contrast to the known case, however, no more than 11 models can be\nused [5, \u00a7V]. Another difference between known and unknown parameters is that in the former case\nthe resulting \u03c0m\n\nJ s are interpretable, while in the latter there is ambiguity due to the Ajs in eq. 3.\n\nComparing our approach with [5], we note that: (a) their approach does not exploit the knowledge\nthat the torque functions for the different contexts are known to share latent functions as in eq. 2,\nand thus it may be useful to learn the M inverse dynamics models jointly. This is expected to be\nparticularly advantageous when the data for each task explores rather different portions of x-space;\n(b) rather than relying on least-squares methods (which assume equal error variances everywhere),\nour fully probabilistic model will propagate uncertainties (co-variances for jointly Gaussian models)\nautomatically; and (c) eq. 6 shows that we do not need to be limited to exactly 11 reference contexts,\neither fewer or more than 11 can be used. On the other hand, using the LWPR methods will generally\ngive rise to better computational scaling for large data-sets (although see approximate GP methods\nin [7, ch. 8]), and are perhaps less complex than the method in this paper.\n\nEarlier work on multiple model learning such as Multiple Model Switching and Tuning (MMST)\n[10] uses an inverse dynamics model and a controller for each context, switching among the models\nto the one producing the most accurate predictions. The models are linear-in-the-parameters with\nknown non-linear regressor functions of x, and the number of models are assumed known. MMST\ninvolves very little dynamics learning, estimating only the linear parameters of the models. A closely\nrelated approach is Modular Selection and Identi\ufb01cation for Control (MOSAIC) [11], which uses\ninverse dynamics models for control and forward dynamics models for context identi\ufb01cation. How-\never, MOSAIC was developed and tested on linear dynamics models without the insights into how\neq. 1 may be used across contexts for more ef\ufb01cient and robust learning and control.\n\nEarly references to general multi-task learning are [1] and [2]. There has been a lot of work in recent\nyears on MTL with e.g. neural networks, Dirichlet processes, Gaussian processes and support vector\nmachines. Some previous models using GPs are summarized in [3]. An important related work is the\nsemiparametric latent factor model [12] which has a number of latent processes which are linearly\ncombined to produce observable functions as in eq. 3. However, in our model all the latent functions\nshare a common covariance function, which reduces the number of free parameters and should thus\nhelp to reduce over-\ufb01tting. Also we note that the regression experiments by Teh et al. [12, \u00a74] used\na forward dynamics problem on a four-jointed robot arm for a single context, with an arti\ufb01cial linear\nmixing of the four target joint accelerations to produce six response variables. In contrast, we have\nshown how linear mixing arises naturally in a multi-context inverse dynamics situation. In relation\n\n\fp3\n\np4\n\np2\n\np1\n\n0.5\n\nz/m\n0.3\n\n0.2\ny/m\n\n0\n\u22120.2\n\n0.3\n\n0.4\n\n0.7\nFigure 3: The four paths p1, p2, p3, p4. The\nrobot base is located at (0, 0, 0).\n\nx/m\n\n0.6\n\n0.5\n\nTable 1: The trajectories at which the training\nsamples for each load are acquired. All loads\nhave training samples from the common trajec-\ntory (p2, s3). For the multiple-contexts setting,\nc15, and hence (p4, s4), is not used for training.\n\ns1\nc1\nc6\nc11\nc2\n\np1\np2\np3\np4\n\ns2\nc7\nc12\nc3\nc8\n\ns3\nc13\n\nc1 \u00b7 \u00b7 \u00b7 c15\n\nc4\nc9\n\ns4\nc14\nc5\nc10\nc15\u2217\n\nTable 2: The average nMSEs of the predictions by LR and sGP, for joint 3 and for both kinds of test\nsets. Training set sizes given in the second row. The nMSEs are averaged over loads c1 . . . c15.\n\naverage nMSE for the interpm sets\n4000\n20\n6\u00d710\u22124\n3\u00d710\u22129\n\n1004\n6\u00d710\u22124\n2\u00d710\u22128\n\n7\u00d710\u22124\n2\u00d710\u22127\n\n1\u00d710\u22121\n1\u00d710\u22122\n\n170\n\naverage nMSE for the extrapm sets\n4000\n20\n2\u00d710\u22121\n3\u00d710\u22123\n\n1004\n2\u00d710\u22121\n4\u00d710\u22123\n\n2\u00d710\u22121\n3\u00d710\u22122\n\n5\u00d710\u22121\n1\u00d710\u22121\n\n170\n\nLR\nsGP\n\nto work by Bonilla et al. [3] described in section 2.2, we note that the factorization between inter-task\nsimilarity K f and a common covariance function kx is an assumption there, while we have shown\nthat such decomposition is inherent in our application.\n\n6 Experiments\n\nData We investigate the effectiveness of our model with the Puma 560 (Figure 1), which has\nJ = 6 degrees of freedom. We learn the inverse dynamic models of this robot manipulating M = 15\ndifferent loads c1, . . . , c15 through four different \ufb01gure-of-eight paths at four different speeds. The\ndata for our experiments is obtained using a realistic simulation package [13], which models both\nCoulomb and viscous frictional forces. Figure 3 shows the paths p1, . . . , p4 which are placed at\n0.35m, 0.45m, 0.55m and 0.65m along the x-axis, at 0.36m, 0.40m, 0.44m and 0.48m along the\nz-axis, and rotated about the z-axis by \u221210\u25e6, 0\u25e6, 10\u25e6 and 20\u25e6. There are four speeds s1, . . . , s4,\n\ufb01nishing a path in 20s, 15s, 10s and 5s respectively.\nIn general, loads can have very different\nphysical characteristics; in our case, this is done by representing each load as a cuboid with differing\ndimensions and mass, and attaching each load rigidly to a random point at the end-effector. The\nmasses range evenly from 0.2kg for c1 to 3.0kg for c15; details of the other parameters are omitted\ndue to space constraints.\nFor each load cm, 4000 data points are sampled at regular intervals along the path for each path-speed\n(trajectory) combination (p\u00b7, s\u00b7). Each sample is the pair (t, x), where t \u2208 RJ are the observed\ntorques at the joints, and x \u2208 R3J are the joint angles, velocities and accelerations. This set of data\nis partitioned into train and test sets in the manner described below.\n\nAcquiring training data combinatorially by sampling for every possible load-trajectory pair may be\nprohibitively expensive. One may imagine, however, that training data for the handling of a load can\nbe obtained along a \ufb01xed reference trajectory Tr for calibration purposes, and also along a trajectory\ntypical for that load, say Tm for the mth load. Thus, for each load, 2000 random training samples\nare acquired at a common reference trajectory Tr = (p2, s3), and an additional 2000 random training\nsamples are acquired at a trajectory unique to each load; Table 1 gives the combinations. Therefore\neach load has a training set of 4000 samples, but acquired only on two different trajectories.\nFollowing [14], two kinds of test sets are used to assess our models for (a) control along a repeated\ntrajectory (which is of practical interest in industry), and (b) control along arbitrary trajectories\n(which is of general interest to roboticists). The test for (a) assesses the accuracy of torque predic-\ntions for staying within the trajectories that were used for training. In this case, the test set for load\ncm, denoted by interpm for interpolation, consists of the rest of the samples from Tr and Tm that are\nnot used for training. The test for (b) assesses the accuracy also for extrapolation to trajectories not\n\n\fsampled for training. The test set for this, denoted by extrapm, consists of all the samples that are\nnot training samples for cm.\nIn addition, we consider a data-poor scenario, and investigate the quality of the models using ran-\ndomly selected subsets of the training data. The sizes of these subsets range from 20 to 4000.\n\nResults comparing GP with linear regression We \ufb01rst compare learning the inverse dynamics\nwith Bayesian linear regression (LR) to learning with single-task Gaussian processes (sGP). For each\ncontext and each joint, we train a LR model and a sGP model with the corresponding training data\nseparately. For LR, the covariates are (x, sgn( \u02d9q), 1), where sgn(\u00b7) is the component-wise signum\nof its arguments; regression coef\ufb01cients \u03b2 and noise variance \u03c32 are given a broad normal-inverse-\ngamma prior p(\u03b2, \u03c32) \u2261 N (\u03b2|0, \u03c32 \u00b7 108I)IG(\u03c32|1, 1), though note that the mean predictions do\nnot depend on the parameters of the inverse-gamma prior on \u03c32. The covariance function of each\nsGP model is a sum of an inhomogeneous linear kernel on (x, sgn( \u02d9q)), a squared exponential kernel\non x, and an independent noise component [7, \u00a74.2], with the \ufb01rst two using the automatic relevance\ndetermination parameterization [7, \u00a75.1]. The hyperparameters of sGP are initialized by giving equal\nweightings among the covariates and among the components of the covariance function, and then\nlearnt by optimizing the marginal likelihood independently for each context and each joint.\nThe trained LR and sGP models are used to predict torques for the interpm and extrapm data sets. For\neach test set, the normalized mean square error (nMSE) of the predictions are computed, by dividing\nthe MSE by the variance of the test data. The nMSEs are then averaged over the 15 contexts for\nthe interpm and extrapm tests. Table 2 shows how the averages for joint 3 vary with the number\nof training samples. Similar relative results are obtained for the other joints. The results show that\nsGP outperforms LR for both the test cases. As one would expect, the errors of LR level-off early\nat around 200 training samples, while the quality of predictions by sGP continues to improve with\ntraining sample size, especially so for the interpm sets. Both sGP and LR do reasonably well on the\ninterpm sets, but not so well on the extrapm sets. This suggests that learning from multiple contexts\nwhich have training data from different parts of the trajectory space will be advantageous.\n\nResults for multi-task GP We now investigate the merit of using MTL, using the training data\ntabulated in Table 1 for loads c1, . . . , c14. We use n to denote the number of observed torques for\neach joint totalled across the 14 contexts. Note that trajectory (p4, s4) is entirely unobserved during\nlearning, but is included in the extrapm sets. We learn the hyperparameters of a multi-task GP model\n(mGP) for each joint by optimizing the marginal likelihood for all training data (accumulated across\ncontexts) for that joint, as discussed in \u00a73, using the same kernel and parameterization as for the\nsGP. This is done for ranks 2, 4, 5, 6, 8 and 10. Finally, a common rank r for all the joints is chosen\nusing the selection criterion given in \u00a74. We denote the selected set of mGP models by mGP-BIC.\n\nIn addition to comparing with sGP, we also compare mGP-BIC with two other na\u00a8\u0131ve schemes: (a)\ndenoted by iGP, a collection of independent GPs for the contexts, but sharing kernel parameters of\nj among the contexts; and (b) denoted by pGP, a single GP for each joint that learns by pooling\nkx\nall training data from all the contexts. The iGP and pGP models can be seen as restrictions of the\nmulti-task GP model, restricting K \u03c1\nj to the identity matrix I and the matrix of ones 11T respectively.\nAs discussed in \u00a73, the hyperparameters for the mGPs are initialized to either those of pGP or those\nof iGP during optimization, choosing the one with the higher marginal likelihood. For our data,\nwe \ufb01nd that the choice is mostly iGP; pGP is only chosen for the case of joint 1 and n < 532. In\naddition, the chosen ranks based on the BIC are r = 4 for all cases of n, except for n = 476 and\nn = 1820 when r = 5 is selected instead.\nFigure 4 gives results of sGP, iGP, pGP and mGP-BIC for both the interpm and extrapm test sets,\nand for joints 1 and 4. Plots for the other joints are omitted due to space constraints, but they are\nqualitatively similar to the plots for joint 4. The plots are the average nMSEs over the 14 contexts\nagainst n. The vertical scales of the plots indicate that extrapolation is at least an order of magnitude\nharder than interpolation. Since the training data are subsets selected independently for the different\nvalues of n, the plots re\ufb02ect the underlying variability in sampling. Nevertheless, we can see that\nmGP-BIC performs favorably in almost all the cases, and especially so for the extrapolation task.\nFor joint 1, we see a close match between the predictive performances of mGP-BIC and pGP, with\nmGP-BIC slightly better than pGP for the interpolation task. This is due to the limited variation\namong observed torques for this joint across the different contexts for the range of end-effector\n\n\f\u00d710\u22125\n5\n\n4\n\n3\n\n2\n\n1\n\n\u00d710\u22124\n2\n\n1.5\n\n1\n\n0.5\n\n\u00d710\u22124\n4\n\n3\n\n2\n\n1\n\n\u00d710\u22122\n2\n\n1.5\n\n1\n\n0.5\n\n0\n280\n1820\n(a) joint 1, interpm tests\n\n896\n\n532\n\n0\n280\n1820\n(b) joint 1, extrapm tests\n\n532\n\n896\n\n0\n280\n1820\n(c) joint 4, interpm tests\n\n532\n\n896\n\n0\n280\n1820\n(d) joint 4, extrapm tests\n\n896\n\n532\n\nFigure 4: Average nMSEs of sGP (\n) against n (on log2\nscale). Ticks on the x-axes represent speci\ufb01ed values of n. The vertical scales of the plots varies. A\nvalue above the upper limit of its vertical range is plotted with a nominal value near the top instead.\n\n) and mGP-BIC (\n\n), pGP (\n\n), iGP (\n\nmovements investigated here. Therefore it is not surprising that pGP produces good predictions\nfor joint 1. For the other joints, iGP is usually the next best after mGP-BIC. In particular, iGP is\nbetter than sGP, showing that (in this case) combining all the data to estimate the parameters of a\nsingle common covariance function is better than separating the data to estimate the parameters of\n14 covariance functions.\n\n7 Summary\n\nWe have shown how the structure of the multiple-context inverse dynamics problem maps onto a\nmulti-task GP prior as given in eq. 6, how the corresponding marginal likelihood can be optimized\neffectively, and how the rank of the K \u03c1\nj s can be chosen. We have demonstrated experimentally\nthat the results of the multi-task GP method (mGP) are generally superior to sGP, iGP and pGP.\nTherefore it is advantageous to learn inverse dynamics models jointly using mGP-BIC, especially\nwhen each context/task explores different portions of the data space, a common case in dynamics\nlearning. In future work we would like to investigate if coupling learning over joints is bene\ufb01cial.\n\nAcknowledgments\n\nWe thank Sam Roweis for suggesting pGP as a baseline. This work is supported in part by the EU\nPASCAL2 ICT Programme, and in part by the EU FP6 SENSOPAC project grant to SV and SK.\nKMAC would also like to thank DSO NL for \ufb01nancial support.\n\nReferences\n[1] R. Caruana. Multitask Learning. Machine Learning, 28(1), July 1997.\n[2] S. Thrun and L. Pratt, editors. Learning to Learn. Kluwer Academic Publishers, 1998.\n[3] E. Bonilla, K. M. A. Chai, and C. K. I. Williams. Multi-task Gaussian Process Prediction. NIPS 20, 2008.\n[4] L. Sciavicco and B. Siciliano. Modelling and Control of Robot Manipulators. Springer, 2000.\n[5] G. Petkos and S. Vijayakumar. Load estimation and control using learned dynamics models. IROS, 2007.\n[6] T. P. Minka and R. W. Picard. Learning How to Learn is Learning with Point Sets, 1997. URL http:\n\n//research.microsoft.com/\u02dcminka/papers/point-sets.html. revised 1999.\n\n[7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[8] S. Vijayakumar and S. Schaal. LWPR: An O(n) Algorithm for Incremental Real Time Learning in High\n\n[9] D. Nguyen-Tuong, J. Peters, and M. Seeger. Computed torque control with nonparametric regression\n\nDimensional Space. ICML 2000, 2000.\n\nmodels. ACC 2008, 2008.\n\n[10] M. Kemal C\u0131l\u0131z and K. S. Narendra. Adaptive control of robotic manipulators using multiple models and\n\n[11] M. Haruno, D. M. Wolpert, and M. Kawato. MOSAIC Model for Sensorimotor Learning and Control.\n\nswitching. Int. J. Rob. Res., 15(6):592\u2013610, 1996.\n\nNeural Comp., 13(10):2201\u20132220, 2001.\n\n[12] Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. 10th AISTATS, 2005.\n[13] P. I. Corke. A robotics toolbox for MATLAB. IEEE Rob. and Auto. Magazine, 3(1):24\u201332, 1996.\n[14] E. Burdet and A. Codourey. Evaluation of parametric and nonparametric nonlinear adaptive controllers.\n\nRobotica, 16(1):59\u201373, 1998.\n\n\f", "award": [], "sourceid": 441, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Stefan", "family_name": "Klanke", "institution": null}, {"given_name": "Sethu", "family_name": "Vijayakumar", "institution": null}, {"given_name": "Kian", "family_name": "Chai", "institution": null}]}