{"title": "Incremental Local Gaussian Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 972, "page_last": 980, "abstract": "Locally weighted regression (LWR) was created as a nonparametric method that can approximate a wide range of functions, is computationally efficient, and can learn continually from very large amounts of incrementally collected data. As an interesting feature, LWR can regress on non-stationary functions, a beneficial property, for instance, in control problems. However, it does not provide a proper generative model for function values, and existing algorithms have a variety of manual tuning parameters that strongly influence bias, variance and learning speed of the results. Gaussian (process) regression, on the other hand, does provide a generative model with rather black-box automatic parameter tuning, but it has higher computational cost, especially for big data sets and if a non-stationary model is required. In this paper, we suggest a path from Gaussian (process) regression to locally weighted regression, where we retain the best of both approaches. Using a localizing function basis and approximate inference techniques, we build a Gaussian (process) regression algorithm of increasingly local nature and similar computational complexity to LWR. Empirical evaluations are performed on several synthetic and real robot datasets of increasing complexity and (big) data scale, and demonstrate that we consistently achieve on par or superior performance compared to current state-of-the-art methods while retaining a principled approach to fast incremental regression with minimal manual tuning parameters.", "full_text": "Incremental Local Gaussian Regression\n\nFranziska Meier1\nfmeier@usc.edu\n\nPhilipp Hennig2\n\nphennig@tue.mpg.de\n\nStefan Schaal1,2\n\nsschaal@usc.edu\n\n1University of Southern California\n\nLos Angeles, CA 90089, USA\n\n2Max Planck Institute for Intelligent Systems\n\nSpemannstra\u00dfe 38, T\u00a8ubingen, Germany\n\nAbstract\n\nLocally weighted regression (LWR) was created as a nonparametric method that\ncan approximate a wide range of functions, is computationally ef\ufb01cient, and can\nlearn continually from very large amounts of incrementally collected data. As\nan interesting feature, LWR can regress on non-stationary functions, a bene\ufb01cial\nproperty, for instance, in control problems. However, it does not provide a proper\ngenerative model for function values, and existing algorithms have a variety of\nmanual tuning parameters that strongly in\ufb02uence bias, variance and learning speed\nof the results. Gaussian (process) regression, on the other hand, does provide\na generative model with rather black-box automatic parameter tuning, but it has\nhigher computational cost, especially for big data sets and if a non-stationary model\nis required. In this paper, we suggest a path from Gaussian (process) regression to\nlocally weighted regression, where we retain the best of both approaches. Using\na localizing function basis and approximate inference techniques, we build a\nGaussian (process) regression algorithm of increasingly local nature and similar\ncomputational complexity to LWR. Empirical evaluations are performed on several\nsynthetic and real robot datasets of increasing complexity and (big) data scale, and\ndemonstrate that we consistently achieve on par or superior performance compared\nto current state-of-the-art methods while retaining a principled approach to fast\nincremental regression with minimal manual tuning parameters.\n\n1\n\nIntroduction\n\nBesides accuracy and sample ef\ufb01ciency, computational cost is a crucial design criterion for machine\nlearning algorithms in real-time settings, such as control problems. An example is the modeling of\nrobot dynamics: The sensors in a robot can produce thousands of data points per second, quickly\namassing a coverage of the task related workspace, but what really matters is that the learning\nalgorithm incorporates this data in real time, as a physical system can not necessarily stop and\nwait in its control \u2013 e.g., a biped would simply fall over. Thus, a learning method in such settings\nshould produce a good local model in fractions of a second, and be able to extend this model as the\nrobot explores new areas of a very high dimensional workspace that can often not be anticipated\nby collecting \u201crepresentative\u201d training data. Ideally, it should rapidly produce a good (local) model\nfrom a large number N of data points by adjusting a small number M of parameters. In robotics,\nlocal learning approaches such as locally weighted regression [1] have thus been favored over global\napproaches such as Gaussian process regression [2] in the past.\n\nLocal regression models approximate the function in the neighborhood of a query point x\u2217. Each\n\nlocal model\u2019s region of validity is de\ufb01ned by a kernel. Learning the shape of that kernel [3] is the\nkey component of locally weighted learning. Schaal & Atkeson [4] introduced a non-memory-based\nversion of LWR to compress large amounts of data into a small number of parameters. Instead\nof keeping data in memory and constructing local models around query points on demand, their\n\n1\n\n\falgorithm incrementally compresses data into M local models, where M grows automatically to\ncover the experienced input space of the data. Each local model can have its own distance metric,\nallowing local adaptation to local characteristics like curvature or noise. Furthermore, each local\nmodel is trained independently, yielding a highly ef\ufb01cient parallelizable algorithm. Both its local\n\nadaptiveness and its low computation cost (linear,O(N M)) has made LWR feasible and successful\n\nin control learning. The downside is that LWR requires several tuning parameters, whose optimal\nvalues can be highly data dependent. This is at least partly a result of the strongly localized training,\nwhich does not allow models to \u2018coordinate\u2019, or to bene\ufb01t from other local models in their vicinity.\nGaussian process regression (GPR) [2], on the other hand, offers principled inference for hyperpa-\nrameters, but at high computational cost. Recent progress in sparsifying Gaussian processes [5, 6]\nhas resulted in computationally ef\ufb01cient variants of GPR . Sparsi\ufb01cation is achieved either through a\nsubset selection of support points [7, 8] or through sparsi\ufb01cation of the spectrum of the GP [9, 10].\nOnline versions of such sparse GPs [11, 12, 13] have produced a viable alternative for real-time\nmodel learning problems [14]. However, these sparse approaches typically learn one global distance\nmetric, making it dif\ufb01cult to \ufb01t the non-stationary data encountered in robotics. Moreover, restricting\nthe resources in a GP also restricts the function space that can be covered, such that with the need to\ncover a growing workspace, the accuracy of learning will naturally diminish.\nHere we develop a probabilistic alternative to LWR that, like GPR, has a global generative model, but\nis locally adaptive and retains LWRs fast incremental training. We start in the batch setting, where\nrethinking LWRs localization strategy results in a loss function coupling local models that can be\nmodeled within the Gaussian regression framework (Section 2). Modifying and approximating the\nglobal model, we arrive at a localized batch learning procedure (Section 3), which we term Local\nGaussian Regression (LGR). Finally, we develop an incremental version of LGR that processes\nstreaming data (Section 4). Previous probabilistic formulations of local regression [15, 16, 17] are\nbottom-up constructions\u2014generative models for one local model at a time. Ours is a top-down\napproach, approximating a global model to give a localized regression algorithm similar to LWR.\n\n2 Background\n\nL(wm).\n\nLocally weighted regression (LWR) with a \ufb01xed set of M local models minimizes the loss function\n\nused, too, but linear models have a favorable bias-variance trade-off [18]. The models are localized\n\nL(w)= NQ\nMQ\nm=1\nn=1\nfm(x)= KQ\nk=1\n\n\u03b7m(xn)(yn\u2212 \u03bem(xn)T wm)2= MQ\nm=1\n\u03bemk(x)wmk= \u03bem(x)\u0016wm\n\nThe right hand side decomposesL(w) into independent losses for M models. We assume each\nmodel has K local feature functions \u03bemk(x), so that the m-th model\u2019s prediction at x is\nK= 2, \u03bem1(x)= 1, \u03bem2(x)=(x\u2212 cm) gives a linear model around cm. Higher polynomials can be\nby a non-negative, symmetric and integrable weighting \u03b7m(x), typically the radial basis function\nfor x\u2208 RD, with center cm and length scale \u03bbm or positive de\ufb01nite metric \u039bm. \u03b7m(xn) localizes the\nThe prediction y\u2217 at a test point x\u2217 is a normalized weighted average of the local predictions y\u2217,m:\nLWR effectively trains M linear models on M separate datasets ym(xn)=\u0001\n\u03b7m(xn)yn. These\n\n\u03b7m(x)= exp\u0004\u2212(x\u2212 cm)2\n\ny\u2217=\u2211M\n\nm=1 \u03b7m(x\u2217)fm(x\u2217)\nm=1 \u03b7m(x\u2217)\n\u2211M\n\n\u03b7m(x)= exp\u0003\u2212 1\n\n(x\u2212 cm)\u039b\u22121\n\nm(x\u2212 cm)\u0016\u0003\n\neffect of errors on the least-squares estimate of wm\u2014data points far away from cm have little effect.\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n\u0004 ,\n\nor\n\n2\u03bb2\nm\n\n2\n\nmodels differ from the one of Eq. (4), used at test time. This smoothes discontinuous transitions\nbetween models, but also means that LWR can not be cast probabilistically as one generative model\nfor training and test data simultaneously. (This holds for any bottom-up construction that learns local\n\n2\n\n\f\u03b2y\n\nw\n\n\u03b7n\nm\n\n\u03bem\n\n\u03c6n\nm\n\nyn\n\nN\n\nM\n\n\u03b2y\n\n\u03b2fm\n\n\u03b7n\nm\n\n\u03ben\nm\n\nwm\n\nf n\nm\n\nyn\n\nN\n\nM\n\nm= \u03c6m(xn)= \u03b7n\n\nFigure 1: Left: Bayesian linear regression with M feature functions \u03c6n\nbe a function localizing the effect of the mth input function \u03ben\nvariables f n\ncreate M local models connected only through the latent f n\nm.\n\nm can\nm towards the prediction of yn. Right: Latent\nm placed between the features and yn decouple the M regression parameters wm and effectively\n\nm, where \u03b7n\n\nm\u03ben\n\ntraining point y, similar to how LWR operates during test time (Eq.4). Thus, already during training,\n\nmodels independently and combines them as above, e.g., [15, 16]). The independence of local models\nis key to LWR\u2019s training: changing one local model does not affect the others. While this lowers cost,\nwe believe it is also partially responsible for LWR\u2019s sensitivity to manually tuned parameters.\nHere, we investigate a different strategy to achieve localization, aiming to retain the computational\ncomplexity of LWR, while adding a sense of globality. Instead of using \u03b7m to localize the training\n\nerror of data points, we localize a model\u2019s contribution \u02c6ym= \u03be(x)T wm towards the global \ufb01t of\nlocal models must collaborate to \ufb01t a data point \u02c6y=\u2211m=1 \u03b7m(x)\u03be(x)T wm. Our loss function is\ncombining the localizer \u03b7m(xn) and the mth input function \u03bem(xn) to form the feature \u03c6m(xn)=\n\u03b7m(xn)\u03bem(xn). This form of localization couples all local models, as in classical radial basis\n\n\u03b7m(xn)\u03bem(xn)T wm\u00042= NQ\nn=1\n\n\u03c6m(xn)T wm\u00042\n\nL(w)= NQ\nn=1\n\n\u0004yn\u2212 MQ\nm=1\n\n\u0004yn\u2212 MQ\nm=1\n\n(5)\n\n,\n\nfunction networks [19]. At test time, all local predictions form a joined prediction\n\ny\u2217= MQ\nm=1\n\ny\u2217m= MQ\nm=1\n\n\u03c6m(x\u2217)T wm\n\n(6)\n\n(7)\n(8)\n\nThis loss can be minimized through a regularized least-square estimator for w (the concatenation of all\nwm). We follow the probabilistic interpretation of least-squares estimation as inference on the weights\n\nw, from a Gaussian prior p(w)=N(w; \u00b50, \u03a30) and likelihood p(y \u03c6, w)=N(y; \u03c6\u0016w, \u03b2\u22121\ny I).\n\nThe probabilistic formulation has additional value as a generative model for all (training and test)\ndata points y, which can be used to learn hyperparameters (Figure 1, left). The posterior is\n\np(w y, \u03c6)=N(w; \u00b5N , \u03a3N) with\n\n0 + \u03b2y\u03a6\u0016\u03a6)\u22121(\u03b2y\u03a6\u0016y+ \u03a3\u22121\n0 \u00b50)\n\nand\n\n\u00b5N=(\u03a3\u22121\n\nthis framework can be extended nonparametrically by a limit that replaces all inner products\n\n(Heteroscedastic data will be addressed below). The prediction for f(x\u2217) with features \u03c6(x\u2217)=\u2236 \u03c6\u2217\nis also Gaussian, with p(f(x\u2217) y, \u03c6) = N(f(x\u2217); \u03c6\u2217\u00b5N , \u03c6\u2217\u03a3N \u03c6\u0016\u2217). As is widely known,\n\u03c6(xi)\u03a30\u03c6(xj)\u0016 with a Mercer (positive semi-de\ufb01nite) kernel k(xi, xj), corresponding to a Gaus-\n0 + \u03b2y\u03a6\u0016\u03a6). In general, this requiresO(F 3) operations.\njointly, by inverting the Gram matrix(\u03a3\u22121\n\nsian process prior. The direct connection between Gaussian regression and the elegant theory of\nGaussian processes is a conceptual strength. The main downside, relative to LWR, is computational\ncost: Calculating the posterior (7) requires solving the least-squares problem for all F parameters w\n\nBelow we propose approximations to lower the computational cost of this operation to a level compa-\nrable to LWR, while retaining the probabilistic interpretation, and the modeling robustness of the full\nGaussian model.\n\n\u03a3N=(\u03a3\u22121\n\n0 + \u03b2y\u03a6\u0016\u03a6)\u22121\n\n3 Local Parametric Gaussian Regression\n\nThe above shows that Gaussian regression with features \u03c6m(x)= \u03b7m(x)\u03bem(x) can be interpreted\nas global regression with M models, where \u03b7m(xn) localizes the contribution of the model \u03bem(x)\n\ntowards the joint prediction of yn. The choice of local parametric model \u03bem is essentially free. Local\n\n3\n\n\fcorrespond to Gaussian regression with RBF features. Generalizing to M local models with K\nparameters each, feature function \u03c6n\n\nlinear regression in a K-dimensional input space takes the form \u03bem(xn)= xn\u2212 cm, and can be\nviewed as the analog of locally weighted linear regression. Locally constant models \u03bem(x)= 1\nmk combines the kth component of the local model \u03bekm(xn),\nlocalized by the m-th weighting function \u03b7m(xn)\nmk\u2236= \u03c6mk(xn)= \u03b7m(xn)\u03bekm(xn).\nTreating mk as indices of a vector\u2208 RM K, Equation (7) gives localized linear Gaussian regression.\nN(wm; 0, A\u22121\n\nSince it will become necessary to prune the model, we adopt the classic idea of automatic relevance\ndetermination [20, 21] using a factorizing prior\n\nm) with Am= diag(\u03b1m1, . . . , \u03b1mK).\n\nThus every component k of local model m has its own precision, and can be pruned out by setting\n\n\u03b1mk(cid:95)\u221e. Section 3.1 assumes a \ufb01xed number M of local models with \ufb01xed centers cm. The\nparameters are \u03b8={\u03b2y,{\u03b1mk},{\u03bbmd}}, where K is the dimension of local model \u03be(x) and D is\n\np(wA)= MM\nm=1\n\nthe dimension of input x. We propose an approximation for estimating \u03b8. Section 4 then describes\nan incremental algorithm allocating local models as needed, adapting M and cm.\n\n(10)\n\n(9)\n\n\u03c6n\n\n3.1 Learning in Local Gaussian Regression\n\nExact Gaussian regression with localized features still has cubic cost. However, because of the\nlocalization, correlation between distant local models approximately vanishes, and inference is\napproximately independent between local models. To use this near-independence for cheap local\napproximate inference, similar to LWR, we introduce a latent variable f n\nm for each local model m\nand datum xn, as in probabilistic back\ufb01tting [22]. Intuitively, the f form approximate local targets,\nagainst which the local parameters \ufb01t (Figure 1, right). Moreover, as formalized below, each f n\nm has\nits own variance parameter, which re-introduces the ability to model hetereoscedastic data.\nThis modi\ufb01ed model motivates a factorizing variational bound (Section 3.1.1). Rendering the local\nmodels computationally independent, it allows for fast approximate inference in the local Gaussian\nmodel. Hyperparameters can be learned by approximate maximum likelihood (Section 3.1.2),\n\ni.e. iterating between constructing a bound q(z \u03b8) on the posterior over hidden variables z (de\ufb01ned\n\nbelow) given current parameter estimates \u03b8 and optimizing q with respect to \u03b8.\n\n3.1.1 Variational Bound\n\nm; \u03c6n\n\nThe complete data likelihood of the modi\ufb01ed model (Figure 1, right) is\n\nN(yn; f n, \u03b2\u22121\ny ) NM\nn=1\n\np(y, f , w \u03a6, \u03b8)= NM\nn=1\nOur Gaussian model involves the latent variables w and f, the precisions \u03b2={\u03b2y, \u03b2f 1, . . . , \u03b2f M} and\nthe model parameters \u03bbm, cm. We treat w and f as probabilistic variables and estimate \u03b8={\u03b2, \u03bb, c}.\nOn w, f, we construct a variational bound q(w, f) imposing factorization q(w, f)= q(w)q(f).\n\nN(wm; 0, A\u22121\nm)\n\nf m) MM\nm=1\n\nmwm, \u03b2\u22121\n\nN(f n\n\nMM\nm=1\n\nis maximized\n\nThe variational free energy is a lower bound on the log evidence for the observations y:\n\nlog p(y \u03b8)\u2265S q(w, f) log\n\ndistributions are Gaussian in both w and f.The approximation on w is\n\nDKL[q(w, f)\u0001p(w, f y, \u03b8)], the distribution for which log q(w) = Ef[log p(y f , w)p(w, f)]\nand log q(f) = Ew[log p(y f , w)p(w, f)]. It is relatively easy to show (e.g. [23]) that these\nlog q(w)= Ef\u0004 NQ\nn=1\n\u03a3wm=\u0004\u03b2f m\n\nlog p(f n \u03c6n, w)+ log p(w A)\u0004= log\nT+ Am\u0004\u22121 \u2208 RK\u00d7K and \u00b5wm\n\nN(wm; \u00b5wm, \u03a3wm)\nE[f n\nm]\u0004\u2208 RK\u00d71\n\np(y, w, f \u03b8)\nq(w, f)\nq(w, f) minimizing\nMM\nm=1\n= \u03b2f m\u03a3wm\u0004 NQ\nn=1\n\nNQ\nn=1\n\nentropy\n\nrelative\n\nbound\n\n\u03c6n\n\nm\u03c6n\n\nm\n\nby\n\nthe\n\n.\n\nthe\n\nwhere\n\nThis\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n\u03c6n\nm\n\n4\n\n\fwhere\n\n\u00b5f n\n\nm\n\nf m\n\nf m\n\n(15)\n\n(16)\n\n(17)\n\nlikelihood under the variational bound\n\n3.1.2 Optimizing Hyperparameters\n\n\u0004yn\u2212 MQ\nm=1\n\nm is the posterior mean of the m-th model\u2019s virtual target for data\n\nf couples the local models, allowing for a form of message passing between local models.\n\nlog q(f n)= Ew[log p(yn f n, \u03b2y)+ log p(f n \u03c6n\n\nThe posterior update equations for the weights are local: each of the local models updates its\nparameters independently. This comes at the cost of having to update the belief over the variables f n\nm,\nwhich achieves a coupling between the local models. The Gaussian variational bound on f is\n\nm, w)]= logN(f n; \u00b5f n, \u03a3f),\n\u03a3f= B\u22121\u2212 B\u221211(\u03b2\u22121\ny + 1T B\u221211)\u221211T B\u22121= B\u22121\u2212 B\u2212111T B\u22121\ny + 1T B\u221211\n\u03b2\u22121\nEw[wm]T \u03c6n\nm\u0004\n= Ew[wm]T \u03c6n\nm+\n\u03b2\u22121\ny +\u2211M\nm=1 \u03b2\u22121\n\u03b2\u22121\nand B= diag(\u03b2f 1, . . . , \u03b2f M). \u00b5f n\npoint n. These updates can be performed inO(M K). Note how the posterior over hidden variables\nm=1,{\u03b1mk}}, we maximize the expected complete log\nTo set the parameters \u03b8={\u03b2y,{\u03b2f m, \u03bbm}M\ny \u0004\n\u0004 logN\u0004yn;\nEf ,w[log p(y, f , w \u03a6, \u03b8)]= Ef ,w\u0004 NQ\nMQ\nm, \u03b2\u22121\nm=1\nn=1\nlogN(f n\nf m)\u0004+ MQ\n+ MQ\nm, \u03b2\u22121\nm=1\nm=1\n(yn\u2212 1\u00b5f n)2+ 1T \u03a3f 1\nNQ\nn=1\nm)2+ \u03c6n\n\u0002(\u00b5f nm\u2212 \u00b5wm \u03c6n\nNQ\nn=1\n+ \u03a3w,kk\n= \u2202Ef ,w\u0001\u2211N\nn=1 logN(f n\n\u2202Ef ,w[log p(y, f , w \u03a6, \u03b8)]\n\nSetting the gradient of this expression to zero leads to the following update equations for the variances\n\nThe gradient with respect to the scales of each local model is completely localized\n\nlogN(wm; 0, A\u22121\nm)\u0004.\n\ny = 1\n\u03b2\u22121\nf m= 1\n\u03b2\u22121\nmk= \u00b52\n\u03b1\u22121\n\n\u2202\u03bbmd\n\n\u2202\u03bbmd\n\nand, with the exception of the variance 1~\u03b2y, all hyper-parameter updates are solved independently\n\nWe use gradient ascent to optimize the length scales \u03bbmd. All necessary equations are of low cost\n\nfor each local model, similar to LWR. In contrast to LWR, however, these local updates do not\ncause a potential catastrophic shrinking in the length scales: In LWR, both inputs and outputs are\nweighted by the localizing function, thus reducing the length scale improves the \ufb01t. The localization\nin Equation (22) only affects the in\ufb02uence of regression model m, but the targets still need to be\n\ufb01t accordingly. Shrinking of local models only happens if it actually improves the \ufb01t against the\nunweighted targets fnm such that no complex cross validation procedures are required.\n\nf m)(cid:6)\nm, \u03b2\u22121\n\nm\u0002+ \u03c32\n\nT \u03a3wm\u03c6n\n\nm\n\nm; wT\n\nm\u03c6n\n\nm; wT\n\nm\u03c6n\n\n(19)\n\n(20)\n\n(21)\n\nN\n\nN\n\n(18)\n\n(22)\n\nf n\n\nwmk\n\nf m\n\n3.1.3 Prediction\n\nPredictions at a test point x\u2217 arise from marginalizing over both f and w, using\nS \u0003S N(y\u2217; 1T f\u2217, \u03b2\u22121\ny )N(f\u2217; W T \u03c6(x\u2217), B\u22121)df\u2217\u0003N(w; \u00b5w, \u03a3w)dw\n=N\u0004y\u2217;Q\nm, \u03c32(x\u2217)\u0004 (23)\nm\u03c6\u2217\nf m+\u2211M\ny +\u2211M\nwhere \u03c32(x\u2217)= \u03b2\u22121\nm=1 \u03c6\u2217\nm=1 \u03b2\u22121\n\nm, which is linear in M and K.\n\nT \u03a3wm \u03c6\u2217\n\nwT\n\nm\n\nm\n\n5\n\n\f4\n\nIncremental Local Gaussian Regression\n\n0) and p(\u03b1m)=\u220fK\n\nthe n-th incoming data point. Following this principle we extend the model presented in Section 3\n\nThe above approximate posterior updates apply in the batch setting, assuming the number M and\nlocations c of local models are \ufb01xed. This section constructs an online algorithm for incrementally\nincoming data, creating new local models when needed. There has been recent interest in variational\nonline algorithms for ef\ufb01cient learning on large data sets [24, 25]. Stochastic variational inference\n[24] operates under the assumption that the data set has a \ufb01xed size N and optimizes the variational\nlower bound for N data points via stochastic gradient descent. Here, we follow algorithms for\nstreaming datasets of unknown size. Probabilistic methods in this setting typically follow a Bayesian\n\n\ufb01ltering approach [26, 25, 27] in which the posterior after n\u2212 1 data points becomes the prior for\nand treat precision variables{\u03b2f m, \u03b1mk} as random variables, assuming Gamma priors p(\u03b2f m)=\nG(\u03b2f m a\u03b2\n0). Thus, the factorized approximation on the\nposterior q(z) over all random variables z={f , w, \u03b1, \u03b2f} is changed to\nq(z)= q(f , w, \u03b2f , \u03b1)= q(f)q(w)q(\u03b2f)q(\u03b1)\np(zx1, . . . , xn)\u2248 p(xn z)q(z x1, . . . xn\u22121)\nf m and\u2211n=1(\u00b5n\n\n(25)\nafter n data points.\nIn essence, this formulates the (approximate) posterior updates in terms\nof suf\ufb01cient statistics, which are updated with each new incoming data point. The batch up-\ndates (listed in [28]) can be rewritten such that they depend on the following suf\ufb01cient statistics\n\n(24)\nA batch version of this was introduced in [28]. Given that, the recursive application of Bayes\u2019 theorem\nresults in the approximate posterior\n\nk=1G(\u03b1mk a\u03b1\n\n,\u2211n=1 \u03c6n\n\nm\u03c6n\n\n0 , b\u03b2\n\n0 , b\u03b1\n\nm\u00b5n\n\n\u0016\n\nm\n\nare added. Algorithm 1 gives an overview of the entire incremental algorithm.\n\nnumber M can grow fast initially, before the pruning becomes effective. Thus, we check for each\n\nFinally, we use an extension analogous to incremental training of the relevance vector machine [29] to\n\neach iteration adds one local model in the variational step, and prunes out existing local models\n\nrandom variables too, here we update them using the noisy (stochastic) gradients produced by each\nincoming data point. Due to space limitations, we only summarize these update equations in the\n\nf m)2. Although the length-scales \u03bbm could be treated as\n\n\u2211N\nn=1 \u03c6n\nalgorithm below, where we have replaced the expectation operator by\u001b\u22c5\u001b.\niteratively add local models at new, greedily selected locations cM+1. Starting with one local model,\nfor which all components \u03b1mk(cid:95)\u221e. This works well in practice, with the caveat that the model\n\u03b7m(cM+1)\u2265 wgen, where wgen is a parameter between 0 and 1 and regulates how many parameters\nselected location cM+1 whether any of the existing local models c1\u2236M produces a localizing weight\n1: M= 0; C={}, a\u03b1\n2: for all(xn, yn) do\nAlgorithm 1 Incremental LGR\n~~ for each data point\nif \u03b7m(xn)< wgen,\u2200m= 1, . . . , M then cm(cid:94) xn; C(cid:94) C\u222a{cm}; M= M+ 1 end if\n0 , a\u03b2\n0 , \u03b2\u03b2\n0 , b\u03b1\nm\u0001\n\u0001yn\u2212\u2211M\n\u03a3f= B\u22121\u2212 B\n= \u00b5T\n\u22121\n\u2212111T B\n\u22121\nm=1 \u00b5T\nfor m= 1 to M do\ny +\u2211m\u001b\u03b2\u001bfm\nm=1\u001b\u03b2\u001b\u22121\ny +\u2211M\n\u03b2\u22121\n\u03b2\u22121\nif \u03b7m(xn)< 0.01 then continue end if\n+ \u03c6n\n+ \u03c6n\n+ \u00b52\n\u0016\nm(cid:94) \u03baS\u03c6m\u03c6\u0016\n, S\u03c6m\u00b5fm(cid:94) \u03baS\u03c6m\u00b5fm\nfm(cid:94) \u03baS\u00b52\n+\u001bA\u001bm\u0001\u22121\n\u03a3wm=\u0001\u001b\u03b2\u001bf mS\u03c6m\u03c6T\n, \u00b5wm=\u001b\u03b2\u001bf m\u03a3wm S\u03c6m\u00b5fm\nNm= \u03baNm+ 1, a\u03b2\n0+ Nm, a\u03b1\nN m= a\u03b2\n0+ 0.5\nN m= a\u03b1\n+ tr\u0001S\u03c6m\u03c6\u0016\nN m= S\u00b52\n\u2212 2\u00b5\u0016\n(\u03a3wm+ \u00b5wm\u00b5\u0016\nwm,k+ \u03a3wm,kk\nN mk= \u00b52\nNmk)\nNmk~b\u03b1\nNm,\u001bA\u001bm= diag(a\u03b1\n\u001b\u03b2\u001bf m= a\u03b2\nNm~b\u03b2\n\u03bbm= \u03bbm+ \u03bd(\u2202~\u2202\u03bbmN(\u001bf n\u001bm;\u001bw\u001bT\nm,\u001b\u03b2\u001b\u22121\nf m))\nif\u001b\u03b1\u001bmk> 1e3 \u2200k= 1, . . . , K then prune local model m, M(cid:94) M\u2212 1 end if\n\n0 , forgetting rate \u03ba, learning rate \u03bd\n\n)(cid:6)+ Nm\u03c32\n\n8:\n9:\n10:\n\nS\u03c6m\u00b5fm\n\nS\u03c6m\u03c6T\n\n4:\n5:\n6:\n7:\n\nm\u03c6n\n\nm\u00b5f n\n\nm\n\n, \u00b5f n\n\nm\n\nm\u03c6n\n\n\u03c6n\nm\n\nwm\n\n\u03c6n\n\nwm\n\nfm\n\nf n\nm\n\nm\n\nm\n\nf n\nm\n\nwm\n\nwm\n\nf m\n\nb\u03b2\nb\u03b1\n\n\u03b2\n\nfm\n\nfm\n\n3:\n\nm\n\n, S\u00b52\n\n11:\n12:\n13:\n14:\n15:\n16: end for\n\nend for\n\nm\n\n6\n\n\fTable 1: Datasets for inverse dynamics tasks: KUKA1, KUKA2 are different splits of the same data. Rightmost\ncolumn indicates the overlap in input space coverage between of\ufb02ine (ISof\ufb02ine) and online training (ISonline) sets.\n\nISof\ufb02ine\u222a ISonline\n\nNof\ufb02ine train Nonline train\n\nNtest\n\n4449\n17560\n17560\n\n-\n\n44484\n180360\n180360\n1984950\n\n-\n-\n-\n\nlarge overlap\nsmall overlap\nno overlap\n\n20050\n\n-\n\nDataset\nSarcos [2]\nKUKA1\nKUKA2\nKUKAsim\n\nfreq\n100\n500\n500\n500\n\nMotion\nrhythmic\n\nrhythmic at various speeds\nrhythmic at various speeds\n\nrhythmic + discrete\n\n5 Experiments\n\nWe evaluate our LGR on inverse dynamics learning tasks, using data from two robotic platforms:\na SARCOS anthropomorphic arm and a KUKA lightweight arm. For both robots, learning the\n\ninverse dynamics involves learning a map from the joint positions q(rad), velocities \u02d9q(rad~s) and\naccelerations \u00a8q(rad~s2), to torques \u03c4(Nm) for each of 7 joints (degrees of freedom). We compare to\n\ntwo methods previously used for inverse dynamics learning: LWPR1 \u2013 an extension of LWR for high\ndimensional spaces [31] \u2013 and I-SSGPR2 [13] \u2013 an incremental version of Sparse Spectrum GPR.\nI-SSGPR differs from LGR and LWPR in that it is a global method and does not learn the distance\nmetric online. Instead, I-SSGPR needs of\ufb02ine training of hyperparameters before it can be used\nonline. We mimic the procedure used in [13]: An of\ufb02ine training set is used to learn an initial model\nand hyperparameters, then an online training set is used to evaluate incremental learning. Where\nindicated we use initial of\ufb02ine training for all three methods. I-SSGPR uses typical GPR optimization\nprocedures for of\ufb02ine training, and is thus only available in batch mode. For LGR, we use the batch\nversion for pre-training/hyperparameter learning. For all experiments we initialized the length scales\n\nto \u03bb= 0.3, and used wgen= 0.3 for both LWPR and LGR.\n\nWe evaluate on four different data sets, listed in Table 1. These sets vary in scale, types of motion\nand how well the of\ufb02ine training set represents the data encountered during online learning. All\nresults were averaged over 5 randomly seeded runs, mean-squared error (MSE) and normalized\nmean-squared error (nMSE) are reported on the online training dataset. The nMSE is reported as the\nmean-squared error normalized by the variance of the outputs.\n\nwith 200(400) features, MSE for 400 features is reported in brackets.\n\nTable 2: Predictive performance on online training data of Sarcos after one sweep. I-SSGPR has been trained\n\nI-SSGPR200(400)\n\nJoint\nJ1\nJ2\nJ3\nJ4\nJ5\nJ6\nJ7\n\nMSE\n\n13.699 (10.832)\n6.158 (4.788)\n1.803 (1.415)\n1.198 (0.857)\n0.034 (0.027)\n0.129 (0.096)\n0.093 (0.063)\n\nnMSE\n0.033\n0.027\n0.018\n0.006\n0.036\n0.044\n0.014\n\nMSE\n19.180\n9.783\n3.595\n4.807\n0.071\n0.248\n0.231\n\nLWPR\nnMSE # of LM\n0.046\n0.044\n0.036\n0.025\n0.075\n0.085\n0.034\n\n461.4\n495.0\n464.6\n382.8\n431.2\n510.2\n378.8\n\nLGR\n\nnMSE # of LM\n0.027\n0.037\n0.023\n0.027\n0.033\n0.034\n0.025\n\n321.4\n287.4\n298.0\n303.2\n344.2\n344.2\n348.8\n\nMSE\n11.434\n8.342\n2.237\n5.079\n0.031\n0.101\n0.170\n\nSarcos: Table 2 summarizes results on the popular Sarcos benchmark for inverse dynamics learning\ntasks [2]. The traditional test set is used as the of\ufb02ine training data to pre-train all three models.\n\n200 features is the optimal design choice according to[13]. We report the (normalized) mean-squared\nI-SSGPR is trained with 200 and 400 sparse spectrum features, indicated as I-SSGPR200(400), where\n\nerror on the online training data, after one sweep through it - i.e. each data point has been used once -\nhas been performed. All three methods perform well on this data, with I-SSGPR and LGR having a\nslight edge over LWPR in terms of accuracy; and LGR uses fewer local models than LWPR. The\nSarcos data of\ufb02ine training set represents the data encountered during online training very well. Thus,\nhere online distance metric learning is not necessary to achieve good performance.\n\n1we use the LWPR implementation found in the SL simulation software package [30]\n2we use code from the learningMachine library in the RobotCub framework, from http:// eris.liralab.it/iCub\n\n7\n\n\fTable 3: Predictive performance on online training data of KUKA1 and KUKA2 after one sweep. KUKA2\nresults are averages across joints. I-SSGPR was trained on 200 and 400 features (results for I-SSGPR400 shown\nin brackets).\n\nI-SSGPR200(400)\n\nLWPR\n\nLGR\n\ndata\n\nKUKA1\n\nJoint\nJ1\nJ2\nJ3\nJ4\nJ5\nJ6\nJ7\n\nMSE\n\n7.021 (7.680)\n16.385 (18.492)\n1.872 (1.824)\n3.124 (3.460)\n0.095 (0.143)\n0.142 (0.296)\n0.129 (0.198)\n\nnMSE MSE nMSE # of LM MSE nMSE # of LM\n3188.6\n0.233\n3363.8\n0.265\n3246.6\n0.289\n0.256\n3333.6\n3184.4\n0.196\n3372.4\n0.139\n0.174\n3232.6\n\n3476.8\n3508.6\n3477.2\n3494.6\n3512.4\n3561.0\n3625.6\n\n2.238\n2.738\n0.528\n0.571\n0.017\n0.029\n0.033\n\n2.362\n2.359\n0.457\n0.503\n0.019\n0.043\n0.023\n\n0.078\n0.038\n0.071\n0.041\n0.039\n0.042\n0.031\n\n0.074\n0.044\n0.082\n0.047\n0.036\n0.029\n0.044\n\nKUKA2\n\n-\n\n9.740 (9.985)\n\n0.507\n\n1.064\n\n0.056\n\n3617.7\n\n1.012\n\n0.054\n\n3290.2\n\n0.06\n\n0.04\n\n0.02\n\nE\nS\nM\nn\n\nLGR\nLWPR\n\n5\u22c5 105\n\n1\u22c5 106\n\n1.5\u22c5 106\n\n17,000\n\n16,000\n\nM\n\n15,000\n\n14,000\n\n5\u22c5 105\n\n1\u22c5 106\n\n1.5\u22c5 106\n\nn\n\nn\n\nFigure 2: Right: nMSE on the \ufb01rst joint of simulated KUKA arm Left: average number of local models used.\nKUKA1 and KUKA2: The two KUKA datasets consist of rhythmic motions at various speeds, and\nrepresent a more realistic setting in robotics: While one can collect some data for of\ufb02ine training, it is\nnot feasible to cover the whole state-space. Of\ufb02ine data of KUKA1 has been chosen to give partial\ncoverage of the range of available speeds, while KUKA2 consists of motion at only one speed. In this\nsetting, both LWPR and LGR excel (Table 3). As they can learn local distance metrics on the \ufb02y, they\nadapt to incoming data in previously unexplored input areas. Performance of I-SSGPR200 degrades\nas the of\ufb02ine training data is less representative, while LGR and LWPR perform almost equally well\non KUKA1 and KUKA2. While there is little difference in accuracy between LGR and LWPR, LGR\nconsistently uses fewer local models and does not require careful manual meta-parameter tuning.\nSince both LGR and LWPR use more local models on this data (compared to the Sarcos data) we\nalso tried increasing the feature space of I-SSGPR to 400 features. This did not improve I-SSGPRs\nperformance on the online data (see Table 3). Finally, it is noteworthy that LGR processes both of\n\nthese data sets at\u223c 500Hz (C++ code, on a 3.4GHz Intel Core i7), making it a realistic alternative for\n\nreal-time inverse dynamics learning tasks.\n\nKUKAsim : Finally, we evaluate LGRs ability to learn from scratch on KUKAsim, a large data set\nof 2 million simulated data points, collected using [30]. We randomly drew 1% points as a test\nset, on which we evaluate convergence during online training. Figure 2 (left) shows convergence\nand number of local models used, averaged over 5 randomly seeded runs for joint 1. After the \ufb01rst\n1e5 data points, both LWPR and LGR achieve a normalized mean squared error below 0.07, and\n\neventually converge to a nMSE of\u223c 0.01. LGR converges slightly faster, while using fewer local\n\nmodels (Figure 2, right).\n\n6 Conclusion\n\nWe proposed a top-down approach to probabilistic localized regression. Local Gaussian Regression\ndecouples inference over M local models, resulting in ef\ufb01cient and principled updates for all\nparameters, including local distance metrics. These localized updates can be used in batch as well as\nincrementally, yielding computationally ef\ufb01cient learning in either case and applicability to big data\nsets. Evaluated on a variety of simulated and real robotic inverse dynamics tasks, and compared to\nI-SSGPR and LWPR, incremental LGR shows an ability to add resources (local models) and to update\nits distance metrics online. This is essential to consistently achieve high accuracy. Compared to\nLWPR, LGR matches or improves precision, while consistently using fewer resources (local models)\nand having signi\ufb01cantly fewer manual tuning parameters.\n\n8\n\n\fReferences\n[1] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning for control.\n\nArti\ufb01cial Intelligence Review, (1-5):75\u2013113, 1997.\n\n[2] Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. MIT\n\nPress, 2006.\n\n[3] Jianqing Fan and Irene Gijbels. Data-driven bandwidth selection in local polynomial \ufb01tting: variable\n\nbandwidth and spatial adaptation. Journal of the Royal Statistical Society., pages 371\u2013394, 1995.\n\n[4] Stefan Schaal and Christopher G Atkeson. Constructive incremental learning from only local information.\n\nNeural Computation, 10(8):2047\u20132084, 1998.\n\n[5] Joaquin Qui\u02dcnonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian\n\nprocess regression. JMLR, 6:1939\u20131959, 2005.\n\n[6] Krzysztof Chalupka, Christopher KI Williams, and Iain Murray. A framework for evaluating approximation\n\nmethods for Gaussian process regression. JMLR, 14(1):333\u2013350, 2013.\n\n[7] Michalis K Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[8] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. Advances in\n\nneural information processing systems, 18:1257, 2006.\n\n[9] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[10] Miguel L\u00b4azaro-Gredilla, Joaquin Qui\u02dcnonero-Candela, Carl Edward Rasmussen, and An\u00b4\u0131bal R Figueiras-\n\nVidal. Sparse spectrum Gaussian process regression. JMLR, 11:1865\u20131881, 2010.\n\n[11] Marco F Huber. Recursive Gaussian process: On-line regression and learning. Pattern Recognition Letters,\n\n45:85\u201391, 2014.\n\n[12] Lehel Csat\u00b4o and Manfred Opper. Sparse on-line Gaussian processes. Neural computation, 2002.\n[13] Arjan Gijsberts and Giorgio Metta. Real-time model learning using incremental sparse spectrum Gaussian\n\nprocess regression. Neural Networks, 41:59\u201369, 2013.\n\n[14] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. UAI, 2013.\n[15] Jo-Anne Ting, Mrinal Kalakrishnan, Sethu Vijayakumar, and Stefan Schaal. Bayesian kernel shaping for\n\nlearning control. Advances in neural information processing systems, 6:7, 2008.\n\n[16] Duy Nguyen-Tuong, Jan R Peters, and Matthias Seeger. Local Gaussian process regression for real time\nonline model learning. In Advances in Neural Information Processing Systems, pages 1193\u20131200, 2008.\n[17] Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approximations. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 524\u2013531, 2007.\n\n[18] Trevor Hastie and Clive Loader. Local regression: Automatic kernel carpentry. Statistical Science, 1993.\n[19] J. Moody and C. Darken. Learning with localized receptive \ufb01elds. In Proceedings of the 1988 Connectionist\n\nSummer School, pages 133\u2013143. San Mateo, CA, 1988.\n\n[20] Radford M Neal. Bayesian Learning for Neural Network, volume 118. Springer, 1996.\n[21] Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. The Journal of Machine\n\nLearning Research, 1:211\u2013244, 2001.\n\n[22] Aaron D\u2019Souza, Sethu Vijayakumar, and Stefan Schaal. The Bayesian back\ufb01tting relevance vector machine.\n\nIn ICML, 2004.\n\n[23] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends\u00ae in Machine Learning, 2008.\n\n[24] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J.\n\nMach. Learn. Res., 14(1):1303\u20131347, May 2013.\n\n[25] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael Jordan. Streaming\n\nvariational Bayes. In Advances in Neural Information Processing Systems, pages 1727\u20131735, 2013.\n\n[26] Jan Luts, Tamara Broderick, and Matt Wand. Real-time semiparametric regression. arxiv, 2013.\n[27] Antti Honkela and Harri Valpola. On-line variational Bayesian learning. In 4th International Symposium\n\non Independent Component Analysis and Blind Signal Separation, pages 803\u2013808, 2003.\n\n[28] Franziska Meier, Philipp Hennig, and Stefan Schaal. Ef\ufb01cient Bayesian local model learning for control.\n\nIn Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS), 2014.\n\n[29] Joaquin Qui\u02dcnonero-Candela and Ole Winther. Incremental Gaussian processes. In NIPS, 2002.\n[30] Stefan Schaal. The SL simulation and real-time control software package. Technical report, 2009.\n[31] Sethu Vijayakumar and Stefan Schaal. Locally weighted projection regression: Incremental real time\n\nlearning in high dimensional space. In ICML, pages 1079\u20131086, 2000.\n\n9\n\n\f", "award": [], "sourceid": 598, "authors": [{"given_name": "Franziska", "family_name": "Meier", "institution": "USC"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "MPI T\u00fcbingen"}, {"given_name": "Stefan", "family_name": "Schaal", "institution": "USC"}]}