{"title": "Local Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 639, "abstract": "", "full_text": "Local Dimensionality Reduction \n\nSethu Vijayakumar 3, I \n\nChristopher G. Atkeson 4 \n\nsethu@cs.titech.ac.jp \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.edul \n\nStefan Schaal  1,2,4 \n\nsschaal@usc.edu \n\nhttp://www-slab.usc.edulsschaal \n\nhttp://ogawa(cid:173)\n\nwww.cs.titech.ac.jp/-sethu \n\nfac/Chris.Atkeson \n\nIERATO Kawato Dynamic Brain Project (IST), 2-2 Hikaridai, Seika-cho, Soraku-gun, 619-02 Kyoto \n\n2Dept.  of Comp. Science &  Neuroscience, Univ. of South. California HNB-I 03, Los Angeles CA 90089-2520 \n\n3Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo-I 52 \n\n4College of Computing, Georgia Institute of Technology, 801  Atlantic Drive, Atlanta, GA 30332-0280 \n\nAbstract \n\nIf globally high dimensional data has locally only low dimensional distribu(cid:173)\ntions,  it is  advantageous to perform a local dimensionality reduction before \nfurther processing the data.  In this paper we examine several techniques for \nlocal  dimensionality reduction  in the  context of locally weighted linear re(cid:173)\ngression.  As possible candidates, we derive local versions of factor analysis \nregression, principle component regression, principle component regression \non joint distributions, and partial least squares regression. After outlining the \nstatistical bases of these  methods,  we perform Monte Carlo  simulations to \nevaluate  their  robustness  with  respect  to  violations  of their  statistical  as(cid:173)\nsumptions.  One  surprising  outcome  is  that  locally  weighted  partial  least \nsquares  regression offers the best average results,  thus outperforming even \nfactor analysis, the theoretically most appealing of our candidate techniques. \n\n1  INTRODUCTION \nRegression tasks involve mapping a n-dimensional continuous input vector  x  E  ~n onto \na  m-dimensional output vector y  E ~m \u2022  They form a ubiquitous class of problems found \nin fields including process control, sensorimotor control, coordinate transformations, and \nvarious stages of information processing in biological nervous  systems.  This paper will \nfocus  on  spatially  localized  learning  techniques,  for  example,  kernel  regression  with \nGaussian weighting functions.  Local learning offer advantages for real-time incremental \nlearning problems due to fast convergence, considerable robustness towards problems of \nnegative interference, and large tolerance in model selection (Atkeson, Moore, &  Schaal, \n1997; Schaal &  Atkeson, in press). Local learning  is usually based on interpolating data \nfrom a local neighborhood around the  query point.  For high dimensional learning prob(cid:173)\nlems, however,  it suffers  from  a bias/variance dilemma,  caused by the  nonintuitive fact \nthat \" ...  [in high dimensions]  if neighborhoods  are  local,  then they  are  almost  surely \nempty, whereas if a neighborhood is not empty, then it is not local.\" (Scott,  1992, p.198). \nGlobal  learning  methods,  such  as  sigmoidal  feedforward  networks,  do  not  face  this \n\n\f634 \n\nS.  School, S.  Vijayakumar and C.  G.  Atkeson \n\nproblem  as  they  do  not  employ  neighborhood  relations,  although  they  require  strong \nprior knowledge about the problem at hand in order to be successful. \nAssuming that local learning in high dimensions is a hopeless, however,  is  not necessar(cid:173)\nily warranted: being globally high dimensional does not imply that data remains high di(cid:173)\nmensional  if viewed locally.  For example,  in  the  control  of robot  anns and biological \nanns we have shown that for estimating the inverse dynamics of an ann, a globally 21-\ndimensional space reduces on average to 4-6 dimensions locally (Vijayakumar &  Schaal, \n1997).  A  local  learning  system that can robustly  exploit  such locally  low dimensional \ndistributions should be able to avoid the curse of dimensionality. \nIn pursuit  of the  question  of  what,  in  the  context  of local  regression,  is  the  \"right\" \nmethod to perfonn local  dimensionality  reduction,  this  paper will  derive  and  compare \nseveral  candidate  techniques  under  i)  perfectly  fulfilled  statistical  prerequisites  (e.g., \nGaussian noise,  Gaussian input distributions,  perfectly  linear data),  and ii)  less perfect \nconditions (e.g.,  non-Gaussian distributions,  slightly  quadratic  data,  incorrect guess  of \nthe dimensionality of the true data distribution). We will focus on nonlinear function ap(cid:173)\nproximation with locally weighted linear regression (L WR), as  it allows us to adapt a va(cid:173)\nriety of global linear dimensionality reduction techniques, and as L WR has  found wide(cid:173)\nspread application in several local learning  systems  (Atkeson,  Moore,  &  Schaal,  1997; \nJordan &  Jacobs,  1994; Xu,  Jordan,  &  Hinton,  1996).  In particular,  we  will derive  and \ninvestigate locally weighted principal component regression (L WPCR), locally weighted \njoint  data  principal  component  analysis  (L WPCA),  locally  weighted  factor  analysis \n(L WF A), and locally weighted partial least squares (L WPLS). Section 2 will briefly out(cid:173)\nline  these  methods  and their theoretical  foundations,  while  Section  3  will  empirically \nevaluate the robustness of these methods using synthetic data sets  that increasingly vio(cid:173)\nlate some of the statistical assumptions of the techniques. \n\n2  METHODS OF DIMENSIONALITY REDUCTION \nWe assume that our regression data originate from a generating process with two sets of \nobservables, the \"inputs\"  i  and the \"outputs\"  y. The characteristics of the process en(cid:173)\nsure a functional relation y = f(i). Both  i  and  yare obtained through some measure(cid:173)\nment device that adds  independent mean zero noise  of different magnitude in each ob(cid:173)\nservable, such that  x ==  i  + Ex  and y = y + Ey \u2022  For the sake of simplicity, we will only fo(cid:173)\ncus  on  one-dimensional  output  data  (m=l)  and  functions  /  that  are  either  linear  or \nslightly quadratic, as these cases are the most common in nonlinear function approxima(cid:173)\ntion with locally linear models. Locality of the regression is ensured by weighting the er(cid:173)\nror of each data point with a weight from a Gaussian kernel: \nWi  = exp(-O.5(Xi - Xqf D(Xi - Xq)) \n\n(1) \n\nXtt  denotes the query point, and D  a positive semi-definite distance metric which deter(cid:173)\nmmes the size and shape of the neighborhood contributing to the regression (Atkeson et \naI.,  1997). The parameters  Xq  and D can be determined in the framework of nonparamet(cid:173)\nric statistics (Schaal &  Atkeson, in press) or parametric maximum likelihood estimations \n(Xu et aI,  1995}- for the present study they are determined manually since their origin is \nsecondary to the results of this paper. Without loss of generality, all our data sets will set \n!,q  to the zero vector, compute the weights, and then translate the input data such that the \nlocally weighted mean,  i  =  L WI  Xi  /  L Wi  ,  is zero.  The output data is  equally translated to \nbe mean zero. Mean zero data is necessary for most of techniques considered below. The \n(translated)  input  data  is  summarized  in  the  rows  of the  matrix  X,  the  corresponding \n(translated) outputs are the elements of the vector y, and the corresponding weights are in \nthe diagonal matrix W. In some cases, we need the joint input and output data, denoted \nas Z=[X y). \n\n\fLocal Dimensionality Reduction \n\n2.1  FACTORANALYSIS(LWFA) \n\n635 \n\nFactor analysis  (Everitt,  1984)  is  a technique  of dimensionality reduction which is  the \nmost appropriate given the generating process of our regression data.  It assumes the  ob(cid:173)\nserved data z  was  produced. by a  mean  zero  independently  distributed k  -dimensional \nvector of factors  v, transformed by the  matrix U,  and contaminated by mean zero  inde(cid:173)\npendent noise f:  with diagonal covariance matrix Q: \n\nz=Uv+f:,  where  z=[xT,yt  and  f:=[f:~,t:yr \n\n(2) \n\nIf both v and f:  are  normally distributed, the parameters Q  and U  can be obtained itera(cid:173)\ntively by the  Expectation-Maximization algorithm (EM) (Rubin &  Thayer,  1982). For a \nlinear regression problem, one assumes that z was generated with U=[I, f3  Y and v = i, \nwhere f3  denotes the vector of regression coefficients of the linear model y  = f31 x, and I \nthe identity matrix. After calculating Q  and U by EM in joint data space as formulated in \n(2),  an  estimate  of  f3  can be derived from  the  conditional  probability  p(y I x). As  all \ndistributions are assumed to be normal, the expected value ofy is the mean of this condi(cid:173)\ntional distribution. The locally weighted version (L WF A) of f3  can be obtained together \nwith an estimate of the factors v from the joint weighted covariance matrix 'I' of z and v: \n\nE{[: ] + [ ~ } ~ ~,,~,;'x,  where  ~ ~ [ZT, VT~~Jft: w; ~ \n\n(3) \n\n[ Q+UU T  U]  ['I'II(=n x n) \n= \n\nI  =  '\u00a521(= (m + k) x n) \n\nUT \n\n'I'12(=nX(m+k\u00bb)] \n'1'22(= (m + k) x (m + k\u00bb) \n\nwhere E { .}  denotes the expectation operator and  B a matrix  of coefficients involved in \nestimating the factors  v.  Note that unless the noise  f:  is zero, the estimated  f3  is different \nfrom the true  f3 as it tries to average out the noise in the data. \n\n2.2  JOINT-SPACE PRINCIPAL COMPONENT ANALYSIS (LWPCA) \nAn alternative way of determining the parameters  f3  in a reduced space employs locally \nweighted principal component analysis (LWPCA) in the joint data space. By defining the  . \nlargest k+ 1 principal components of the weighted covariance matrix ofZ as  U: \n\nU = [eigenvectors(I Wi (Zi - ZXZi - Z)T II Wi)] \n\nmax(l:k+1l \n\n(4) \n\nand noting that the eigenvectors in  U are unit length, the matrix inversion theorem (Hom \n&  Johnson,  1994) provides a means to derive an efficient estimate of f3 \n\n[Ux(=nXk)] \nf3=U x Uy -Uy  UyUy -I  UyUyt  where  U=  Uy(=mxk) \n\n( T \n\nT( \n\nT \n\n)-1 \n\nT\\ \n\n(5) \n\nIn our one dimensional output case,  U y  is just a  (1 x k) -dimensional row vector and the \nevaluation of (5) does not require a matrix inversion anymore but rather a division. \nIf one assumes normal distributions in all variables as in L WF A,  L WPCA is the special \ncase  of L WF A where the  noise  covariance Q  is  spherical,  i.e.,  the  same magnitude  of \nnoise in all observables. Under these circumstances, the subspaces spanned by U in both \nmethods will be  the  same.  However, the  regression coefficients of L WPCA will be dif(cid:173)\nferent from those of L WF A unless the noise level is zero, as L WF A optimizes the coeffi(cid:173)\ncients  according to  the noise  in the data (Equation (3\u00bb . Thus,  for  normal  distributions \nand a correct guess of k, L WPCA is always expected to perform worse than L WF A. \n\n\f636 \n\nS. Schaal, S.  Vijayakumar and C.  G.  Atkeson \n\nFor Training: \nInitialize: \nDo  = X, \nFor i = 1 to k: \n\n2.3  PARTIAL LEAST SQUARES (LWPLS, LWPLS_I) \nPartial  least squares  (Wold,  1975;  Frank &  Friedman,  1993) recursively  computes  or(cid:173)\nthogonal  projections  of the  input  data and performs  single  variable  regressions  along \nthese projections on the residuals of the previous iteration step.  A locally weighted ver(cid:173)\nsion of partial least squares (LWPLS) proceeds as  shown in Equation (6) below. \nAs  all  single variable regressions are ordinary uni(cid:173)\nvariate least-squares minim izations, \nL WPLS \nmakes the  same  statistical assumption as  ordinary \nlinear regressions,  i.e.,  that  only  output variables \nhave  additive  noise, but input variables  are  noise(cid:173)\nless.  The choice of the projections u, however,  in(cid:173)\ntroduces an element in L WPLS that remains statis(cid:173)\ntically still debated (Frank &  Friedman,  1993), al(cid:173)\nthough,  interestingly, there exists a strong similar(cid:173)\nity with the way projections are chosen in Cascade \nCorrelation (Fahlman &  Lebiere,  1990).  A peculi(cid:173)\narity  of L WPLS  is that it also regresses the inputs \nof the previous  step against the projected inputs  s \nin  order to ensure the orthogonality of all  the pro(cid:173)\njections u.  Since L WPLS chooses projections in a \nvery  powerful  way,  it  can  accomplish  optimal \nfunction  fits  with only one single projections (i.e., \nk= 1)  for  certain input distributions.  We  will  address this  issue  in our empirical evalua(cid:173)\ntions by comparing k-step L WPLS with  I-step L WPLS, abbreviated L WPLS_I. \n\neo  = y \n\n(6) \n\nI \n\nFor Lookup: \nInitialize: \n\ndo  = x,  y= \u00b0 \nFor i = 1 to k: \ns.  = dT.u. \n\n1-\n\nI \n\n2.4  PRINCIPAL COMPONENT REGRESSION (L WPCR) \nAlthough not optimal, a computationally efficient techniques of dimensionality reduction \nfor  linear regression is principal component regression (LWPCR) (Massy,  1965). The in(cid:173)\nputs  are  projected onto  the  largest  k  principal components of the weighted covariance \nmatrix of the input data by the matrix U: \n\nU = [eigenvectors(2: Wi (Xi  - xX Xi  - xt /2: Wi )] \n\nmax(l:k) \n\nThe regression coefficients f3 are thus calculated as: \n\nf3  = (UTXTwxUtUTXTWy \n\n(7) \n\n(8) \n\nEquation (8) is inexpensive to evaluate since after projecting X with U,  UTXTWXU  be(cid:173)\ncomes a diagonal matrix that is easy to invert. L WPCR assumes that the inputs have ad(cid:173)\nditive  spherical noise,  which includes the zero noise case.  As  during dimensionality re(cid:173)\nduction L WPCR does not take into account the output data, it is endangered by clipping \ninput dimensions with low  variance which nevertheless have  important contribution to \nthe regression output.  However, from a statistical point of view, it is  less likely that low \nvariance  inputs  have  significant  contribution  in a  linear  regression,  as  the  confidence \nbands of the regression coefficients increase inversely proportionally with the variance of \nthe associated input. If the input data has  non-spherical noise, L WPCR is prone to focus \nthe regression on irrelevant projections. \n\n3  MONTE CARLO EVALUATIONS \nIn order to evaluate the candidate methods, data sets with 5 inputs and 1 output were ran(cid:173)\ndomly generated. Each data set consisted of 2,000 training points and  10,000 test points, \ndistributed either uniformly  or nonuniformly  in the unit hypercube.  The  outputs  were \n\n\fLocal Dimensionality Reduction \n\n637 \n\ngenerated by either a linear or quadratic  function.  Afterwards,  the 5-dimensional input \nspace  was  projected  into  a  to-dimensional  space  by  a  randomly  chosen  distance  pre(cid:173)\nserving linear transformation. Finally, Gaussian noise of various magnitudes was added \nto both the  10-dimensional inputs and one dimensional output. For the test sets, the addi(cid:173)\ntive  noise  in  the  outputs  was  omitted.  Each  regression  technique  was  localized by  a \nGaussian kernel  (Equation  (1))  with a  to-dimensional distance  metric  D=IO*I (D  was \nmanually chosen to ensure that the Gaussian kernel had sufficiently many data points and \nno \"data holes\" in the  fringe  areas  of the kernel)  . The  precise experimental conditions \nfollowed closely those suggested by Frank and Friedman (1993): \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n2 kinds of linear functions  y = {g.I  for: \n2 kinds of quadratic functions  y = f3J.I + f3::.aAxt ,xi ,xi ,X;,X;]T for: \ni)  1311.  = [I, I, I, I, Wand f3q.ad = 0.1 [I, I, I, I, If, and ii)  131 .. = [1,2,3,4, sf and f3quad  = 0.1 [I, 4, 9, 16, 2sf \n\nii)  I3Ii.  = [1,2,3,4, sf \n\ni)  131 .. = [I, I, I, I, If , \n\n3 kinds of noise conditions, each with 2 sub-conditions: \ni) only output noise: \n\na) low noise: \nb) high noise: \n\nlocal signal/noise ratio Isnr=20, \nIsnr=2, \n\nii) equal noise in inputs and outputs: \n\nand \n\na) low noise  Ex \u2022\u2022  = Sy  = N(O,O.Ot2),  n e[I,2, ... ,10], \nb) high noise  Ex \u2022\u2022  =sy=N(0,0.12),ne[I,2, ... ,10], \n\nand \n\niii) unequal noise in  inputs and outputs: \n\nand \n\na) low noise:  Ex .\u2022  = N(0,(0.0In)2),  n e[I,2, ... ,1O]  and Isnr=20, \nb) high noise:  Ex .\u2022 = N(0,(0.0In)2),  n e[I,2, ... ,1O]  and Isnr=2, \n\n\u2022 \n\n2 kinds of input distributions:  i) uniform in unit hyper cube, ii) uniform in unit hyper cube excluding data \npoints which activate a  Gaussian  weighting  function  (I) at  c = [O.S,O,o,o,of  with  D=IO*I  more than \nw=0.2 (this forms a \"hyper kidney\" shaped distribution) \n\nEvery  algorithm  was  run * 30 times  on each of the 48  combinations  of the  conditions. \nAdditionally, the complete test was repeated for three further conditions varying the di(cid:173)\nmensionality--called factors in accordance with L WF A-that the algorithms assumed to \nbe the true dimensionality of the  to-dimensional data from k=4 to 6, i.e., too few, correct, \nand too many factors. The average results are summarized in Figure  I. \nFigure  I a,b,c show the summary results of the three factor conditions. Besides averaging \nover the 30 trials per condition, each mean of these charts also averages over the two in(cid:173)\nput distribution conditions and the linear and quadratic function condition, as  these  four \ncases are frequently observed violations of the  statistical assumptions in nonlinear func(cid:173)\ntion approximation with locally linear models. In  Figure  I b the number of factors equals \nthe  underlying  dimensionality  of the  problem,  and  all  algorithms  are  essentially per(cid:173)\nforming  equally well.  For perfectly Gaussian distributions  in all  random variables (not \nshown separately), LWFA's assumptions are perfectly fulfilled and it achieves the best \nresults, however, almost indistinguishable closely followed by L WPLS. For the ''unequal \nnoise  condition\",  the  two  PCA based techniques,  L WPCA  and  L WPCR,  perform the \nworst since--as expected-they choose suboptimal projections.  However, when violat(cid:173)\ning  the  statistical assumptions, L WF A loses  parts of its advantages,  such that the  sum(cid:173)\nmary results become fairly balanced in Figure  lb. \nThe quality of function fitting changes significantly when violating the correct number of \nfactors,  as illustrated in Figure  I a,c.  For too few  factors  (Figure  la), L WPCR performs \nworst because it randomly omits one of the principle components in the input data, with(cid:173)\nout  respect to  how  important it  is  for the regression.  The  second worse  is  L WF A:  ac(cid:173)\ncording to its assumptions it believes that the signal it cannot model must be noise, lead(cid:173)\ning to a degraded estimate of the data's subspace and, consequently, degraded regression \nresults. L WPLS has a clear lead in this test, closely followed by L WPCA and L WPLS_I. \n\n* Except for  LWFA, all  methods can evaluate a data set in  non-iterative calculations.  LWFA was trained with EM for  maxi(cid:173)\n\nmally  1000 iterations or until the log-likelihood increased less than  I.e-lOin one iteration. \n\n\f638 \n\nS. Schaal, S. Vljayakumar and C.  G. Atkeson \n\nFor too many factors than necessary (Figure  Ie), it is now LWPCA which degrades. This \neffect is  due to its  extracting one very noise contaminated projection which strongly in(cid:173)\nfluences  the recovery of the  regression parameters in Equation (4).  All other algorithms \nperform almost equally well, with L WF A and L WPLS taking a small lead. \n\nOnlyOutpul \n\nNoise \n\nEqual NoIse In ell \nIn puIS end OutpUIS \n\nUnequel NoIse In ell \nInputs end OutpulS \n\n0.1 \n\nc \no \n~  0.01 \n::::;; \nc \nII> \nC> \n~  0.001 \n~ \n\n0.0001 \n\n0.1 \n\n0.01 \n\nc:: \no \nW \n\n~ c:: \n\nII> \nC) \n~  0.001 \n~ \n\n0.0001 \n\nfl- I. E>O  ~I. \u00a3 >>(I  ~ J. &>O  ~J , E \u00bb O  ~ J .E>O  fl- I.&\u00bb O  ~I . & >O  ~ I . & >>O  ~I.& >O  ~ I .\u00a3>>o  p,. 1. s>O \n\ntJ-J .\u00a3>>O \n\n\u2022 \n\nLWFA \n\n\u2022 \n\ne) RegressIon Results with 4 Factors \nLWPCR  0  LWPLS \n\nLWPCA \n\n\u2022 \n\n\u2022 \n\nLWPLS_1 \n\n0.1 \n\n0.01 \n\n~ \n~ \n8 \nW \n~ c:: g, \n\n~  0.001 \n~ \n\n0.0001 \n\n0.1 \n\njj \nil \nf-a \n~  0.01 \n::::;; \nc \nII> \nC) \n~  0.001 \n~ \n\n0.0001 \n\nc) RegressIon Results with 6 Feclors \n\nd) Summery Results \n\nFigure  I:  Average summary results of Monte Carlo experiments. Each chart is primarily \ndivided into the three major noise conditions, cf. headers in chart (a). In each noise con(cid:173)\ndition, there are four further subdivision: i) coefficients of linear or quadratic model are \nequal with low added noise; ii) like i) with high added noise; iii) coefficients oflinear or \nquadratic model are different with low noise added; iv) like iii) with high added noise. \n\nRefer to text and descriptions of Monte Carlo studies for further explanations. \n\n\fLocal Dimensionality Reduction \n\n639 \n\n4  SUMMARY AND CONCLUSIONS \nFigure 1 d summarizes all the Monte Carlo experiments in a final average plot. Except for \nL WPLS, every other technique showed at least one clear weakness in one of our \"robust(cid:173)\nness\" tests.  It was particularly an  incorrect number of factors  which made these weak(cid:173)\nnesses apparent. For high-dimensional regression problems, the local dimensionality, i.e., \nthe  number of factors,  is not a clearly  defined number but rather a varying quantity, de(cid:173)\npending on the way the generating process operates. Usually, this process  does  not need \nto  generate locally  low dimensional distributions, however, it often \"chooses\" to  do  so, \nfor  instance,  as  human  ann movements  follow  stereotypic  patterns  despite  they could \ngenerate arbitrary ones. Thus, local dimensionality reduction needs to find  autonomously \nthe appropriate number of local factor.  Locally weighted partial least squares turned out \nto be a surprisingly robust technique for this purpose, even outperforming the statistically \nappealing  probabilistic  factor  analysis.  As  in  principal  component analysis,  LWPLS's \nnumber of factors can easily be controlled just based on a variance-cutoff threshold in in(cid:173)\nput space  (Frank &  Friedman,  1993),  while  factor  analysis  usually  requires  expensive \ncross-validation techniques.  Simple,  variance-based control over the number of factors \ncan actually improve the results of L WPCA and L WPCR in practice, since,  as  shown in \nFigure  I a, L WPCR is  more robust towards overestimating the number of factors,  while \nL WPCA is more robust towards an underestimation.  If one is  interested in dynamically \ngrowing the number of factors while obtaining already good regression results with too \nfew  factors,  L WPCA  and,  especially,  L WPLS  seem  to  be  appropriate-it  should  be \nnoted how well one factor L WPLS (L WPLS_l) already performed in Figure  I! \nIn  conclusion,  since  locally  weighted partial  least  squares  was  equally robust  as  local \nweighted  factor  analysis  towards  additive  noise  in. both  input  and  output  data,  and, \nmoreover,  superior when mis-guessing the number of factors,  it  seems  to  be a most  fa(cid:173)\nvorable technique for local dimensionality reduction for high dimensional regressions. \n\nAcknowledgments \nThe authors are grateful to Geoffrey Hinton for reminding them of partial least squares. This work was sup(cid:173)\nported by the ATR Human Information Processing Research Laboratories. S. Schaal's support includes the \nGerman Research Association, the Alexander von Humboldt Foundation, and the German Scholarship Founda(cid:173)\ntion.  S. Vijayakumar was supported by the Japanese Ministry of Education, Science, and Culture (Monbusho). \nC. G. Atkeson acknowledges the Air Force Office of Scientific Research grant F49-6209410362 and a National \nScience Foundation Presidential Young Investigators Award. \nReferences \n\ntures of experts and the EM algorithm.\" Neural Com-\nputation, 6,  2, pp.181-214. \n\nAtkeson, C. G., Moore, A. W., &  Schaal, S, (1997a).  Massy, W.  F, (1965). \"Principle component regression \n\"Locally weighted learning.\"  ArtifiCial Intelligence Re- in exploratory statistical research.\" Journal of the \nview, 11,  1-5, pp.II-73. \nAmerican Statistical Association, 60, pp.234-246. \nAtkeson, C. G., Moore, A. W., &  Schaal, S, (1997c).  Rubin, D.  B., &  Thayer, D. T, (l982). \"EM algorithms \n\"Locally weighted learning for control.\" ArtifiCial In-\nfor ML factor analysis.\" Psychometrika, 47,  I, 69-76. \nSchaal, S., &  Atkeson, C. G, (in press). \"Constructive \ntelligence Review, 11,  1-5, pp.75-113. \nBelsley, D.  A., Kuh, E., & Welsch, R.  E, (1980). Re-\nincremental learning from  only local information.\" \ngression diagnostics: Identifying influential data and  Neural Computation. \nsources of collinearity. New York:  Wiley. \nEveritt, B.  S, (1984). An introduction to latent variable  New York: Wiley. \nmodels. London: Chapman and Hall. \nFahlman, S.  E.  ,Lebiere, C, (1990). \"The cascade-\ncorrelation learning architecture.\" In: Touretzky, D.  S.  International Conference on Computational Intelli-\n(Ed.), Advances in  Neural Information Processing \nSystems II, pp.524-532. Morgan Kaufmann. \nFrank, I.  E., &  Friedman, 1. H, (1993). \"A statistical  Wold, H. (1975). \"Soft modeling by latent variables: \nview of some chemometric regression tools.\"  Tech-\nthe nonlinear iterative partial least squares approach.\" \nIn: Gani, J. (Ed.), Perspectives in Probability and Sta-\nnometrics, 35, 2, pp.l09-135. \nGeman, S., Bienenstock, E., &  Doursat, R. (1992). \ntistics,  Papers in Honour ofM S. Bartlett. Aca<j.  Press. \nXu,  L., Jordan, M.l., &  Hinton, G. E, (1995). \"An al-\n\"Neural networks and the bias/variance dilemma.\" \nNeural Computation, 4, pp.I-58. \nternative model for mixture of experts.\" In: Tesauro, \nHom, R.  A., & Johnson, C. R, (1994). Matrix analySis.  G., Touretzky, D.  S., &  Leen, T. K. (Eds.), Advances in \nPress Syndicate of the University of Cambridge. \nNeural Information Processing Systems  7, pp.633-640. \nJordan, M.I., &  Jacobs, R, (1994). \"Hierarchical mix- Cambridge, MA:  MIT Press. \n\nVijayakumar, S., &  Schaal, S,  (1997). \"Local dimen-\nsionality reduction for locally weighted learning.\" In: \n\ngence in Robotics and Automation, pp.220-225, Mon-\nteray, CA, July  10-11, 1997. \n\nScott, D. W, (1992). Multivariate Density Estimation. \n\n\f", "award": [], "sourceid": 1387, "authors": [{"given_name": "Stefan", "family_name": "Schaal", "institution": null}, {"given_name": "Sethu", "family_name": "Vijayakumar", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}