{"title": "Local Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 639, "abstract": "", "full_text": "Local Dimensionality Reduction \n\nSethu Vijayakumar 3, I \n\nChristopher G. Atkeson 4 \n\nsethu@cs.titech.ac.jp \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.edul \n\nStefan Schaal 1,2,4 \n\nsschaal@usc.edu \n\nhttp://www-slab.usc.edulsschaal \n\nhttp://ogawa(cid:173)\n\nwww.cs.titech.ac.jp/-sethu \n\nfac/Chris.Atkeson \n\nIERATO Kawato Dynamic Brain Project (IST), 2-2 Hikaridai, Seika-cho, Soraku-gun, 619-02 Kyoto \n\n2Dept. of Comp. Science & Neuroscience, Univ. of South. California HNB-I 03, Los Angeles CA 90089-2520 \n\n3Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo-I 52 \n\n4College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30332-0280 \n\nAbstract \n\nIf globally high dimensional data has locally only low dimensional distribu(cid:173)\ntions, it is advantageous to perform a local dimensionality reduction before \nfurther processing the data. In this paper we examine several techniques for \nlocal dimensionality reduction in the context of locally weighted linear re(cid:173)\ngression. As possible candidates, we derive local versions of factor analysis \nregression, principle component regression, principle component regression \non joint distributions, and partial least squares regression. After outlining the \nstatistical bases of these methods, we perform Monte Carlo simulations to \nevaluate their robustness with respect to violations of their statistical as(cid:173)\nsumptions. One surprising outcome is that locally weighted partial least \nsquares regression offers the best average results, thus outperforming even \nfactor analysis, the theoretically most appealing of our candidate techniques. \n\n1 INTRODUCTION \nRegression tasks involve mapping a n-dimensional continuous input vector x E ~n onto \na m-dimensional output vector y E ~m \u2022 They form a ubiquitous class of problems found \nin fields including process control, sensorimotor control, coordinate transformations, and \nvarious stages of information processing in biological nervous systems. This paper will \nfocus on spatially localized learning techniques, for example, kernel regression with \nGaussian weighting functions. Local learning offer advantages for real-time incremental \nlearning problems due to fast convergence, considerable robustness towards problems of \nnegative interference, and large tolerance in model selection (Atkeson, Moore, & Schaal, \n1997; Schaal & Atkeson, in press). Local learning is usually based on interpolating data \nfrom a local neighborhood around the query point. For high dimensional learning prob(cid:173)\nlems, however, it suffers from a bias/variance dilemma, caused by the nonintuitive fact \nthat \" ... [in high dimensions] if neighborhoods are local, then they are almost surely \nempty, whereas if a neighborhood is not empty, then it is not local.\" (Scott, 1992, p.198). \nGlobal learning methods, such as sigmoidal feedforward networks, do not face this \n\n\f634 \n\nS. School, S. Vijayakumar and C. G. Atkeson \n\nproblem as they do not employ neighborhood relations, although they require strong \nprior knowledge about the problem at hand in order to be successful. \nAssuming that local learning in high dimensions is a hopeless, however, is not necessar(cid:173)\nily warranted: being globally high dimensional does not imply that data remains high di(cid:173)\nmensional if viewed locally. For example, in the control of robot anns and biological \nanns we have shown that for estimating the inverse dynamics of an ann, a globally 21-\ndimensional space reduces on average to 4-6 dimensions locally (Vijayakumar & Schaal, \n1997). A local learning system that can robustly exploit such locally low dimensional \ndistributions should be able to avoid the curse of dimensionality. \nIn pursuit of the question of what, in the context of local regression, is the \"right\" \nmethod to perfonn local dimensionality reduction, this paper will derive and compare \nseveral candidate techniques under i) perfectly fulfilled statistical prerequisites (e.g., \nGaussian noise, Gaussian input distributions, perfectly linear data), and ii) less perfect \nconditions (e.g., non-Gaussian distributions, slightly quadratic data, incorrect guess of \nthe dimensionality of the true data distribution). We will focus on nonlinear function ap(cid:173)\nproximation with locally weighted linear regression (L WR), as it allows us to adapt a va(cid:173)\nriety of global linear dimensionality reduction techniques, and as L WR has found wide(cid:173)\nspread application in several local learning systems (Atkeson, Moore, & Schaal, 1997; \nJordan & Jacobs, 1994; Xu, Jordan, & Hinton, 1996). In particular, we will derive and \ninvestigate locally weighted principal component regression (L WPCR), locally weighted \njoint data principal component analysis (L WPCA), locally weighted factor analysis \n(L WF A), and locally weighted partial least squares (L WPLS). Section 2 will briefly out(cid:173)\nline these methods and their theoretical foundations, while Section 3 will empirically \nevaluate the robustness of these methods using synthetic data sets that increasingly vio(cid:173)\nlate some of the statistical assumptions of the techniques. \n\n2 METHODS OF DIMENSIONALITY REDUCTION \nWe assume that our regression data originate from a generating process with two sets of \nobservables, the \"inputs\" i and the \"outputs\" y. The characteristics of the process en(cid:173)\nsure a functional relation y = f(i). Both i and yare obtained through some measure(cid:173)\nment device that adds independent mean zero noise of different magnitude in each ob(cid:173)\nservable, such that x == i + Ex and y = y + Ey \u2022 For the sake of simplicity, we will only fo(cid:173)\ncus on one-dimensional output data (m=l) and functions / that are either linear or \nslightly quadratic, as these cases are the most common in nonlinear function approxima(cid:173)\ntion with locally linear models. Locality of the regression is ensured by weighting the er(cid:173)\nror of each data point with a weight from a Gaussian kernel: \nWi = exp(-O.5(Xi - Xqf D(Xi - Xq)) \n\n(1) \n\nXtt denotes the query point, and D a positive semi-definite distance metric which deter(cid:173)\nmmes the size and shape of the neighborhood contributing to the regression (Atkeson et \naI., 1997). The parameters Xq and D can be determined in the framework of nonparamet(cid:173)\nric statistics (Schaal & Atkeson, in press) or parametric maximum likelihood estimations \n(Xu et aI, 1995}- for the present study they are determined manually since their origin is \nsecondary to the results of this paper. Without loss of generality, all our data sets will set \n!,q to the zero vector, compute the weights, and then translate the input data such that the \nlocally weighted mean, i = L WI Xi / L Wi , is zero. The output data is equally translated to \nbe mean zero. Mean zero data is necessary for most of techniques considered below. The \n(translated) input data is summarized in the rows of the matrix X, the corresponding \n(translated) outputs are the elements of the vector y, and the corresponding weights are in \nthe diagonal matrix W. In some cases, we need the joint input and output data, denoted \nas Z=[X y). \n\n\fLocal Dimensionality Reduction \n\n2.1 FACTORANALYSIS(LWFA) \n\n635 \n\nFactor analysis (Everitt, 1984) is a technique of dimensionality reduction which is the \nmost appropriate given the generating process of our regression data. It assumes the ob(cid:173)\nserved data z was produced. by a mean zero independently distributed k -dimensional \nvector of factors v, transformed by the matrix U, and contaminated by mean zero inde(cid:173)\npendent noise f: with diagonal covariance matrix Q: \n\nz=Uv+f:, where z=[xT,yt and f:=[f:~,t:yr \n\n(2) \n\nIf both v and f: are normally distributed, the parameters Q and U can be obtained itera(cid:173)\ntively by the Expectation-Maximization algorithm (EM) (Rubin & Thayer, 1982). For a \nlinear regression problem, one assumes that z was generated with U=[I, f3 Y and v = i, \nwhere f3 denotes the vector of regression coefficients of the linear model y = f31 x, and I \nthe identity matrix. After calculating Q and U by EM in joint data space as formulated in \n(2), an estimate of f3 can be derived from the conditional probability p(y I x). As all \ndistributions are assumed to be normal, the expected value ofy is the mean of this condi(cid:173)\ntional distribution. The locally weighted version (L WF A) of f3 can be obtained together \nwith an estimate of the factors v from the joint weighted covariance matrix 'I' of z and v: \n\nE{[: ] + [ ~ } ~ ~,,~,;'x, where ~ ~ [ZT, VT~~Jft: w; ~ \n\n(3) \n\n[ Q+UU T U] ['I'II(=n x n) \n= \n\nI = '\u00a521(= (m + k) x n) \n\nUT \n\n'I'12(=nX(m+k\u00bb)] \n'1'22(= (m + k) x (m + k\u00bb) \n\nwhere E { .} denotes the expectation operator and B a matrix of coefficients involved in \nestimating the factors v. Note that unless the noise f: is zero, the estimated f3 is different \nfrom the true f3 as it tries to average out the noise in the data. \n\n2.2 JOINT-SPACE PRINCIPAL COMPONENT ANALYSIS (LWPCA) \nAn alternative way of determining the parameters f3 in a reduced space employs locally \nweighted principal component analysis (LWPCA) in the joint data space. By defining the . \nlargest k+ 1 principal components of the weighted covariance matrix ofZ as U: \n\nU = [eigenvectors(I Wi (Zi - ZXZi - Z)T II Wi)] \n\nmax(l:k+1l \n\n(4) \n\nand noting that the eigenvectors in U are unit length, the matrix inversion theorem (Hom \n& Johnson, 1994) provides a means to derive an efficient estimate of f3 \n\n[Ux(=nXk)] \nf3=U x Uy -Uy UyUy -I UyUyt where U= Uy(=mxk) \n\n( T \n\nT( \n\nT \n\n)-1 \n\nT\\ \n\n(5) \n\nIn our one dimensional output case, U y is just a (1 x k) -dimensional row vector and the \nevaluation of (5) does not require a matrix inversion anymore but rather a division. \nIf one assumes normal distributions in all variables as in L WF A, L WPCA is the special \ncase of L WF A where the noise covariance Q is spherical, i.e., the same magnitude of \nnoise in all observables. Under these circumstances, the subspaces spanned by U in both \nmethods will be the same. However, the regression coefficients of L WPCA will be dif(cid:173)\nferent from those of L WF A unless the noise level is zero, as L WF A optimizes the coeffi(cid:173)\ncients according to the noise in the data (Equation (3\u00bb . Thus, for normal distributions \nand a correct guess of k, L WPCA is always expected to perform worse than L WF A. \n\n\f636 \n\nS. Schaal, S. Vijayakumar and C. G. Atkeson \n\nFor Training: \nInitialize: \nDo = X, \nFor i = 1 to k: \n\n2.3 PARTIAL LEAST SQUARES (LWPLS, LWPLS_I) \nPartial least squares (Wold, 1975; Frank & Friedman, 1993) recursively computes or(cid:173)\nthogonal projections of the input data and performs single variable regressions along \nthese projections on the residuals of the previous iteration step. A locally weighted ver(cid:173)\nsion of partial least squares (LWPLS) proceeds as shown in Equation (6) below. \nAs all single variable regressions are ordinary uni(cid:173)\nvariate least-squares minim izations, \nL WPLS \nmakes the same statistical assumption as ordinary \nlinear regressions, i.e., that only output variables \nhave additive noise, but input variables are noise(cid:173)\nless. The choice of the projections u, however, in(cid:173)\ntroduces an element in L WPLS that remains statis(cid:173)\ntically still debated (Frank & Friedman, 1993), al(cid:173)\nthough, interestingly, there exists a strong similar(cid:173)\nity with the way projections are chosen in Cascade \nCorrelation (Fahlman & Lebiere, 1990). A peculi(cid:173)\narity of L WPLS is that it also regresses the inputs \nof the previous step against the projected inputs s \nin order to ensure the orthogonality of all the pro(cid:173)\njections u. Since L WPLS chooses projections in a \nvery powerful way, it can accomplish optimal \nfunction fits with only one single projections (i.e., \nk= 1) for certain input distributions. We will address this issue in our empirical evalua(cid:173)\ntions by comparing k-step L WPLS with I-step L WPLS, abbreviated L WPLS_I. \n\neo = y \n\n(6) \n\nI \n\nFor Lookup: \nInitialize: \n\ndo = x, y= \u00b0 \nFor i = 1 to k: \ns. = dT.u. \n\n1-\n\nI \n\n2.4 PRINCIPAL COMPONENT REGRESSION (L WPCR) \nAlthough not optimal, a computationally efficient techniques of dimensionality reduction \nfor linear regression is principal component regression (LWPCR) (Massy, 1965). The in(cid:173)\nputs are projected onto the largest k principal components of the weighted covariance \nmatrix of the input data by the matrix U: \n\nU = [eigenvectors(2: Wi (Xi - xX Xi - xt /2: Wi )] \n\nmax(l:k) \n\nThe regression coefficients f3 are thus calculated as: \n\nf3 = (UTXTwxUtUTXTWy \n\n(7) \n\n(8) \n\nEquation (8) is inexpensive to evaluate since after projecting X with U, UTXTWXU be(cid:173)\ncomes a diagonal matrix that is easy to invert. L WPCR assumes that the inputs have ad(cid:173)\nditive spherical noise, which includes the zero noise case. As during dimensionality re(cid:173)\nduction L WPCR does not take into account the output data, it is endangered by clipping \ninput dimensions with low variance which nevertheless have important contribution to \nthe regression output. However, from a statistical point of view, it is less likely that low \nvariance inputs have significant contribution in a linear regression, as the confidence \nbands of the regression coefficients increase inversely proportionally with the variance of \nthe associated input. If the input data has non-spherical noise, L WPCR is prone to focus \nthe regression on irrelevant projections. \n\n3 MONTE CARLO EVALUATIONS \nIn order to evaluate the candidate methods, data sets with 5 inputs and 1 output were ran(cid:173)\ndomly generated. Each data set consisted of 2,000 training points and 10,000 test points, \ndistributed either uniformly or nonuniformly in the unit hypercube. The outputs were \n\n\fLocal Dimensionality Reduction \n\n637 \n\ngenerated by either a linear or quadratic function. Afterwards, the 5-dimensional input \nspace was projected into a to-dimensional space by a randomly chosen distance pre(cid:173)\nserving linear transformation. Finally, Gaussian noise of various magnitudes was added \nto both the 10-dimensional inputs and one dimensional output. For the test sets, the addi(cid:173)\ntive noise in the outputs was omitted. Each regression technique was localized by a \nGaussian kernel (Equation (1)) with a to-dimensional distance metric D=IO*I (D was \nmanually chosen to ensure that the Gaussian kernel had sufficiently many data points and \nno \"data holes\" in the fringe areas of the kernel) . The precise experimental conditions \nfollowed closely those suggested by Frank and Friedman (1993): \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n2 kinds of linear functions y = {g.I for: \n2 kinds of quadratic functions y = f3J.I + f3::.aAxt ,xi ,xi ,X;,X;]T for: \ni) 1311. = [I, I, I, I, Wand f3q.ad = 0.1 [I, I, I, I, If, and ii) 131 .. = [1,2,3,4, sf and f3quad = 0.1 [I, 4, 9, 16, 2sf \n\nii) I3Ii. = [1,2,3,4, sf \n\ni) 131 .. = [I, I, I, I, If , \n\n3 kinds of noise conditions, each with 2 sub-conditions: \ni) only output noise: \n\na) low noise: \nb) high noise: \n\nlocal signal/noise ratio Isnr=20, \nIsnr=2, \n\nii) equal noise in inputs and outputs: \n\nand \n\na) low noise Ex \u2022\u2022 = Sy = N(O,O.Ot2), n e[I,2, ... ,10], \nb) high noise Ex \u2022\u2022 =sy=N(0,0.12),ne[I,2, ... ,10], \n\nand \n\niii) unequal noise in inputs and outputs: \n\nand \n\na) low noise: Ex .\u2022 = N(0,(0.0In)2), n e[I,2, ... ,1O] and Isnr=20, \nb) high noise: Ex .\u2022 = N(0,(0.0In)2), n e[I,2, ... ,1O] and Isnr=2, \n\n\u2022 \n\n2 kinds of input distributions: i) uniform in unit hyper cube, ii) uniform in unit hyper cube excluding data \npoints which activate a Gaussian weighting function (I) at c = [O.S,O,o,o,of with D=IO*I more than \nw=0.2 (this forms a \"hyper kidney\" shaped distribution) \n\nEvery algorithm was run * 30 times on each of the 48 combinations of the conditions. \nAdditionally, the complete test was repeated for three further conditions varying the di(cid:173)\nmensionality--called factors in accordance with L WF A-that the algorithms assumed to \nbe the true dimensionality of the to-dimensional data from k=4 to 6, i.e., too few, correct, \nand too many factors. The average results are summarized in Figure I. \nFigure I a,b,c show the summary results of the three factor conditions. Besides averaging \nover the 30 trials per condition, each mean of these charts also averages over the two in(cid:173)\nput distribution conditions and the linear and quadratic function condition, as these four \ncases are frequently observed violations of the statistical assumptions in nonlinear func(cid:173)\ntion approximation with locally linear models. In Figure I b the number of factors equals \nthe underlying dimensionality of the problem, and all algorithms are essentially per(cid:173)\nforming equally well. For perfectly Gaussian distributions in all random variables (not \nshown separately), LWFA's assumptions are perfectly fulfilled and it achieves the best \nresults, however, almost indistinguishable closely followed by L WPLS. For the ''unequal \nnoise condition\", the two PCA based techniques, L WPCA and L WPCR, perform the \nworst since--as expected-they choose suboptimal projections. However, when violat(cid:173)\ning the statistical assumptions, L WF A loses parts of its advantages, such that the sum(cid:173)\nmary results become fairly balanced in Figure lb. \nThe quality of function fitting changes significantly when violating the correct number of \nfactors, as illustrated in Figure I a,c. For too few factors (Figure la), L WPCR performs \nworst because it randomly omits one of the principle components in the input data, with(cid:173)\nout respect to how important it is for the regression. The second worse is L WF A: ac(cid:173)\ncording to its assumptions it believes that the signal it cannot model must be noise, lead(cid:173)\ning to a degraded estimate of the data's subspace and, consequently, degraded regression \nresults. L WPLS has a clear lead in this test, closely followed by L WPCA and L WPLS_I. \n\n* Except for LWFA, all methods can evaluate a data set in non-iterative calculations. LWFA was trained with EM for maxi(cid:173)\n\nmally 1000 iterations or until the log-likelihood increased less than I.e-lOin one iteration. \n\n\f638 \n\nS. Schaal, S. Vljayakumar and C. G. Atkeson \n\nFor too many factors than necessary (Figure Ie), it is now LWPCA which degrades. This \neffect is due to its extracting one very noise contaminated projection which strongly in(cid:173)\nfluences the recovery of the regression parameters in Equation (4). All other algorithms \nperform almost equally well, with L WF A and L WPLS taking a small lead. \n\nOnlyOutpul \n\nNoise \n\nEqual NoIse In ell \nIn puIS end OutpUIS \n\nUnequel NoIse In ell \nInputs end OutpulS \n\n0.1 \n\nc \no \n~ 0.01 \n::::;; \nc \nII> \nC> \n~ 0.001 \n~ \n\n0.0001 \n\n0.1 \n\n0.01 \n\nc:: \no \nW \n\n~ c:: \n\nII> \nC) \n~ 0.001 \n~ \n\n0.0001 \n\nfl- I. E>O ~I. \u00a3 >>(I ~ J. &>O ~J , E \u00bb O ~ J .E>O fl- I.&\u00bb O ~I . & >O ~ I . & >>O ~I.& >O ~ I .\u00a3>>o p,. 1. s>O \n\ntJ-J .\u00a3>>O \n\n\u2022 \n\nLWFA \n\n\u2022 \n\ne) RegressIon Results with 4 Factors \nLWPCR 0 LWPLS \n\nLWPCA \n\n\u2022 \n\n\u2022 \n\nLWPLS_1 \n\n0.1 \n\n0.01 \n\n~ \n~ \n8 \nW \n~ c:: g, \n\n~ 0.001 \n~ \n\n0.0001 \n\n0.1 \n\njj \nil \nf-a \n~ 0.01 \n::::;; \nc \nII> \nC) \n~ 0.001 \n~ \n\n0.0001 \n\nc) RegressIon Results with 6 Feclors \n\nd) Summery Results \n\nFigure I: Average summary results of Monte Carlo experiments. Each chart is primarily \ndivided into the three major noise conditions, cf. headers in chart (a). In each noise con(cid:173)\ndition, there are four further subdivision: i) coefficients of linear or quadratic model are \nequal with low added noise; ii) like i) with high added noise; iii) coefficients oflinear or \nquadratic model are different with low noise added; iv) like iii) with high added noise. \n\nRefer to text and descriptions of Monte Carlo studies for further explanations. \n\n\fLocal Dimensionality Reduction \n\n639 \n\n4 SUMMARY AND CONCLUSIONS \nFigure 1 d summarizes all the Monte Carlo experiments in a final average plot. Except for \nL WPLS, every other technique showed at least one clear weakness in one of our \"robust(cid:173)\nness\" tests. It was particularly an incorrect number of factors which made these weak(cid:173)\nnesses apparent. For high-dimensional regression problems, the local dimensionality, i.e., \nthe number of factors, is not a clearly defined number but rather a varying quantity, de(cid:173)\npending on the way the generating process operates. Usually, this process does not need \nto generate locally low dimensional distributions, however, it often \"chooses\" to do so, \nfor instance, as human ann movements follow stereotypic patterns despite they could \ngenerate arbitrary ones. Thus, local dimensionality reduction needs to find autonomously \nthe appropriate number of local factor. Locally weighted partial least squares turned out \nto be a surprisingly robust technique for this purpose, even outperforming the statistically \nappealing probabilistic factor analysis. As in principal component analysis, LWPLS's \nnumber of factors can easily be controlled just based on a variance-cutoff threshold in in(cid:173)\nput space (Frank & Friedman, 1993), while factor analysis usually requires expensive \ncross-validation techniques. Simple, variance-based control over the number of factors \ncan actually improve the results of L WPCA and L WPCR in practice, since, as shown in \nFigure I a, L WPCR is more robust towards overestimating the number of factors, while \nL WPCA is more robust towards an underestimation. If one is interested in dynamically \ngrowing the number of factors while obtaining already good regression results with too \nfew factors, L WPCA and, especially, L WPLS seem to be appropriate-it should be \nnoted how well one factor L WPLS (L WPLS_l) already performed in Figure I! \nIn conclusion, since locally weighted partial least squares was equally robust as local \nweighted factor analysis towards additive noise in. both input and output data, and, \nmoreover, superior when mis-guessing the number of factors, it seems to be a most fa(cid:173)\nvorable technique for local dimensionality reduction for high dimensional regressions. \n\nAcknowledgments \nThe authors are grateful to Geoffrey Hinton for reminding them of partial least squares. This work was sup(cid:173)\nported by the ATR Human Information Processing Research Laboratories. S. Schaal's support includes the \nGerman Research Association, the Alexander von Humboldt Foundation, and the German Scholarship Founda(cid:173)\ntion. S. Vijayakumar was supported by the Japanese Ministry of Education, Science, and Culture (Monbusho). \nC. G. Atkeson acknowledges the Air Force Office of Scientific Research grant F49-6209410362 and a National \nScience Foundation Presidential Young Investigators Award. \nReferences \n\ntures of experts and the EM algorithm.\" Neural Com-\nputation, 6, 2, pp.181-214. \n\nAtkeson, C. G., Moore, A. W., & Schaal, S, (1997a). Massy, W. F, (1965). \"Principle component regression \n\"Locally weighted learning.\" ArtifiCial Intelligence Re- in exploratory statistical research.\" Journal of the \nview, 11, 1-5, pp.II-73. \nAmerican Statistical Association, 60, pp.234-246. \nAtkeson, C. G., Moore, A. W., & Schaal, S, (1997c). Rubin, D. B., & Thayer, D. T, (l982). \"EM algorithms \n\"Locally weighted learning for control.\" ArtifiCial In-\nfor ML factor analysis.\" Psychometrika, 47, I, 69-76. \nSchaal, S., & Atkeson, C. G, (in press). \"Constructive \ntelligence Review, 11, 1-5, pp.75-113. \nBelsley, D. A., Kuh, E., & Welsch, R. E, (1980). Re-\nincremental learning from only local information.\" \ngression diagnostics: Identifying influential data and Neural Computation. \nsources of collinearity. New York: Wiley. \nEveritt, B. S, (1984). An introduction to latent variable New York: Wiley. \nmodels. London: Chapman and Hall. \nFahlman, S. E. ,Lebiere, C, (1990). \"The cascade-\ncorrelation learning architecture.\" In: Touretzky, D. S. International Conference on Computational Intelli-\n(Ed.), Advances in Neural Information Processing \nSystems II, pp.524-532. Morgan Kaufmann. \nFrank, I. E., & Friedman, 1. H, (1993). \"A statistical Wold, H. (1975). \"Soft modeling by latent variables: \nview of some chemometric regression tools.\" Tech-\nthe nonlinear iterative partial least squares approach.\" \nIn: Gani, J. (Ed.), Perspectives in Probability and Sta-\nnometrics, 35, 2, pp.l09-135. \nGeman, S., Bienenstock, E., & Doursat, R. (1992). \ntistics, Papers in Honour ofM S. Bartlett. Aca