{"title": "Serial Order in Reading Aloud: Connectionist Models and Neighborhood Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 65, "abstract": null, "full_text": "Local Dimensionality Reduction \n\nSethu Vijayakumar 3, I \n\nChristopher G. Atkeson 4 \n\nsethu@cs.titech.ac.jp \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.edul \n\nStefan Schaal 1,2,4 \n\nsschaal@usc.edu \n\nhttp://www-slab.usc.edulsschaal \n\nhttp://ogawa(cid:173)\n\nwww.cs.titech.ac.jp/-sethu \n\nfac/Chris.Atkeson \n\nIERATO Kawato Dynamic Brain Project (IST), 2-2 Hikaridai, Seika-cho, Soraku-gun, 619-02 Kyoto \n\n2Dept. of Comp. Science & Neuroscience, Univ. of South. California HNB-I 03, Los Angeles CA 90089-2520 \n\n3Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo-I 52 \n\n4College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30332-0280 \n\nAbstract \n\nIf globally high dimensional data has locally only low dimensional distribu(cid:173)\ntions, it is advantageous to perform a local dimensionality reduction before \nfurther processing the data. In this paper we examine several techniques for \nlocal dimensionality reduction in the context of locally weighted linear re(cid:173)\ngression. As possible candidates, we derive local versions of factor analysis \nregression, principle component regression, principle component regression \non joint distributions, and partial least squares regression. After outlining the \nstatistical bases of these methods, we perform Monte Carlo simulations to \nevaluate their robustness with respect to violations of their statistical as(cid:173)\nsumptions. One surprising outcome is that locally weighted partial least \nsquares regression offers the best average results, thus outperforming even \nfactor analysis, the theoretically most appealing of our candidate techniques. \n\n1 INTRODUCTION \nRegression tasks involve mapping a n-dimensional continuous input vector x E ~n onto \na m-dimensional output vector y E ~m \u2022 They form a ubiquitous class of problems found \nin fields including process control, sensorimotor control, coordinate transformations, and \nvarious stages of information processing in biological nervous systems. This paper will \nfocus on spatially localized learning techniques, for example, kernel regression with \nGaussian weighting functions. Local learning offer advantages for real-time incremental \nlearning problems due to fast convergence, considerable robustness towards problems of \nnegative interference, and large tolerance in model selection (Atkeson, Moore, & Schaal, \n1997; Schaal & Atkeson, in press). Local learning is usually based on interpolating data \nfrom a local neighborhood around the query point. For high dimensional learning prob(cid:173)\nlems, however, it suffers from a bias/variance dilemma, caused by the nonintuitive fact \nthat \" ... [in high dimensions] if neighborhoods are local, then they are almost surely \nempty, whereas if a neighborhood is not empty, then it is not local.\" (Scott, 1992, p.198). \nGlobal learning methods, such as sigmoidal feedforward networks, do not face this \n\n\f634 \n\nS. School, S. Vijayakumar and C. G. Atkeson \n\nproblem as they do not employ neighborhood relations, although they require strong \nprior knowledge about the problem at hand in order to be successful. \nAssuming that local learning in high dimensions is a hopeless, however, is not necessar(cid:173)\nily warranted: being globally high dimensional does not imply that data remains high di(cid:173)\nmensional if viewed locally. For example, in the control of robot anns and biological \nanns we have shown that for estimating the inverse dynamics of an ann, a globally 21-\ndimensional space reduces on average to 4-6 dimensions locally (Vijayakumar & Schaal, \n1997). A local learning system that can robustly exploit such locally low dimensional \ndistributions should be able to avoid the curse of dimensionality. \nIn pursuit of the question of what, in the context of local regression, is the \"right\" \nmethod to perfonn local dimensionality reduction, this paper will derive and compare \nseveral candidate techniques under i) perfectly fulfilled statistical prerequisites (e.g., \nGaussian noise, Gaussian input distributions, perfectly linear data), and ii) less perfect \nconditions (e.g., non-Gaussian distributions, slightly quadratic data, incorrect guess of \nthe dimensionality of the true data distribution). We will focus on nonlinear function ap(cid:173)\nproximation with locally weighted linear regression (L WR), as it allows us to adapt a va(cid:173)\nriety of global linear dimensionality reduction techniques, and as L WR has found wide(cid:173)\nspread application in several local learning systems (Atkeson, Moore, & Schaal, 1997; \nJordan & Jacobs, 1994; Xu, Jordan, & Hinton, 1996). In particular, we will derive and \ninvestigate locally weighted principal component regression (L WPCR), locally weighted \njoint data principal component analysis (L WPCA), locally weighted factor analysis \n(L WF A), and locally weighted partial least squares (L WPLS). Section 2 will briefly out(cid:173)\nline these methods and their theoretical foundations, while Section 3 will empirically \nevaluate the robustness of these methods using synthetic data sets that increasingly vio(cid:173)\nlate some of the statistical assumptions of the techniques. \n\n2 METHODS OF DIMENSIONALITY REDUCTION \nWe assume that our regression data originate from a generating process with two sets of \nobservables, the \"inputs\" i and the \"outputs\" y. The characteristics of the process en(cid:173)\nsure a functional relation y = f(i). Both i and yare obtained through some measure(cid:173)\nment device that adds independent mean zero noise of different magnitude in each ob(cid:173)\nservable, such that x == i + Ex and y = y + Ey \u2022 For the sake of simplicity, we will only fo(cid:173)\ncus on one-dimensional output data (m=l) and functions / that are either linear or \nslightly quadratic, as these cases are the most common in nonlinear function approxima(cid:173)\ntion with locally linear models. Locality of the regression is ensured by weighting the er(cid:173)\nror of each data point with a weight from a Gaussian kernel: \nWi = exp(-O.5(Xi - Xqf D(Xi - Xq)) \n\n(1) \n\nXtt denotes the query point, and D a positive semi-definite distance metric which deter(cid:173)\nmmes the size and shape of the neighborhood contributing to the regression (Atkeson et \naI., 1997). The parameters Xq and D can be determined in the framework of nonparamet(cid:173)\nric statistics (Schaal & Atkeson, in press) or parametric maximum likelihood estimations \n(Xu et aI, 1995}- for the present study they are determined manually since their origin is \nsecondary to the results of this paper. Without loss of generality, all our data sets will set \n!,q to the zero vector, compute the weights, and then translate the input data such that the \nlocally weighted mean, i = L WI Xi / L Wi , is zero. The output data is equally translated to \nbe mean zero. Mean zero data is necessary for most of techniques considered below. The \n(translated) input data is summarized in the rows of the matrix X, the corresponding \n(translated) outputs are the elements of the vector y, and the corresponding weights are in \nthe diagonal matrix W. In some cases, we need the joint input and output data, denoted \nas Z=[X y). \n\n\fLocal Dimensionality Reduction \n\n2.1 FACTORANALYSIS(LWFA) \n\n635 \n\nFactor analysis (Everitt, 1984) is a technique of dimensionality reduction which is the \nmost appropriate given the generating process of our regression data. It assumes the ob(cid:173)\nserved data z was produced. by a mean zero independently distributed k -dimensional \nvector of factors v, transformed by the matrix U, and contaminated by mean zero inde(cid:173)\npendent noise f: with diagonal covariance matrix Q: \n\nz=Uv+f:, where z=[xT,yt and f:=[f:~,t:yr \n\n(2) \n\nIf both v and f: are normally distributed, the parameters Q and U can be obtained itera(cid:173)\ntively by the Expectation-Maximization algorithm (EM) (Rubin & Thayer, 1982). For a \nlinear regression problem, one assumes that z was generated with U=[I, f3 Y and v = i, \nwhere f3 denotes the vector of regression coefficients of the linear model y = f31 x, and I \nthe identity matrix. After calculating Q and U by EM in joint data space as formulated in \n(2), an estimate of f3 can be derived from the conditional probability p(y I x). As all \ndistributions are assumed to be normal, the expected value ofy is the mean of this condi(cid:173)\ntional distribution. The locally weighted version (L WF A) of f3 can be obtained together \nwith an estimate of the factors v from the joint weighted covariance matrix 'I' of z and v: \n\nE{[: ] + [ ~ } ~ ~,,~,;'x, where ~ ~ [ZT, VT~~Jft: w; ~ \n\n(3) \n\n[ Q+UU T U] ['I'II(=n x n) \n= \n\nI = '\u00a521(= (m + k) x n) \n\nUT \n\n'I'12(=nX(m+k\u00bb)] \n'1'22(= (m + k) x (m + k\u00bb) \n\nwhere E { .} denotes the expectation operator and B a matrix of coefficients involved in \nestimating the factors v. Note that unless the noise f: is zero, the estimated f3 is different \nfrom the true f3 as it tries to average out the noise in the data. \n\n2.2 JOINT-SPACE PRINCIPAL COMPONENT ANALYSIS (LWPCA) \nAn alternative way of determining the parameters f3 in a reduced space employs locally \nweighted principal component analysis (LWPCA) in the joint data space. By defining the . \nlargest k+ 1 principal components of the weighted covariance matrix ofZ as U: \n\nU = [eigenvectors(I Wi (Zi - ZXZi - Z)T II Wi)] \n\nmax(l:k+1l \n\n(4) \n\nand noting that the eigenvectors in U are unit length, the matrix inversion theorem (Hom \n& Johnson, 1994) provides a means to derive an efficient estimate of f3 \n\n[Ux(=nXk)] \nf3=U x Uy -Uy UyUy -I UyUyt where U= Uy(=mxk) \n\n( T \n\nT( \n\nT \n\n)-1 \n\nT\\ \n\n(5) \n\nIn our one dimensional output case, U y is just a (1 x k) -dimensional row vector and the \nevaluation of (5) does not require a matrix inversion anymore but rather a division. \nIf one assumes normal distributions in all variables as in L WF A, L WPCA is the special \ncase of L WF A where the noise covariance Q is spherical, i.e., the same magnitude of \nnoise in all observables. Under these circumstances, the subspaces spanned by U in both \nmethods will be the same. However, the regression coefficients of L WPCA will be dif(cid:173)\nferent from those of L WF A unless the noise level is zero, as L WF A optimizes the coeffi(cid:173)\ncients according to the noise in the data (Equation (3\u00bb . Thus, for normal distributions \nand a correct guess of k, L WPCA is always expected to perform worse than L WF A. \n\n\f636 \n\nS. Schaal, S. Vijayakumar and C. G. Atkeson \n\nFor Training: \nInitialize: \nDo = X, \nFor i = 1 to k: \n\n2.3 PARTIAL LEAST SQUARES (LWPLS, LWPLS_I) \nPartial least squares (Wold, 1975; Frank & Friedman, 1993) recursively computes or(cid:173)\nthogonal projections of the input data and performs single variable regressions along \nthese projections on the residuals of the previous iteration step. A locally weighted ver(cid:173)\nsion of partial least squares (LWPLS) proceeds as shown in Equation (6) below. \nAs all single variable regressions are ordinary uni(cid:173)\nvariate least-squares minim izations, \nL WPLS \nmakes the same statistical assumption as ordinary \nlinear regressions, i.e., that only output variables \nhave additive noise, but input variables are noise(cid:173)\nless. The choice of the projections u, however, in(cid:173)\ntroduces an element in L WPLS that remains statis(cid:173)\ntically still debated (Frank & Friedman, 1993), al(cid:173)\nthough, interestingly, there exists a strong similar(cid:173)\nity with the way projections are chosen in Cascade \nCorrelation (Fahlman & Lebiere, 1990). A peculi(cid:173)\narity of L WPLS is that it also regresses the inputs \nof the previous step against the projected inputs s \nin order to ensure the orthogonality of all the pro(cid:173)\njections u. Since L WPLS chooses projections in a \nvery powerful way, it can accomplish optimal \nfunction fits with only one single projections (i.e., \nk= 1) for certain input distributions. We will address this issue in our empirical evalua(cid:173)\ntions by comparing k-step L WPLS with I-step L WPLS, abbreviated L WPLS_I. \n\neo = y \n\n(6) \n\nI \n\nFor Lookup: \nInitialize: \n\ndo = x, y= \u00b0 \nFor i = 1 to k: \ns. = dT.u. \n\n1-\n\nI \n\n2.4 PRINCIPAL COMPONENT REGRESSION (L WPCR) \nAlthough not optimal, a computationally efficient techniques of dimensionality reduction \nfor linear regression is principal component regression (LWPCR) (Massy, 1965). The in(cid:173)\nputs are projected onto the largest k principal components of the weighted covariance \nmatrix of the input data by the matrix U: \n\nU = [eigenvectors(2: Wi (Xi - xX Xi - xt /2: Wi )] \n\nmax(l:k) \n\nThe regression coefficients f3 are thus calculated as: \n\nf3 = (UTXTwxUtUTXTWy \n\n(7) \n\n(8) \n\nEquation (8) is inexpensive to evaluate since after projecting X with U, UTXTWXU be(cid:173)\ncomes a diagonal matrix that is easy to invert. L WPCR assumes that the inputs have ad(cid:173)\nditive spherical noise, which includes the zero noise case. As during dimensionality re(cid:173)\nduction L WPCR does not take into account the output data, it is endangered by clipping \ninput dimensions with low variance which nevertheless have important contribution to \nthe regression output. However, from a statistical point of view, it is less likely that low \nvariance inputs have significant contribution in a linear regression, as the confidence \nbands of the regression coefficients increase inversely proportionally with the variance of \nthe associated input. If the input data has non-spherical noise, L WPCR is prone to focus \nthe regression on irrelevant projections. \n\n3 MONTE CARLO EVALUATIONS \nIn order to evaluate the candidate methods, data sets with 5 inputs and 1 output were ran(cid:173)\ndomly generated. Each data set consisted of 2,000 training points and 10,000 test points, \ndistributed either uniformly or nonuniformly in the unit hypercube. The outputs were \n\n\fLocal Dimensionality Reduction \n\n637 \n\ngenerated by either a linear or quadratic function. Afterwards, the 5-dimensional input \nspace was projected into a to-dimensional space by a randomly chosen distance pre(cid:173)\nserving linear transformation. Finally, Gaussian noise of various magnitudes was added \nto both the 10-dimensional inputs and one dimensional output. For the test sets, the addi(cid:173)\ntive noise in the outputs was omitted. Each regression technique was localized by a \nGaussian kernel (Equation (1)) with a to-dimensional distance metric D=IO*I (D was \nmanually chosen to ensure that the Gaussian kernel had sufficiently many data points and \nno \"data holes\" in the fringe areas of the kernel) . The precise experimental conditions \nfollowed closely those suggested by Frank and Friedman (1993): \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n2 kinds of linear functions y = {g.I for: \n2 kinds of quadratic functions y = f3J.I + f3::.aAxt ,xi ,xi ,X;,X;]T for: \ni) 1311. = [I, I, I, I, Wand f3q.ad = 0.1 [I, I, I, I, If, and ii) 131 .. = [1,2,3,4, sf and f3quad = 0.1 [I, 4, 9, 16, 2sf \n\nii) I3Ii. = [1,2,3,4, sf \n\ni) 131 .. = [I, I, I, I, If , \n\n3 kinds of noise conditions, each with 2 sub-conditions: \ni) only output noise: \n\na) low noise: \nb) high noise: \n\nlocal signal/noise ratio Isnr=20, \nIsnr=2, \n\nii) equal noise in inputs and outputs: \n\nand \n\na) low noise Ex \u2022\u2022 = Sy = N(O,O.Ot2), n e[I,2, ... ,10], \nb) high noise Ex \u2022\u2022 =sy=N(0,0.12),ne[I,2, ... ,10], \n\nand \n\niii) unequal noise in inputs and outputs: \n\nand \n\na) low noise: Ex .\u2022 = N(0,(0.0In)2), n e[I,2, ... ,1O] and Isnr=20, \nb) high noise: Ex .\u2022 = N(0,(0.0In)2), n e[I,2, ... ,1O] and Isnr=2, \n\n\u2022 \n\n2 kinds of input distributions: i) uniform in unit hyper cube, ii) uniform in unit hyper cube excluding data \npoints which activate a Gaussian weighting function (I) at c = [O.S,O,o,o,of with D=IO*I more than \nw=0.2 (this forms a \"hyper kidney\" shaped distribution) \n\nEvery algorithm was run * 30 times on each of the 48 combinations of the conditions. \nAdditionally, the complete test was repeated for three further conditions varying the di(cid:173)\nmensionality--called factors in accordance with L WF A-that the algorithms assumed to \nbe the true dimensionality of the to-dimensional data from k=4 to 6, i.e., too few, correct, \nand too many factors. The average results are summarized in Figure I. \nFigure I a,b,c show the summary results of the three factor conditions. Besides averaging \nover the 30 trials per condition, each mean of these charts also averages over the two in(cid:173)\nput distribution conditions and the linear and quadratic function condition, as these four \ncases are frequently observed violations of the statistical assumptions in nonlinear func(cid:173)\ntion approximation with locally linear models. In Figure I b the number of factors equals \nthe underlying dimensionality of the problem, and all algorithms are essentially per(cid:173)\nforming equally well. For perfectly Gaussian distributions in all random variables (not \nshown separately), LWFA's assumptions are perfectly fulfilled and it achieves the best \nresults, however, almost indistinguishable closely followed by L WPLS. For the ''unequal \nnoise condition\", the two PCA based techniques, L WPCA and L WPCR, perform the \nworst since--as expected-they choose suboptimal projections. However, when violat(cid:173)\ning the statistical assumptions, L WF A loses parts of its advantages, such that the sum(cid:173)\nmary results become fairly balanced in Figure lb. \nThe quality of function fitting changes significantly when violating the correct number of \nfactors, as illustrated in Figure I a,c. For too few factors (Figure la), L WPCR performs \nworst because it randomly omits one of the principle components in the input data, with(cid:173)\nout respect to how important it is for the regression. The second worse is L WF A: ac(cid:173)\ncording to its assumptions it believes that the signal it cannot model must be noise, lead(cid:173)\ning to a degraded estimate of the data's subspace and, consequently, degraded regression \nresults. L WPLS has a clear lead in this test, closely followed by L WPCA and L WPLS_I. \n\n* Except for LWFA, all methods can evaluate a data set in non-iterative calculations. LWFA was trained with EM for maxi(cid:173)\n\nmally 1000 iterations or until the log-likelihood increased less than I.e-lOin one iteration. \n\n\f638 \n\nS. Schaal, S. Vljayakumar and C. G. Atkeson \n\nFor too many factors than necessary (Figure Ie), it is now LWPCA which degrades. This \neffect is due to its extracting one very noise contaminated projection which strongly in(cid:173)\nfluences the recovery of the regression parameters in Equation (4). All other algorithms \nperform almost equally well, with L WF A and L WPLS taking a small lead. \n\nOnlyOutpul \n\nNoise \n\nEqual NoIse In ell \nIn puIS end OutpUIS \n\nUnequel NoIse In ell \nInputs end OutpulS \n\n0.1 \n\nc \no \n~ 0.01 \n::::;; \nc \nII> \nC> \n~ 0.001 \n~ \n\n0.0001 \n\n0.1 \n\n0.01 \n\nc:: \no \nW \n\n~ c:: \n\nII> \nC) \n~ 0.001 \n~ \n\n0.0001 \n\nfl- I. E>O ~I. \u00a3 >>(I ~ J. &>O ~J , E \u00bb O ~ J .E>O fl- I.&\u00bb O ~I . & >O ~ I . & >>O ~I.& >O ~ I .\u00a3>>o p,. 1. s>O \n\ntJ-J .\u00a3>>O \n\n\u2022 \n\nLWFA \n\n\u2022 \n\ne) RegressIon Results with 4 Factors \nLWPCR 0 LWPLS \n\nLWPCA \n\n\u2022 \n\n\u2022 \n\nLWPLS_1 \n\n0.1 \n\n0.01 \n\n~ \n~ \n8 \nW \n~ c:: g, \n\n~ 0.001 \n~ \n\n0.0001 \n\n0.1 \n\njj \nil \nf-a \n~ 0.01 \n::::;; \nc \nII> \nC) \n~ 0.001 \n~ \n\n0.0001 \n\nc) RegressIon Results with 6 Feclors \n\nd) Summery Results \n\nFigure I: Average summary results of Monte Carlo experiments. Each chart is primarily \ndivided into the three major noise conditions, cf. headers in chart (a). In each noise con(cid:173)\ndition, there are four further subdivision: i) coefficients of linear or quadratic model are \nequal with low added noise; ii) like i) with high added noise; iii) coefficients oflinear or \nquadratic model are different with low noise added; iv) like iii) with high added noise. \n\nRefer to text and descriptions of Monte Carlo studies for further explanations. \n\n\fLocal Dimensionality Reduction \n\n639 \n\n4 SUMMARY AND CONCLUSIONS \nFigure 1 d summarizes all the Monte Carlo experiments in a final average plot. Except for \nL WPLS, every other technique showed at least one clear weakness in one of our \"robust(cid:173)\nness\" tests. It was particularly an incorrect number of factors which made these weak(cid:173)\nnesses apparent. For high-dimensional regression problems, the local dimensionality, i.e., \nthe number of factors, is not a clearly defined number but rather a varying quantity, de(cid:173)\npending on the way the generating process operates. Usually, this process does not need \nto generate locally low dimensional distributions, however, it often \"chooses\" to do so, \nfor instance, as human ann movements follow stereotypic patterns despite they could \ngenerate arbitrary ones. Thus, local dimensionality reduction needs to find autonomously \nthe appropriate number of local factor. Locally weighted partial least squares turned out \nto be a surprisingly robust technique for this purpose, even outperforming the statistically \nappealing probabilistic factor analysis. As in principal component analysis, LWPLS's \nnumber of factors can easily be controlled just based on a variance-cutoff threshold in in(cid:173)\nput space (Frank & Friedman, 1993), while factor analysis usually requires expensive \ncross-validation techniques. Simple, variance-based control over the number of factors \ncan actually improve the results of L WPCA and L WPCR in practice, since, as shown in \nFigure I a, L WPCR is more robust towards overestimating the number of factors, while \nL WPCA is more robust towards an underestimation. If one is interested in dynamically \ngrowing the number of factors while obtaining already good regression results with too \nfew factors, L WPCA and, especially, L WPLS seem to be appropriate-it should be \nnoted how well one factor L WPLS (L WPLS_l) already performed in Figure I! \nIn conclusion, since locally weighted partial least squares was equally robust as local \nweighted factor analysis towards additive noise in. both input and output data, and, \nmoreover, superior when mis-guessing the number of factors, it seems to be a most fa(cid:173)\nvorable technique for local dimensionality reduction for high dimensional regressions. \n\nAcknowledgments \nThe authors are grateful to Geoffrey Hinton for reminding them of partial least squares. This work was sup(cid:173)\nported by the ATR Human Information Processing Research Laboratories. S. Schaal's support includes the \nGerman Research Association, the Alexander von Humboldt Foundation, and the German Scholarship Founda(cid:173)\ntion. S. Vijayakumar was supported by the Japanese Ministry of Education, Science, and Culture (Monbusho). \nC. G. Atkeson acknowledges the Air Force Office of Scientific Research grant F49-6209410362 and a National \nScience Foundation Presidential Young Investigators Award. \nReferences \n\ntures of experts and the EM algorithm.\" Neural Com-\nputation, 6, 2, pp.181-214. \n\nAtkeson, C. G., Moore, A. W., & Schaal, S, (1997a). Massy, W. F, (1965). \"Principle component regression \n\"Locally weighted learning.\" ArtifiCial Intelligence Re- in exploratory statistical research.\" Journal of the \nview, 11, 1-5, pp.II-73. \nAmerican Statistical Association, 60, pp.234-246. \nAtkeson, C. G., Moore, A. W., & Schaal, S, (1997c). Rubin, D. B., & Thayer, D. T, (l982). \"EM algorithms \n\"Locally weighted learning for control.\" ArtifiCial In-\nfor ML factor analysis.\" Psychometrika, 47, I, 69-76. \nSchaal, S., & Atkeson, C. G, (in press). \"Constructive \ntelligence Review, 11, 1-5, pp.75-113. \nBelsley, D. A., Kuh, E., & Welsch, R. E, (1980). Re-\nincremental learning from only local information.\" \ngression diagnostics: Identifying influential data and Neural Computation. \nsources of collinearity. New York: Wiley. \nEveritt, B. S, (1984). An introduction to latent variable New York: Wiley. \nmodels. London: Chapman and Hall. \nFahlman, S. E. ,Lebiere, C, (1990). \"The cascade-\ncorrelation learning architecture.\" In: Touretzky, D. S. International Conference on Computational Intelli-\n(Ed.), Advances in Neural Information Processing \nSystems II, pp.524-532. Morgan Kaufmann. \nFrank, I. E., & Friedman, 1. H, (1993). \"A statistical Wold, H. (1975). \"Soft modeling by latent variables: \nview of some chemometric regression tools.\" Tech-\nthe nonlinear iterative partial least squares approach.\" \nIn: Gani, J. (Ed.), Perspectives in Probability and Sta-\nnometrics, 35, 2, pp.l09-135. \nGeman, S., Bienenstock, E., & Doursat, R. (1992). \ntistics, Papers in Honour ofM S. Bartlett. Aca::' ... \nIi \nr!I 02 \n\nFigure 1: I-syllable network latency differences & neighborhood statistics \n\n3.2 Methods \n\nFor the single syllable words, we used an identical network to the feed-forward net(cid:173)\nwork used by PMSP, i.e., a 105-100-61 network, and for the two syllable words, we \nsimply used the same architecture with the each layer size doubled. We trained each \nnetwork for 300 epochs, using batch training with a cross entropy objective function, \nan initial learning rate of 0.001, momentum of 0.9 after the first 10 epochs, weight \ndecay of 0.0001, and delta-bar-delta learning rate adjustment. Training exemplars \nwere weighted by the log of frequency as found in the Kucera-Francis corpus. Af(cid:173)\nter this training, the single syllable feed-forward networks averaged 98.6% correct \noutputs, using the same evaluation technique outlined in PMSP. Two syllable net(cid:173)\nworks were trained for 1700 epochs using online training, a learning rate of 0.05, \nmomentum of 0.9 after the first 10 epochs, and raw frequency weighting. The two \nsyllable network achieved 85% correct. Naming latency was equated with network \noutput MSE; for successful results, the error difference between the irregular words \nand associated control words should decrease with irregularity position. \n\n3.3 Results \n\nSingle Syllable Words First, Coltheart's challenge that a single-route model \ncannot produce the latency effects was explored. The single-syllable network de(cid:173)\nscribed above was tested on the collection of single-syllable words identified as irreg(cid:173)\nular by (Taraban and McClelland, 1987). In (Coltheart and Rastle, 1994), control \nwords are selected based on equal number of letters, same beginning phoneme, and \nKucera-Francis frequency between 1 and 20 (controls were not frequency matched). \nFor single syllable words used here, the control condition was modified to allow \nfrequency from 1 to 70, which is the range of the \"low frequency\" exception words \nin the Taraban & McClelland set. Controls were chosen by drawing randomly from \nthe words meeting the control criteria. \nEach test and control word input vector was presented to the network, and the \nMSE at the output layer (compared to the expected correct target) was calculated. \nFrom these values, the differences in MSE for target and matched control words \nwere calculated and are shown in Figure 1. Note that words with an irregularity \nin the first phoneme position have the largest difference from their control words, \nwith this (exception - regular control) difference decreasing as phoneme position \nincreases. Contrary to the claims of the Dual-Route model, this network does show \nthe desired rank-ordering of MSE/latency. \n\n\fSerial Order in Reading Aloud \n\n63 \n\n02 \n\n1.0 \n\nI' 0.1 d \nI 1;l \n\n0.0 \n\n::I! \n\no \n\nI'boaeme 1 ........ larit)' PooIIioa \n\n4 \n\n6 \n\nO.O+--~--r----.---~--' \n\no \n\n2 \n\n\" \n\nPbonane lnegularlty PosItioIl \n\n6 \n\nFigure 2: 2-syllable network latency differences & neighborhood statistics \n\nTwo Syllable Words Testing of the two-syllable network is identical to that \nof the one-syllable network. The difference in MSE for each test word and its \ncorresponding control is calculated, averaging across all test pairs in the position \nset. Both test words and their controls are those found in (Coltheart and Rastle, \n1994). The 2-syllable network appears to produce approximately the correct linear \ntrend in the naming MSE/latency (Figure 2), although the results displayed are not \nmonotonically decreasing with position. Note, however, that the results presented \nby Coltheart, when taken separately, also fail to exhibit this trend (Table 1). For \ncorrect analysis, several \"subject\" networks should be trained, with formal linear \ntrend analysis then performed with the resulting data. These further simulations \nare currently being undertaken. \n\n4 Why the network works: Neighborhood effects \n\nA possible explanation for these results relies on the fact that connectionist networks \ntend to extract statistical regularities in the data, and are affected by regularity by \nfrequency interactions. In this case, we decided to explore the hypothesis that \nthe results could be explained by a neighborhood effect: Perhaps the number of \n\"friends\" and \"enemies\" in the neighborhood (in a sense to be defined below) of \nthe exception word varies in English in a position-dependent way. If there are more \nenemies (different pronunciations) than friends (identical pronunciations) when the \nexception occurs at the beginning of a word than at the end, then one would expect \na network to reflect this statistical regularity in its output errors. In particular, one \nwould expect higher errors (and therefore longer latencies in naming) if the word \nhas a higher proportion of enemies in the neighborhood. \nTo test this hypothesis, we created some data search engines to collect word neigh(cid:173)\nborhoods based on various criteria. There is no consensus on the exact definition \nof the \"neighborhood\" of a word. There are some common measures, however, so \nwe explored several of these. Taraban & McClelland (1987) neighborhoods (T&M) \nare defined as words containing the same vowel grouping and final consonant clus(cid:173)\nter. These neighborhoods therefore tend to consist of words that rhyme (MUST, \nDUST, TRUST). There is independent evidence that these word-body neighbors \nare psychologically relevant for word naming tasks (i.e., pronunciation) (Treiman \nand Chafetz, 1987). The neighborhood measure given by Coltheart (Coltheart and \nRastle, 1994), N, counts same-length words which differ by only one letter, tak(cid:173)\ning string position into account. Finally, edit-distance-1 (ED1) neighborhoods are \nthose words which can be generated from the target word by making one change \n\n\f64 \n\nJ. C. Milostan and G. W Cottrell \n\n(Peereman, 1995): either a letter substitution, insertion or deletion. This differs \nfrom the Coltheart N definition in that \"TRUST\" is in the EDI neighborhood (but \nnot the N neighborhood) of \"RUST\" , and provides a neighborhood measure which \nconsiders both pronunciation and spelling similarity. However, the N and the ED-l \nmeasure have not been shown to be psychologically real in terms of affecting naming \nlatency (Treiman and Chafetz, 1987). \nWe therefore extended T&M neighborhoods to multi-syllable words. Each vowel \ngroup is considered within the context of its rime, with each syllable considered \nseparately. Consonant neighborhoods consist of orthographic clusters which cor(cid:173)\nrespond to the same location in the word. This results in 4 consonant cluster \nlocations: first syllable onset, first syllable coda, second syllable onset, and second \nsyllable coda. Consonant cluster neighborhoods include the preceeding vowel for \ncoda consonants, and the following vowel for onset consonants. \n\nThe notion of exception words is also not universally agreed upon. Precisely which \nwords are exceptions is a function of the working definition of pronunciation and \nregularity for the experiment at hand. Given a definition of neighborhood, then, \nexception words can be defined as those words which do not agree with the phono(cid:173)\nlogical mapping favored by the majority of items in that particular neighborhood. \nAlternatively, in cases assuming a set of rules for grapheme-phoneme correspon(cid:173)\ndence, exception words are those which violate the rules which define thp majority \nof pronunciations. For this investigation, single syllable exception words are those \ndefined as exception by the T&M neighborhood definition. For instance, PINT \nwould be considered an exception word compared to its neighbors MINT, TINT, \nHINT, etc. Coltheart, on the other hand, defines exception words to be those for \nwhich his G PC rules produce incorrect pronunciation. Since we are concerned with \naddressing Coltheart's claims, these 2-syllable exception words will also be used \nhere. \n\n4.1 Results \n\nSingle syllable words For each phoneme position, we compare each word with \nirregularity at that position with its neighbors, counting the number of enemies \n(words with alternate pronunciation at the supposed irregularity) and friends (words \nwith pronunciation in agreement) that it has. The T &M neighborhood numbers \n(words containing the same vowel grouping and final consonant cluster) used in Fig(cid:173)\nure 1 are found in (Taraban and McClelland, 1987). For each word, we calculate its \n(enemy) / (friend+enemy) ratio; these ratios are then averaged over all the words in \nthe position set. The results using neighborhoods as defined in Taraban & McClel(cid:173)\nland clearly show the desired rank ordering of effect. First-position-irregularity \nwords have more \"enemies\" and fewer \"friends\" than third-position-irregularity \nwords, with the second-position words falling in the middle as desired. We sug(cid:173)\ngest that this statistical regularity in the data is what the above networks capture. \nHowever convincing these results may be, they do not fully address Coltheart's \ndata, which is for two syllable words of five phonemes or phoneme clusters, with \nirregularities at each of five possible positions. Also, due to the size of the T&M \ndata set, there are only 2 members in the position I set, and the single-syllable data \nonly goes up to phoneme position 3. The neighborhoods for the two-syllable data \nset were thus examined. \n\nTwo syllable results Recall that the two-syllable test words are those used in \nthe (Coltheart and Rastle, 1994) subject study, for which naming latency differ(cid:173)\nences are shown in Table 1. CoItheart's I-letter-different neighborhood definition \n\n\fSerial Order in Reading Aloud \n\n65 \n\nis not very informative in this case, since by this criterion most of the target words \nprovided in (Coltheart and Rastle, 1994) are loners (i.e., have no neighbors at \nall). However, using a neighborhood based on T&M-2 recreates the desired ranking \n(Figure 2) as indicated by the ratio of hindering pronunciations to the total of the \nhelping and hindering pronunciations. As with the single syllable words, each test \nword is compared with its neighbor words and the (enemy)/(friend+enemy) ratio is \ncalculated. Averaging over the words in each position set, we again see that words \nwith early irregularities are at a support disadvantage compared to words with late \nirregularities. \n\n5 Summary \n\nDual-Route models claim the irregularity position effect can only be accounted \nfor by two-route models with left-to-right activation of phonemes, and interaction \nbetween GPC rules and the lexicon. The work presented in this paper refutes this \nclaim by presenting results from feed-forward connectionist networks which show the \nsame rank ordering of latency. Further, an analysis of orthographic neighborhoods \nshows why the networks can do this: the effect is based on a statistical interaction \nbetween friend/enemy support and position. Words with irregular orthographic(cid:173)\nphonemic correspondence at word beginning have less support from their neighbors \nthan words with later irregularities; it is this difference which explains the latency \nresults. The resulting statistical regularity is then easily captured by connectionist \nnetworks exposed to representative data sets. \n\nReferences \n\nColtheart, M., Curitis, B., Atkins, P., and Haller, M. (1993). Models of reading \n\naloud: Dual-route and parallel-distributed-processing approaches. Psychologi(cid:173)\ncal Review, 100(4):589-608. \n\nColtheart, M. and Rastle, K. (1994). Serial processing in reading aloud: Evidence \nfor dual route models of reading. Journal of Experimental Psychology: Human \nPerception and Performance, 20(6):1197-1211. \n\nKucera, H. and Francis, W. (1967). Computational Analysis of Present-Day Amer(cid:173)\n\nican English. Brown University Press, Providence, RI. \n\nMcClelland, J. and Rumelhart, D. (1981). An interactive activation model of context \neffects in letter perception: Part 1. an account of basic findings. Psychological \nReview, 88:375-407. \n\nPeereman, R. (1995). Naming regular and exception words: Further examination \nof the effect of phonological dissension among lexical neighbours. European \nJournal of Cognitive Psychology, 7(3):307-330. \n\nPlaut, D., McClelland, J., Seidenberg, M., and Patterson, K. (1996). Understanding \nnormal and impaired word reading: Computational principles in quasi-regular \ndomains. Psychological Review, 103(1):56-115. \n\nSeidenberg, M. and McClelland, J. (1989). A distributed, developmental model of \n\nword recognition and naming. Psychological Review, 96:523-568. \n\nTaraban, R. and McClelland, J. (1987). Conspiracy effects in word pronunciation. \n\nJournal of Memory and Language, 26:608-631. \n\nTreiman, R. and Chafetz, J. (1987). Are there onset- and rime-like units in printed \nIn Coltheart, M., editor, Attention and Performance XII: The Psy(cid:173)\n\nwords? \nchology of Reading. Erlbaum, Hillsdale, NJ. \n\n\f", "award": [], "sourceid": 1389, "authors": [{"given_name": "Jeanne", "family_name": "Milostan", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}