{"title": "Fast Learning in Multi-Resolution Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 29, "page_last": 39, "abstract": null, "full_text": "29 \n\n\"FAST LEARNING IN \n\nMULTI-RESOLUTION HIERARCHIES\" \n\nYale Computer Science, P.O. Box 2158, New Haven, CT 06520 \n\nJohn Moody \n\nAbstract \n\nA class of fast, supervised learning algorithms is presented. They use lo(cid:173)\n\ncal representations, hashing, atld multiple scales of resolution to approximate \nfunctions which are piece-wise continuous. Inspired by Albus's CMAC model, \nthe algorithms learn orders of magnitude more rapidly than typical imple(cid:173)\nmentations of back propagation, while often achieving comparable qualities of \ngeneralization. Furthermore, unlike most traditional function approximation \nmethods, the algorithms are well suited for use in real time adaptive signal \nprocessing. Unlike simpler adaptive systems, such as linear predictive cod(cid:173)\ning, the adaptive linear combiner, and the Kalman filter, the new algorithms \nare capable of efficiently capturing the structure of complicated non-linear \nsystems. As an illustration, the algorithm is applied to the prediction of a \nchaotic timeseries. \n\n1 \n\nIntroduction \n\nA variety of approaches to adaptive information processing have been developed \nby workers in disparate disciplines. These include the large body of literature on \napproximation and interpolation techniques (curve and surface fitting), the linear, \nreal-time adaptive signal processing systems (such as the adaptive linear combiner \nand the Kalman filter), and most recently, the reincarnation of non-linear neural \nnetwork models such as the multilayer perceptron. \n\nEach of these methods has its strengths and weaknesses. The curve and surface \n\nfitting techniques are excellent for off-line data analysis, but are typically not formu(cid:173)\nlated with real-time applications in mind. The linear techniques of adaptive signal \nprocessing and adaptive control are well-characterized, but are limited to applica(cid:173)\ntions for which linear descriptions are appropriate. Finally, neural network learning \nmodels such as back propagation have proven extremely versatile at learning a wide \nvariety of non-linear mappings, but tend to be very slow computationally and are \nnot yet well characterized. \n\nThe purpose of this paper is to present a general description of a class of su(cid:173)\n\npervised learning algorithms which combine the ability of the conventional curve \n\n\f30 \n\nMoody \n\nfitting and multilayer perceptron methods to precisely learn non-linear mappings \nwith the speed and flexibility required for real-time adaptive application domains. \nThe algorithms are inspired by a simple, but often overlooked, neural network \nmodel, Albus's Cerebellar Model Articulation Controller (CMAC) [2,1], and have \na great deal in common with the standard techniques of interpolation and approx(cid:173)\nimation. The algorithms \"learn from examples\", generalize well, and can perform \nefficiently in real time. Furthermore, they overcome the problems of precision and \ngeneralization which limit the standard CMAC model, while retaining the CMAC's \nspeed. \n\n2 System Description \nThe systems are designed to rapidly approximate mappings g: X 1-+ fi from multi(cid:173)\ndimensional input spaces x E Sinput to multidimensional output spaces fi E Soutput. \nThe algorithms can be applied to any problem domain for which a metric can be de(cid:173)\nfined on the input space (typically the Euclidean, Hamming, or Manhattan metric) \nand for which the desired learned mapping is (to a close approximation) piece-wise \ncontinuous. (Discontinuities in the desired mapping, such as those at classifica(cid:173)\ntion boundaries, are approximated continuously.) Important general classes of such \nproblems include approximation of real-valued functions nn 1-+ nm (such as those \nfound in signal processing), classification problems nn 1-+ f3\"l (such as phoneme \nclassification), and boolean mapping problems Bn 1-+ f3\"l (such as the NETtalk \nproblem [20]). Here, n are the reals and B is {0,1}. This paper focuses on real(cid:173)\nvalued mappings; the formulation and application of the algorithms to boolean \nproblem domains will be presented elsewhere. \n\nIn order to specify the complete learning system in detail, it is easiest to start \n\nwith simple special cases and build the description from the bottom up: \n\n2.1 A Simple Adaptive Module \n\nThe simplest special case of the general class under consideration is described as \nfollows. The input space is overlayed with a lattice of points xf3 a local function \nvalue or \"weight\" Vf3 is assigned to every possible lattice point. The output of the \nsystem for a given input is: \n\nwhere Nf3(x) is a neighborhood function for the 13th lattice point such that Nf3 = 1 \nif xf3 is the lattice point closest to the input vector x and Nf3 = 0 otherwise. \n\nMore generally, the neighborhood functions N can overlap and the sum in equa(cid:173)\n\ntion (1) can be replaced by an average. This results in a greater ability to generalize \nwhen training data is sparse, but at the cost of losing fine detail. \n\n(1) \n\n\f\"Fast Learning in Multi-Resolution Hierarchies\" \n\n31 \n\nLearning is accomplished by varying the V{3 to minimize the squared error of \n\nthe system output on a set of training data: \n\nE = ~ ~(Zide8ired - zeXi))2 , \n\nI \n\n(2) \n\nwhere the sum is over all exemplars {Xi, Zide'ired} in the training set. The de(cid:173)\ntermination of V{3 is easily formulated as a real time adaptive algorithm by using \ngradient descent to minimize an instantaneous estimate E(t) of the error: \n\ndV \ndt = -fJdV . \n\ndE(t) \n\n(3) \n\n2.2 Saving Memory with Hashing: The CMAC \n\nThe approach of the previous section encounters serious difficulty when the dimen(cid:173)\nsion of the input space becomes large and the distribution of data in the input space \nbecomes highly non-uniform. In such cases, allocating a separate function value for \neach possible lattice point is extremely wasteful, because the majority of lattice \npoints will have no training data within a local neighborhood. \n\nAs an example, suppose that the input space is four dimensional, but that all \ninput data lies on a fuzzy two dimensional subspace. (Such a situation [projected \nonto 3-dimensions] is shown in figure [2A].) Furthermore, suppose that the input \nspace is overlayed with a rectangular lattice with K nodes per dimension. The \ncomplete lattice will contain K4 nodes, but only O( K2) of those nodes will have \ntraining data in their local neighborhoods Thus, only 0(K2) of the weights V{3 will \nhave any meaning. The remaining 0(1(4) weights will be wasted. (This assumes \nthat the lattice is not too fine. If K is too large, then only O(P) of the lattice points \nwill have training data nearby, where P is the number of training data.) \n\nAn alternative approach is to have only a small number of weights and to allo(cid:173)\n\ncate them to only those regions of the input space which are populated with training \ndata. This allocation can be accomplished by a dimensionality-reducing mapping \nfrom a virtual lattice in the input space onto a lookup table of weights or function \nvalues. In the absence of any a priori information about the distribution of data in \nthe input space, the optimal mapping is a random mapping, for example a universal \nhashing function [8]. The random nature of such a function insures that neighbor(cid:173)\nhood relationships in the virtual lattice are not preserved. The average behavior \nof an ensemble of universal hashing functions is thus to access all elements of the \nlookup table with equal probability, regardless of the correlations in the input data. \nThe many-to-one hash function can be represented here as a matrix HT{3 of D's \nand 1 's with one 1 per column, but many 1 's per row. With this notation, the \nsystem response function is: \n\nT N \n\nzeX) = L L VT H T {3 N{3(x) \n\nT=l{3=l \n\n(4) \n\n\f32 \n\nMoody \n\nCTID --~\" Resolution \n\n1 \n\nI---.--I.,c, \n\nResolution \n\n2 \n\n~M\"\u00b7l-I.,c, \n\nA \n\nHash Table \n\nFigure 1: (A) A simple CMAC module. (B) The computation of errors for a multi(cid:173)\nresolution hierarchy. \n\nThe CMAC model of Albus is obtained when a distributed representation of \nthe input space is used and the neighborhood functions NP(x) are overlapping. \nIn this case, the sum over (3 is replaced by an average. Note that, as specified \nby equation (4), hash table collisions are not resolved. This introduces \"collision \nnoise\", but the effect of this noise is reduced by 1/ .j(B), where B is the number \nof neighborhood functions which respond to a given input. Collision noise can be \ncompletely eliminated if standard collision resolution techniques are used. \n\nA few comments should be made about efficiency. In spite of the costly formal \nsums in equation (4), actual implementations of the algorithm are extremely fast. \nThe set of non-zero NP (X) on the virtual lattice, the hash function value for each \nvertex, and the set of corresponding lookup table values ih given by the hash \nfunction are easily determined on the fly. The entire hash function H T f3 is never \npre-computed, the sum over the index (3 is limit.ed to a few lattice points neighboring \nthe input X, and since each lattice point is associated with only one lookup table \nvalue, the formal sum over T disappears. \n\nThe CMAC model is shown schematically in figure [IA). \n\n2.3 \n\nInterpolation: Neighborhood Functions with Graded \nResponse \n\nOne serious problem with the formulations discussed so far is that the neighborhood \nfunctions are constant in their regions of support. Thus, the system response is dis(cid:173)\ncontinuous over neighborhood boundaries. This problem can be easily remedied by \nusing neighborhood functions with graded response in order to perform continuous \ninterpolation between lattice points. \n\n\f\"Fast Learning in Multi-Resolution Hierarchies\" \n\n33 \n\nThe normalized system response function is then: \n\n(5) \n\nThe functions Rf3 (i) are the graded neighborhood response functions associated \nwith each lattice point if3. They are intended to have local support on the\" input \nspace Sinput, thus being non-zero only in a local neighborhood of their associated \nlattice point Xf3. Each function Rf3(x) attains its maximum value at lattice point \ni f3 and drops off monotonically to zero as the distance lIif3 - Xli increases. Note \nthat R is not necessarily isotropic or symmetric. \n\nCertain classes of localized response functions R defined on certain lattices are \n\nself-normalized, meaning that: \n\nL Rf3(X) = 1 , for any x. \nf3 \n\nIn this case, the equation (5) simplifies to: \n\nZ(x) = L L: liT H Tf3 Rf3(X) \n\nT \n\nf3 \n\n(6) \n\n(7) \n\nOne particularly important and useful class of of response functions are the B(cid:173)\n\nsplines. However, it is not easy to formulate B-splines on arbitrary lattices in high \ndimensional spaces. \n\n2.4 M uIti-Resolution Interpolation \nThe final limitation of the methods described so far is that they use a lattice at \nonly one scale of resolution. Without detailed a priori knowledge of the distribu(cid:173)\ntion of data in the input space, it is difficult to choose an optimal lattice spacing. \nFurthermore, there is almost always a trade-off between the ability to generalize \nand the ability to capture fine detail. When a single coarse resolution is used, gen(cid:173)\neralization is good, but fine details are lost. When a single fine resolution is used, \nfine details are captured in those regions which contain dense data, but no general \npicture emerges for those regions in which data is sparse. \n\nGood generalization and fine detail can both be captured by using a multi(cid:173)\n\nresolution hierarchy. \n\nA hierarchical system with L levels represents functions 9 : i 1-+ yin the follow-\n\nmg way: \n\ny(X) = Yi.(x) = L: %A (E) , \n\nL \n\n~=l \n\n(8) \n\nwhere %A is a mapping as described in equation(5) for the A-th level in the hierarchy. \nThe coarsest scale is A = 1 and the finest is A = L. \n\n\f34 \n\nMoody \n\nThe multi-resolution system is trained such that the finer scales learn corrections \nto the total output of the coarser scales. This is accomplished by using a hierarchy \nof error functions. For each level in the hierarchy A, the output for that level fh, is \ndefined to be the partial sum \n\n~ \n\nY>. = 2: Zit \n\n. \n\n(Note that Y)..+l = Y>.. + z~+}.) The error for level A is defined to be \n\n/C=} \n\nE).. = 2: E)..(i) \n\n, \n\nwhere the error associated with the ith exemplar is: \n\ni \n\nE (.) \n~ , =\"2 Yi \n\n1 (-del \n\n.. ( .. ))2 \n\n- Y~ Xi \n\nThe learning or training procedure for level A involves varying the lookup table \nvalues V; for that level to minimize E)... Note that the lookup table values V; \nfor previous or subsequent levels (Ie 1= A) are held fixed during the minimization \nof E)... Thus, the lookup table values for each level are varied to minimize only \nthe error defined for that level. This hierarchical learning procedure guarantees \nthat the first level mapping Zl is the best possible at that level, the second level \nmapping Z2 constitutes the best possible corrections to the first level, and the A-th \nlevel mapping Z~ constitutes the best possible corrections to the total contributions \nof all previous levels. The computation of error signals is shown schematically in \nfigure [lB]. \n\nIt should be noted that multi-resolution approaches have been successfully used \nin other contexts. Examples are the well-known multigrid methods for solving \ndifferential equations and the pyramid architectures used in machine vision [6,7]. \n\n3 Application to Timeseries Prediction \n\nThe multi-resolution hierarchy can be applied to a wide variety of problem domains \nas mentioned earlier. Due to space limitations, we consider only one test problem \nhere, the prediction of a chaotic timeseries. \n\nAs it is usually formulated, the prediction is accomplished by finding a real(cid:173)\nvalued mapping f : nn 1-+ n which takes a sequence of n recent samples of the \ntimeseries and predicts the value at a future moment. Typically, the state space \nimbedding in nn is i{t] = (x[t], x[t - ~], x[t - 2~], x[t - 3~)), where ~ is the \nsampling parameter, and the correct prediction for prediction time T is x[t + T). \nFor the purposes of testing various non-parametric prediction methods, it is assumed \nthat the underlying process which generates the timeseries is unknown. \n\nThe particular timeseries studied here results from integrating the Mackey-Glass \n\ndifferential-delay equation [14]: \n\ndx[t] = -b x[t] + a \ndt \n\nx[t - r] \n\n1 + x[t - r)1o \n\n(9) \n\n\f\"Fast Learning in Multi-Resolution HierarchieS-\n\n35 \n\n.(II \n\n. .. . .. . ' .. \n\n\u2022 \u2022 \n\n'\n\n,a \n\n'(O) \n\n0.0 \n\n-0.5 \n\nS -\n\n-1.5 \n\n\u2022 \n\u2022 \u2022 \n\n---- ..... --------\n\n\u2022 \n\u2022\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\u2022 \u2022 \n\u2022 \u2022 \n3.5 \n2.5 \nNum. Data (log 10) \n\n' . ,' .... : .. .' .. \n\n2.0 \n\n3.0 \n\n4.0 \n\nFigure 2: \n(A) Imbedding in three dimensions of 1000 successive points of the \nMackey-Glass chaotic timeseries with delay parameter T = 17 and sampling pa(cid:173)\nrameter ~ = 6. (B) Normalized Prediction Error vs. Number of Training Data. \nSquares are runs with the multi-resolution hierarchy runs. The circle is the back \npropagation benchmark. The horizontal line is included for visual reference only \nand is not intended to imply a scaling law for back propagation. \n\nThe solid lines in figure [3] show the resulting timeseries for T = 17, a = 0.2, \nand b = 0.1; note that it is cyclic, but not periodic. The characteristic time of \nthe series, given by the inverse of the mean of the power spectrum, is tcha,. ~ \n50. Classical techniques like linear predictive coding and Gabor-Volterra-Wiener \npolynomial expansions typically do no better than chance when predicting beyond \ntcha,. [10]. \n\nFor purposes of comparison, the sampling parameter and prediction time are \nchosen to be ~ = 6 and T = 85 > tcha,. respectively. Figure [2A] shows a projection \nof the four dimensional state space imbedding onto three dimensions. The orbits of \nthe series lie on a fuzzy two dimensional subspace which is a strange attractor of \nfractal dimension 2.1. \n\nThis problem has been studied by both conventional data analysis techniques \n\nand by neural network methods. \n\nIt was first studied by Farmer and Sidorowich who locally fitted linear and \nquadratic surfaces directly to the data. [11,10]. The exemplars in the imbedding \nspace were stored in a k-d tree structure in order to allow rapid determination of \nproximity relationships [3,4,19]. The local surface fitting method is extremely ef(cid:173)\nficient computationally. This kind of approach has found wide application in the \nstatistics community [5]. Casdagli has applied the method of radial basis functions, \nwhich is an exact interpolation method and also depends on explicit storage of the \ndata. [9]. The radial basis functions method is a global method and becomes com-\n\n\f36 \n\nMoody \n\nputationally expensive when the number of exemplars is large, growing as O(P3). \nBoth approaches yield excellent results when used as off-line algorithms, but do not \nseem to be well suited to real-time application domains. \n\nFor real-time applications, little a priori knowledge about the data can be as(cid:173)\n\nsumed, large amounts of past data can't be stored, the function being learned may \nvary with time, and computing speed is essential. \n\nThree different neural network techniques have been applied to the timeseries \nprediction problem, back propagation [13], self-organized, locally-tuned processing \nunits [18,17], and an approach based on the GMDH method and simulated an(cid:173)\nnealing [21]. The first two approaches can in principle be applied in real time, \nbecause they don't require explicit storage of past data and can adapt continuously. \nBack propagation yields better predictions since it is completely supervised, but \nthe locally-tuned processing units learn substantially faster. The GMDH approach \nyields excellent results, but is computationally intensive and is probably limited to \noff-line use. \n\nThe multi-resolution hierarchy is intended to offer speed, precision, and the \nability to adapt continuously in real time. Its application to the Mackey-Glass \nprediction problem is demonstrated in two different modes of operation: off-line \nlearning and real-time learning. \n\n3.1 Off-Line Learning \n\nIn off-line mode, a five level hierarchy was trained to predict the future values. At \neach level, a regular rectangular lattice was used, with each lattice having A intervals \nand therefore A + 1 nodes per dimension. The lattice resolutions were chosen to \nbe (AI = 4, A2 = 8, A3 = 16, A4 = 32, As = 64). The corresponding number of \nvertices in each ofthe virtual4-dimensionallattices was therefore (Ml = 625, M2 = \n6,561, M3 = 83,521, M4 = 1,185,921, Ms = 17,850,625). The corresponding \nlookup table sizes were (TI = 625, T2 = 4096, T3 = 4096, T4 = 4096, Ts = 4096). \nNote that TI = M 1, so hashing was not required for the first layer. For all other \nlayers, T>. < M>., so hashing was used. For layers 3, 4, and 5, T>. <: M>., so \nhashing resulted in a dramatic reduction in the memory required. The neighborhood \nresponse function RI3(E) was a B-spline with support in the 16 cells adjacent to each \nlattice point EI3. Hash table collisions were not resolved. \n\nThe learning method used was simple gradient descent. The lookup table values \nwere updated after the presentation of each exemplar. At each level, the training \nset was presented repeatedly until a convergence criterion was satisfied. The levels \nwere trained sequentially: level 1 was trained until it converged, followed by level \n2, and so on. \n\nThe performance of the system as a function of training set size is shown in fig(cid:173)\n\nure [2B]. The normalized error is defined as [rms error]/[O'], where 0' is the standard \ndeviation of the timeseries. For each run, a different segment of the timeseries was \nused. In all cases, the performance was measured on an independent test sequence \nconsisting of the 500 exemplars immediately following the training sequence. The \nprediction error initially drops rapidly as the number of training data are increased, \n\n\f\"Fast Learning in Multi-Resolution Hierarchies\" \n\n37 \n\nbut then begins to level out. This leveling out is most likely caused by collision \nnoise in the hash tables. Collision resolution techniques should improve the results, \nbut have not yet been implemented. \n\nFor training sets with 500 exemplars, the multi-resolution hierarchy achieved \nprediction accuracy equivalent to that of a back propagation network trained by \nLapedes and Farber [13]. Their network had four linear inputs, one linear output, \nand two internal layers, each containing 20 sigmoidal units. The layers were fully \nconnected yielding 541 adjustable parameters (weights and thresholds) total. They \ntrained their network in off-line mode using conjugate gradient, which they found \nto be significantly faster than gradient descent. \n\nThe multi-resolution hierarchy converged in about 3.5 minutes on a Sun 3/60 for \nthe 500 exemplar runs. Lapedes estimates that the back propagation network re(cid:173)\nquired probably 5 to 10 minutes ofCray X/MP time running at about 90 Mflops [12]. \nThis would correspond to about 4, 000 to 8, 000 minutes of Sun 3/60 time. Hence, \nthe multi-resolution hierarchy converged about three orders of magnitude faster that \nthe back propagation network. This comparison should not be taken to be univer(cid:173)\nsal, since many implementations of both back propagation and the multi-resolution \nhierarchy are possible. Other comparisons could easily vary by factors of ten or \nmore. \n\nIt is interesting to note that the training time for the multi-resolution hierarchy \nincreased sub-linearly with training set size. This is because the lookup table values \nwere varied after the presentation of each exemplar, not after presentation of the \nwhole set. A similar effect should be observable in back propagation nets. In fact, \ntraining after the presentation of each exemplar could very likely increase the overall \nrate of convergence for a back propagation net. \n\n3.2 Real-Time Learning \nUnlike most standard curve and surface fitting methods, the multi-resolution hi(cid:173)\nerarchy is extremely well-suited for real-time applications. Indeed, the standard \nCMAC model has been applied to the real-time control of robots with encouraging \nsuccess [16,15]. \nFigure [3] illustrates a two level hierarchy (with 5 and 9 nodes per dimension) \nlearning to predict the timeseries for T = 50 from an initial tabula rasa configuration \n(all lookup table values set to zero). The solid line is the actual timeseries data, while \nthe dashed line are the predicted values. The predicted values lead the actual values \nin the graphs. Notice that the system discovers the intrinsically cyclic nature of the \nseries almost immediately. At the end of a single pass through 9,900 exemplars, \nthe normalized prediction error is below 5% and the fit looks very good to the eye. \nOn a Sun 3/50, the algorithm required 1.4 msec per level to respond to and \nlearn from each exemplar. At this rate, the two level system was able to process \n360 exemplars (over 7 cycles of the timeseries) per second. This rate would be \nconsidered phenomenal for a typical back propagation network running on a Sun \n3/50. \n\n\f38 \n\nMoody \n\n1.0 \n\nf..J \n\nIII \n0 \n\n0.8 \n~ 0.6 \n> \nd \n0 \n. .::l 0.4 \nu \nd \ntf \n\n0.2 \n\n0.0 \n\nl-\n\nh \no \n\nH ~ \n-\n\n\u2022 \n\nI \n\nI \n\n) \n\nI \n\nI \n\nI \n\nI \n\n~ \n\n~ \n\n, I \n, \n\nI -\n, \n~-, ' \n'-\nI' ,I \n\\: -\n1 \n\" -\n1 \n100 200 300 400 \nTime ( = # of Exemplars ) \n\nII \nII _ \n\n1 \n\n1 \n\nI \n\n1.0 f..J \n\nI \n\nI \n\nI \n\nI \n\n_ \n\nIII \n0 \n\n0.8 - ~ \n~ 0.6 -\n> \n0 \u2022 .::l 0.4 r-< \nu \nd \n~ \n\nd \n\nJ t ~ ~ , \n,1-\n~I \n, 1 \nI r \n, -\nI \n, \nI \nI -\n1 -\n9500 9600 9700 9800 9900 \n\nII \n\n1 \n\nI \n\nI \n\n0.0 h \n\n0.2 l-\n\nTime ( = # of exemplars) \n\nFigure 3: An example of learning to predict the Mackey-Glass chaotic timeseries in \nreal time with a two-stage multi-resolution hierarchy. \n\n4 Discussion \n\nThere are two reasons that the multi-resolution hierarchy learns much more quickly \nthan back propagation. The first is that the hierarchy uses local representations \nof the input space and thus requires evaluation and modification of only a few \nlookup table values for each exemplar. In contrast, the complete back propagation \nnet must be evaluated and modified for each exemplar. Second, the learning in \nthe multi-resolution hierarchy is cast as a purely quadratic optimization procedure. \nIn contrast, the back propagation procedure is non-linear and is plagued with a \nmultitude of local minima and plateaus which can significantly retard the learning \nprocess. \n\nIn these respects, the multi-resolution hierarchy is very similar to the local sur(cid:173)\n\nface fitting techniques exploited by Farmer and Sidorowich. The primary difference, \nhowever, is that the hierarchy, with its multi-resolution architecture and hash table \ndata structures offers the flexibility needed for real time problem domains and does \nnot require the explicit storage of past data or the creation of data structures which \ndepend on the distribution of data. \n\nAcknowledgements \n\nI gratefully acknowledge helpful comments from Chris Darken, Doyne Farmer, Alan \nLapedes, Tom Miller, Terry Sejnowski, and John Sidorowich. \nI am especially \ngrateful for support from ONR grant NOO0l4-86-K-0310, AFOSR grant F49620-\n88-C0025, and a Purdue Army subcontract. \n\n\f\"Fast Learning in Multi-Resolution Hierarchies\" \n\n39 \n\nReferences \n[1] J.S. Albus. Brain, Behavior and Rohotic6. Byte Books, 1981. \n[2] J.S. Albus. A new approach to manipulator control: the cerebellar model articulation con(cid:173)\n\ntroller (CMAC). J. Dyn. SY6. Mea6., Contr., 97:220, 1975. \n\n[3] Jon L. Bentley. Multidimensional binary search trees in database applications. IEEE Tran6. \n\non Software Engineering, SE-5:333, 1979. \n\n[4] Jon L. Bentley. Multidimensional divide and conquer. Communication6 of the A CM, 23:214, \n\n1980. \n\n[5] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Clauification and Regreuion \n\nTree6. Wadsworth, Monterey, CA, 1984. \n\n[6] Peter J. Burt and Edward H. Adelson. The laplacian pyramid as a compact image code. \n\nIEEE Tran6. Communication6, COM-31:532, 1983. \n\n[7] Peter J. Burt and Edward H. Adelson. A multiresolution spline with application to image \n\nmosaics. A CM Tran6. on Graphic6, 2:217, 1983. \n\n[8) J.L. Carter and M.N. Wegman. Universal classes of hash functions. In Proceeding6 of the \n\nNinth Annual SIGA CT Conference, 1977. \n\n[9] M. Casdagli. Nonlinear Prediction of Chaotic Time Serie6. Technical Report, Queen Mary \n\nCollege, London, 1988. \n\n[10) J.D. Fanner and J.J. Sidorowich. Erploiting Cha06 to Predict the Future and Reduce Noi6e. \n\nTechnical Report, Los Alamos National Laboratory, Los Alamos, New Mexico, 1988. \n\n[11] J.D. Fanner and J.J. Sidorowich. Predicting chaotic time series. PhY6icai Review Letter6, \n\n59:845, 1987. \n\n[12] A. Lapedes. 1988. Personal communication. \n[13] A.S. Lapedes and R. Farber. Nonlinear Signal Proceuing U6ing Neural Network6: Prediction \nand SY6tem Modeling. Technical Report, Los Alamos National Laboratory, Los Alamos, New \nMexico, 1987. \n\n[14) M.C. Mackey and L. Glass. Oscillation and chaos in physiological control systems. Science, \n\n197:287. \n\n[15] W. T. Miller, F. H. Glanz, and L. G. Kraft. Application of a general learning algorithm to the \ncontrol of robotic manipulators. International Journal of Robotic6 Re6earch, 6(2):84, 1987. \n\n[16) W. Thomas Miller. Sensor-based control of robotic manipulators using a general learning \n\nalgorithm. IEEE Journal of Rohotic6 and Automation, RA-3(2):157, 1987. \n\n[17) J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural \n\nComputation, 1989. To Appear. \n\n[18] J. Moody and C. Darken. Learning with localized receptive fields. In Touretzky, Hinton, and \nSejnowski, editors, Proceeding6 of the 1988 Connectioni6t Model6 Summer School, Morgan \nKaufmann, Publishers, 1988. \n\n[19] S. Omohundro. Efficient algorithms with neural network behavior. Compler SY6tem6, 1:273. \n[20] T. Sejnowski and C. Rosenberg. Parallel networks that learn to pronounce English text. \n\nCompler SY6tem6, 1:145, 1987. \n\n[21) M.F. Tenorio and W.T. Lee. Self-organized neural networks for the identification problem. \n\nPoster paper presented at the Neural Infonnation Processing Systems Conference, 1988. \n\n\f", "award": [], "sourceid": 175, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}]}