{"title": "Interpolating Earth-science Data using RBF Networks and Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 988, "page_last": 994, "abstract": null, "full_text": "Interpolating Earth-science Data using RBF \n\nNetworks and Mixtures of Experts \n\nE.VVan \n\nD.Bone \n\nDivision of Infonnation Technology \n\nCanberra Laboratory, CSIRO \n\nGPO Box 664, Canberra, ACT, 2601, Australia \n\n{ernest, don} @cbr.dit.csiro.au \n\nAbstract \n\nWe present a mixture of experts (ME) approach to interpolate sparse, \nspatially correlated earth-science data. Kriging is an interpolation \nmethod which uses a global covariation model estimated from the data \nto take account of the spatial dependence in the data. Based on the \nclose relationship between kriging and the radial basis function (RBF) \nnetwork (Wan & Bone, 1996), we use a mixture of generalized RBF \nnetworks to partition the input space into statistically correlated \nregions and learn the local covariation model of the data in each \nregion. Applying the ME approach to simulated and real-world data, \nwe show that it is able to achieve good partitioning of the input space, \nlearn the local covariation models and improve generalization. \n\n1. INTRODUCTION \n\nKriging is an interpolation method widely used in the earth sciences, which models the \nsurface to be interpolated as a stationary random field (RF) and employs a linear model. \nThe value at an unsampled location is evaluated as a weighted sum of the sparse, \nspatially correlated data points. The weights take account of the spatial correlation \nbetween the available data points and between the unknown points and the available data \npoints. The spatial dependence is specified in the form of a global covariation model. \nAssuming global stationarity, the kriging predictor is the best unbiased linear predictor \nof the un sampled value when the true covariation model is used, in the sense that it \nminimizes the squared error variance under the unbiasedness constraint. However, in \npractice, the covariation of the data is unknown and has to be estimated from the data by \nan initial spatial data analysis. The analysis fits a covariation model to a covariation \nmeasure of the data such as the sample variogram or the sample covariogram, either \ngraphically or by means of various least squares (LS) and maximum likelihood (ML) \napproaches. Valid covariation models are all radial basis functions. \nOptimal prediction is achieved when the true covariation model of the data is used. In \ngeneral, prediction (or generalization) improves as the covariation model used more \n\n\fInterpolating Earth-science Data using RBFN and Mixtures of Experts \n\n989 \n\nclosely matches the true covariation of the data. Nevertheless, estimating the covariation \nmodel from earth-science data has proved to be difficult in practice due to the sparseness \nof data samples. Furthermore for many data sets the global stationarity assumption is \nnot valid. To address this, data sets are commonly manually partitioned into smaller \nregions within which the stationarity assumption is valid or approximately so. \nIn a previous paper, we showed that there is a close, formal relationship between kriging \nand RBF networks (Wan & Bone, 1996). In the equivalent RBF network formulation of \nkriging, the input vector is a coordinate and the output is a scalar physical quantity of \ninterest. We pointed out that, under the stationarity assumption, the radial basis function \nused in an RBF network can be viewed as a covariation model of the data. We showed \nthat an RBF network whose RBF units share an adaptive norm weighting matrix, can be \nused to estimate the parameters of the postulated covariation model, outperforming more \nconventional methods. In the rest of this paper we will refer to such a generalization of \nthe RBF network as a generalized RBF (GRBF) network. \nIn this paper, we discuss how a mixture of GRBF networks can be used to partition the \ninput space into statistically correlated regions and learn the local covariation model of \neach region. We demonstrate the effectiveness of the ME approach with a simulated \ndata set and an aero-magnetic data set. Comparisons are also made of prediction \naccuracy of a single GRBF network and other more traditional RBF networks. \n\n2 MIXTURE OF GRBF EXPERTS \n\nMixture of experts (Jacobs et al , 1991) is a modular neural network architecture in \nwhich a number of expert networks augmented by a gating network compete to learn the \ndata. The gating network learns to assign probability to the experts according to their \nperformance over various parts of the input space, and combines the outputs of the \nexperts accordingly. During training, each expert is made to focus on modelling the \nlocal mapping it performs best, improving its performance further. Competition among \nthe experts achieves a soft partitioning of the input space into regions with each expert \nnetwork learning a separate local mapping. An hierarchical generalization of ME, the \nhierarchical mixture of experts (HME), in which each expert is allowed to expand into a \ngating network and a set of sub-experts, has also been proposed (Jordan & Jacobs, 1994). \nUnder the global stationarity assumption, training a GRBF network by minimizing the \nmean squared prediction error involves adjusting its norm weighting matrix. This can \nbe interpreted as an attempt to match the RBF to the covariation of the data. It then \nseems natural to use a mixture of GRBF networks when only local stationarity can be \nassumed. After training, the gating network soft partitions the input space into \nstatistically correlated regions and each GRBF network provides a model of the \ncovariation of the data for a local region. \nInstead of an ME architecture, an HME \narchitecture can be used. However, to simplify the discussion we restrict ourselves to the \nME architecture. \nEach expert in the mixture is a GRBF network. The output of expert i is given by: \n\nYi(X;Oi) = L Wijq,(x;cij~Mi)+ WiD \n\n... \n\nj =\\ \n\n(2.1) \n\nwhere ni is the number of RBF units, 0i = {{wi);~o,{cij}i=\\,Md are the parameters \nof the expert and q,(x;c,M)=qX:II x-c II M). Assuming zero-mean Gaussian error and \ncommon variance a/, the conditional probability of y given x and ~ is given by: \n\n(2.3) \n\n\f990 \n\nE. Wan and D. Bone \n\nSince the radial basis functions we used bave compact support and eacb expert only \nlearns a local covariation model, small GRBF networks spanning overlapping regions \ncan be used to reduce computation at the expense of some resolution in locating the \nboundaries of the regions. Also, only the subset of data within and around the region \nspanned by a GRBF network is needed to train it, further reducing computational effort. \nWith m experts, the i lb output of the gating network gives the probability of selecting the \nexpert i and is given by the normalized function: \n\ng, (x~'U) = P(ilx, '0) = Il, exp(q(x~'UJ)/ ~lllj exp{q(x;'U J) \n\n(2.4) \n\nwbere'U = { raj::\\, {'UJ::1}. Using q(x~ '0,) = 'U;[x T If and setting all a, 's to 1, the \ngating network implements the softmax function and partitions the input space into a \nAlternatively, with q(x~1>i)=-IITi(X-u;)112 (wbere \nsmoothed planar \n1>i={u;,Td consists of a location vector and an affine transformation matrix) and \nrestricting the a/s to be non-negative, the gating network divides the input space into \npacked anisotropic ellipsoids. These two partitionings are quite convenient and adequate \nfor most earth-science applications wbere x is a 2D or 3D coordinate. \n\ntessellation. \n\nThe output of the experts are combined to give the overall output of the mixture: \n\nY{x~a) = L P(ilx, '\\\u00bb)9i (x;a i ) = L g, (x; '0 )Yi (x;a,) \n\nIII \n\n(2.5) \n\ni=1 \n\nIII \n\ni=1 \n\nwbere a = {'U, {ai }::1} and the conditional probability of observing y given x and a is: \n\np(ylx,a) = L P(ilx, '0 )p(ylx,a,) . \n\nIII \n\n,=1 \n\n(2.6) \n\n3 THE TRAINING ALGORITHM \n\nThe Expectation-Maximization (EM) algorithm of Jordan and Jacobs is used to train the \nmixture of GRBF networks. \nInstead of computing the ML estimates, we extend the \nalgorithm by including priors on the parameters of the experts and compute the \nmaximum a posteriori (MAP) estimates. Since an expert may be focusing on a small \nsubset of the data, the priors belp to prevent over-fitting and improve generalization. \n\nJordan & Jacobs introduced a set of indicator random variables Z = {Za. Spherical does NOT mean isotropic. \n\n\f992 \n\nE. Wan and D. Bone \n\nHME with 4 GRBF network experts each with 36 spherical units are used to learn the \nlocal covariation models and the mapping. Softmax gating networks are used and each \nexpert is somewhat 'localized' in each quadrant of the input space. The units of the \nexperts are located at the same locations as the units of the 64-unit GRBF network with \n24 overlapping units between any two of the experts. The design ensures that the HME \ndoes not have an advantage over the 64-unit GRBF network if the data is indeed globally \nstationary. Figure 2 shows the local covariation models learned by the HME with the \nsmoothness priors and Figure 3b shows the interpolant generated and the partitioning. \n\n(a) NW (spherical) \n\n(b) NE (spherical) \n\n~ [11-\"\" ~ ~.01 \n\n-10 \n-2 0 -20 \n\n-10 \n\n-20 -10 0 10 20 \n(c) SW (spherical) \n\n-20 -10 0 10 20 \n(d) SE (spherical) \n\n~~~[I}. \n\n-10 \n-2 0 -20 \n\n-10 \n\n-20-10 0 1020 \n-20-10 0 1020 \nFigure 2: The profile of the local \n\ncovariation models learned by the HME. \n\n(c) \n\n60 \n\n40 \n\n20 \n\n(a) NW (exponential) (b) NE (spherical) \n\n~ lij\" ~ ~\"01 \n\n-10 \n-20 \n\n. \n\n-10 \n-20 \n\n-20 -10 0 10 20 \n(c) SW (spherical) \n\n-20 -10 0 10 20 \n(d) SE (spherical) \n\n~~\"~1 ~~\"1\" \n-10 ~ -10 \n\n\u2022 \n\n-20 \n\n-20 \n\n-20-10 0 1020 \n\n-20-10 0 1020 \n\nFigure 1: The profile of the true local \n\ncovariation models of the simulated data set. \nExponential and spherical models are used. \n\n(a) \n\n(b) \n\n60 \n\n40 \n\n20 \n\n60 \n\n40 \n\n20 \n\n20 40 60 \n\n20 40 60 \n\n20 \n\n40 60 \n\nFigure 3: (a) Simulated data set and true partitions. (b) Interpolant generated by the 144 \nspherical unit GRBFN. (c) The HME interpolant and the soft partitioning learned (0.5, \n0.9 probability contours of the 4 experts shown in solid and dotted lines respectively) \n\nTable 1: Nonnalized mean squared prediction error for the simulated data set. \n\nNetwork \n\nRBFN (isotropic RBF units with width set to the \ndistance to the nearest neighbor) \n\nRBFN (identical isotropic RBF units with adaptive \nwidth) \nGRBFN (identical RBF units with adaptive norm \nweiRhtinR matrix) \nHME (2 levels, 4 GRBFN eXlJerts) without lJriors \nHME (2 levels 4 GRBFN eXlJerts) with lJriors \nkriging predictor (usinR true local models) \n\nRBF unit \n\n64, Gaussian \n144, Gaussian \n400, Gaussian \n64, Gaussian \n144, Gaussian \n64, spherical \n144, spherical \n4x36, spherical \n4x36, spherical \n\nNMSE \n0.761 \n0.616 \n0.543 \n0.477 \n0.475 \n0.506 \n0.431 \n0.938 \n0.433 \n0.372 \n\nFor comparison, a number of ordinary RBF networks are also used to learn the mapping. \nIn all cases, the RBF units of networks of the same size share the same locations which \n\n\fInterpolating Earth-science Data using RBFN and Mixtures of Experts \n\n993 \n\nare preset by a Kohonen map. Table 1 summarizes the normalized mean squared \nprediction error (NMSE)- the squared prediction error divided by the variance of the \nvalidation set - for each network. With the exception of HME, all results listed are \nobtained with a smoothness prior and a regularization parameter of 0.1. Ordinary \nweight decay is used for RBF networks with units of varying widths and the smoothness \nprior discussed in section 3 are used for the remaining networks. The NMSE of the \nkriging predictor that uses the true local models is also listed as a reference. \n\nSimilar experiments are also conducted on a real aero-magnetic data set. The flight \npaths along which the data is collected are divided into a 740 data points training set and \na 1690 points validation set. The NMSE for each network is summarized in Table 2, the \nlocal covariation models learned by the HME is shown in Figure 4, and the interpolant \ngenerated by the HME and the partitioning is shown in Figure 5b. \n\n(a) NW (spherical) \n\n(b) NE (spherical) \n\n(a) \n\n(b) \n\n120 \n\n80 \n\n40 \n\n40 80 120 \n\n40 80 120 \n\n40 \n\n120 \n\n-100 \n\n0 \n\n100 -100 \n\n-::.::~ 80 \n'~[I]'~~ \nl00~ \n\n(c) SW (spherical) \n\n(d) SE (spherical) \n\n-100 \n\n0 \n\n100 \n\n-100 \n\n0 \n\n100 -100 \n\n0 \n\n100 \n\nFigure 5: (a) Thin-plate interpolant of the \nentire aero-magnetic data set. (b) The HME \ninterpolant and the soft partitioning (0.5, 0.9 \nprobability contours of the 4 experts shown \n\nin solid and dotted lines respectively). \n\nFigure 4: The profile of the local covariation models \nof the aero-magnetic data set learned by the HME. \n\nTable 2: Normalized mean squared prediction error for the aero-magnetic data set. \n\nNetwork \n\nRBFN (isotropic RBF units with width set to the \ndistance to the nearest neighbor) \nRBFN (isotropic RBF units with width set to the \nmean distance to the 8 nearest neighbors) \nRBFN (identical isotropic RBF units with adaptive \nwidth) \nGRBFN (identical RBF units with adaptive norm \nweighting matrix) \nHME (2 levels, 4 GRBFN experts) without priors \nHME (2 levels, 4 GRBFN expertsl with priors \n\nRBF units \n49, Gaussian \n100, Gaussian \n49, Gaussian \n100, Gaussian \n49, Gaussian \n100, Gaussian \n49, spherical \n100 spherical \n4x25, spherical \n4x25, spherical \n\nNMSE \n1.158 \n1.256 \n0.723 \n0.699 \n0.692 \n0.614 \n0.684 \n0.612 \n0.389 \n0.315 \n\n5 DISCUSSION \n\nThe ordinary RBF networks perform worst with both the simulated data and the aero(cid:173)\nmagnetic data. As neither data set is globally stationary, the GRBF networks do not \nimprove prediction accuracy over the corresponding RBF networks that use isotropic \nGaussian units. In both cases, the hierarchical mixture of GRBF networks improves the \nprediction accuracy when the smoothness priors are used. Without the priors, the ML \nestimates of the HME parameters lead to improbably high and low predictions. \n\n\f994 \n\nE. Wan and D. Bone \n\nThe improvement in prediction accuracy is more significant for the aero-magnetic data \nset than for the simulated data set due to some apparent global covariation of the \nsimulated data which only becomes evident when the directional variograms of the data \nare plotted. However, despite the similar NMSE, Figure 3 shows that the interpolant \ngenerated by the 144-unit GRBF network does not contain the structural information \nthat is captured by the HME interpolant and is most evident in the north-east region. \nIn the case of the simulated data set, the HME learns the local covariation models \naccurately despite the fact that the bottom level gating networks fail to partition the input \nspace precisely along the north-south direction. The availability of more data and the \nstraight east-west discontinuity allows the upper gating network to partition the input \nspace precisely along the east-west direction. In the north-west region, although the class \nof function the expert used is different from that of the true model, the model learned \nstill resembles the true model especially in the inner region where it matters most. \nIn the case of the aero-magnetic data set, the RBF and GRBF networks perform poorly \ndue to the considerable extrapolation that is required in the prediction and the absence of \nglobal stationarity. However, the HME whose units capture the local covariation of the \ndata interpolates and extrapolates significantly better. The partitioning as well as the \nlocal covariation model learned by the HME seems to be reasonably accurate and leads \nto the construction of prominent ridge-like structures in the north-west and south-east \nwhich are only apparent in the thin-plate interpolant of the entire data set of Figure Sa. \n\n6 CONCLUSIONS \n\nWe show that a mixture of GRBF networks can be used to learn the local covariation of \nspatial data and improve prediction (or generalization) when the data is approximately \nlocally stationary - a viable assumption in many earth-science applications. We believe \nthat the improvement will be even more significant for data sets with larger spatial \nextent especially if the local regions are more statistically distinct. The estimation of the \nlocal covariation models of the data and the use of these models in producing the \ninterpolant helps to capture the structural information in the data which, apart from \naccuracy of the prediction, is of critical importance to many earth-science applications. \nThe ME approach allows the objective and automatic partitioning of the input space into \nstatistically correlated regions. It also allows the use of a number of small local GRBF \nnetworks each trained on a subset of the data making it scaleable to large data sets. \nThe mixture of GRBF networks approach is motivated by the statistical interpolation \nmethod of kriging. The approach therefore has a very sound physical interpretation and \nall the parameters of the network have clear statistical and/or physical meanings. \n\nReferences \nCressie, N. A (1993). Statistics for Spatial Data. Wiley, New York. \nJacobs, R. A, Jordan, M. I., Nowlan, S. J. & Hinton, G. E. (1991). Adaptive Mixtures of Local \n\nExperts. Neural Computation 3, pp. 79-87. \n\nJordan, M. I. & Jacobs, R. A (1994). Hierarchical Mixtures of Experts and the EM Algorithm. \n\nNeural Computation 6, pp. 181-214. \n\nMacKay, D. J. (1992). Bayesian Interpolation. Neural Computation 4, pp. 415-447. \nOrr, M. J. (1995). Regularization in the Selection of Radial Basis Function Centers. Neural \n\nComputation 7, pp. 606-623. \n\nPoggio, T. & Girosi, F. (1990). Networks for Approximation and Learning. In Proceedings of the \n\nIEEE 78, pp. 1481-1497. \n\nWan, E. & Bone, D. (1996). A Neural Network Approach to Covariation Model Fitting and the \nInterpolation of Sparse Earth-science Data. In Proceedings of the Seventh Australian \nConference on Neural Networks, pp. 121-126. \n\n\f", "award": [], "sourceid": 1313, "authors": [{"given_name": "Ernest", "family_name": "Wan", "institution": null}, {"given_name": "Don", "family_name": "Bone", "institution": null}]}