{"title": "Estimating Conditional Probability Densities for Periodic Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 648, "abstract": null, "full_text": "Estimating Conditional Probability \n\nDensities for Periodic Variables \n\nChris M Bishop and Claire Legleye \n\nNeural Computing Research Group \n\nDepartment of Computer Science and Applied Mathematics \n\nAston University \n\nBirmingham, B4 7ET, U.K. \n\nc.m.bishop@aston.ac.uk \n\nAbstract \n\nMost of the common techniques for estimating conditional prob(cid:173)\nability densities are inappropriate for applications involving peri(cid:173)\nodic variables. In this paper we introduce three novel techniques \nfor tackling such problems, and investigate their performance us(cid:173)\ning synthetic data. We then apply these techniques to the problem \nof extracting the distribution of wind vector directions from radar \nscatterometer data gathered by a remote-sensing satellite. \n\n1 \n\nINTRODUCTION \n\nMany applications of neural networks can be formulated in terms of a multi-variate \nnon-linear mapping from an input vector x to a target vector t. A conventional \nneural network approach , based on least squares for example, leads to a network \nmapping which approximates the regression of t on x. A more complete description \nof the data can be obtained by estimating the conditional probability density of t, \nconditioned on x, which we write as p(tlx). Various techniques exist for modelling \nsuch densities when the target variables live in a Euclidean space. However, a num(cid:173)\nber of potential applications involve angle-like output variables which are periodic \non some finite interval (usually chosen to be (0,271\")). For example, in Section 3 \n\n\f642 \n\nChris M. Bishop, Claire Legleye \n\nwe consider the problem of determining the wind direction (a periodic quantity) \nfrom radar scatterometer data obtained from remote sensing measurements. Most \nof the existing techniques for conditional density estimation cannot be applied in \nsuch cases. \n\nA common technique for unconditional density estimation is based on mixture mod(cid:173)\nels of the form \n\nm \n\npet) = L Cki\u00a2i(t) \n\n(1) \n\ni=l \n\nwhere Cki are called mixing coefficients, and the kernel functions \u00a2i(t) are frequently \nchosen to be Gaussians. Such models can be used as the basis of techniques for con(cid:173)\nditional density estimation by allowing the mixing coefficients, and any parameters \ngoverning the kernel functions, to be general functions of the input vector x. This \ncan be achieved by relating these quantities to the outputs of a neural network \nwhich takes x as input, as shown in Figure 1. Such an approach forms the basis of \n\nconditional \nprobability \n\ndensity n p(tlx) \n\n1\\ \n1\\ \n1\\ \n(> z \nparameter \nvector U \n\nmixture \nmodel \n\nneural \nnetwork \n\nFigure 1: A general framework for conditional density estimation \nis obtained by using a feed-forward neural network whose outputs \ndetermine the parameters in a mixture density model. The mixture \nmodel then represents the conditional probability density of the \ntarget variables, conditioned on the input vector to the network. \n\nthe 'mixture of experts' model (Jacobs et al., 1991) and has also been considered by \na number of other authors (White, 1992; Bishop, 1994; Lui, 1994). In this paper we \nintroduce three techniques for estimating conditional densities of periodic variables, \nbased on extensions of the above formalism for Euclidean variables. \n\n\fEstimating Conditional Probability Densities for Periodic Variables \n\n643 \n\n2 DENSITY ESTIMATION FOR PERIODIC VARIABLES \n\nIn this section we consider three alternative approaches to estimating the conditional \ndensity p(Blx) of a periodic variable B, conditioned on an input vector x. They are \nbased respectively on a transformation to an extended domain representation, the \nuse of adaptive circular normal kernel functions, and the use of fixed circular normal \nkernels. \n\n2.1 TRANSFORMATION TO AN EXTENDED VARIABLE \n\nDOMAIN \n\nThe first technique which we consider involves finding a transformation from the \nperiodic variable B E (0,27r) to a Euclidean variable X E (-00,00), such that stan(cid:173)\ndard techniques for conditional density estimation can be applied in X-space. In \nparticular, we seek a conditional density function p(xlx) which is to be modelled \nusing a conventional Gaussian mixture approach as described in Section 1. Consider \nthe transformation \n\np(Blx) = L p(B + L27rlx) \n\n00 \n\n(2) \n\nThen it is clear by construction that the density model on the left hand side satisfies \nthe periodicity requirement p(B + 27rlx) = p(Blx). Furthermore, if the density \nfunction p(xlx) is normalized, then we have \n\nL=-oo \n\n{21f \nJo p(Blx) dB = \n\n(3) \n\nand so the corresponding periodic density p(Olx) will also be normalized. We now \nmodel the density function p(xlx) using a mixture of Gaussians of the form \n\nm \n\np(xlx) = L fri(X)4>i(xlx) \n\ni=l \n\nwhere the kernel functions are given by \n\n4>i(xlx) = (27r)1/2Uj(x) exp \n\n1 \n\n( \n-\n\n{X - Xi(X)P) \n\n2u;(x) \n\n(4) \n\n(5) \n\nand the parameters fri(X), Uj(x) and Xi(X) are determined by the outputs of a \nfeed-forward network. In particular, the mixing coefficients frj(x) are governed by a \n'softmax' activation function to ensure that they lie in the range (0,1) and sum to \n\n\f644 \n\nChris M. Bishop, Claire Legleye \n\nunity; the width parameters O'i(X) are given by the exponentials ofthe corresponding \nnetwork outputs to ensure their positivity; and the basis function centres Xi(X) are \ngiven directly by network output variables. \n\nThe network is trained by maximizing the likelihood function, evaluated for set of \ntraining data, with respect to the weights and biases in the network. For a training \nset consisting of N input vectors xn and corresponding targets (r, the likelihood is \ngiven by \n\nN \n\n.c = II peon Ixn)p(xn) \n\nn=l \n\n(6) \n\nwhere p(x) is the unconditional density of the input data. Rather than work with .c \ndirectly, it is convenient instead to minimize an error function given by the negative \nlog of the likelihood. Making use of (2) we can write this in the form \n\nE = -In.c ~ - 2: In 2:P'(on + L271'1xn) \n\nn \n\nL \n\n(7) \n\nwhere we have dropped the term arising from p(x) since it is independent of the \nnetwork weights. This expression is very similar to the one which arises if we \nperform density estimation on the real axis, except for the extra summation over \nL, which means that the data point on recurs at intervals of 271' along the x-axis. \nThis is not equivalent simply to replicating the data, however, since the summation \nover L occurs inside the logarithm, rather than outside as with the summation over \ndata points n. \n\nIn a practical implementation, it is necessary to restrict the summation over L. For \nthe results presented in the next section, this summation was taken over 7 complete \nperiods of 271' spanning the range (-771', 771'). Since the Gaussians have exponentially \ndecaying tails, this represents an extremely good approximation in almost all cases, \nprovided we take care in initializing the network weights so that the Gaussian \nkernels lie in the central few periods. Derivatives of E with respect to the network \nweights can be computed using the rules of calculus, to give a modified form of \nback-propagation. These derivatives can then be used with standard optimization \ntechniques to find a minimum of the error function. (The results presented in the \nnext section were obtained using the BFGS quasi-Newton algorithm). \n\n2.2 MIXTURES OF CIRCULAR NORMAL DENSITIES \n\nThe second approach which we introduce is also based on a mixture of kernel func(cid:173)\ntions of the form (1), but in this case the kernel functions themselves are periodic, \nthereby ensuring that the overall density function will be periodic. To motivate this \napproach, consider the problem of modelling the distribution of a velocity vector \nv in two dimensions (this arises, for example, in the application considered in Sec(cid:173)\ntion 3). Since v lives in a Euclidean plane, we can model the density function p(v) \nusing a mixture of conventional spherical Gaussian kernels, where each kernel has \n\n\fEstimating Conditional Probability Densities for Periodic Variables \n\nthe form \n\n( \n\u00a2 Vx , Vy = 211\"cr2 exp \n\n) \n\n1 \n\n( { Vx -\n\nflx F \n\n-\n\n2cr2 \n\nfly F ) \n\n{V y -\n\n2cr2 \n\n645 \n\n(8) \n\nwhere (v x , v y ) are the Cartesian components of v, and (flx, fly) are the components \nof the center I-' of the kernel. From this we can extract the conditional distribution \nof the polar angle () of the vector v, given a value for v = I/vll. This is easily done \nwith the transformation Vx = v cos (), Vy = v sin (), and defining ()o to be the polar \nangle of 1-', so that flx = fl cos ()o and fly = fl sin ()o, where fl = 111-'11\u00b7 This leads to a \ndistribution which can be written in the form \n\n\u00a2(()) = \n\n1 \n(A) exp {A cos(() - ()on \n211\"10 \n\n(9) \n\nwhere the normalization coefficient has been expressed in terms of the zeroth order \nmodified Bessel function of the first kind, Io(A) . The distribution (9) is known as a \ncircular normal or von Mises distribution (Mardia, 1972) . The parameter A (which \ndepends on v in our derivation) is analogous to the (inverse) variance parameter in a \nconventional normal distribution. Since (9) is periodic, we can construct a general \nrepresentation for the conditional density of a periodic variable by considering a \nmixture of circular normal kernels, with parameters given by the outputs of a neural \nnetwork. The weights of the network can again be determined by maximizing the \nlikelihood function defined over a set of training data. \n\n2.3 FIXED KERNELS \n\nThe third approach introduced here is again based on a mixture model in which \nthe kernel functions are periodic, but where the kernel parameters (specifying their \nwidth and location) are fixed. The only adaptive parameters are the mixing coeffi(cid:173)\ncients, which are again determined by the outputs of a feed-forward network having \na softmax final-layer activation function. Here we consider a set of equally-spaced \ncircular normal kernels in which the width parameters are chosen to give a mod(cid:173)\nerate degree of overlap between the kernels so that the resulting representation for \nthe density function will be reasonably smooth. Again, a maximum likelihood for(cid:173)\nmalism is employed to train the network. Clearly a major drawback of fixed-kernel \nmethods is that the number of kernels must grow exponentially with the dimen(cid:173)\nsionality of the output space. For a single output variable, however , they can be \nregarded as practical techniques. \n\n3 RESULTS \n\nIn order to test and compare the methods introduced above, we first consider a \nsimple problem involving synthetic data, for which the true underlying distribution \nfunction is known . This data set is intended to mimic the central properties of the \nreal data to be discussed in the next section. It has a single input variable x and an \noutput variable () which lies in the range (0,211\"). The distribution of () is governed \n\n\f646 \n\nChris M. Bishop, Claire Leg/eye \n\nby a mixture of two triangular functions whose parameters (locations and widths) \nare functions of x. Here we present preliminary results from the application of the \nmethod introduced in section 2.1 (involving the transformation to Euclidean space) \nto this data. Figure 2 shows a plot of the reconstructed conditional density in both \nthe extended X variable, and in the reconstructed polar variable (), for a particular \nvalue of the input variable x . \n\n0.6 r-----r-------,~-_, \n\n0.5,..-----\"'T\"\"-----., \n\nx=0.5 \n\n-network \n-- - - true \n\nsinO \n\np(X) \n\n0.3 \n\n0.0 ~-~---+-----~ \n\n, \n\n-0.5 '--____ ...L..-____ .... \n\n-0.5 \n\n0.0 \n\ncosO \n\n0.5 \n\n0.0 L..-_---L_\"----'-----';.......oooooL..-......... - - I \n21\u00a3 \n\n-21\u00a3 \n\no \n\nx \n\nFigure 2: The left hand plot shows the predicted density (solid \ncurve) together with the true density (dashed curve) in the ex(cid:173)\ntended X space. The right hand plot shows the corresponding den(cid:173)\nsities in the periodic () space. In both cases the input variable is \nfixed at x = 0.5. \n\nOne of the original motivations for developing the techniques described in this paper \nwas to provide an effective, principled approach to the analysis of radar scatterome(cid:173)\nter data from satellites such as the European Remote Sensing Satellite ERS-I. This \nsatellite is equipped with three C-band radar antennae which measure the total \nbackscattered power (called 0\"0) along three directions relative to the satellite track, \nas shown in Figure 3. When the satellite passes over the ocean, the strengths of the \nbackscattered signals are related to the surface ripples of the water (on length-scales \nof a few cm.) which in turn are determined by the low level winds. Extraction of \nthe wind speed and direction from the radar signals represents an inverse problem \nwhich is typically multi-valued. For example, a wind direction of (}1 will give rise to \nsimilar radar signals to a wind direction of (}1 + 1r. Often, there are additional such \n'aliases' at other angles. A conventional neural network approach to this problem, \nbased on least-squares, would predict wind directions which were given by condi(cid:173)\ntional averages of the target data. Since the average of several valid wind directions \nis typically not itself a valid direction, such an approach would clearly fail. Here \nwe aim to extract the complete distribution of wind directions (as a function of the \nthree 0\"0 values and on the angle of incidence of the radar beam) and hence avoid \n\n\fEstimating Conditional Probability Densities for Periodic Variables \n\n647 \n\nsatellite \n\n785km \n\n5COkm \n\nfore \nbeam \n\nmid \nbeam \n\naft \n\nbeam \n\nFigure 3: Schematic illustration of the ERS-l satellite showing the \nfootprints of the three radar scatterometers. \n\nsuch difficulties. This approach also provides the most complete information for \nthe next stage of processing (not considered here) which is to 'de-alias' the wind \ndirections to extract the most probable overall wind field. \n\nA large data set of ERS-l measurements, spanning a wide range of meteorological \nconditions, has been assembled by the European Space Agency in collaboration \nwith the UK Meteorological Office. Labelling of the data set was performed using \nwind vectors from the Meteorological Office Numerical Weather Prediction code. \nAn example of the results from the fixed-kernel method of Section 2.3 are presented \nin Figure 4. This clearly shows the existence of a primary alias at an angle of 1r \nrelative to the principal direction, as well as secondary aliases at \u00b11r /2. \n\nAcknowledgements \n\nWe are grateful to the European Space Agency and the UK Meteorological Office \nfor making available the ERS-l data. We would also like to thank lain Strachan \nand Ian Kirk of AEA Technology for a number of useful discussions relating to the \ninterpretation of this data. \n\nReferences \n\nBishop C M (1994). Mixture density networks. Neural Computing Research Group \nReport, NCRG/4288, Department of Computer Science, Aston University, Birm(cid:173)\ningham, U.K. \n\nJacobs R A, Jordan M I, Nowlan S J and Hinton G E (1991). Adaptive mixtures \n\n\f648 \n\nChris M. Bishop, Claire Leg/eye \n\n0.5 ~--~---r---~--\"\"'\" \n\n0.0 1--------.....;;::~~-__1 \n\n-0.5 \n\n-1.0 \n\n-1.5 ~ __ ~_----II.....-_---I __ - -J \n0.5 \n\n-1.5 \n\n0.0 \n\n-1.0 \n\n-0.5 \n\nFigure 4: An example of the results obtained with the fixed-kernel \nmethod applied to data from the ERS-1 satellite. As well as the \nprimary wind direction, there are aliases at 1r and \u00b11r /2. \n\nof local experts. Neural Computation, 3 79-87. \n\nLui Y (1994) Robust parameter estimation and model selection for neural network \nregression. Advances in Neural Information Processing Systems 6 Morgan Kauf(cid:173)\nmann, 192-199 .. \n\nMardia K V (1972) . Statistics of Directional Data. Academic Press, London. \n\nWhite H (1992). Parametric statistical estimation with artificial neural networks. \nUniversity of California, San Diego, Technical Report. \n\n\f", "award": [], "sourceid": 938, "authors": [{"given_name": "Chris", "family_name": "Bishop", "institution": null}, {"given_name": "Claire", "family_name": "Legleye", "institution": null}]}