Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)

*Malik Magdon-Ismail, Amir Atiya*

We introduce two new techniques for density estimation. Our ap(cid:173) proach poses the problem as a supervised learning task which can be performed using Neural Networks. We introduce a stochas(cid:173) tic method for learning the cumulative distribution and an analo(cid:173) gous deterministic technique. We demonstrate convergence of our methods both theoretically and experimentally, and provide com(cid:173) parisons with the Parzen estimate. Our theoretical results demon(cid:173) strate better convergence properties than the Parzen estimate.

1

Introduction and Background

A majority of problems in science and engineering have to be modeled in a prob(cid:173) abilistic manner. Even if the underlying phenomena are inherently deterministic, the complexity of these phenomena often makes a probabilistic formulation the only feasible approach from the computational point of view. Although quantities such as the mean, the variance, and possibly higher order moments of a random variable have often been sufficient to characterize a particular problem, the quest for higher modeling accuracy, and for more realistic assumptions drives us towards modeling the available random variables using their probability density. This of course leads us to the problem of density estimation (see [6]).

The most common approach for density estimation is the nonparametric approach, where the density is determined according to a formula involving the data points available. The most common non parametric methods are the kernel density esti(cid:173) mator, also known as the Parzen window estimator [4] and the k-nearest neighbor technique [1]. Non parametric density estimation belongs to the class of ill-posed problems in the sense that small changes in the data can lead to large changes in

"To whom correspondence should be addressed.

Neural Networks for Density Estimation

523

the estimated density. Therefore it is important to have methods that are robust to slight changes in the data. For this reason some amount of regularization is needed [7]. This regularization is embedded in the choice of the smoothing parameter (ker(cid:173) nel width or k). The problem with these non-parametric techniques is their extreme sensitivity to the choice of the smoothing parameter. A wrong choice can lead to either undersmoothing or oversmoothing.

In spite of the importance of the density estimation problem, proposed methods using neural networks have been very sporadic. We propose two new methods for density estimation which can be implemented using multilayer networks. In addition to being able to approximate any function to any given precision, multilayer networks give us the flexibility to choose an error function to suit our application. The methods developed here are based on approximating the distribution function, in contrast to most previous works which focus on approximating the density itself. Straightforward differentiation gives us the estimate of the density function. The distribution function is often useful in its own right - one can directly evaluate quantiles or the probability that the random variable occurs in a particular interval.

One of the techniques is a stochastic algorithm (SLC), and the second is a determin(cid:173) istic technique based on learning the cumulative (SIC). The stochastic technique will generally be smoother on smaller numbers of data points, however, the de(cid:173) terministic technique is faster and applies to more that one dimension. We will present a result on the consistency and the convergence rate of the estimation error for our methods in the univariate case. When the unknown density is bounded and has bounded derivatives up to order K, we find that the estimation error is O((loglog(N)/N)-(l-t»), where N is the number of data points. As a comparison, for the kernel density estimator (with non-negative kernels), the estimation error is O(N-4 / 5 }, under the assumptions that the unknown density has a square integrable second derivative (see [6]), and that the optimal kernel width is used, which is not possible in practice because computing the optimal kernel width requires knowledge of the true density. One can see that for smooth density functions with bounded derivatives, our methods achieve an error rate that approaches O(N- 1 ).

2 New Density Estimation Techniques

To illustrate our methods, we will use neural networks, but stress that any suf(cid:173) ficiently general learning model will do just as well. The network's output will represent an estimate of the distribution function, and its derivative will be an estimate of the density. We will now proceed to a description of the two methods.

2.1 SLC (Stochastic Learning of the Cumulative)

Let Xn E R, n = 1, ... , N be the data points. Let the underlying density be g(x) and its distribution function G(x) = J~oog(t)dt. Let the neural network output be H (x, w), where w represents the set of weights of the network. Ideally, after training the neural network, we would like to have H (x, w) = G (x). It can easily be shown that the density of the random variable G(x) (x being generated according to g(x)) is uniform in [0,1]. Thus, if H(x,w) is to be as close as possible to G(x), then the network output should have a density that is close to uniform in [0,1]. This is what our goal will be. We will attempt to train the network such that its output density is uniform, then the network mapping should represent the distribution function G(x). The basic idea behind the proposed algorithm is to use the N data points as inputs to the network. For every training cycle, we generate a different set of N network targets randomly from a uniform distribution in [0, 1], and adjust

524

M Magdon-Ismail and A. Atiya

the weights to map the data points (sorted in ascending order) to these generated targets (also sorted in ascending order). Thus we are training the network to map the data to a uniform distribution.

Before describing the steps of the algorithm, we note that the resulting network has to represent a monotonically non decreasing mapping, otherwise it will not represent a legitimate distribution function. In our simulations, we used a hint penalty to enforce monotonicity [5]. The algorithm is as follows.

- Let Xl S X2 S ... S XN be the data points. Set t = 1, where t is the

training cycle number. Initialize the weights (usually randomly) to w(l).

- Generate randomly from a uniform distribution in [0,1] N points (and sort

them): UI S U2 S '" S UN· The point Un is the target output for X n· 3. Adjust the network weights according to the backpropagation scheme:

a£ w(t + 1) = w(t) - 17(t) aw

(1)

where £ is the objective function that includes the error term and the monotonicity hint penalty term [5]:

Do not remove: This comment is monitored to verify that the site is working properly