{"title": "Neural Networks for Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 522, "page_last": 528, "abstract": null, "full_text": "Neural Networks for Density Estimation \n\nMalik Magdon-Ismail* \nmagdon~cco.caltech.edu \n\nAmir Atiya \n\namir~deep.caltech.edu \n\nCaltech Learning Systems Group \n\nDepartment of Electrical Engineering \n\nCalifornia Institute of Technology \n\n136-93 Pasadena, CA, 91125 \n\nCaltech Learning Systems Group \n\nDepartment of Electrical Engineering \n\nCalifornia Institute of Technology \n\n136-93 Pasadena, CA, 91125 \n\nAbstract \n\nWe introduce two new techniques for density estimation. Our ap(cid:173)\nproach poses the problem as a supervised learning task which can \nbe performed using Neural Networks. We introduce a stochas(cid:173)\ntic method for learning the cumulative distribution and an analo(cid:173)\ngous deterministic technique. We demonstrate convergence of our \nmethods both theoretically and experimentally, and provide com(cid:173)\nparisons with the Parzen estimate. Our theoretical results demon(cid:173)\nstrate better convergence properties than the Parzen estimate. \n\n1 \n\nIntroduction and Background \n\nA majority of problems in science and engineering have to be modeled in a prob(cid:173)\nabilistic manner. Even if the underlying phenomena are inherently deterministic, \nthe complexity of these phenomena often makes a probabilistic formulation the only \nfeasible approach from the computational point of view. Although quantities such \nas the mean, the variance, and possibly higher order moments of a random variable \nhave often been sufficient to characterize a particular problem, the quest for higher \nmodeling accuracy, and for more realistic assumptions drives us towards modeling \nthe available random variables using their probability density. This of course leads \nus to the problem of density estimation (see [6]). \n\nThe most common approach for density estimation is the nonparametric approach, \nwhere the density is determined according to a formula involving the data points \navailable. The most common non parametric methods are the kernel density esti(cid:173)\nmator, also known as the Parzen window estimator [4] and the k-nearest neighbor \ntechnique [1]. Non parametric density estimation belongs to the class of ill-posed \nproblems in the sense that small changes in the data can lead to large changes in \n\n\"To whom correspondence should be addressed. \n\n\fNeural Networks for Density Estimation \n\n523 \n\nthe estimated density. Therefore it is important to have methods that are robust to \nslight changes in the data. For this reason some amount of regularization is needed \n[7]. This regularization is embedded in the choice of the smoothing parameter (ker(cid:173)\nnel width or k). The problem with these non-parametric techniques is their extreme \nsensitivity to the choice of the smoothing parameter. A wrong choice can lead to \neither undersmoothing or oversmoothing. \n\nIn spite of the importance of the density estimation problem, proposed methods \nusing neural networks have been very sporadic. We propose two new methods \nfor density estimation which can be implemented using multilayer networks. In \naddition to being able to approximate any function to any given precision, multilayer \nnetworks give us the flexibility to choose an error function to suit our application. \nThe methods developed here are based on approximating the distribution function, \nin contrast to most previous works which focus on approximating the density itself. \nStraightforward differentiation gives us the estimate of the density function. The \ndistribution function is often useful in its own right - one can directly evaluate \nquantiles or the probability that the random variable occurs in a particular interval. \n\nOne of the techniques is a stochastic algorithm (SLC), and the second is a determin(cid:173)\nistic technique based on learning the cumulative (SIC). The stochastic technique \nwill generally be smoother on smaller numbers of data points, however, the de(cid:173)\nterministic technique is faster and applies to more that one dimension. We will \npresent a result on the consistency and the convergence rate of the estimation error \nfor our methods in the univariate case. When the unknown density is bounded \nand has bounded derivatives up to order K, we find that the estimation error is \nO((loglog(N)/N)-(l-t\u00bb), where N is the number of data points. As a comparison, \nfor the kernel density estimator (with non-negative kernels), the estimation error is \nO(N-4 / 5 }, under the assumptions that the unknown density has a square integrable \nsecond derivative (see [6]), and that the optimal kernel width is used, which is not \npossible in practice because computing the optimal kernel width requires knowledge \nof the true density. One can see that for smooth density functions with bounded \nderivatives, our methods achieve an error rate that approaches O(N- 1 ). \n\n2 New Density Estimation Techniques \n\nTo illustrate our methods, we will use neural networks, but stress that any suf(cid:173)\nficiently general learning model will do just as well. The network's output will \nrepresent an estimate of the distribution function, and its derivative will be an \nestimate of the density. We will now proceed to a description of the two methods. \n\n2.1 SLC (Stochastic Learning of the Cumulative) \n\nLet Xn E R, n = 1, ... , N be the data points. Let the underlying density be g(x) \nand its distribution function G(x) = J~oog(t)dt. Let the neural network output be \nH (x, w), where w represents the set of weights of the network. Ideally, after training \nthe neural network, we would like to have H (x, w) = G (x). It can easily be shown \nthat the density of the random variable G(x) (x being generated according to g(x)) \nis uniform in [0,1]. Thus, if H(x,w) is to be as close as possible to G(x), then \nthe network output should have a density that is close to uniform in [0,1]. This is \nwhat our goal will be. We will attempt to train the network such that its output \ndensity is uniform, then the network mapping should represent the distribution \nfunction G(x). The basic idea behind the proposed algorithm is to use the N data \npoints as inputs to the network. For every training cycle, we generate a different \nset of N network targets randomly from a uniform distribution in [0, 1], and adjust \n\n\f524 \n\nM Magdon-Ismail and A. Atiya \n\nthe weights to map the data points (sorted in ascending order) to these generated \ntargets (also sorted in ascending order). Thus we are training the network to map \nthe data to a uniform distribution. \n\nBefore describing the steps of the algorithm, we note that the resulting network has \nto represent a monotonically non decreasing mapping, otherwise it will not represent \na legitimate distribution function. In our simulations, we used a hint penalty to \nenforce monotonicity [5]. The algorithm is as follows. \n\n1. Let Xl S X2 S ... S XN be the data points. Set t = 1, where t is the \n\ntraining cycle number. Initialize the weights (usually randomly) to w(l). \n\n2. Generate randomly from a uniform distribution in [0,1] N points (and sort \n\nthem): UI S U2 S '\" S UN\u00b7 The point Un is the target output for X n\u00b7 \n3. Adjust the network weights according to the backpropagation scheme: \n\na\u00a3 \nw(t + 1) = w(t) - 17(t) aw \n\n(1) \n\nwhere \u00a3 is the objective function that includes the error term and the \nmonotonicity hint penalty term [5]: \n\nN \n\nNh \n\n2 \n\nn=l \n\n\u00a3 = I: [H(xn)-Un] +AI:8(H(Yk)-H(Yk+~)) [H(Yk)-H(Yk+~)] \n(2) \nwhere we have suppressed the w dependence. The second term is the mono(cid:173)\ntonicity penalty term, A is a positive weighting constant, ~ is a small pos(cid:173)\nitive number, 8(x) is the familiar unit step function, and the Yk'S are any \nset of points where we wish to enforce the monotonicity. \n\nk=l \n\n2 \n\n4. Set t = t + 1, and go to step 2. Repeat until the error is small enough. \n\nUpon convergence, the density estimate is the derivative of H. \n\nNote that as presented, the randomly generated targets are different for every cycle, \nwhich will have a smoothing effect that will allow convergence to a truly uniform \ndistribution. One other version, that we have implemented in our simulation stud(cid:173)\nies, is to generate new targets after every fixed number L of cycles, rather than \nevery cycle. This will generally improve the speed of convergence as there is more \n\"continuity\" in the learning process. Also note that it is preferable to choose the \nactivation function for the output node to be in the range of 0 to 1, to ensure that \nthe estimate of the distribution function is in this range. \n\nSLC is only applicable to estimating univariate densities, because, for the multivari(cid:173)\nate case, the nonlinear mapping Y = G (x) will not necessarily result in a uniformly \ndistributed output y. Fortunately, many, if not the majority of problems encoun(cid:173)\ntered in practice are univariate. This is because multivariate problems, with even \na modest number of dimensions, need a huge amount of data to obtain statistically \naccurate results. The next method, is applicable to the multivariate case as well. \n\n2.2 SIC (Smooth Interpolation of the Cumulative) \n\nAgain, we have a multilayer network, to which we input the point x, and the \nnetwork outputs the estimate of the distribution function. Let g(x) be the true \ndensity function, and let G(x) be the corresponding distribution function. Let \nx = (Xl, ... , xd)T. The distribution function is given by \n\nG(x) = 100'\" 100 g(x)dx 1 ... xd, \n\nx d \n\nxl \n\n(3) \n\n\fNeural Networks for Density Estimation \n\n525 \n\na straightforward estimate of G(x) could be the fraction of data points falling in \nthe area of integration: \n\n1 N \n\nG(x) = N :Le(x - x n ), \n\nn=l \n\n(4) \n\nwhere e is defined as \n\ne(x) = {~ \n\nif xi 2 0 for all i = 1, ... , d, \notherwise. \n\nThe method we propose uses such an estimate for the target outputs of the neural \nnetwork. The estimate given by (4) is discontinuous. The neural network method \ndeveloped here provides a smooth, and hence more realistic estimate of the distri(cid:173)\nbution function. The density can be obtained by differentiating the output of the \nnetwork with respect to its inputs. \n\nFor the low-dimensional case, we can uniformly sample (4) using a grid, to obtain \nthe examples for the network. Beyond two or three dimensions, this becomes com(cid:173)\nputationally intensive. Alternatively, one could sample the input space randomly \n(using say a uniform distribution over the approximate range of Xn 's) , and for every \npoint determine the network target according to (4) . Another option is to use the \ndata points themselves as examples. The target for a point Xm would then be \n\nN \n\n1 ~ \n\nA \n\nG(xm) = N _ 1 Lt \n\n(5) \n\nWe also use monotonicity as a hint to guide the training. Once training is performed, \nand H(x, w) approximates G(x), the density estimate can be obtained as \n\nn=l, n;im \n\ng(x) = ad H(x, w) . \nOXl ... oxd \n\n(6) \n\n3 Simulation Results \n\n\u00b7 . \n'\" \u00b7 \" \u00b7 , \n, , \n, , \n, , \n, , \n, \n\nTn.lI:!DtIn3llly \nsoc \nOpI_mat Parlen W\\n00w \n\n- - -\n\n(a) \n\n(b) \n\nFigure 1: Comparison of optimal Parzen windows, with neural network estimators. \nPlotted are the true density and the estimates (SLC , SIC, Parzen window with \noptimal kernel width [6, pg 40]). Notice that even the optimal Parzen window is \nbumpy as compared to the neural network. \n\nWe tested our techniques for density estimation on data drawn from a mixture of \ntwo Gaussians: \n\n(7) \n\n\f526 \n\nM. Magdon-Ismail and A. Atiya \n\nData points were randomly generated and the density estimates using SLC or SIC \n(for 100 and 200 data points) were compared to the Parzen technique. Learning \nwas performed with a standard 1 hidden layer neural network with 3 hidden units. \nThe hidden unit activation function used was tanh and the output unit was an erf \nfunction l . A set of typical density estimates are shown in figure 1. \n\n4 Convergence of the Density Estimation Techniques \n\n~. \n\n! \n1 \n1 \nI \n\n,,' \n\n,,' \n\n,,' N \n\nFigure 2: Convergence of the density estimation error for SIC . A five hidden unit \ntwo layer neural network was used to perform the mapping Xi -+ i/{N + 1), trained \naccording to SIC. For various N, the resulting density estimation error was com(cid:173)\nputed for over 100 runs. Plotted are the results on a Log-Log scale. For comparison, \nalso shown is the best 1 IN fit. \n\nU sing techniques from stochastic approximation theory, it can be shown that SLC \nconverges to a similar solution to SIC [3], so, we focus our attention on the conver(cid:173)\ngence of SIC. Figure 2 shows an empirical study of the convergence behavior. The \noptimal linear fit between 10g(E) and 10g(N) has a slope of -0.97. This indicates \nthat the convergence rate is about liN. The theoretically derived convergence rate \nis loglog(N)IN as we will shortly discuss. \nTo analyze SIC, we introduce so called approximate generalized distribution func(cid:173)\ntions. We will assume that the true distribution function has bounded derivatives. \nTherefore the cumulative will be \"approximately\" implementable by generalized \ndistributions with bounded derivatives (in the asymptotic limit, with probability \n1). We will then obtain the convergence to the true density. \nLet 9 be the space of distribution functions on the real line that possess continuous \ndensities, i.e., X E 9 if X : R -+ [0,1]; X'(t) exists everywhere, is continuous and \nX' (t) ~ 0; X ( - 00) = 0 and X (00) = 1. This is the class of functions that we will \nbe interested in. We define a metric with respect to 9 as follows \n\nII f II~ = i: f(t)2 X'(t)dt \n\n(8) \n\nII f II~ is the expectation of the squared value of f with respect to the distribution \nX E g. Let us name this the L2 X-norm of f. Let the data set (D) be {Xl S X2 S \n. .. S XN}, and corresponding to each Xi, let the target be Yi = il N + 1. We will \nassume that the true distribution function has bounded derivatives up to order K . \nWe define the set of approximate sample distribution functions 1l'D as follows \n\n\fNeural Networks for Density Estimation \n\n527 \n\nDefinition 4.1 Fix v > O. A v-approximate sample distribution (unction, H, sat(cid:173)\nisfies the following two conditions \n\nWe will denote the set of all v-approximate sample distribution functions for a data \n\nset, D, and a given v by 1-lo' \nLet Ai = sUPx IG(i) I, \nderivative. Define BnD) by \n\ni = 1 . . . K where we use the notation f(i) to denote the ith \n\nt \n\nQE1I.'h \n\nx \n\nBV(D)= \n\ninf sup lQ(iJI \n\n(9) \nfor fixed v > O. Note that by definition, for all E > 0, :J H E 1-lo such that \nsUPx IH(i)(x)1 ::; Bi + E. Bi(D) is the lowest possible bound on the ith derivative \nfor the v-approximate sample distribution functions given a particular data set. In \na sense, the \"smoothest\" approximating sample distribution function with respect \nto the ith derivative has an ith derivative bounded by BnD). One expects that \nBi ::; Ai, at least in the limit N -+ 00. \nIn the next theorem, we present the main theoretical result of the paper, namely \na bound on the estimation error for the density estimator obtained by using the \napproximate sample distribution functions. It is embedded in a large amount of \ntechnical machinery, but its essential content is that if the true distribution function \nhas bounded derivatives to order K, then, picking the approximate distribution \nfunction obeying certain bounds, we obtain a convergence rate for the estimation \nerror of O((loglog(N)jN)l - l/K). \n\nTheore\"m 4.2 (L2 convergence to the true density) Let N data points, Xi be \ndrawn i.i.d. from the distribution G E g. Let sUPx IG(i) 1 = Ai for i = 0 ... K , where \nK ~ 2. Fix v > 2 and E > O. Let B'K(D) = infQE1I.'h suPx IQ(K) I. Let H E 1-lo \nbe a v-approximate distribution function with BK = sUPx IHKI ::; B'K + E (by the \ndefinition of B/o such a v-approximate sample distribution function must exist). \nThen, for any F E g, as N -+ 00, the inequality \n\nwhere \n\nII H' - G' II~ ::; 22 (K - l)(2A K + E)k F(N) \n\n:F(N) = [(1 + v) C10g~g(N)) l + N ~ 1 rr.. \n\n(10) \n\n(11) \n\nholds with probability 1, as N -+ 00. \n\n\u2022 \nWe present the proof elsewhere [3]. \nN otel: The theorem applies uniformly to any interpolator H E 1-lo' In particular, \na large enough neural network will be one such monotonic interpolator, \nprovided that the network can be trained to small enough error. This is \npossible by the universal approximation results for multilayer networks [2]. \nNote 2: This theorem holds for any E > 0 and v > 1. For smooth density func(cid:173)\ntions, with bounded higher derivatives, the convergence rate approaches \no (log log( N) j N) which is faster convergence than the kernel density esti(cid:173)\nmator (for which the optimal rate is O(N - 4 / 5 )). \n\n\f528 \n\nM. Magdon-Ismail and A. Atiya \n\nNote 3: No smoothing parameter needs to be determined. \nNote 4: One should try to find an approximate distribution function with the \nsmallest possible derivatives. Specifically, of all the sample distribution \nfunctions, pick the one that \"minimizes\" B K, the bound on the Kth deri va(cid:173)\ntive. This could be done by introducing penalty terms, penalizing the mag(cid:173)\nnitudes of the derivatives (for example Tikhonov type regularizers [7]). \n\n5 Comments \n\nWe developed two techniques for density estimation based on the idea of learning the \ncumulative by mapping the data points to a uniform density. Two techniques were \npresented, a stochastic technique (SLC), which is expected to inherit the character(cid:173)\nistics of most stochastic iterative algorithms, and a deterministic technique (SIC). \nSLC tends to be slow in practice, however, because each set of targets is drawn \nfrom the uniform distribution, this is anticipated to have a smoothing/regularizing \neffect - this can be seen by comparing SLC and SIC in figure 1 (a). We presented \nexperimental comparison of our techniques with the Parzen technique. \n\nWe presented a theoretical result that demonstrated the consistency of our tech(cid:173)\nniques as well as giving a convergence rate of O(loglog(N)/N), which is better \nthan the optimal Parzen technique. No smoothing parameter needs to be chosen -\nsmoothing occurs naturally by picking the interpolator with the lowest bound for \na certain derivative. For our methods, the majority of time is spent in the learning \nphase, but once learning is done, evaluating the density is fast. \n\n6 Acknowledgments \n\nWe would like to acknowledge Yaser Abu-Mostafa and the Caltech Learning Systems \nGroup for their useful input. \n\nReferences \n\n[1] K. Fukunaga and L. D. Hostetler. Optimization of k-nearest neighbor density \n\nestimates. IEEE Transactions on Information Theory, 19(3):320-326, 1973. \n\n[2] K. Hornik, M. Stinchcombe, and H. White. Universal approximation of an \nunknown mapping and its derivatives using multilayer feedforward networks. \nNeural Networks, 3:551-560, 1990. \n\n[3] M. Magdon-Ismail and A. Atiya. Consistent density estimation from the sample \n\ndistribution function. manuscript in preparation for submission, 1998. \n\n[4] E. Parzen. On the estimation of a probability density function and mode. Annals \n\nof Mathematical Statistics, 33:1065-1076, 1962. \n\n[5] J. Sill and Y. S. Abu-Mostafa. Monotonicity hints. In M. C. Mozer, M. I. \nJordan, and T. Petsche, editors, Advances in Neural Information Processing \nSystems (NIPS), volume 9, pages 634-640. Morgan Kaufmann, 1997. \n\n[6] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman \n\nand Hall, London, UK, 1993. \n\n[7] A. N. Tikhonov and V. 1. Arsenin. Solutions of Ill-Posed Problems. Scripta Series \nin Mathematics. Distributed solely by Halsted Press, Winston; New York, 1977. \nTranslation Editor: Fritz, John. \n\n\f", "award": [], "sourceid": 1624, "authors": [{"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": null}, {"given_name": "Amir", "family_name": "Atiya", "institution": null}]}