{"title": "Learning Informative Statistics: A Nonparametnic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 900, "page_last": 906, "abstract": null, "full_text": "Learning Informative Statistics: A \n\nNonparametric Approach \n\nJohn W. Fisher III, Alexander T. IhIer, and Paul A. Viola \n\nMassachusetts Institute of Technology \n\n77 Massachusetts Ave., 35-421 \n\nCambridge, MA 02139 \n\n{jisher,ihler,viola}@ai.mit.edu \n\nAbstract \n\nWe discuss an information theoretic approach for categorizing and mod(cid:173)\neling dynamic processes. The approach can learn a compact and informa(cid:173)\ntive statistic which summarizes past states to predict future observations. \nFurthermore, the uncertainty of the prediction is characterized nonpara(cid:173)\nmetrically by a joint density over the learned statistic and present obser(cid:173)\nvation. We discuss the application of the technique to both noise driven \ndynamical systems and random processes sampled from a density which \nis conditioned on the past. In the first case we show results in which both \nthe dynamics of random walk and the statistics of the driving noise are \ncaptured. In the second case we present results in which a summarizing \nstatistic is learned on noisy random telegraph waves with differing de(cid:173)\npendencies on past states. In both cases the algorithm yields a principled \napproach for discriminating processes with differing dynamics and/or de(cid:173)\npendencies. The method is grounded in ideas from information theory \nand nonparametric statistics. \n\n1 Introduction \n\nNoisy dynamical processes abound in the world - human speech, the frequency of sun \nspots, and the stock market are common examples. These processes can be difficult to \nmodel and categorize because current observations are dependent on the past in complex \nways. Classical models come in two sorts: those that assume that the dynamics are linear \nand the noise is Gaussian (e.g. Weiner etc.); and those that assume that the dynamics are \ndiscrete (e.g. HMM's). These approach are wildly popular because they are tractable and \nwell understood. Unfortunately there are many processes where the underlying theoretical \nassumptions of these models are false. For example we may wish to analyze a system \nwith linear dynamics and non-Gaussian noise or we may wish to model a system with an \nunknown number of discrete states. \n\nWe present an information-theoretic approach for analyzing stochastic dynamic processes \nwhich can model simple processes like those mentioned above, while retaining the flexi(cid:173)\nbility to model a wider range of more complex processes. The key insight is that we can \noften learn a simplifying informative statistic of the past from samples using non parametric \nestimates of both entropy and mutual information. Within this framework we can predict \nfuture states and, of equal importance, characterize the uncertainty accompanying those \n\n\fLearning Informative Statistics: A Nonparametric Approach \n\n901 \n\npredictions. This non-parametric model is flexible enough to describe uncertainty which \nis more complex than second-order statistics. In contrast techniques which use squared \nprediction error to drive learning are focused on the mode of the distribution. \nTaking an example from financial forecasting, while the most likely sequence of pricing \nevents is of interest, one would also like to know the accompanying distribution of price \nvalues (i.e. even if the most likely outcome is appreciation in the price of an asset, knowl(cid:173)\nedge of lower, but not insignificant, probability of depreciation is also valuable). Towards \nthat end we describe an approach that allows us to simultaneously learn the dependencies \nof the process on the past as well as the uncertainty of future states. Our approach is novel \nin that we fold in concepts from information theory, nonparametric statistics, and learning. \n\nIn the two types of stochastic processes we will consider, the challenge is to summarize the \npast in an efficient way. In the absence of a known dynamical or probabilistic model, can \nwe learn an informative statistic (ideally a sufficient statistic) of the past which minimizes \nour uncertainty about future states? In the classical linear state-space approach, uncertainty \nis characterized by mean squared error (MSE) which implicitly assume Gaussian statistics. \nThere are, however, linear systems with interesting behavior due to non-Gaussian statistics \nwhich violate the assumption underlying MSE. There are also nonlinear systems and purely \nprobabilistic processes which exhibit complex behavior and are poorly characterized by \nmean square error and/or the assumption of Gaussian noise. \n\nOur approach is applicable to both types of processes. Because it is based on non(cid:173)\nparametric statistics we characterize the uncertainty of predictions in a very general way: \nby a density of possible future states. Consequently the resulting system captures both the \ndynamics of the systems (through a parameterization) and the statistics of driving noise \n(through a non parametric modeling). The model can then be used to classify new signals \nand make predictions about the future. \n\n2 Learning from Stationary Processes \n\nIn this paper we will consider two related types of stochastic processes, depicted in figure I. \nThese processes differ in how current observations are related to the past. The first type of \nprocess, described by the following set of equations, is a discrete time dynamical (possibly \nnonlinear) system: \n\nXk =G({Xk-t}N ;Wg)+rJk \n\n(I) \nwhere, .T k, the state of the process at time k, is a function of the N previous states and \nthe present value of rJ. In general the sequence {Xk} is not stationary (in the strict sense); \nhowever, under fairly mild conditions on {rJk}, namely that {rJk} is a sequence of i.i.d. \nrandom variables (which we will always assume to be true), the sequence: \n\n; {xk}N={Xk, .. . , Xk-(N - l}} \n\n(2) \nis stationary. Often termed an innovation sequence, for our purpose the stationarity of 2 will \nsuffice. This leads to a prediction framework for estimating the dynamical parameters, w g , \nof the system and to which we will adjoin a nonparametric characterization of uncertainty. \n\n\u20ack = Xk - G({Xk-t}N;Wg) \n\nThe second type of process we consider is described by a conditional probability density: \n\n(3) \nIn this case it is only the conditional statistics of {Xk} that we are concerned with and they \nare, by definition, constant. \n\nXk \"'p(xkl l{Xk-t}N) \n\n3 Learning Informative Statistics with Nonparametric Estimators \n\nWe propose to determine the system parameters by minimizing the entropy of the error \nresiduals for systems of type (a). Parametric entropy optimization approaches have been \n\n\f902 \n\nJ. W Fisher IlI, A. T. Ihler and P. A. Viola \n\nr - - - - - - - - - - - - - - - -. \n\u2022 + \n.\"11.; \n\n+ \n~----l \n\n,--------\n\nI \n\n(a) \n\n(b) \n\nFigure I: Two related systems: (a) dynamical system driven by stationary noise and (b) \nprobabilistic system dependent on the finite past. Dotted box indicates source of stochastic \nprocess, while solid box indicates learning algorithm \n\nproposed (e.g. [4]), the novelty of our approach; however, is that we estimate entropy \nnonparametrically. That is, \n\nwhere the differential entropy integral is approximated using a function of the Parzen kernel \ndensity estimator [51 (in all experiments we use the Gaussian kernel). It can be shown that \nminimizing the entropy of the error residuals is equivalent to maximizing their likelihood \n[11. In this light, the proposed criterion is seeking the maximum likelihood estimate of the \nsystem parameters using a nonparametric description of the noise density. Consequently, \nwe solve for the system parameters and the noise density jointly. \n\nWhile there is no explicit dynamical system in the second system type we do assume that \nthe conditional statistics of the observed sequence are constant (or at worst slowly changing \nfor an on-line learning algorithm). In this case we desire to minimize the uncertainty of \npredictions from future samples by summarizing information from the past. The challenge \nis to do so efficiently via a function of recent samples. Ideally we would like to find a \nsufficient statistic of the past; however, without an explicit description of the density we \nopt instead for an informative statistic. By informative statistic we simply mean one which \nreduces the conditional entropy of future samples. If the statistic were sufficient then the \nmutual information has reached a maximum [1]. As in the previous case, we propose to \nfind such a statistic by maximizing the nonparametric mutual information as defined by \n\narg min i (x k, F ( { x k -1 } N; W f) ) \nargmin H(Xk) + H(F({ };Wj)) - H(XbF({ };Wj))) \n\nWf \n\nWf \n\n= \n\n(5) \n\n(6) \n\n(7) \n\nBy equation 6 this is equivalent to optimizing the joint and marginal entropies (which we \ndo in practice) or, by equation 7, minimizing the conditional entropy. \n\nWe have previously presented two related methods for incorporating kernel based density \nestimators into an information theoretic learning framework [2, 3]. We chose the method of \n[3J because it provides an exact gradient of an approximation to entropy, but more impor(cid:173)\ntantly can be converted into an implicit error function thereby reducing computation cost. \n\n\fLearning Informative Statistics: A Nonparametric Approach \n\n903 \n\n4 Distinguishing Random Walks: An Example \n\nIn random walk the feedback function G( {Xk-l} 1) = Xk-l. The noise is assumed to be in(cid:173)\ndependent and identically distributed (i.i.d.). Although the sequence,Xk, is non-stationary \nthe increments (Xk- Xk-l) are stationary. In this context, estimating the statistics of the \nresiduals allows for discrimination between two random walk process with differing noise \ndensities. Furthermore, as we will demonstrate empiricalIy, even when one of the pro(cid:173)\ncesses is driven by Gaussian noise (an implicit assumption of the MMSE criterion), such \nknowledge may not be sufficient to distinguish one process from another. \n\nFigure 2 shows two random walk realizations and their associate noise densities (solid \nlines). One is driven by Gaussian noise (17k rv N (O, l), while the other is driven by \nrv 1N(0.95,0.3) + 4N( -0.95, 0.3) (note: both \na bi-modal mixture of gaussians ('17k \ndensities are zero-mean and unit variance). During learning, the process was modeled as \nfifth-order auto-regressive (AR5 ). One hundred samples were drawn from a realization of \neach type and the AR parameters were estimated using the standard MMSE approach and \nthe approach described above. With regards to parameter estimation, both methods (as \nexpected) yield essentially the same parameters with the first coefficient being near unity \nand the remaining coefficients being near zero. \n\nWe are interested in the ability to distinguish one process from another. As mentioned. \nthe current approach jointly estimates the parameters of the system as weII as the den(cid:173)\nsity of the noise. The nonparametric estimates are shown in figure 2 (dotted lines). \nThese estimates are then be used to compute the accumulated average log-likelihood \n(L(EI.:) = t I:7=110gp(:ri ) of the residual sequence (Ek ;:::; r/k) under the known and \nlearned densities (figure 3). It is striking (but not surprising) that L( Ek) of the bi-modal \nmixture under the Gaussian model (dashed lines, top) does not differ significantly from the \nGaussian driven increments process (solid lines, top). The explanation follows from the \nfact that \n\n(8) \n\nis the true density of \u20ac (bi-modal), p( \u20ac) \n\nwhere Pf (\u20ac) \nis the assumed density of the likelihood \ntest (unit-variance Gaussian), and D( II) is the KuIlback-Leibler divergence [I). In this \ncase, D(p(E)l lpf( E)) is relatively small (not true for D(Pf (C) ll p(E\u00bb) and H(Pf (C)) is less \nthan the entropy of the unit-variance Gaussian (for fixed variance, the Gaussian density \nhas maximum entropy). The consequence is that the likelihood test under the Gaussian \nassumption does not reliably distinguish the two processes. The likelihood test under the \nbi-modal density or its nonparametric estimate (figure 3, bottom) does distinguish the two. \n\nThe method described is not limited to linear dynamic models. It can certainly be used \nfor nonlinear models, so long as the dynamic can be well approximated by differentiable \nfunctions. Examples for multi-layer perceptrons are described in [3]. \n\n5 Learning the Structure of a Noisy Random Telegraph Wave \n\nA noisy random telegraph wave (RTW) can be described by figure 1 (b). Our goal is not to \ndemonstrate that we can analyze random telegraph waves, rather that we can robustly learn \nan informative statistic of the past for such a process. We define a noisy random telegraph \nwave as a sequence Xk rv N (J.Lk, (J) where 11k is binomially distributed: \n\n{\u00b1 } P{ \n\nJ.Lk E \n\nJ.L \n\n_ \nJ.Lk -\n\n} _ \n\n1 *,~;V= l x k-, 1 \n\n- a *' ~!I IX k - . I' \n\n-J.Lk-l \n\n(9) \n\nN (J.Lk , (J) is Gaussian and a < 1. This process is interesting because the parameters are \nrandom functions of a nonlinear combination of the set {Xk} N. Depending on the value of \nN, we observe different switching dynamics. Figure 4 shows examples of such signals for \n\n\f904 \n\nJ. W Fisher III, A. T. Ihler and P. A. Viola \n\n201 \n\n400 \n\n101 \n\n~ ~l2SJ \n~ ~lAKJ \n\n... \n\n1000 \n\n: \n\n, \n\n' \n\n-I \n\n~IO\n\n0.00 \n\n1000 \n\nIII \n\n400 \n\n\\ \n\nIt \n\n100 \n\n-I \n\n0 \n\nI \n\n\u2022 \n\no \n\n_ \n\nD \n\nFigure 2: Random walk examples (left), comparison of known to learned densities (right). \n\nFigure 3: L(\u20ack) under known models (left) as compared to learned models (right). \n\nN = 20 (left) and N = 4 (right). Rapid switching dynamics are possible for both signals \nwhile N = 20 has periods with longer duration than N = 4. \n\nFigure 4: Noisy random telegraph wave: N = 20 (left), N = 4 (right) \n\nIn our experiments we learn a sufficient statistic which has the form \n\nF({x.}past) ~ q (t W/;Xk-.) , \n\n(to) \n\nwhere u( ) is the hyperbolic tangent function (i.e. P{ } is a one layer perceptron). Note \nthat a multi-layer perceptron could also be used [3]. \n\nIn our experiments we train on 100 samples of noisy RTW(N=2o) and RTW(N=4). We \nthen learn statistics for each type of process using M = {4, 5,15,20, 25}. This tests for \nsituations in which the depth is both under-specified and over-specified (as well as perfectly \n\n\fLearning Informative Statistics: A Nonparametric Approach \n\n905 \n\nFigure 5: Comparison of Wiener filter (top) non parametric approach (bottom) for synthesis . \n\n...... ~ \u2022 \u2022 P \n\nFigure 6: Informative Statistics for noisy random telegraph waves. M = 25 trained on N \nequal 4 (left) and 20 (right). \n\nspecified). We will denote FN({Xk}M) as the statistic which was trained on an RTW(N) \nprocess with a memory depth of M. \nSince we implicitly learn a joint density over (Xk, FN( {Xk} M)) synthesis is possible by \nsampling from that density. Figure 5 compares synthesis using the described method (bot(cid:173)\ntom) to a Wiener filter (top) estimated over the same data. The results using the information \ntheoretic approach (bottom) preserve the structure of the RTW while the Wiener filter re(cid:173)\nsults do not. This was achieved by collapsing the information of past samples into a single \nstatistic (avoiding high dimension density estimation). Figure 6 shows the joint density \nover (Xk, F N ( {Xk} M )) for N = {4, 20} and M = 25. We see that the estimated den(cid:173)\nsities are not separable and by virtue of this fact the learned statistic conveys information \nabout the future. Figure 7 shows results from 100 monte carlo trials. In this case the depth \nof the statistic is matched to the process. Each plot shows the accumulated conditional log \nlikelihood (L(f.k) = i E:=1 10gp(XiIFN( {Xk-l} M)) under the learned statistic with error \nbars. Figure 8 shows similar results after varying the memory depth M = {4, 5,15,20, 25} \nof the statistic. The figures illustrate robustness to choice of memory depth M. This is not \nto say that memory depth doesn't matter; that is, there must be some information to exploit, \nbut the empirical results indicate that useful information was extracted. \n\n6 Conclusions \n\nWe have described a nonpararnetric approach for finding informative statistics. The ap(cid:173)\nproach is novel in that learning is derived from nonpararnetric estimators of entropy and \nmutual information. This allows for a means by which to 1) efficiently summarize the \npast, 2) predict the future and 3) characterize the uncertainty of those predictions beyond \nsecond-order statistics. Futhermore, this was accomplished without the strong assumptions \naccompanying parametric approaches. \n\n\f906 \n\n1. W Fisher Ill, A. T. Ihler and P. A. Viola \n\nFigure 7: Conditional L(\u20ack). Solid line indicates RTW(N=20) while dashed line indicates \nRTW(N=4). Thick lines indicate the average over all monte carlo runs while the thin lines \nindicate \u00b11 standard deviation. The left plot uses a statistic trained on RTW(N=20) while \nthe right plot uses a statistic trained on RTW(N=4). \n\n.\" \u2022 \u2022 \u2022 \u2022 ,,-==--= \n\n1.'~\n:~:\"\"~Z= \n... \n\nO.OO~--;\"'~--;I\"'~---;I-!;_;------;_;!;;---='\" \n\nFigure 8: Repeat of figure 7 for cases with M = {4, 5, 15,20, 25}. Obvious breaks indicate \na new set of trials \n\nWe also presented empirical results which illustrated the utility of our approach. The exam(cid:173)\nple of random walk served as a simple illustration in learning a dynamic system in spite of \nthe over-specification of the AR model. More importantly, we demonstrated the ability to \nlearn both the dynamic and the statistics of the underlying noise process. This information \nwas later used to distinguish realizations by their non parametric densities, something not \npossible using MMSE error prediction. \n\nAn even more compelling result were the experiments with noisy random telegraph waves. \nWe demonstrated the algorithms ability to learn a compact statistic which efficiently sum(cid:173)\nmarized the past for process identification. The method exhibited robustness to the number \nof parameters of the learned statistic. For example, despite overspecifying the dependence \nof the memory-4 in three of the cases, a useful statistic was still found. Conversely, despite \nthe memory-20 statistic being underspecified in three of the experiments, useful informa(cid:173)\ntion from the available past was extracted. \nIt is our opinion that this method provides an alternative to some of the traditional and \nconnectionist approaches to time-series analysis. The use of nonparametric estimators adds \nflexibility to the class of densities which can be modeled and places less of a constraint on \nthe exact form of the summarizing statistic. \n\nReferences \n\n[1] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 199]. \n[2] P. Viola et al. Empricial entropy manipulation for real world problems. In Mozer Touretsky and \n\nHasselmo, editors, Advances in Neural Information ProceSSing Systems, pages ?-?, ] 996. \n\n[3] J.w. Fisher and J.e. Principe. A methodology for information theoretic feature extraction. In \n\nA. Stuberud, editor, Proc. of the IEEE Int loint Conf on Neural Networks, pages ?-?, ] 998. \n\n[4] 1. Kapur and H. Kesavan. Entropy Optimization Principles with Applications. Academic Press, \n\nNew York, ] 992. \n\n[5] E. Parzen. On estimation of a probability density function and mode. Ann. of Math Stats., \n\n33:1065-1076, 1962. \n\n\f", "award": [], "sourceid": 1765, "authors": [{"given_name": "John", "family_name": "Fisher III", "institution": null}, {"given_name": "Alexander", "family_name": "Ihler", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}