{"title": "Reconstruction of Sequential Data with Probabilistic Models and Continuity Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 414, "page_last": 420, "abstract": null, "full_text": "Reconstruction of Sequential Data with \n\nProbabilistic Models and Continuity Constraints \n\nMiguel A. Carreira-Perpifian \n\nDept. of Computer Science, University of Sheffield, UK \n\nmiguel@dcs.shefac.uk \n\nAbstract \n\nWe consider the problem of reconstructing a temporal discrete sequence \nof multidimensional real vectors when part of the data is missing, under \nthe assumption that the sequence was generated by a continuous pro(cid:173)\ncess. A particular case of this problem is multivariate regression, which \nis very difficult when the underlying mapping is one-to-many. We pro(cid:173)\npose an algorithm based on a joint probability model of the variables \nof interest, implemented using a nonlinear latent variable model. Each \npoint in the sequence is potentially reconstructed as any of the modes \nof the conditional distribution of the missing variables given the present \nvariables (computed using an exhaustive mode search in a Gaussian mix(cid:173)\nture). Mode selection is determined by a dynamic programming search \nthat minimises a geometric measure of the reconstructed sequence, de(cid:173)\nrived from continuity constraints. We illustrate the algorithm with a toy \nexample and apply it to a real-world inverse problem, the acoustic-to(cid:173)\narticulatory mapping. The results show that the algorithm outperforms \nconditional mean imputation and multilayer perceptrons. \n\n1 Definition of the problem \n\nConsider a mobile point following a continuous trajectory in a subset of ]RD. Imagine \nthat it is possible to obtain a finite number of measurements of the position of the point. \nSuppose that these measurements are corrupted by noise and that sometimes part of, or all, \nthe variables are missing. The problem considered here is to reconstruct the sequence from \nthe part of it which is observed. In the particular case where the present variables and the \nmissing ones are the same for every point, the problem is one of multivariate regression. \nIf the pattern of missing variables is more general, the problem is one of missing data \nreconstruction. \n\nConsider the problem of regression. If the present variables uniquely identify the missing \nones at every point of the data set, the problem can be adequately solved by a universal \nfunction approximator, such as a multilayer perceptron. In a probabilistic framework, the \nconditional mean of the missing variables given the present ones will minimise the average \nsquared reconstruction error [3]. However, if the underlying mapping is one-to-many, there \nwill be regions in the space for which the present variables do not identify uniquely the \nmissing ones. In this case, the conditional mean mapping will fail, since it will give a \ncompromise value-an average of the correct ones. Inverse problems, where the inverse \n\n\fProbabilistic Sequential Data Reconstruction \n\n415 \n\nof a mapping is one-to-many, are of this type. They include the acoustic-to-articulatory \nmapping in speech [15], where different vocal tract shapes may produce the same acoustic \nsignal, or the robot arm problem [2], where different configurations of the joint angles may \nplace the hand in the same position. \n\nIn some situations, data reconstruction is a means to some other objective, such as classifi(cid:173)\ncation or inference. Here, we deal solely with data reconstruction of temporally continuous \nsequences according to the squared error. Our algorithm does not apply for data sets that \neither lack continuity (e.g. discrete variables) or have lost it (e.g. due to undersampling or \nshuffling). \n\nWe follow a statistical learning approach: we attempt to reconstruct the sequence by learn(cid:173)\ning the mapping from a training set drawn from the probability distribution of the data, \nrather than by solving a physical model of the system. Our algorithm can be described \nbriefly as follows. First, a joint density model of the data is learned in an unsupervised way \nfrom a sample of the datal . Then, pointwise reconstruction is achieved by computing all \nthe modes of the conditional distribution of the missing variables given the present ones \nat the current point. In principle, any of these modes is potentially a plausible reconstruc(cid:173)\ntion. When reconstructing a sequence, we repeat this mode search for every point in the \nsequence, and then find the combination of modes that minimises a geometric sequence \nmeasure, using dynamic programming. The sequence measure is derived from local conti(cid:173)\nnuity constraints, e.g. the curve length. \n\nThe algorithm is detailed in \u00a72 to \u00a74. We illustrate it with a 2D toy problem in \u00a75 and apply \nit to an acoustic-to-articulatory-like problem in \u00a76. \u00a77 discusses the results and compares \nthe approach with previous work. \nOur notation is as follows. We represent the observed variables in vector form as t = \n(tl' ... , t D) E ~D. A data set (possibly a temporal sequence) is represented as {t n } ~=l . \nGroups of variables are represented by sets of indices I, J E {I, ... , D}, so that if I = \n{I, 7, 3}, then tI = (tlt7t3). \n\n2 Joint generative modelling using latent variables \n\nOur starting point is a joint probability model of the observed variables p( t). From it, we \ncan compute conditional distributions of the form p( t..71 tI) and, by picking representative \npoints, derive a (multivalued) mapping tI ~ t..7. Thus, contrarily to other approaches, \ne.g. [6], we adopt multiple pointwise imputation. In \u00a74 we show how to obtain a single \nreconstructed sequence of points. \n\nAlthough density estimation requires more parameters than mapping approximation, it has \na fundamental advantage [6]: the density model represents the relation between any vari(cid:173)\nables, which allows to choose any missing/present variable combination. A mapping ap(cid:173)\nproximator treats asymmetrically some variables as inputs (present) and the rest as outputs \n(missing) and can't easily deal with other relations. \n\nThe existence of functional relationships (even one-to-many) between the observed vari(cid:173)\nables indicates that the data must span a low-dimensional manifold in the data space. This \nsuggests the use of latent variable models for modelling the joint density. However, it is \npossible to use other kinds of density models. \n\nIn latent variable modelling the assumption is that the observed high-dimensional data t \nis generated from an underlying low-dimensional process defined by a small number L \nof latent variables x = (Xl, ... , xL) [1] . The latent variables are mapped by a fixed \n\nI In our examples we only use complete training data (i.e., with no missing data), but it is perfectly \npossible to estimate a probability model with incomplete training data by using an EM algorithm [6]. \n\n\f416 \n\nM A. Carreira-Perpiful.n \n\ntransformation into a D-dimensional data space and noise is added there. A particular \nmodel is specified by three parametric elements: a prior distribution in latent space p(x), a \nsmooth mapping f from latent space to data space and a noise model in data space p(tlx). \nMarginalising the joint probability density function p(t, x) over the latent space gives the \ndistribution in data space, p(t). Given an observed sample in data space {t n };;=l' a pa(cid:173)\nrameter estimate can be found by maximising the log-likelihood, typically using an EM \nalgorithm. We consider the following latent variable models, both of which allow easy \ncomputation of conditional distributions of the form p( tJ It I ): \n\nFactor analysis [1], in which the mapping is linear, the prior in latent space is unit Gaus(cid:173)\nsian and the noise model is diagonal Gaussian. The density in data space is then \nGaussian with a constrained covariance matrix. We use it as a baseline for com(cid:173)\nparison with more sophisticated models. \n\nThe generative topographic mapping (GTM) [4] is a nonlinear latent variable model, \n\nwhere the mapping is a generalised linear model, the prior in latent space is dis(cid:173)\ncrete uniform and the noise model is isotropic Gaussian. The density in data space \nis then a constrained mixture of isotropic Gaussians. \n\nIn latent variable models that sample the latent space prior distribution (like GTM), the \nmixture centroids in data space (associated to the latent space samples) are not trainable \nparameters. We can then improve the density model at a higher computational cost with no \ngeneralisation loss by increasing the number of mixture components. Note that the number \nof components required will depend exponentially on the intrinsic dimensionality of the \ndata (ideally coincident with that of the latent space, L) and not on the observed one, D. \n\n3 Exhaustive mode finding \n\nGiven a conditional distribution p(tJltI), we consider all its modes as plausible predic(cid:173)\ntions for tJ. This requires an exhaustive mode search in the space of t J . For Gaussian \nmixtures, we do this by using a maximisation algorithm starting from each centroid2 , such \nas a fixed-point iteration or gradient ascent combined with quadratic optimisation [5]. In \nthe particular case where all variables are missing, rather than performing a mode search, \nwe return as predictions all the component centroids. It is also possible to obtain error \nbars at each mode by locally approximating the density function by a normal distribution. \nHowever, if the dimensionality of tJ is high, the error bars become very wide due to the \ncurse of the dimensionality. \n\nAn advantage of multiple pointwise imputation is the easy incorporation of extra constraints \non the missing variables. Such constraints might include keeping only those modes that lie \nin an interval dependent on the present variables [8] or discarding low-probability (spuri(cid:173)\nous) modes-which speeds up the reconstruction algorithm and may make it more robust. \n\nA faster way to generate representative points of p(tJltI) is simply to draw a fixed number \nof samples from it-which may also give robustness to poor density models. However, in \npractice this resulted in a higher reconstruction error. \n\n4 Continuity constraints and dynamic programming (D.P) search \n\nApplication of the exhaustive mode search to the conditional distribution at every point \nof the sequence produces one or more candidate reconstructions per point. To select a \n\n2 Actually, given a value of tz, most centroids have negligible posterior probability and can be \nremoved from the mixture with practically no loss of accuracy. Thus, a large number of mixture \ncomponents may be used without deteriorating excessively the computational efficiency. \n\n\fProbabilistic Sequential Data Reconstrnction \n\n417 \n\ntrajectory \nfactor an. \nmean \ndpmode \n\n..., \n\nN \n\n0 \n\n-2 \n\n-4 \n\n-6 \n\n- 6 \n\n-4 \n\n-2 \n\ntl \n\nAverage squared reconstruction error \n\nMissing \npattern \n\nh \ntl \n\ntl or t2 \n\n10% \n50% \n90% \n\nFactor \nanalysis \n3.8902 \n4.3226 \n4.2020 \n1.0983 \n6.2914 \n21.4942 \n\nMLP\" \n\n0.2046 \n2.5126 \n\n-\n-\n-\n-\n\nmean \n0.2044 \n2.4224 \n1.2963 \n0.3970 \n4.6530 \n20.7877 \n\nGTM \ndpmode \n0.2168 \n0.0522 \n0.1305 \n0.0253 \n0. 1176 \n2.2261 \n\ncmode \n0.2168 \n0.0522 \n0. 1305 \n0.0251 \n0.0771 \n0.0643 \n\naThe MLP cannot be applied to varying patterns of \n\nmissing data. \n\nTable 1: Trajectory reconstruction for a 2D problem. The table gives the average squared \nreconstruction error when t2 is missing (row 1), tl is missing (row 2), exactly one variable \nper point is missing at random (row 3) or a percentage of the values are missing at random \n(rows 4-6). The graph shows the reconstructed trajectory when tl is missing: factor anal(cid:173)\nysis (straight, dotted line), mean (thick, dashed), dpmode (superimposed on the trajectory). \n\nsingle reconstructed sequence, we define a local continuity constraint: consecutive points \nin time should also lie nearby in data space. That is, if 8 is some suitable distance in JR.D, \n8 (tn, tn+ 1) should be small. Then we define a global geometric measure ~ for a sequence \n\n{tn};;=1 as ~ ({t n};;=I) ~f '2:.::11 8 (tn, tn+t). We take 8 as the Euclidean distance, so \n\n~ becomes simply the length of the sequence (considered as a polygonal line). Finding the \nsequence of modes with minimal ~ is efficiently achieved by dynamic programming. \n\n5 Results with a toy problem \n\nTo illustrate the algorithm, we generated a 2D data set from the curve (tl, t2) = (x, x + \n3 sin(x)) for x E [-211',211'], with normal isotropic noise (standard deviation 0.2) added. \nThus, the mapping tl -+ t2 is one-to-one but the inverse one, t2 -+ tl, is multivalued. \nOne-dimensional factor analysis (6 parameters) and GTM models (21 parameters) were \nestimated from a 1000-point sample, as well as two 48-hidden-unit multilayer perceptrons \n(98 parameters), one for each mapping. For GTM we tried several strategies to select points \nfrom the conditional distribution: mean (the conditional mean), dpmode (the mode selected \nby dynamic programming) and cmode (the closest mode to the actual value of the missing \nvariable). The cmode, unknown in practice, is used here to compute a lower bound on \nthe performance of any mode-based strategy. Other strategies, such as picking the global \nmode, a random mode or using a local (greedy) search instead of dynamic programming, \ngave worse results than the dpmode. \n\nTable 1 shows the results for reconstructing a IOO-point trajectory. The nonlinear nature of \nthe problem causes factor analysis to break down in all cases. For the one-to-one mapping \ncase (t2 missing) all the other methods perform well and recover the original trajectory, \nwith mean attaining the lowest error, as predicted by the theory3. For the one-to-many case \n(tl missing, see fig .), both the MLP and the mean are unable to track more than one branch \nof the mapping, but the dpmode still recovers the original mapping. For random missing \n\n3 A combined strategy could retain the optimality of the mean in the one-to-one case and the \nadvantage of the modes in the one-to-many case, by choosing the conditional mean (rather than the \nmode) when the conditional distribution is unimodal, and all the modes otherwise. \n\n\f418 \n\nM A. Carreira-Perpinan \n\nMissing \npattern \nPLP \nEPG \n10% \n50% \nblocks \n\nFactor \nanalysis mean \n0.6217 \n0.9165 \n2.3729 \n3.7177 \n0.0947 \n0.2046 \n1.1285 \n0.7540 \n0.1669 \n0.1950 \n\nGTM \ndpmode \n0.6250 \n2.0613 \n0.0903 \n0.6527 \n0.1005 \n\ncmode \n0.4587 \n1.0538 \n0.0841 \n0.6023 \n0.0925 \n\nTable 2: Average squared reconstruction error for an utterance. The last row corresponds \nto a missing pattern of square blocks totalling 10% of the utterance. \n\npatterns4 , the dpmode is able to cope well with high amounts of missing data. \n\nThe consistently low error of the cmode shows that the modes contain important infor(cid:173)\nmation about the possible options to predict the missing values. The performance of the \ndpmode, close to that of the cmode even for large amounts of missing data, shows that \napplication of the continuity constraint allows to recover that information. \n\n6 Results with real speech data \n\nWe report a preliminary experiment using acoustic and e1ectropalatographic (EPG) data5 \nfor the utterance \"Put your hat on the hatrack and your coat in the cupboard\" (speaker FG) \nfrom the ACCOR database [10]. 12th-order perceptual linear prediction coefficients [7] \nplus the log-energy were computed at 200 Hz from its acoustic waveform. The EPG data \nconsisted of 62-bit frames sampled at 200 Hz, which we consider as 62-dimensional vectors \nof real numbers. No further preprocessing of the data was carried out. Thus, the resulting \nsequence consisted of over 600 75-dimensional real vectors. We constructed a training set \nby picking, in random order, 80% of these vectors. The whole utterance was used for the \nreconstruction test. \n\nWe trained two density models: a 9-dimensional factor analysis (825 parameters) and a \ntwo-dimensional6 GTM (3676 parameters) with a 20 x 20 grid (resulting in a mixture of \n400 isotropic Gaussians in the 75-dimensional data space). Table 2 confirms again that the \nlinear method (factor analysis) fares worst (despite its use of a latent space of dimension \nL = 9). The dpmode attains almost always a lower error than the conditional mean, with up \nto a 40% improvement (the larger the higher the amount of missing data). When a shuffled \nversion of the utterance (thus having lost its continuity) was reconstructed, the error of the \ndpmode was consistently higher than that of the mean, indicating that the application of the \ncontinuity constraint was responsible for the error decrease. \n\n7 Discussion \n\nUsing a joint probability model allows flexible construction of predictive distributions for \nthe missing data: varying patterns of missing data and multiple pointwise imputations are \npossible, as opposed to standard function approximators. We have shown that the modes of \nthe conditional distribution of the missing variables given the present ones are potentially \n\n4Note that the nature of the missing pattern (missing at random, missing completely at random, \n\netc. [9]) does not matter for reconstruction-although it does for estimation. \n\n5 An EPG datum is the (binary) contact pattern between the tongue and the palate at selected \n\nlocations in the latter. Note that it is an incomplete articulatory representation of speech. \n\n6 A latent space of 2 dimensions is clearly too low for this data, but the computational complexity \n\nof GTM prevents the use of a higher one. Still, its nonlinear character compensates partly for this. \n\n\fProbabilistic Sequential Data Reconstruction \n\n419 \n\nplausible reconstructions of the missing values, and that the application of local continuity \nconstraints-when they hold-can help to recover the actually plausible ones. \n\nPrevious work The key aspects of our approach are the use of a joint density model \n(learnt in an unsupervised way), the exhaustive mode search, the definition of a geometric \ntrajectory measure derived from continuity constraints and its implementation by dynamic \nprogramming. Several of these ideas have been applied earlier in the literature, which we \nreview briefly. \n\nThe use of the joint density model for prediction is the basis of the statistical technique \nof multiple imputation [9]. Here, several versions of the complete data set are generated \nfrom the appropriate conditional distributions, analysed by standard complete-data methods \nand the results combined to produce inferences that incorporate missing-data uncertainty. \nGhahramani and Jordan [6] also proposed the use of the joint density model to generate a \nsingle estimate of the missing variables and applied it to a classification problem. \n\nConditional distributions have been approximated by MLPs rather than by density estima(cid:173)\ntion [16], but this lacks flexibility to varying patterns of missing data and requires an extra \nmodel of the input variables distribution (unless assumed uniform). \n\nRohwer and van der Rest [12] introduce a cost function with a description length interpre(cid:173)\ntation whose minimum is approximated by the densest mode of a distribution. A neural \nnetwork trained with this cost function can learn one branch of a multivariate mapping, but \nis unable to select other branches which may be correct at a given time. \n\nContinuity constraints implemented via dynamic programming have been used for the \nacoustic-to-articulatory mapping problem [15]. Reasonable results (better than using an \nMLP to approximate the mapping) can be obtained using a large codebook of acoustic and \narticulatory vectors. Rahim et al. [11] achieve similar quality with much less computa(cid:173)\ntional requirements using an assembly of MLPs, each one trained in a different area of the \nacoustic-articulatory space, to locally approximate the mapping. However, clustering the \nspace is heuristic (with no guarantee that the mapping is one-to-one in each region) and \ntraining the assembly is difficult. It also lacks flexibility to varying missingness patterns. \n\nA number of trajectory measures have been used in the robot arm problem literature [2] and \nminimised by dynamic programming, such as the energy, torque, acceleration, jerk, etc. \n\nTemporal modelling \nIt is important to remark that our approach does not attempt to \nmodel the temporal evolution of the system. The joint probability model is estimated stat(cid:173)\nically. The temporal aspect of the data appears indirectly and a posteriori through the ap(cid:173)\nplication of the continuity constraints to select a trajectory? In this respect, our approach \ndiffers from that of dynamical systems or from models based in Markovian assumptions, \nsuch as hidden Markov models or other trajectory models [13, 14]. However, the fact that \nthe duration or speed of the trajectory plays no role in the algorithm may make it invariant \nto time warping (e.g. robust to fast/slow speech styles). \n\nChoice of density model The fact that the modes are a key aspect of our approach make \nit sensitive to the density model. With finite mixtures, spurious modes can appear as ripple \nsuperimposed on the density function in regions where the mixture components are sparsely \ndistributed and have little interaction. Such modes can lead the DP search to a wrong \ntrajectory. Possible solutions are to improve the density model (perhaps by increasing the \nnumber of components, see \u00a72, or by regularisation), to smooth the conditional distribution \nor to look for bumps (regions of high probability mass) instead of modes. \n\n7However, the method may be derived by assuming a distribution over the whole sequence with a \n\nnormal, Markovian dependence between adjacent frames. \n\n\f420 \n\nM. A. Carreira-Perpifuin \n\nComputational cost The DP search has complexity O(N M2), where M is an average \nof the number of modes per sequence point and N the number of points in the sequence. In \nour experiments M is usually small and the DP search is fast even for long sequences. The \nbottleneck of the reconstruction part of the algorithm is obtaining the modes of the condi(cid:173)\ntional distribution for every point in the sequence when there are many missing variables. \n\nFurther work We envisage more thorough experiments using data from the Wisconsin \nX-ray microbeam database and comparing with recurrent MLPs or an MLP committee, \nwhich may be more suitable for multi valued mappings. Extensions of our algorithm in(cid:173)\nclude different geometric measures (e.g. curvature-based rather than length-based), differ(cid:173)\nent strategies for multiple pointwise imputation (e.g. bump searching) or multidimensional \nconstraints (e.g. temporal and spatial). Other practical applications include audiovisual \nmappings for speech, hippocampal place cell reconstruction and wind vector retrieval from \nscatterometer data. \n\nAcknowledgments \n\nWe thank Steve Renals for useful conversations and for comments about this paper. \n\nReferences \n\n[1] D. J. Bartholomew. Latent Variable Models and Factor Analysis. Charles Griffin & Company \n\nLtd., London, 1987. \n\n[2] N. Bernstein. The Coordination and Regulation 0/ Movements. Pergamon, Oxford, 1967. \n[3] C. M. Bishop. Neural Networks/or Pattern Recognition. Oxford University Press, 1995. \n[4] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: The generative topographic mapping. \n\nNeural Computation, 10(1):215-234, Jan. 1998. \n\n[5] M. A. Carreira-Perpifian. Mode-finding in Gaussian mixtures. Technical Report CS-99-03, \nDept. of Computer Science, University of Sheffield, UK, Mar. 1999. Available online at \nhttp://vvv.dcs.shef.ac.uk/-miguel/papers/cs-99-03.html. \n\n[6] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an EM ap(cid:173)\n\nproach. In NIPS 6, pages 120-127,1994. \n\n[7] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. 1. Acoustic Soc. Amer., \n\n87(4):1738-1752, Apr. 1990. \n\n[8] L. Josifovski, M. Cooke, P. Green, and A. Vizinho. State based imputation of missing data for \nrobust speech recognition and speech enhancement. In Proc. Eurospeech 99. pages 2837-2840, \n1999. \n\n[9] R. 1. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, \n\nNew York, London, Sydney, 1987. \n\n[10] A. Marchal and W. J. Hardcastle. ACCOR: Instrumentation and database for the cross-language \n\nstudy of coarticulation. Language and Speech, 36(2, 3): 137-153, 1993. \n\n[11] M. G. Rahim, C. C. Goodyear, W. B. Kleijn, J. Schroeter, and M. M. Sondhi . On the use of \nneural networks in articulatory speech synthesis. 1. Acoustic Soc. Amer., 93(2): 1109-1121, Feb. \n1993. \n\n[12] R. Rohwer and J. C. van der Rest. Minimum description length, regularization, and multi modal \n\ndata. Neural Computation, 8(3):595-609, Apr. 1996. \n\n[13] S. Roweis. Constrained hidden Markov models. In NIPS 12 (this volume), 2000. \n[14] L. K. Saul and M. G. Rahim. Markov processes on curves for automatic speech recognition. In \n\nNIPS 11, pages 751-757, 1999. \n\n[15] 1. Schroeter and M. M. Sondhi. Techniques for estimating vocal-tract shapes from the speech \n\nsignal. IEEE Trans. Speech and Audio Process., 2(1): 133-150, Jan. 1994. \n\n[16] V. Tresp, R. Neuneier, and S. Ahmad. Efficient methods for dealing with missing data in \n\nsupervised learning. In NiPS 7, pages 689-696, 1995. \n\n\f", "award": [], "sourceid": 1660, "authors": [{"given_name": "Miguel", "family_name": "Carreira-Perpi\u00f1\u00e1n", "institution": null}]}