Part of Advances in Neural Information Processing Systems 12 (NIPS 1999)

*Miguel Carreira-Perpiñán*

We consider the problem of reconstructing a temporal discrete sequence of multidimensional real vectors when part of the data is missing, under the assumption that the sequence was generated by a continuous pro(cid:173) cess. A particular case of this problem is multivariate regression, which is very difficult when the underlying mapping is one-to-many. We pro(cid:173) pose an algorithm based on a joint probability model of the variables of interest, implemented using a nonlinear latent variable model. Each point in the sequence is potentially reconstructed as any of the modes of the conditional distribution of the missing variables given the present variables (computed using an exhaustive mode search in a Gaussian mix(cid:173) ture). Mode selection is determined by a dynamic programming search that minimises a geometric measure of the reconstructed sequence, de(cid:173) rived from continuity constraints. We illustrate the algorithm with a toy example and apply it to a real-world inverse problem, the acoustic-to(cid:173) articulatory mapping. The results show that the algorithm outperforms conditional mean imputation and multilayer perceptrons.

1 Definition of the problem

Consider a mobile point following a continuous trajectory in a subset of ]RD. Imagine that it is possible to obtain a finite number of measurements of the position of the point. Suppose that these measurements are corrupted by noise and that sometimes part of, or all, the variables are missing. The problem considered here is to reconstruct the sequence from the part of it which is observed. In the particular case where the present variables and the missing ones are the same for every point, the problem is one of multivariate regression. If the pattern of missing variables is more general, the problem is one of missing data reconstruction.

Consider the problem of regression. If the present variables uniquely identify the missing ones at every point of the data set, the problem can be adequately solved by a universal function approximator, such as a multilayer perceptron. In a probabilistic framework, the conditional mean of the missing variables given the present ones will minimise the average squared reconstruction error [3]. However, if the underlying mapping is one-to-many, there will be regions in the space for which the present variables do not identify uniquely the missing ones. In this case, the conditional mean mapping will fail, since it will give a compromise value-an average of the correct ones. Inverse problems, where the inverse

Probabilistic Sequential Data Reconstruction

415

of a mapping is one-to-many, are of this type. They include the acoustic-to-articulatory mapping in speech [15], where different vocal tract shapes may produce the same acoustic signal, or the robot arm problem [2], where different configurations of the joint angles may place the hand in the same position.

In some situations, data reconstruction is a means to some other objective, such as classifi(cid:173) cation or inference. Here, we deal solely with data reconstruction of temporally continuous sequences according to the squared error. Our algorithm does not apply for data sets that either lack continuity (e.g. discrete variables) or have lost it (e.g. due to undersampling or shuffling).

We follow a statistical learning approach: we attempt to reconstruct the sequence by learn(cid:173) ing the mapping from a training set drawn from the probability distribution of the data, rather than by solving a physical model of the system. Our algorithm can be described briefly as follows. First, a joint density model of the data is learned in an unsupervised way from a sample of the datal . Then, pointwise reconstruction is achieved by computing all the modes of the conditional distribution of the missing variables given the present ones at the current point. In principle, any of these modes is potentially a plausible reconstruc(cid:173) tion. When reconstructing a sequence, we repeat this mode search for every point in the sequence, and then find the combination of modes that minimises a geometric sequence measure, using dynamic programming. The sequence measure is derived from local conti(cid:173) nuity constraints, e.g. the curve length.

The algorithm is detailed in §2 to §4. We illustrate it with a 2D toy problem in §5 and apply it to an acoustic-to-articulatory-like problem in §6. §7 discusses the results and compares the approach with previous work. Our notation is as follows. We represent the observed variables in vector form as t = (tl' ... , t D) E ~D. A data set (possibly a temporal sequence) is represented as {t n } ~=l . Groups of variables are represented by sets of indices I, J E {I, ... , D}, so that if I = {I, 7, 3}, then tI = (tlt7t3).

2 Joint generative modelling using latent variables

Our starting point is a joint probability model of the observed variables p( t). From it, we can compute conditional distributions of the form p( t..71 tI) and, by picking representative points, derive a (multivalued) mapping tI ~ t..7. Thus, contrarily to other approaches, e.g. [6], we adopt multiple pointwise imputation. In §4 we show how to obtain a single reconstructed sequence of points.

Although density estimation requires more parameters than mapping approximation, it has a fundamental advantage [6]: the density model represents the relation between any vari(cid:173) ables, which allows to choose any missing/present variable combination. A mapping ap(cid:173) proximator treats asymmetrically some variables as inputs (present) and the rest as outputs (missing) and can't easily deal with other relations.

The existence of functional relationships (even one-to-many) between the observed vari(cid:173) ables indicates that the data must span a low-dimensional manifold in the data space. This suggests the use of latent variable models for modelling the joint density. However, it is possible to use other kinds of density models.

In latent variable modelling the assumption is that the observed high-dimensional data t is generated from an underlying low-dimensional process defined by a small number L of latent variables x = (Xl, ... , xL) [1] . The latent variables are mapped by a fixed

I In our examples we only use complete training data (i.e., with no missing data), but it is perfectly possible to estimate a probability model with incomplete training data by using an EM algorithm [6].

416

M A. Carreira-Perpiful.n

transformation into a D-dimensional data space and noise is added there. A particular model is specified by three parametric elements: a prior distribution in latent space p(x), a smooth mapping f from latent space to data space and a noise model in data space p(tlx). Marginalising the joint probability density function p(t, x) over the latent space gives the distribution in data space, p(t). Given an observed sample in data space {t n };;=l' a pa(cid:173) rameter estimate can be found by maximising the log-likelihood, typically using an EM algorithm. We consider the following latent variable models, both of which allow easy computation of conditional distributions of the form p( tJ It I ):

Factor analysis [1], in which the mapping is linear, the prior in latent space is unit Gaus(cid:173) sian and the noise model is diagonal Gaussian. The density in data space is then Gaussian with a constrained covariance matrix. We use it as a baseline for com(cid:173) parison with more sophisticated models.

The generative topographic mapping (GTM) [4] is a nonlinear latent variable model,

where the mapping is a generalised linear model, the prior in latent space is dis(cid:173) crete uniform and the noise model is isotropic Gaussian. The density in data space is then a constrained mixture of isotropic Gaussians.

In latent variable models that sample the latent space prior distribution (like GTM), the mixture centroids in data space (associated to the latent space samples) are not trainable parameters. We can then improve the density model at a higher computational cost with no generalisation loss by increasing the number of mixture components. Note that the number of components required will depend exponentially on the intrinsic dimensionality of the data (ideally coincident with that of the latent space, L) and not on the observed one, D.

3 Exhaustive mode finding

Given a conditional distribution p(tJltI), we consider all its modes as plausible predic(cid:173) tions for tJ. This requires an exhaustive mode search in the space of t J . For Gaussian mixtures, we do this by using a maximisation algorithm starting from each centroid2 , such as a fixed-point iteration or gradient ascent combined with quadratic optimisation [5]. In the particular case where all variables are missing, rather than performing a mode search, we return as predictions all the component centroids. It is also possible to obtain error bars at each mode by locally approximating the density function by a normal distribution. However, if the dimensionality of tJ is high, the error bars become very wide due to the curse of the dimensionality.

An advantage of multiple pointwise imputation is the easy incorporation of extra constraints on the missing variables. Such constraints might include keeping only those modes that lie in an interval dependent on the present variables [8] or discarding low-probability (spuri(cid:173) ous) modes-which speeds up the reconstruction algorithm and may make it more robust.

A faster way to generate representative points of p(tJltI) is simply to draw a fixed number of samples from it-which may also give robustness to poor density models. However, in practice this resulted in a higher reconstruction error.

4 Continuity constraints and dynamic programming (D.P) search

Application of the exhaustive mode search to the conditional distribution at every point of the sequence produces one or more candidate reconstructions per point. To select a

2 Actually, given a value of tz, most centroids have negligible posterior probability and can be removed from the mixture with practically no loss of accuracy. Thus, a large number of mixture components may be used without deteriorating excessively the computational efficiency.

Probabilistic Sequential Data Reconstrnction

417

trajectory factor an. mean dpmode

Do not remove: This comment is monitored to verify that the site is working properly