{"title": "Learning Shared Latent Structure for Image Synthesis and Robotic Imitation", "book": "Advances in Neural Information Processing Systems", "page_first": 1233, "page_last": 1240, "abstract": null, "full_text": "Learning Shared Latent Structure for Image Synthesis and Robotic Imitation\n\nAaron P. Shon \n\nKeith Grochow Aaron Hertzmann Rajesh P. N. Rao Department of Computer Science and Engineering University of Washington Seattle, WA 98195 USA Department of Computer Science University of Toronto Toronto, ON M5S 3G4 Canada {aaron,keithg,rao}@cs.washington.edu, hertzman@dgp.toronto.edu\n\nAbstract\nWe propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms's ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data.\n\n1\n\nIntroduction\n\nFinding common structure between two or more concepts lies at the heart of analogical reasoning. Structural commonalities can often be used to interpolate novel data in one space given observations in another space. For example, predicting a 3D object's appearance given corresponding poses of another, related object relies on learning a parameterization common to both objects. Another domain where finding common structure is crucial is imitation learning, also called \"learning by watching\" [11, 12, 6]. In imitation learning, one agent, such as a robot, learns to perform a task by observing another agent, for example, a human instructor. In this paper, we propose an efficient framework for discovering parameterizations shared between multiple observation spaces using Gaussian processes. Gaussian processes (GPs) are powerful models for classification and regression that subsume numerous classes of function approximators, such as single hidden-layer neural networks and RBF networks [8, 15, 9]. Recently, Lawrence proposed the Gaussian process latent variable model (GPLVM) [4] as a new technique for nonlinear dimensionality reduction and data visualization [13, 10]. An extension of this model, the scaled GPLVM (SGPLVM), has been used successfully for dimensionality reduction on human motion capture data for motion synthesis and visualization [1]. In this paper, we propose a generalization of the GPLVM model that can handle multiple observation spaces, where each set of observations is parameterized by a different set of kernel parameters. Observations are linked via a single, reduced-dimensionality latent variable space. Our framework can be viewed as a nonlinear extension to canonical correlation\n\n\f\nanalysis (CCA), a framework for learning correspondences between sets of observations. Our goal is to find correspondences on testing data, given a limited set of corresponding training data from two observation spaces. Such an algorithm can be used in a variety of applications, such as inferring a novel view of an object given a corresponding view of a different object and estimating the kinematic parameters for a humanoid robot given a human pose. Several properties motivate our use of GPs. First, finding latent representations for correlated, high-dimensional sets of observations requires non-linear mappings, so linear CCA is not viable. Second, GPs reduce the number of free parameters in the regression model, such as number of basis units needed, relative to alternative regression models such as neural networks. Third, the probabilistic nature of GPs facilitates learning from multiple sources with potentially different variances. Fourth, probabilistic models provide an estimate of uncertainty in classification or interpolating between data; this is especially useful in applications such as robotic imitation where estimates of uncertainty can be used to decide whether a robot should attempt a particular pose or not. GPs can also generate samples of novel data, unlike many nonlinear dimensionality reduction methods [10, 13]. Fig. 1(a) shows the graphical model for learning shared structure using Gaussian processes. A latent space X maps to two (or more) observation spaces Y , Z using nonlinear kernels, and \"inverse\" Gaussian processes map back from observations to latent coordinates. Synthesis employs a map from latent coordinates to observations, while recognition employs an inverse mapping. We demonstrate our approach on two datasets. The first is an image dataset containing corresponding views of two different objects. The challenge is to predict corresponding views of the second object given novel views of the first based on a limited training set of corresponding object views. The second dataset consists of human poses derived from motion capture data and corresponding kinematic poses from a humanoid robot. The challenge is to estimate the kinematic parameters for robot pose, given a potentially novel pose from human motion capture, thereby allowing robotic imitation of human poses. Our results indicate that the model generalizes well when only limited training correspondences are available, and that the model remains robust when testing data is noisy.\n\n2\n\nLatent Structure Model\n\nThe goal of our model is to find a shared latent variable parameterization in a space X that relates corresponding pairs of observations from two (or more) different spaces Y , Z . The observation spaces might be very dissimilar, despite the observations sharing a common structure or parameterization. For example, a robot's joint space may have very different degrees of freedom than a human's joint space, although they may both be made to assume similar poses. The latent variable space then characterizes the common pose space. Let Y, Z be matrices of observations (training data) drawn from spaces of dimensionality DY , DZ respectively. Each row represents one data point. These observations are drawn so that the first observation y1 corresponds to the observation z1 , observation y2 corresponds to observation z2 , etc. up to the number of observations N . Let X be a \"latent space\" of dimensionality DX DY , DZ . We initialize a matrix of latent points X by averaging the top DX principal components of Y, Z. As with the original GPLVM, we optimize over a limited subset of training points (the active set) to accelerate training, determined by the informative vector machine (IVM) [5]. The SGPLVM assumes that a diagonal \"scaling matrix\" W scales the variances of each dimension k of the Y matrix (a similar matrix V scales each dimension m of Z). The scaling matrix helps in domains where different output dimensions (such as the degrees of freedom of a robot) can have vastly different variances. We assume that each latent point xi generates a pair of observations yi , zi via a nonlinear function parameterized by a kernel matrix. GPs parameterize the functions fY : X Y and fZ : X Z . The SGPLVM model uses an exponential (RBF) kernel, defining the\n\n\f\nsimilarity between two data points x, x as: + - Y x,x k (x, x ) = Y exp ||x - x ||2 2\n\n\n\n-1 Y\n\n(1)\n\ngiven hyperparameters for the Y space Y = {Y , Y , Y }. represents the delta function. Following standard notation for GPs [8, 15, 9], the priors P (Y ), P (Z ), P (X), the likelihoods P (Y), P (Z) for the Y, Z observation spaces, and the joint likelihood PGP (X, Y, Z, Y , Z ) are given by: -D ( |W|N 1 k Y 2 T -1 ( P (Y|Y , X) = exp wk Yk KY Yk 2) 2 2 )N DY |K|DY =1 -D ( |V|N 1 m Z 2 T -1 ( P (Z|Z , X) = exp v Z K Zm 3) 2 =1 m m Z 2 )N DZ |K|DZ P (Y ) 1 Y Y Y P (X) PGP (X, Y, Z, Y , Z ) = 1 Z Z Z - 1i 1 exp ||xi ||2 2 2 P (Z ) (4) ( 5) (6)\n\n= P (Y|Y , X)P (Z|Z , X)P (Y )P (Z )P (X)\n\nwhere Z , Z , Z are hyperparameters for the Z space, and wk , vm respectively denote the -1 diagonal entries for matrices W, V. Let Y, KY respectively denote the Y observations from the active set (with mean Y subtracted out) and the kernel matrix for the active set. The joint negative log likelihood of a latent point x and observations y, z is: Ly|x (x, y) fY (x)\n2 Y (x)\n\n=\n\n2 DY ||W (y - fY (x)) ||2 + ln Y (x) 2 (x) 2 Y 2\nT -1\n\n(\n\n7)\n\n= Y + Y KY k(x) = k (x, x) - k(x)T KY k(x) 2 ||V (z - fZ (x)) ||2 DZ = + ln Z (x) 2 2Z (x) 2 = Z + Z KZ k(x) = k (x, x) - k(x)\nT -1 KZ k(x) T -1 -1\n\n(8) (9) ( 10)\n\nLz|x (x, z) fZ (x)\n2 Z (x)\n\n(11)\n\n(12) 1 (13) Lx,y,z = Ly|x + Lz|x + ||x||2 2 The model learns a separate kernel for each observation space, but a single set of common latent points. A conjugate gradient solver adjusts model parameters and latent coordinates to maximize Eq. 6. Given a trained SGPLVM, we would like to infer the parameters in one observation space given parameters in the other (e.g., infer robot pose z given human pose y). We solve this problem in two steps. First, we determine the most likely latent coordinate x given the LX observation y using argmaxx LX (x, y). In principle, one could find x at x = 0 using gradient descent. However, to speed up recognition, we instead learn a separate \"inverse\" - Gaussian process fY 1 : y x that maps back from the space Y to the space X . Once the correct latent coordinate x has been inferred for a given y, the model uses the trained SGPLVM to predict the corresponding observation z.\n\n\f\n3\n\nResults\n\nWe first demonstrate how the our model can be used to synthesize new views of an object, character or scene from known views of another object, character or scene, given a common latent variable model. For ease of visualization, we used 2D latent spaces for all results shown here. The model was applied to image pairs depicting corresponding views of 3D objects. Different views show the objects1 rotated at varying degrees out of the camera plane. We downsampled the images to 32 32 grayscale pixels. For fitting images, the scaling matrices W, V are of minimal importance (since we expect all pixels should a pri-1 ori have the same variance). We also found empirically that using fY (x) = YT KY k(x) instead of Eqn. 8 produced better renderings. We rescaled each fY to use the full range of pixel values [0 . . . 255], creating the images shown in the figures. Fig. 1(b) shows how the model extrapolates to novel datasets given a limited set of training correspondences. We trained the model using 72 corresponding views of two different objects, a coffee cup and a toy truck. Fixing the latent coordinates learned during training, we then selected 8 views of a third object (a toy car). We selected latent points corresponding to those views, and learned kernel parameters for the 8 images. Empirically, priors on kernel parameters are critical for acceptable performance, particularly when only limited data are available such as the 8 different poses for the toy car. In this case, we used the kernel parameters learned for the cup and toy truck (based on 72 different poses) to impose a Gaussian prior on the kernel parameters for the car (replacing P () in Eqn. 4): - log P (car ) = - log PGP + (car - ) -1 (car - ) \nT\n\n(14)\n\nwhere car , , -1 are respectively kernel parameters for the car, the mean kernel param eters for previously learned kernels (for the cup and truck), and inverse covariance matrix for learned kernel parameters. , -1 in this case are derived from only two samples, but nonetheless successfully constrain the kernel parameters for the car so the model functions on the limited set of 8 example poses. To test the model's robustness to noise and missing data, we randomly selected 10 latent coordinates corresponding to a subset of learned cup and truck image pairs. We then added varying displacements to the latent coordinates and synthesized the corresponding novel views for all 3 observation spaces. Displacements varied from 0 to 0.45 (all 72 latent coordinates lie on the interval [-0.70,-0.87] to [0.72,0.56]). The synthesized views are shown in Fig. 1(b), with images for the cup and truck in the first two rows. Latent coordinates in regions of low model likelihood generate images that appear blurry or noisy. More interestingly, despite the small number of images used for the car, the model correctly matches the orientation of the car to the synthesized images of the cup and truck. Thus, the model can synthesize reasonable correspondences (given a latent point) even if the number of training examples used to learn kernel parameters is small. Fig. 2 illustrates the recognition performance of the \"inverse\" Gaussian process model as a function of the amount of noise added to the inputs. Using the latent space and kernel parameters learned for Fig. 1, we present 72 views of the coffee cup with varying amounts of additive, zero-mean white noise, and determine the fraction of the 72 poses correctly classified by the model. The model estimates the pose using 1-nearest-neighbor classification of the latent coordinates x learned during training: argmax (x, x )\nx\nk\n\n(15)\n\nThe recognition performance degrades gracefully with increasing noise power. Fig. 2 also plots sample images from one pose of the cup at several different noise levels. For two of the noise levels, we show the \"denoised\" cup image selected using the nearest-neighbor\n1\n\nhttp://www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html\n\n\f\na)\nGPLVM\n\nX\nInverse GP kernels\n\nb) Displacement from latent coordinate:\n0 .05 .10 .15 .20 .25 .30 .35 .40 .45\n\nGPLVM\n\nY\n\nZ\n\nY\n\n...\n\nZ\n\nNovel\n\nFigure 1: Pose synthesis for multiple objects using shared structure: (a) Graphical model for our\nshared structure latent variable model. The latent space X maps to two (or more) observation spaces Y , Z using a nonlinear kernel. \"Inverse\" Gaussian process kernels map back from observations to latent coordinates. (b) The model learns pose correspondences for images of the coffee cup and toy truck (Y and Z) by fitting kernel parameters and a 2-dimensional latent variable space. After learning the latent coordinates for the cup and truck, we fit kernel parameters for a novel object (the toy car). Unlike the cup and truck, where 72 pairs of views were used to fit kernel parameters and latent coordinates, only 8 views were used to fit kernel parameters for the car. The model is robust to noise in the latent coordinates; numbers above each column represent the amount of noise added to the latent coordinates used to synthesize the images. Even at points where the model is uncertain (indicated by the rightmost results in the Y and Z rows), the learned kernel extrapolates the correct view of the toy car (the \"novel\" row).\n\nclassification, and the corresponding reconstructed truck. This illustrates how even noisy observations in one space can predict corresponding observations in the companion space. Fig. 3 illustrates the ability of the model to synthesize novel views of one object given a novel view of a different object. A limited set of corresponding poses (24 of 72 total) of a cat figurine and a mug were used to train the GP model. The remaining 48 poses of the mug were then used as testing data. For each snapshot of the mug, we inferred a latent point using the \"inverse\" Gaussian process model and used the learned model to synthesize what the cat figurine should look like in the same pose. A subset of these results is presented in the rows on the left in Fig. 3: the \"Test\" rows show novel images of the mug, the \"Inferred\" rows show the model's best estimate for the cat figurine, and the \"Actual\" rows show the ground truth. Although the images for some poses are blurry and the model fails to synthesize the correct image for pose 44, the model nevertheless manages to capture fine detail on most of the images. , 2 2 The grayscale plot at upper right in Fig. 3 shows model certainty 1/ Y (x) + Z (x) with white where the model is highly certain and black where the model is highly uncertain. Arrows indicate the path in latent space formed by the training images. The dashed line indicates latent points inferred from testing images of the mug. Numbered latent coordinates correspond to the synthesized images at left. The latent space shows structure: latent points for similar poses are grouped together, and tend to move along a smooth curve in latent space, with coordinates for the final pose lying close to coordinates for the first pose (as desired for a cyclic image sequence). The bar graph at lower right compares model certainty for the numbered latent coordinates; higher bars indicate greater model certainty. The model appears particularly uncertain for blurry inferred images, such as 8, 14, and 26. Fig. 4 shows an application of our framework to the problem of robotic imitation of human actions. We trained our model on a dataset containing human poses (acquired with a Vicon motion capture system) and corresponding poses of a Fujitsu HOAP-2 humanoid robot. Note that the robot has 25 degrees-of-freedom which differ significantly from the degrees-\n\n\f\n1 0.8 0.6 0.4 0.2 0 0\n\nFraction correct\n\n20 40 60 80 100 Noise power ( of noise distribution)\n\nFigure 2: Recognition using a Learned Latent Variable Space: After learning from 72 paired\ncorrespondences between poses of a coffee cup and of a toy truck, the model is able to recognize different poses of the coffee cup in the presence of additive white noise. Fraction of images recognized are plotted on the Y axis and standard deviation of white noise is plotted on the X axis. One pose of the cup (of 72 total) is plotted for various noise levels (see text for details). \"Denoised\" images obtained from nearest-neighbor classification and the corresponding images for the Z space (the toy truck) are also shown.\n\nof-freedom of the human skeleton used in motion capture. After training on 43 roughly matching poses (only linear time scaling applied to align training poses), we tested the model by presenting a set of 123 human motion capture poses (which includes the original - training set). Because the recognition model fY 1 : y x is not trained from samples from the prior distribution of the data, P (x, y), we found it necessary to approximate k (x) for the recognition model by rescaling k (x) for the testing points to lie on the same interval as the k (x) values of the training points. We suspect that providing proper samples from the prior will improve recognition performance. As illustrated in Fig. 4 (inset panels, human and robot skeletons), the model was able to correctly infer appropriate robot kinematic parameters given a range of novel human poses. These inferred parameters were used in conjunction with a simple controller to instantiate the pose in the humanoid robot (see photos in the inset panels).\n\n4\n\nDiscussion\n\nOur Gaussian process model provides a novel method for learning nonlinear relationships between corresponding sets of data. Our results demonstrate the model's utility for diverse tasks such as image synthesis and robotic programming by demonstration. The GP model is closely related to other kernel methods for solving CCA [3] and similar problems [2]. The problems addressed by our model can also be framed as a type of nonlinear CCA. Our method differs from the latent variable method proposed in [14] by using Gaussian process regression. Disadvantages of our method with respect to [14] include lack of global optimality for the latent embedding; advantages include fewer independent parameters and the ability to easily impose priors on the latent variable space (since GPLVM regression uses conjugate gradient optimization instead of eigendecomposition). Empirically we found the flexiblity of the GPLVM approach desirable for modeling a diversity of data sources. Our framework learns mappings between each observation space and a latent space, rather than mapping directly between the observation spaces. This makes visualization and interaction much easier. An intermediate mapping to a latent space is also more economical in\n\n\f\n2\n\n8\n\n14\n\n20\n\n26\n\n32\n\nTest Inferred Actual\n38 44 50 56 62 68\n\n-1.5 -1\n68\n\n2\n\n0.037 0.036\n8\n\n-0.5 0\n38 32\n\n0.035\n14 44 26 62 56 20\n\n0.5 1\n\n0.034 0.033\n\nTest\n0.35\n\n-0.5\n\n50\n\n0\n\n0.5\n\n1\n\nInferred Actual\n0.29 2 8 14 20 26 32 38 44 50 56 62 68\n\nFigure 3: Synthesis of novel views using a shared latent variable model: After training on 24 paired images of a mug with a cat figurine (out of 72 total paired images), we ask the model to infer what the remaining 48 poses of the cat would look like given 48 novel views of the mug. The system uses an inverse Gaussian process model to infer a 2D latent point for each of the 48 novel mug views, then synthesizes a corresponding view of the cat figurine. At left we plot the novel testing mug images given to the system (\"test\"), the synthesized cat images (\"inferred\"), and the actual views of the cat figurine from the database (\"actual\"). At upper right we plot the model uncertainty in the latent space. The 24 latent coordinates from the training data are plotted as arrows, while the 48 novel latent points are plotted as crosses on a dashed line. At lower right we show model certainty for 2 the cat figurine data (1/Z (x)) for each testing latent point x. Note the low certainty for the blurry inferred images labeled 8, 14, and 26. the limit of many correlated observation spaces. Rather than learning all pairwise relations between observation spaces (requiring a number of parameters quadratic in the number of observation spaces), our method learns one generative and one inverse mapping between each observation space and the latent space (so the number of parameters grows linearly). From a cognitive science perspective, such an approach is similar to the Active Intermodal Mapping (AIM) hypothesis of imitation [6]. In AIM, an imitating agent maps its own actions and its perceptions of others' actions into a single, modality-independent space. This modality-independent space is analogous to the latent variable space in our model. Our model does not directly address the \"correspondence problem\" in imitation [7], where correspondences between an agent and a teacher are established through some form of unsupervised feature matching. However, it is reasonable to assume that imitation by a robot of human activity could involve some initial, explicit correspondence matching based on simultaneity. Turn-taking behavior is an integral part of human-human interaction. Thus, to bootstrap its database of corresponding data points, a robot could invite a human to take turns playing out motor sequences. Initially, the human would imitate the robot's actions and the robot could use this data to learn correspondences using our GP model; later, the robot could check and if necessary, refine its learned model by attempting to imitate the human's actions.\nAcknowledgements: This work was supported by NSF AICS grant no. 130705 and an ONR YIP award/NSF Career award to RPNR. We thank the anonymous reviewers for their comments. References [1] K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popovic. Style-based inverse kinematics. In Proc. SIGGRAPH, 2004. [2] J. Ham, D. Lee, and L. Saul. Semisupervised alignment of manifolds. In AISTATS, 2004. [3] P. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Sys., 10(5):365377, 2000.\n\n\f\nFigure 4: Learning shared latent structure for robotic imitation of human actions: The plot in 2 the center shows the latent training points (red circles) and model precision 1/Z for the robot model (grayscale plot), with examples of recovered latent points for testing data (blue diamonds). Model precision is qualitatively similar for the human model. Inset panels show the pose of the human motion capture skeleton, the simulated robot skeleton, and the humanoid robot for each example latent point. The model correctly infers robot poses from the human walking data (inset panels).\n[4] N. D. Lawrence. Gaussian process models for visualization of high dimensional data. In S. Thrun, L. Saul, and B. Scholkopf, editors, Advances in NIPS 16. [5] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in NIPS 15, 2003. [6] A. N. Meltzoff. Elements of a developmental theory of imitation. In A. N. Meltzoff and W. Prinz, editors, The imitative mind: Development, evolution, and brain bases, pages 1941. Cambridge: Cambridge University Press, 2002. [7] C. Nehaniv and K. Dautenhahn. The correspondence problem. In Imitation in Animals and Artifacts. MIT Press, 2002. [8] A. O'Hagan. On curve fitting and optimal design for regression. Journal of the Royal Statistical Society B, 40:142, 1978. [9] C. E. Rasmussen. Evaluation of Gaussian Processes and other Methods for Non-Linear Regression. PhD thesis, University of Toronto, 1996. [10] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):23232326, 2000. [11] S. Schaal, A. Ijspeert, and A. Billard. Computational approaches to motor learning by imitation. Phil. Trans. Royal Soc. London: Series B, 358:537547, 2003. [12] A. P. Shon, D. B. Grimes, C. L. Baker, and R. P. N. Rao. A probabilistic framework for modelbased imitation learning. In Proc. 26th Ann. Mtg. Cog. Sci. Soc., 2004. [13] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):23192323, 2000. [14] J. J. Verbeek, S. T. Roweis, and N. Vlassis. Non-linear CCA and PCA by alignment of local models. In Advances in NIPS 16, pages 297304. 2003. [15] C. K. I. Williams. Computing with infinite networks. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NIPS 9. Cambridge, MA: MIT Press, 1996.\n\n\f\n", "award": [], "sourceid": 2751, "authors": [{"given_name": "Aaron", "family_name": "Shon", "institution": null}, {"given_name": "Keith", "family_name": "Grochow", "institution": null}, {"given_name": "Aaron", "family_name": "Hertzmann", "institution": null}, {"given_name": "Rajesh", "family_name": "Rao", "institution": null}]}