{"title": "Bayesian Manifold Learning: The Locally Linear Latent Variable Model (LL-LVM)", "book": "Advances in Neural Information Processing Systems", "page_first": 154, "page_last": 162, "abstract": "We introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic model for non-linear manifold discovery that describes a joint distribution over observations, their manifold coordinates and locally linear maps conditioned on a set of neighbourhood relationships. The model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data. Thus, the LL-LVM encapsulates the local-geometry preserving intuitions that underlie non-probabilistic methods such as locally linear embedding (LLE). Its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships, select the intrinsic dimensionality of the manifold, construct out-of-sample extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold.", "full_text": "Bayesian Manifold Learning:\n\nThe Locally Linear Latent Variable Model\n\nMijung Park, Wittawat Jitkrittum, Ahmad Qamar\u2217,\n\nZolt\u00b4an Szab\u00b4o, Lars Buesing\u2020, Maneesh Sahani\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\n{mijung, wittawat, zoltan.szabo}@gatsby.ucl.ac.uk\n\natqamar@gmail.com, lbuesing@google.com, maneesh@gatsby.ucl.ac.uk\n\nAbstract\n\nWe introduce the Locally Linear Latent Variable Model (LL-LVM), a probabilistic\nmodel for non-linear manifold discovery that describes a joint distribution over ob-\nservations, their manifold coordinates and locally linear maps conditioned on a set\nof neighbourhood relationships. The model allows straightforward variational op-\ntimisation of the posterior distribution on coordinates and locally linear maps from\nthe latent space to the observation space given the data. Thus, the LL-LVM en-\ncapsulates the local-geometry preserving intuitions that underlie non-probabilistic\nmethods such as locally linear embedding (LLE). Its probabilistic semantics make\nit easy to evaluate the quality of hypothesised neighbourhood relationships, select\nthe intrinsic dimensionality of the manifold, construct out-of-sample extensions\nand to combine the manifold model with additional probabilistic models that cap-\nture the structure of coordinates within the manifold.\n\n1\n\nIntroduction\n\nMany high-dimensional datasets comprise points derived from a smooth, lower-dimensional mani-\nfold embedded within the high-dimensional space of measurements and possibly corrupted by noise.\nFor instance, biological or medical imaging data might re\ufb02ect the interplay of a small number of la-\ntent processes that all affect measurements non-linearly. Linear multivariate analyses such as princi-\npal component analysis (PCA) or multidimensional scaling (MDS) have long been used to estimate\nsuch underlying processes, but cannot always reveal low-dimensional structure when the mapping is\nnon-linear (or, equivalently, the manifold is curved). Thus, there has been substantial recent interest\nin algorithms to identify non-linear manifolds in data.\nMany more-or-less heuristic methods for non-linear manifold discovery are based on the idea of\npreserving the geometric properties of local neighbourhoods within the data, while embedding, un-\nfolding or otherwise transforming the data to occupy fewer dimensions. Thus, algorithms such as\nlocally-linear embedding (LLE) and Laplacian eigenmap attempt to preserve local linear relation-\nships or to minimise the distortion of local derivatives [1, 2]. Others, like Isometric feature mapping\n(Isomap) or maximum variance unfolding (MVU) preserve local distances, estimating global man-\nifold properties by continuation across neighbourhoods before embedding to lower dimensions by\nclassical methods such as PCA or MDS [3]. While generally hewing to this same intuitive path, the\nrange of available algorithms has grown very substantially in recent years [4, 5].\n\n\u2217Current af\ufb01liation: Thread Genius\n\u2020Current af\ufb01liation: Google DeepMind\n\n1\n\n\fHowever, these approaches do not de\ufb01ne distributions over the data or over the manifold properties.\nThus, they provide no measures of uncertainty on manifold structure or on the low-dimensional\nlocations of the embedded points; they cannot be combined with a structured probabilistic model\nwithin the manifold to de\ufb01ne a full likelihood relative to the high-dimensional observations; and they\nprovide only heuristic methods to evaluate the manifold dimensionality. As others have pointed out,\nthey also make it dif\ufb01cult to extend the manifold de\ufb01nition to out-of-sample points in a principled\nway [6].\nAn established alternative is to construct an explicit probabilistic model of the functional relationship\nbetween low-dimensional manifold coordinates and each measured dimension of the data, assuming\nthat the functions instantiate draws from Gaussian-process priors. The original Gaussian process\nlatent variable model (GP-LVM) required optimisation of the low-dimensional coordinates, and thus\nstill did not provide uncertainties on these locations or allow evaluation of the likelihood of a model\nover them [7]; however a recent extension exploits an auxiliary variable approach to optimise a\nmore general variational bound, thus retaining approximate probabilistic semantics within the latent\nspace [8]. The stochastic process model for the mapping functions also makes it straightforward\nto estimate the function at previously unobserved points, thus generalising out-of-sample with ease.\nHowever, the GP-LVM gives up on the intuitive preservation of local neighbourhood properties that\nunderpin the non-probabilistic methods reviewed above. Instead, the expected smoothness or other\nstructure of the manifold must be de\ufb01ned by the Gaussian process covariance function, chosen a\npriori.\nHere, we introduce a new probabilistic model over high-dimensional observations, low-dimensional\nembedded locations and locally-linear mappings between high and low-dimensional linear maps\nwithin each neighbourhood, such that each group of variables is Gaussian distributed given the\nother two. This locally linear latent variable model (LL-LVM) thus respects the same intuitions\nas the common non-probabilistic manifold discovery algorithms, while still de\ufb01ning a full-\ufb02edged\nprobabilistic model. Indeed, variational inference in this model follows more directly and with fewer\nseparate bounding operations than the sparse auxiliary-variable approach used with the GP-LVM.\nThus, uncertainty in the low-dimensional coordinates and in the manifold shape (de\ufb01ned by the local\nmaps) is captured naturally. A lower bound on the marginal likelihood of the model makes it possible\nto select between different latent dimensionalities and, perhaps most crucially, between different\nde\ufb01nitions of neighbourhood, thus addressing an important unsolved issue with neighbourhood-\nde\ufb01ned algorithms. Unlike existing probabilistic frameworks with locally linear models such as\nmixtures of factor analysers (MFA)-based and local tangent space analysis (LTSA)-based methods\n[9, 10, 11], LL-LVM does not require an additional step to obtain the globally consistent alignment\nof low-dimensional local coordinates.1\nThis paper is organised as follows. In section 2, we introduce our generative model, LL-LVM, for\nwhich we derive the variational inference method in section 3. We brie\ufb02y describe out-of-sample\nextension for LL-LVM and mathematically describe the dissimilarity between LL-LVM and GP-\nLVM at the end of section 3.\nIn section 4, we demonstrate the approach on several real world\nproblems.\nNotation: In the following, a diagonal matrix with entries taken from the vector v is written diag(v).\nThe vector of n ones is 1n and the n \u00d7 n identity matrix is In. The Euclidean norm of a vector is\n(cid:107)v(cid:107), the Frobenius norm of a matrix is (cid:107)M(cid:107)F . The Kronecker delta is denoted by \u03b4ij (= 1 if i = j,\nand 0 otherwise). The Kronecker product of matrices M and N is M \u2297 N. For a random vector w,\nwe denote the normalisation constant in its probability density function by Zw. The expectation of\na random vector w with respect to a density q is (cid:104)w(cid:105)q.\n\n2 The model: LL-LVM\nSuppose we have n data points {y1, . . . , yn} \u2282 Rdy, and a graph G on nodes {1 . . . n} with edge\nset EG = {(i, j) | yi and yj are neighbours}. We assume that there is a low-dimensional (latent)\nrepresentation of the high-dimensional data, with coordinates {x1, . . . , xn} \u2282 Rdx, dx < dy. It will\nbe helpful to concatenate the vectors to form y = [y1\n\n(cid:62)](cid:62) and x = [x1\n\n(cid:62), . . . , yn\n\n(cid:62), . . . , xn\n\n(cid:62)](cid:62).\n\n1This is also true of one previous MFA-based method [12] which \ufb01nds model parameters and global coor-\n\ndinates by variational methods similar to our own.\n\n2\n\n\fFigure 1: Locally linear mapping Ci\ntransforms the tan-\nfor ith data point\ngent space, TxiMx at xi\nin the low-\ndimensional space to the tangent space,\nTyiMy at the corresponding data point\nin the high-dimensional space. A\nyi\nneighbouring data point is denoted by yj\nand the corresponding latent variable by\nxj.\n\nOur key assumption is that the mapping between high-dimensional data and low-dimensional co-\nordinates is locally linear (Fig. 1). The tangent spaces are approximated by {yj \u2212 yi}(i,j)\u2208EG and\n{xj \u2212 xi}(i,j)\u2208EG , the pairwise differences between the ith point and neighbouring points j. The\nmatrix Ci \u2208 Rdy\u00d7dx at the ith point linearly maps those tangent spaces as\n\n(1)\nUnder this assumption, we aim to \ufb01nd the distribution over the linear maps C = [C1,\u00b7\u00b7\u00b7 , Cn] \u2208\nRdy\u00d7ndx and the latent variables x that best describe the data likelihood given the graph G:\n\nyj \u2212 yi \u2248 Ci(xj \u2212 xi).\n\nlog p(y|G) = log\n\np(y, C, x|G) dx dC.\n\n(2)\n\n(cid:90)(cid:90)\n\nn(cid:88)\n\nn(cid:88)\n\nThe joint distribution can be written in terms of priors on C, x and the likelihood of y as\n\np(y, C, x|G) = p(y|C, x,G)p(C|G)p(x|G).\n\n(3)\nIn the following, we highlight the essential components the Locally Linear Latent Variable Model\n(LL-LVM). Detailed derivations are given in the Appendix.\nAdjacency matrix and Laplacian matrix The edge set of G for n data points speci\ufb01es a n \u00d7 n\nsymmetric adjacency matrix G. We write \u03b7ij for the i, jth element of G, which is 1 if yj and\nyi are neighbours and 0 if not (including on the diagonal). The graph Laplacian matrix is then\nL = diag(G 1n) \u2212 G.\n\nPrior on x We assume that the latent variables are zero-centered with a bounded expected scale,\nand that latent variables corresponding to neighbouring high-dimensional points are close (in Eu-\nclidean distance). Formally, the log prior on the coordinates is then\n\nlog p({x1 . . . xn}|G, \u03b1) = \u2212 1\n\n2\n\n(\u03b1(cid:107)xi(cid:107)2 +\n\n\u03b7ij(cid:107)xi \u2212 xj(cid:107)2) \u2212 log Zx,\n\ni=1\n\nj=1\n\nwhere the parameter \u03b1 controls the expected scale (\u03b1 > 0). This prior can be written as multivariate\nnormal distribution on the concatenated x:\n\np(x|G, \u03b1) = N (0, \u03a0), where \u2126\u22121 = 2L \u2297 Idx, \u03a0\u22121 = \u03b1Indx + \u2126\u22121.\n\nlog p({C1 . . . Cn}|G) = \u2212 \u0001\n2\n= \u2212 1\n2\n\nPrior on C We assume that the linear maps corresponding to neighbouring points are similar in\nterms of Frobenius norm (thus favouring a smooth manifold of low curvature). This gives\nF \u2212 log Zc\n\n\u03b7ij(cid:107)Ci \u2212 Cj(cid:107)2\n\nn(cid:88)\n\nCi\n\n(cid:13)(cid:13)(cid:13)2\n\nn(cid:88)\n\n(cid:13)(cid:13)(cid:13) n(cid:88)\nTr(cid:2)(\u0001JJ(cid:62) + \u2126\u22121)C(cid:62)C(cid:3) \u2212 log Zc,\n\n\u2212 1\n2\n\ni=1\n\ni=1\n\nj=1\n\nF\n\n(4)\nwhere J := 1n \u2297 Idx. The second line corresponds to the matrix normal density, giving p(C|G) =\nMN (C|0, Idy , (\u0001JJ(cid:62) + \u2126\u22121)\u22121) as the prior on C. In our implementation, we \ufb01x \u0001 to a small\nvalue2, since the magnitude of the product Ci(xi \u2212 xj) is determined by optimising the hyper-\nparameter \u03b1 above.\n\n2\u0001 sets the scale of the average linear map, ensuring the prior precision matrix is invertible.\n\n3\n\nhigh-dimensional spacelow-dimensional spaceyiyjTxiMxixjxCiTMyiy\fC\n\nG\n\ny\n\nx\n\n\u03b1\n\nV\n\nFigure 2: Graphical representation of generative process in LL-\nLVM. Given a dataset, we construct a neighbourhood graph G. The\ndistribution over the latent variable x is controlled by the graph G\nas well as the parameter \u03b1. The distribution over the linear map\nC is also governed by the graph G. The latent variable x and the\nlinear map C together determine the data likelihood.\n\n(cid:107) n(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n2\n\nLikelihood Under the local-linearity assumption, we penalise the approximation error of Eq. (1),\nwhich yields the log likelihood\n\nyi(cid:107)2\u2212 1\n\nlog p(y|C, x, V, G) = \u2212 \u0001\n2\n\n\u22121(\u2206yj,i \u2212 Ci\u2206xj,i )\u2212 log Zy,\n(5)\nwhere \u2206yj,i = yj \u2212 yi and \u2206xj,i = xj \u2212 xi.3 Thus, y is drawn from a multivariate normal\ndistribution given by\n\n\u03b7ij(\u2206yj,i \u2212 Ci\u2206xj,i )\n\nj=1\n\nV\n\ni=1\n\ni=1\n\n(cid:62)\n\nwith \u03a3\u22121\n\nei = \u2212(cid:80)n\n\ny = (\u00011n1n\n\n(cid:62)](cid:62) \u2208 Rndy;\nj=1 \u03b7jiV\u22121(Cj + Ci)\u2206xj,i . For computational simplicity, we assume V\u22121 = \u03b3Idy.\nThe graphical representation of the generative process underlying the LL-LVM is given in Fig. 2.\n\n(cid:62)) \u2297 Idy + 2L \u2297 V\u22121, \u00b5y = \u03a3ye, and e = [e1\n\n(cid:62),\u00b7\u00b7\u00b7 , en\n\np(y|C, x, V, G) = N (\u00b5y, \u03a3y),\n\n3 Variational inference\nOur goal is to infer the latent variables (x, C) as well as the parameters \u03b8 = {\u03b1, \u03b3} in LL-LVM. We\ninfer them by maximising the lower bound L of the marginal likelihood of the observations\ndxdC := L(q(C, x), \u03b8).\n\nlog p(y|G, \u03b8) \u2265\n\np(y, C, x|G, \u03b8)\n\nq(C, x) log\n\n(cid:90)(cid:90)\n\n(6)\n\nq(C, x)\n\nFollowing the common treatment for computational tractability, we assume the posterior over (C, x)\nfactorises as q(C, x) = q(x)q(C) [13]. We maximise the lower bound w.r.t. q(C, x) and \u03b8 by the\nvariational expectation maximization algorithm [14], which consists of (1) the variational expecta-\ntion step for computing q(C, x) by\n\nq(x) \u221d exp\n\nq(C) \u221d exp\n\nq(C) log p(y, C, x|G, \u03b8)dC\n\nq(x) log p(y, C, x|G, \u03b8)dx\n\n,\n\n(7)\n\n(8)\n\n(cid:21)\n(cid:21)\n\n,\n\n(cid:20)(cid:90)\n(cid:20)(cid:90)\n\nthen (2) the maximization step for estimating \u03b8 by \u02c6\u03b8 = arg max\u03b8 L(q(C, x), \u03b8).\nVariational-E step Computing q(x) from Eq. (7) requires rewriting the likelihood in Eq. (5) as a\nquadratic function in x\n\n2 (x(cid:62)Ax \u2212 2x(cid:62)b)(cid:3) ,\nwhere the normaliser \u02dcZx has all the terms that do not depend on x from Eq. (5). Let \u02dcL := (\u00011n1(cid:62)\nn +\ni,j=1 \u2208 Rndx\u00d7ndx where the i, jth\n2\u03b3L)\u22121. The matrix A is given by A := A(cid:62)\n(cid:2)(cid:80)\nk \u03b7ikV\u22121(Ck + Ci)(cid:3) . The\n\u02dcL(p, q)AE(p, i)(cid:62)AE(q, j) and each i, jth (dy \u00d7 dx) block of\n(cid:62)](cid:62) \u2208 Rndx with the component dx-dimensional vectors\n(cid:62)V\u22121(yj \u2212 yi)). The likelihood combined with\n\ndx \u00d7 dx block is Aij =(cid:80)n\ngiven by bi =(cid:80)n\n\nAE \u2208 Rndy\u00d7ndx is given by AE(i, j) = \u2212\u03b7ijV\u22121(Cj + Ci) + \u03b4ij\nvector b is de\ufb01ned as b = [b1\n\np(y|C, x, \u03b8, G) = 1\n\u02dcZx\n\n(cid:62)V\u22121(yi \u2212 yj) \u2212 Ci\n\nexp(cid:2)\u2212 1\n\nE\u03a3yAE = [Aij]n\n\n(cid:62),\u00b7\u00b7\u00b7 , bn\n\n(cid:80)n\n\np=1\n\nq=1\n\nj=1 \u03b7ij(Cj\n\nthe prior on x gives us the Gaussian posterior over x (i.e., solving Eq. (7))\n\nq(x) = N (x|\u00b5x, \u03a3x), where \u03a3\u22121\n\nx = (cid:104)A(cid:105)q(C) + \u03a0\u22121, \u00b5x = \u03a3x(cid:104)b(cid:105)q(C).\n\n(9)\n\n3The \u0001 term centers the data and ensures the distribution can be normalised. It applies in a subspace orthog-\n\nonal to that modelled by x and C and so its value does not affect the resulting manifold model.\n\n4\n\n\fFigure 3: A simulated example. A: 400 data points drawn from Swiss Roll. B: true latent points (x)\nin 2D used for generating the data. C: Posterior mean of C and D: posterior mean of x after 50 EM\niterations given k = 9, which was chosen by maximising the lower bound across different k\u2019s. E:\nAverage lower bounds as a function of k. Each point is an average across 10 random seeds.\n\np(y|C, x, G, \u03b8) = 1\n\u02dcZC\n\nexp[\u2212 1\n\n2Tr(\u0393C(cid:62)C \u2212 2C(cid:62)V\u22121H)],\n\nSimilarly, computing q(C) from Eq. (8) requires rewriting the likelihood in Eq. (5) as a quadratic\nfunction in C\n\n(10)\nwhere the normaliser \u02dcZC has all the terms that do not depend on C from Eq. (5), and \u0393 := Q\u02dcLQ(cid:62).\nThe matrix Q = [q1 q2 \u00b7\u00b7\u00b7 qn] \u2208 Rndx\u00d7n where the jth subvector of the ith column is qi(j) =\n\u03b7ijV\u22121(xi \u2212 xj) + \u03b4ij\n\n(cid:2)(cid:80)\nk \u03b7ikV\u22121(xi \u2212 xk)(cid:3) \u2208 Rdx. We de\ufb01ne H = [H1,\u00b7\u00b7\u00b7 , Hn] \u2208 Rdy\u00d7ndx\nwhose ith block is Hi =(cid:80)n\n\nj=1 \u03b7ij(yj \u2212 yi)(xj \u2212 xi)(cid:62).\n\nThe likelihood combined with the prior on C gives us the Gaussian posterior over C (i.e., solving\nEq. (8))\n\nq(C) = MN (\u00b5C, I, \u03a3C), where \u03a3\n\nC := (cid:104)\u0393(cid:105)q(x) + \u0001JJ\n\u22121\n\n(cid:62)\n\n+ \u2126\n\n\u22121 and \u00b5C = V\n\n\u22121(cid:104)H(cid:105)q(x)\u03a3\n\n(cid:62)\nC.\n\n(11)\n\nThe expected values of A, b, \u0393 and H are given in the Appendix.\nVariational-M step We set the parameters by maximising L(q(C, x), \u03b8) w.r.t. \u03b8 which is split\ninto two terms based on dependence on each parameter: (1) expected log-likelihood for updating\nV by arg maxV Eq(x)q(C)[log p(y|C, x, V, G)]; and (2) negative KL divergence between the prior\nand the posterior on x for updating \u03b1 by arg max\u03b1 Eq(x)q(C)[log p(x|G, \u03b1)\u2212 log q(x)]. The update\nrules for each hyperparameter are given in the Appendix.\nThe full EM algorithm4 starts with an initial value of \u03b8. In the E-step, given q(C), compute q(x)\nas in Eq. (9). Likewise, given q(x), compute q(C) as in Eq. (11). The parameters \u03b8 are updated\nin the M-step by maximising Eq. (6). The two steps are repeated until the variational lower bound\nin Eq. (6) saturates. To give a sense of how the algorithm works, we visualise \ufb01tting results for\na simulated example in Fig. 3. Using the graph constructed from 3D observations given different\nk, we run our EM algorithm. The posterior means of x and C given the optimal k chosen by the\nmaximum lower bound resemble the true manifolds in 2D and 3D spaces, respectively.\n\nOut-of-sample extension In the LL-LVM model one can formulate a computationally ef\ufb01cient\nout-of-sample extension technique as follows. Given n data points denoted by D = {y1,\u00b7\u00b7\u00b7 , yn},\nthe variational EM algorithm derived in the previous section converts D into the posterior q(x, C):\nD (cid:55)\u2192 q(x)q(C). Now, given a new high-dimensional data point y\u2217, one can \ufb01rst \ufb01nd\nthe neighbourhood of y\u2217 without changing the current neighbourhood graph. Then, it is pos-\nsible to compute the distributions over the corresponding locally linear map and latent variable\nq(C\u2217, x\u2217) via simply performing the E-step given q(x)q(C) (freezing all other quantities the same)\nas D \u222a {y\u2217} (cid:55)\u2192 q(x)q(C)q(x\u2217)q(C\u2217).\n\n4An implementation is available from http://www.gatsby.ucl.ac.uk/resources/lllvm.\n\n5\n\npost mean of x678910119001000average lwbsktrue xAECBDposterior mean of C400 datapoints \fFigure 4: Resolving short-circuiting problems using variational lower bound. A: Visualization of\n400 samples drawn from a Swiss Roll in 3D space. Points 28 (red) and 29 (blue) are close to each\nother (dotted grey) in 3D. B: Visualization of the 400 samples on the latent 2D manifold. The\ndistance between points 28 and 29 is seen to be large. C: Posterior mean of x with/without short-\ncircuiting the 28th and the 29th data points in the graph construction. LLLVM achieves a higher\nlower bound when the shortcut is absent. The red and blue parts are mixed in the resulting estimate\nin 2D space (right) when there is a shortcut. The lower bound is obtained after 50 EM iterations.\n\nX = [x1, . . . , xdx ] \u2208 Rn\u00d7dx is de\ufb01ned by p(Y|X) = (cid:81)dy\n\nComparison to GP-LVM A closely related probabilistic dimensionality reduction algorithm to\nLL-LVM is GP-LVM [7]. GP-LVM de\ufb01nes the mapping from the latent space to data space us-\ning Gaussian processes. The likelihood of the observations Y = [y1, . . . , ydy ] \u2208 Rn\u00d7dy (yk\nis the vector formed by the kth element of all n high dimensional vectors) given latent variables\nk=1 N (yk|0, Knn + \u03b2\u22121In), where\nthe i, jth element of the covariance matrix is of the exponentiated quadratic form: k(xi, xj) =\nwith smoothness-scale parameters {\u03b1q} [8]. In LL-LVM, once\n\u03c32\nf exp\nwe integrate out C from Eq. (5), we also obtain the Gaussian likelihood given x,\n\nq=1 \u03b1q(xi,q \u2212 xj,q)2(cid:105)\n(cid:80)dx\n\n(cid:104)\u2212 1\n\n2\n\n(cid:90)\n\nexp(cid:2)\u2212 1\n\nLL y(cid:3) .\n\n2 y(cid:62) K\u22121\n\np(y|x, G, \u03b8) =\n\np(y|C, x, G, \u03b8)p(C|G, \u03b8) dC = 1\n\nZYy\n\nLL = (2L \u2297 V\u22121) \u2212 (W \u2297 V\u22121) \u039b (W(cid:62) \u2297\nIn contrast to GP-LVM, the precision matrix K\u22121\nV\u22121) depends on the graph Laplacian matrix through W and \u039b. Therefore, in LL-LVM, the graph\nstructure directly determines the functional form of the conditional precision.\n\n4 Experiments\n\n4.1 Mitigating the short-circuit problem\n\nLike other neighbour-based methods, LL-LVM is sensitive to misspeci\ufb01ed neighbourhoods; the\nprior, likelihood, and posterior all depend on the assumed graph. Unlike other methods, LL-\nLVM provides a natural way to evaluate possible short-circuits using the variational lower bound\nof Eq. (6). Fig. 4 shows 400 samples drawn from a Swiss Roll in 3D space (Fig. 4A). Two points,\nlabelled 28 and 29, happen to fall close to each other in 3D, but are actually far apart on the la-\ntent (2D) surface (Fig. 4B). A k-nearest-neighbour graph might link these, distorting the recovered\ncoordinates. However, evaluating the model without this edge (the correct graph) yields a higher\nvariational bound (Fig. 4C). Although it is prohibitive to evaluate every possible graph in this way,\nthe availability of a principled criterion to test speci\ufb01c hypotheses is of obvious value.\nIn the following, we demonstrate LL-LVM on two real datasets: handwritten digits and climate data.\n\n4.2 Modelling USPS handwritten digits\n\nAs a \ufb01rst real-data example, we test our method on a subset of 80 samples each of the digits\n0, 1, 2, 3, 4 from the USPS digit dataset, where each digit is of size 16\u00d716 (i.e., n = 400, dy = 256).\nWe follow [7], and represent the low-dimensional latent variables in 2D.\n\n6\n\n400 samples (in 3D)2D representationposterior mean of x in 2D space ABC29282829G without shortcutG with shortcutLB: 1119.4LB: 1151.5\fFigure 5: USPS handwritten digit dataset described in section 4.2. A: Mean (in solid) and variance\n(1 standard n deviation shading) of the variational lower bound across 10 different random starts of\nEM algorithm with different k\u2019s. The highest lower bound is achieved when k = n/80. B: The\nposterior mean of x in 2D. Each digit is colour coded. On the right side are reconstructions of y\u2217 for\nrandomly chosen query points x\u2217. Using neighbouring y and posterior means of C we can recover\ny\u2217 successfully (see text). C: Fitting results by GP-LVM using the same data. D: ISOMAP (k = 30)\nand E: LLE (k=40). Using the extracted features (in 2D), we evaluated a 1-NN classi\ufb01er for digit\nidentity with 10-fold cross-validation (the same data divided into 10 training and test sets). The\nclassi\ufb01cation error is shown in F. LL-LVM features yield the comparably low error with GP-LVM\nand ISOMAP.\n\nFig. 5A shows variational lower bounds for different values of k, using 9 different EM initialisations.\nThe posterior mean of x obtained from LL-LVM using the best k is illustrated in Fig. 5B. Fig. 5B\nalso shows reconstructions of one randomly-selected example of each digit, using its 2D coordinates\nx\u2217 as well as the posterior mean coordinates \u02c6xi, tangent spaces \u02c6Ci and actual images yi of its\nk = n/80 closest neighbours. The reconstruction is based on the assumed tangent-space structure\nof the generative model (Eq. (5)), that is: \u02c6y\u2217 = 1\n. A similar process\ncould be used to reconstruct digits at out-of-sample locations. Finally, we quantify the relevance\nof the recovered subspace by computing the error incurred using a simple classi\ufb01er to report digit\nidentity using the 2D features obtained by LL-LVM and various competing methods (Fig. 5C-F).\nClassi\ufb01cation with LL-LVM coordinates performs similarly to GP-LVM and ISOMAP (k = 30),\nand outperforms LLE (k = 40).\n\nyi + \u02c6Ci(x\u2217 \u2212 \u02c6xi)\n\n(cid:80)k\n\n(cid:105)\n\n(cid:104)\n\nk\n\ni=1\n\n4.3 Mapping climate data\n\nIn this experiment, we attempted to recover 2D geographical relationships between weather stations\nfrom recorded monthly precipitation patterns. Data were obtained by averaging month-by-month\nannual precipitation records from 2005\u20132014 at 400 weather stations scattered across the US (see\nFig. 6) 5. Thus, the data set comprised 400 12-dimensional vectors. The goal of the experiment is to\nrecover the two-dimensional topology of the weather stations (as given by their latitude and longi-\n\n5The dataset is made available by the National Climatic Data Center at http://www.ncdc.noaa.\n\ngov/oa/climate/research/ushcn/. We use version 2.5 monthly data [15].\n\n7\n\n30345x 104variational lower boundA# EM iterations0true Y*estimatek=n/80posterior mean of x (k=n/80)Bdigit 1digit 2digit 3digit 0digit 4query (0)query (1)GP-LVMCISOMAPDLLEEClassi(cid:31)cation errorFLLEISOMAPGPLVMLLLVMk=n/100k=n/50k=n/40query (2)query (3)00.20.4query (4)\f(a) 400 weather stations\n\n(b) LLE\n\n(c) LTSA\n\n(d) ISOMAP\n\n(e) GP-LVM\n\n(f) LL-LVM\n\nFigure 6: Climate modelling problem as described in section 4.3. Each example corresponding to\na weather station is a 12-dimensional vector of monthly precipitation measurements. Using only\nthe measurements, the projection obtained from the proposed LL-LVM recovers the topological\narrangement of the stations to a large degree.\n\ntude) using only these 12-dimensional climatic measurements. As before, we compare the projected\npoints obtained by LL-LVM with several widely used dimensionality reduction techniques. For the\ngraph-based methods LL-LVM, LTSA, ISOMAP, and LLE, we used 12-NN with Euclidean distance\nto construct the neighbourhood graph.\nThe results are presented in Fig. 6. LL-LVM identi\ufb01ed a more geographically-accurate arrangement\nfor the weather stations than the other algorithms. The fully probabilistic nature of LL-LVM and\nGPLVM allowed these algorithms to handle the noise present in the measurements in a principled\nway. This contrasts with ISOMAP which can be topologically unstable [16] i.e. vulnerable to short-\ncircuit errors if the neighbourhood is too large. Perhaps coincidentally, LL-LVM also seems to\nrespect local geography more fully in places than does GP-LVM.\n\n5 Conclusion\n\nWe have demonstrated a new probabilistic approach to non-linear manifold discovery that embod-\nies the central notion that local geometries are mapped linearly between manifold coordinates and\nhigh-dimensional observations. The approach offers a natural variational algorithm for learning,\nquanti\ufb01es local uncertainty in the manifold, and permits evaluation of hypothetical neighbourhood\nrelationships.\nIn the present study, we have described the LL-LVM model conditioned on a neighbourhood graph.\nIn principle, it is also possible to extend LL-LVM so as to construct a distance matrix as in [17], by\nmaximising the data likelihood. We leave this as a direction for future work.\n\nAcknowledgments\n\nThe authors were funded by the Gatsby Charitable Foundation.\n\n8\n\n\u2212120\u2212110\u2212100\u221290\u221280\u22127030354045LongitudeLatitude\fReferences\n[1] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embed-\n\nding. Science, 290(5500):2323\u20132326, 2000.\n\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and\n\nclustering. In NIPS, pages 585\u2013591, 2002.\n\n[3] J. B. Tenenbaum, V. Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear\n\nDimensionality Reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[4] L.J.P. van der Maaten, E. O. Postma, and H. J. van den Herik. Dimensionality re-\nhttp://www.iai.uni-bonn.de/\u02dcjz/\n\nduction: A comparative review, 2008.\ndimensionality_reduction_a_comparative_review.pdf.\n\n[5] L. Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep,\n\npages 1\u201317, 2005. http://www.lcayton.com/resexam.pdf.\n\n[6] J. Platt. Fastmap, metricmap, and landmark MDS are all Nystr\u00a8om algorithms. In Proceedings\nof 10th International Workshop on Arti\ufb01cial Intelligence and Statistics, pages 261\u2013268, 2005.\n[7] N. Lawrence. Gaussian process latent variable models for visualisation of high dimensional\n\ndata. In NIPS, pages 329\u2013336, 2003.\n\n[8] M. K. Titsias and N. D. Lawrence. Bayesian Gaussian process latent variable model.\n\nAISTATS, pages 844\u2013851, 2010.\n\nIn\n\n[9] S. Roweis, L. Saul, and G. Hinton. Global coordination of local linear models. In NIPS, pages\n\n889\u2013896, 2002.\n\n[10] M. Brand. Charting a manifold. In NIPS, pages 961\u2013968, 2003.\n[11] Y. Zhan and J. Yin. Robust local tangent space alignment. In NIPS, pages 293\u2013301. 2009.\n[12] J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear models.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8):1236\u20131250, 2006.\n\n[13] C. Bishop. Pattern recognition and machine learning. Springer New York, 2006.\n[14] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby\n\nUnit, University College London, 2003.\n\n[15] M. Menne, C. Williams, and R. Vose. The U.S. historical climatology network monthly tem-\nperature data, version 2.5. Bulletin of the American Meteorological Society, 90(7):993\u20131007,\nJuly 2009.\n\n[16] Mukund Balasubramanian and Eric L. Schwartz. The isomap algorithm and topological sta-\n\nbility. Science, 295(5552):7\u20137, January 2002.\n\n[17] N. Lawrence. Spectral dimensionality reduction via maximum entropy. In AISTATS, pages\n\n51\u201359, 2011.\n\n9\n\n\f", "award": [], "sourceid": 79, "authors": [{"given_name": "Mijung", "family_name": "Park", "institution": "UCL"}, {"given_name": "Wittawat", "family_name": "Jitkrittum", "institution": "Gatsby Unit, UCL"}, {"given_name": "Ahmad", "family_name": "Qamar", "institution": null}, {"given_name": "Zoltan", "family_name": "Szabo", "institution": "Gatsby Unit, UCL"}, {"given_name": "Lars", "family_name": "Buesing", "institution": null}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": null}]}