{"title": "Automatic Alignment of Local Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 872, "abstract": null, "full_text": "Automatic Alignment of Local Representations\n\nYee Whye Teh and Sam Roweis\n\nDepartment of Computer Science, University of Toronto\n\n ywteh,roweis\n\n@cs.toronto.edu\n\nAbstract\n\nWe present an automatic alignment procedure which maps the disparate\ninternal representations learned by several local dimensionality reduction\nexperts into a single, coherent global coordinate system for the original\ndata space. Our algorithm can be applied to any set of experts, each\nof which produces a low-dimensional local representation of a high-\ndimensional input. Unlike recent efforts to coordinate such models by\nmodifying their objective functions [1, 2], our algorithm is invoked after\ntraining and applies an ef\ufb01cient eigensolver to post-process the trained\nmodels. The post-processing has no local optima and the size of the sys-\ntem it must solve scales with the number of local models rather than the\nnumber of original data points, making it more ef\ufb01cient than model-free\nalgorithms such as Isomap [3] or LLE [4].\n\n1 Introduction: Local vs. Global Dimensionality Reduction\nBeyond density modelling, an important goal of unsupervised learning is to discover com-\npact, informative representations of high-dimensional data. If the data lie on a smooth low\ndimensional manifold, then an excellent encoding is the coordinates internal to that man-\nifold. The process of determining such coordinates is dimensionality reduction. Linear\ndimensionality reduction methods such as principal component analysis and factor analy-\nsis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these\nsimple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality\nreduction, and can provide good density models for curved manifolds, but unfortunately\nsuch mixtures cannot do dimensionality reduction. They do not describe a single, coher-\nent low-dimensional coordinate system for the data since there is no pressure for the local\ncoordinates of each component to agree.\n\nRoweis et al [1] recently proposed a model which performs global coordination of local\ncoordinate systems in a mixture of factor analyzers (MFA). Their model is trained by max-\nimizing the likelihood of the data, with an additional variational penalty term to encourage\nthe internal coordinates of the factor analyzers to agree. While their model can trade off\nmodelling the data and having consistent local coordinate systems, it requires a user given\ntrade-off parameter, training is quite inef\ufb01cient (although [2] describes an improved train-\ning algorithm for a more constrained model), and it has quite serious local minima problems\n(methods like LLE [4] or Isomap [3] have to be used for initialization).\n\nIn this paper we describe a novel, automatic way to align the hidden representations used by\neach component of a mixture of dimensionality reducers into a single global representation\nof the data throughout space. Given an already trained mixture, the alignment is achieved\nby applying an eigensolver to a matrix constructed from the internal representations of the\nmixture components. Our method is ef\ufb01cient, simple to implement, and has no local optima\nin its optimization nor any learning rates or annealing schedules.\n\n\u0001\n\f2 The Locally Linear Coordination Algorithm\n\nSuppose we have a set of data points given by the rows of \u0002\u0001\u0004\u0003\n\u0005\u0007\u0006\t\b\n\u0005\f\u000b\r\b\u000f\u000e\u0010\u000e\u000f\u000e\u0011\b\u0012\u0005\f\u0013\u0015\u0014\u0017\u0016\na\u0018\n-dimensional space, which we assume are sampled from a \u0019\u001b\u001a\u001c\u0018\n\u0014\u0017\u0016\nfold. We approximate the manifold coordinates using images \u001d\u001e\u0001\u001f\u0003\n\b\u0010\u000e\u000f\u000e\u0010\u000e\u000f\b! \n\u0019 dimensional embedding space. Suppose also that we have already trained, or have been\nth reducer produces a\u0019%$ dimen-\ngiven, a mixture of\"\nsional internal representation&('\r$\nfor data point \u0005)' as well as a \u201cresponsibility\u201d*\t'\r$,+.-\n'\r$\ndescribing how reliable the#\n\u000110\nis. These satisfy/\n\nlocal dimensionality reducers. The#\nth reducer\u2019s representation of\u0005\n\nand can be obtained, for example, using a gating network in a mixture of experts, or the\nposterior probabilities in a probabilistic network. Notice that the manifold coordinates and\ninternal representations need not have the same number of dimensions.\n\nfrom\ndimensional mani-\nin a\n\n\b\n \n\nGiven the data, internal representations, and responsibilities, our algorithm automatically\naligns the various hidden representations into a single global coordinate system. Two key\nideas motivate the method. First, to use a convex cost function whose unique minimum is\nto\n\nattained at the desired global coordinates. Second, to restrict the global coordinates \n'\r$ and responsibilities*\ndepend on the data\u0005\n\nthereby leveraging the structure of the mixture model to regularize and reduce the effective\nsize of the optimization problem. In effect, rather than working with individual data points,\nwe work with large groups of points belonging to particular submodels.\n\n' only through the local representations &\n\n'2$ ,\n\n'\r$;:\n\n4<5\n\n\u000187\n\n'2$%9\n\n'\r$ and&\n\nto these to obtain its guess at the global coordinates. The \ufb01nal global\nis obtained by averaging the guesses using the responsibilities as weights:\n\n'\r$ . Given an input\nWe \ufb01rst parameterize the global coordinates \nin terms of *\n'\r$ and then applies a linear projection\n, each local model infers its internal coordinates&\n$ and offset 465\ncoordinates \n?A@\n'\r$>E\nBDC\n'\r$\n\u0001N*O'\r$P&\n\b\u0012#\nHM'\n\u0001N4\nis the K th entry of &\nwhere 4\n\u0001Q0\n= , whereJ\nindices 9LK\n9LK\n\b\u0012#\n\bR#\n. For compactness, we will writeJ\nthe domain of 9LK\nto \n'\r$ and theJ\nNow de\ufb01ne the matrices I\nas4\n\u0001F*O'\r$\n(1) becomes a system of linear equations (2) with \ufb01xed I\n\n\u000187\n$>=\n9LK\n\u001dF\u00018I\nis the K th column of 3\n$ , E\n'2$\ninto a single new indexJ\n\b\u0012#\n:,/\n0\r\bRS%\b\u000f\u000e\u0010\u000e\u000f\u000eA\b\u0012\"\n\bR#\nand3\nasHT'\n\n\u0001F7>G\u001eH\n'\r$\n'\r$ , and E\nth row of3\n\u0001.4\nand unknown parameters3\n\nis a bias.\nThis process is described in \ufb01gure 1. To simplify our calculations, we have vectorized the\nis an invertible mapping from\n\n9\u0017K\n\bR#\n= .\n$ . Then\n\n(1)\n\n(2)\n\nresponsibilities\n\n'\r$\n\n.\n\nalignment\nparameters\n\nlj\n\nnju\n\nny\n\nglobal\ncoordinates\n\n9LK\n\nrnk\n\nznk\n\nhigh\u2212dimensional\ndata\n\nxn\n\nlocal dimensionality\nreduction models\n\nResponsibility\u2212weighted\nlocal representations\n\nlocal coordinates\n\nFigure 1: Obtaining global coordinates from data via responsibility-weighted local coordinates.\n\n \n\u0006\n\u000b\n\u0013\n'\n$\n*\n'\n'\n\u0005\n'\n3\n$\n'\n \n'\n$\n*\n3\n$\n&\n$\n7\n5\n*\nB\n4\nB\n$\n'\nG\n4\nG\n3\nJ\n\u0001\nJ\n=\n\b\nG\nB\n\b\n4\nG\nB\n$\nB\n$\nB\n5\n=\n=\n=\n$\n\u0019\n$\n\u0001\n\u0001\nJ\nG\nE\nB\nG\nB\n\f\u0001\u0007\u0006\t\b\n\u0001\u0007\u0006\t\b\n\u0001\u0007\u0006\t\b\n\nis linear in each\n\n'\r$ and the unknown parameters4\n\nis highly non-linear since it depends on the multipli-\ncation of responsibilities and internal coordinates which are in turn non-linearly related to\n\nThe key assumption, which we have emphasized by re-expressing \nmapping between the local representations and the global coordinates \n'\r$ ,*\nof&\n' and the images \noriginal data \u0005\nthe data\u0005\f'\nWe now consider determining3\n\n' above, is that the\n$ . Crucially, however, the mapping between the\n= . For this we\nis convex in3\n- on\u001d\n\nadvocate using a convex\nas well, and there is a unique optimum that can be computed ef\ufb01ciently using a variety of\nmethods. This is still true if we also have feasible convex constraints\n. The\ncase where the cost and constraints are both quadratic is particularly appealing since we\nare matrices\n\n= . Notice that since\u001d\n\nthrough the inference procedure of the mixture model.\n\naccording to some given cost function\n\nis linear in3\n\n. In particular suppose\n\n=\n=\n\n,\n\nand\n\nand\n. This gives:\n\nde\ufb01ning the cost and constraints, and let\n\ncan use an eigensolver to \ufb01nd the optimal 3\n\u00048\u0001FI\n\n\u001d\u0007\n\f\u000b\n\n\u000b\n\n(3)\n\n\f\u000b\n\nand\n\nwhere\n\nis the trace operator. The matrices\n\nare typically obtained from the\nand summarize the essential geometries among them. The solution to the\nwith\n\n\u0006\t\b\noriginal data \nconstrained minimization above is given by the \u0019 smallest generalized eigenvectors\n\u0001\u0011\u0010\u0012\u0005\u0013\u000e\n\u0004\u000f\u000e\nBelow, we investigate a cost function based on the Locally Linear Embedding (LLE) algo-\nrithm of Roweis and Saul [4]. We call the resulting algorithm Locally Linear Coordina-\ntion (LLC). The idea of LLE is to preserve the same locally linear relationships between\nits\n\n. In particular the columns of 3\n\nare given by these generalized eigenvectors.\n\n. We identify for each point \u0005\n\u0001%\u0006\t\b(\u0003\n\n\u000b(\n'\u001a\n\n\u000b&\n'\u001a\n\n' and their counterparts \nthe original data points \u0005\n' and then minimize\nnearest-neighbours\u0005\u0015\u0014\u0017\u0016\u0019\u0018\n\u0001F7\n\u0005\f'\u001d\n\nsubject to the constraints /\n\n\u0014\u001f\u001e! #\"#$\n\n,\b\u001b\u001a\n\n0 . The weights are unique1\n' we\n\nwith respect to\nand can be solved for ef\ufb01ciently using constrained least squares (since solving for\ndecoupled across\n\nis\n). The weights summarize the local geometries relating the data points\n\n\u0014\u001f\u001e! \n\narrange to minimize the same cost\n\nto their neighbours, hence to preserve these relationships among the coordinates \nbut with respect to\u001d\nwe scale\u001d\n\n\u0001*\u0006\t\b(\u0003\nis invariant to translations and rotations of\u001d\n\u0001*\u000b\n\n. In order to break these degeneracies we enforce the following constraints:\n\ninstead.\n\n\u000b(\n'\u001a\n\n\u000b&\n'\u001a\n\n\u001d\u0015\u0014\n\n+-,\n\n\b\u001b\u001a\n\nwhere\n\nis a vector of 0 \u2019s. For this choice, the cost function and constraints above become:\n\n, and scales as\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n(8)\n\n(9)\n\nwith cost and constraint matrices\n\n\u0001%\u0006\t\b(\u0003\n\n\b.\u001a\n0\t\u0016MI\n\n\u00048\u0001FI\n\n\u000b(\n\r\u001a\n\n\u000b(\n'\u001a\n\n\u0014M\u0001*\u0006/\bO\u0003\n\u0001%\u000b\n\n\u00160\u0005\n\u0005\u001e\u0001\n\n1In the unusual case where the number of neighbours is larger than the dimensionality of the data\n, simple regularization of the norm of the weights once again makes them unique.\n\n\u000b(\n'\u001a\n\n\u000b\u0013\n\r\u001a\n\n'\nB\n'\n\n9\n\u001d\n\n9\n\u001d\n\n9\n\u001d\n9\n3\n\u0001\n9\n\u001d\n=\n\u0001\n\u0002\n\u0003\n\u0016\n\u0002\nI\n\u0005\n\u0001\nI\n\u0016\n\u0003\nI\n\n9\n\u001d\n=\n\u0003\n\u001d\n\u0016\n\u0002\n\u001d\n\u0014\n\u0001\n9\n\u001d\n=\n\u0001\n-\n\u0001\n\u001d\n\u0016\n\u0003\n?\n\u0003\n3\n\u0016\nI\n\u0016\n\u0002\nI\n3\n\u0014\n\u0001\n3\n\u0016\nI\n\u0016\n\u0003\nI\n3\n?\n\u0003\n3\n\u0016\n\u0004\n3\n\u0014\n\u0001\n3\n\u0016\n\u0005\n3\n?\n\u0002\n\u0003\n\u000e\n'\n'\n\n9\n=\n'\n\u001c\n\u001c\n/\n'\n\u0014\n\u0005\n\u0014\n\u001c\n\u001c\n\u000b\n\n\u0016\n9\n\u0016\n=\n9\n=\n\n\u0014\n\u001a\n\"\n$\n'\n\u0014\n\u0001\n$\n'\n\u0014\n)\n\n9\n\u001d\n=\n\u001d\n\u0016\n9\n\u0016\n=\n9\n=\n\n0\n+\n7\n'\n \n'\n\u0001\n0\n0\n\u0016\n\u001d\n\u0001\n-\n0\n+\n7\n'\n \n'\n \n\u0016\n'\n\u0001\n0\n+\n\u001d\n\u0016\n\u001d\n?\n,\n0\n\n9\n\u001d\n=\n3\n\u0016\nI\n\u0016\n9\n\u0016\n=\n9\n=\nI\n3\n3\n\u0016\n\u0004\n3\n\u0014\n\u0006\n\u0013\n,\n3\n\u0001\n-\n\u0006\n\u0013\n3\n\u0016\nI\n\u0016\nI\n3\n\u0001\n3\n3\n?\n\u0016\n9\n\u0016\n=\n9\n=\nI\n0\n+\nI\n\u0016\nI\n1\n\f'\r$\n\n'\r$>E\n\nand\n\nusing (4).\n\nAs shown previously, the solution to this problem is given by the smallest generalized\neigenvectors\n\u00100\u0005(\u000e\nthat are orthogonal to the vector\nis the smallest generalized\neigenvector, corresponding to an eigenvalue of 0. Hence the solution to the problem is\n\n- , we need to \ufb01nd eigenvectors\n\n\u0004\u000f\u000e8\u0001\n\nwith\n\n0\t\u0016\n\n. To satisfy \u0006\n\u0016/,\n\n0 . Fortunately,\n\n.\n\n.\n\n.\nfrom (9).\n\nand\n\nLLC Alignment Algorithm:\n\nof the generalized eigenvalue system\n\ngiven by theS\n\nNote that the edge size of the matrices\n\n, compute local linear reconstruction weights\n\n\u0001\u0003\u0002 smallest generalized eigenvectors instead.\n\n, obtaining a local representation&\nfor each submodel# and each data point\u0005\n\nto 9\n\u0004 Using data\n\u0004 Train or receive a pre-trained mixture of local dimensionality reducers.\nApply this mixture to\nresponsibility*\n\u0004 Form the matrixI withH\n'\r$ and calculate\n\u0004 Find the eigenvectors corresponding to the smallest \u0019\n\u0001\u0011\u00100\u0005(\u000e\n\u0004 Let3 be a matrix with columns formed by the S nd to\u0019\nReturn theJ\nth row of3\nas alignment weight4\n$ .\nReturn the global manifold coordinates as \u001d\n\u0001FI\n$ which scales with the number of components and dimensions of the local\nis \"\n. As a result, solving for the\nrepresentations but not with the number of data points +\nalignment weights is much more ef\ufb01cient than the original LLE computation (or those\n. In effect, we\nof Isomap) which requires solving an eigenvalue system of edge size +\nhave leveraged the mixture of local models to collapse large groups of points together and\nworked only with those groups rather than the original data points. Notice however that\nstill requires determining the neighbours of the original\nthe computation of the weights\nin the worse case.\ndata points, which scales as \u0005\nCoordination with LLC also yields a mixture of noiseless factor analyzers over the global\n\n'\r$ and\n0 eigenvalues\n0 st eigenvectors.\n\nwhose generalized eigenvectors we seek\n\n$ and factor loading 3\nth factor analyzer having mean 4\n$ .\ncoordinate space \n, with the #\n, we can infer the responsibilities*P$ and the posterior means\nGiven any global coordinates \n$ over the latent space of each factor analyzer. If our original local dimensionality reduc-\n$ , we can now infer the high dimensional mean\nfrom*\ners also supports computing\u0005\ndata point\u0005 which corresponds to the global coordinates \n\n. This allows us to perform op-\nerations like visualization and interpolation using the global coordinate system. This is the\nmethod we used to infer the images in \ufb01gures 4 and 5 in the next section.\n\n$ and&\n\n3 Experimental Results using Mixtures of Factor Analyzers\nThe alignment computation we have described is applicable to any mixture of local dimen-\nsionality reducers. In our experiments, we have used the most basic such model: a mixture\nth factor analyzer in the mixture describes a proba-\n\nof factor analyzers (MFA) [8]. The #\nbilistic linear mapping from a latent variable&\n\nThe model assumes that the data manifold is locally linear and it is this local structure that\nis captured by each factor analyzer. The non-linearity in the data manifold is handled by\npatching multiple factor analyzers together, each handling a locally linear region.\n\nto the data\u0005 with additive Gaussian noise.\n\nMFAs are trained in an unsupervised way by maximizing the marginal log likelihood of\nthe observed data, and parameter estimation is typically done using the EM algorithm 2.\n\n2In our experiments, we initialized the parameters by drawing the means from the global covari-\nance of the data and setting the factors to small random values. We also simpli\ufb01ed the factor analyzers\nto share the same spherical noise covariance\nalthough this is not essential to the process.\n\n\u0006\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\n\n\u000e\n\u0013\n,\nI\n3\n\u0001\n\u000e\n5\n\u0001\nI\n\u000e\n5\n'\n?\n\u0019\n:\n0\n=\n$\n'\n\u0014\n'\n'\nG\n\u0001\n*\nB\n\u0004\n\u0005\n:\n\u0004\n\u000e\n:\nB\n3\n\u0004\n\u0005\n:\n/\n$\n\u0019\n\u001a\n9\n+\n\u000b\n=\n5\n&\n$\n\fA\n\nB\n\nC\n\nD\n\nFigure 2: LLC on the S curve (A). There are 14 factor analyzers in the mixture (B), each with 2 latent\ndimensions. Each disk represents one of them with the two black lines being the factor loadings. After\nalignment by LLC (C), the curve is successfully unrolled; it is also possible to retroactively align the\noriginal data space models (D).\n\nA\n\nB\n\nFigure 3: Unknotting the trefoil\ncurve. We generated 6000 noisy\npoints from the curve. Then we \ufb01t\nan MFA with 30 components with\n1 latent dimension each (A), but\naligned them in a 2D space (B).\nWe used 10 neighbours to recon-\nstruct each data point.\n\n$ conditioned on the data\u0005\n\nth local representation of \u0005\n\nas the responsibility.\n\nth factor analyzer generated\u0005\n\n$ , a MFA trained only\n\nth factor analyzer generated \u0005 ) as the#\n\nto maximize likelihood cannot learn a global coordinate system for the manifold that is\nconsistent across every factor analyzer. Hence this is a perfect model on which to apply\n(assuming\n, while we use the\n\nSince there is no constraint relating the various hidden variables &\nautomatic alignment. Naturally, we use the mean of&\nthe #\nposterior probability that the #\n\nWe illustrate LLC on two synthetic toy problems to give some intuition about how it works.\nThe \ufb01rst problem is the S curve given in \ufb01gure 2(A). An MFA trained on 1200 points\nsampled uniformly from the manifold with added noise (B) is able to model the linear\nstructure of the curve locally, however the internal coordinates of the factor analyzers are\nnot aligned properly. We applied LLC to the local representations and aligned them in a 2D\nspace (C). When solving for local weights, we used 12 neighbours to reconstruct each data\npoint. We see that LLC has successfully unrolled the S curve onto the 2D space. Further,\ngiven the coordinate transforms produced by LLC, we can retroactively align the latent\nspaces of the MFAs (D). This is done by determining directions in the various latent spaces\nwhich get transformed to the same direction in the global space.\n\nTo emphasize the topological advantages of aligning representations into a space of higher\ndimensionality than the local coordinates used by each submodel, we also trained a MFA\non data sampled from a trefoil curve, as shown in \ufb01gure 3(A). The trefoil is a circle with a\nknot in 3D. As \ufb01gure 3(B) shows, LLC connects these models into a ring of local topology\nfaithful to the original data.\n\nWe applied LLC to MFAs trained on sets of real images believed to come from a complex\nmanifold with few degrees of freedom. We studied face images of a single person under\nvarying pose and expression changes and handwritten digits from the MNIST database.\nAfter training the MFAs, we applied LLC to align the models. The face models were\naligned into a 2D space as shown in \ufb01gure 4. The \ufb01rst dimension appears to describe\n\n\fFigure 4: A map of reconstructions of the faces when the global coordinates are speci\ufb01ed. Contours\ndescribe the likelihood of the coordinates. Note that some reconstructions around the edge of the map\nare not good because the model is extrapolating from the training images to regions of low likelihood.\nA MFA with 20 components and 8 latent dimensions each is used. It is trained on 1965 images. The\nweights\n\nare calculated using 36 neighbours.\n\nchanges in pose, while the second describes changes in expression. The digit models were\naligned into a 3D space. Figure 5 (top) shows maps of reconstructions of the digits. The\n\ufb01rst dimension appears to describe the slant of each digit, the second the fatness of each\ndigit, and the third the relative sizes of the upper to lower loops. Figure 5 (bottom) shows\nhow LLC can smoothly interpolate between any two digits.\nIn particular, the \ufb01rst row\ninterpolates between left and right slanting digits, the second between fat and thin digits,\nthe third between thick and thin line strokes, and the fourth between having a larger bottom\nloop and larger top loop.\n\n4 Discussion and Conclusions\nPrevious work on nonlinear dimensionality reduction has usually emphasized either a para-\nmetric approach, which explicitly constructs a (sometimes probabilistic) mapping between\nthe high-dimensional and low-dimensional spaces, or a nonparametric approach which\nmerely \ufb01nds low-dimensional images corresponding to high-dimensional data points but\nwithout probabilistic models or hidden variables. Compared to the global coordination\nmodel [1], the closest parametric approach to ours, our algorithm can be understood as post\ncoordination, in which the latent spaces are coordinated after they have been \ufb01t to data. By\ndecoupling the data \ufb01tting and coordination problems we gain ef\ufb01ciency and avoid local\noptima in the coordination phase. Further, since we are just maximizing likelihood when\n\ufb01tting the original mixture to data, we can use a whole range of known techniques to escape\nlocal minima, and improve ef\ufb01ciency in the \ufb01rst phase as well.\n\nOn the nonparametric side, our approach can be compared to two recent algorithms, LLE\n\n\n\fst and\n\nnd coordinates speci\ufb01ed; right:\n\nFigure 5: Top: maps of reconstructions of digits when two global coordinates are speci\ufb01ed, and the\nrd. Bottom: Interpolating\nthird integrated out. Left:\nbetween two digits using LLC. In each row, we interpolate between the upper leftmost and rightmost\ndigits. The LLC interpolants are spread out evenly along a line connecting the global coordinates of\nthe two digits. For comparison, we show the 20 training images whose coordinates are closest to the\nline segment connecting those of the two digits at each side. A MFA with 50 components, each with\n6 latent dimensions is used. It is trained on 6000 randomly chosen digits from the combined training\nand test sets of 8\u2019s in MNIST. The weights\n\nwere calculated using 36 neighbours.\n\nnd and\n\n[4] and Isomap [3]. The cost functions of LLE and Isomap are convex, so they do not\nsuffer from the local minima problems of earlier methods [9, 10], but these methods must\nsolve eigenvalue systems of size equal to the number of data points. (Although in LLE the\nsystems are highly sparse.) Another problem is neither LLE nor Isomap yield a probabilis-\ntic model or even a mapping between the data and embedding spaces. Compared to these\nmodels (which are run on individual data points) LLC uses as its primitives descriptions\nof the data provided by the individual local models. This makes the eigenvalue system to\nbe solved much smaller and as a result the computational cost of the coordination phase of\nLLC is much less than that for LLE or Isomap. (Note that the construction of the eigenvalue\nsystem still requires \ufb01nding nearest neighbours for each point, which is costly.) Further-\nmore, if each local model describes a complete (probabilistic) mapping from data space\n\n\n\u0001\n\u0001\n\u0002\n\n\fto its latent space, the \ufb01nal coordinated model will also describe a (probabilistic) mapping\nfrom the whole data space to the coordinated embedding space.\n\nOur alignment algorithm improves upon local embedding or density models by elevating\ntheir status to full global dimensionality reduction algorithms without requiring any modi\ufb01-\ncations to their training procedures or cost functions. For example, using mixtures of factor\nanalyzers (MFAs) as a test case, we show how our alignment method can allow a model\npreviously suited only for density estimation to do complex operations on high dimensional\ndata such as visualization and interpolation.\n\nBrand [11] has recently proposed an approach, similar to ours, that coordinates local para-\nmetric models to obtain a globally valid nonlinear embedding function. Like LLC, his\n\u201ccharting\u201d method de\ufb01nes a quadratic cost function and \ufb01nds the optimal coordination di-\nrectly and ef\ufb01ciently. However, charting is based on a cost function much closer in spirit to\nthe original global coordination model and it instantiates one local model centred on each\ntraining point, so its scaling is the same as that of LLE and Isomap. In principle, Brand\u2019s\nmethod can be extended to work with fewer local models and our alignment procedure can\nbe applied using the charting cost rather than the LLE cost we have pursued here.\n\nAutomatic alignment procedures emphasizes a powerful but often overlooked interpreta-\ntion of local mixture models. Rather than considering the output of such systems to be a\nsingle quantity, such as a density estimate or a expert-weighted regression, it is possible\nto view them as networks which convert high-dimensional inputs into a vector of internal\ncoordinates from each submodel, accompanied by responsibilities. As we have shown, this\nview can lead to ef\ufb01cient and powerful algorithms which allow separate local models to\nlearn consistent global representations.\n\nAcknowledgments\n\nWe thank Geoffrey Hinton for inspiration and interesting discussions, Brendan Frey and\nYann LeCun for sharing their data sets, and the reviewers for helpful comments.\n\nReferences\n[1] S. Roweis, L. Saul, and G. E. Hinton. Global coordination of local linear models. In Advances\n\nin Neural Information Processing Systems, volume 14, 2002.\n\n[2] J. J. Verbeek, N. Vlassis, and B. Kr\u00a8ose. Coordinating principal component analysers. In Pro-\n\nceedings of the International Conference on Arti\ufb01cial Neural Networks, 2002.\n\n[3] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, December 2000.\n\n[4] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290(5500):2323\u20132326, December 2000.\n\n[5] K. Fukunaga and D. R. Olsen. An algorithm for \ufb01nding intrinsic dimensionality of data. IEEE\n\nTransactions on Computers, 20(2):176\u2013193, 1971.\n\n[6] N. Kambhatla and T. K. Leen. Dimension reduction by local principal component analysis.\n\nNeural Computation, 9:1493\u20131516, 1997.\n\n[7] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers.\n\nNeural Computation, 11(2):443\u2013482, 1999.\n\n[8] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical\n\nReport CRG-TR-96-1, University of Toronto, Department of Computer Science, 1996.\n[9] T. Kohonen. Self-organization and Associative Memory. Springer-Verlag, Berlin, 1988.\n[10] C. Bishop, M. Svensen, and C. Williams. GTM: The generative topographic mapping. Neural\n\nComputation, 10:215\u2013234, 1998.\n\n[11] M. Brand. Charting a manifold. This volume, 2003.\n\n\f", "award": [], "sourceid": 2180, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}