{"title": "Minimax Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": "", "full_text": "Minimax embeddings\n\nMatthew Brand\n\nMitsubishi Electric Research Labs\n\nCambridge MA 02139 USA\n\nAbstract\n\nSpectral methods for nonlinear dimensionality reduction (NLDR) impose\na neighborhood graph on point data and compute eigenfunctions of a\nquadratic form generated from the graph. We introduce a more general\nand more robust formulation of NLDR based on the singular value de-\ncomposition (SVD). In this framework, most spectral NLDR principles\ncan be recovered by taking a subset of the constraints in a quadratic form\nbuilt from local nullspaces on the manifold. The minimax formulation\nalso opens up an interesting class of methods in which the graph is \u201cdec-\norated\u201d with information at the vertices, offering discrete or continuous\nmaps, reduced computational complexity, and immunity to some solu-\ntion instabilities of eigenfunction approaches. Apropos, we show almost\nall NLDR methods based on eigenvalue decompositions (EVD) have a so-\nlution instability that increases faster than problem size. This pathology\ncan be observed (and corrected via the minimax formulation) in problems\nas small as N < 100 points.\n\n1 Nonlinear dimensionality reduction (NLDR)\nSpectral NLDR methods are graph embedding problems where a set of N points X .=\n[x1,\u00b7\u00b7\u00b7 ,xN] \u2208 RD\u00d7N sampled from a low-dimensional manifold in a ambient space RD is\nreparameterized by imposing a neighborhood graph G on X and embedding the graph with\nminimal distortion in a \u201cparameterization\u201d space Rd, d < D. Typically the graph is sparse\nand local, with edges connecting points to their immediate neighbors. The embedding must\nkeep these edges short or preserve their length (for isometry) or angles (for conformality).\nThe graph-embedding problem was \ufb01rst introduced as a least-squares problem by Tutte [1],\nand as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics\nfor least-squares problems has been studied intensely in the following three decades (see\n[3]). Modern NLDR methods use graph constraints to generate a metric in a space of embed-\ndings RN. Eigenvalue decomposition (EVD) gives the directions of least or greatest variance\nunder this metric. Typically a subset of d extremal eigenvectors gives the embedding of N\npoints in Rd parameterization space. This includes the IsoMap family [4], the locally linear\nembedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the\nAutomatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of\npoints, and by combining subspace projections thus obtain continuous maps between RD\nand Rd.\nThis paper introduces a general algebraic framework for computing optimal embeddings\ndirectly from graph constraints. The aforementioned methods can can be recovered as spe-\ncial cases. The framework also suggests some new methods with very attractive properties,\nincluding continuous maps, reduced computational complexity, and control over the degree\n\n\fof conformality/isometry in the desired map. It also eliminates a solution instability that is\nintrinsic to EVD-based approaches. A perturbational analysis quanti\ufb01es the instability.\n\n2 Minimax theorem for graph embeddings\n\nWe begin with neighborhood graph speci\ufb01ed by a nondiagonal weighted adjacency matrix\nM \u2208 RN\u00d7N that has the data-reproducing property XM = X (this can be relaxed to XM \u2248\nX in practice). The graph-embedding and NLDR literatures offer various constructions of\nM, each appropriate to different sets of assumptions about the original embedding and\nits sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.).\nTypically Mi j 6= 0 if points i, j are nearby on the intrinsic manifold and |Mi j| is small or\nzero otherwise. Each point is taken to be a linear or convex combination of its neighbors,\nand thus M speci\ufb01es manifold connectivity in the sense that any nondegenerate embedding\nY that satis\ufb01es YM \u2248 Y with small residual kYM\u2212 YkF will preserve this connectivity\nand the structure of local neighborhoods. For example, in barycentric embeddings, each\npoint is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j\n(of degree k). We will also consider three optional constraints on the embedding :\n\n1. A null-space restriction, where the solution must be outside to the column-space\nof C \u2208 RN\u00d7M, M < N. For example, it is common to stipulate that the solution Y\nbe centered, i.e., YC = 0 for C = 1, the constant vector.\n\n2. A basis restriction, where the solution must be a linear combination of the rows\nof basis Z \u2208 RK\u00d7N, K \u2264 N. This can be thought of as information placed at the\nvertices of the graph that serves as example inputs for a target NLDR function. We\nwill use this to construct dimension-reducing radial basis function networks.\n\n3. A metric S \u2208 RN\u00d7N that determines how error is distributed over the points. For\nexample, it might be important that boundary points have less error. We assume\nis symmetric positive de\ufb01nite and has factorization S = AA> (e.g., A could\nthat S\nbe a Cholesky factor of S).\n\nIn most settings, the optional matrices will default to the identity matrix. In this context,\nwe de\ufb01ne the per-dimension embedding error of row-vector yi \u2208 rows(Y) to be\n\n(1)\n\nEM(yi) .=\n\nmax\n\nyi\u2208range(Z),, K\u2208RM\u00d7N\n\nk(yi(M + CD)\u2212 yi)Ak\n\nkyiAk\n\nwhere D is a matrix constructed by an adversary to maximize the error. The optimizing yi\nis a vector inside the subspace spanned by the rows of Z and outside the subspace spanned\nby the columns of C, for which the reconstruction residual yiM\u2212yi has smallest norm\nw.r.t. the metric S. The following theorem identi\ufb01es the optimal embedding Y for any\nchoice of M,Z,C,S:\nMinimax solution: Let Q \u2208 SK\u00d7P be a column-orthonormal basis of the null-space of the\nrows of ZC, with P = K \u2212 rank(C). Let B \u2208 RP\u00d7P be a square factor satisfying B>B =\nQ>ZSZ >Q, e.g., a Cholesky factor (or the \u201cR\u201d factor in QR-decomposition of (Q>ZA)>).\nCompute the left singular vectors U \u2208 SN\u00d7N of Udiag(s)V> = B\u2212>Q>Z(I\u2212 M)A, with\nsingular values s> .= [s1,\u00b7\u00b7\u00b7 , sP] ordered s1 \u2264 s2 \u2264 \u00b7\u00b7\u00b7 \u2264 sp. Using the leading columns\nU1:d of U, set Y = U>\nTheorem 1. Y is the optimal (minimax) embedding in Rd with error k[s1,\u00b7\u00b7\u00b7 , sd]k2:\n\n1:dB\u2212>Q>Z.\n\nY .= U>\n\n1:dB\u2212>Q>Z = arg min\nY\u2208Rd\u00d7N\n\nyi\u2208rows(Y)\n\nEM(yi)2 with EM(yi) = si.\n\n(2)\n\n(cid:229)\n\fAppendix A develops the proof and other error measures that are minimized.\nLocal NLDR techniques are easily expressed in this framework. When Z = A = I, C = [],\nand M reproduces X through linear combinations with M>1 = 1, we recover LLE [5].\nWhen Z = I, C = [], I\u2212 M is the normalized graph Laplacian, and A is a diagonal matrix\nof vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover\nlocally preserving projections [8].\n\n3 Analysis and generalization of charting\n\nisometry,\n\nthe charting error\n\nis equivalent\n\nthe assumption of\n\nThe minimax construction of charting [9] takes some development, but offers an interest-\ning insight into the above-mentioned methods. Recall that charting \ufb01rst solves for a set\nof local af\ufb01ne subspace axes S1 \u2208 RD\u00d7d,S2,\u00b7\u00b7\u00b7 at offsets \u00b51 \u2208 RD, \u00b52,\u00b7\u00b7\u00b7 that best cover\nthe data and vary smoothly over the manifold. Each subspace offers a chart\u2014a local pa-\nrameterization of the data by projection onto the local axes. Charting then constructs a\nweighted mixture of af\ufb01ne projections that merges the charts into a global parameteriza-\ntion. If the data manifold is curved, each projection will assign a point a slightly different\nembedding, so the error is measured as the variance of these proposed embeddings about\ntheir mean. This maximizes consistency and tends to produce isometric embeddings; [9]\ndiscusses ways to explicitly optimize the isometry of the embedding.\nUnder\nto the sum-\nsquared displacements of an embedded point relative to its immediate neighbors\n(summed over all neighborhoods). To construct the same error criteria in the min-\nimax setting, let xi\u2212k,\u00b7\u00b7\u00b7 ,xi,\u00b7\u00b7\u00b7 ,xi+k denote points in the ith neighborhood and let\nthe columns of Vi \u2208 R(2k+1)\u00d7d be an orthonormal basis of rows of the local pa-\ni [xi\u2212k,\u00b7\u00b7\u00b7 ,xi,\u00b7\u00b7\u00b7 ,xi+k]. Then a nonzero reparameterization will satisfy\nrameterization S>\n[yi\u2212k,\u00b7\u00b7\u00b7 ,yi,\u00b7\u00b7\u00b7 ,yi+k]ViV>\ni = [yi\u2212k,\u00b7\u00b7\u00b7 ,yi,\u00b7\u00b7\u00b7 ,yi+k] if and only if it preserves the relative\nposition of the points in the local parameterization. Conversely, any relative displacements\nof the points are isolated by the formula [yi\u2212k,\u00b7\u00b7\u00b7 ,yi,\u00b7\u00b7\u00b7 ,yi+k](I \u2212 ViV>\ni ). Minimizing\nthe Frobenius norm of this expression is thus equivalent to minimizing the local error in\ncharting. We sum these constraints over all neighborhoods to obtain the constraint matrix\nM = I\u2212 (cid:229)\ni , where (Fi)k j = 1 iff the jth point of the ith neighborhood is\nthe kth point of the dataset. Because ViV>\ni ) are complementary, it follows\nthat the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.)\nmust measure the projection of the embedding onto some subspace of (I\u2212 ViV>\ni ).\nTo construct a continuous map, charting uses an overcomplete radial basis function (RBF)\nrepresentation Z = [z(x1), z(x2),\u00b7\u00b7\u00b7 z(xN)], where z(x) is a vector that stacks z1(x), z2(x),\netc., and\n\ni and (I\u2212 ViV>\n\ni Fi(I\u2212 ViV>\n\ni )F>\n\n(cid:20) K>\n\nm(x\u2212 \u00b5m)\n\nzm(x) .=\npm(x) .= N (x|\u00b5m,S m) (cid:181)\n\n1\n\n(cid:21) pm(x)\n\n(cid:229) m pm(x) ,\ne\u2212(x\u2212\u00b5m)>S \u22121\n\nm (x\u2212\u00b5m)/2\n\n(3)\n\n(4)\n\nand Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z\ncontains many \u201cviews\u201d of the same point that are combined to give its low-dimensional\nembedding.\nFinally, we set C = 1, which forces the embedding of the full data to be centered.\nApplying the minimax solution to these constraints yields the RBF network mixing ma-\ntrix, f (x) .= U>\n1:dB\u2212>Q>z(x). Theorem 1 guarantees that the resulting embedding is least-\nsquares optimal w.r.t. Z,M,C,A at the datapoints f (xi), and because f (\u00b7) is an af\ufb01ne trans-\nform of z(\u00b7) it smoothly interpolates the embedding between points.\nThere are some interesting variants:\n\n\fFig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable\nswiss roll. Points are connected into a grid which ideally should be regular. The EVD so-\nlution shows substantial degradation. Insets detail corners where the EVD solution crosses\nitself repeatedly. The border compression is characteristic of Laplacian constraints.\n\nOne-shot charting: If we set the local dimensionality reducers to the identity matrix (all\nKm = I), then the minimax method jointly optimizes the local dimensionality reduction to\ncharts and the global coordination of the charts (under any choice of M). This requires that\nrows(Z) \u2264 N for a fully determined solution.\nDiscrete isometric charting: If Z = I then we directly obtain a discrete isometric embed-\nding of the data, rather than a continuous map, making this a local equivalent of IsoMap.\nReduced basis charting: Let Z be constructed using just a small number of kernels ran-\ndomly placed on the data manifold, such that rows(Z) (cid:28) N. Then the size of the SVD\nproblem is substantially reduced.\n\n4 Numerical advantage of minimax method\nNote that the minimax method projects the constraint matrix M into a subspace derived\nfrom C and Z and decomposes it there. This suppresses unwanted degrees of freedom\n(DOFs) admitted by the problem constraints, for example the trivial R0 embedding where\nall points are mapped to a single point yi = N\u22121/2. The R0 embedding serves as a trans-\nlational DOF in the solution. LLE- and eigenmap-based methods construct M to have a\nconstant null-space so that the translational DOF will be isolated in the EVD as null eigen-\nvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows\nthat this construction makes the EVD increasingly unstable as problem size grows and/or the\ndata becomes increasing amenable to low-residual embeddings, ultimately causing solution\ncollapse. As the next paragraph demonstrates, the problem is exacerbated when embedding\nw.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvec-\ntor associated with the unwanted DOF can have arbitrary structure. In all cases the problem\ncan be averted by using the minimax formulation with C = 1 to suppress the DOF.\nA 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regu-\nlarly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 ran-\ndom points as RBF centers, i.e., a continous map using M derived from the graph Laplacian\nand Z constructed as above. The map was computed both via the minimax (SVD) method\nand via the equivalent generalized eigenproblem, where the translational degree of freedom\nmust be removed by discarding an eigenvector from the solution. The two solutions are al-\ngebraically equivalent in every other regard. A variety of eigensolvers were tried; we took\n\nKernel embeddings of the twisted swiss rollgeneralized EVDminimax SVDLL corner detailUR corner detail\fFig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam-\ninated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its\nremaining eigenvalues should be identical to those derived from the minimax solution. The\ngraph at left shows the difference in the eigenspectra. The graph at right shows the EVD\nsolution\u2019s deviation from the translational vector y0 = 1\u00b7 N\u22121/2 \u2248 .03333. If the numer-\nics were perfect the line would be \ufb02at, but in practice the deviation is signi\ufb01cant enough\n(roughly 1% of the diameter of the embedding) to noticably perturb points in \ufb01gure 1.\n\nthe best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a\nfolding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows\nthat the noisiness of the EVD solution is due largely to mutual contamination of numerically\nunstable eigenvectors.\n\n4.1 Numerical instability of eigen-methods\n\nThe following theorem uses tools of matrix perturbation theory to show that as the prob-\nlem size increases, the desired and unwanted eigenvectors become increasingly wobbly\nand gradually contaminate each other, leading to degraded solutions. More precisely, the\nlow-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due\nto noiseless samples from low-curvature manifolds) or false (due to numerical noise). Al-\nthough in many cases some post-hoc algebra can \u201c\ufb01lter\u201d the unwanted components out\nof the contaminated eigensolution, it is not hard to construct cases where the eigenvectors\ncannot be cleanly separated. The minimax formulation is immune to this problem because\nit explicitly suppresses the gratuitous component(s) before matrix decomposition.\n\nTheorem 2. For any \ufb01nite numerical precision, as the number of points N increases, the\nFrobenius norm of numerical noise in the null eigenvector v0 can grow as O(N3/2), and\nthe eigenvalue problem can approach a false multiplicity at a rate as fast as O(N3/2),\nat which point the eigenvectors of interest\u2014embedding and translational\u2014are mutually\ncontaminated and/or have an indeterminate eigenvalue ordering.\n\nPlease see appendix B for the proof. This theorem essentially lower-bounds an upper-\nbound on error; examples can be constructed in which the problem is worse. For exam-\nple, it can be shown analytically that when embedding points drawn from the simple curve\nxi = [a,cospa] >, a \u2208 [0,1] with K = 2 neighbors, instabilities cannot be bounded better\nthan O(N5/2); empirically we see eigenvector mixing with N < 100 points and we see\nit grow at the rate \u2248 O(N4)\u2014in many different eigensolvers. At very large scales, more\npernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over.\nAlthough algebraic multiplicity and instability of the eigenproblem is conceptually a minor\noversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows,\nthe consequences are eventually fatal.\n\n5 Summary\nOne of the most appealing aspects of the spectral NLDR literature is that algorithms are\nusually motivated from analyses of linear operators on smooth differentiable manifolds,\ne.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry\n\n100200051015x 10\u22125eigenvalueexcess energyEigen spectrum compared to minimax spectrum100200300400500600700800900\u22128\u22126\u22124\u2212202x 10\u22125pointdeviationError in null embedding100200051015x 10\u22125eigenvalueexcess energyEigen spectrum compared to minimax spectrum100200300400500600700800900\u22128\u22126\u22124\u2212202x 10\u22125pointdeviationError in null embedding\for noiseless sampling) that make it dif\ufb01cult to predict what algorithmic realizations will do\nwhen real, noisy data violates these assumptions. The minimax embedding theorem pro-\nvides a complete algebraic characterization of this discrete NLDR problem, and provides\na solution that recovers numerically robusti\ufb01ed versions of almost all known algorithms.\nIt offers a principled way of constructing new algorithms with clear optimality properties\nand good numerical conditioning\u2014notably the construction of a continuous NLDR map (an\nRBF network) in a one-shot optimization (SVD). We have also shown how to cast several\nlocal NLDR principles in this framework, and upgrade these methods to give continuous\nmaps. Working in the opposite direction, we sketched the minimax formulation of isomet-\nric charting and showed that its constraint matrix contains a superset of all the algebraic\nconstraints used in local NLDR techniques.\n\nReferences\n1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743\u2013768, 1963.\n2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its applica-\n\ntion to graph theory. Czech. Math. Journal, 25:619\u2013633, 1975.\n\n3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in\n\nMathematics. American Mathematical Society, 1997.\n\n4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. Science, 290:2319\u20132323, December 22 2000.\n\n5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. Science, 290:2323\u20132326, December 22 2000.\n\n6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc.\n\nNIPS-15, 2003.\n\n7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. volume 14 of Advances in Neural Information Processing Systems, 2002.\n\n8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09,\n\nUniversity of Chicago Computer Science, October 2002.\n\n9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing\n\nSystems, 2003.\n\n10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990.\n\nA Proof of minimax embedding theorem (1)\nThe burden of this proof is carried by supporting lemmas, below. To emphasize the proof\nstrategy, we give the proof \ufb01rst; supporting lemmas follow.\nProof. Setting yi = l>\n\ni Z, we will solve for li \u2208 columns(L). Writing the error in terms of li,\nkl>\ni Z(I\u2212 M\u2212 CK)Ak\n\ni ZCKAk\n\n.\n\n(5)\n\nkl>\ni Z(I\u2212 M)A\u2212 l>\nkl>\ni ZAk\n\nEM(li) = max\n= max\nK\u2208RM\u00d7N\nK\u2208RM\u00d7N\ni ZCKA produces in\ufb01nite error unless l>\n\nkl>\ni ZAk\n\nThe term l>\nstraint and seek\n\ni ZC = 0, so we accept this as a con-\n\ni Z(I\u2212 M)Ak\nkl>\ni ZAk\nkl>\n\n.\n\nmin\nl>\ni ZC=0\n\n(6)\n\nBy lemma 1, that orthogonality is satis\ufb01ed by solving the problem in the space orthogonal\nto ZC; the basis for this space is given by columns of Q .= null((ZC)>).\nBy lemma 2, the denominator of the error speci\ufb01es the metric in solution space to be\nZAA>Z>; when the problem is projected into the space orthogonal to ZC it becomes\nQ>(ZAA>Z>)Q. Nesting the \u201corthogonally-constrained-SVD\u201d construction of lemma 1\n\n\finside the \u201cSVD-under-a-metric\u201d lemma 2, we obtain a solution that uses the correct metric\nin the orthogonal space:\n\nB>B = Q>ZAA>Z>Q\n\n(7)\n(8)\n(9)\nwhere braces indicate the nesting of lemmas. By the \u201cbest-projection\u201d lemma (#3), if we\norder the singular values by ascending magnitude,\n\nUdiag(s)V> = B\u2212>{Q(Z(I\u2212 M)A)}\n\nL = QB\u22121U\n\nq(cid:229)\n\nL1:d = arg min\nJ\u2208RN\u00d7d\n\n(Y>)1:d = arg min\nJ\u2208RN\u00d7d\n\nThe proof is completed by making the substitutions L>Z \u2192 Y> and kx>Ak \u2192 kxkS\nS = AA>), and leaving off the \ufb01nal square root operation to obtain\n\nji\u2208cols(J)(kj>Z(I\u2212 M)Ak/kjkZSZ >)2\n(cid:17)2\n\n(cid:16)kj>(I\u2212 M)kS /kjkS\n\nji\u2208cols(J)\n\n.\n\n(10)\n\n(for\n\n(11)\n\nLemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under\nthe constraint U>C = 0 are calculated as Q .= null(C>), Udiag(s)V> SVD\u2190 Q>M, L = QU.\n\nProof. First observe that L is orthogonal to C: By de\ufb01nition, the null-space basis satis\ufb01es\nQ>C = 0, thus L>C = U>Q>C = 0. Let J be an orthonormal basis for C, with J>J = I\nand Q>J = 0. Then Ldiag(s)V> = QQ>M = (I\u2212 JJ>)M, the orthogonal projector of C\napplied to M, proving that the SVD captures the component of M that is orthogonal to C.\nLemma 2. SVD with respect to a metric: The vectors li \u2208 L, vi \u2208 V that diagonalize matrix\nM with respect to positive de\ufb01nite column-space metric S are calculated as B>B \u2190 S,\nUdiag(s)V> SVD\u2190 B\u2212>M, L .= B\u22121U satisfy kl>\ni Mk/klikS = si and extremize this form for\nthe extremal singular values smin, smax.\n\nProof. By construction, L and V diagonalize M:\n\nL>MV = (B\u22121U)>MV = U>(B\u2212>M)V = diag(s)\n\n(12)\nand diag(s)V> = B\u2212>M. Forming the gram matrices of both sides of the last line, we\nobtain the identity Vdiag(s)2V> = M>B\u22121B\u2212>M = M>S \u22121M, which demonstrates that\nsi \u2208 s are the singular values of M w.r.t. column-space metric S. Finally, L is orthonormal\nw.r.t. the metric S, because kLk2S = L>SL = U>B\u2212>B>BB\u22121U = I. Consequently,\n\nkl>Mk/klkS = kl>Mk/1 = ksiv>\n\ni k = si .\n\nand by the Courant-Hilbert theorem,\n\nsmax = max\n\nl\n\nkl>Mk/klkS ;\n\nsmin = min\n\nl\n\nkl>Mk/klkS .\n\nLemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and ele-\nments of s be sorted so that s1 \u2265 s2 \u2265 \u00b7\u00b7\u00b7 \u2265 sN. Then for any dimensionality 1 \u2264 d \u2264 N,\n\nL1:d\n\n.= [l1,\u00b7\u00b7\u00b7 ,ld] = arg max\nJ\u2208RN\u00d7d\n\nkJ>Mk(J>SJ) \u22121\nkJ>MkF\nmax\n\n= arg\n\nJ\u2208RN\u00d7d|J>SJ= I\n\nq(cid:229)\n\n= arg max\nJ\u2208RN\u00d7d\n\nji\u2208cols(J)(kj>Mk/kjkS )2\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\nwith the optimum value of all right hand sides being ((cid:229) d\nversed, the minimum of this form is obtained.\n\ni=1 s2\n\ni )1/2. If the sort order is re-\n\n(cid:229)\n\f1:d\n\nF = kU>Mk2\n\n1:dMk2\n(L>\n1:d\n\nSL 1:d)\u22121 = trace((L>\n\n1:dM)>I(L>\n\n1:dM)) = trace((diag(s1:d)V>\n\nkJ>Mk2\nF = kUU>Mk2\n\n1:dM)>(L>\n1:d)>(diag(s1:d)V>\n\nProof. By the Eckart-Young-Mirsky theorem, if U>MV = diag(s) with singular values\n.= [u1,\u00b7\u00b7\u00b7 ,ud] = argmaxU\u2208SN\u00d7d kU>MkF . We \ufb01rst\nsorted in descending order, then U1:d\nextend this to a non-orthonogonal basis J under a Mahalonobis norm:\nmaxJ\u2208RN\u00d7dkJ>Mk(J>J)\u22121 = maxU\u2208SN\u00d7dkU>MkF\n(18)\n(J>J)\u22121 = trace(M>J(J>J)\u22121J>M) = trace(M>JJ+(JJ+)>M) =\nbecause\nk(JJ+)Mk2\nF since JJ+ is a (symmetric) orthogonal pro-\njector having binary eigenvalues l \u2208 {0,1} and therefore it is the gram of an thin\northogonal matrix. We then impose a metric S\non the column-space of J to obtain\nthe \ufb01rst criterion (equation 15), which asks what maximizes variance in J>M while\nminimizing the norm of J w.r.t. metric S. Here it suf\ufb01ces to substitute in the leading\n(resp., trailing) columns of L and verify that the norm is maximized (resp., mini-\nmized). Expanding, kL>\nSL 1:d)\u22121(L>\n1:dM)) =\n1:d)) = ks1:dk2. Again,\ntrace((L>\nby the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving pro-\njections, so the \ufb01rst criterion is indeed maximized by setting J to the columns in L\ncorresponding to the largest values in s.\nCriterion #2 restates the \ufb01rst criterion with the set of candidates for J restricted to (the hy-\nperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix\n(thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above\ntrace by individual singular values. Note that the numerator and denominator can have dif-\nferent metrics because they are norms in different spaces, possibly of different dimension.\nFinally, that the trailing d eigenvectors minimize these criteria follows directly from the fact\nthat leading N \u2212 d singular values account for the maximal part of the variance.\nB Proof of instability theorem (2)\nProof. When generated from a sparse graph with average degree K, weighted connectivity\nmatrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from\na smooth manifold, increasing the sampling density N does not change the distribution of\nmagnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W \u2192 W + E\n\u221a\ndue to numerical noise E created by \ufb01nite machine precision. By the weak law of large\nnumbers, the Frobenius norm of the sparse perturbation grows as kEkF \u223c O(\nN). How-\nt Wvt \u223c O(N\u22121), be-\never the tth-smallest nonzero eigenvalue l\ncause elements of corresponding eigenvector vt grow as O(N\u22121/2) and only K of those\nelements are multiplied by nonzero values to form each element of Wvt. In sum, the per-\nturbation kEkF grows while the eigenvalue l\nt(W) shrinks. In linear embedding algorithms,\n.= l 1 \u2212 l 0. The tail eigenvalue l 0 = 0 by construction but\nthe eigengap of interest is l gap\nit is possible that l 0 > 0 with numerical error, thus l gap \u2264 l 1. Combining these facts,\nthe ratio between the perturbation and the eigengap grows as kEkF /l gap \u223c O(N3/2) or\nfaster. Now consider the shifted eigenproblem I \u2212 W with leading (maximal) eigenval-\nues 1 \u2212 l 0 \u2265 1 \u2212 l 1 \u2265 \u00b7\u00b7\u00b7 and unchanged eigenvectors. From matrix perturbation the-\nory [10, thm. V.2.8], when W is perturbed to W0\n.= W + E, the change in the lead-\n0 \u2212 l 0| \u2264 \u221a\ning eigenvalue from 1 \u2212 l 0 to 1 \u2212 l\n2kEkF and similarly\n0 is bounded as |l\n0\ngap \u2265 l gap \u2212\u221a\n\u221a\n2kEkF . Thus l\n1\u2212 l\n2kEkF . Since kEkF /l gap \u223c O(N3/2),\nthe right hand side of the gap bound goes negative at a supralinear rate, implying that the\neigenvalue ordering eventually becomes unstable with the possibility of the \ufb01rst and second\neigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens\nwell before: Under general (dense) conditions, the change in the eigenvector v0 is bounded\n4kEkF\nas kv0\n|l 0\u2212l 1|\u2212\u221a\n[10, thm. V.2.8]. (This bound is often tight enough to serve\nas a good approximation.) Specializing this to the sparse embedding matrix, we \ufb01nd that\nthe bound weakens to kv0\n\n2kEkF\n0 \u2212 1\u00b7 N\u22121/2k \u223c\n\n\u221a\nO(\n\n\u221a\nN)\nO(N\u22121)\u2212O(\n\n\u221a\nN) > O(\n\nt(W) grows as l\n\nt(W) = v>\n\nN)\n\nO(N\u22121) = O(N3/2).\n\n1 \u2264 1\u2212 l 1 +\n0\n\n0\n\n0\n\n0 \u2212 v0k \u2264\n\n\f", "award": [], "sourceid": 2373, "authors": [{"given_name": "Matthew", "family_name": "Brand", "institution": null}]}