{"title": "Heavy-Tailed Symmetric Stochastic Neighbor Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 2169, "page_last": 2177, "abstract": "Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavy-tailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heave-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.", "full_text": "Heavy-Tailed Symmetric Stochastic Neighbor\n\nEmbedding\n\nZhirong Yang\n\nThe Chinese University of Hong Kong\n\nHelsinki University of Technology\n\nzhirong.yang@tkk.fi\n\nZenglin Xu\n\nThe Chinese University of Hong Kong\n\nSaarland University & MPI for Informatics\n\nIrwin King\n\nThe Chinese University of Hong Kong\n\nking@cse.cuhk.edu.hk\n\nErkki Oja\n\nHelsinki University of Technology\n\nerkki.oja@tkk.fi\n\nzlxu@cse.cuhk.edu.hk\n\nAbstract\n\nStochastic Neighbor Embedding (SNE) has shown to be quite promising for data\nvisualization. Currently, the most popular implementation, t-SNE, is restricted to\na particular Student t-distribution as its embedding distribution. Moreover, it uses\na gradient descent algorithm that may require users to tune parameters such as\nthe learning step size, momentum, etc., in \ufb01nding its optimum. In this paper, we\npropose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE)\nmethod, which is a generalization of the t-SNE to accommodate various heavy-\ntailed embedding similarity functions. With this generalization, we are presented\nwith two dif\ufb01culties. The \ufb01rst is how to select the best embedding similarity\namong all heavy-tailed functions and the second is how to optimize the objective\nfunction once the heavy-tailed function has been selected. Our contributions then\nare: (1) we point out that various heavy-tailed embedding similarities can be char-\nacterized by their negative score functions. Based on this \ufb01nding, we present a pa-\nrameterized subset of similarity functions for choosing the best tail-heaviness for\nHSSNE; (2) we present a \ufb01xed-point optimization algorithm that can be applied to\nall heavy-tailed functions and does not require the user to set any parameters; and\n(3) we present two empirical studies, one for unsupervised visualization showing\nthat our optimization algorithm runs as fast and as good as the best known t-SNE\nimplementation and the other for semi-supervised visualization showing quanti-\ntative superiority using the homogeneity measure as well as qualitative advantage\nin cluster separation over t-SNE.\n\n1 Introduction\n\nVisualization as an important tool for exploratory data analysis has attracted much research effort\nin recent years. A multitude of visualization approaches, especially the nonlinear dimensionality\nreduction techniques such as Isomap [9], Laplacian Eigenmaps [1], Stochastic Neighbor Embedding\n(SNE) [6], manifold sculpting [5], and kernel maps with a reference point [8], have been proposed.\nAlthough they are reported with good performance on tasks such as unfolding an arti\ufb01cial manifold,\nthey are often not successful at visualizing real-world data with high dimensionalities.\nA common problem of the above methods is that most mapped data points are crowded together in\nthe center without distinguished gaps that isolate data clusters. It was recently pointed out by van\nder Maaten and Hinton [10] that the \u201ccrowding problem\u201d can be alleviated by using a heavy-tailed\n\n\fdistribution in the low-dimensional space. Their method, called t-Distributed Stochastic Neighbor\nEmbedding (t-SNE), is adapted from SNE with two major changes: (1) it uses a symmetrized cost\nfunction; and (2) it employs a Student t-distribution with a single degree of freedom (T1). In this\nway, t-SNE can achieve remarkable superiority in the discovery of clustering structure in high-\ndimensional data.\nThe t-SNE development procedure in [10] is restricted to the T1 distribution as its embedding sim-\nilarity. However, different data sets or other purposes of dimensionality reduction may require gen-\neralizing t-SNE to other heavy-tailed functions. The original t-SNE derivation provides little infor-\nmation for users on how to select the best embedding similarity among all heavy-tailed functions.\nFurthermore, the original t-SNE optimization algorithm is not convenient when the symmetric SNE\nis generalized to use various heavy-tailed embedding similarity functions since it builds on the gra-\ndient descent approach with momenta. As a result, several optimization parameters need to be\nmanually speci\ufb01ed. The performance of the t-SNE algorithm depends on laborious selection of the\noptimization parameters. For instance, a large learning step size might cause the algorithm to di-\nverge, while a conservative one might lead to slow convergence or poor annealed results. Although\ncomprehensive strategies have been used to improve the optimization performance, they might be\nstill problematic when extended to other applications or embedding similarity functions.\nIn this paper we generalize t-SNE to accommodate various heavy-tailed functions with two major\ncontributions: (1) we propose to characterize heavy-tailed embedding similarities in symmetric SNE\nby their negative score functions. This further leads to a parameterized subset facilitating the choice\nof the best tail-heaviness; and (2) we present a general algorithm for optimizing the symmetric SNE\nobjective with any heavy-tailed embedding similarities.\nThe paper is organized as follows. First we brie\ufb02y review the related work of SSNE and t-SNE in\nSection 2. In Section 3, we present the generalization of t-SNE to our Heavy-tailed Symmetric SNE\n(HSSNE) method. Next, a \ufb01xed-point optimization algorithm for HSSNE is provided and its con-\nvergence is discussed in Section 4. In Section 5, we relate the EM-like behavior of the \ufb01xed-point\nalgorithm to a pairwise local mixture model for an in-depth analysis of HSSNE. Section 6 presents\ntwo sets of experiments, one for unsupervised and the other for semi-supervised visualization. Fi-\nnally, conclusions are drawn in Section 7.\n\n2 Symmetric Stochastic Neighbor Embedding\nSuppose the pairwise similarities of a set of m-dimensional data points X = {xi}n\ni=1 are encoded\nin a symmetric matrix P \u2208 Rn\u00d7n\nij Pij = 1. Symmetric Stochastic Neighbor\nEmbedding (SSNE) [4, 10] seeks r-dimensional (r (cid:191) m) representations of X , denoted by Y =\n{yi}n\n\n+ , where Pii = 0 and\n\ni=1, such that\n\nJ (Y ) = DKL(P||Q) =\n\nPij log Pij\nQij\n\nis minimized, where Qij = qij/\nbedding and\n\na(cid:54)=b qab are the normalized similarities in low-dimensional em-\n\ni(cid:54)=j\n\n(cid:80)\n(cid:88)\n(cid:161)\u2212(cid:107)yi \u2212 yj(cid:107)2(cid:162)\n(cid:88)\n(cid:175)(cid:175)(cid:175)\n\nj\n\n\u2202J\n\u2202Y\n\n(cid:80)\n\n\u2202J\n\u2202yi\n\n(1)\n\n(2)\n\n(3)\n\nqij = exp\n\n, qii = 0.\n\nThe optimization of SSNE uses the gradient descent method with\n(Pij \u2212 Qij)(yi \u2212 yj).\n\n= 4\n\n(cid:179)\n\n(cid:180)\n\nA momentum term is added to the gradient in order to speed up the optimization:\n\n+ \u03b2(t)\n\n1 . . . y(t)\n\nY (t+1) = Y (t) + \u03b7\n\n(4)\nn ] \u2208 Rr\u00d7n is the solution in matrix form at iteration t; \u03b7 is the learning\nwhere Y (t) = [y(t)\nrate; and \u03b2(t) is the momentum amount at iteration t. Compared with an earlier method Stochastic\nNeighbor Embedding (SNE) [6], SSNE uses a symmetrized cost function with simpler gradients.\nMost mapped points in the SSNE visualizations are often compressed near the center of the visualiz-\ning map without clear gaps that separate clusters of the data. The t-Distributed Stochastic Neighbor\n\nY =Y (t)\n\n,\n\nY (t) \u2212 Y (t\u22121)\n\n\fEmbedding (t-SNE) [10] addresses this crowding problem by using the Student t-distribution with a\nsingle degree of freedom\n\n(5)\nas the embedding similarity distribution, which has a heavier tail than the Gaussian used in SNE and\nSSNE. For brevity we denote such distribution by T1. Using this distribution yields the gradient of\nt-SNE:\n\nqij = (1 + (cid:107)yi \u2212 yj(cid:107)2)\u22121, qii = 0,\n\n= 4\n\n(Pij \u2212 Qij)(yi \u2212 yj)(1 + (cid:107)yi \u2212 yj(cid:107)2)\u22121.\n\n(6)\n\n(cid:88)\n\nj\n\n\u2202J\n\u2202yi\n\nIn addition, t-SNE employs a number of strategies to overcome the dif\ufb01culties in the optimization\nbased on gradient descent.\n\n3 Heavy-tailed SNE characterized by negative score functions\n\nAs the gradient derivation in [10] is restricted to the T1 distribution, we derive the gradient with a\ngeneral function that converts squared distances to similarities, with T1 as a special case. In addition,\nthe direct chain rule used in [10] may cause notational clutter and conceal the working components in\nthe gradients. We instead employ the Lagrangian technique to simplify the derivation. Our approach\ncan provide more insights of the working factor brought by the heavy-tailed functions.\nMinimizing J (Y ) in Equation (1) with respect to Y is equivalent to the optimization problem:\n\n(cid:88)\n\nqij(cid:80)\n\na(cid:54)=b qab\n\nL(q, Y ) =\n\nq,Y\n\nmaximize\n\nPij log\nsubject to qij = H((cid:107)yi \u2212 yj(cid:107)2),\n\nij\n\n(7)\n\n(cid:195)\n\n(cid:88)\n(cid:88)\n\nj\n\n1(cid:80)\n\n(8)\nwhere the embedding similarity function H(\u03c4) \u2265 0 can be any function that is monotonically de-\ncreasing with respect to \u03c4 for \u03c4 > 0. Note that H is not required to be de\ufb01ned as a probability\nfunction because the symmetric SNE objective already involves normalization over all data pairs.\nThe extended objective using the Lagrangian technique is given by\n\n(cid:88)\n\n(cid:163)\n\nqij(cid:80)\n(cid:80)\na(cid:54)=b qab \u2212 Pij/qij. Inserting these Lagrangian multi-\n\nqij \u2212 H((cid:107)yi \u2212 yj(cid:107)2)\n\na(cid:54)=b qab\n\n(9)\n\n\u03bbij\n\n+\n\nij\n\n.\n\n\u02dcL(q, Y ) =\n\nPij log\n\nSetting \u2202 \u02dcL(q, Y )/\u2202qij = 0 yields \u03bbij = 1/\npliers to the gradient with respect to yi, we have\n\n(cid:88)\n\nij\n\n(cid:33)\n\n(cid:181)\n\n(cid:164)\n\n(cid:182)\n\n\u2202J (Y )\n\u2202yi\n\n= \u2212 \u2202 \u02dcL(q, Y )\n\n\u2202yi\n\n= 4\n\n= 4\n\n\u2212 Pij\nqij\n\n\u00b7 qij \u00b7\n\na(cid:54)=b qab\n\n\u2212 h((cid:107)yi \u2212 yj(cid:107)2)\n\nqij\n\n(Pij \u2212 Qij)S((cid:107)yi \u2212 yj(cid:107)2)(yi \u2212 yj),\n\n(yi \u2212 yj)\n\n(10)\n\n(11)\n\nwhere h(\u03c4) = dH(\u03c4)/d\u03c4 and\n\nj\n\nS(\u03c4) = \u2212 d log H(\u03c4)\n\nd\u03c4\n\n(12)\nis the negative score function of H. For notational simplicity, we also write Sij = S((cid:107)yi \u2212 yj(cid:107)2).\nWe propose to characterize the tail heaviness of the similarity function H, relative to the one that\nleads to the Gaussian, by its negative score function S, also called tail-heaviness function in this\npaper. In this characterization, there is a functional operator S that maps every similarity function to\na tail-heaviness function. For the baseline Gaussian similarity, H(\u03c4) = exp(\u2212\u03c4), we have S(H) =\n1, i.e. S(H)(\u03c4) = 1 for all \u03c4. As for the Student t-distribution of a single degree of freedom,\nH(\u03c4) = (1 + \u03c4)\u22121 and thus S(H) = H.\nThe above observation inspires us to further parameterize a family of tail-heaviness functions by\nthe power of H: S(H, \u03b1) = H \u03b1 for \u03b1 \u2265 0, where a larger \u03b1 value corresponds to a heavier-tailed\nembedding similarity function. Such a function H can be determined by solving the \ufb01rst-order\ndifferential equation \u2212d log H(\u03c4)/d\u03c4 = [H(\u03c4)]\u03b1, which gives\nH(\u03c4) = (\u03b1\u03c4 + c)\u22121/\u03b1\n\n(13)\n\n\fFigure 1: Several functions in the power family.\n\nwith c a constant. Here we set c = 1 for a consistent generalization of SNE and t-SNE. Thus the\nGaussian embedding similarity function, i.e. H(\u03c4) = exp(\u2212\u03c4), is achieved when \u03b1 \u2192 0. Figure 1\nshows a number of functions in the power family.\n\n4 A \ufb01xed-Point optimization algorithm\n\n(cid:80)\n\nUnlike many other dimensionality reduction approaches that can be solved by eigendecomposition\nin a single step, SNE and its variants require iterative optimization methods. Substantial efforts have\nbeen devoted to improve the ef\ufb01ciency and robustness of t-SNE optimization. However it remains\nunknown whether such a comprehensive implementation also works for other types of embedding\nsimilarity functions. Manually adjusting the involved parameters such as the learning rate and the\nmomentum for every function is rather time-consuming and infeasible in practice.\n(cid:80)\nHere we propose to optimize symmetric SNE by a \ufb01xed-point algorithm. After rearranging the terms\nin \u2202J /\u2202yi = 0 (see Equation (11)), we obtain the following update rule:\n(cid:80)\nj (Aij \u2212 Bij) Y (t)\nj Aij\nj (cid:107)2) and Bij = QijS((cid:107)y(t)\n\nwhere Aij = PijS((cid:107)y(t)\nj (cid:107)2). Our optimization algorithm\nfor HSSNE simply involves the iterative application of Equation (14). Compared with the original\nt-SNE optimization algorithm, our method requires no user-provided parameters such as the learn-\ning step size and momentum, which is more convenient for applications. The \ufb01xed-point algorithm\nusually converges, with the result satisfying the stationary condition \u2202J /\u2202Y = 0. However, it is\nknown that the update rule (14) can diverge in some cases, for example, when Yki are large. There-\nfore, a proof without extra conditions cannot be constructed. Here we provide two approximative\ntheoretical justi\ufb01cations for the algorithm.\nDenote \u2206 = Y \u2212 Y (t) and \u2207 the gradient of J with respect to Y . Let us \ufb01rst approximate the\nHSSNE objective by the \ufb01rst-order Taylor expansion at the current estimate Y (t):\n\ni \u2212 y(t)\n\ni \u2212 y(t)\n\nY (t+1)\nki\n\n=\n\nY (t)\nki\n\nj Bij +\n\nkj\n\n,\n\n(14)\n\n(cid:88)\n\nJ (Y ) \u2248 Jlin(Y ) = J (Y (t)) +\n\n\u2206ki\u2207(t)\nki .\n\nThen we can construct an upper bound of Jlin(Y ):\nG(Y, Y (t)) = Jlin(Y ) +\n\n(cid:88)\n\nki\n\n\u22062\nki\n\n(cid:88)\n\nki\n\na\n\n1\n2\n\nAia\n\n(15)\n\n(16)\n\nas Pia and Sia are all nonnegative. The bound is tight at Y = Y (t), i.e. G(Y (t), Y (t)) = Jlin(Y (t)).\nEquating \u2202G(Y, Y (t))/\u2202Y = 0 implements minimization of G(Y, Y (t)) and yields the update rule\n(14). Iteratively applying the update rule (14) thus results in a monotonically decreasing sequence\nof the linear approximation of HSSNE objective: Jlin(Y (t)) \u2265 G(Y (t+1), Y (t)) \u2265 Jlin(Y (t+1)).\n\n01234500.10.20.30.40.50.60.70.80.91\u03c4H(\u03c4)=(1+\u03b1\u03c4)\u22121/\u03b1 \u03b1\u2192 0 (Gaussian)\u03b1=0.5\u03b1=1 (T1)\u03b1=1.5\u03b1=2\f(cid:80)\n\nEven if the second-order terms in the Taylor expansion of J (Y ) are also considered, the update rule\n(14) is still justi\ufb01ed if Yki or Y (t+1)\nki are small. Let DA and DB be diagonal matrices with\nDA\n\n(cid:80)\nj Bij. We can write J (Y ) = Jquad(Y ) + O(\u22063), where\nJquad(Y ) = Jlin(Y ) +\n\nj Aij and DB\n\n\u2206ki\u2206ljHijkl.\n\n(cid:88)\n\n\u2212 Y (t)\n\nii =\n\nii =\n\n(17)\n\nki\n\n1\n2\n\nijkl\n\nWith the approximated Hessian Hijkl = \u03b4kl\nij, the updating term Uki in\nNewton\u2019s method Y (t)\nlj . Solving this\nequation by directly inverting the huge tensor H is however infeasible in practice and thus usually\nimplemented by iterative methods such as\n\n\u2212 Uki can be determined by\n\nki = Y (t\u22121)\n\nki\n\n(DA \u2212 A) \u2212 (DB \u2212 B)\n\n(cid:164)\n(cid:80)\nki HijklUki = \u2207(t)\n(cid:162)\n\n(cid:161)\n\nU (v+1)\n\nki\n\n=\n\n(A + DB \u2212 B)U (v) + \u2207(t)\n\nDA\nii\n\nki\n\n.\n\n(18)\n\nSuch iterations albeit still form a costly inner loop over v. To overcome this, we initialize U (0) = 0\nand only employ the \ufb01rst iteration of each inner loop. Then one can \ufb01nd that such an approximated\nNewton\u2019s update rule Y (t+1)\nis identical to Equation (14). Such a \ufb01rst-step approx-\nimation technique has also been used in the Mean Shift algorithm as a generalized Expectation-\nMaximization solution [2].\n\nki \u2212 \u2207(t)\n\n= Y (t)\n\nki\nDA\nii\n\nki\n\n5 A local mixture interpretation\n\n(cid:163)\n\n(cid:104)\n\nFurther rearranging the update rule can give us more insights of the properties of SSNE solutions:\n\n(cid:80)\n\nj Aij\n\nY (t+1)\nki\n\n=\n\n(cid:105)\n\n(cid:80)\n\nY (t)\nkj + Qij\nPij\nj Aij\n\nki \u2212 Y (t)\n(Y (t)\nkj )\n\n.\n\n(19)\n\nOne can see that the above update rule mimics the maximization step in the EM-algorithm for\nclassical Gaussian mixture model (e.g. [7]), or more particularly, the Mean Shift method [3, 2].\nThis resemblance inspires us to \ufb01nd an alternative interpretation of the SNE behavior in terms of a\nparticular mixture model.\nGiven the current estimate Y (t), the \ufb01xed-point update rule actually performs minimization of\n\n(cid:179)\n\nwhere \u00b5(t)\nbound of\n\nij = y(t)\n\nj + Qij\nPij\n\nPijSij(cid:107)yi \u2212 \u00b5(t)\n\nij (cid:107)2,\n\n(20)\n\n. This problem is equivalent to maximizing the Jensen lower\n\nlog\n\nPijSij exp\n\n\u2212(cid:107)yi \u2212 \u00b5(t)\nij (cid:107)2\n\n.\n\n(21)\n\n(cid:179)\n\n(cid:180)\n\n(cid:88)\n(cid:180)\n(cid:88)\ni \u2212 y(t)\ny(t)\n\nij\n\nj\n\nij\n\nIn this form, \u00b5(t)\nij can be regarded as the mean of the j-th mixture component for the i-th embedded\ndata point, while the product PijSij can be thought as the mixing coef\ufb01cients1. Note that each data\nsample has its own mixing coef\ufb01cients because of locality sensitivity.\nFor the converged estimate, i.e., Y (t+1) = Y (t) = Y \u2217, we can rewrite the mixture without the\nlogarithm as\n\n(cid:41)\n\n(cid:40)\n\n(cid:181)\n\nPijSij exp\n\n\u2212\n\n1 \u2212 Qij\nPij\n\ni \u2212 y\u2217\nj(cid:107)2\n\n.\n\n(22)\n\n(cid:182)2 (cid:107)y\u2217\n\n(cid:88)\n\nij\n\nMaximizing this quantity clearly explains the ingredients of symmetric SNE: (1) Pij re\ufb02ects that\nsymmetric SNE favors close pairs in the input space, which is also adopted by most other locality\n\n1The data samples in such a symmetric mixture model do not follow the independent and identically dis-\ntributed (i.i.d.) assumption because the mixing coef\ufb01cient rows are not summed to the same number. Never-\ntheless, this does not affect our subsequent pairwise analysis.\n\n\fpreserving methods. (2) As discussed in Section 3, Sij characterizes the tail heaviness of the em-\nbedding similarity function. For the baseline Gaussian similarity, this reduces to one and thus has no\neffect. For heavy-tailed similarities, Sij can compensate for mismatched dimensionalities between\nthe input space and its embedding. (3) The \ufb01rst factor in the exponential emphasizes the distance\ngraph matching, which underlies the success of SNE and its variants for capturing the global data\nstructure compared with many other approaches that rely on only variance constraints [10]. A pair\nof Qij that approximates Pij well can increase the exponential, while a pair with a poor mismatch\nyields little contribution to the mixture. (4) Finally, as credited in many other continuity preserv-\ning methods, the second factor in the exponential forces that close pairs in the input space are also\nsituated nearby in the embedding space.\n\n6 Experiments\n\n6.1\n\nt-SNE for unsupervised visualization\n\nIn this section we present experiments of unsupervised visualization with T1 distribution, where our\nFixed-Point t-SNE is compared with the original Gradient t-SNE optimization method as well as\nanother dimensionality reduction approach, Laplacian Eigenmap [1]. Due to space limitation, we\nonly focus on three data sets, iris, wine, and segmentation (training subset) from the UCI repository2.\nWe followed the instructions in [10] for calculating Pij and choosing the learning rate \u03b7 and momen-\ntum amount \u03b2(t) for Gradient t-SNE. Alternatively, we excluded two tricks, \u201cearly compression\u201d\nand \u201cearly exaggeration\u201d, that are described in [10] from the comparison of long-run optimization\nbecause they apparently belong to the initialization stage. Here both Fixed-Point and Gradient t-\nSNEs execute with the same initialization which uses the \u201cearly compression\u201d trick and pre-runs the\nGradient t-SNE for 50 iterations as suggested in [10].\nThe visualization quality can be quanti\ufb01ed using the ground truth class information. We adopt the\nmeasurement of the homogeneity of nearest neighbors:\n\nhomogeneity = \u03b3/n,\n\n(23)\n\nwhere \u03b3 is the number of mapped points belonging to the same class with their nearest neighbor and\nn again is the total number of points. A larger homogeneity generally indicates better separability\nof the classes.\nThe experimental results are shown in Figure 2. Even though having a globally optimal solution, the\nLaplacian Eigenmap yields poor visualizations, since none of the classes can be isolated. By con-\ntrast, both t-SNE methods achieve much higher homogeneities and most clusters are well separated\nin the visualization plots. Comparing the two t-SNE implementations, one can see that our sim-\nple \ufb01xed-point algorithm converges even slightly faster than the comprehensive and carefully tuned\nGradient t-SNE. Besides ef\ufb01ciency, our approach performs as good as Gradient t-SNE in terms of\nboth t-SNE objectives and homogeneities of nearest neighbors for these data sets.\n\n6.2 Semi-supervised visualization\n\nUnsupervised symmetric SNE or t-SNE may perform poorly for some data sets in terms of iden-\ntifying classes. In such cases it is better to include some supervised information and apply semi-\nsupervised learning to enhance the visualization.\nLet us consider another data set vehicle from the LIBSVM repository3. The top-left plot in Figure 3\ndemonstrates a poor visualization using unsupervised Gradient t-SNE. Next, suppose 10% of the\nintra-class relationships are known. We can construct a supervised matrix u where uij = 1 if xi and\nxj are known to belong to the same class and 0 otherwise. After normalizing Uij = uij/\na(cid:54)=b uab,\nwe calculate the semi-supervised similarity matrix \u02dcP = (1\u2212\u03c1)P +\u03c1U, where the trade-off parameter\n\u03c1 is set to 0.5 in our experiments. All SNE learning algorithms remain unchanged except that P is\nreplaced with \u02dcP .\n\n(cid:80)\n\n2http://archive.ics.uci.edu/ml/\n3http://www.csie.ntu.edu.tw/\u223ccjlin/libsvmtools/datasets/\n\n\fFigure 2: Unsupervised visualization on three data sets. Column 1 to 3 are results of iris, wine and\nsegmentation, respectively. The \ufb01rst row comprises the learning times of Gradient and Fixed-Point\nt-SNEs. The second to fourth rows are visualizations using Laplacian Eigenmap, Gradient t-SNE,\nand Fixed-Point t-SNE, respectively.\n\n00.20.40.60.811.200.20.40.60.811.21.41.6learning time (seconds)t\u2212SNE costiris Gradient t\u2212SNEFixed\u2212Point t\u2212SNE00.20.40.60.811.21.41.61.80.20.40.60.811.21.41.61.82learning time (seconds)t\u2212SNE costwine Gradient t\u2212SNEFixed\u2212Point t\u2212SNE00.511.522.50.20.40.60.811.21.41.6learning time (seconds)t\u2212SNE costsegmentation Gradient t\u2212SNEFixed\u2212Point t\u2212SNE\u22120.05\u22120.04\u22120.03\u22120.02\u22120.0100.010.020.030.040.05\u22120.05\u22120.04\u22120.03\u22120.02\u22120.0100.010.020.030.040.05Laplacian Eigenmap, homogeneity=0.47, DKL=1.52 123\u22120.04\u22120.03\u22120.02\u22120.0100.010.020.03\u22120.03\u22120.02\u22120.0100.010.020.03Laplacian Eigenmap, homogeneity=0.38, DKL=1.67 123\u22120.03\u22120.02\u22120.0100.010.020.030.04\u22120.03\u22120.02\u22120.0100.010.020.03Laplacian Eigenmap, homogeneity=0.42, DKL=1.86 1234567\u221220\u221215\u221210\u22125051015\u221220\u221215\u221210\u22125051015Gradient t\u2212SNE, homogeneity=0.96, DKL=0.15 123\u221210\u221250510\u221215\u221210\u221250510Gradient t\u2212SNE, homogeneity=0.96, DKL=0.36 123\u221220\u221215\u221210\u221250510152025\u221220\u221215\u221210\u2212505101520Gradient t\u2212SNE, homogeneity=0.86, DKL=0.24 1234567\u221210\u22128\u22126\u22124\u221220246810\u221212\u221210\u22128\u22126\u22124\u2212202468Fixed\u2212Point t\u2212SNE, homogeneity=0.95, DKL=0.16 123\u221210\u221250510\u221210\u221250510Fixed\u2212Point t\u2212SNE, homogeneity=0.97, DKL=0.37 123\u221215\u221210\u22125051015\u221210\u22125051015Fixed\u2212Point t\u2212SNE, homogeneity=0.83, DKL=0.24 1234567\fFigure 3: Semi-supervised visualization for the vehicle data set. The plots titled with \u03b1 values are\nproduced using the \ufb01xed-point algorithm of the power family of HSSNE.\n\nThe top-middle plot in Figure 3 shows that inclusion of some supervised information improves the\nhomogeneity (0.92) and visualization, where Class 3 and 4 are identi\ufb01able, but the classes are still\nvery close to each other, especially Class 1 and 2 heavily mixed. We then tried the power family\nof HSSNE with \u03b1 ranging from 0 to 1.5, using our \ufb01xed-point algorithm. It can be seen that with\n\u03b1 increased, the cyan and magenta clusters become more separate and Class 1 and 2 can also be\nidenti\ufb01ed. With \u03b1 = 1 and \u03b1 = 2, the HSSNEs implemented by our \ufb01xed-point algorithm achieve\neven higher homogeneities (0.94 and 0.96, respectively) than the Gradient t-SNE. On the other hand,\ntoo large \u03b1 may increase the number of outliers and the Kullback-Leibler divergence.\n\n7 Conclusions\n\nThe working mechanism of Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) has\nbeen investigated rigorously. The several \ufb01ndings are: (1) we propose to use a negative score func-\ntion to characterize and parameterize the heavy-tailed embedding similarity functions; (2) this \ufb01nd-\ning has provided us with a power family of functions that convert distances to embedding similari-\nties; and (3) we have developed a \ufb01xed-point algorithm for optimizing SSNE, which greatly saves the\neffort in tuning program parameters and facilitates the extensions and applications of heavy-tailed\nSSNE. We have compared HSSNE against t-SNE and Laplacian Eigenmap using UCI and LIBSVM\nrepositories. Two sets of experimental results from unsupervised and semi-supervised visualization\nindicate that our method is ef\ufb01cient, accurate, and versatile over the other two approaches.\nOur future work might include further empirical studies on the learning speed and robustness of\nHSSNE by using more extensive, especially large-scale, experiments. It also remains important to\ninvestigate acceleration techniques in both initialization and long-run stages of the learning.\n\n8 Acknowledgement\n\nThe authors appreciate the reviewers for their extensive and informative comments for the improve-\nment of this paper. This work is supported by a grant from the Research Grants Council of the Hong\nKong Special Administrative Region, China (Project No. CUHK 4128/08E).\n\n\u221240\u22122002040\u221250\u221240\u221230\u221220\u22121001020304050unsupervised Gradient t\u2212SNE, homogeneity=0.69, DKL=3.24 1234\u221230\u221220\u221210010203040\u221240\u221230\u221220\u221210010203040semi\u2212supervised Gradient t\u2212SNE, homogeneity=0.92, DKL=2.58 1234\u22123\u22122\u22121012\u22123\u22122\u221210123\u03b1=0, homogeneity=0.79, DKL=2.78 1234\u22126\u22124\u221220246\u22126\u22124\u221220246\u03b1=0.5, homogeneity=0.87, DKL=2.71 1234\u221220\u221215\u221210\u22125051015\u221220\u221215\u221210\u22125051015\u03b1=1, homogeneity=0.94, DKL=2.60 1234\u221230\u221220\u2212100102030\u221225\u221220\u221215\u221210\u221250510152025\u03b1=1.5, homogeneity=0.96, DKL=2.61 1234\fReferences\n\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and\n\nclustering. Advances in neural information processing systems, 14:585\u2013591, 2002.\n\n[2] M. A. Carreira-Perpi\u02dcn\u00b4an. Gaussian mean-shift is an em algorithm.\n\nPattern Analysis And Machine Intelligence, 29(5):767\u2013776, 2007.\n\nIEEE Transactions On\n\n[3] D. Comaniciu and M. Peter. Mean Shift: A robust approach toward feature space analysis.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603\u2013619, 2002.\n\n[4] J. A. Cook, I. Sutskever, A. Mnih, and G. E. Hinton. Visualizing similarity data with a mixture\nof maps. In Proceedings of the 11th International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 2, pages 67\u201374, 2007.\n\n[5] M. Gashler, D. Ventura, and T. Martinez. Iterative non-linear dimensionality reduction with\nmanifold sculpting. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in\nNeural Information Processing Systems 20, pages 513\u2013520. MIT Press, Cambridge, MA, 2008.\n[6] G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information\n\nProcessing Systems, 15:833\u2013840, 2003.\n\n[7] G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley, 2000.\n[8] J. A. K. Suykens. Data visualization and dimensionality reduction using kernel maps with a\n\nreference point. IEEE Transactions on Neural Networks, 19(9):1501\u20131517, 2008.\n\n[9] J. B. Tenenbaum, V. Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, Dec. 2000.\n\n[10] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9:2579\u20132605, 2008.\n\n\f", "award": [], "sourceid": 664, "authors": [{"given_name": "Zhirong", "family_name": "Yang", "institution": null}, {"given_name": "Irwin", "family_name": "King", "institution": null}, {"given_name": "Zenglin", "family_name": "Xu", "institution": null}, {"given_name": "Erkki", "family_name": "Oja", "institution": null}]}