{"title": "Colored Maximum Variance Unfolding", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1392, "abstract": "", "full_text": "Colored Maximum Variance Unfolding\n\nLe Song\u2020, Alex Smola\u2020, Karsten Borgwardt\u2021 and Arthur Gretton\u2217\n\n\u2020National ICT Australia, Canberra, Australia\n\n\u2021University of Cambridge, Cambridge, United Kingdom\n\u2217MPI for Biological Cybernetics, T\u00a8ubingen, Germany\n\n{le.song,alex.smola}@nicta.com.au\n\nkmb51@eng.cam.ac.uk,arthur.gretton@tuebingen.mpg.de\n\nAbstract\n\nMaximum variance unfolding (MVU) is an effective heuristic for dimensionality\nreduction. It produces a low-dimensional representation of the data by maximiz-\ning the variance of their embeddings while preserving the local distances of the\noriginal data. We show that MVU also optimizes a statistical dependence measure\nwhich aims to retain the identity of individual observations under the distance-\npreserving constraints. This general view allows us to design \u201ccolored\u201d variants\nof MVU, which produce low-dimensional representations for a given task, e.g.\nsubject to class labels or other side information.\n\n1 Introduction\n\nIn recent years maximum variance unfolding (MVU), introduced by Saul et al. [1], has gained pop-\nularity as a method for dimensionality reduction. This method is based on a simple heuristic: max-\nimizing the overall variance of the embedding while preserving the local distances between neigh-\nboring observations. Sun et al. [2] show that there is a dual connection between MVU and the goal\nof \ufb01nding a fast mixing Markov chain. This connection is intriguing. However, it offers limited\ninsight as to why MVU can be used for data representation.\n\nThis paper provides a statistical interpretation of MVU. We show that the algorithm attempts to\nextract features from the data which simultaneously preserve the identity of individual observations\nand their local distance structure. Our reasoning relies on a dependence measure between sets of\nobservations, the Hilbert-Schmidt Independence Criterion (HSIC) [3].\n\nRelaxing the requirement of retaining maximal information about individual observations, we are\nable to obtain \u201ccolored\u201d MVU. Unlike traditional MVU which takes only one source of informa-\ntion into account, \u201ccolored\u201d MVU allows us to integrate two sources of information into a single\nembedding. That is, we are able to \ufb01nd an embedding that leverages between two goals:\n\n\u2022 preserve the local distance structure according to the \ufb01rst source of information (the data);\n\u2022 and maximally align with the second sources of information (side information).\n\nNote that not all features inherent in the data are interesting for an ulterior objective. For instance,\nif we want to retain a reduced representation of the data for later classi\ufb01cation, then only those\ndiscriminative features will be relevant. \u201cColored\u201d MVU achieves the goal of elucidating primarily\nrelevant features by aligning the embedding to the objective provided in the side information. Some\nexamples illustrate this situation in more details:\n\n\u2022 Given a-bag-of-pixels representation of images (the data), such as USPS digits, \ufb01nd an\n\u2022 Given a vector space representation of texts on the web (the data), such as newsgroups, \ufb01nd\n\nembedding which re\ufb02ects the categories of the images (side information).\n\nan embedding which re\ufb02ects a hierarchy of the topics (side information).\n\n1\n\n\f\u2022 Given a TF/IDF representation of documents (the data), such as NIPS papers, \ufb01nd an em-\nbedding which re\ufb02ects co-authorship relations between the documents (side information).\n\nThere is a strong motivation for not simply merging the two sources of information into a single\ndistance metric: Firstly, the data and the side information may be heterogenous. It is unclear how\nto combine them into a single distance metric; Secondly, the side information may appear in the\nform of similarity rather than distance. For instance, co-authorship relations is a similarity between\ndocuments (if two papers share more authors, they tends to be more similar), but it does not induce\na distance between the documents (if two papers share no authors, we cannot assert they are far\napart). Thirdly, at test time (i.e. when inserting a new observation into an existing embedding) only\none source of information might be available, i.e. the side information is missing.\n\n2 Maximum Variance Unfolding\n\nWe begin by giving a brief overview of MVU and its projection variants, as proposed in [1]. Given\na set of m observations Z = {z1, . . . , zm} \u2286 Z and a distance metric d : Z \u00d7 Z \u2192 [0,\u221e) \ufb01nd an\ninner product matrix (kernel matrix) K \u2208 Rm\u00d7m with K (cid:23) 0 such that\n1. The distances are preserved, i.e. Kii + Kjj \u2212 2Kij = d2\n\nij for all (i, j) pairs which are\nsuf\ufb01ciently close to each other, such as the n nearest neighbors of each observation. We\ndenote this set by N . We will also use N to denote the graph formed by having these (i, j)\npairs as edges.\n2. The embedded data is centered, i.e. K1 = 0 (where 1 = (1, . . . , 1)> and 0 = (0, . . . , 0)>).\n3. The trace of K is maximized (the maximum variance unfolding part).\n\nSeveral variants of this algorithm, including a large scale variant [4] have been proposed. By and\nlarge the optimization problem looks as follows:\n\nmaximize\n\nK(cid:23)0\n\ntr K subject to K1 = 0 and Kii + Kjj \u2212 2Kij = d2\n\nij for all (i, j) \u2208 N .\n\n(1)\n\nNumerous variants of (1) exist, e.g. where the distances are only allow to shrink, where slack vari-\nables are added to the objective function to allow approximate distance preserving, or where one uses\nlow-rank expansions of K to cope with the computational complexity of semide\ufb01nite programming.\nA major drawback with MVU is that its results necessarily come as somewhat of a surprise. That is,\nit is never clear before invoking MVU what speci\ufb01c interesting results it might produce. While in\nhindsight it is easy to \ufb01nd an insightful interpretation of the outcome, it is not a-priori clear which\naspect of the data the representation might emphasize. A second drawback is that while in general\ngenerating brilliant results, its statistical origins are somewhat more obscure. We aim to address\nthese problems by means of the Hilbert-Schmidt Independence Criterion.\n\n3 Hilbert-Schmidt Independence Criterion\n\nLet sets of observations X and Y be drawn jointly from some distribution Prxy. The Hilbert-\nSchmidt Independence Criterion (HSIC) [3] measures the dependence between two random vari-\nables, x and y, by computing the square of the norm of the cross-covariance operator over the\ndomain X \u00d7 Y in Hilbert Space. It can be shown, provided the Hilbert Space is universal, that this\nnorm vanishes if and only if x and y are independent. A large value suggests strong dependence\nwith respect to the choice of kernels.\nLet F and G be the reproducing kernel Hilbert Spaces (RKHS) on X and Y with associated kernels\nk : X \u00d7 X \u2192 R and l : Y \u00d7 Y \u2192 R respectively. The cross-covariance operator Cxy : G 7\u2192 F is\nde\ufb01ned as [5]\n\n(2)\nwhere \u00b5x = E[k(x,\u00b7)] and \u00b5y = E[l(y,\u00b7)]. HSIC is then de\ufb01ned as the square of the Hilbert-\nSchmidt norm of Cxy, that is HSIC(F,G, Prxy) := kCxyk2\n\nHS . In term of kernels HSIC is\n\nCxy = Exy [(k(x,\u00b7) \u2212 \u00b5x)(l(y,\u00b7) \u2212 \u00b5y)] ,\n\nExx0yy0[k(x, x0)l(y, y0)] + Exx0[k(x, x0)]Eyy0[l(y, y0)] \u2212 2Exy[Ex0[k(x, x0)]Ey0[l(y, y0)]].\n\n(3)\n\n2\n\n\fGiven the samples (X, Y ) = {(x1, y1), . . . , (xm, ym)} of size m drawn from the joint distribution,\nPrxy, an empirical estimate of HSIC is [3]\n\nHSIC(F,G, Z) = (m \u2212 1)\u22122 tr HKHL,\n\n(4)\nwhere K, L \u2208 Rm\u00d7m are the kernel matrices for the data and the labels respectively, and Hij =\n\u03b4ij \u2212 m\u22121 centers the data and the labels in the feature space. (For convenience, we will drop the\nnormalization and use tr HKHL as HSIC.)\nHSIC has been used to measure independence between random variables [3], to select features or to\ncluster data (see the Appendix for further details). Here we use it in a different way:\n\nWe try to construct a kernel matrix K for the dimension-reduced data X which\npreserves the local distance structure of the original data Z, such that X is maxi-\nmally dependent on the side information Y as seen from its kernel matrix L.\n\nHSIC has several advantages as a dependence criterion. First, it satis\ufb01es concentration of measure\nconditions [3]: for random draws of observation from Prxy, HSIC provides values which are very\nsimilar. This is desirable, as we want our metric embedding to be robust to small changes. Second,\nHSIC is easy to compute, since only the kernel matrices are required and no density estimation is\nneeded. The freedom of choosing a kernel for L allows us to incorporate prior knowledge into the\ndependence estimation process. The consequence is that we are able to incorporate various side\ninformation by simply choosing an appropriate kernel for Y .\n\n4 Colored Maximum Variance Unfolding\n\nWe state the algorithmic modi\ufb01cation \ufb01rst and subsequently we explain why this is reasonable: the\nkey idea is to replace tr K in (1) by tr KL, where L is the covariance matrix of the domain (side\ninformation) with respect to which we would like to extract features. For instance, in the case of\nNIPS papers which happen to have author information, L would be the kernel matrix arising from\ncoauthorship and d(z, z0) would be the Euclidean distance between the vector space representations\nof the documents. Key to our reasoning is the following lemma:\nLemma 1 Denote by L a positive semide\ufb01nite matrix in Rm\u00d7m and let H \u2208 Rm\u00d7m be de\ufb01ned as\nHij = \u03b4ij \u2212 m\u22121. Then the following two optimization problems are equivalent:\ntr HKHL subject to K (cid:23) 0 and constraints on Kii + Kjj \u2212 2Kij.\ntr KL subject to K (cid:23) 0 and constraints on Kii + Kjj \u2212 2Kij and K1 = 0.\n\n(5b)\nAny solution of (5b) solves (5a) and any solution of (5a) solves (5b) after centering K \u2190 HKH.\n\nmaximize\n\n(5a)\n\nmaximize\n\nK\n\nK\n\nProof Denote by Ka and Kb the solutions of (5a) and (5b) respectively. Kb is feasible for (5a) and\ntr KbL = tr HKbHL. This implies that tr HKaHL \u2265 tr HKbHL. Vice versa HKaH is feasible\nfor (5b). Moreover tr HKaHL \u2264 tr KbL by requirement on the optimality of Kb. Combining both\ninequalities shows that tr HKaHL = tr KbL, hence both solutions are equivalent.\nThis means that the centering imposed in MVU via constraints is equivalent to the centering in\nHSIC by means of the dependence measure tr HKHL itself. In other words, MVU equivalently\nmaximizes tr HKHI, i.e. the dependence between K and the identity matrix I which corresponds\nto retain maximal diversity between observations via Lij = \u03b4ij. This suggests the following colored\nversion of MVU:\n\nmaximize\n\nK\n\ntr HKHL subject to K (cid:23) 0 and Kii + Kjj \u2212 2Kij = d2\n\nij for all (i, j) \u2208 N .\n\n(6)\n\nUsing (6) we see that we are now extracting a Euclidean embedding which maximally depends on\nthe coloring matrix L (for the side information) while preserving local distance structure. A second\nadvantage of (6) is that whenever we restrict K further, e.g. by only allowing for K be part of a\nlinear subspace formed by the principal vectors in some space, (6) remains feasible, whereas the\n(constrained) MVU formulation may become infeasible (i.e. K1 = 0 may not be satis\ufb01ed).\n\n3\n\n\f5 Dual Problem\n\nTo gain further insight into the structure of the solution of (6) we derive its dual problem. Our\napproach uses the results from [2]. First we de\ufb01ne matrices Eij \u2208 Rm\u00d7m for each edge (i, j) \u2208 N ,\nji = \u22121. Then the distance\nii = Eij\nsuch that it has only four nonzero entries Eij\npreserving constraint can be written as tr KEij = d2\n\njj = 1 and Eij\nij. Thus we have the following Lagrangian:\n\nij = Eij\n\nL = tr KHLH + tr KZ \u2212 X\n= tr K(HLH + Z \u2212 X\n\n(i,j)\u2208N\n\nwij(tr KEij \u2212 d2\nij)\n\nwijEij) + X\n\nwijd2\n\nij where Z (cid:23) 0 and wij \u2265 0.\n\n(7)\n\n(i,j)\u2208N\n\nSetting the derivative of L with respect to K to zero, yields HLH + Z \u2212P\nij subject to G(w) (cid:23) HLH where G(w) = X\n\nPlugging this condition into (7) gives us the dual problem.\n\nX\n\nminimize\n\n(i,j)\u2208N\n\nwijd2\n\nwij\n\n(i,j)\u2208N\n\n(i,j)\u2208N wijEij = 0.\n\nwijEij.\n\n(8)\n\n(i,j)\u2208N\n\nNote that G(w) amounts to the Graph Laplacian of a weighted graph with adjacency matrix given\nby w. The dual constraint G(w) (cid:23) HLH effectively requires that the eigen-spectrum of the graph\nLaplacian is bounded from below by that of HLH.\nWe are interested in the properties of the solution K of the primal problem, in particular the num-\nber of nonzero eigenvalues. Recall that at optimality the Karush-Kuhn-Tucker conditions imply\ntr KZ = 0, i.e. the row space of K lies in the null space of Z. Thus the rank of K is upper bounded\nby the dimension of the null space of Z.\nRecall that Z = G(w) \u2212 HLH (cid:23) 0, and by design G(w) (cid:23) 0 since it is the graph Laplacian\nof a weighted graph with edge weights wij. If G(w) corresponds to a connected graphs, only one\neigenvalue of G(w) vanishes. Hence, the eigenvectors of Z with zero eigenvalues would correspond\nto those lying in the image of HLH. If L arises from a label kernel matrix, e.g. for an n-class\nclassi\ufb01cation problem, then we will only have up to n vanishing eigenvalues in Z. This translates\ninto only up to n nonvanishing eigenvalues in K.\nContrast this observation with plain MVU. In this case L = I, that is, only one eigenvalue of\nHLH vanishes. Hence it is likely that G(w) \u2212 HLH will have many vanishing eigenvalues which\ntranslates into many nonzero eigenvalues of K. This is corroborated by experiments (Section 7).\n\n6 Implementation Details\n\nIn practice, instead of requiring the distances to remain unchanged in the embedding we only require\nthem to be preserved approximately [4]. We do so by penalizing the slackness between the original\ndistance and the embedding distance, i.e.\n\n(cid:0)Kii + Kjj \u2212 2Kij \u2212 d2\n\nij\n\n(cid:1)2 subject to K (cid:23) 0\n\nmaximize\n\nK\n\ntr HKHL \u2212 \u03bd\n\n(9)\n\nX\n\n(i,j)\u2208N\n\nHere \u03bd controls the tradeoff between dependence maximization and distance preservation. The\nsemide\ufb01nite program usually has a time complexity up to O(m6). This renders direct implementa-\ntion of the above problem infeasible for anything but toy problems. To reduce the computation, we\napproximate K using an orthonormal set of vectors V (of size m\u00d7 n) and a smaller positive de\ufb01nite\nmatrix A (of size n \u00d7 n), i.e. K = VAV>. Conveniently we choose the number of dimensions n\nto be much smaller than m (n (cid:28) m) such that the resulting semide\ufb01nite program with respect to A\nbecomes tractable (clearly this is an approximation).\nTo obtain the matrix V we employ a regularization scheme as proposed in [4]. First, we construct a\nnearest neighbor graph according to N (we will also refer to this graph and its adjacency matrix as\nN ). Then we form V by stacking together the bottom n eigenvectors of the graph Laplacian of the\nneighborhood graph via N . The key idea is that neighbors in the original space remain neighbors in\n\n4\n\n\fthe embedding space. As we require them to have similar locations, the bottom eigenvectors of the\ngraph Laplacian provide a set of good bases for functions smoothly varying across the graph.\n\nSubsequent to the semide\ufb01nite program we perform local re\ufb01nement of the embedding via gradient\ndescent. Here the objective is reformulated using an m \u00d7 n dimensional vector X, i.e. K = XX>.\nThe initial value X0 is obtained using the n leading eigenvectors of the solution of (9).\n\n7 Experiments\n\nUltimately the justi\ufb01cation for an algorithm is practical applicability. We demonstrate this based on\nthree datasets: embedding of digits of the USPS database, the Newsgroups 20 dataset containing\nUsenet articles in text form, and a collection of NIPS papers from 1987 to 1999.1 We compare\n\u201ccolored\u201d MVU (also called MUHSIC, maximum unfolding via HSIC) to MVU [1] and PCA, high-\nlighting places where MUHSIC produces more meaningful results by incorporating side informa-\ntion. Further details, such as effects of the adjacency matrices and a comparison to Neighborhood\nComponent Analysis [6] are relegated to the appendix due to limitations of space.\n\nFor images we use the Euclidean distance between pixel values as the base metric. For text docu-\nments, we perform four standard preprocessing steps: (i) the words are stemmed using the Porter\nstemmer; (ii) we \ufb01lter out common but meaningless stopwords; (iii) we delete words that appear in\nless than 3 documents; (iv) we represent each document as a vector using the usual TF/IDF (term\nfrequency / inverse document frequency) weighting scheme. As before, the Euclidean distance on\nthose vectors is used to \ufb01nd the nearest neighbors.\n\nAs in [4] we construct the nearest neighbor graph by considering the 1% nearest neighbors of each\npoint. Subsequently the adjacency matrix of this graph is symmetrized. The regularization parameter\n\u03bd as given in (9) is set to 1 as a default. Moreover, as in [4] we choose 10 dimensions (n = 10)\nto decompose the embedding matrix K. Final visualization is carried out using 2 dimensions. This\nmakes our results very comparable to previous work.\nUSPS Digits This dataset consists of images of hand written digits of a resolution of 16\u00d716 pixels.\nWe normalized the data to the range [\u22121, 1] and used the test set containing 2007 observations. Since\nit is a digit recognition task, we have Y \u2208 [0, . . . , 9]. Y is used to construct the matrix L by applying\nthe kernel k(y, y0) = \u03b4y,y0. This kernel further promotes embedding where images from the same\nclass are grouped tighter. Figure 1 shows the results produced by MUHSIC, MVU and PCA.\n\nThe overall properties the embeddings are similar across the three methods (\u20182\u2019 on the left, \u20181\u2019 on the\nright, \u20187\u2019 on top, and \u20188\u2019 at the bottom). Arguably MUHSIC produces a clearer visualization. For\ninstance, images of \u20185\u2019 are clustered tighter in this case than the other two methods. Furthermore,\nMUHSIC also results in much better separation between images from different classes. For instance,\nthe overlap between \u20184\u2019 and \u20186\u2019 produce by MVU and PCA are largely reduced by MUHSIC. Similar\nresults also hold for \u20180\u2019 and \u20185\u2019.\nFigure 1 also shows the eigenspectrum of K produced by different methods. The eigenvalues are\nsorted in descending order and normalized by the trace of K. Each patch in the color bar represents\nan eigenvalue. We see that MUHSIC results in 3 signi\ufb01cant eigenvalues, MVU results in 10, while\nPCA produces a grading of many eigenvalues (as can be seen by an almost continuously changing\nspectrum in the spectral diagram). This con\ufb01rms our reasoning of Section 5 that the spectrum\ngenerated by MUHSIC is likely to be considerably sparser than that of MVU.\n\nNewsgroups This dataset consists of Usenet articles collected from 20 different newsgroups. We\nuse a subset of 2000 documents for our experiments (100 articles from each newsgroup). We remove\nthe headers from the articles before the preprocessing while keeping the subject line. There is a clear\nhierarchy in the newsgroups. For instance, 5 topics are related to computer science, 3 are related\nto religion, and 4 are related to recreation. We will use these different topics as side information\nand apply a delta kernel k(y, y0) = \u03b4y,y0 on them. Similar to USPS digits we want to preserve the\nidentity of individual newsgroups. While we did not encode hierarchical information for MVU we\nrecover a meaningful hierarchy among topics, as can be seen in Figure 2.\n\n1Preprocessed data are available at http://www.it.usyd.edu.au/\u223clesong/muhsic datasets.html.\n\n5\n\n\fFigure 1: Embedding of 2007 USPS digits produced by MUHSIC, MVU and PCA respectively.\nColors of the dots are used to denote digits from different classes. The color bar below each \ufb01gure\nshows the eigenspectrum of the learned kernel matrix K.\n\nFigure 2: Embedding of 2000 newsgroup articles produced by MUHSIC, MVU and PCA respec-\ntively. Colors and shapes of the dots are used to denote articles from different newsgroups. The\ncolor bar below each \ufb01gure shows the eigenspectrum of the learned kernel matrix K.\nA distinctive feature of the visualizations is that MUHSIC groups articles from individual topics\nmore tightly than MVU and PCA. Furthermore, the semantic information is also well preserved by\nMUHSIC. For instance, on the left side of the embedding, all computer science topics are placed\nadjacent to each other; comp.sys.ibm.pc.hardware and comp.os.ms-windows.misc are adjacent and\nwell separated from comp.sys.mac.hardware and comp.windows.x and comp.graphics. The latter is\nmeaningful since Apple computers are more popular in graphics (so are X windows based systems\nfor scienti\ufb01c visualization). Likewise we see that on the top we \ufb01nd all recreational topics (with\nrec.sport.baseball and rec.sport.hockey clearly distinguished from the rec.autos and rec.motorcycles\ngroups). A similar adjacency between talk.politics.mideast and soc.religion.christian is quite inter-\nesting. The layout suggests that the content of talk.politics.guns and of sci.crypt is quite different\nfrom other Usenet discussions.\n\nNIPS Papers We used the 1735 regular NIPS papers from 1987 to 1999. They are scanned from\nthe proceedings and transformed into text \ufb01les via OCR. The table of contents (TOC) is also avail-\nable. We parse the TOC and construct a coauthor network from it. Our goal is to embed the papers\nby taking the coauthor information into account. As kernel k(y, y0) we simply use the number of\nauthors shared by two papers. To illustrate this we highlighted some known researchers. Further-\nmore, we also annotated some papers to show the semantics revealed by the embedding. Figure 3\nshows the results produced by MUHSIC, MVU and PCA.\n\nAll three methods correctly represent the two major topics of NIPS papers: arti\ufb01cial systems, i.e.\nmachine learning (they are positioned on the left side of the visualization) and natural systems,\n\n6\n\n\fd\ne\nt\nh\ng\ni\nl\nh\ng\ni\nh\n\ne\nr\na\n\ns\nr\ne\nh\nc\nr\na\ne\ns\ne\nr\n\n)\nf\no\n\ns\nn\no\ni\nt\na\nn\ni\nb\nm\no\nc\n(\n\ne\nv\ni\nt\na\nt\nn\ne\ns\ne\nr\np\ne\nr\n\ne\nm\no\ns\n\ny\nb\n\ns\nr\ne\np\na\nP\n\n.\n\nA\nC\nP\n\nd\nn\na\nU\nV\nM\n\n,\n\nC\nI\nS\nH\nU\nM\ny\nb\n\nd\ne\nc\nu\nd\no\nr\np\n\ns\nr\ne\np\na\np\n\nS\nP\nI\nN\n5\n3\n7\n1\n\nf\no\n\ng\nn\ni\nd\nd\ne\nb\nm\nE\n\n:\n3\n\ne\nr\nu\ng\ni\n\nF\n\n.\nx\ni\nd\nn\ne\np\np\na\n\ne\nh\nt\nn\ni\n\ne\nr\na\n\ns\nl\ni\na\nt\ne\nd\n\ne\nr\no\nm\n\n;\nr\no\nb\nh\ng\ni\ne\nn\n\nt\ns\ne\nr\na\ne\nn\n\ns\nt\ni\n\nf\no\n\nn\no\ni\nt\na\nc\no\nl\n\ne\nh\nt\n\nn\ni\n\nd\ne\nc\na\nl\np\n\ns\ni\n\nr\ne\np\na\np\n\ns\ni\nh\nT\n\n.\n\nS\nP\nI\nN\no\nt\n\nd\ne\nt\nt\ni\n\nm\nb\nu\ns\n\ns\na\n\nr\ne\np\na\np\n\nt\nn\ne\nr\nr\nu\nc\n\ne\nh\nt\n\ns\ne\nt\no\nn\ne\nd\nh\np\na\nr\ng\n\ne\nh\nt\n\nn\ni\n\nd\nn\no\nm\na\ni\nd\nw\no\nl\nl\ne\ny\n\ne\nh\nT\n\n.\n\nK\nx\ni\nr\nt\na\nm\n\nl\ne\nn\nr\ne\nk\nd\ne\nn\nr\na\ne\nl\n\ne\nh\nt\n\nf\no\nm\nu\nr\nt\nc\ne\np\ns\nn\ne\ng\ni\ne\n\ne\nh\nt\n\ns\nw\no\nh\ns\n\ne\nr\nu\ng\n\ufb01\nh\nc\na\ne\nw\no\nl\ne\nb\nr\na\nb\nr\no\nl\no\nc\n\ne\nh\nT\n\n.\n\nd\nn\ne\ng\ne\nl\n\ne\nh\nt\n\ny\nb\nd\ne\nt\na\nc\ni\nd\nn\ni\n\ns\na\n\n7\n\n\fi.e. computational neuroscience (which lie on the right). This is be con\ufb01rmed by examining the\nhighlighted researchers. For instance, the papers by Smola, Sch\u00a8olkopf and Jordan are embedded on\nthe left, whereas the many papers by Sejnowski, Dayan and Bialek can be found on the right.\n\nUnique to the visualization of MUHSIC is that there is a clear grouping of the papers by researchers.\nFor instance, papers on reinforcement learning (Barto, Singh and Sutton) are on the upper left corner;\npapers by Hinton (computational cognitive science) are near the lower left corner; and papers by\nSejnowski and Dayan (computational neuroscientists) are clustered to the right side and adjacent\nto each other. Interestingly, papers by Jordan (at that time best-known for his work in graphical\nmodels) are grouped close to the papers on reinforcement learning. This is because Singh used to be\na postdoc of Jordan. Another interesting trend is that papers on new \ufb01elds of research are embedded\non the edges. For instance, papers on reinforcement learning (Barto, Singh and Sutton), are along\nthe left edge. This is consistent with the fact that they presented some interesting new results during\nthis period (recall that the time period of the dataset is 1987 to 1999).\n\nNote that while MUHSIC groups papers according to authors, thereby preserving the macroscopic\nstructure of the data it also reveals the microscopic semantics between the papers. For instance, the\n4 papers (numbered from 6 to 9 in Figure 3) by Smola, Scholk\u00a8opf, Hinton and Dayan are very close\nto each other. Although their titles do not convey strong similarity information, these papers all used\nhandwritten digits for the experiments. A second example are papers by Dayan. Although most of\nhis papers are on the neuroscience side, two of his papers (numbered 14 and 15) on reinforcement\nlearning can be found on the machine learning side. A third example are papers by Bialek and\nHinton on spiking neurons (numbered 20, 21 and 23). Although Hinton\u2019s papers are mainly on the\nleft, his paper on spiking Boltzmann machines is closer to Bialek\u2019s two papers on spiking neurons.\n\n8 Discussion\nIn summary, MUHSIC provides an embedding of the data which preserves side information possibly\navailable at training time. This way we have a means of controlling which representation of the\ndata we obtain rather than having to rely on our luck that the representation found by MVU just\nhappens to match what we want to obtain. It makes feature extraction robust to spurious interactions\nbetween observations and noise (see the appendix for an example of adjacency matrices and further\ndiscussion). A fortuitous side-effect is that if the matrix containing side information is of low rank,\nthe reduced representation learned by MUHSIC can be lower rank than that obtained by MVU, too.\nFinally, we showed that MVU and MUHSIC can be formulated as feature extraction for obtaining\nmaximally dependent features. This provides an information theoretic footing for the (brilliant)\nheuristic of maximizing the trace of a covariance matrix [1].\n\nThe notion of extracting features of the data which are maximally dependent on the original data\nis far more general than what we described in this paper. In particular, one may show that feature\nselection [7] and clustering [8] can also be seen as special cases of this framework.\n\nAcknowledgments NICTA is funded through the Australian Government\u2019s Backing Australia\u2019s\nAbility initiative, in part through the ARC.This research was supported by the Pascal Network.\nReferences\n[1] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction.\n\nIn Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.\n\n[2] J. Sun, S. Boyd, L. Xiao, and P. Diaconis. The fastest mixing markove process on a graph and a connection\n\nto a maximum variance unfolding problem. SIAM Review, 48(4):681\u2013699, 2006.\n\n[3] A. Gretton, O. Bousquet, A.J. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with Hilbert-\nSchmidt norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory,\npages 63\u201377, Berlin, Germany, 2005. Springer-Verlag.\n\n[4] K. Weinberger, F. Sha, Q. Zhu, and L. Saul. Graph laplacian regularization for large-scale semide\ufb01nte\n\nprogramming. In Neural Information Processing Systems, 2006.\n\n[5] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73\u201399, 2004.\n\n[6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis.\n\nIn\n\nAdvances in Neural Information Processing Systems 17, 2004.\n\n[7] L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection via dependence\n\nestimation. In Proc. Intl. Conf. Machine Learning, 2007.\n\n[8] L. Song, A. Smola, A. Gretton, and K. Borgwardt. A dependence maximization view of clustering. In\n\nProc. Intl. Conf. Machine Learning, 2007.\n\n8\n\n\f", "award": [], "sourceid": 492, "authors": [{"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": ""}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}