{"title": "Ranking on Data Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 176, "abstract": "", "full_text": "Ranking on Data Manifolds\n\nDengyong Zhou, Jason Weston, Arthur Gretton,\n\nOlivier Bousquet, and Bernhard Sch\u00a4olkopf\n\nMax Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany\n\nf(cid:2)rstname.secondname g@tuebingen.mpg.de\n\nAbstract\n\nThe Google search engine has enjoyed huge success with its web page\nranking algorithm, which exploits global, rather than local, hyperlink\nstructure of the web using random walks. Here we propose a simple\nuniversal ranking algorithm for data lying in the Euclidean space, such\nas text or image data. The core idea of our method is to rank the data\nwith respect to the intrinsic manifold structure collectively revealed by a\ngreat amount of data. Encouraging experimental results from synthetic,\nimage, and text data illustrate the validity of our method.\n\n1\n\nIntroduction\n\nThe Google search engine [2] accomplishes web page ranking using PageRank algorithm,\nwhich exploits the global, rather than local, hyperlink structure of the web [1]. Intuitively,\nit can be thought of as modelling the behavior of a random surfer on the graph of the web,\nwho simply keeps clicking on successive links at random and also periodically jumps to\na random page. The web pages are ranked according to the stationary distribution of the\nrandom walk. Empirical results show PageRank is superior to the naive ranking method, in\nwhich the web pages are simply ranked according to the sum of inbound hyperlinks, and\naccordingly only the local structure of the web is exploited.\nOur interest here is in the situation where the objects to be ranked are represented as vectors\nin Euclidean space, such as text or image data. Our goal is to rank the data with respect\nto the intrinsic global manifold structure [6, 7] collectively revealed by a huge amount of\ndata. We believe for many real world data types this should be superior to a local method,\nwhich rank data simply by pairwise Euclidean distances or inner products.\nLet us consider a toy problem to explain our motivation. We are given a set of points\nconstructed in two moons pattern (Figure 1(a)). A query is given in the upper moon, and the\ntask is to rank the remaining points according to their relevances to the query. Intuitively,\nthe relevant degrees of points in the upper moon to the query should decrease along the\nmoon shape. This should also happen for the points in the lower moon. Furthermore, all\nof the points in the upper moon should be more relevant to the query than the points in the\nlower moon. If we rank the points with respect to the query simply by Euclidean distance,\nthen the left-most points in the lower moon will be more relevant to the query than the\nright-most points in the upper moon (Figure 1(b)). Apparently this result is not consistent\nwith our intuition (Figure 1(c)).\nWe propose a simple universal ranking algorithm, which can exploit the intrinsic manifold\n\n\f(a) Two moons ranking problem\n\n(b) Ranking by Euclidean distance\n\n(c) Ideal ranking\n\nquery\n\nFigure 1: Ranking on the two moons pattern. The marker sizes are proportional to the\nranking in the last two (cid:2)gures. (a) toy data set with a single query; (b) ranking by the\nEuclidean distances; (c) ideal ranking result we hope to obtain.\n\nstructure of data. This method is derived from our recent research on semi-supervised learn-\ning [8]. In fact the ranking problem can be viewed as an extreme case of semi-supervised\nlearning, in which only positive labeled points are available. An intuitive description of our\nmethod is as follows. We (cid:2)rst form a weighted network on the data, and assign a positive\nranking score to each query and zero to the remaining points which are ranked with respect\nto the queries. All points then spread their ranking score to their nearby neighbors via the\nweighted network. The spread process is repeated until a global stable state is achieved,\nand all points except queries are ranked according to their (cid:2)nal ranking scores.\nThe rest of the paper is organized as follows. Section 2 describes the ranking algorithm in\ndetail. Section 3 discusses the connections with PageRank. Section 4 further introduces a\nvariant of PageRank, which can rank the data with respect to the speci(cid:2)c queries. Finally,\nSection 5 presents experimental results on toy data, on digit image, and on text documents,\nand Section 6 concludes this paper.\n\n2 Algorithm\n\nGiven a set of point X = fx1; :::; xq; xq+1; :::; xng (cid:26) Rm; the (cid:2)rst q points are the queries\nand the rest are the points that we want to rank according to their relevances to the queries.\nLet d : X (cid:2) X (cid:0)! R denote a metric on X , such as Euclidean distance, which assigns\neach pair of points xi and xi a distance d(xi; xj): Let f : X (cid:0)! R denote a ranking\nfunction which assigns to each point xi a ranking value fi: We can view f as a vector\nf = [f1; ::; fn]T : We also de(cid:2)ne a vector y = [y1; ::; yn]T , in which yi = 1 if xi is a\nquery, and yi = 0 otherwise. If we have prior knowledge about the con(cid:2)dences of queries,\nthen we can assign different ranking scores to the queries proportional to their respective\ncon(cid:2)dences.\nThe algorithm is as follows:\n\n1. Sort the pairwise distances among points in ascending order. Repeat connecting\nthe two points with an edge according the order until a connected graph is ob-\ntained.\n\n2. Form the af(cid:2)nity matrix W de(cid:2)ned by Wij = exp[(cid:0)d2(xi; xj)=2(cid:27)2] if there is\nan edge linking xi and xj: Note that Wii = 0 because there are no loops in the\ngraph.\n\n3. Symmetrically normalize W by S = D(cid:0)1=2W D(cid:0)1=2 in which D is the diagonal\n\nmatrix with (i; i)-element equal to the sum of the i-th row of W:\n\n\f4. Iterate f (t + 1) = (cid:11)Sf (t) + (1 (cid:0) (cid:11))y until convergence, where (cid:11) is a parameter\n\nin [0; 1):\n\n5. Let f (cid:3)\n\ni denote the limit of the sequence ffi(t)g: Rank each point xi according its\n\nranking scores f (cid:3)\n\ni (largest ranked (cid:2)rst).\n\nThis iteration algorithm can be understood intuitively. First a connected network is formed\nin the (cid:2)rst step. The network is simply weighted in the second step and the weight is\nsymmetrically normalized in the third step. The normalization in the third step is necessary\nto prove the algorithm\u2019s convergence. In the forth step, all points spread their ranking score\nto their neighbors via the weighted network. The spread process is repeated until a global\nstable state is achieved, and in the (cid:2)fth step the points are ranked according to their (cid:2)nal\nranking scores. The parameter (cid:11) speci(cid:2)es the relative contributions to the ranking scores\nfrom neighbors and the initial ranking scores. It is worth mentioning that self-reinforcement\nis avoided since the diagonal elements of the af(cid:2)nity matrix are set to zero in the second\nstep. In addition, the information is spread symmetrically since S is a symmetric matrix.\nAbout the convergence of this algorithm, we have the following theorem:\nTheorem 1 The sequence ff (t)g converges to f (cid:3) = (cid:12)(I (cid:0) (cid:11)S)(cid:0)1y; where (cid:12) = 1 (cid:0) (cid:11):\nSee also [8] for the rigorous proof. Here we only demonstrate how to obtain such a closed\nform expression. Suppose f (t) converges to f (cid:3): Substituting f (cid:3) for f (t + 1) and f (t) in\nthe iteration equation f (t + 1) = (cid:11)Sf (f ) + (1 (cid:0) (cid:11))y; we have\n\nf (cid:3) = (cid:11)f (cid:3) + (1 (cid:0) (cid:11))y;\n\n(1)\n\nwhich can be transformed into\n\nSince (I (cid:0) (cid:11)S) is invertible, we have\n\n(I (cid:0) (cid:11)S)f (cid:3) = (1 (cid:0) (cid:11))y:\n\nf (cid:3) = (1 (cid:0) (cid:11))(I (cid:0) (cid:11)S)(cid:0)1y:\n\nClearly, the scaling factor (cid:12) does not make contributions for our ranking task. Hence the\nclosed form is equivalent to\n\nf (cid:3) = (I (cid:0) (cid:11)S)(cid:0)1y:\n\n(2)\n\nWe can use this closed form to compute the ranking scores of points directly. In large-scale\nreal-world problems, however, we prefer using iteration algorithm. Our experiments show\nthat a few iterations are enough to yield high quality ranking results.\n\n3 Connections with Google\n\nLet G = (V; E) denote a directed graph with vertices. Let W denote the n (cid:2) n adjacency\nmatrix W; in which Wij = 1 if there is a link in E from vertex xi to vertex xj; and Wij = 0\notherwise. Note that W is possibly asymmetric. De(cid:2)ne a random walk on G determined\nby the following transition probability matrix\n\nP = (1 (cid:0) (cid:15))U + (cid:15)D(cid:0)1W;\n\n(3)\nwhere U is the matrix with all entries equal to 1=n. This can be interpreted as a probability\n(cid:15) of transition to an adjacent vertex, and a probability 1 (cid:0) (cid:15) of jumping to any point on the\ngraph uniform randomly. Then the ranking scores over V computed by PageRank is given\nby the stationary distribution (cid:25) of the random walk.\nIn our case, we only consider graphs which are undirected and connected. Clearly, W is\nsymmetric in this situation. If we also rank all points without queries using our method, as\nis done by Google, then we have the following theorem:\n\n\fTheorem 2 For the task of ranking data represented by a connected and undirected graph\nwithout queries, f (cid:3) and PageRank yield the same ranking list.\nProof. We (cid:2)st show that the stationary distribution (cid:25) of the random walk used in Google is\nproportional to the vertex degree if the graph G is undirected and connected. Let 1 denote\nthe 1 (cid:2) n vector with all entries equal to 1. We have\n\n1DP = 1D[(1 (cid:0) (cid:15))U + (cid:15)D(cid:0)1W ] = (1 (cid:0) (cid:15))1DU + (cid:15)1DD(cid:0)1W\n\n= (1 (cid:0) (cid:15))1D + (cid:15)1W = (1 (cid:0) (cid:15))1D + (cid:15)1D = 1D:\n\nLet vol G denote the volume of G; which is given by the sum of vertex degrees. The\nstationary distribution is then\n\n(cid:25) = 1D=vol G:\n\n(4)\nNote that (cid:25) does not depend on (cid:15): Hence (cid:25) is also the the stationary distribution of the\nrandom walk determined by the transition probability matrix D(cid:0)1W:\nNow we consider the ranking result given by our method in the situation without queries.\nThe iteration equation in the fourth step of our method becomes\n\n(5)\nA standard result [4] of linear algebra states that if f (0) is a vector not orthogonal to the\nprincipal eigenvector, then the sequence ff (t)g converges to the principal eigenvector of\nS. Let 1 denotes the n (cid:2) 1 vector with all entries equal to 1: Then\n\nf (t + 1) = Sf (t):\n\nSD1=21 = D(cid:0)1=2W D(cid:0)1=2D1=21 = D(cid:0)1=2W 1 = D(cid:0)1=2D1 = D1=21:\n\nFurther, noticing that the maximal eigenvalue of S is 1 [8], we know the principal eigen-\nvector of S is D1=21: Hence\n\nf (cid:3) = D1=21:\n\n(6)\n\nComparing (4) with (6), it is clear that f (cid:3) and (cid:25) give the same ranking list. This completes\nour proof.\n\n4 Personalized Google\n\nAlthough PageRank is designed to rank all points without respect to any query, it is easy to\nmodify for query-based ranking problems. Let P = D(cid:0)1W: The ranking scores given by\nPageRank are the elements of the convergence solution (cid:25) (cid:3) of the iteration equation\n\n(7)\nBy analogy with the algorithm in Section 2, we can add a query term on the right-hand side\nof (7) for the query-based ranking,\n\n(cid:25)(t + 1) = (cid:11)P T (cid:25)(t):\n\n(cid:25)(t + 1) = (cid:11)P T (cid:25)(t) + (1 (cid:0) (cid:11))y:\n\n(8)\nThis can be viewed as the personalized version of PageRank. We can show that the se-\nquence f(cid:25)(t)g converges to (cid:25) (cid:3) = (1 (cid:0) (cid:11))(I (cid:0) (cid:11)P T )(cid:0)1y as before, which is equivalent\nto\n\n(cid:25)(cid:3) = (I (cid:0) (cid:11)P T )(cid:0)1y:\n\n(9)\n\nNow let us analyze the connection between (2) and (9). Note that (9) can be transformed\ninto\n\n(cid:25)(cid:3) = [(D (cid:0) (cid:11)W )D(cid:0)1](cid:0)1y = D(D (cid:0) (cid:11)W )(cid:0)1y:\n\nIn addition, f (cid:3) can be represented as\n\nf (cid:3) = [D(cid:0)1=2(D (cid:0) (cid:11)W )D(cid:0)1=2](cid:0)1y = D1=2(D (cid:0) (cid:11)W )(cid:0)1D1=2y:\n\n(10)\n\n\fHence the main difference between (cid:25) (cid:3) and f (cid:3) is that in the latter the initial ranking score\nyi of each query xi is weighted with respect to its degree.\nThe above observation motivates us to propose a more general personalized PageRank\nalgorithm,\n\n(11)\nin which we assign different importance to queries with respect to their degree. The closed\nform of (11) is given by\n\n(cid:25)(t + 1) = (cid:11)P T (cid:25)(t) + (1 (cid:0) (cid:11))Dky;\n\n(cid:25)(cid:3) = (I (cid:0) (cid:11)P T )(cid:0)1Dky:\n\n(12)\n\nIf k = 0; (12) is just (9); and if k = 1; we have\n\n(cid:25)(cid:3) = (I (cid:0) (cid:11)P T )(cid:0)1Dy = D(D (cid:0) (cid:11)W )(cid:0)1Dy;\n\nwhich is almost as same as (10).\nWe can also use (12) for classi(cid:2)cation problems without any modi(cid:2)cation, besides setting\nthe elements of y to 1 or -1 corresponding to the positive or negative classes of the labeled\npoints, and 0 for the unlabeled data. This shows the ranking and classi(cid:2)cation problems\nare closely related.\nWe can do a similar analysis of the relations to Kleinberg\u2019s HITS [5], which is another\npopular web page ranking algorithm. The basic idea of this method is also to iteratively\nspread the ranking scores via the existing web graph. We omit further discussion of this\nmethod due to lack of space.\n\n5 Experiments\n\nWe validate our method using a toy problem and two real-world domains: image and text.\nIn our following experiments we use the closed form expression in which (cid:11) is (cid:2)xed at 0:99:\nAs a true labeling is known in these problems, i.e.\nthe image and document categories\n(which is not true in real-world ranking problems), we can compute the ranking error using\nthe Receiver Operator Characteristic (ROC) score [3] to evaluate ranking algorithms. The\nreturned score is between 0 and 1, a score of 1 indicating a perfect ranking.\n\n5.1 Toy Problem\n\nIn this experiment we considered the toy ranking problem mentioned in the introduction\nsection. The connected graph described in the (cid:2)rst step of our algorithm is shown in Figure\n2(a). The ranking scores with different time steps: t = 5; 10; 50; 100 are shown in Figures\n2(b)-(e). Note that the scores on each moon decrease along the moon shape away from the\nquery, and the scores on the moon containing the query point are larger than on the other\nmoon. Ranking by Euclidean distance is shown in Figure 2(f), which fails to capture the\ntwo moons structure.\nIt is worth mentioning that simply ranking the data according to the shortest paths [7] on\nthe graph does not work well. In particular, we draw the reader\u2019s attention to the long edge\nin Figure 2(a) which links the two moons. It appears that shortest paths are sensitive to\nthe small changes in the graph. The robust solution is to assemble all paths between two\npoints, and weight them by a decreasing factor. This is exactly what we have done. Note\nthat the closed form can be expanded as f (cid:3) = Pi (cid:11)iSiy:\n\n5.2\n\nImage Ranking\n\nIn this experiment we address a task of ranking on the USPS handwritten 16x16 digits\ndataset. We rank digits from 1 to 6 in our experiments. There are 1269, 929, 824, 852, 716\nand 834 examples for each class, for a total of 5424 examples.\n\n\f(a) Connected graph\n\nFigure 2: Ranking on the pattern of two moons. (a) connected graph; (b)-(e) ranking with\nthe different time steps: t = 5; 10; 50; 100; (f) ranking by Euclidean distance.\n\n1\n\n0.998\n\nC\nO\nR\n\n0.996\n\n0.994\n\n0.992\n\n0.99\n\nC\nO\nR\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n(a) Query digit 1\n\n2\n\n4\n\nManifold ranking\nEuclidean distance\n6\n# queries\n\n8\n\n10\n\n(d) Query digit 4\n\n2\n\n4\n\nManifold ranking\nEuclidean distance\n6\n# queries\n\n8\n\n10\n\nC\nO\nR\n\nC\nO\nR\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n(b) Query digit 2\n\n2\n\n4\n\nManifold ranking\nEuclidean distance\n6\n# queries\n\n8\n\n10\n\n(e) Query digit 5\n\n2\n\n4\n\nManifold ranking\nEuclidean distance\n6\n# queries\n\n8\n\n10\n\nC\nO\nR\n\nC\nO\nR\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n(c) Query digit 3\n\n2\n\n4\n\nManifold ranking\nEuclidean distance\n6\n# queries\n\n8\n\n10\n\n(e) Query digit 6\n\n2\n\n4\n\nManifold ranking\nEuclidean distance\n6\n# queries\n\n8\n\n10\n\nFigure 3: ROC on USPS for queries from digits 1 to 6. Note that this experimental results\nalso provide indirect proof of the intrinsic manifold structure in USPS.\n\n\fFigure 4: Ranking digits on USPS. The top-left digit in each panel is the query. The left\npanel shows the top 99 by the manifold ranking; and the right panel shows the top 99 by\nthe Euclidean distance based ranking. Note that there are many more 2s with knots in the\nright panel.\n\nWe randomly select examples from one class of digits to be the query set over 30 trials,\nand then rank the remaining digits with respect to these sets. We use a RBF kernel with\nthe width (cid:27) = 1:25 to construct the af(cid:2)nity matrix W; but the diagonal elements are set to\nzero. The Euclidean distance based ranking method is used as the baseline: given a query\nset fxsg(s 2 S), the points x are ranked according to that the highest ranking is given to\nthe point x with the lowest score of mins2Skx (cid:0) xsk:\nThe results, measured as ROC scores, are summarized in Figure 3; each plot corresponds\nto a different query class, from digit one to six respectively. Our algorithm is comparable\nto the baseline when a digit 1 is the query. For the other digits, however, our algorithm\nsigni(cid:2)cantly outperforms the baseline. This experimental result also provides indirect proof\nof the underlying manifold structure in the USPS digit dataset [6, 7].\nThe top ranked 99 images obtained by our algorithm and Euclidean distance, with a random\ndigit 2 as the query, are shown in Figure 4: The top-left digit in each panel is the query.\nNote that there are some 3s in the right panel. Furthermore, there are many curly 2s in\nthe right panel, which do not match well with the query: the 2s in the left panel are more\nsimilar to the query than the 2s in the right panel. This subtle superiority makes a great\ndeal of sense in the real-word ranking task, in which users are only interested in very few\nleading ranking results. The ROC measure is too simple to re(cid:3)ect this subtle superiority\nhowever.\n\n5.3 Text Ranking\n\nIn this experiment, we investigate the task of text ranking using the 20-newsgroups dataset.\nWe choose the topic rec which contains autos, motorcycles, baseball and hockey from the\nversion 20-news-18828.\nThe articles are processed by the Rainbow software package with the following options:\n(1) passing all words through the Porter stemmer before counting them; (2) tossing out\nany token which is on the stoplist of the SMART system; (3) skipping any headers; (4)\nignoring words that occur in 5 or fewer documents. No further preprocessing was done.\nRemoving the empty documents, we obtain 3970 document vectors in a 8014-dimensional\nspace. Finally the documents are normalized into TFIDF representation.\nWe use the ranking method based on normalized inner product as the baseline. The af(cid:2)nity\nmatrix W is also constructed by inner product, i.e. linear kernel. The ROC scores for 100\nrandomly selected queries for each class are given in Figure 5.\n\n\f0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nt\nc\nu\nd\no\nr\np\n\n \nr\ne\nn\nn\n\ni\n\n0.3\n\n0.3\n\n0.4\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nt\nc\nu\nd\no\nr\np\n\n \nr\ne\nn\nn\n\ni\n\n0.3\n\n0.3\n\n0.4\n\n(a) autos\n\n(b) motorcycles\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nt\nc\nu\nd\no\nr\np\n\n \nr\ne\nn\nn\n\ni\n\n0.5\n\n0.6\n\n0.7\n\nmanifold ranking\n\n0.8\n\n0.9\n\n0.3\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\nmanifold ranking\n\n(c) baseball\n\n(d) hockey\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nt\nc\nu\nd\no\nr\np\n\n \nr\ne\nn\nn\n\ni\n\n0.5\n\n0.6\n\n0.7\n\nmanifold ranking\n\n0.8\n\n0.9\n\n0.3\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\nmanifold ranking\n\n0.8\n\n0.9\n\n0.8\n\n0.9\n\nFigure 5: ROC score scatter plots of 100 random queries from the category autos, motor-\ncycles, baseball and hockey contained in the 20-newsgroups dataset.\n\n6 Conclusion\n\nFuture research should address model selection. Potentially, if one was given a small la-\nbeled set or a query set greater than size 1, one could use standard cross validation tech-\nniques. In addition, it may be possible to look to the theory of stability of algorithms to\nchoose appropriate hyperparameters. There are also a number of possible extensions to\nthe approach. For example one could implement an iterative feedback framework: as the\nuser speci(cid:2)es positive feedback this can be used to extend the query set and improve the\nranking output. Finally, and most importantly, we are interested in applying this algorithm\nto wide-ranging real-word problems.\n\nReferences\n[1] R. Albert, H. Jeong, and A. Barabsi. Diameter of the world wide web. Nature, 401:130(cid:150)131,\n\n1999.\n\n[2] S. Brin and L. Page. The anatomy of a large scale hypertextual web search engine. In Proc. 7th\n\nInternational World Wide Web Conf., 1998.\n\n[3] R. Duda, P. Hart, and D. Stork. Pattern Classi(cid:2)cation. Wiley-Interscience, 2nd edition, 2000.\n[4] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Balti-\n\nmore, 1989.\n\n[5] J. Kleinberg. Authoritative sources in a hyperlinked environment. JACM, 46(5):604(cid:150)632, 1999.\n[6] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290:2323(cid:150)2326, 2000.\n\n[7] J. B. Tenenbaum, V. de Silva, and J. C. Langford. Global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290:2319(cid:150)2323, 2000.\n\n[8] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a4olkopf. Learning with local and global\n\nconsistency. In 18th Annual Conf. on Neural Information Processing Systems, 2003.\n\n\f", "award": [], "sourceid": 2447, "authors": [{"given_name": "Dengyong", "family_name": "Zhou", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}