{"title": "Learning Semantic Similarity", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 680, "abstract": null, "full_text": "Learning Semantic Similarity \n\nJaz Kandola \nJohn Shawe-Taylor \nRoyal Holloway, University of London \n\n{jaz, john}@cs.rhul.ac.uk \n\nN ella Cristianini \n\nUniversity of California, Berkeley \n\nnello@support-vector.net \n\nAbstract \n\nThe standard representation of text documents as bags of words \nsuffers from well known limitations, mostly due to its inability to \nexploit semantic similarity between terms. Attempts to incorpo(cid:173)\nrate some notion of term similarity include latent semantic index(cid:173)\ning [8], the use of semantic networks [9], and probabilistic methods \n[5]. In this paper we propose two methods for inferring such sim(cid:173)\nilarity from a corpus. The first one defines word-similarity based \non document-similarity and viceversa, giving rise to a system of \nequations whose equilibrium point we use to obtain a semantic \nsimilarity measure. The second method models semantic relations \nby means of a diffusion process on a graph defined by lexicon and \nco-occurrence information. Both approaches produce valid kernel \nfunctions parametrised by a real number. The paper shows how \nthe alignment measure can be used to successfully perform model \nselection over this parameter. Combined with the use of support \nvector machines we obtain positive results. \n\n1 \n\nIntroduction \n\nKernel-based algorithms exploit the information encoded in the inner-products be(cid:173)\ntween all pairs of data items (see for example [1]). This matches very naturally the \nstandard representation used in text retrieval, known as the 'vector space model', \nwhere the similarity of two documents is given by the inner product between high \ndimensional vectors indexed by all the terms present in the corpus. The combina(cid:173)\ntion of these two methods, pioneered by [6], and successively explored by several \nothers, produces powerful methods for text categorization. However, such an ap(cid:173)\nproach suffers from well known limitations, mostly due to its inability to exploit \nsemantic similarity between terms: documents sharing terms that are different but \nsemantically related will be considered as unrelated. A number of attempts have \nbeen made to incorporate semantic knowledge into the vector space representation. \nSemantic networks have been considered [9], whilst others use co-occurrence analy(cid:173)\nsis where a semantic relation is assumed between terms whose occurrence patterns \nin the documents of the corpus are correlated [3]. Such methods are also limited in \ntheir flexibility, and the question of how to infer semantic relations between terms \nor documents from a corpus remains an open issue. In this paper we propose two \nmethods to model such relations in an unsupervised way. The structure of the paper \nis as follows. Section 2 provides an introduction to how semantic similarity can be \n\n\fintroduced into the vector space model. Section 3 derives a parametrised class of \nsemantic proximity matrices from a recursive definition of similarity of terms and \ndocuments. A further parametrised class of kernels based on alternative similarity \nmeasures inspired by considering diffusion on a weighted graph of documents is \ngiven in Section 4. In Section 5 we show how the recently introduced alignment \nmeasure [2] can be used to perform model selection over the classes of kernels we \nhave defined. Positive experimental results with the methods are reported in Section \n5 before we draw conclusions in Section 6. \n\n2 Representing Semantic Proximity \n\nKernel based methods are an attractive choice for inferring relations from textual \ndata since they enable us to work in a document-by-document setting rather than \nin a term-by-term one [6]. In the vector space model, a document is represented \nby a vector indexed by the terms of the corpus. Hence, the vector will typically \nbe sparse with non-zero entries for those terms occurring in the document. Two \ndocuments that use semantically related but distinct words will therefore show no \nsimilarity. The aim of a semantic proximity matrix [3] is to correct for this by \nindicating the strength of the relationship between terms that even though distinct \nare semantically related. \n\nThe semantic proximity matrix P is indexed by pairs of terms a and b, with the \nentry Pab = Pba giving the strength of their semantic similarity. If the vectors \ncorresponding to two documents are d i , d j , their inner product is now evaluated \nthrough the kernel \n\nk(di , dj ) = d~Pdj, \n\nwhere x' denotes the transpose of the vector or matrix x. The symmetry of P \nensures that the kernel is symmetric. We must also require that P is positive semi(cid:173)\ndefinite in order to satisfy Mercer's conditions. In this case we can decompose \nP = R' R for some matrix R, so that we can view the semantic similarity as a \nprojection into a semantic space \n\n\u00a2: d f--t Rd, \n\nsince \n\nk(di,dj ) = d~Pdj = (Rdi , Rdj }. \n\nThe purpose of this paper is to infer (or refine) the similarity measure between \nexamples by taking into account higher order correlations, thereby performing un(cid:173)\nsupervised learning of the proximity matrix from a given corpus. We will propose \ntwo methods based on two different observations. \n\nThe first method exploits the fact that the standard representation of text docu(cid:173)\nments as bags of words gives rise to an interesting duality: while documents can be \nseen as bags of words, simultaneously terms can be viewed as bags of documents \n- the documents that contain them. In such a model, two documents that have \nhighly correlated term-vectors are considered as having similar content. Similarly, \ntwo terms that have a correlated document-vector will have a semantic relation. \nThis is of course only a first order approximation since the knock-on effect of the \ntwo similarities on each other needs to be considered. We show that it is possible \nto define term-similarity based on document-similarity, and vice versa, to obtain \na system of equations that can be solved in order to obtain a semantic proximity \nmatrix P. \n\nThe second method exploits the representation of a lexicon (the set of all words in \na given corpus) as a graph, where the nodes are indexed by words and where co(cid:173)\noccurrence is used to establish links between nodes. Such a representation has been \nstudied recently giving rise to a number of topological properties [4]. We consider \n\n\fthe idea that higher order correlations between terms can affect their semantic \nrelations as a diffusion process on such a graph. Although there can be exponentially \nmany paths connecting two given nodes in the graph, the use of diffusion kernels [7] \nenables us to obtain the level of semantic relation between any two nodes efficiently, \nso inferring the semantic proximity matrix from data. \n\n3 Equilibrium Equations for Semantic Similarity \n\nIn this section we consider the first of the two methods outlined in the previous \nsection. Here the aim is to create recursive equations for the relations between \ndocuments and between terms. \n\nLet X be the feature example (term/document in the case of text data) matrix \nin a possibly kernel-defined feature space, so that X' X gives the kernel matrix K \nand X X' gives the correlations between different features over the training set. \nWe denote this latter matrix with G. Consider the similarity matrices defined \nrecursively by \n\nK \n\n>\"X'GX+K and G=>..X'KX+G \n\n(1) \n\nWe can interpret this as augmenting the similarity given by K through indirect \nsimilarities measured by G and vice versa. The factor >.. < IIKII- 1 ensures that the \nlonger range effects decay exponentially. Our first result characterizes the solution \nof the above recurrences. \nProposition 1 Provided>\" < IIKII- 1 = IIGII- 1 , the kernels K and G that solve the \nrecurrences (1) are given by \n\nK \n\nK(f - >\"K)-l and G = G(I - >\"G)-l \n\nProof: First observe that \n\nK(I - >\"K) - l \n\nK(I - >\"K) - l - -(I - >\"K) - l + -(f - >\"K) - l \n\n1 \n>.. \n\n1 \n>.. \n\n1 \n\n1 \n>.. \n\n--(I - >..K)(f - >\"K) - + -(I - >\"K) -\n1 \n-(I - >\"K) - --f \n>.. \n\n1 \n>.. \n\n1 \n\n1 \n\n1 \n>.. \n\nNow if we substitute the second recurrence into the first we obtain \n\nK \n\n>..2X'XKX'X+>\"X'XX'X+K \n>..2 K(K(I - >..K) - l)K + >..K2 + K \n>..2 K( ~(I - >\"K)-l - ~I)K + >..K2 + K \n>\"K(I - >\"K)-l K + K(I - >..K)-l(f - >\"K) \nK(I - >\"K) - l \n\nshowing that the expression does indeed satisfy the recurrence. Clearly, by the \nsymmetry of the definition the expression for G also satisfies the recurrence. _ \nIn view of the form of the solution we introduce the following definition: \n\nDefinition 2 von Neumann Kernel Given a kernel K the derived kernel K(>..) = \nK(f - >\"K)-l will be referred to as the von Neumann kernel. \n\n\fNote that we can view K(>\\) as a kernel based on the semantic proximity matrix \nP = >..a + I since \nHence, the solution a defines a refined similarity between terms/features. In the \n\nX'PX = X'(>..a + I)X = >..x'ax + K = K(>\"). \n\nnext section, we will consider the second method of introducing semantic similarity \nderived from viewing the terms and documents as vertices of a weighted graph. \n\n4 Semantic Similarity as a Diffusion Process \n\nGraph like structures within data occur frequently in many diverse settings. In \nthe case of language, the topological structure of a lexicon graph has recently been \nanalyzed [4]. Such a graph has nodes indexed by all the terms in the corpus, and \nthe edges are given by the co-occurrence between terms in documents of the corpus. \nAlthough terms that are connected are likely to have related meaning, terms with \na higher degree of separation would not be considered as being related. \n\nA diffusion process on the graph can also be considered as a model of semantic \nrelations existing between indirectly connected terms. Although the number of \npossible paths between any two given nodes can grow exponentially, results from \nspectral graph theory have been recently used by [7] to show that it is possible to \ncompute the similarity between any two given nodes efficiently without examining \nall possible paths. It is also possible to show that the similarity measure obtained \nin this way is a valid kernel function. The exponentiation operation used in the \ndefinition, naturally yields the Mercer conditions required for valid kernel functions. \n\nAn alternative insight into semantic similarity, to that presented in section 2, is \nafforded if we multiply out the expression for K(>..) , K(>..) = K(I - >\"K)-l = \nL:~l >..t-l Kt. The entries in the matrix Kt are given by \n\nKfj = \n\n2..= \n\nt-1 \n\nII KUtUt+l' \n\nU E {1, ... ,~}t \u00a3=1 \nU1 = i, Ut = j \n\nthat is the sum of the products of the weights over all paths of length t that start \nat vertex i and finish at vertex j in the weighted graph on the examples. If we \nview the connection strengths as channel capacities, the entry Klj can be seen to \nmeasure the sum over all routes of the products of the capacities. If the entries \nsatisfy that they are all positive and for each vertex the sum of the connections \nis 1, we can view the entry as the probability t hat a random walk beginning at \nvertex i is at vertex j after t steps. It is for these reasons that the kernels defined \nusing these combinations of powers of the kernel matrix have been termed diffusion \nkernels [7]. A similar equation holds for Gt. Hence, examples that both lie in a \ncluster of similar examples become more strongly related, and similar features that \noccur in a cluster of related features are drawn together in the semantic proximity \nmatrix P. We should stress that the emphasis of this work is not in its diffusion \nconnections, but its relation to semantic proximity. It is this link that motivates \nthe alternative decay factors considered below. \nThe kernel K combines these indirect link kernels with an exponentially decaying \nweight. This suggests an alternative weighting scheme that shows faster decay for \nincreasing path length, \n\n00 >..tKt \n\n_ \n\nK(>..) = K 2..= -, = K exp(>..K) \n\nt. \n\nt=1 \n\n\fThe next proposition gives the semantic proximity matrix corresponding to K(>\"') . \n\nProposition 3 Let K(>\"') = K exp(>...K). Then K(>\"') corresponds to a semantic \nproximity matrix exp(>\"'G). \n\nProof: Let X = UI;V' be the singular value decomposition of X, so that K = \nVAV' is the eigenvalue decomposition of K, where A = I;/I;. We can write K as \n\nK \n\nVAexp(>...A)V' = XIUI; - lAexp(>...A)I; - lUIX \n\n= XIU exp(>\"'A)U' X = Xl exp(>\"'G)X, \n\nas required. _ \n\nThe above leads to the definition of the second kernel that we consider. \n\nDefinition 4 Given a kernel K the derived kernels K(>\"') = K exp(>...K) will be \nreferred to as the exponential kernels. \n\n5 Experimental Methods \n\nIn the previous sections we have introduced two new kernel adaptations, in both \ncases parameterized by a positive real parameter >.... In order to apply these kernels \non real text data, we need to develop a method of choosing the parameter >.... Of \ncourse one possibility would be just to use cross-validation as considered by [7]. \nRather than adopt this rather expensive methodology we will use a quantitative \nmeasure of agreement between the diffusion kernels and the learning task known as \nalignment, which measures the degree of agreement between a kernel and target [2]. \n\nDefinition 5 Alignment The (empirical) alignment of a kernel kl with a kernel \nk2 with respect to the sample S is the quantity \n\nA(S,k1 ,k2 ) = \n\n(K1 ,K2 )F \n\ny!(K1 ,K1 )F(K2,K2)F \n\n, \n\nwhere Ki is the kernel matrix for the sample S using kernel ki . \n\nwhere we use the following definition of inner products between Gram matrices \n\n(K1 ,K2)F = 2..= K 1 (Xi ,Xj)K2(Xi,X j ) \n\n(2) \n\nm \n\ni,j=l \n\ncorresponding to the Frobenius inner product. From a text categorization perspec(cid:173)\ntive this can also be viewed as the cosine of the angle between two bi-dimensional \nvectors Kl and K 2, representing the Gram matrices. If we consider K2 = yyl, where \ny is the vector of outputs (+1/-1) for the sample, then \n\nA(S K \n\nI) _ \n\n, \n\n, yy \n\n- y!(K K) ( \n\n(K, yy/)F \nI \n, F yy , yy F \n\nI) \n\ny'Ky \nmllKllF \n\n(3) \n\nThe alignment has been shown to possess several convenient properties [2]. Most \nnotably it can be efficiently computed before any training of the kernel machine \ntakes place, and based only on training data information; and since it is sharply \nconcentrated around its expected value, its empirical value is stable with respect to \ndifferent splits of the data. \n\nWe have developed a method for choosing>... to optimize the alignment of the re(cid:173)\nsulting matrix K(>...) or k(>...) to the target labels on the training set. This method \n\n\ffollows similar results presented in [2], but here the parameterization is non-linear \nin A so that we cannot solve for the optimal value. We rather seek the optimal value \nusing a line search over the range of possible values of A for the value at which the \nderivative of the alignment with respect to A is zero. The next two propositions \nwill give equations that are satisfied at this point. \n\nProposition 6 If A* is the solution of A* = argmax~A(S, K(A), yy') and Vi, Ai are \nthe eigenvector/eigenvalue pairs of the kernel matrix K then \n\nm \n\ni=l \n\nm \n\ni=l \n\nm \n\nL Ai exp(A* Ai)(Vi, y)2 L AJ exp(2A* Ai) \n\ni = l \n\nm \n\ni=l \n\nProof: First observe that K(A) = V MV' = 2:~1 J.tiViV~, where Mii = J.ti(A) \nAi exp(Ui). We can express the alignment of K(A) as \n\nA(S, K(A), yy') \n\n2:~1 J.ti(A)(Vi , y)2 \nmJ2:~l J.ti(A)2 \n\nThe function is a differentiable function of A and so at its maximal value the deriva(cid:173)\ntive will be zero. Taking the derivative of this expression and setting it equal to \nzero gives the condition in the proposition statement. _ \n\nProposition 7 If A* is the solution of A* = argmaxAE(O,IIKII-,)A(S, K(A), yy'), and \nVi, Ai are the eigenvector eigenvalue pairs of the kernel matrix K then \n\n~ 1 \n\n~ (Vi,y)2(2A*Ai -1) \n6 (A*(l- VAi))2 6 (A*(l- A*Ai))2 \n\n~ (Vi,y)2 ~ 2A*Ai -1 \n6 V(l- A*Ai) 6 (V(l- A*Ai))3 \n\nProof: The proof is identical to that of Proposition 6 except that Mii = J.ti(A) = \n(l - Ai>.r' \n\nA \n\n. -\n\nDefinition 8 Line Search Optimization of the alignment can take place by using \na line search of the values of A to find a maximum point of the alignment by seeking \npoints at which the equations given in Proposition 6 and 7 hold. \n\n5.1 Results \n\nTo demonstrate the performance of the proposed algorithm for text data, the Med(cid:173)\nline1033 dataset commonly used in text processing [3] was used. This dataset con(cid:173)\ntains 1033 documents and 30 queries obtained from the national library of medicine. \nIn this work we focus on query20. A Bag of Words kernel was used [6]. Stop words \nand punctuation were removed from the documents and the Porter stemmer was \napplied to the words. The terms in the documents were weighted according to a \nvariant of the tfidf scheme. It is given by 10g(1 + tf) * log(m/ df), where tf repre(cid:173)\nsents the term frequency, df is used for the document frequency and m is the total \nnumber of documents. A support vector classifier (SVC) was used to assess the per(cid:173)\nformance of the derived kernels on the Medline dataset. A 10-fold cross validation \nprocedure was used to find the optimal value for the capacity control parameter \n'C' . Having selected the optimal 'C' parameter, the SVC was re-trained ten times \nusing ten random training and test dataset splits. Error results for the different \nalgorithms are presented together with F1 values. The F1 measure is a popular \nstatistic used in the information retrieval community for comparing performance of \n\n\fTRAIN ALIGN \n0.851 {0.012} \n0.423 (0 .007) \n0.863 {0.025} \n0.390 (0.009) \n0.867 {0.029} \n0.325 (0.009) \n\nSVC ERROR \n0.017 {0.005} \n0.022 (0.007) \n0.018 {0.006} \n0.024 (0.004) \n0.019 {0.004) \n0.030 (0.005) \n\nF1 \n\n0.795 {0.060} \n0.256 (0.351) \n0.783 {0.074} \n0.456 (0 .265) \n0.731 {0.089} \n0.349 (0 .209) \n\nK80 \n\nB80 \n\nK50 \n\nB50 \n\nK 20 \n\nB 20 \n\nA \n\n0.197 ~0.004) \n\n0.185 ~0.008) \n\n0.147 ~0.04) \n\nTable 1: Medline dataset - Mean and associated standard deviation alignment, F1 \nand sve error values for a sve trained using the Bag of Words kernel (B) and the \nexponential kernel (K). The index represents the percentage of training points. \n\nTRAIN ALIGN \n0.758 (0.015) \n0.423(0.007) \n0.766 (0.025) \n0.390 (0 .009) \n0.728 (0.012) \n0.325 (0.009) \n\nSVC ERROR \n0.017 (0.004) \n0.022 (0.007) \n0.018 (0.005) \n0.024 (0.004) \n0.028 (0.004) \n0.030 (0.005) \n\nK80 \n\nB80 \n\nK50 \n\nB50 \n\nK 20 \n\nB 20 \n\nF1 \n\nA \n\n0.765 (0.020) \n0.256 (0.351) \n0.701 (0.066) \n0.456 (0.265) \n0.376 (0.089) \n0.349 (0.209) \n\n0.032 (0 .001) \n\n0.039 (0.008) \n\n0.029 (0 .07) \n\nTable 2: Medline dataset - Mean and associated standard deviation alignment, F1 \nand sve error values for a sve trained using the Bag of Words kernel (B) and the \nvon Neumann (K). The index represents the percentage of training points. \n\nalgorithms typically on uneven data. F1 can be computed using F1 = ~~~, where \nP represents precision i.e. a measure of the proportion of selected items that the \nsystem classified correctly, and R represents recall i.e. the proportion of the target \nitems that the system selected. \nApplying the line search procedure to find the optimal value of A for the diffusion \nkernels. All of the results are averaged over 10 random splits with the standard \ndeviation given in brackets. Table 1 shows the results of using the Bag of Words \nkernel matrix (B) and the exponential kernel matrix (K). Table 2 presents the results \nof using the von Neumann kernel matrix (K) together with the Bag of Words kernel \nmatrix for different sizes of the training data. The index represents the percentage \nof training points. The first column of both table 1 and 2 shows the alignments of \nthe Gram matrices to the rank 1 labels matrix for different sizes of training data. \n\nIn both cases the results presented indicate that the alignment of the diffusion \nkernels to the labels is greater than that of the Bag of Words kernel matrix by \nmore than the sum of the standard deviations across all sizes of training data. The \nsecond column of the tables represents the support vector classifier (SVe) error \nobtained using the diffusion Gram matrices and the Bag of Words Gram matrix. \nThe sve error for the diffusion kernels shows a decrease with increasing alignment \nvalue. F1 values are also shown and in all instances show an improvement for the \ndiffusion kernel matrices. An interesting observation can be made regarding the F1 \nvalue for the von Neumann kernel matrix trained using 20% training data (K20). \nDespite an increase in alignment value and reduction of sve error the F1 value \ndoes not increase as much as that for the exponential kernel trained using the same \nproportion of the data K 20 . This observation implies that the diffusion kernel needs \n\n\fmore data to be effective. This will be investigated in future work. \n\n6 Conclusions \n\nWe have proposed and compared two different methods to model the notion of se(cid:173)\nmantic similarity between documents, by implicitly defining a proximity matrix P \nin a way that exploits high order correlations between terms. The two methods \ndiffer in the way the matrix is constructed. In one view, we propose a recursive def(cid:173)\ninition of document similarity that depends on term similarity and vice versa. By \nsolving the resulting system of kernel equations, we effectively learn the parameters \nof the model (P), and construct a kernel function for use in kernel based learning \nmethods. In the other approach, we model semantic relations as a diffusion pro(cid:173)\ncess in a graph whose nodes are the documents and edges incorporate first-order \nsimilarity. Diffusion efficiently takes into account all possible paths connecting two \nnodes, and propagates the 'similarity' between two remote documents that share \n'similar terms'. The kernel resulting from this model is known in the literature \nas the 'diffusion kernel'. We have experimentally demonstrated the validity of the \napproach on text data using a novel approach to set the adjustable parameter ..\\ in \nthe kernels by optimising their 'alignment' to the target on the training set. For \nthe dataset partitions substantial improvements in performance over the traditional \nBag of Words kernel matrix were obtained using the diffusion kernels and the line \nsearch method. Despite this success, for large imbalanced datasets such as those en(cid:173)\ncountered in text classification tasks the computational complexity of constructing \nthe diffusion kernels may become prohibitive. Faster kernel construction methods \nare being investigated for this regime. \n\nReferences \n\n[1] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma(cid:173)\n\nchines. Cambridge University Press, Cambridge, UK , 2000. \n\n[2] Nello Cristianini, John Shawe-Taylor, and Jaz Kandola. On kernel target align(cid:173)\nment. In Proceedings of the Neural Information Processing Systems, NIPS '01, \n2002. \n\n[3] Nello Cristianini, John Shawe-Taylor, and Huma Lodhi. Latent semantic kernels. \n\nJournal of Intelligent Information Systems, 18(2):127-152,2002. \n\n[4] R. Ferrer and R.V. Sole. The small world of human language. Proceedings of the \nRoyal Society of London Series B - Biological Sciences, pages 2261- 2265 , 200l. \nIn Research and \n\n[5] Thomas Hofmann. Probabilistic latent semantic indexing. \n\nDevelopment in Information Retrieval, pages 50-57, 1999. \n\n[6] T. Joachims. Text categorization with support vector machines. In Proceedings \n\nof European Conference on Machine Learning (ECML) , 1998. \n\n[7] R.I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete struc(cid:173)\ntures. In Proceedings of Intenational Conference on Machine Learning (ICML \n2002), 2002. \n\n[8] Todd A. Letsche and Michael W. Berry. Large-scale information retrieval with \n\nlatent semantic indexing. Information Sciences, 100(1-4):105- 137,1997. \n\n[9] G. Siolas and F. d'Alch Buc. Support vector machines based on a semantic \n\nkernel for text categorization. In IEEE-IJCNN 2000), 2000. \n\n\f", "award": [], "sourceid": 2316, "authors": [{"given_name": "Jaz", "family_name": "Kandola", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}