{"title": "Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text", "book": "Advances in Neural Information Processing Systems", "page_first": 1945, "page_last": 1952, "abstract": "In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.", "full_text": "Learning the Semantic Correlation: An\n\nAlternative Way to Gain from Unlabeled Text\n\nYi Zhang\n\nMachine Learning Department\n\nCarnegie Mellon University\nyizhang1@cs.cmu.edu\n\nJeff Schneider\n\nThe Robotics Institute\n\nCarnegie Mellon University\nschneide@cs.cmu.edu\n\nArtur Dubrawski\n\nThe Robotics Institute\n\nCarnegie Mellon University\n\nawd@cs.cmu.edu\n\nAbstract\n\nIn this paper, we address the question of what kind of knowledge is gen-\nerally transferable from unlabeled text. We suggest and analyze the se-\nmantic correlation of words as a generally transferable structure of the\nlanguage and propose a new method to learn this structure using an ap-\npropriately chosen latent variable model. This semantic correlation con-\ntains structural information of the language space and can be used to\ncontrol the joint shrinkage of model parameters for any speci\ufb01c task in\nthe same space through regularization. In an empirical study, we con-\nstruct 190 different text classi\ufb01cation tasks from a real-world benchmark,\nand the unlabeled documents are a mixture from all these tasks. We test\nthe ability of various algorithms to use the mixed unlabeled text to en-\nhance all classi\ufb01cation tasks. Empirical results show that the proposed\napproach is a reliable and scalable method for semi-supervised learn-\ning, regardless of the source of unlabeled data, the speci\ufb01c task to be\nenhanced, and the prediction model used.\n\n1 Introduction\n\nThe availability of large amounts of unlabeled data such as text on the Internet is a strong\nmotivation for research in semi-supervised learning [4]. Currently, most of these methods\nassume that the unlabeled data belong to the same classes or share the generative distri-\nbutions with the labeled examples, e.g., generative models [10], low-density separation\n[8, 13], and graph-based methods [3]. As indicated in [11], unlabeled data in real-world\napplications do not necessarily follow the classes or distribution of labeled examples, and\nsemi-supervised learning algorithms that give up this assumption have wider applicability\nin practice. As a result, some algorithms avoid using unlabeled examples directly in model\ntraining and instead focus on \u201cchanges of representation\u201d that \ufb01nd a more informative rep-\nresentation from unlabeled data and use it to encode the labeled examples [4, 1, 11].\n\n\fHowever, even algorithms for learning good features from unlabeled data still make a strong\nassumption: those learned high-level features will be relevant to the speci\ufb01c prediction task\nat hand. This assumption might be problematic. Many functions can be de\ufb01ned over an\ninput space and a speci\ufb01c task corresponds to only one of them. The feature extraction\non unlabeled data is an unsupervised process and thus a \u201cblindly\u201d learned representation\nmight be irrelevant to a speci\ufb01c task, especially when the unlabeled data are not from\nthe same task. To tackle this problem, some recent work avoids blind feature extraction by\nincorporating external knowledge about the task being enhanced [1]: the high-level features\nare learned by principal component analysis on the weights of several models, and these\nmodels are trained from some \u201cauxiliary\u201d tasks constructed by domain knowledge.\n\nIn this paper, we explore the possibility of extracting generally transferable knowledge\nfrom unlabeled text without information about the task to be enhanced. This knowledge\nis represented as the semantic correlation structure of the words in the text domain and\nis shown to be transferable among documents of different themes. This structure is ex-\ntracted using a latent topic model combined with a bootstrapping procedure. The rationale\nis that the latent topics (or more generally, high-level features) extracted from unlabeled\ndata might be irrelevant to a particular task, but the word distribution in these topics reveals\nthe structural information of the language, represented by the semantic correlation among\nwords. For any speci\ufb01c task de\ufb01ned on the same input space, this information can be used\nto control the joint shrinkage of model parameters through informative regularization.\n\nThe use of covariance or correlation structure has already been mentioned in transfer learn-\ning [12, 9]. A covariance structure can be transferred from a few related tasks to a target\ntask [12] or inferred from meta-features [9]. In fact, one way to view the present work\nis: 1) we automatically construct a large number of diverse but meaningful \u201ctasks\u201d from\nunlabeled text without using external knowledge, where each \u201ctask\u201d is actually extracted\nas a latent variable; 2) we propose to learn the semantic correlation structure of the word\nspace from these dummy tasks and show that this structure is generally transferable regard-\nless of the source of unlabeled data; 3) this structure can be ef\ufb01ciently incorporated into a\nbroad category of prediction models via regularization, which leads to a very scalable and\napplicable semi-supervised learning framework.\n\n2 Semantic Correlation: Transferable Structure from Unlabeled Text\n\n2.1 Latent Topics and Semantic Structure\n\nLatent topics extracted from unlabeled text might be irrelevant to a particular task, but the\ncomposition of these topics in terms of word distribution reveals information about the\nsemantic structure of the language. Assume a latent topic model [7, 2] of the word space\nX, or more generally, a latent variable model characterizing the input space X:\n\nx = Az\n\n(1)\n\nwhere x = [x1, x2, . . . , xp]T is the p-dimensional vector of input variables, and z =\n[z1, z2, . . . , zk]T represents latent variables in the k-dimensional latent space Z. A is a\np \u00d7 k matrix, representing a generative process from a probabilistic view or a projection\nfrom a deterministic view. For a latent topic model, x corresponds to the bag-of-words\nvector of a document divided by the document length, z is the distribution of k latent topics\nin the document, and A is the distribution of p words in k latent topics. Various models \ufb01t\nin this formula including PCA, ICA, sparse coding, and non-negative matrix factorization.\n\nDifferent documents have different topic distributions, z, and thus different word dis-\ntributions, x, but A can be considered an invariant structure of the language. Each p-\ndimensional column vector of A denotes the word distribution in a latent topic, and serves\nas an \u201cobservation\u201d in the p dimensional word space, indicating the semantic roles of p\n\n\fwords in this topic. Given a large set of k latent topics represented by k p-dimensional vec-\ntors {a(,1), a(,2), . . . , a(,k)}, we can de\ufb01ne the semantic covariance of p words as follows.\nLet A denote the matrix formed by treating each vector a(,t), t = 1, 2, . . . , k as a column,\nand let a(i,) and a(i,t) denote a row vector and an element of this matrix, respectively. The\nsemantic covariance of word i and word j is de\ufb01ned as:\n\ncovs(xi, xj) =\n\n1\nk\n\nk\n\nX\n\nt=1\n\n(ait \u2212 a(i,))(ajt \u2212 a(j,)) =\n\n1\nk\n\nk\n\nX\n\nt=1\n\naitajt \u2212 a(i,)a(j,)\n\n(2)\n\nwhere a(i,) is the mean of the ith row in A. Naturally, the semantic correlation is:\n\ncorrs(xi, xj) =\n\ncovs(xi, xj)\n\npcovs(xi, xi)covs(xj , xj)\n\n(3)\n\n2.2 Comparing Semantic Correlation and Data Correlation\n\nSuppose we observe a set of n documents in word space X, denoted by an n \u00d7 p data\nmatrix DX where each document corresponds to a p-dimensional bag-of-words vector of\ncounts. We refer to the correlation between words computed directly from DX as the data\ncorrelation. This data correlation may not be transferable between tasks since documents\nfrom different themes may have distinct topic distributions and word distributions, which\nlead to different word correlations in data space.\n\nHere we show intuitively why we expect the data correlation to have limited use across\ndistinct tasks, while we expect the semantic correlation to be transferable. Consider the\nlatent variable model in eq. (1), which relates A to data space X. We focus on semantic\ncovariance and data covariance, and assume that the bag-of-words vector is divided by the\nlength of the document so that it corresponds to x in eq. (1). From eq. (1), an input variable\nxi can be written as xi = Pk\naitzt, and therefore, the data covariance of word i and\nword j can be expressed as:\n\nt=1\n\ncov(xi, xj ) = E[(xi \u2212 Exi)(xj \u2212 Exj )]\n\n(4)\n\nk\n\n= E[\n\nX\n\nt=1\n\nait(zt \u2212 Ezt)\n\nk\n\nX\n\nt=1\n\najt(zt \u2212 Ezt)]\n\n=\n\n=\n\nk\n\nk\n\nX\n\nX\n\nt=1\n\nt\u2032=1\n\nk\n\nk\n\nX\n\nX\n\nt=1\n\nt\u2032=1\n\naitajt\u2032 E[(zt \u2212 Ezt)(zt\u2032 \u2212 Ezt\u2032 )]\n\naitajt\u2032 cov(zt, zt\u2032 )\n\nThus, data covariance is directly related to the covariance among latent topics. Documents\nfrom different sources have different topic distributions and thus different covariance terms\ncov(zt, zt\u2032 ) in latent space. As a result, the data covariance learned from one source of\ndocuments may not be transferable to another class of documents. On the other hand, the\nsemantic covariance in eq. (2) is completely determined by the structure of A.\n\nIntuitively, the data covariance among words must contain some information about the\nsemantic relationship of words. This can also be observed from eq. (4).\nIf we ignore\nthe effect of the covariance among topics by assuming that latent topics are independently\ndistributed and have the same variance (denoted as \u03c32), eq. (4) can be written as:\n\ncov(xi, xj) = \u03c32\n\nk\n\nX\n\nt=1\n\naitajt\n\n(5)\n\n\fAlgorithm 1 Estimation of semantic correlation structure\n\nInput: data D = Du \u222a Dl, latent variable model M\nOutput: semantic correlation matrix \u03a3s\nParameters: \u03b1, k, N\nInitialize V \u2190 \u2205\nrepeat\n\nDsamp \u2190 Sampling(D, \u03b1)\n{(z1, a(,1)), (z2, a(,2)), . . . , (zk, a(,k))} \u2190 M (k, Dsamp)\nV \u2190 V \u222a {a(,1), a(,2), . . . , a(,k)}\n\nuntil |V| \u2265 kN\nCompute \u03a3s: \u03a3s(i, j) \u2190 corrs(xi, xj )\n\nComparing this to the last form in eq. (2), we see the similarity between data and semantic\ncovariance. In fact, our empirical study shows that data correlation from unlabeled text\ndoes contain useful information, but is not as informative as semantic correlation.\n\n3 Semantic Structure Learning and Informative Regularization\n\ni, yl\n\nConsider a set of nl labeled documents Dl = {(xl\ni) \u2208 X \u00d7 Yl, i = 1, \u00b7 \u00b7 \u00b7 nl}, where\nX \u2286 Rp is the p-dimensional word space, and Yl = {\u22121, 1} for classi\ufb01cation and Yl \u2286 R\nfor regression. Also assume that a large set of nu unlabeled documents Du = {xu\ni \u2208\nX, i = 1, \u00b7 \u00b7 \u00b7 nu} is available. The goal is to learn a good function fl : X \u2192 Yl, which is\na classi\ufb01er or a regressor. In this section we introduce a framework to transfer knowledge\nfrom unlabeled text. Section 3.1 proposes an approach to learning the semantic structure\nof the word space from a set of unlabeled text. In section 3.2, we discuss how to ef\ufb01ciently\napply the learned structure to a broad category of prediction models through regularization.\n\n3.1 Learning the Semantic Correlation\n\nThe semantic correlation among words can be estimated using eq. (3) by observing a large\nnumber of different latent topics. However, obtaining a large set of diverse but meaningful\ntopics is hard, since the number of meaningful topics extracted by a latent topic model is\nusually not very large. To solve this problem, resampling techniques such as bootstrapping\n[5] can be combined with a chosen latent variable model, which provides a principled way\nto estimate the semantic correlation. The procedure is given in Algorithm 1, which uses\nall the available data D = Du \u222a Dl and a latent variable model M as the input. The\nalgorithm repeats N iterations. In each iteration it draws an \u03b1 percentage sample1 from\nthe data and extracts k latent topics from the sample by applying the model M . After N\niterations, the p \u00d7 p semantic correlation matrix \u03a3s is estimated from the kN observations\nof word distribution in latent topics. The algorithm requires an appropriate latent variable\nmodel M (e.g., latent dirichlet allocation for text data), and a number k of latent variables\nextracted each iteration from the sampled data. The number of iterations N is set as large\nas necessary to obtain a reliable estimation.\n\n3.2 Knowledge Transfer by Informative Regularization\n\nThis section discusses how to use the semantic structure \u03a3s in any speci\ufb01c learning task\nde\ufb01ned on the input space X. For the prediction model, we mainly consider regularized\nlinear models with an l-2 norm penalty, e.g., support vector machines, ridge regression,\nlogistic regression with a Gaussian prior, etc. The model is represented by a p-dimensional\nweight vector w and an intercept b. The prediction is computed as wT x + b for regression\n\n1In this paper, we use \u03b1 = 50% sampling without replacement. Other choices can be made.\n\n\for by setting a threshold \u03b8 (usually \u03b8 = 0) on wT x + b for classi\ufb01cation. To learn w and\nb, we minimize a loss function L on the training examples plus a regularization term on w:\n\nargmin\n\nw,b\n\nnl\n\nX\n\ni=1\n\nL(yl\n\ni, wT xl\n\ni + b) + \u03bbwT w\n\n(6)\n\nDifferent models correspond to different loss functions [6], e.g., SVMs use hinge loss, lo-\ngistic regression uses log-likelihood loss, and ridge regression uses squared error loss. The\nregularization term \u03bbwT w = \u03bbwT I\u22121w is well known to be equivalent to the Bayesian\napproach that imposes a Gaussian prior with zero mean and an identity correlation matrix.\nThe correlation is often set to an identity matrix due to lack of knowledge about the input\nspace. If a covariance or correlation structure is known, e.g., the semantic structure of the\nword space, the prior can be more informative [12]. Incorporating \u03a3s into the Gaussian\nprior leads to a new regularization term and the resulting model is:\n\nargmin\n\nw,b\n\nnl\n\nX\n\ni=1\n\nL(yl\n\ni, wT xl\n\ni + b) + \u03bbwT \u03a3\u22121\n\ns\n\nw\n\n(7)\n\nExtending the discussion on SVMs in [9], all regularized linear models in the form of\neq. (7) can be easily solved by three steps. First, transform the training examples by\n\n\u02dcxl\n\ni = \u03a3\n\n1\n\n2\n\ns xl\ni\n\nSecond, learn the standard linear model in the transformed space:\n\nargmin\n\n\u02dcw,b\n\nnl\n\nX\n\ni=1\n\nL(yl\n\ni, \u02dcwT \u02dcxl\n\ni + b) + \u03bb \u02dcwT \u02dcw\n\nFinally, the optimal solution for (7) is obtained by:\n\n1\n\nw = \u03a3\n\n2\n\ns \u02dcw\n\n(8)\n\n(9)\n\n(10)\n\nw = \u02dcwT \u02dcw. Semantic\nThis equivalence is derived from wT xl\ncorrelation is transferable to any speci\ufb01c task and thus can be computed of\ufb02ine. As a\nresult, semi-supervised learning for any task simply requires the linear transformation in\neq. (8) before training on the labeled examples, which is very scalable.\n\ni and wT \u03a3\u22121\n\ni = \u02dcwT \u02dcxl\n\ns\n\n4 Experiments\n\nWe use the by-date version of the 20-NewsGroups data set2, where 11314 training and 7532\ntesting documents are divided by date and denoted as Dtr and Dts here. Documents are\nrepresented by bag-of-words vectors. The vocabulary is built to include the most frequent\n200 words in each of the 20 newsgroups, while the 20 most frequent words over all 20\nnewsgroups are removed. This yields an input space X with p = 1443 features (words).\nDocuments come from 20 newsgroups, so we construct 190 binary classi\ufb01cation tasks, one\nfor each pair of newsgroups. For each task, a few documents in the two newsgroups are\nselected from Dtr as the labeled examples, denoted as Dl in section 3. The rest of the\ndocuments in Dtr are used as the unlabeled data, denoted by Du. Note that Du is a mixture\nfrom all the 20 newsgroups. In this sense, semi-supervised learning algorithms that assume\nthe unlabeled data come from the target task or the same generative distribution are unlikely\nto work very well. The test data for each binary task are all the relevant documents in Dts,\ni.e., documents in Dts that belong to one of the two chosen newsgroups. For any task we\n\n2http://people.csail.mit.edu/jrennie/20Newsgroups/\n\n\falways have Du \u222a Dl = Dtr, so Algorithm 1 is run only once on Dtr to learn the semantic\ncorrelation structure \u03a3s that is used by all 190 tasks.\nThe documents are well distributed over the 20 newsgroups and thus there are large num-\nbers of training documents in Dtr for each newsgroup. To limit the number of labeled\nexamples for each binary prediction task, we use 5%, 10%, 20% of the relevant documents\nin Dtr as the labeled examples Dl, and the rest of the relevant and all irrelevant docu-\nments in Dtr as the unlabeled data Du. We denote these tests as 5%-Test, 10%-Test, and\n20%-Test. The result of each test is averaged over 10 random runs, with Dl randomly\nselected from Dtr. The testing data for each task are \ufb01xed to be all relevant documents\nin Dts, which is invariant for a task among different tests and random runs. Methods for\ncomparison are as follows.\n(1) Comparison based on SVM. For each classi\ufb01cation task, we compare: SVM di-\nrectly trained on labeled examples Dl (denoted SV M ), SVM trained on Dl in the latent\ntopic space extracted by latent dirichlet allocation on Dl \u222a Du [2] (denoted SV MLDA),\nSVM trained on Dl in principal component space extracted by PCA on Dl \u222a Du (denoted\nSV MP CA), SVM trained on Dl via informative regularization with semantic correlation\n\u03a3s in the prior (denoted SV MIR), SVM trained on Dl via informative regularization with\ndata correlation in the prior (denoted SV MIR(data)), where the data correlation \u03a3 is esti-\nmated from bag-of-words vectors of documents in Dl \u222a Du.\n(2) Comparison based on L-2 Regularized Logistic Regression. Analogous to the SVM\ncomparison with logistic regression (denoted LGR) as the base classi\ufb01er.\n(3) Comparison based on ridge regression. Ridge regression (denoted RR) is used as the\nbase classi\ufb01er: examples are labeled as +1 and \u22121, and prediction is made by wT x+b > 0.\n(4) Comparison to semi-supervised SVM. Recently a fast semi-supervised SVM using\nL-2 loss was proposed [13], which makes it possible to handle large-scale unlabeled doc-\numents. We compare: L2-SVM directly trained on Dl (L2-SV M ), semi-supervised L2-\nSVM trained on Dl \u222a Du (L2-S3V M ), and L2-SVM trained on Dl via informative regu-\nlarization with semantic correlation (L2-SV MIR). The semi-supervised SVM should not\nwork well since the unlabeled data is a mixture from all tasks. Therefore, we also test an\n\u201coracle\u201d semi-supervised SVM, using labeled examples together with unlabeled examples\ncoming only from the two relevant newsgroups (L2-S3V Moracle).\nHere are additional implementation details. The regularization parameter \u03bb for each\nmodel is determined by 5-fold cross-validation in the range 10\u22126 to 106. LibSVM 2.85\nis used for SVM. For PCA, we tried 10, 20, 30, 50, 100, 200, 400 principal components\nand report PCA using 200 principal components as the best result. For latent dirichlet\nallocation, we use the implementation at http://chasen.org/\u223cdaiti-m/dist/lda/. We tried\nk = 10, 20, 30, 50, 100, 200 latent topics with 30 topics performing best. For the proposed\nmethod, Algorithm 1 uses latent dirichlet allocation with k = 30 topics per sampling, re-\npeats N = 100 iterations, and \u03a3s is estimated from these 3000 latent topics. L2-S3V M\n(code available as SVMlin [13]) has a second parameter \u03bbu for unlabeled examples, which\nis set to 1 as in [13]. Unlabeled data for L2-S3V M is downsampled to 3000 documents\nfor each run to make training (and cross-validation) feasible.\n\nEmpirical results are shown in Tables 1- 4. For each semi-supervised learning algorithm,\nwe report two performance measures: the average classi\ufb01cation error over all 190 tasks,\nand the gain/loss ratio compared to the corresponding supervised learning method. The\nformer measures the effectiveness of using the unlabeled data, while the latter measures\nthe reliability of the knowledge transfer. From Tables 1 - 3, IR based methods with seman-\ntic correlation signi\ufb01cantly outperform standard supervised learning, LDA based methods,\nPCA based methods, and is also generally more effective than IR with data correlation. The\nLDA based algorithms slightly improve the prediction performance when using SVM or lo-\ngistic regression as the base classi\ufb01er, while decreasing the performance when using ridge\n\n\fTable 1: Comparison over 190 tasks, based on SVMs\n\nSV M\n\nSV MLDA(30)\nSV MP CA(200)\n\nSV MIR\n\nSV MIR(data)\n\n5%-Test\n14.22%\n\n10%-Test\n10.34%\n\n9.76% (179/11)\n13.32% (123/67)\n7.58% (190/0)\n9.40% (185/5)\n\n8.01% (171/19)\n10.31% (104/86)\n6.11% (190/0)\n7.14% (183/7)\n\n20%-Test\n\n7.88%\n\n6.90% (161/29)\n8.29% (89/101)\n5.13% (183/7)\n5.70% (180/10)\n\nTable 2: Comparison over 190 tasks, based on regularized logistic regression\n\nLGR\n\nLGRLDA(30)\nLGRP CA(200)\n\nLGRIR\n\nLGRIR(data)\n\n5%-Test\n11.70%\n\n8.21% (171/19)\n11.43% (105/85)\n6.70% (189/1)\n8.46% (172/18)\n\n10%-Test\n\n20%-Test\n\n8.43%\n\n7.38% (156/34)\n8.95% (65/125)\n5.78% (181/9)\n7.21% (157/33)\n\n6.67%\n\n6.79% (134/56)\n7.28% (64/122)\n5.19% (169/21)\n6.46% (132/58)\n\nTable 3: Comparison over 190 tasks, based on ridge regression\n\nRR\n\nRRLDA(30)\nRRP CA(200)\n\nRRIR\n\nRRIR(data)\n\n5%-Test\n14.13%\n\n10%-Test\n10.73%\n\n20%-Test\n\n8.90%\n\n14.08% (111/101)\n15.50% (56/132)\n10.55% (182/8)\n10.68% (176/14)\n\n11.98% (67/102)\n12.80% (33/157)\n8.88% (161/29)\n8.94% (157/33)\n\n11.34% (42/148)\n11.53% (17/173)\n8.01% (134/56)\n7.99% (139/51)\n\nTable 4: Comparison to semi-supervised SVMs over 190 tasks, based on L2-SVM\n\nL2-SV M\nL2-S3V M\n\nL2-S3V Moracle\n\nL2-SV MIR\n\n5%-Test\n11.18%\n\n10%-Test\n\n20%-Test\n\n8.41%\n\n6.65%\n\n14.14% (14/176)\n8.22% (189/1)\n6.87% (188/2)\n\n11.64% (5/185)\n6.95% (185/5)\n\n5.73% (180/10)\n\n10.04% (1/189)\n6.00% (164/24)\n4.98% (177/13)\n\nregression. This is possibly because the loss function of ridge regression is not a good ap-\nproximation to the 0/1 classi\ufb01cation error, and therefore, ridge regression is more sensitive\nto irrelevant latent features extracted from mixed unlabeled documents. The PCA based\nmethods are generally worse than standard supervised learning, which indicates they are\nsensitive to the mixed unlabeled data. In Table 4, the L2-S3V M performs worse than stan-\ndard L2-SV M , showing that traditional semi-supervised learning cannot handle unlabeled\ndata outside the target task. We can also see that the L2-SV MIR even outperforms the\noracle version of semi-supervised SVM (L2-S3V Moracle) by achieving similar gain/loss\nratio but better average classi\ufb01cation error. This is a very promising result since it shows\nthat information can be gained from other tasks even in excess of what can be gained from\na signi\ufb01cant amount of unlabeled data on the task at hand. In conclusion, the empirical\nresults show that the proposed approach is an effective and reliable (also scalable) method\nfor semi-supervised learning, regardless of the source of unlabeled data, the speci\ufb01c task\nto be enhanced, and the base prediction model used.\nIt is interesting to directly compare the semantic correlation \u03a3s and the data correlation\n\u03a3 matrices learned from the data. We make three observations: 1) The average value\nof entries is 0.0147 in the semantic correlation and 0.0341 in the data correlation. We\n\n\fTable 5: Top 10 distinct word pairs in terms of semantic correlation vs. data correlation\n\ngaza/lebanes\n0.956/0.007\n\nbiker/yamaha motorcycl/yamaha\n0.937/\u22120.004\n\npalestin/lebanes\n\ncage/ama\n\n0.946/0.181\n\n0.921/\u22120.005\n\n0.970/0.030\ntoyota/mileag\n0.934/0.009\n\nbatter/clemen\n0.932/\u22120.002\nmileag/mustang\n0.923/\u22120.002\n\nyanke/catcher\n0.934/0.002\nbrave/batter\n0.950/0.025\n\nhave 1617834 entries with higher data correlation and 462972 entries with higher semantic\ncorrelation. Thus overall word pairs tend to have higher values in the data correlation.\n2) However, if we list the top 1000 pairs of words with the largest absolute difference\nbetween the two correlations, they all have very high semantic correlation and low data\ncorrelation. 3) We list the top 10 such word pairs and their semantic/data correlations in\nTable 5. The words are indeed quite related. In conclusion, entries in \u03a3s seem to have\na power-law distribution where a few pairs of words have very high correlation and the\nrest have low correlation, which is consistent with our intuition about words. However,\nthe data correlation misses highly correlated words found by the semantic correlation even\nthough it generally assigns higher correlation to most word pairs. This is consistent with\nthe data correlation not being transferable among documents of different themes. When the\nunlabeled documents are a mixture from different sources, the estimation of data correlation\nis affected by the fact that the mixture of input documents is not consistent.\n\nAcknowledgments\n\nThis work was supported by the Centers of Disease Control and Prevention (award R01-PH\n000028) and by the National Science Foundation (grant IIS-0325581).\n\nReferences\n\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks\n\nand unlabeled data. JMLR, 6:1817\u20131853, 2005.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[3] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In\n\nICML, pages 19\u201326, 2001.\n\n[4] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised Learning. The MIT Press, 2006.\n[5] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7, 1979.\n[6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,\n\nInference and Prediction. Springer, New York, 2001.\n\n[7] T. Hofmann. Probabilistic latent semantic analysis. In UAI, 1999.\n[8] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In\n\nICML, pages 200\u2013209, 1999.\n\n[9] E. Krupka and N. Tishby. Incorporating Prior Knowledge on Features into Learning. In AIS-\n\nTATS, pages 227\u2013234, 2007.\n\n[10] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classi\ufb01cation from labeled and\n\nunlabeled documents using em. Machine Learning, 39:103\u2013134, 2000.\n\n[11] R. Raina, A. Battle, H. Lee, and B. P. A. Y. Ng. Self-taught learning: Transfer learning from\n\nunlabeled data. In ICML, pages 759\u2013766, 2007.\n\n[12] R. Raina, A. Y. Ng, and D. Koller. Constructing informative priors using transfer learning. In\n\nICML, pages 713\u2013720, 2006.\n\n[13] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006.\n\n\f", "award": [], "sourceid": 767, "authors": [{"given_name": "Yi", "family_name": "Zhang", "institution": null}, {"given_name": "Artur", "family_name": "Dubrawski", "institution": null}, {"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}