{"title": "Latent Support Measure Machines for Bag-of-Words Data Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1961, "page_last": 1969, "abstract": "In many classification problems, the input is represented as a set of features, e.g., the bag-of-words (BoW) representation of documents. Support vector machines (SVMs) are widely used tools for such classification problems. The performance of the SVMs is generally determined by whether kernel values between data points can be defined properly. However, SVMs for BoW representations have a major weakness in that the co-occurrence of different but semantically similar words cannot be reflected in the kernel calculation. To overcome the weakness, we propose a kernel-based discriminative classifier for BoW data, which we call the latent support measure machine (latent SMM). With the latent SMM, a latent vector is associated with each vocabulary term, and each document is represented as a distribution of the latent vectors for words appearing in the document. To represent the distributions efficiently, we use the kernel embeddings of distributions that hold high order moment information about distributions. Then the latent SMM finds a separating hyperplane that maximizes the margins between distributions of different classes while estimating latent vectors for words to improve the classification performance. In the experiments, we show that the latent SMM achieves state-of-the-art accuracy for BoW text classification, is robust with respect to its own hyper-parameters, and is useful to visualize words.", "full_text": "Latent Support Measure Machines\nfor Bag-of-Words Data Classi\ufb01cation\n\nYuya Yoshikawa\n\nNara, 630-0192, Japan\n\nTomoharu Iwata\n\nKyoto, 619-0237, Japan\n\nNara Institute of Science and Technology\n\nNTT Communication Science Laboratories\n\nyoshikawa.yuya.yl9@is.naist.jp\n\niwata.tomoharu@lab.ntt.co.jp\n\nHiroshi Sawada\n\nNTT Service Evolution Laboratories\n\nKanagawa, 239-0847, Japan\n\nsawada.hiroshi@lab.ntt.co.jp\n\nAbstract\n\nIn many classi\ufb01cation problems, the input is represented as a set of features, e.g.,\nthe bag-of-words (BoW) representation of documents. Support vector machines\n(SVMs) are widely used tools for such classi\ufb01cation problems. The performance\nof the SVMs is generally determined by whether kernel values between data points\ncan be de\ufb01ned properly. However, SVMs for BoW representations have a major\nweakness in that the co-occurrence of different but semantically similar words\ncannot be re\ufb02ected in the kernel calculation. To overcome the weakness, we pro-\npose a kernel-based discriminative classi\ufb01er for BoW data, which we call the la-\ntent support measure machine (latent SMM). With the latent SMM, a latent vector\nis associated with each vocabulary term, and each document is represented as a\ndistribution of the latent vectors for words appearing in the document. To repre-\nsent the distributions ef\ufb01ciently, we use the kernel embeddings of distributions that\nhold high order moment information about distributions. Then the latent SMM\n\ufb01nds a separating hyperplane that maximizes the margins between distributions of\ndifferent classes while estimating latent vectors for words to improve the classi-\n\ufb01cation performance. In the experiments, we show that the latent SMM achieves\nstate-of-the-art accuracy for BoW text classi\ufb01cation, is robust with respect to its\nown hyper-parameters, and is useful to visualize words.\n\n1 Introduction\n\nIn many classi\ufb01cation problems, the input is represented as a set of features. A typical example of\nsuch features is the bag-of-words (BoW) representation, which is used for representing a document\n(or sentence) as a multiset of words appearing in the document while ignoring the order of the words.\nSupport vector machines (SVMs) [1], which are kernel-based discriminative learning methods, are\nwidely used tools for such classi\ufb01cation problems in various domains, e.g., natural language pro-\ncessing [2], information retrieval [3, 4] and data mining [5]. The performance of SVMs generally\ndepends on whether the kernel values between documents (data points) can be de\ufb01ned properly.\nThe SVMs for BoW representation have a major weakness in that the co-occurrence of different but\nsemantically similar words cannot be re\ufb02ected in the kernel calculation. For example, when dealing\nwith news classi\ufb01cation, \u2018football\u2019 and \u2018soccer\u2019 are semantically similar and characteristic words for\nfootball news. Nevertheless, in the BoW representation, the two words might not affect the compu-\ntation of the kernel value between documents, because many kernels, e.g., linear, polynomial and\n\n1\n\n\fGaussian RBF kernels, evaluate kernel values based on word co-occurrences in each document, and\n\u2018football\u2019 and \u2018soccer\u2019 might not co-occur.\nTo overcome this weakness, we can consider the use of the low rank representation of each doc-\nument, which is learnt by unsupervised topic models or matrix factorization. By using the low\nrank representation, the kernel value can be evaluated properly between documents without shared\nvocabulary terms. Blei et al. showed that an SVM using the topic proportions of each document\nextracted by latent Dirichlet allocation (LDA) outperforms an SVM using BoW features in terms\nof text classi\ufb01cation accuracy [6]. Another naive approach is to use vector representation of words\nlearnt by matrix factorization or neural networks such as word2vec [7]. In this approach, each doc-\nument is represented as a set of vectors corresponding to words appearing in the document. To\nclassify documents represented as a set of vectors, we can use support measure machines (SMMs),\nwhich are a kernel-based discriminative learning method on distributions [8]. However, these low\ndimensional representations of documents or words might not be helpful for improving classi\ufb01ca-\ntion performance because the learning criteria for obtaining the representation and the classi\ufb01ers are\ndifferent.\nIn this paper, we propose a kernel-based discriminative learning method for BoW representation\ndata, which we call the latent support measure machine (latent SMM). The latent SMMs assume\nthat a latent vector is associated with each vocabulary term, and each document is represented as a\ndistribution of the latent vectors for words appearing in the document. By using the kernel embed-\ndings of distributions [9], we can effectively represent the distributions without density estimation\nwhile preserving necessary distribution information. In particular, the latent SMMs map each dis-\ntribution into a reproducing kernel Hilbert space (RKHS), and \ufb01nd a separating hyperplane that\nmaximizes the margins between distributions from different classes on the RKHS. The learning pro-\ncedure of the latent SMMs is performed by alternately maximizing the margin and estimating the\nlatent vectors for words. The learnt latent vectors of semantically similar words are located close\nto each other in the latent space, and we can obtain kernel values that re\ufb02ect the semantics. As a\nresult, the latent SMMs can classify unseen data using a richer and more useful representation than\nthe BoW representation. The latent SMMs \ufb01nd the latent vector representation of words useful for\nclassi\ufb01cation. By obtaining two- or three-dimensional latent vectors, we can visualize relationships\nbetween classes and between words for a given classi\ufb01cation task.\nIn our experiments, we demonstrate the quantitative and qualitative effectiveness of the latent SMM\non standard BoW text datasets. The experimental results \ufb01rst indicate that the latent SMM can\nachieve state-of-the-art classi\ufb01cation accuracy. Therefore, we show that the performance of the\nlatent SMM is robust with respect to its own hyper-parameters, and the latent vectors for words in\nthe latent SMM can be represented in a two dimensional space while achieving high classi\ufb01cation\nperformance. Finally, we show that the characteristic words of each class are concentrated in a single\nregion by visualizing the latent vectors.\nThe latent SMMs are a general framework of discriminative learning for BoW data. Thus, the idea\nof the latent SMMs can be applied to various machine learning problems for BoW data, which have\nbeen solved by using SVMs: for example, novelty detection [10], structure prediction [11], and\nlearning to rank [12].\n\n2 Related Work\n\nThe proposed method is based on a framework of support measure machines (SMMs), which are\nkernel-based discriminative learning on distributions [8]. Muandet et al. showed that SMMs are\nmore effective than SVMs when the observed feature vectors are numerical and dense in their exper-\niments on handwriting digit recognition and natural scene categorization. On the other hand, when\nobservations are BoW features, the SMMs coincide with the SVMs as described in Section 3.2.\nTo receive the bene\ufb01ts of SMMs for BoW data, the proposed method represents each word as a\nnumerical and dense vector, which is estimated from the given data.\nThe proposed method aims to achieve a higher classi\ufb01cation performance by learning a classi\ufb01er\nand feature representation simultaneously. Supervised topic models [13] and maximum margin\ntopic models (MedLDA) [14] have been proposed based on a similar motivation but using differ-\nent approaches. They outperform classi\ufb01ers using features extracted by unsupervised LDA. There\n\n2\n\n\fare two main differences between these methods and the proposed method. First, the proposed\nmethod plugs the latent word vectors into a discriminant function, while the existing methods plug\nthe document-speci\ufb01c vectors into their discriminant functions. Second, the proposed method can\nnaturally develop non-linear classi\ufb01ers based on the kernel embeddings of distributions. We demon-\nstrate the effectiveness of the proposed model by comparing the topic model based classi\ufb01ers in our\ntext classi\ufb01cation experiments.\n\n3 Preliminaries\n\nIn this section, we introduce the kernel embeddings of distributions and support measure machines.\nOur method in Section 4 will build upon these techniques.\n\n3.1 Representations of Distributions via Kernel Embeddings\nSuppose that we are given a set of n distributions fPign\ni=1, where Pi is the ith distribution on space\nX (cid:26) Rq. The kernel embeddings of distributions are to embed any distribution Pi into a reproducing\nkernel Hilbert space (RKHS) Hk speci\ufb01ed by kernel k [15], and the distribution is represented as\nelement (cid:22)Pi in the RKHS. More precisely, the element of the ith distribution (cid:22)Pi is de\ufb01ned as\nfollows:\n\n\u222b\n\n(cid:22)Pi := Ex(cid:24)Pi [k((cid:1); x)] =\n\nk((cid:1); x)dPi 2 Hk;\n\nX\n\n(1)\n\nwhere kernel k is referred to as an embedding kernel. It is known that element (cid:22)Pi preserves the\nproperties of probability distribution Pi such as mean, covariance and higher-order moments by\nusing characteristic kernels (e.g., Gaussian RBF kernel) [15]. In practice, although distribution Pi\nis unknown, we are given a set of samples Xi = fximgMi\nm=1 drawn from the distribution. In this\nm=1 (cid:14)xim((cid:1)), where (cid:14)x((cid:1))\ncase, by interpreting sample set Xi as empirical distribution ^Pi = 1\nis the Dirac delta function at point x 2 X , empirical kernel embedding ^(cid:22)Pi is given by\n\n\u2211\n\nMi\n\nMi\n\nk((cid:1); xim) 2 Hk;\n\n(2)\n\nMi\u2211\n\nm=1\n\n^(cid:22)Pi =\n\n1\nMi\n\nwhich can be approximated with an error rate of jj^(cid:22)Pi\n3.2 Support Measure Machines\n\n(cid:0) (cid:22)Pi\n\njjHk = Op(M\n\n(cid:0) 1\ni\n\n2\n\n) [9].\n\nNow we consider learning a separating hyper-plane on distributions by employing support measure\nmachines (SMMs). An SMM amounts to solving an SVM problem with a kernel between empirical\ngn\nembedded distributions f^(cid:22)Pi\ni=1, called level-2 kernel. A level-2 kernel between the ith and jth\ndistributions is given by\n\nMi\u2211\n\nMj\u2211\n\n1\n\nMiMj\n\nk(xig; xjh);\n\ng=1\n\nh=1\n\n(3)\n\nK(^Pi; ^Pj) = \u27e8^(cid:22)Pi ; ^(cid:22)Pj\n)\n(\n\n\u27e9Hk =\n(\n\n= exp\n\n\u27e9Hk\n\njj2Hk\n\njj^(cid:22)Pi\n\n(cid:0) (cid:21)\n2\n\n(cid:0) (cid:21)\n2\n\n(cid:0) ^(cid:22)Pj\n\n(\u27e8^(cid:22)Pi ; ^(cid:22)Pi\n\nwhere kernel k indicates the embedding kernel used in Eq. (2). Although the level-2 kernel Eq.(3) is\n)\nlinear on the embedded distributions, we can also consider non-linear level-2 kernels. For example,\na Gaussian RBF level-2 kernel with bandwidth parameter (cid:21) > 0 is given by\n\u27e9Hk )\nKrbf (^Pi; ^Pj) = exp\n(4)\nNote that the inner-product \u27e8(cid:1);(cid:1)\u27e9Hk in Eq. (4) can be calculated by Eq. (3). By using these kernels,\nwe can measure similarities between distributions based on their own moment information.\nThe SMMs are a generalization of the standard SVMs. For example, suppose that a word is rep-\nresented as a one-hot representation vector with vocabulary length, where all the elements are zero\nexcept for the entry corresponding to the vocabulary term. Then, a document is represented by\nadding the one-hot vectors of words appearing in the document. This operation is equivalent to\nusing a linear kernel as its embedding kernel in the SMMs. Then, by using a non-linear kernel\nas a level-2 kernel like Eq. (4), the SMM for the BoW documents is the same as an SVM with a\nnon-linear kernel.\n\n\u27e9Hk + \u27e8^(cid:22)Pj ; ^(cid:22)Pj\n\n(cid:0) 2\u27e8^(cid:22)Pi ; ^(cid:22)Pj\n\n:\n\n3\n\n\f4 Latent Support Measure Machines\n\nIn this section, we propose latent support measure machines (latent SMMs) that are effective for\nBoW data classi\ufb01cation by learning latent word representation to improve classi\ufb01cation perfor-\nmance.\nThe SMM assumes that a set of samples from distribution Pi, Xi, is observed. On the other hand, as\ndescribed later, the latent SMM assumes that Xi is unobserved. Instead, we consider a case where\nBoW features are given for each document. More formally, we are given a training set of n pairs\nof documents and class labels f(di; yi)gn\ni=1, where di is the ith document that is represented by a\nmultiset of words appearing in the document and yi 2 Y is a class variable. Each word is included\nin vocabulary set V. For simplicity, we consider binary class variable yi 2 f+1;(cid:0)1g. The proposed\nmethod is also applicable to multi-class classi\ufb01cation problems by adopting one-versus-one or one-\nversus-rest strategies as with the standard SVMs [16].\nWith the latent SMM, each word t 2 V is represented by a q-dimensional latent vector xt 2 Rq,\nand the ith document is represented as a set of latent vectors for words appearing in the document\nXi = fxtgt2di. Then, using the kernel embeddings of distributions described in Section 3.1, we\ncan obtain a representation of the ith document from Xi as follows: ^(cid:22)Pi = 1jdij\ngn\nUsing latent word vectors X = fxtgt2V and document representation f^(cid:22)Pi\ni=1, the primal opti-\nmization problem for the latent SMM can be formulated in an analogous but different way from the\noriginal SMMs as follows:\n\nk((cid:1); xt).\n\n\u2211\n\nt2di\n\nmin\n\nw;b;(cid:24);X;(cid:18)\n\njjwjj2 + C\n\n1\n2\n\n(cid:24)i +\n\n(cid:26)\n2\n\njjxtjj2\n\n2\n\nsubject to yi (\u27e8w; (cid:22)Pi\n\n\u27e9H (cid:0) b) (cid:21) 1(cid:0) (cid:24)i; (cid:24)i (cid:21) 0; (5)\n\nwhere f(cid:24)ign\ni=1 denotes slack variables for handling soft margins. Unlike the primal form of the\nSMMs, that of the latent SMMs includes a \u21132 regularization term with parameter (cid:26) > 0 with respect\nto latent word vectors X. The latent SMM minimizes Eq. (5) with respect to the latent word vectors\nX and kernel parameters (cid:18), along with weight parameters w, bias parameter b and f(cid:24)ign\nIt is extremely dif\ufb01cult to solve the primal problem Eq. (5) directly because the inner term \u27e8w; (cid:22)Pi\n\u27e9H\nin the constrained conditions is in fact calculated in an in\ufb01nite dimensional space. Thus, we solve\nthis problem by converting it into an another optimization problem in which the inner term does not\n\u2211\nappear explicitly. Unfortunately, due to its non-convex nature, we cannot derive the dual form for\nEq. (5) as with the standard SVMs. Thus we consider a min-max optimization problem, which is\nderived by \ufb01rst introducing Lagrange multipliers A = fa1; a2;(cid:1)(cid:1)(cid:1) ; ang and then plugging w =\n\ni=1.\n\nn\n\ni=1 ai ^(cid:22)Pi into Eq (5), as follows:\n\nn\u2211\n\ni=1\n\n\u2211\n\nt2V\n\nL(A; X; (cid:18)) subject to 0 (cid:20) ai (cid:20) C;\n\nmin\nX;(cid:18)\n\nmax\n\nA\n\nn\u2211\n\nn\u2211\n\nn\u2211\n\nai (cid:0) 1\n\n2\n\ni=1\n\ni=1\n\nj=1\n\nwhere L(A; X; (cid:18)) =\n\naiajyiyjK(^Pi; ^Pj; X; (cid:18)) +\n\n\u2211\n\nt2V\n\n(cid:26)\n2\n\n(6a)\n\njjxtjj2\n\n2; (6b)\n\nwhere K(^Pi; ^Pj; X; (cid:18)) is a kernel value between empirical distributions ^Pi and ^Pj speci\ufb01ed by\nparameters X and (cid:18) as is shown in Eq. (3).\nWe solve this min-max problem by separating it into two partial optimization problems: 1) maxi-\nmization over A given current estimates (cid:22)X and (cid:22)(cid:18), and 2) minimization over X and (cid:18) given current\nestimates (cid:22)A. This approach is analogous to wrapper methods in multiple kernel learning [17].\nMaximization over A. When we \ufb01x X and (cid:18) in Eq. (6) with current estimate (cid:22)X and (cid:22)(cid:18), the maxi-\nmization over A becomes a quadratic programming problem as follows:\n\nn\u2211\n\naiyi = 0;\n\ni=1\n\nn\u2211\n\nn\u2211\n\nn\u2211\n\nai (cid:0) 1\n\n2\n\nmax\n\nA\n\ni=1\n\naiajyiyjK(^Pi; ^Pj; (cid:22)X; (cid:22)(cid:18)) subject to 0 (cid:20) ai (cid:20) C;\n\naiyi = 0;\n\n(7)\n\ni=1\n\nj=1\n\ni=1\n\nwhich is identical to solving the dual problem of the standard SVMs. Thus, we can obtain optimal\nA by employing an existing SVM package.\n\n4\n\nn\u2211\n\n\fTable 1: Dataset speci\ufb01cations.\n# features\n7,770\n17,387\n70,216\n\n# samples\n4,199\n7,674\n18,821\n\nWebKB\nReuters-21578\n20 Newsgroups\n\n# classes\n4\n8\n20\n\nMinimization over X and (cid:18). When we \ufb01x A in Eq. (6) with current estimate (cid:22)A, the min-max\nproblem can be replaced with a simpler minimization problem as follows:\n\n(cid:22)ai(cid:22)ajyiyjK(^Pi; ^Pj; X; (cid:18)) +\n\n(cid:26)\n2\n\njjxtjj2\n2:\n\n(8)\n\n\u2211\n\nt2V\n\nn\u2211\n\nn\u2211\n\ni=1\n\nj=1\n\nmin\nX;(cid:18)\n\nl(X; (cid:18)), where l(X; (cid:18)) = (cid:0) 1\n2\nn\u2211\n\nn\u2211\n\n@l(X; (cid:18))\n\n= (cid:0) 1\n2\n\n@xm\n\n(cid:22)ai(cid:22)ajyiyj\n\ni=1\n\nj=1\n\nTo solve this problem, we use a quasi-Newton method [18]. The quasi-Newton method needs the\ngradient of parameters. For each word m 2 V, the gradient of latent word vector xm is given by\n\n@K(^Pi; ^Pj; X; (cid:18))\n\n@xm\n\n+ (cid:26)xm;\n\n(9)\n\nwhere the gradient of the kernel with respect to xm depends on the choice of kernels. For example,\nwhen choosing a embedding kernel as a Gaussian RBF kernel with bandwidth parameter (cid:13) > 0:\nk(cid:13)(xs; xt) = exp((cid:0) (cid:13)\n), and a level-2 kernel as a linear kernel, the gradient is given by\n@K(^Pi; ^Pj; X; (cid:18))\n\n\u2211\n\n{\n\n2\n\n(cid:13)(xt (cid:0) xs)\n(cid:13)(xs (cid:0) xt)\n\n0\n\n(m = s ^ m \u0338= t)\n(m = t ^ m \u0338= s)\n(m = t ^ m = s):\n\njjxs (cid:0) xtjj2Hk\n\u2211\n1jdijjdjj\n\n=\n\ns2di\n\nt2dj\n\nk(cid:13)(xs; xt) (cid:2)\n\n@xm\n\nAs with the estimation of X, kernel parameters (cid:18) can be obtained by calculating gradient @l(X;(cid:18))\n.\nBy alternately repeating these computations until dual function Eq. (6) converges, we can \ufb01nd a\nlocal optimal solution to the min-max problem.\nThe parameters that need to be stored after learning are latent word vectors X, kernel parameters\n(cid:3) is performed by computing\n(cid:18) and Lagrange multipliers A. Classi\ufb01cation for new document d\n(cid:3)\ni=1 aiyiK(^Pi; ^P(cid:3); X; (cid:18)), where ^P(cid:3) is the distribution of latent vectors for words included\ny(d\n) =\n(cid:3).\nin d\n\n\u2211\n\n@(cid:18)\n\nn\n\n5 Experiments with Bag-of-Words Text Classi\ufb01cation\n\nData description. For the evaluation, we used the following three standard multi-class text classi-\n\ufb01cation datasets: WebKB, Reuters-21578 and 20 Newsgroups. These datasets, which have already\nbeen preprocessed by removing short and stop words, are found in [19] and can be downloaded\nfrom the author\u2019s website1. The speci\ufb01cations of these datasets are shown in Table 1. For our\nexperimental setting, we ignored the original training/test data separations.\nSetting. In our experiments, the proposed method, latent SMM, uses a Gaussian RBF embedding\nkernel and a linear level-2 kernel. To demonstrate the effectiveness of the latent SMM, we compare\nit with several methods: MedLDA, SVD+SMM, word2vec+SMM and SVMs. MedLDA is a method\nthat jointly learns LDA and a maximum margin classi\ufb01er, which is a state-of-the-art discriminative\nlearning method for BoW data [14]. We use the author\u2019s implementation of MedLDA2. SVD+SMM\nis a two-step procedure: 1) extracting low-dimensional representations of words by using a singular\nvalue decomposition (SVD), and 2) learning a support measure machine using the distribution of\nextracted representations of words appearing in each document with the same kernels as the latent\nSMM. word2vec+SMM employs the representations of words learnt by word2vec [7] and uses them\nfor the SMM as in SVD+SMM. Here we use pre-trained 300 dimensional word representation vec-\ntors from the Google News corpus, which can be downloaded from the author\u2019s website3. Note that\nword2vec+SMM utilizes an additional resource to represent the latent vectors for words unlike the\n\n1http://web.ist.utl.pt/acardoso/datasets/\n2http://www.ml-thu.net/\u02dcjun/medlda.shtml\n3https://code.google.com/p/word2vec/\n\n5\n\n\f(a) WebKB\n\n(b) Reuters-21578\n\n(c) 20 Newsgroups\n\nFigure 1: Classi\ufb01cation accuracy over number of training samples.\n\n(a) WebKB\n\n(b) Reuters-21578\n\n(c) 20 Newsgroups\n\nFigure 2: Classi\ufb01cation accuracy over the latent dimensionality.\n\n(cid:0)3; 2\n\nlatent SMM, and the learning of word2vec requires n-gram information about documents, which\nis lost in the BoW representation. With SVMs, we use a Gaussian RBF kernel with parameter (cid:13)\nand a quadratic polynomial kernel, and the features are represented as BoW. We use LIBSVM4 to\nestimate Lagrange multipliers A in the latent SMM and to build SVMs and SMMs. To deal with\nmulti-class classi\ufb01cation, we adopt a one-versus-one strategy [16] in the latent SMM, SVMs and\nSMMs. In our experiments, we choose the optimal parameters for these methods from the following\nvariations: (cid:13) 2 f10\n(cid:0)2;(cid:1)(cid:1)(cid:1) ; 103g in the latent SMM, SVD+SMM, word2vec+SMM and SVM\n(cid:0)3; 10\nwith a Gaussian RBF kernel, C 2 f2\n(cid:0)1;(cid:1)(cid:1)(cid:1) ; 25; 27g in all the methods, regularizer parame-\nter (cid:26) 2 f10\n(cid:0)1; 100g, latent dimensionality q 2 f2; 3; 4g in the latent SMM, and the latent\n(cid:0)2; 10\ndimensionality of MedLDA and SVD+SMM ranges f10; 20;(cid:1)(cid:1)(cid:1) ; 50g.\nAccuracy over number of training samples. We \ufb01rst show the classi\ufb01cation accuracy when vary-\ning the number of training samples. Here we randomly chose \ufb01ve sets of training samples, and used\nthe remaining samples for each of the training sets as the test set. We removed words that occurred\nin less than 1% of the training documents. Below, we refer to the percentage as a word occurrence\nthreshold. As shown in Figure 1, the latent SMM outperformed the other methods for each of the\nnumbers of training samples in the WebKB and Reuters-21578 datasets. For the 20 Newsgroups\ndataset, the accuracies of the latent SMM, MedLDA and word2vec+SMM were proximate and bet-\nter than those of SVD+SMM and SVMs.\nThe performance of SVD+SMM changed depending on the datasets: while SVD+SMM was the\nsecond best method with the Reuters-21578, it placed fourth with the other datasets. This result\nindicates that the usefulness of the low rank representations by SVD for classi\ufb01cation depends on\nthe properties of the dataset. The high classi\ufb01cation performance of the latent SMM for all of the\ndatasets demonstrates the effectiveness of learning the latent word representations.\nRobustness over latent dimensionality. Next we con\ufb01rm the robustness of the latent SMM over\nthe latent dimensionality. For this experiment, we changed the latent dimensionality of the latent\nSMM, MedLDA and SVD+SMM within f2; 4;(cid:1)(cid:1)(cid:1) ; 12g. Figure 2 shows the accuracy when varying\nthe latent dimensionality. Here the number of training samples in each dataset was 600, and the\nword occurrence threshold was 1%. For all the latent dimensionality, the accuracy of the latent\nSMM was consistently better than the other methods. Moreover, even with two-dimensional latent\n\n4http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm/\n\n6\n\n\fFigure 3: Classi\ufb01cation accuracy on WebKB\nwhen varying word occurrence threshold.\n\nFigure 4: Parameter sensitivity on Reuters-21578.\n\nproject\n\ncourse\n\nfaculty\n\nstudent\n\nFigure 5: Distributions of latent vectors for words appearing in documents of each class on WebKB.\n\nvectors, the latent SMM achieved high classi\ufb01cation performance. On the other hand, MedLDA\nand SVD+SMM often could not display their own abilities when the latent dimensionality was low.\nOne of the reasons why the latent SMM with a very low latent dimensionality q achieves a good\nperformance is that it can use qjdij parameters to classify the ith document, while MedLDA uses\nonly q parameters. Since the latent word representation used in SVD+SMM is not optimized for the\ngiven classi\ufb01cation problem, it does not contain useful features for classi\ufb01cation, especially when\nthe latent dimensionality is low.\nAccuracy over word occurrence threshold.\nIn the above experiments, we omit words whose\noccurrence accounts for less than 1% of the training document. By reducing the threshold, low\nfrequency words become included in the training documents. This might be a dif\ufb01cult situation\nfor the latent SMM and SVD+SMM because they cannot observe enough training data to estimate\ntheir own latent word vectors. On the other hand, it would be an advantageous situation for SVMs\nusing BoW features because they can use low frequency words that are useful for classi\ufb01cation to\ncompute their kernel values. Figure 3 shows the classi\ufb01cation accuracy on WebKB when varying\nthe word occurrence threshold within f0:4; 0:6; 0:8; 1:0g. The performance of the latent SMM did\nnot change when the thresholds were varied, and was better than the other methods in spite of the\ndif\ufb01cult situation.\nParameter sensitivity. Figure 4 shows how the performance of the latent SMM changes against\n\u21132 regularizer parameter (cid:26) and C on a Reuters-21578 dataset with 1,000 training samples. Here\nthe latent dimensionality of the latent SMM was \ufb01xed at q = 2 to eliminate the effect of q. The\nperformance is insensitive to (cid:26) except when C is too small. Moreover, we can see that the perfor-\nmance is improved by increasing the C value. In general, the performance of SVM-based methods\nis very sensitive to C and kernel parameters [20]. Since kernel parameters (cid:18) in the latent SMM are\nestimated along with latent vectors X, the latent SMM can avoid the problem of sensitivity for the\nkernel parameters. In addition, Figure 2 has shown that the latent SMM is robust over the latent\ndimensionality. Thus, the latent SMM can achieve high classi\ufb01cation accuracy by focusing only on\ntuning the best C, and experimentally the best C exhibits a large value, e.g., C (cid:21) 25.\nVisualization of classes.\nIn the above experiments, we have shown that the latent SMM can\nachieve high classi\ufb01cation accuracy with low-dimensional latent vectors. By using two- or three-\ndimensional latent vectors in the latent SMM, and visualizing them, we can understand the rela-\ntionships between classes. Figure 5 shows the distributions of latent vectors for words appearing\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nComplete view (50% sampling)\n\nFigure 6: Visualization of latent vectors for words on WebKB. The font color of each word indicates\nthe class in which the word occurs most frequently, and \u2018project\u2019, \u2018course\u2019, \u2018student\u2019 and \u2018faculty\u2019\nclasses correspond to yellow, red, blue and green fonts, respectively.\n\nin documents of each class. Each class has its own characteristic distribution that is different from\nthose of other classes. This result shows that the latent SMM can extract the difference between\nthe distributions of the classes. For example, the distribution of \u2018course\u2019 is separated from those\nof the other classes, which indicates that documents categorized in \u2018course\u2019 share few words with\ndocuments categorized in other classes. On the other hand, the latent words used in the \u2018project\u2019\nclass are widely distributed, and its distribution overlaps those of the \u2018faculty\u2019 and \u2018student\u2019 classes.\nThis would be because faculty and students work jointly on projects, and words in both \u2018faculty\u2019\nand \u2018student\u2019 appear simultaneously in \u2018project\u2019 documents.\nVisualization of words. In addition to the visualization of classes, the latent SMM can visualize\nwords using two- or three-dimensional latent vectors. Unlike unsupervised visualization methods\nfor documents, e.g., [21], the latent SMM can gather characteristic words of each class in a region.\nFigure 6 shows the visualization result of words on the WebKB dataset. Here we used the same\nlearning result as that used in Figure 5. As shown in the complete view, we can see that highly-\nfrequent words in each class tend to gather in a different region. On the right side of this \ufb01gure,\nfour regions from the complete view are displayed in closeup. Figures (a), (b) and (c) include words\nindicating \u2018course\u2019, \u2018faculty\u2019 and \u2018student\u2019 classes, respectively. For example, \ufb01gure (a) includes\n\u2018exercise\u2019, \u2019examine\u2019 and \u2018quiz\u2019 which indicate examinations in lectures. Figure (d) includes words\nof various classes, although the \u2018project\u2019 class dominates the region as shown in Figure 5. This\nmeans that words appearing in the \u2018project\u2019 class are related to the other classes or are general\nwords, e.g., \u2018occur\u2019 and \u2018differ\u2019.\n\n6 Conclusion\nWe have proposed a latent support measure machine (latent SMM), which is a kernel-based dis-\ncriminative learning method effective for sets of features such as bag-of-words (BoW). The latent\nSMM represents each word as a latent vector, and each document to be classi\ufb01ed as a distribution\nof the latent vectors for words appearing in the document. Then the latent SMM \ufb01nds a separating\nhyperplane that maximizes the margins between distributions of different classes while estimating\nlatent vectors for words to improve the classi\ufb01cation performance. The experimental results can be\nsummarized as follows: First, the latent SMM has achieved state-of-the-art classi\ufb01cation accuracy\nfor BoW data. Second, we have shown experimentally that the performance of the latent SMM is\nrobust as regards its own hyper-parameters. Third, since the latent SMM can represent each word as\na two- or three- dimensional latent vector, we have shown that the latent SMMs are useful for un-\nderstanding the relationships between classes and between words by visualizing the latent vectors.\nAcknowledgment. This work was supported by JSPS Grant-in-Aid for JSPS Fellows (259867).\n\n8\n\n\fReferences\n[1] Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine Learning, 20(3):273\u2013297,\n\nSeptember 1995.\n\n[2] Taku Kudo and Yuji Matsumoto. Chunking with Support Vector Machines. Proceedings of the second\nmeeting of the North American Chapter of the Association for Computational Linguistics on Language\ntechnologies, 816, 2001.\n\n[3] Dell Zhang and Wee Sun Lee. Question Classi\ufb01cation Using Support Vector Machines. SIGIR, page 26,\n\n2003.\n\n[4] Changhua Yang, Kevin Hsin-Yih Lin, and Hsin-Hsi Chen. Emotion Classi\ufb01cation Using Web Blog Cor-\n\npora. IEEE/WIC/ACM International Conference on Web Intelligence, pages 275\u2013278, November 2007.\n\n[5] Pranam Kolari, Tim Finin, and Anupam Joshi. SVMs for the Blogosphere: Blog Identi\ufb01cation and Splog\n\nDetection. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, 2006.\n\n[6] David M. Blei, Andrew Y. Ng, and M. Jordan. Latent Dirichlet Allocation. The Journal of Machine\n\nLearning Research, 3(4-5):993\u20131022, May 2003.\n\n[7] Tomas Mikolov, I Sutskever, and Kai Chen. Distributed Representations of Words and Phrases and their\n\nCompositionality. NIPS, pages 1\u20139, 2013.\n\n[8] Krikamol Muandet and Kenji Fukumizu. Learning from Distributions via Support Measure Machines.\n\nNIPS, 2012.\n\n[9] Alex Smola, Arthur Gretton, Le Song, and B Sch\u00a8olkopf. A Hilbert Space Embedding for Distributions.\n\nAlgorithmic Learning Theory, 2007.\n\n[10] Bernhard Sch\u00a8olkopf, Robert Williamson, Alex Smola, John Shawe-Taylor, and John Platt. Support Vector\n\nMethod for Novelty Detection. NIPS, pages 582\u2013588, 1999.\n\n[11] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support Vector Ma-\n\nchine Learning for Interdependent and Structured Output Spaces. ICML, page 104, 2004.\n\n[12] Thorsten Joachims. Optimizing Search Engines Using Clickthrough Data. SIGKDD, page 133, 2002.\n[13] David M. Blei and Jon D. McAuliffe. Supervised Topic Models. NIPS, pages 1\u20138, 2007.\n[14] Jun Zhu, A Ahmed, and EP Xing. MedLDA: Maximum Margin Supervised Topic Models for Regression\n\nand Classi\ufb01cation. ICML, 2009.\n\n[15] BK Sriperumbudur and A Gretton. Hilbert Space Embeddings and Metrics on Probability Measures. The\n\nJournal of Machine Learning Research, 11:1517\u20131561, 2010.\n\n[16] Chih-Wei Hsu and Chih-Jen Lin. A Comparison of Methods for Multi-class Support Vector Machines.\n\nNeural Networks, IEEE Transactions on, 13(2):415\u2014-425, 2002.\n\n[17] S\u00a8oren Sonnenburg and G R\u00a8atsch. Large Scale Multiple Kernel Learning. The Journal of Machine Learn-\n\ning Research, 7:1531\u20131565, 2006.\n\n[18] Dong C. Liu and Jorge Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization.\n\nMathematical Programming, 45(1-3):503\u2013528, August 1989.\n\n[19] Ana Cardoso-Cachopo. Improving Methods for Single-label Text Categorization. PhD thesis, 2007.\n[20] Vladimir Cherkassky and Yunqian Ma. Practical Selection of SVM Parameters and Noise Estimation for\nSVM Regression. Neural networks : the of\ufb01cial journal of the International Neural Network Society,\n17(1):113\u201326, January 2004.\n\n[21] Tomoharu Iwata, T Yamada, and N Ueda. Probabilistic Latent Semantic Visualization: Topic Model for\n\nVisualizing Documents. SIGKDD, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1074, "authors": [{"given_name": "Yuya", "family_name": "Yoshikawa", "institution": "Nara Institute of Science and Technology"}, {"given_name": "Tomoharu", "family_name": "Iwata", "institution": "NTT"}, {"given_name": "Hiroshi", "family_name": "Sawada", "institution": "NTT Service Evolution Laboratories"}]}