{"title": "Parametric Mixture Models for Multi-Labeled Text", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 744, "abstract": null, "full_text": "Parametric Mixture Models for\n\nMulti-Labeled Text\n\nNaonori Ueda\n\nKazumi Saito\n\nNTT Communication Science Laboratories\n\n2-4 Hikaridai, Seikacho, Kyoto 619-0237 Japan\n\nfueda,saitog@cslab.kecl.ntt.co.jp\n\nAbstract\n\nWe propose probabilistic generative models, called parametric mix-\nture models (PMMs), for multiclass, multi-labeled text categoriza-\ntion problem. Conventionally, the binary classi(cid:12)cation approach\nhas been employed, in which whether or not text belongs to a cat-\negory is judged by the binary classi(cid:12)er for every category. In con-\ntrast, our approach can simultaneously detect multiple categories of\ntext using PMMs. We derive e(cid:14)cient learning and prediction algo-\nrithms for PMMs. We also empirically show that our method could\nsigni(cid:12)cantly outperform the conventional binary methods when ap-\nplied to multi-labeled text categorization using real World Wide\nWeb pages.\n\n1\n\nIntroduction\n\nRecently, as the number of online documents has been rapidly increasing, auto-\nmatic text categorization is becoming a more important and fundamental task in\ninformation retrieval and text mining. Since a document often belongs to multiple\ncategories, the task of text categorization is generally de(cid:12)ned as assigning one or\nmore category labels to new text. This problem is more di(cid:14)cult than the traditional\npattern classi(cid:12)cation problems, in the sense that each sample is not assumed to be\nclassi(cid:12)ed into one of a number of prede(cid:12)ned exclusive categories. When there are\nL categories, the number of possible multi-labeled classes becomes 2L. Hence, this\ntype of categorization problem has become a challenging research theme in the (cid:12)eld\nof machine learning.\n\nConventionally, a binary classi(cid:12)cation approach has been used, in which the multi-\ncategory detection problem is decomposed into independent binary classi(cid:12)cation\nproblems. This approach usually employs the state-of-the-art methods such as sup-\nport vector machines (SVMs) [9][4] and naive Bayes (NB) classi(cid:12)ers [5][7]. However,\nsince the binary approach does not consider a generative model of multi-labeled text,\nwe think that it has an important limitation when applied to the multi-labeled text\ncategorization.\n\nIn this paper, using independent word-based representation, known as Bag-of-Words\n(BOW) representation [3], we present two types of probabilistic generative models\nfor multi-labeled text called parametric mixture models (PMM1, PMM2), where\nPMM2 is a more (cid:13)exible version of PMM1. The basic assumption under PMMs is\n\n\fthat multi-labeled text has a mixture of characteristic words appearing in single-\nlabeled text that belong to each category of the multi-categories. This assumption\nleads us to construct quite simple generative models with a good feature: the ob-\njective function of PMM1 is convex (i.e., the global optimum solution can be easily\nfound). We present e(cid:14)cient learning and prediction algorithms for PMMs. We also\nshow the actual bene(cid:12)ts of PMMs through an application of WWW page catego-\nrization, focusing on those from the \\yahoo.com\" domain.\n\n2 Parametric Mixture Models\n\n2.1 Multi-labeled Text\n\nAccording to the BOW representation, which ignores the order of word occurrence\nin a document, the nth document, dn, can be represented by a word-frequency\nvector, xn = (xn\ni denotes the frequency of word wi occurrence\nin dn among the vocabulary V =< w1; : : : ; wV >. Here, V is the total number\nof words in the vocabulary. Next, let yn = (yn\nL) be a category vector for\nl takes a value of 1(0) when dn belongs (does not belong) to the lth\ndn, where yn\ncategory. L is the total number of categories. Note that L categories are pre-de(cid:12)ned\n\nV ), where xn\n\n1 ; : : : ; xn\n\n1 ; : : : ; yn\n\nand that a document always belongs to at least one category (i.e., Pl yl > 0).\negory should be generated from a multinomial distribution: P (xjl) / QV\nHere, (cid:18)l;i (cid:21) 0 and PV\n\nIn the case of multi-class and single-labeled text, it is natural that x in the lth cat-\ni=1((cid:18)l;i)xi\ni=1 (cid:18)l;i = 1. (cid:18)l;i is a probability that the ith word wi appears\nin a ducument belonging to the lth class. We generalize this to multi-class and\nmulti-labeled text as:\n\nP (xjy) /\n\nV\n\nYi=1\n\n(\u2019i(y))xi ; where \u2019i(y) (cid:21) 0 and\n\n\u2019i(y) = 1:\n\n(1)\n\nV\n\nXi=1\n\nHere, \u2019i(y) is a class-dependent probability that the ith word appears in a document\nbelonging to class y. Clearly, it is impractical to independently set a multinomial\nparameter vector to each distinct y, since there are 2L (cid:0) 1 possible classes. Thus,\nwe try to e(cid:14)ciently parameterize them.\n\n2.2 PMM1\n\nIn general, words in a document belonging to a multi-category class can be regarded\nas a mixture of characteristic words related to each of the categories. For example, a\ndocument that belongs to both \\sports\" and \\music\" would consist of a mixture of\ncharacteristic words mainly related to both categories. Let (cid:18)l = ((cid:18)l;1; : : : ; (cid:18)l;V ). The\nabove assumption indicates that \u2019(y)(= (\u20191(y); : : : ; \u2019V (y))) can be represented by\nthe following parametric mixture:\n\n\u2019(y) =\n\nL\n\nXl=1\n\nhl(y)(cid:18)l; where hl(y) = 0 for l such that yl = 0:\n\n(2)\n\nHere, hl(y)(> 0) is a mixing proportion (PL\n\nl=1 hl(y) = 1). Intuitively, hl(y) can\nalso be interpreted as the degree to which x has the lth category. Actually, by\nexperimental veri(cid:12)cation using about 3,000 real Web pages, we con(cid:12)rmed that the\nabove assumption was reasonable.\n\nBased on the parametric mixture assumption, we can construct a simple parametric\nl0=1 yl0 .\nFor example, in the case of L = 3, \u2019((1; 1; 0)) = ((cid:18)1 + (cid:18)2)=2 and \u2019((1; 1; 1)) =\n((cid:18)1 + (cid:18)2 + (cid:18)3)=3.\n\nmixture model, PMM1, in which the degree is uniform: hl(y) = yl=PL\n\n\fSubstituting Eq. (2) into Eq. (1), PMM1 can be de(cid:12)ned by\n\nP (xjy; (cid:2)) /\n\n:\n\n(3)\n\nV\n\nl=1 yl(cid:18)l;i\n\nl0=1 yl0 !xi\nYi=1 PL\nPL\n\nA set of unknown model paramters in PMM1 is (cid:2) = f(cid:18)lgL\n\nl=1.\n\nOf course, multi-category text may sometimes be weighted more toward one cate-\ngory than to the rest of the categories among multiple categories. However, being\naveraged over all biases, they could be canceled and therefore PMM1 would be\nreasonable. This motivates us to construct PMM1.\n\nPMMs are di(cid:11)erent from usual distributional mixture models in the sense that the\nmixing is performed in a parameter space, while the latter several distributional\ncomponents are mixed. Since the latter models assume that a sample is generated\nfrom one component, they cannot represent \\multiplicity.\" On the other hand,\nPMM1 can represent 2L (cid:0) 1 multi-category classes with only L parameter vectors.\n\n2.3 PMM2\n\nIn PMM1, shown in Eq. (2), \u2019(y) is approximated by f(cid:18)lg, which can be regarded\nas the \\(cid:12)rst-order\" approximation. We consider the second order model, PMM2, as\na more (cid:13)exible model, in which parameter vectors of duplicate-category, (cid:18)l;m, are\nalso used to approximate \u2019(y).\n\nL\n\nL\n\n\u2019(y) =\n\nhl(y)hm(y)(cid:18)l;m; where (cid:18)l;m = (cid:11)l;m(cid:18)l + (cid:11)m;l(cid:18)m:\n\n(4)\n\nHere, (cid:11)l;m is a non-negative bias parameter satisfying (cid:11)l;m + (cid:11)m;l = 1; 8l; m.\nClearly, (cid:11)l;l = 0:5. For example, in the case of L = 3, \u2019((1; 1; 0)) = f(1+2(cid:11)1;2)(cid:18)1 +\n(1 + 2(cid:11)2;1)(cid:18)2g=4, \u2019((1; 1; 1)) = f(1 + 2((cid:11)1;2 + (cid:11)1;3))(cid:18)1 + (1 + 2((cid:11)2;1 + (cid:11)2;3))(cid:18)2 + (1 +\n2((cid:11)3;1 + (cid:11)3;2))(cid:18)3g=9. In PMM2, unlike in PMM1, the category biases themselves\ncan be estimated from given training data.\n\nBased on Eq. (4), PMM2 can be de(cid:12)ned by\n\nXl=1\n\nXm=1\n\nP (xjy; (cid:2)) /\n\n(5)\n\nV\n\nYi=1(PL\nl=1PL\nPL\nl=1 ylPL\n\nm=1 ylym(cid:18)l;m;i\n\nm=1 ym )xi\n\nA set of unknown parameters in PMM2 becomes (cid:2) = f(cid:18)l; (cid:11)l;mgL;L\n\nl=1;m=1.\n\n2.4 Related Model\n\nVery recently, as a more general probabilistic model for multi-latent-topics text,\ncalled Latent Dirichlet Allocation (LDA), has been proposed [1]. However, LDA is\nformulated in an \\unsupervised\" manner. Blei et al. also perform single-labeled text\ncategorization using LDA in which individual LDA is (cid:12)tted to each class. Namely,\nthey do not explain how to model the observed class labels y in LDA.\n\nIn contrast, our PMMs can e(cid:14)ciently model class y, depending on other classes\nthrough the common basis vectors. Moreover, based on the PMM assumtion, models\nmuch simpler than LDA can be constructed as mentioned above. Moreover, unlike\nin LDA, it is feasible to compute the objective functions for PMMs exactly as shown\nbelow.\n\n3 Learning & Prediction Algorithms\n3.1 Objective functions\nLet D = f(xn; yn)gN\nn=1 denote the given training data (N labeled documents). The\nunknown parameter (cid:2) is estimated by maximizing posterior p((cid:2)jD). Assuming\n\n\fthat P (y) is independent of (cid:2), ^(cid:2)map = arg max(cid:2)flog P (xnjyn; (cid:2)) + log p((cid:2))g.\nHere, p((cid:2)) is prior over the parameters. We used the following conjugate priors\nfor PMM1\nl;m ) for PMM2. Here, (cid:24) and (cid:16) are\nhyperparameters and in this paper we set (cid:24) = 2 and (cid:16) = 2, each of which is\nequivalent to Laplace smoothing for (cid:18)l;i and (cid:11)l;m, respectively.\nConsequently, the objective function to (cid:12)nd ^(cid:2)map is given by\n\n(Dirichlet distributions) over (cid:18)l and (cid:11)l;m as: p((cid:2)) / QL\nand p((cid:2)) / (QL\n\nl=1QV\n\nl;i )(QL\n\nl=1QL\n\nl=1QV\n\nm=1 (cid:11)(cid:16)(cid:0)1\n\ni=1 (cid:18)(cid:24)(cid:0)1\n\ni=1 (cid:18)(cid:24)(cid:0)1\n\nl;i\n\nJ((cid:2); D) = L((cid:2); D) + ((cid:24) (cid:0) 1)\n\nL\n\nV\n\nXl=1\n\nXi=1\n\nlog (cid:18)l;i + ((cid:16) (cid:0) 1)\n\nL\n\nL\n\nXl=1\n\nXm=1\n\nlog (cid:11)l;m:\n\n(6)\n\nOf course, the third term on the RHS of Eq. (6) is just ignored for PMM1. The\nlikelihood term, L, is given by\n\nPMM1 : L((cid:2); D) =\n\nPMM2 : L((cid:2); D) =\n\nN\n\nV\n\nN\n\nXn=1\nXn=1\n\nxn;i log\n\nxn;i log\n\nV\n\nXi=1\nXi=1\n\nL\n\nL\n\nXl=1\nXl=1\n\nhn\n\nl (cid:18)l;i;\n\nhn\nl hn\n\nm(cid:18)l;m;i:\n\nL\n\nXm=1\n\n(7)\n\n(8)\n\nNote that (cid:18)l;m;i = (cid:11)l;m(cid:18)l;i + (cid:11)m;l(cid:18)m;i.\n\n3.2 Update formulae\n\nThe optimization problem given by Eq. (6) cannot be solved analytically; therefore\nsome iterative method needs to be applied. Although the steepest ascend algorithms\ninvolving Newton\u2019s method are available, here we derive an e(cid:14)cient algorithm in a\nsimilar manner to the EM algorithm [2]. First, we derive parameter update formulae\nfor PMM2 because they are more general than those for PMM1. We then explain\nthose for PMM1 as a special case.\nSuppose that (cid:2)(t) is obtained at step t. We then attmpt to derive (cid:2)(t+1) by using\n(cid:2)(t). For convenience, we de(cid:12)ne gn\n\nl;m;i and (cid:21)l;m;i as follows.\n\nL\n\nL\n\ngn\nl;m;i((cid:2)) = hn\n\nl hn\n\nhn\nl hn\n\nm(cid:18)l;m;i;\n\nm(cid:18)l;m;i(cid:14)\n\nXl=1\n\nXm=1\n\n(cid:21)l;m;i((cid:18)l;m) = (cid:11)l;m(cid:18)l;i=(cid:18)l;m;i; (cid:21)m;l;i((cid:18)l;m) = (cid:11)m;l(cid:18)m;i=(cid:18)l;m;i:\n\n(9)\n\n(10)\n\nl=1PL\nxn;ifXl;m\n\nNoting that PL\nL((cid:2); D) =Xn;i\n=Xn;i\nxn;iXl;m\n\nm=1 gn\n\nl;m;i((cid:2)) = 1, L for PMM2 can be rewritten as\n\nl;m;i((cid:2)(t))g logf(\ngn\n\nhn\nl hn\nhn\nl hn\n\nm(cid:18)l;m;i\nm(cid:18)l;m;i\n\nhn\nl0 hn\n\nm0(cid:18)l0;m0;ig\n\n) Xl0;m0\n\nl;m;i((cid:2)(t)) log hn\ngn\n\nl hn\n\nm(cid:18)n\n\nl;m;i((cid:2)(t)) log gn\ngn\n\nl;m;i((cid:2)):(11)\n\nl;m;i (cid:0)Xn;i\n\nxn;iXl;m\n\nMoreover, noting that (cid:21)l;m;i((cid:18)l;m) + (cid:21)m;l;i((cid:18)l;m) = 1, we rewrite the (cid:12)rst term on\nthe RHS of Eq. (11) as\n\nXn;i\n\nxn;iXl;m\n\ngn\n\nl;m;i((cid:2)(t))(cid:2)(cid:21)l;m;i((cid:18)(t)\n\nl;m) logf(\n\n+(cid:21)m;l;i((cid:18)(t)\n\nl;m) logf(\n\n(cid:11)l;m(cid:18)l;i\n(cid:11)l;m(cid:18)l;i\n\n)hn\n\nl hn\n\nm(cid:18)l;m;ig\n\n(cid:11)m;l(cid:18)m;i\n(cid:11)m;l(cid:18)m;i\n\n)hn\n\nl hn\n\nm(cid:18)l;m;ig(cid:3):\n\n(12)\n\n\fFrom Eqs.(11) and (12), we obtain the following important equation:\n\nL((cid:2); D) = U((cid:2)j(cid:2)(t)) (cid:0) T ((cid:2)j(cid:2)(t)):\n\nHere, U and T are de(cid:12)ned by\n\nU((cid:2)j(cid:2)(t)) = Xn;i;l;m\n\nT ((cid:2)j(cid:2)(t)) = Xn;i;l;m\n\nxn;i gn\n\nl;m;i((cid:2)(t))n(cid:21)l;m;i((cid:18)(t)\n\n+(cid:21)m;l;i((cid:18)(t)\n\nl;m) log hn\n\nl hn\n\nxn;i gn\n\nl;m;i((cid:2)(t))nlog gn\n\nl;m) log hn\n\nl hn\n\nm(cid:11)l;m(cid:18)l;i\n\nm(cid:11)m;l(cid:18)m;io;\n\nl;m;i((cid:2)) + (cid:21)l;m;i((cid:18)(t)\n\nl;m) log (cid:21)l;m;i((cid:18)l;m)\n\n(13)\n\n(14)\n\n(15)\n\nhn\nl (cid:18)l;i\nl=1 hn\n\nl (cid:18)l;i\n\nPL\nl (cid:18)l;i (cid:0)Xn;i\n\n+(cid:21)m;l;i((cid:18)(t)\n\nl;m) log (cid:21)m;l;i((cid:18)l;m)o:\n\nFrom Jensen\u2019s inequality, T ((cid:2)j(cid:2)(t)) (cid:20) T ((cid:2)(t)j(cid:2)(t)) holds. Thus we just maximize\nU((cid:2)j(cid:2)(t)) + log P ((cid:2)) w.r.t. (cid:2) to derive the parameter update formula. Noting that\n(cid:18)l;m;i (cid:17) (cid:18)m;l;i and qn\n\nm;l;i, we can derive the following formulae:\n\nl;m;i (cid:17) qn\n\n(cid:18)(t+1)\nl;i =\n\nn=1 xn\nn=1 xn\ni=1 xn\ni qn\n\n2PN\ni PL\n2PV\ni=1PN\ni PL\nl;m = PN\nn=1PV\nPV\ni=1PN\n\n(cid:11)(t+1)\n\nm=1 qn\nm=1 qn\n\nl;m;i((cid:2)(t))(cid:21)l;m;i((cid:2)(t)) + (cid:24) (cid:0) 1\nl;m;i((cid:2)(t))(cid:21)l;m;i((cid:2)(t)) + V ((cid:24) (cid:0) 1)\n\n;\n\n8l; i; (16)\n\nl;m;i((cid:2)(t))(cid:21)l;m;i((cid:2)(t)) + ((cid:16) (cid:0) 1)=2\nn=1 xn\n\nl;m;i((cid:2)(t)) + (cid:16) (cid:0) 1\n\ni qn\n\n;\n\n8l; m 6= l:\n\n(17)\n\nThese parameter updates always converge to a local optimum of J given by Eq. (6).\n\nIn PMM1, since unknown parameter is just f(cid:18)lg, by modifying Eq. (9) as\n\ngn\nl;i((cid:2)) =\n\n;\n\n(18)\n\nand rewriting Eq. (7) in a similar manner, we obtain\n\nL((cid:2); D) =Xn;i\n\nxn;iXl\n\nl;i((cid:2)(t)) log hn\ngn\n\nIn this case, U becomes a simpler form as\n\nl;i((cid:2)(t)) log gn\ngn\n\nl;i((cid:2)):\n\n(19)\n\nxn;iXl\n\nN\n\nV\n\nL\n\nU((cid:2)j(cid:2)(t)) =\n\nXl=1\nl=1PV\nTherefore, maximizing U((cid:2)j(cid:2)(t))+((cid:24) (cid:0) 1)PL\nconstraint Pi (cid:18)l;i = 1; 8l, we can obtain the following update formula for PMM1:\n\ni=1 log (cid:18)l;i w.r.t. (cid:2) under the\n\nl;i((cid:2)(t)) log hn\ngn\n\nXn=1\n\nXi=1\n\nl (cid:18)l;i:\n\n(20)\n\nxn;i\n\n8l; i:\n\n(21)\n\nn=1 xn;ign\nn=1 xn;ign\n\nl;i((cid:2)(t)) + (cid:24) (cid:0) 1\nl;i((cid:2)(t)) + V ((cid:24) (cid:0) 1)\n\n;\n\nRemark: The parameter update given by Eq. (21) of PMM1 always converges to\nthe global optimum solution.\n\n(cid:18)(t+1)\n\nli\n\n=\n\nPN\nPV\ni=1PN\n\nProof: The Hessian matrix, H, of the objective function, J, of PMM1 becomes\n\nH = (cid:8)T @2J((cid:2); D)\n@(cid:2)@(cid:2)T\n\n(cid:8) =\n\nd2J((cid:2) + (cid:20)(cid:8); D)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:20)=0\ni (cid:18)li(cid:19)2\n\ni (cid:30)li\n\nd(cid:20)2\n\nxn\n\ni (cid:18)Pl hn\nPl hn\n\n= (cid:0)Xn;i\n\n(cid:18)li(cid:19)2\n(cid:0) ((cid:24) (cid:0) 1)Xl;i (cid:18) (cid:30)li\n\n: (22)\n\n\fHere, (cid:8) is an arbitrary vector in the (cid:2) space. Noting that xn\ni (cid:21) 0, (cid:24) > 1 and (cid:8) 6= 0,\nH is negative de(cid:12)nite; therefore J is a strictly convex function of (cid:2). Moreover, since\n\nthe feasible region de(cid:12)ned by J and constraints Pi (cid:18)l;1 = 1; 8l is a convex set,\n\nthe maximization problem here becomes a convex programming problem and has\na unique global solution. Since Eq. (21) always increases J at each iteration, the\nlearning algorithm given above always converges to the global optimum solution,\nirrespective of any initial parameter value.\n\n3.3 Prediction\nLet ^(cid:2) denote the estimated parameter. Then, applying Bayes\u2019 rule, the op-\ntimum category vector y (cid:3) for x(cid:3) of a new document is de(cid:12)ned as: y (cid:3) =\narg maxy P (yjx(cid:3); ^(cid:2)) under a uniform class prior assumption. Since this maxi-\nmization problem belongs to the zero-one integer problem (i.e., NP-hard problem),\nan exhaustive search is prohibitive for a large L. Therefore, we solve this problem\napproximately with the help of the following greedy-search algorithm. That is, (cid:12)rst,\nonly one yl1 value is set to 1 so that P (yjx(cid:3); ^(cid:2)) is maximized. Then, for the re-\nmaining elements, only one yl2 value, which mostly increases P (yjx(cid:3); ^(cid:2)), is set to 1\nunder a (cid:12)xed yl1 value. This procedure is repeated until P (yjx(cid:3); ^(cid:2)) cannot increase\nany further. This algorithm successively determines an element in y to increase the\nposterior probability until its value does not improve. This is very e(cid:14)cient because\nit requires the calculation of the posterior probability at most L(L + 1)=2 times,\nwhile the exhaustive search needs 2L (cid:0) 1 times.\n\n4 Experiments\n\n4.1 Automatic Web Page Categorization\nWe tried to categorize real Web pages linked from the \\yahoo.com\" domain1. More\nspeci(cid:12)cally, Yahoo consists of 14 top-level categories (i.e., \\Arts & Humanities,\"\n\\Business & Economy,\" \\Computers & Internet,\" and so on), and each category is\nclassi(cid:12)ed into a number of second-level subcategories. By focusing on the second-\nlevel categories, we can make 14 independent text categorization problems. We used\n11 of these 14 problems2. In those 11 problems, mininum (maximum) values of L\nand V were 21 (40) and 21924 (52350), respectively. About 30(cid:24)45% of the pages\nare multi-labeled over the 11 problems. To collect a set of related Web pages for\neach problem, we used a software robot called \"GNU Wget (version 1.5.3). A text\nmulti-label can be obtained by following its hyperlinks in reverse toward the page\nof origin.\n\nWe compared our PMMs with the convetional methods: naive Bayes (NB), SVM,\nk-nearest neighbor (kNN), and three-layer neural networks (NN). We used linear\nSVMlight (version 4.0), tuning the C (penalty cost) and J (cost-factor for negative\nand positive samples) parameters for each binary classi(cid:12)cation to improve the SVM\nresults [6]3. In addition, it is worth mentioning that when performing the SVM,\ni = 1 because discrimination is much easier\nin the V (cid:0) 1-dimensional simplex than in the original V dimensional space. In other\nwords, classi(cid:12)cation is generally not determined by the number of words on the\npage; actually, normalization could also signi(cid:12)cantly improve the performance.\n\neach xn was normalized to be PV\n\ni=1 xn\n\n1This domain is a famous portal site and most related pages linked from the domain\n\nare registered by site recommendation and therefore category labels would be reliable.\n\n2We could not collect enough pages for three categories due to our communication\nnetwork security. However, we believe that 11 independent problems are su(cid:14)cient for\nevaluating our method.\n\n3Since the ratio of the number of positive samples to negative samples per category\n\nwas quite small in our web pages, SVM without the J option provided poor results.\n\n\fTable 1: Performance for 3000 test data using 2000 training data.\n\nNo.\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n\nNB\n\n41.6 (1.9)\n75.0 (0.6)\n56.5 (1.3)\n39.3 (1.0)\n54.5 (0.8)\n66.4 (0.8)\n51.8 (0.8)\n52.6 (1.1)\n42.4 (0.9)\n41.7 (10.7)\n47.2 (0.9)\n\nSVM\n\n47.1 (0.3)\n74.5 (0.8)\n56.2 (1.1)\n47.8 (0.8)\n56.9 (0.5)\n67.1 (0.3)\n52.1 (0.8)\n55.4 (0.6)\n49.2 (0.7)\n65.0 (1.1)\n51.4 (0.6)\n\nkNN\n\n40.0 (1.1)\n78.4 (0.4)\n51.1 (0.8)\n42.9 (0.9)\n47.6 (1.0)\n60.4 (0.5)\n44.4 (1.1)\n53.3 (0.5)\n43.9 (0.6)\n59.5 (0.9)\n46.4 (1.2)\n\nNN\n\n43.3 (0.2)\n77.4 (0.5)\n53.8 (1.3)\n44.1 (1.0)\n54.9 (0.5)\n66.0 (0.4)\n49.6 (1.3)\n55.0 (1.1)\n45.8 (1.3)\n62.2 (2.3)\n50.5 (0.4)\n\nPMM1\n50.6 (1.0)\n75.5 (0.9)\n61.0 (0.4)\n51.3 (2.8)\n59.7 (0.4)\n66.2 (0.5)\n55.2 (0.5)\n61.1 (1.4)\n51.4 (0.7)\n62.0 (5.1)\n54.2 (0.2)\n\nPMM2\n\n48.6 (1.0)\n72.1 (1.2)\n59.9 (0.6)\n48.3 (0.5)\n58.4 (0.6)\n65.1 (0.3)\n52.4 (0.6)\n60.1 (1.2)\n49.9 (0.8)\n56.4 (6.3)\n52.5 (0.7)\n\nWe employed the cosine similarity for kNN method (see [8] for more details). As for\nNNs, an NN consists of V input units and L output units for estimating a category\nvector from each frequency vector. We used 50 hidden units. An NN was trained\nto maximize the sum of cross-entropy functions for target and estimated category\nvectors of training samples, together with a regularization term consisting of a sum\nof squared NN weights. Note that we did not perform any feature transformations\nsuch as TFIDF (for an example, see e.g., [8]) because we wanted to evaluate the\nbasic performance of each detection method purely.\n\n1 ; : : : ; yn\n\nL) and ^yn = (^yn\n\n1 ; : : : ; ^yn\n\nWe used the F-measure as the performance measure which is de(cid:12)ned as the weighted\nharmonic average of two well-known statistics: precision, P , and recall, R. Let\nyn = (yn\nL) be actual and predicted category vec-\ntors for xn, respectively.\nSubsequently, the Fn = 2PnRn=(Pn + Rn), where\nl=1 ^yn\nl . We evaluated the per-\nformance by (cid:22)F = 1\nn=1 Fn using 3000 test data independent of the training\ndata. Although micro- and macro-averages can be used, we think that the sample-\nbased F -measure is the most suitable for evaluating the generalization performance,\nsince it is natural to consider the i.i.d. assumption for documents.\n\nl =PL\n3000P3000\n\nl and Rn = PL\n\nPn = PL\n\nl =PL\n\nl=1 yn\n\nl ^yn\n\nl=1 yn\n\nl ^yn\n\nl=1 yn\n\n4.2 Results\n\nFor each of the 11 problems, we used (cid:12)ve pairs of training and test data sets. In\nTable 1 (Table 2) we compared the mean of the (cid:22)F values over (cid:12)ve trials by using\n2000 (500) training documents. Each number in parenthesis in the Tables denotes\nthe standard deviation of the (cid:12)ve trials. PMMs took about (cid:12)ve minutes for training\n(2000 data) and only about one minute for the test (3000 data) on 2.0-Ghz Pentium\nPC, averaged over the 11 problmes. The PMMs were much faster than the k-NN\nand NN.\nIn the binary approach, SVMs with optimally tuned parameters produced\nrather better results than the NB method. The performance by SVMs, however,\nwas inferior to those by PMMs in almost all problems. These experimental results\nsupport the importance of considering generative models of multi-category text.\n\nWhen the training sample size was 2000, kNN provided comparable results to the\nNB method. On the other hand, when the training sample size was 500, the kNN\nmethod obtained results similar to or slightly better than those of SVM. However,\nin both cases, PMMs signi(cid:12)cantly outperformed kNN. We think that the memory-\nbased approach is limited in its generalization ability for multi-labeled text catego-\nrization.\n\nThe results of well-regularized NN were fair, although it took an intolerable amount\nof training time, indicating that (cid:13)exible discrimination would not be necessary for\n\n\fTable 2: Performance for 3000 test data using 500 training data.\n\nNo.\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n\nNB\n\n21.2 (1.0)\n73.9 (0.7)\n46.1 (2.9)\n15.2 (0.9)\n34.1 (1.6)\n50.2 (0.3)\n22.1 (0.8)\n32.7 (4.4)\n17.6 (1.6)\n40.6 (12.3)\n34.2 (2.2)\n\nSVM\n\n32.5 (0.5)\n73.8 (1.2)\n44.9 (1.9)\n33.6 (0.5)\n42.7 (1.3)\n56.0 (1.0)\n32.1 (0.5)\n38.8 (0.6)\n32.5 (1.0)\n55.0 (1.1)\n38.3 (4.7)\n\nkNN\n\n34.7 (0.4)\n75.6 (0.6)\n44.1 (1.2)\n37.1 (1.0)\n43.9 (1.0)\n54.4 (0.9)\n37.4 (1.1)\n48.1 (1.3)\n35.3 (0.4)\n53.7 (0.6)\n40.2 (0.7)\n\nNN\n\n33.8 (0.4)\n74.8 (0.9)\n45.1 (1.0)\n33.8 (1.1)\n45.3 (0.9)\n57.2 (0.7)\n33.9 (0.8)\n43.1 (1.0)\n31.6 (1.7)\n55.8 (4.0)\n40.9 (1.2)\n\nPMM1\n\n43.9 (1.0)\n75.2 (0.4)\n56.4 (0.3)\n41.8 (1.2)\n53.0 (0.3)\n58.9 (0.9)\n46.5 (1.3)\n54.1 (1.5)\n40.3 (0.7)\n57.8 (6.5)\n49.7 (0.9)\n\nPMM2\n\n43.2 (0.8)\n69.7 (8.9)\n55.4 (0.5)\n41.9 (0.7)\n53.1 (0.6)\n59.4 (1.0)\n45.5 (0.9)\n53.5 (1.5)\n41.0 (0.5)\n57.9 (5.9)\n49.0 (0.5)\n\ndiscriminating high-dimensional, sparse-text data. The results obtained by PMM1\nwere better than those by PMM2, which indicates that a model with a (cid:12)xed (cid:11)l;m =\n0:5 seems su(cid:14)cient, at least for the WWW pages used in the experiments.\n\n5 Concluding Remarks\n\nWe have proposed new types of mixture models (PMMs) for multi-labeled text\ncategorization, and also e(cid:14)cient algorithms for both learning and prediction. We\nhave taken some important steps along the path, and we are encouraged by our\ncurrent results using real World Wide Web pages. Moreover, we have con(cid:12)rmed\nthat studying the generative model for multi-labeled text is bene(cid:12)cial in improving\nthe performance.\n\nReferences\n\n[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. to appear Ad-\n\nvances in Neural Information Processing Systems 14. MIT Press.\n\n[2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete\ndata via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38. 1977.\n\n[3] S. T. Dumais, J. Platt, D. Heckerman, & M. Sahami. Inductive learning algorithms\n\nand representations for text categorization. In Proc. of ACM-CIKM\u201998, 1998.\n\n[4] T. Joachims. Text categorization with support vector machines: Learning with many\nrelevant features. In Proc. of the European Conference on Machine Learning, 137-142,\nBerlin, 1998.\n\n[5] D. Lewis & M. Ringuette. A comparison of two learning algorithms for text categoriza-\ntion. In Third Anual Symposium on Document Analysis and Information Retrieval,\n81-93. 1994.\n\n[6] K. Morik, P. Brockhausen, and T. Joachims. Combining statistical learning with\nknowledge-based approach. A case study in intensive care monitoring. In Proc. of\nInternational Conference on Machine Learning (ICML\u201999), 1999.\n\n[7] K. Nigam, A. K. McCallum, S. Thrun, & T. Mitchell. Text classi(cid:12)cation from labeled\n\nand unlabeled documents using EM. Machine Learning, 39:103-134, 2000.\n\n[8] Y. Yang & J. Pederson. A comparative study on feature selection in text categoriza-\n\ntion. In Proc of International Conference on Machine Learning, 412-420, 1997.\n\n[9] V. N. Vapnik. Statistical learning theory. John Wiley & Sons, Inc., New York. 1998.\n\n\f", "award": [], "sourceid": 2244, "authors": [{"given_name": "Naonori", "family_name": "Ueda", "institution": null}, {"given_name": "Kazumi", "family_name": "Saito", "institution": null}]}