{"title": "Maximal Margin Labeling for Multi-Topic Text Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 656, "abstract": null, "full_text": "Maximal Margin Labeling for Multi-Topic Text\n\nCategorization\n\nHideto Kazawa, Tomonori Izumitani, Hirotoshi Taira and Eisaku Maeda\n\nNTT Communication Science Laboratories\nNippon Telegraph and Telephone Corporation\n\n2-4 Hikaridai, Seikacho, Sorakugun, Kyoto 619-0237 Japan\n\n{kazawa,izumi,taira,maeda}@cslab.kecl.ntt.co.jp\n\nAbstract\n\nIn this paper, we address the problem of statistical learning for multi-\ntopic text categorization (MTC), whose goal is to choose all relevant top-\nics (a label) from a given set of topics. The proposed algorithm, Max-\nimal Margin Labeling (MML), treats all possible labels as independent\nclasses and learns a multi-class classi\ufb01er on the induced multi-class cate-\ngorization problem. To cope with the data sparseness caused by the huge\nnumber of possible labels, MML combines some prior knowledge about\nlabel prototypes and a maximal margin criterion in a novel way. Experi-\nments with multi-topic Web pages show that MML outperforms existing\nlearning algorithms including Support Vector Machines.\n\n1 Multi-topic Text Categorization (MTC)\n\nThis paper addresses the problem of learning for multi-topic text categorization (MTC),\nwhose goal is to select all topics relevant to a text from a given set of topics. In MTC,\nmultiple topics may be relevant to a single text. We thus call a set of topics label, and say\nthat a text is assigned a label, not a topic.\n\nIn almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by\njudging each topic\u2019s relevance to the text. In this decomposition approach, the features\nspeci\ufb01c to a topic, not a label, are regarded as important features. However, the approach\nmay result in inef\ufb01cient learning as we will explain in the following example.\n\nImagine an MTC problem of scienti\ufb01c papers where quantum computing papers are as-\nsigned multi-topic label \u201cquantum physics (QP) & computer science (CS)\u201d. (QP and CS\nare topics in this example.) Since there are some words speci\ufb01c to quantum computing\nsuch as \u201cqbit1\u201d, one can say that ef\ufb01cient MTC learners should use such words to assign\nlabel QP & CS. However, the decomposition approach is likely to ignore these words since\nthey are only speci\ufb01c to a small portion of the whole QP or CS papers (there are many more\nQP and CS papers than quantum computing papers), and therefore are not discriminative\nfeatures for either topic QP or CS.\n\n1Qbit is a unit of quantum information, and frequently appears in real quantum computing litera-\n\ntures, but rarely seen in other literatures.\n\n\fSymbol\nx(\u2208 Rd)\nt1, t2, . . . , tl\nT\nL, \u03bb(\u2282 T )\nL[j]\n\u039b(= 2T )\n{(xi, Li)}m\n\ni=1\n\nMeaning\nA document vector\nTopics\nThe set of all topics\nA label\nThe binary representation of L. 1 if tj \u2208 L and 0 otherwise.\nThe set of all possible labels\nTraining samples\n\nTable 1: Notation\n\nParametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in\nPMM that multi-topic texts are generated from a mixture of topic-speci\ufb01c word distribu-\ntions. Its decision on labeling is done at once, not separately for each topic. However, PMM\nalso has a problem with multi-topic speci\ufb01c features such as \u201cqbit\u201d since it is impossible\nfor texts to have such features given PMM\u2019s mixture process.\n\nThese problems with multi-topic speci\ufb01c features are caused by dependency assumptions\nbetween labels, which are explicitly or implicitly made in existing methods. To solve these\nproblems, we propose Maximal Margin Labeling, which treats labels as independent\nclasses and learns a multi-class classi\ufb01er on the induced multi-class problem.\n\nIn this paper, we \ufb01rst discuss why multi-class classi\ufb01ers cannot be directly applied to MTC\nin Section 2. We then propose MML in Section 3, and address implementation issues in\nSection 4. In Section 5, MML is experimentally compared with existing methods using a\ncollection of multi-topic Web pages. We summarize this paper in Section 6.\n\n2 Solving MTC as a Multi-Class Categorization\n\nTo discuss why existing multi-class classi\ufb01ers do not work in MTC, we start from the\nmulti-class classi\ufb01er proposed in [4]. Hereafter we use the notation given in Table 1. The\nmulti-class classi\ufb01er in [4] categorizes an object into the class whose prototype vector is\nthe closest to the object\u2019s feature vector. By substituting label for class, the classi\ufb01er can\nbe written as follows.\n\nf (x) = arg max\n\n\u03bb\u2208\u039b\n\nhx, m\u03bbiX\n\n(1)\n\nwhere h,iX is the the inner product of Rd, and m\u03bb \u2208 Rd is the prototype vector of label \u03bb.\nFollowing the similar argument as in [4], the prototype vectors are learned by solving the\nfollowing maximal margin problem2.\n\n1\n\nmin\nM\n\n2kMk2 + C X1\u2264i\u2264m X\u03bb\u2208\u039b,\u03bb6=Li\nhxi, mLiiX \u2212 hxi, m\u03bbiX \u2265 1 \u2212 \u03be\u03bb\n\n\u03be\u03bb\ni\n\ni\n\ns.t.\n\nfor 1 \u2264 i \u2264 m,\u2200\u03bb 6= Li,\n\n(2)\nwhere M is the prototype matrix whose columns are the prototype vectors, and kMk is the\nFrobenius matrix norm of M.\nNote that Eq. (1) and Eq. (2) cover only training samples\u2019 labels, but also all possible labels.\nThis is because the labels unseen in training samples may be relevant to test samples. In\n2In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer\nand Singer penalize only the largest violation of the margin constraint for each training sample [4].\nWe chose the \u201cpenalize-all\u201d approach since it leads to an optimization problem without equality\nconstraints (see Eq.(7)), which is much easier to solve than the one in [4].\n\n\fusual multi-class problems, such unseen labels seldom exist. In MTC, however, the number\nof labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and\nunseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and\nEq. (2) since it is impossible to know which unseen labels are present in the test samples.\n\nThere are two problems with Eq. (1) and Eq. (2). The \ufb01rst problem is that they involve the\nprototype vectors of seldom or never seen labels. Without the help of prior knowledge about\nwhere the prototype vectors should be, it is impossible to obtain appropriate prototype\nvectors for such labels. The second problem is that these equations are computationally too\ndemanding since they involve combinatorial maximization and summation over all possible\nlabels, whose number can be quite large. (For example, the number is around 230 in the\ndatasets used in our experiments.) We will address the \ufb01rst problem in Section 3 and the\nsecond problem in Section 4.\n\n3 Maximal Margin Labeling\n\nIn this section, we incorporate some prior knowledge about the location of prototype vec-\ntors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal\nMargin Labeling (MML).\n\nAs prior knowledge, we simply assume that the prototype vectors of similar labels should\nbe placed close to each other. Based on this assumption, we \ufb01rst rewrite Eq. (1) to yield\n\nf (x) = arg max\n\nhM T\n\nx, e\u03bbiL,\n\n(3)\n\n\u03bb\u2208\u039b\n\nwhere h,iL is the inner product of R|\u039b| and {e\u03bb}\u03bb\u2208\u039b is the orthonormal basis of R|\u039b|. The\nclassi\ufb01er of Eq. (3) can be interpreted as a two-step process: the \ufb01rst step is to map the\nvector x into R|\u039b| by M T , and the second step is to \ufb01nd the closest e\u03bb to image M T x.\nThen we replace {e\u03bb}\u03bb\u2208\u039b with (generally) non-orthogonal vectors {\u03c6(\u03bb)}\u03bb\u2208\u039b whose ge-\nometrical con\ufb01guration re\ufb02ects label similarity. More formally speaking, we use vectors\n{\u03c6(\u03bb)}\u03bb\u2208\u039b that satisfy the condition\n\nh\u03c6(\u03bb1), \u03c6(\u03bb2)iS = S(\u03bb1, \u03bb2)\n\n(4)\nwhere h,iS is an inner product of the vector space spanned by {\u03c6(\u03bb)}\u03bb\u2208\u039b, and S is a\nMercer kernel [5] on \u039b \u00d7 \u039b and is a similarity measure between labels. We call the vector\nspace spanned by {\u03c6(\u03bb)} VS.\nWith this replacement, MML\u2019s classi\ufb01er is written as follows.\n\nfor \u2200\u03bb1, \u03bb2 \u2208 \u039b,\n\nf (x) = arg max\n\nhW x, \u03c6(\u03bb)iS,\n\n(5)\n\n\u03bb\u2208\u039b\n\nwhere W is a linear map from Rd to VS. W is the solution of the following problem.\n\nmin\nW\n\n1\n2kWk2 + C\n\nm\n\nXi=1 X\u03bb\u2208\u039b,\u03bb6=Li\n\n\u03be\u03bb\ni\n\ns.t. (cid:28)W xi,\n\nk\u03c6(Li)\u2212\u03c6(\u03bb)k(cid:29) \u2265 1\u2212\u03be\u03bb\n\u03c6(Li)\u2212\u03c6(\u03bb)\n\ni \u2265 0 for 1 \u2264 i \u2264 m,\u2200\u03bb 6= Li. (6)\nNote that if \u03c6(\u03bb) is replaced by e\u03bb, Eq. (6) becomes identical to Eq. (2) except for a scale\nfactor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classi\ufb01er in [4].\nWe call the MTC classi\ufb01er of Eq. (5) and Eq. (6) \u201cMaximal Margin Labeling (MML)\u201d.\n\ni , \u03be\u03bb\n\nFigure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents\nthe distance from the image of the training sample xi to the boundary between the correct\nlabel Li and wrong label \u03bb. MML optimizes the linear map W so that the smallest margin\nbetween all training samples and all possible labels becomes maximal, along with a penalty\nC for the case that samples penetrate into the margin.\n\n\fFigure 1: Maximal Margin Labeling\n\nDual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more\nconvenient. (We omit its derivation due to space limits.)\n\nS(Li, Li\u2032 )\u2212S(Li, \u03bb\u2032)\u2212S(\u03bb, Li\u2032 )+S(\u03bb, \u03bb\u2032)\n\n1\n\n2 Xi,\u03bb Xi\u2032,\u03bb\u2032\n\ni \u03b1\u03bb\u2032\n\u03b1\u03bb\n\ni\u2032 (xi\u00b7xi\u2032 )\n\nmax\n\u03b1\u03bb\n\ni Xi,\u03bb\n\n\u03b1\u03bb\ni \u2212\n0 \u2264 \u03b1\u03bb\n\ns.t.\n\n2p(1\u2212S(Li, \u03bb))(1\u2212S(Li\u2032 , \u03bb\u2032))\n(7)\ni are the dual variables corresponding\nto the \ufb01rst inequality constraints in Eq. (6). Note that Eq. (7) does not contain \u03c6(\u03bb): all the\ncomputations involving \u03c6 can be done through the label similarity S. Additionally xi only\nappears in the inner products, and therefore can be replaced by any kernel of x.\n\ni \u2264 C for 1 \u2264 i \u2264 m, \u2200\u03bb 6= Li,\nbyPi,\u03bb, and \u03b1\u03bb\ni=1P\u03bb\u2208\u039b,\u2200\u03bb6=Li\n\nwhere we denotePm\n\nUsing the solution \u03b1\u03bb\n\ni of Eq. (7), the MML\u2019s classi\ufb01er in Eq. (5) can be written as follows.\n\nf (x) = arg max\n\nL\u2208\u039b Xi,\u03bb\n\n\u03b1\u03bb\ni (x\u00b7xi)\n\nS(Li, L)\u2212S(\u03bb, L)\np2(1\u2212S(Li, \u03bb))\n\n.\n\n(8)\n\nLabel Similarity3 As examples of label similarity, we use two similarity measures: Dice\nmeasure and cosine measure.\n\nDice measure4\n\nCosine measure\n\n=\n\nSD(\u03bb1, \u03bb2) =\n\n2|\u03bb1\u2229\u03bb2|\n|\u03bb1|+|\u03bb2|\nSC(\u03bb1, \u03bb2) = |\u03bb1 \u2229 \u03bb2|\np|\u03bb1|p|\u03bb2|\n\nj=1 \u03bb1[j]\u03bb2[j]\n\n2Pl\nPl\nj=1 \u03bb1[j] +Pl\nPl\nqPl\nj=1 \u03bb1[j]qPl\n\nj=1 \u03bb2[j]\nj=1 \u03bb1[j]\u03bb2[j]\n\n=\n\nj=1 \u03bb2[j]\n\n.\n\n(9)\n\n.(10)\n\n4 Ef\ufb01cient Implementation\n\n4.1 Approximation in Learning\n\nEq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this\nsummation rapidly becomes intractable since |\u039b| grows exponentially as 2l. To circumvent\n3The following discussion is easily extended to include the case that both \u03bb1 and \u03bb2 are empty\nalthough we do not discuss the case due to space limits.\n\n\fi of |(A \u2229 Bc) \u222a (Ac \u2229 B)| = 1 and set all the other \u03b1\u03bb\n\nthis problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum\nover \u03b1\u03bb\ni to zero. This approximation\nreduces the burden of the summation quite a lot: the number of summands is reduced from\n2l to l, which is a huge reduction especially when many topics exist.\nTo understand the rationale behind the approximation, \ufb01rst note that \u03b1\u03bb\ni is the dual variable\ncorresponding to the \ufb01rst inequality constraint (the margin constraint) in Eq. (7). Thus \u03b1\u03bb\ni\nis non-zero if and only if W xi falls in the margin between \u03c6(Li) and \u03c6(\u03bb). We assume\nthat this margin violation mainly occurs when \u03c6(\u03bb) is \u201cclose\u201d to \u03c6(Li), i.e. |(A \u2229 Bc) \u222a\n(Ac \u2229 B)| = 1. If this assumption holds well, the proposed approximation of the sum will\nlead to a good approximation of the exact solution.\n\n4.2 Polynomial Time Algorithms for Classi\ufb01cation\n\nThe classi\ufb01cation of MML (Eq. (8)) involves the combinatorial maximization over all pos-\nsible labels, so it can be a computationally demanding process. However, ef\ufb01cient classi\ufb01-\ncation algorithms are available when either the cosine measure or dice measure is used as\nlabel similarity.\n\nEq. (8) can be divided into the subproblems by the number of topics in a label.\n\nf (x) =\n\narg max\n\ng(x, L),\n\nL\u2208{ \u02c6L1, \u02c6L2,..., \u02c6Ll}\n\n\u02c6Ln = arg max\nL\u2208\u039b,|L|=n\n\ng(x, L).\n\nwhere g(x) is\n\nl\n\ng(x, L) =\n\ncn[j]L[j],\n\n(11)\n\n(12)\n\n(13)\n\nXj=1\ncn[j] = \uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\nPi,\u03bb\nPi,\u03bb\n\n\u03b1\u03bb\ni (x\u00b7xi)\n\n\u03b1\u03bb\ni (x\u00b7xi)\n\n|\u03bb|+n(cid:17)\n.(cid:16) 2Li[j]\n|Li|+n \u2212 2\u03bb[j]\n\n\u221a2(1\u2212SD(Li,\u03bb))\n\u221a2(1\u2212SC (Li,\u03bb)) (cid:18) Li[j]\u221a|Li|\u221an \u2212 \u03bb[j]\u221a|\u03bb|\u221an(cid:19) if SC is used.\n\nif SD is used.\n\nHere n = |L|. The computational cost of Eq. (13) for all j is O(n\u03b1l) (n\u03b1 is the number of\nnon-zero \u03b1), and that of Eq. (12) is O(l log l). Thus the total cost of the classi\ufb01cation by\nEq. (11) is O(n\u03b1l2 + l2 log l). On the other hand, n\u03b1 is O(ml) under the approximation\ndescribed above. Therefore, the classi\ufb01cation can be done within O(ml3) computational\nsteps, which is a signi\ufb01cant reduction from the case that the brute force search is used in\nEq. (8).\n\n5 Experiments\n\nIn this section, we report experiments that compared MML to PMM [3], SVM5 [6], and\nBoosTexter [2] using a collection of Web pages. We used a normalized linear kernel\nk(x, x\u2032) = x \u00b7 x\u2032/kxkkx\u2032k in MML and SVM. As for BoosTexter, \u201creal abstaining Ad-\naBoost.MH\u201d was used as the weak learner.\n\n5.1 Experimental Setup\n\nThe datasets used in our experiment represent the Web page collection used in [3] (Ta-\nble 2). The Web pages were collected through the hyperlinks from Yahoo!\u2019s top directory\n5For each topic, an SVM classi\ufb01er is trained to predict whether the topic is relevant (positive) or\n\nirrelevant (negative) to input doucments.\n\n\fDataset Name (Abbrev.)\n\n#Text\n\n#Voc\n\nArts & Humanities (Ar)\n7,484 23,146\nBusiness & Economy (Bu) 11,214 21,924\nComputers & Internet (Co) 12,444 34,096\n12,030 27,534\nEducation (Ed)\nEntertainment (En)\n12,730 32,001\nHealth (He)\n9,205 30,605\n12,828 30,324\nRecreation (Rc)\nReference (Rf)\n8,027 39,679\n6,428 37,187\nScience (Si)\nSocial Science (SS)\n12,111 52,350\nSociety & Culture (SC)\n14,512 31,802\n\n1\n\n2\n\n3\n9.7\n\n#Tpc #Lbl Label Size Frequency (%)\n4 \u22655\n55.6 30.5\n2.8 1.4\n599\n57.6 28.8 11.0 1.7 0.8\n233\n3.0 1.1\n69.8 18.2\n428\n1.9 0.6\n66.9 23.4\n511\n72.3 21.1\n1.0 1.1\n337\n2.4 0.9\n53.2 34.0\n335\n1.4 0.6\n69.2 23.1\n530\n0.3 0.1\n85.5 12.6\n275\n1.9 0.5\n68.0 22.3\n457\n78.4 17.0\n0.7 0.3\n361\n2.9 2.2\n1054 59.6 26.1\n\n7.8\n7.3\n4.5\n9.5\n5.6\n1.5\n7.3\n3.7\n9.2\n\n26\n30\n33\n33\n21\n32\n22\n33\n40\n39\n27\n\nTable 2: A summary of the web page datasets. \u201c#Text\u201d is the number of texts in the dataset,\n\u201c#Voc\u201d the number of vocabularies (i.e. features), \u201c#Tpc\u201d the number of topics, \u201c#Lbl\u201d the\nnumber of labels, and \u201cLabel Size Frequency\u201d is the relative frequency of each label size.\n(Label size is the number of topics in a label.)\n\nMethod\nMML\nPMM\nSVM\nBoost\n\nParameter\nC = 0.1, 1, 10\nModel1, Model2\n\nFeature Type\nTF, TF\u00d7IDF\nTF\nTF, TF\u00d7IDF C = 0.1, 1, 10\nBinary\n\nR = {2, 4, 6, 8, 10}\u00d7103\n\nTable 3: Candidate feature types and learning parameters. (R is the number of weak hy-\npotheses.) The underlined fetures and parameters were selected for the evaluation with the\ntest data.\n\n(www.yahoo.com), and then divided into 11 datasets by Yahoo\u2019s top category. Each\npage is labeled with the Yahoo\u2019s second level sub-categories from which the page is hy-\nperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about\nthe collection. Then the Web pages were converted into three types of feature vectors: (a)\nBinary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b)\nTF vectors, where each feature is the number of appearances of a term (term frequency);\nand (c) TF\u00d7IDF vectors, where each feature is the product of term frequency and inverse\ndocument frequency [7].\n\nTo select the best combinations of feature types and learning parameters such as the penalty\nC for MML, the learners were trained on 2,000 Web pages with all combinations of fea-\nture and parameter listed in Table 3, and then were evaluated by labeling F-measure on\nindependently drawn development data. The combinations which achieve the best labeling\nF-measures (underlined in Table 3) were used in the following experiments.\n\n5.2 Evaluation Measures\n\nWe used three measures to evaluate labeling performance: labeling F-measure, exact match\nratio, and retrieval F-measure. In the following de\ufb01nitions, {Lpred\n}n\nmean the predicted labels and the true labels, respectively.\n\ni=1 and {Ltrue\n}n\n\ni=1\n\ni\n\ni\n\nLabeling F-measure Labeling F-measure FL evaluates the average labeling performance\nwhile taking partial match into account.\n\u2229 Ltrue\n|\n| + |Ltrue\n|\n\n2|Lpred\n|Lpred\n\nj=1 Lpred\n\n[j] + Ltrue\n\n[j]Ltrue\n\nXi=1\n\nXi=1\n\nFL =\n\n(14)\n\n1\nn\n\n1\nn\n\n[j])\n\ni\n\ni\n\ni\n\ni\n\n[j]\n\n.\n\nn\n\nn\n\n=\n\ni\n\ni\n\ni\n\n2Pl\nj=1(Lpred\nPl\n\ni\n\n\fExact Match Ratio\n\nLabeling F-measure\n\nRetrieval F-measure\n\nData-\nset MD MC PM SV BO MD MC PM SV BO MD MC PM SV BO\n0.55 0.44 0.50 0.46 0.38 0.44 0.32 0.21 0.29 0.22 0.30 0.26 0.24 0.29 0.22\nAr\n0.80 0.81 0.75 0.76 0.75 0.63 0.62 0.48 0.57 0.53 0.25 0.27 0.20 0.29 0.20\nBu\n0.62 0.59 0.61 0.55 0.47 0.51 0.46 0.35 0.41 0.34 0.27 0.25 0.19 0.30 0.17\nCo\n0.56 0.43 0.51 0.48 0.37 0.45 0.34 0.19 0.30 0.23 0.25 0.23 0.21 0.25 0.16\nEd\n0.64 0.52 0.61 0.54 0.49 0.55 0.44 0.31 0.42 0.36 0.37 0.33 0.30 0.35 0.29\nEn\n0.74 0.74 0.66 0.67 0.60 0.58 0.53 0.34 0.47 0.39 0.35 0.35 0.23 0.35 0.26\nHe\n0.63 0.46 0.55 0.49 0.44 0.54 0.38 0.25 0.37 0.33 0.47 0.39 0.36 0.40 0.33\nRc\n0.67 0.58 0.63 0.56 0.50 0.60 0.51 0.39 0.49 0.41 0.29 0.25 0.24 0.25 0.16\nRf\n0.61 0.54 0.52 0.47 0.39 0.52 0.43 0.22 0.36 0.28 0.37 0.35 0.28 0.31 0.19\nSi\n0.73 0.71 0.66 0.64 0.59 0.65 0.60 0.45 0.55 0.49 0.36 0.35 0.18 0.31 0.15\nSS\nSC 0.60 0.55 0.54 0.49 0.44 0.44 0.40 0.21 0.32 0.27 0.29 0.28 0.25 0.26 0.20\n0.65 0.58 0.59 0.56 0.49 0.54 0.46 0.31 0.41 0.35 0.32 0.30 0.24 0.31 0.21\nAvg\n\nTable 4: The performance comparison by labeling F-measure (left), exact match ratio (mid-\ndle) and retrieval F-measure (right). The bold \ufb01gures are the best ones among the \ufb01ve\nmethods, and the underlined \ufb01gures the second best ones. MD, MC, PM, SV, and BO\nrepresent MML with SD, MML with SC, PMM, SVM and BoosTexter, respectively.\n\nExact Match Ratio Exact match ratio EX counts only exact matches between the pre-\ndicted label and the true label.\n\nEX =\n\n1\nn\n\nn\n\nXi=1\n\nI[Lpred\n\ni\n\n= Ltrue\n\ni\n\n],\n\n(15)\n\nwhere I[S] is 1 if the statement S is true and 0 otherwise.\n\nRetrieval F-measure6 For real tasks, it is also important to evaluate retrieval perfor-\nmance, i.e. how accurately classi\ufb01ers can \ufb01nd relevant texts for a given topic. Retrieval\nF-measure FR measures the average retrieval performance over all topics.\n\nFR =\n\n1\nl\n\nl\n\nXj=1\n\n5.3 Results\n\ni\n\ni=1 Lpred\n\n2Pn\ni=1(Lpred\nPn\n\ni\n\n[j]\n\n[j]Ltrue\n[j] + Ltrue\n\ni\n\ni\n\n[j])\n\n.\n\n(16)\n\nFirst we trained the classi\ufb01ers with randomly chosen 2,000 samples. We then calculated\nthe three evaluation measures for 3,000 other randomly chosen samples. This process was\nrepeated \ufb01ve times, and the resulting averaged values are shown in Table 4. Table 4 shows\nthat the MMLs with Dice measure outperform other methods in labeling F-measure and\nexact match ratio. The MMLs also show the best performance with regard to retrieval F-\nmeasure although the margins to the other methods are not as large as observed in labeling\nF-measure and exact match ratio. Note that no classi\ufb01er except MML with Dice measure\nachieves good labeling on all the three measures. For example, PMM shows high labeling\nF-measures, but its performance is rather poor when evaluated in retrieval F-measure.\n\nAs the second experiment, we evaluated the classi\ufb01ers trained with 250\u20132000 training sam-\nples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is\nobserved that the MMLs show high generalization even when training data is small. An in-\nteresting point is that MML with cosine measure achieves rather high labeling F-measures\nand retrieval F-measure with training data of smaller size. Such high-performace, however,\ndoes not continue when trained on larger data.\n\n6FR is called \u201cthe macro average of F-measures\u201d in the text categorization community.\n\n\fFigure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and\nretrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4.\n\n6 Conclusion\n\nIn this paper, we proposed a novel learning algorithm for multi-topic text categorization.\nThe algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarity-\ninduced vector space, and learns a large margin classi\ufb01er in the space. To overcome the\ndemanding computational cost of MML, we provide an approximation method in learning\nand ef\ufb01cient classi\ufb01cation algorithms. In experiments on a collection of Web pages, MML\noutperformed other methods including SVM and showed better generalization.\n\nAcknowledgement\n\nThe authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon\nTelegraph and Telephone Corporation for providing PMM\u2019s codes and the datasets.\n\nReferences\n\n[1] Thorsten Joachims. Text categorization with support vector machines: learning with\nmany relevant features. In Claire N\u00b4edellec and C\u00b4eline Rouveirol, editors, Proc. of the\n10th European Conference on Machine Learning, number 1398, pages 137\u2013142, 1998.\n[2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text\n\ncategorization. Machine Learning, 39(2/3):135\u2013168, 2000.\n\n[3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In\n\nAdvances in Neural Information Processing Systems 15, pages 1261\u20131268, 2003.\n\n[4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass\nkernel-based vector machines. Journal of Machine Learning Research, 2:265\u2013292,\n2001.\n\n[5] Klaus-Robert M\u00a8uller, Sebastian Mika, Gunnar R\u00a8atsch, Koji Tsuda, and Bernhard\nSch\u00a8olkopf. An introduction to kernel-based learning algorithms. IEEE Transactions\non Neural Networks, 12(2):181\u2013201, 2001.\n\n[6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.\n[7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval.\n\nAddison-Wealy, 1999.\n\n\f", "award": [], "sourceid": 2669, "authors": [{"given_name": "Hideto", "family_name": "Kazawa", "institution": null}, {"given_name": "Tomonori", "family_name": "Izumitani", "institution": null}, {"given_name": "Hirotoshi", "family_name": "Taira", "institution": null}, {"given_name": "Eisaku", "family_name": "Maeda", "institution": null}]}