{"title": "Incremental Algorithms for Hierarchical Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 233, "page_last": 240, "abstract": null, "full_text": " Incremental Algorithms\n for Hierarchical Classification\n\n\n\n Nicol`o Cesa-Bianchi Claudio Gentile Andrea Tironi Luca Zaniboni\n Universit`a di Milano Universit`a dell'Insubria Universit`a di Milano\n Milano, Italy Varese, Italy Crema, Italy\n\n\n Abstract\n\n We study the problem of hierarchical classification when labels corre-\n sponding to partial and/or multiple paths in the underlying taxonomy are\n allowed. We introduce a new hierarchical loss function, the H-loss, im-\n plementing the simple intuition that additional mistakes in the subtree of\n a mistaken class should not be charged for. Based on a probabilistic data\n model introduced in earlier work, we derive the Bayes-optimal classifier\n for the H-loss. We then empirically compare two incremental approx-\n imations of the Bayes-optimal classifier with a flat SVM classifier and\n with classifiers obtained by using hierarchical versions of the Perceptron\n and SVM algorithms. The experiments show that our simplest incremen-\n tal approximation of the Bayes-optimal classifier performs, after just one\n training epoch, nearly as well as the hierarchical SVM classifier (which\n performs best). For the same incremental algorithm we also derive an\n H-loss bound showing, when data are generated by our probabilistic data\n model, exponentially fast convergence to the H-loss of the hierarchical\n classifier based on the true model parameters.\n\n\n1 Introduction and basic definitions\n\nWe study the problem of classifying data in a given taxonomy of labels, where the tax-\nonomy is specified as a tree forest. We assume that every data instance is labelled with a\n(possibly empty) set of class labels called multilabel, with the only requirement that mul-\ntilabels including some node i in the taxonony must also include all ancestors of i. Thus,\neach multilabel corresponds to the union of one or more paths in the forest, where each\npath must start from a root but it can terminate on an internal node (rather than a leaf).\n\nLearning algorithms for hierarchical classification have been investigated in, e.g., [8, 9, 10,\n11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial\npaths has received very little attention. The analysis in [5], which is mainly theoretical,\nshows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning\nalgorithm based on regularized least-squares estimates.\nIn this work we extend [5] in several ways. First, we introduce a new hierarchical loss func-\ntion, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classification\ntasks, and we derive the corresponding Bayes-optimal classifier under the parametric data\nmodel introduced in [5]. Second, considering various loss functions, including the H-loss,\nwe empirically compare the performance of the following three incremental kernel-based\n\n This work was supported in part by the PASCAL Network of Excellence under EC grant no.\n506778. This publication only reflects the authors' views.\n\n\f\nalgorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an ap-\nproximation to the Bayes-optimal classifier; 3) a simplified variant of this approximation.\nFinally, we show that, assuming data are indeed generated according to the parametric\nmodel mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the\nclassifier based on the true model parameters. Our incremental algorithms are based on\ntraining linear-threshold classifiers in each node of the taxonomy. A similar approach has\nbeen studied in [8], though their model does not consider multiple-path classifications as\nwe do.\nIncremental algorithms are the main focus of this research, since we strongly believe that\nthey are a key tool for coping with tasks where large quantities of data items are generated\nand the classification system needs to be frequently adjusted to keep up with new items.\nHowever, we found it useful to provide a reference point for our empirical results. Thus we\nhave also included in our experiments the results achieved by nonincremental algorithms.\nIn particular, we have chosen a flat and a hierarchical version of SVM [21, 7, 19], which\nare known to perform well on the textual datasets considered here.\nWe assume data elements are encoded as real vectors x Rd which we call instances.\nA multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes,\nincluding the empty set. We denote the multilabel associated with x by a vector y =\n(y1, . . . , yN ) {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1.\nA taxonomy G is a forest whose trees are defined over the set of labels. A multilabel\ny {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more\npaths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1.\nWe assume the data-generating mechanism produces examples (x, y) such that y respects\nsome fixed underlying taxonomy G with N nodes. The set of roots in G is denoted by\nroot(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of\nancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i).\nFinally, given a predicate over a set , we will use {} to denote both the subset of \nwhere is true and the indicator function of this subset.\n\n2 The H-loss\n\nThough several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no\none has emerged as a standard yet. Since hierarchical losses are defined over multilabels,\nwe start by considering two very simple functions measuring the discrepancy between mul-\ntilabels y = (y1, ..., yN ) and y = (y1, ..., yN ): the 0/1-loss 0/1(y, y) = {i : yi = yi}\nand the symmetric difference loss (y, y) = {y1 = y1} + . . . + {yN = yN }.\nThere are several ways of making these losses depend on a given taxonomy G. In this\nwork, we follow the intuition \"if a mistake is made at node i, then further mistakes made\nin the subtree rooted at i are unimportant\". That is, we do not require the algorithm be able\nto make fine-grained distinctions on tasks when it is unable to make coarse-grained ones.\nFor example, if an algorithm failed to label a document with the class SPORTS, then the\nalgorithm should not be charged more loss because it also failed to label the same docu-\nment with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function\nimplementing this intuition is defined by\n H (y, y) = N c\n i=1 i {yi = yi yj = yj , j anc(i)},\n\nwhere c1, . . . , cN > 0 are fixed cost coefficients. This loss, which we call H-loss, can\nalso be described as follows: all paths in G from a root down to a leaf are examined and,\nwhenever we encounter a node i such that yi = yi, we add ci to the loss, whereas all the\nloss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1\nthen 0/1 H . Choices of ci depending on the structure of G are proposed in\nSection 4. Given a multilabel y {0, 1}N define its G-truncation as the multilabel y =\n(y1, ..., y ) {0, 1}N\n N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all\nj anc(i). Note that the G-truncation of any multilabel always respects G. A graphical\n\n\f\n (a) (b) (c) (d)\n\nFigure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the\ntaxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under\nconsideration, white nodes are not. (a) A generic multilabel which does not respect G; (b)\nits G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel\n(b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c).\n\n\nrepresentation of the notions introduced so far is given in Figure 1. In the next lemma we\nshow that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other\nwords, when the multilabel y to be predicted respects a taxonomy G then there is no loss\nof generality in restricting to predictions which respect G.\n\nLemma 1 Let G be a taxonomy, y, y {0, 1}N be two multilabels such that y respects\nG, and y be the G-truncation of y. Then H(y , y) H(y, y) .\n\nProof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j anc(i) implies\nyi = yi and yj = yj for all j anc(i). Pick some i and suppose yi = yi and yj = yj for\nall j anc(i). Now suppose yj = 0 (and thus yj = 0) for some j anc(i). Then yi = 0\nsince y respects G. But this implies yi = 1, contradicting the fact that the G-truncation\ny respects G. Therefore, it must be the case that yj = yj = 1 for all j anc(i). Hence\nthe G-truncation of y left each node j anc(i) unchanged, implying yj = yj for all\nj anc(i). But, since the G-truncation of y does not change the value of a node i whose\nancestors j are such that yj = 1, this also implies yi = yi. Therefore yi = yi and the proof\nis concluded.\n\n\n3 A probabilistic data model\n\nOur learning algorithms are based on the following statistical model for the data, originally\nintroduced in [5]. The model defines a probability distribution fG over the set of multilabels\nrespecting a given taxonomy G by associating with each node i of G a Bernoulli random\nvariable Yi and defining\n\n fG(y | x) = N\n i=1 P Yi = yi | Ypar(i) = ypar(i), X = x .\n\nTo guarantee that fG(y | x) = 0 whenever y {0, 1}N does not respect G, we set\nP Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this definition of fG makes the (rather\nsimplistic) assumption that all Yk with the same parent node i (i.e., the children of i)\nare independent when conditioned on Yi and x. Through fG we specify an i.i.d. process\n{(X1, Y 1), (X2, Y 2), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed ac-\ncording to fG( | Xt) and Xt is distributed according to a fixed and unknown distribution\nD. Each example (xt, yt) is thus a realization of the corresponding pair (Xt, Y t) of ran-\ndom variables. Our parametric model for fG is described as follows. First, we assume that\nthe support of D is the surface of the d-dimensional unit sphere (i.e., instances x Rd are\nsuch that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight\nvector ui Rd. Then, we define the conditional probabilities for a nonroot node i with\nparent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous\nequation simplifies to P (Yi = 1 | X = x) = (1 + ui x)/2.\n\n\f\n3.1 The Bayes-optimal classifier for the H-loss\n\nWe now describe a classifier, called H-BAYES, that is the Bayes-optimal classifier for\nthe H-loss. In other words, H-BAYES classifies any instance x with the multilabel\ny = argminy{0,1} E[ H(\n y, Y ) | x ]. Define pi(x) = P Yi = 1 | Ypar(i) = 1, X = x .\nWhen no ambiguity arises, we write pi instead of pi(x). Now, fix any unit-length instance\nx and let y be a multilabel that respects G. For each node i in G, recursively define\n\n Hi, (y) = c H (y) .\n x i (pi(1 - yi) + (1 - pi)yi) + kchild(i) k,x\n\nThe classifier H-BAYES operates as follows. It starts by putting all nodes of G in a set S;\nnodes are then removed from S one by one. A node i can be removed only if i is a leaf or\nif all nodes j in the subtree rooted at i have been already removed. When i is removed, its\nvalue yi is set to 1 if and only if\n pi 2 - H (y)/c\n kchild(i) k,x i 1 . (1)\n\n(Note that if i is a leaf then (1) is equivalent to yi = {pi 1/2}.) If yi is set to zero, then\nall nodes in the subtree rooted at i are set to zero.\n\nTheorem 2 For any taxonomy G and all unit-length x Rd, the multilabel generated by\nH-BAYES is the Bayes-optimal classification of x for the H-loss.\n\nProof sketch. Let y be the multilabel assigned by H-BAYES and y be any multilabel\nminimizing the expected H-loss. Introducing the short-hand E [] = E[ | x], we can write\n x\n\n\n E c p\n x H (y, Y ) = N\n i=1 i (pi(1 - yi) + (1 - pi)yi) janc(i) j {yj = 1} .\n\nNote that we can recursively decompose the expected H-loss as\n\n E H\n x H (y, Y ) = iroot(G) Ex i(y, Y ),\n\nwhere\n E H p E H\n x i(y, Y ) = ci (pi(1 - yi) + (1 - pi)yi) j {yj = 1} + x k(y, Y ) . (2)\n\n janc(i) kchild(i)\n\nPick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi 1/2},\nwhich is also the minimizer of Hi, (y) = c\n x i (pi(1 - yi) + (1 - pi)yi), implying yi = y\n i .\nNow let i be an internal node and inductively assume yj = yj for all j sub(i). Notice\nthat the factors p\n janc(i) j {yj = 1} occur in both terms in the RHS of (2). Hence y\n i does\nnot depend on these factors and we can equivalently minimize\n ci (pi(1 - yi) + (1 - pi)yi) + pi{yi = 1} H (y), (3)\n kchild(i) k,x\n\nwhere we noted that, for each k child(i),\n\n E H p (y) .\n x k(y, Y ) = janc(i) j {yj = 1} pi{yi = 1}Hk,x\n\nNow observe that yi minimizing (3) is equivalent to the assignment produced by H-BAYES.\nTo conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all\nnodes j sub(i), which is exactly what H-BAYES does.\n\n\n4 The algorithms\n\nWe consider three incremental algorithms. Each one of these algorithms learns a hierarchi-\ncal classifier by training a decision function gi : Rd {0, 1} at each node i = 1, . . . , N .\nFor a given set g1, . . . , gN of decision functions, the hierarchical classifier generated by\nthese algorithms classifies an instance x through a multilabel y = (y1, ..., yN ) defined as\nfollows:\n\n\f\n g\n y i(x) if i root(G) or yj = 1 for all j anc(i)\n i = (4)\n 0 otherwise.\n\nNote that y computed this way respects G. The classifiers (4) are trained incrementally.\nLet gi,t be the decision function at node i after training on the first t - 1 examples. When\nthe next training example (xt, yt) is available, the algorithms compute the multilabel yt\nusing classifier (4) based on g1,t(xt), . . . , gN,t(xt). Then, the algorithms consider for\nan update only those decision functions sitting at nodes i satisfying either i root(G)\nor ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all\nother nodes are left unchanged. The first algorithm we consider is a simple hierarchical\nversion of the Perceptron algorithm [16], which we call H-PERC. The decision functions at\ntime t are defined by gi,t(xt) = {wi,txt 0}. In the update phase, the Perceptron rule\nwi,t+1 = wi,t + yi,txt is applied to every node i eligible at time t and such that yi,t = yi,t.\nThe second algorithm, called APPROX-H-BAYES, approximates the H-BAYES classifier of\nSection 3.1 by replacing the unknown quantities pi(xt) with estimates (1+wi,txt)/2. The\nweights wi,t are regularized least-squares estimates defined by\n (i)\n wi,t = (I + Si,t-1 Si,t-1 + xt xt )-1Si,t-1yt-1 . (5)\n\nThe columns of the matrix Si,t-1 are all past instances xs that have been stored at node i;\n (i)\nthe s-th component of vector yt-1 is the i-th component yi,s of the multilabel ys associated\nwith instance xs. In the update phase, an instance xt is stored at node i, causing an update\nof wi,t, whenever i is eligible at time t and |wi,txt| (5 ln t)/Ni,t, where Ni,t is\nthe number of instances stored at node i up to time t - 1. The corresponding decision\nfunctions gi,t are of the form gi,t(xt) = {wi,txt i,t}, where the threshold i,t 0 at\nnode i depends on the margin values wj,txt achieved by nodes j sub(i) -- recall (1).\nNote that gi,t is not a linear-threshold function, as xt appears in the definition of wi,t. The\nmargin threshold (5 ln t)/Ni,t, controlling the update of node i at time t, reduces the\nspace requirements of the classifier by keeping matrices Si,t suitably small. This threshold\nis motivated by the work [4] on selective sampling.\nThe third algorithm, which we call H-RLS (Hierarchical Regularized Least Squares), is a\nsimplified variant of APPROX-H-BAYES in which the thresholds i,t are set to zero. That\nis, we have gi,t(xt) = {wi,txt 0} where the weights wi,t are defined as in (5) and\nupdated as in the APPROX-H-BAYES algorithm. Details on how to run APPROX-H-BAYES\nand H-RLS in dual variables and perform an update at node i in time O(N 2\n i,t) are found\nin [3] (where a mistake-driven version of H-RLS is analyzed).\n\n\n5 Experimental results\n\nThe empirical evaluation of the algorithms was carried out on two well-known datasets of\nfree-text documents. The first dataset consists of the first (in chronological order) 100,000\nnewswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy\nof labels, which are the topics of the documents, has 101 nodes organized in a forest of\n4 trees. The forest is shallow: the longest path has length 3 and the the distribution of\nnodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we\nused the bag-of-words vectorization performed by Xerox Research Center Europe within\nthe EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents\nwere divided into 5 equally sized groups of chronologically consecutive documents. We\nthen used each adjacent pair of groups as training and test set in an experiment (here the\nfifth and first group are considered adjacent), and then averaged the test set performance\nover the 5 experiments.\n\nThe second dataset is a specific subtree of the OHSUMED corpus of medical abstracts [1]:\nthe subtree rooted in \"Quality of Health Care\" (MeSH code N05.715). After removing\noverlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94\n\n\f\nTable 1: Experimental results on two hierarchical text classification tasks under various loss\nfunctions. We report average test errors along with standard deviations (in parenthesis). In\nbold are the best performance figures among the incremental algorithms.\n\n RCV1 0/1-loss unif. H-loss norm. H-loss -loss\n PERC 0.702(0.045) 1.196(0.127) 0.100(0.029) 1.695(0.182)\n H-PERC 0.655(0.040) 1.224(0.114) 0.099(0.028) 1.861(0.172)\n H-RLS 0.456(0.010) 0.743(0.026) 0.057(0.001) 1.086(0.036)\n AH-BAY 0.550(0.010) 0.815(0.028) 0.090(0.001) 1.465(0.040)\n SVM 0.482(0.009) 0.790(0.023) 0.057(0.001) 1.173(0.051)\n H-SVM 0.440(0.008) 0.712(0.021) 0.055(0.001) 1.050(0.027)\n\n OHSU. 0/1-loss unif. H-loss norm. H-loss -loss\n PERC 0.899(0.024) 1.938(0.219) 0.058(0.005) 2.639(0.226)\n H-PERC 0.846(0.024) 1.560(0.155) 0.057(0.005) 2.528(0.251)\n H-RLS 0.769(0.004) 1.200(0.007) 0.045(0.000) 1.957(0.011)\n AH-BAY 0.819(0.004) 1.197(0.006) 0.047(0.000) 2.029(0.009)\n SVM 0.784(0.003) 1.206(0.003) 0.044(0.000) 1.872(0.005)\n H-SVM 0.759(0.002) 1.170(0.005) 0.044(0.000) 1.910(0.007)\n\n\n\nclasses and 55,503 documents. We made this choice based only on the structure of the\nsubtree: the longest path has length 4, the distribution of nodes sorted by increasing path\nlength is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a significant number of partial and\nmultiple path multilabels. The vectorization of the subtree was carried out as follows: after\ntokenization, we removed all stopwords and also those words that did not occur at least 3\ntimes in the corpus. Then, we vectorized the documents using the Bow library [13] with a\nlog(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a\ntraining set of 40,000 documents and a test set of 15,503 documents. Test set performances\nare averages over these 5 experiments. In the training set we kept more documents than\nin the RCV1 splits since the OHSUMED corpus turned out to be a harder classification\nproblem than RCV1. In both datasets instances have been normalized to unit length. We\ntested the hierarchical Perceptron algorithm (H-PERC), the hierarchical regularized least-\nsquares algorithm (H-RLS), and the approximated Bayes-optimal algorithm (APPROX-H-\nBAYES), all described in Section 4. The results are summarized in Table 1. APPROX-H-\nBAYES (AH-BAY in Table 1) was trained using cost coefficients ci chosen as follows: if\ni root(G) then ci = |root(G)|-1. Otherwise, ci = cj/|child(j)|, where j is the parent\nof i. Note that this choice of coefficients amounts to splitting a unit cost equally among\nthe roots and then splitting recursively each node's cost equally among its children. Since,\nin this case, 0 H 1, we call the resulting loss normalized H-loss. We also tested\na hierarchical version of SVM (denoted by H-SVM in Table 1) in which each node is an\nSVM classifier trained using a batch version of our hierarchical learning protocol. More\nprecisely, each node i was trained only on those examples (xt, yt) such that ypar(i),t = 1\n(note that, as no conditions are imposed on yi,t, node i is actually trained on both positive\nand negative examples). The resulting set of linear-threshold functions was then evaluated\non the test set using the hierachical classification scheme (4). We tried both the C and \nparametrizations [18] for SVM and found the setting C = 1 to work best for our data.1\nWe finally tested the \"flat\" variants of Perceptron and SVM, denoted by PERC and SVM. In\nthese variants, each node is trained and evaluated independently of the others, disregarding\nall taxonomical information. All SVM experiments were carried out using the libSVM\nimplementation [6]. All the tested algorithms used a linear kernel.\n\n\n 1It should be emphasized that this tuning of C was actually chosen in hindsight, with no cross-\nvalidation.\n\n\f\nAs far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost co-\nefficients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric\ndifference loss (denoted by -loss). Note that H-SVM performs best, but our incremental\nalgorithms were trained for a single epoch on the training set. The good performance of\nSVM (the flat variant of H-SVM) is surprising. However, with a single epoch of training\nH-RLS does not perform worse than SVM (except on OHSUMED under the normalized\nH-loss) and comes reasonably close to H-SVM. On the other hand, the performance of\nAPPROX-H-BAYES is disappointing: on OHSUMED it is the best algorithm only for the\nuniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never out-\nperforms H-RLS, though it always does better than PERC and H-PERC. A possible explana-\ntion for this behavior is that APPROX-H-BAYES is very sensitive to errors in the estimates\nof pi(x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to\napproximate H-BAYES, seem to work better in practice on simpler (and possibly more ro-\nbust) algorithms, such as H-RLS. The lower values of normalized H-loss on OHSUMED\n(a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the\nOHSUMED taxonomy are roots, and thus each top-level mistake is only charged about\n4/94. As a final remark, we observe that the normalized H-loss gave too small a range of\nvalues to afford fine comparisons among the best performing algorithms.\n\n\n6 Regret bounds for the H-loss\n\nIn this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm\nH-RLS tested in Section 5. More precisely, we assume data are generated according to\nthe probabilistic model introduced in Section 3 with unknown instance distribution D and\nunknown coefficients u1, . . . , uN . We define the regret of a classifier assigning label y\nto instance X as E H (y, Y t) - E H (y, Y ), where the expected value is with respect the\nrandom draw of (X, Y ) and y is the multilabel assigned by classifier (4) when the decision\nfunctions gi are zero-threshold functions of the form gi(x) = {ui x 0}. The theorem\nbelow shows that the regret of the classifier learned by a variant of H-RLS after t training\nexamples, with t large enough, is exponentially small in t. In other words, H-RLS learns\nto classify as well as the algorithm that is given the true parameters u1, . . . , uN of the\nunderlying data-generating process. We have been able to prove the theorem only for the\nvariant of H-RLS storing all instances at each node. That is, every eligible node at time t is\nupdated, irrespective of whether |wi,txt| (5 ln t)/Ni,t.\nGiven the i.i.d. data-generating process (X 1, Y 1), (X2, Y 2), . . ., for each node k we de-\nfine the derived process Xk , X , . . . including all and only the instances X\n 1 k2 s of the orig-\ninal process that satisfy Ypar(k),s = 1. We call this derived process the process at node k.\nNote that, for each k, the process at node k is an i.i.d. process. However, its distribution\nmight depend on k. The spectrum of the process at node k is the set of eigenvalues of the\ncorrelation matrix with entries E[Xk1,i Xk1,j] for i, j = 1, . . . , d. We have the following\ntheorem, whose proof is omitted due to space limitations.\n\nTheorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G\nparametrized by N unit-norm vectors u1, . . . , uN Rd. Assume the instance distri-\nbution is such that there exist 1, . . . , N > 0 satisfying P |ui Xt| i = 1 for\ni = 1, . . . , N . Then, for all t > max max 16 192d\n i=1,...,N , max the regret\n i=1,...,N\n i i i i 2i\nE H (yt, Y t) - E H(yt, Y t) of the modified H-RLS algorithm is at most\n N\n i t i t i\n i t e-1 2\n i + t2 e-22i cj ,\n i=1 jsub(i)\n\nwhere 1, 2 are constants, i = E (1 + u\n janc(i) j X )/2 and i is the smallest\neigenvalue in the spectrum of the process at node i.\n\n\f\n7 Conclusions and open problems\n\nIn this work we have studied the problem of hierarchical classification of data instances\nin the presence of partial and multiple path labellings. We have introduced a new hierar-\nchical loss function, the H-loss, derived the corresponding Bayes-optimal classifier, and\nempirically compared an incremental approximation to this classifier with some other in-\ncremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee\non the H-loss of a simplified variant of the approximated Bayes-optimal algorithm.\n\nOur investigation leaves several open issues. The current approximation to the Bayes-\noptimal classifier is not satisfying, and this could be due to a bad choice of the model, of\nthe estimators, of the datasets, or of a combination of them. Also, the normalized H-loss\nis not fully satisfying, since the resulting values are often too small. From the theoretical\nviewpoint, we would like to analyze the regret of our algorithms with respect to the Bayes-\noptimal classifier, rather than with respect to a classifier that makes a suboptimal use of the\ntrue model parameters.\n\nReferences\n\n [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/.\n [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/.\n [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc.\n 15th COLT, pages 121137. Springer, 2002.\n [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classifiers\n via selective sampling. In Proc. 16th COLT, pages 373386. Springer, 2003.\n [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classification with\n linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear.\n [6] C.-C. Chang and C.-J. Lin. Libsvm -- a library for support vector machines. URL:\n www.csie.ntu.edu.tw/cjlin/libsvm/.\n [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge\n University Press, 2001.\n [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proc. 21st ICML.\n Omnipress, 2004.\n [9] S.T. Dumais and H. Chen. Hierarchical classification of web content. In Proc. 23rd ACM Int.\n Conf. RDIR, pages 256263. ACM Press, 2000.\n[10] M. Granitzer. Hierarchical Text Classification using Methods from Machine Learning. PhD\n thesis, Graz University of Technology, 2003.\n[11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and\n words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003.\n[12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc.\n 14th ICML, Morgan Kaufmann, 1997.\n[13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification\n and clustering. URL: www-2.cs.cmu.edu/mccallum/bow/.\n[14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classification by\n shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998.\n[15] D. Mladenic. Turning yahoo into an automatic web-page classifier. In Proceedings of the 13th\n European Conference on Artificial Intelligence, pages 473474, 1998.\n[16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization\n in the brain. Psychol. Review, 65:386408, 1958.\n[17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Informa-\n tion Retrieval, 5(1):87118, 2002.\n[18] B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms.\n Neural Computation, 12:12071245, 2000.\n[19] B. Scholkopf and A. Smola. Learning with kernels. MIT Press, 2002.\n[20] A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proc. 2001 Int. Conf.\n Data Mining, pages 521528. IEEE Press, 2001.\n[21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n\n\f\n", "award": [], "sourceid": 2742, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Andrea", "family_name": "Tironi", "institution": null}, {"given_name": "Luca", "family_name": "Zaniboni", "institution": null}]}