{"title": "Distribution-Calibrated Hierarchical Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 450, "page_last": 458, "abstract": "While many advances have already been made on the topic of hierarchical classi- \ufb01cation learning, we take a step back and examine how a hierarchical classi\ufb01ca- tion problem should be formally de\ufb01ned. We pay particular attention to the fact that many arbitrary decisions go into the design of the the label taxonomy that is provided with the training data, and that this taxonomy is often unbalanced. We correct this problem by using the data distribution to calibrate the hierarchical classi\ufb01cation loss function. This distribution-based correction must be done with care, to avoid introducing unmanagable statstical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the uncharted waters of geometric-type estimation. We present a new calibrated de\ufb01nition of statistical risk for hierarchical classi\ufb01cation, an unbiased geometric estimator for this risk, and a new algorithmic reduction from hierarchical classi\ufb01- cation to cost-sensitive classi\ufb01cation.", "full_text": "Distribution-Calibrated Hierarchical Classi\ufb01cation\n\nOfer Dekel\n\nMicrosoft Research\n\nOne Microsoft Way, Redmond, WA 98052, USA\n\noferd@microsoft.com\n\nAbstract\n\nWhile many advances have already been made in hierarchical classi\ufb01cation learn-\ning, we take a step back and examine how a hierarchical classi\ufb01cation problem\nshould be formally de\ufb01ned. We pay particular attention to the fact that many ar-\nbitrary decisions go into the design of the label taxonomy that is given with the\ntraining data. Moreover, many hand-designed taxonomies are unbalanced and\nmisrepresent the class structure in the underlying data distribution. We attempt\nto correct these problems by using the data distribution itself to calibrate the hi-\nerarchical classi\ufb01cation loss function. This distribution-based correction must be\ndone with care, to avoid introducing unmanageable statistical dependencies into\nthe learning problem. This leads us off the beaten path of binomial-type estima-\ntion and into the unfamiliar waters of geometric-type estimation. In this paper,\nwe present a new calibrated de\ufb01nition of statistical risk for hierarchical classi\ufb01-\ncation, an unbiased estimator for this risk, and a new algorithmic reduction from\nhierarchical classi\ufb01cation to cost-sensitive classi\ufb01cation.\n\n1 Introduction\n\nMulticlass classi\ufb01cation is the task of assigning labels from a prede\ufb01ned label-set to instances in a\ngiven domain. For example, consider the task of assigning a topic to each document in a corpus.\nIf a training set of labeled documents is available, then a multiclass classi\ufb01er can be trained using\na supervised machine learning algorithm. Often, large label-sets can be organized in a taxonomy.\nExamples of popular label taxonomies are the ODP taxonomy of web pages [2], the gene ontology\n[6], and the LCC ontology of book topics [1]. A taxonomy is a hierarchical structure over labels,\nwhere some labels de\ufb01ne very general concepts, and other labels de\ufb01ne more speci\ufb01c specializations\nof those general concepts. A taxonomy of document topics could include the labels MUSIC, CLAS-\nSICAL MUSIC, and POPULAR MUSIC, where the last two are special cases of the \ufb01rst. Some label\ntaxonomies form trees (each label has a single parent) while others form directed acyclic graphs.\nWhen a label taxonomy is given alongside a training set, the multiclass classi\ufb01cation problem is\noften called a hierarchical classi\ufb01cation problem. The label taxonomy de\ufb01nes a structure over the\nmulticlass problem, and this structure should be used both in the formal de\ufb01nition of the hierarchical\nclassi\ufb01cation problem, and in the design of learning algorithms to solve this problem.\n\nMost hierarchical classi\ufb01cation learning algorithms treat the taxonomy as an indisputable de\ufb01nitive\nmodel of the world, never questioning its accuracy. However, most taxonomies are authored by\nhuman editors and subjective matters of style and taste play a major role in their design. Many\narbitrary decisions go into the design of a taxonomy, and when multiple editors are involved, these\narbitrary decisions are made inconsistently. Figure 1 shows two versions of a simple taxonomy, both\nequally reasonable; choosing between them is a matter of personal preference. Arbitrary decisions\nthat go into the taxonomy design can have a signi\ufb01cant in\ufb02uence on the outcome of the learning\nalgorithm [19]. Ideally, we want learning algorithms that are immune to the arbitrariness in the\ntaxonomy.\n\n1\n\n\fThe arbitrary factor in popular label taxonomies is a well-known phenomenon. [17] gives the ex-\nample of the Library of Congress Classi\ufb01cation system (LCC), a widely adopted and constantly\nupdated taxonomy of \u201call knowledge\u201d, which includes the category WORLD HISTORY and four of\nits direct subcategories: ASIA, AFRICA, NETHERLANDS, and BALKAN PENINSULA. There is a clear\nimbalance between the the level of granularity of ASIA versus its sibling BALKAN PENINSULA.\nThe Dewey Decimal Classi\ufb01cation (DDC), another widely accepted taxonomy of \u201call knowledge\u201d,\nde\ufb01nes ten main classes, each has exactly ten subclasses, and each of those again has exactly ten sub-\nclasses. The rigid choice of a decimal fan-out is an arbitrary one, and stems from an aesthetic ideal\nrather than a notion of informativeness. Incidentally, the ten subclasses of RELIGION in the DDC\ninclude six categories about Christianity and the additional category OTHER RELIGIONS, demon-\nstrating the editor\u2019s clear subjective predilection for Christianity. The ODP taxonomy of web-page\ntopics is optimized for navigability rather than informativeness, and is therefore very \ufb02at and often\nunbalanced. As a result, two of the direct children of the label GAMES are VIDEO GAMES (with\nover 42, 000 websites listed) and PAPER AND PENCIL GAMES (with only 32 websites). These ex-\namples are not intended to show that these useful taxonomies are \ufb02awed, they merely demonstrate\nthe arbitrary subjective aspect of their design.\n\nOur goal is to de\ufb01ne the problem such that it is invariant to many of these subjective and arbitrary\ndesign choices, while still exploiting much of the available information. Some older approaches to\nhierarchical classi\ufb01cation do not use the taxonomy in the de\ufb01nition of the classi\ufb01cation problem\n[12, 13, 18, 9, 16]. Namely, these approaches consider all classi\ufb01cation mistakes to be equally\nbad, and use the taxonomy only to the extent that it reduces computational complexity and the\nnumber of classi\ufb01cation mistakes. More recent approaches [3, 8, 5, 4] exploit the label taxonomy\nmore thoroughly, by using it to induce a hierarchy-dependent loss function, which captures the\nintuitive idea that not all classi\ufb01cation mistakes are equally bad: incorrectly classifying a document\nas CLASSICAL MUSIC when its true topic is actually JAZZ is not nearly as bad as classifying that\ndocument as COMPUTER HARDWARE. When this interpretation of the taxonomy can be made,\nignoring it is effectively wasting a valuable signal in the problem input. For example, [8] de\ufb01ne the\nloss of predicting a label u when the correct label is y as the number of edges along the path between\nthe two labels in the taxonomy graph.\n\nAdditionally, a taxonomy provides a very natural framework for balancing the tradeoff between\nspeci\ufb01city and accuracy in classi\ufb01cation. Ideally, we would like our classi\ufb01er to assign the most\nspeci\ufb01c label possible to an instance, and the loss function should reward it adequately for doing\nso. However, when a speci\ufb01c label cannot be assigned with suf\ufb01ciently high con\ufb01dence, it is often\nbetter to fall-back on a more general correct label than it is to assign an incorrect speci\ufb01c label. For\nexample, classifying a document on JAZZ as the broader topic MUSIC is better than classifying it as\nthe more speci\ufb01c yet incorrect topic COUNTRY MUSIC. A hierarchical classi\ufb01cation problem should\nbe de\ufb01ned in a way that penalizes both over-con\ufb01dence and under-con\ufb01dence in a balanced way.\n\nThe graph-distance based loss function introduced by [8] captures both of the ideas mentioned\nabove, but it is very sensitive to arbitrary choices that go into the taxonomy design. Once again\nconsider the example in Fig. 1: each hierarchy would induce a different graph-distance, which\nwould lead to a different outcome of the learning algorithm. We can make the difference between\nthe two outcomes arbitrarily large by making some regions of the taxonomy very deep and other\nregions very \ufb02at. Additionally, we note that the simple graph-distance based loss works best when\nthe taxonomy is balanced, namely, when all of the splits in the taxonomy convey roughly the same\namount of information. For example, in the taxonomy of Fig. 1, the children of CLASSICAL MU-\nSIC are VIVALDI and NON-VIVALDI, where the vast majority of classical music falls in the latter.\nIf the correct label is NON-VIVALDI and our classi\ufb01er predicts the more general label CLASSICAL\nMUSIC, the loss should be small, since the two labels are essentially equivalent. On the other hand,\nif the correct label is VIVALDI then predicting CLASSICAL MUSIC should incur a larger loss, since\nimportant detail was excluded. A simple graph-distance based loss will penalize both errors equally.\n\nOn one hand, we want to use the hierarchy to de\ufb01ne the problem. On the other hand, we don\u2019t want\narbitrary choices and unbalanced splits in the taxonomy to have a signi\ufb01cant effect on the outcome.\nCan we have our cake and eat it too? Our proposed solution is to leave the taxonomy structure\nas-is, and to stick with a graph-distance based loss, but to introduce non-uniform edge weights.\nNamely, the loss of predicting u when the true label is y is de\ufb01ned as the sum of edge-weights\nalong the shortest path from u to y. We use the underlying distribution over labels to set the edge\n\n2\n\n\fFigure 1: Two equally-reasonable label taxonomies. Note the subjective decision to include/exclude\nthe label ROCK, and note the unbalanced split of CLASSICAL to the small class VIVALDI and the\nmuch larger class NON-VIVALDI.\n\nweights in a way that adds balance to the taxonomy and compensates for certain arbitrary design\nchoices. Speci\ufb01cally, we set edge weights using the information-theoretic notion of conditional self-\ninformation [7]. The weight of an edge between a label u and its parent u\u2032 is the log-probability of\nobserving the label u given that the example is also labeled by u\u2032.\nOthers [19] have previously tried to use the training data to \u201c\ufb01x\u201d the hierarchy, as a preprocessing\nstep to classi\ufb01cation. However, it is unclear whether it is statistically permissible to reuse the training\ndata twice: once to \ufb01x the hierarchy and then again in the actual learning procedure. The problem\nis that the preprocessing step may introduce strong statistical dependencies into our problem. These\ndependencies could prove detrimental to our learning algorithm, which expects to see a set of inde-\npendent examples. The key to our approach is that we can estimate our distribution-dependent loss\nusing the same data used to de\ufb01ne it, without introducing any signi\ufb01cant bias. It turns out that to\naccomplish this, we must deviate from the prevalent binomial-type estimation scheme that currently\ndominates machine learning and turn to a more peculiar geometric-distribution-type estimator. A\nbinomial-type estimator essentially counts things (such as mistakes), while a geometric-type esti-\nmator measures the amount of time that passes before something occurs. Geometric-type estimators\nhave the interesting property that they might occasionally fail, which we investigate in detail below.\nMoreover, we show how to control the variance of our estimate without adding bias. Since em-\npirical estimation is the basis of supervised machine learning, we can now extrapolate hierarchical\nlearning algorithms from our unbiased estimation technique. Speci\ufb01cally, we present a reduction\nfrom hierarchical classi\ufb01cation to cost-sensitive multiclass classi\ufb01cation, which is based on our new\ngeometric-type estimator.\n\nThis paper is organized as follows. We formally set the problem in Sec. 2 and present our new\ndistribution-dependent loss function in Sec. 3. In Sec. 4 we discuss how to control the variance of\nour empirical estimates, which is a critical step towards the learning algorithm described in Sec. 5.\nWe conclude with a discussion in Sec. 6. We omit technical proofs due to space constraints.\n\n2 Problem Setting\n\nWe now de\ufb01ne our problem more formally. Let X be an instance space and let T be a taxonomy of\nlabels. For simplicity, we focus on tree hierarchies. T is formally de\ufb01ned as the pair (U, \u03c0), where\nU is a \ufb01nite set of labels and \u03c0 is the function that speci\ufb01es the parent of each label in U. U contains\nboth general labels and speci\ufb01c labels. Speci\ufb01cally, we assume that U contains the special label\nALL, and that all other labels in U are special cases of ALL. \u03c0 : U \u2192 U is a function that de\ufb01nes\nthe structure of the taxonomy by assigning a parent \u03c0(u) to each label u \u2208 U. Semantically, \u03c0(u) is\na more general label than u that contains u as a special case. In other words, we can say that \u201cu is\na speci\ufb01c type of \u03c0(u)\u201d. For completeness, we de\ufb01ne \u03c0(ALL) = ALL. The n\u2019th generation parent\nfunction \u03c0n : U \u2192 U is de\ufb01ned by recursively applying \u03c0 to itself n times. Formally\n\n\u03c0n(u) = \u03c0(\u03c0(. . . \u03c0\n}\n\n{z\n\n(u) . . .)) .\n\nFor completeness, de\ufb01ne \u03c00 as the identity function over U. T is acyclic, namely, for all u 6= ALL\nand for all n \u2265 1 it holds that \u03c0n(u) 6= u. The ancestor function \u03c0\u22c6, maps each label to its set of\nancestors, and is de\ufb01ned as \u03c0\u22c6(u) = S\u221e\nn=0{\u03c0n(u)}. In other words, \u03c0\u22c6(u) includes u, its parent, its\nparent\u2019s parent, and so on. We assume that T is connected and speci\ufb01cally that ALL is an ancestor\n\n|\n\nn\n\n3\n\n\fof all labels, meaning that ALL \u2208 \u03c0\u22c6(u) for all u \u2208 U. The inverse of the ancestor function is the\ndescendent function \u03c4, which maps u \u2208 U to the subset {u\u2032 \u2208 U : u \u2208 \u03c0\u22c6(u\u2032)}. In other words,\nu is a descendent of u\u2032 if and only if u\u2032 is an ancestor of u. Graphically, we can depict T as a\nrooted tree: U de\ufb01nes the tree nodes, ALL is the root, and {(cid:0)u, \u03c0(u)(cid:1) : u \u2208 U \\ ALL} is the set of\nedges. In this graphical representation, \u03c4 (u) includes the nodes in the subtree rooted at u. Using this\nrepresentation, we de\ufb01ne the graph distance between any two labels d(u, u\u2032) as the number of edges\nalong the path between u and u\u2032 in the tree. The lowest common ancestor function \u03bb : U \u00d7 U \u2192 U\nmaps any pair of labels to their lowest common ancestor in the taxonomy, where \u201clowest\u201d is in the\nsense of tree depth. Formally, \u03bb(u, u\u2032) = \u03c0j(u) where j = min{i : \u03c0i(u) \u2208 \u03c0\u22c6(u\u2032)}. In words,\n\u03bb(u, u\u2032) is the closest ancestor of u that is also an ancestor if u\u2032. It is straightforward to verify that\n\u03bb(u, u\u2032) = \u03bb(u\u2032, u). The leaves of a taxonomy are the labels that are not parents of any other labels.\nWe denote the set of leaves by Y and note that Y \u2282 U.\nNow, let D be a distribution on the product space X \u00d7 Y. In other words, D is a joint distribution\nover instances and their corresponding labels. Note that we assume that the labels that occur in the\ndistribution are always leaves of the taxonomy T . This assumption can be made without loss of\ngenerality: if this is not the case then we can always add a leaf to each interior node, and relabel\nall of the examples accordingly. More formally, for each label u \u2208 U \\ Y, we add a new node y to\nU with \u03c0(y) = u, and whenever we sample (x, u) from D then we replace it with (x, y). Initially,\nwe do not know anything about D, other than the fact that it is supported on X \u00d7 Y. We sample m\nindependent points from D, to obtain the sample S = {(xi, yi)}m\nA classi\ufb01er is a function f : X \u2192 U that assigns a label to each instance of X . Note that a classi\ufb01er\nis allowed to predict any label in U, even though it knows that only leaf labels are ever observed\nin the real world. We feel that this property captures a fundamental characteristic of hierarchical\nclassi\ufb01cation: although the truth is always speci\ufb01c, a good hierarchical classi\ufb01er will fall-back to a\nmore general label when it cannot con\ufb01dently give a speci\ufb01c prediction. The quality of f is measured\nusing a loss function \u2113 : U \u00d7 Y \u2192 R+. For any instance-label pair (x, y), the loss \u2113(f (x), y) should\nbe interpreted as the penalty associated with predicting the label f (x) when the true label is y. We\nrequire \u2113 to be weakly monotonic, in the following sense: if u\u2032 lies along the path from u to y then\n\u2113(u\u2032, y) \u2264 \u2113(u, y). Although the error indicator function, \u2113(u, y) = 1u6=y satis\ufb01es our requirements,\nit is not what we have in mind. Another fundamental characteristic of hierarchical classi\ufb01cation\nproblems is that not all prediction errors are equally bad, and the de\ufb01nition of the loss should re\ufb02ect\nthis. More speci\ufb01cally, if u\u2032 lies along the path from u to y and u is not semantically equivalent to\nu\u2032, we actually expect that \u2113(u\u2032, y) < \u2113(u, y).\n\ni=1.\n\n3 A Distribution-Calibrated Loss for Hierarchical Classi\ufb01cation\n\nAs mentioned above, we want to calibrate the hierarchical classi\ufb01cation loss function using the\ndistribution D, through its empirical proxy S. In other words, we want D to differentiate between\ninformative splits in the taxonomy and redundant ones. We follow [8] in using graph-distance to\nde\ufb01ne the loss function, but instead of setting all of the edge weights to 1, we de\ufb01ne edge weights\nusing D.\nFor each y \u2208 Y, let p(y) be the marginal probability of the label y in the distribution D. For\neach u \u2208 U, de\ufb01ne p(u) = Py\u2208Y\u2229\u03c4 (u) p(y). In words, for any u \u2208 U, p(u) is the probability of\nobserving any descendent of u. We assume henceforth that p(u) > 0 for all u \u2208 U. With these\nde\ufb01nitions handy, de\ufb01ne the weight of the edge between u and \u03c0(u) as log(cid:0)p(\u03c0(u))/p(u)(cid:1). This\nweight is essentially the de\ufb01nition of conditional self information from information theory [7].\nThe nice thing about this de\ufb01nition is that the weighted graph-distance between labels u and y\ntelescopes between u and \u03bb(u, y) and between u and \u03bb(u, y), and becomes\n\n\u2113(u, y) = 2 log(cid:0)p(\u03bb(u, y))(cid:1) \u2212 log(cid:0)p(u)(cid:1) \u2212 log(cid:0)p(y)(cid:1) .\n\n(1)\n\nSince this loss function depends only on u, y, and \u03bb(u, y), and their frequencies according to D, it\nis completely invariant to the the number of labels along the path from u or y. It is also invariant\nto inconsistent degrees of \ufb02atness of the taxonomy in different regions. Finally, it is even invariant\nto the addition or subtraction of new leaves or entire subtrees, so long as the marginal distributions\np(u), p(y), and p(\u03bb(u, y)) remain unchanged. This loss also balances uneven splits in the taxonomy.\n\n4\n\n\fRecalling the example in Fig. 1 where CLASSICAL is split into VIVALDI and NON-VIVALDI, the edge\nto the former will have a very high weight, whereas the edge to the latter will have a weight close to\nzero.\nNow, de\ufb01ne the risk of a classi\ufb01er h as R(f ) = E(X,Y )\u223cD[\u2113(f (X), Y )], the expected loss over\nexamples sampled from D. Our goal is to obtain a classi\ufb01er with a small risk. However, before we\ntackle the problem of \ufb01nding a low risk classi\ufb01er, we address the intermediate task of estimating the\nrisk of a given classi\ufb01er f using the sample S. The solution is not straightforward since we cannot\neven compute the loss on an individual example, \u2113(f (xi), yi), as this requires knowledge of D. A\nnaive way to estimate \u2113(f (xi), yi) using the sample S is to \ufb01rst estimate each p(y) by Pm\ni=1 1yi=y,\nand to plug these values into the de\ufb01nition of \u2113. This estimator tends to suffer from a strong bias,\ndue to the non-linearity of the logarithm, and is considered to be unreliable1. Instead, we want an\nunbiased estimator.\n\nFirst, we write the de\ufb01nition of risk more explicitly using the de\ufb01nition of the loss function in Eq. (1).\nDe\ufb01ne q(f, u) = Pr(f (X) = u), the probability that f outputs u when X is drawn according to\nthe marginal distribution of D over X . Also de\ufb01ne r(f, u) = Pr(\u03bb(f (X), Y ) = u), the probability\nthat the lowest common ancestor of f (X) and Y is u, when (X, Y ) is drawn from D. R(f ) can be\nrewritten as\n\nR(f ) = X\n\nu\u2208U\n\n(cid:0)2r(f, u) \u2212 q(f, u)(cid:1) log(p(u)) \u2212 X\n\ny\u2208Y\n\np(y) log(cid:0)p(y)(cid:1) .\n\n(2)\n\nNotice that the second term in the de\ufb01nition of risk is a constant, independent of f . This constant\nis simply H(Y ), the Shannon entropy [7] of the label distribution. Our ultimate goal is to compare\nthe risk values of different classi\ufb01ers and to choose the best one, so we don\u2019t really care about this\nconstant, and we can discard it henceforth. From here on, we focus on estimating the augmented\nrisk \u00afR(f ) = R(f ) \u2212 H(Y ).\nThe main building block of our estimator is the estimation technique presented in [14]. Assume for\na moment that the sample S is in\ufb01nite. Recall that the harmonic number hn is de\ufb01ned as Pn\n1\ni ,\nwith h0 = 0. De\ufb01ne the random variables Ai and Bi as follows\n\ni=1\n\nAi = min{j \u2208 N : yi+j \u2208 \u03c4 (f (xi))} \u2212 1\nBi = min(cid:8)j \u2208 N : yi+j \u2208 \u03c4(cid:0)\u03bb(f (xi), yi)(cid:1)(cid:9) \u2212 1\n\nFor example, A1 + 2 is the index of the \ufb01rst example after (x1, y1) whose label is contained in\nthe subtree rooted at f (x1), and B1 + 2 is the index of the \ufb01rst example after (x1, y1) whose\nlabel is contained in the subtree rooted at \u03bb(f (x1), y1). Note that Bi \u2264 Ai, since \u03bb(u, y) is, by\nde\ufb01nition, an ancestor of u, so y\u2032 \u2208 \u03c4 (u) implies y\u2032 \u2208 \u03c4 (\u03bb(u, y)). Next, de\ufb01ne the random variable\nL1 = hA1 \u2212 2hB1.\nTheorem 1. L1 is an unbiased estimator of \u00afR(f ).\n\nProof. We have that\n\n\u221e\n\nE(cid:2)L1 (cid:12)(cid:12) f (X1) = u, Y1 = y(cid:3) = p(u)\nUsing the fact that for any \u03b1 \u2208 [0, 1) it holds that P\u221e\nu, Y1 = y] = \u2212 log(cid:0)p(u)(cid:1) + 2 log(cid:0)p(\u03bb(u, y))(cid:1). Therefore,\n\nhj(cid:0)1 \u2212 p(u)(cid:1)j\n\nj=0\n\nX\n\n\u2212 2p(cid:0)\u03bb(u, y)(cid:1)\n\n\u221e\n\nX\n\nj=0\n\nhj(cid:0)1 \u2212 p(\u03bb(u, y))(cid:1)j\n\n.\n\nn=0 hn\u03b1n = \u2212 log(1\u2212\u03b1)\n\n1\u2212\u03b1 we get, E[L1|f (X1) =\n\nE[L1] = Pu\u2208U Py\u2208Y Pr(f (X) = u, Y = y) E[L1|f (X1) = u, Y1 = y]\n\n= Pu\u2208U (cid:0)2r(f, u) \u2212 q(f, u)(cid:1) log(cid:0)p(u)(cid:1) = \u00afR(f ) .\n\nWe now recall that our sample S is actually of \ufb01nite size m. The problem that now occurs is that\nA1 and B1 are not well de\ufb01ned when f (X1) does not appear anywhere in Y2, . . . , Ym. When this\nhappens, we say that the estimator L1 fails. If f outputs a label u with p(u) = 0 then L1 will fail\n\n1The interested reader is referred to the extensive literature on the closely related problem of estimating the\n\nentropy of a distribution from a \ufb01nite sample.\n\n5\n\n\fwith probability 1. On the other hand, the probability of failure is negligible when m is large enough,\nand when f does not output labels with tiny probabilities. Formally, let \u03b2(f ) = minu:q(f,u)>0 p(u)\nbe the smallest probability of any label that f outputs.\nTheorem 2. The probability of failure is at most e\u2212(m\u22121)\u03b2(f ).\n\nThe estimator E[L1|no-fail] is no longer an unbiased estimator of \u00afR(f ), but the bias is small. Specif-\nically, since we are after a classi\ufb01er f with a small risk, we prove an upper-bound on \u00afR(f ).\n\nTheorem 3. It holds that E(cid:2)L1(cid:12)(cid:12)no-fail(cid:3) \u2265 \u00afR(f ) \u2212 (m\u22121)e\u2212\u03b2(f )(m\u22121)\n\n\u03b22(f )\n\n.\n\nFor example, with \u03b2 = 0.01 and m = 2500, the bias term in Thm. 3 is less than 0.0004. With\nm = 5000 it is already less than 10\u221214.\n\n4 Decreasing the Variance of the Estimator\n\n\u2212 2hBSi\n\nSay that we have k classi\ufb01ers and we want to choose the best one. The estimator L1 suffers from\nan unnecessarily high variance because it typically uses a short pre\ufb01x of the sample S and wastes\nthe remaining examples. To reliably compare k empirical risk estimates, we need to reduce the\nvariance of each estimator. The exact value of Var(L1) depends on the distributions p, q, and r in a\nnon-trivial way, but we can give a simple upper-bound on Var(L1) in terms of \u03b2(f ).\nTheorem 4. Var(L1) \u2264 \u22129 log(cid:0)\u03b2(f )(cid:1) + 9 log2 (cid:0)\u03b2(f )(cid:1).\nWe reduce the variance of the estimator by repeating the estimation multiple times, without reusing\nany sample points. Formally, de\ufb01ne S1 = 1, and de\ufb01ne for all i \u2265 2 the random variables Si =\n. In words: the \ufb01rst estimator L1 starts at S1 = 1\nSi\u22121 + ASi\u22121 + 2, and Li = hASi\nand uses A1 + 2 examples, namely, the examples 1, . . . , (A1 + 2). Now, S2 = A1 + 3 is the \ufb01rst\nuntouched example in the sequence. The second estimator, L2 starts at example S2 and uses AS2 + 2\nexamples, namely, the examples S2, . . . , (S2 + AS2 + 1), and so on. If we had an in\ufb01nite sample and\nchose some threshold t, the random variables L1, . . . , Lt would all be unbiased estimators of \u00afR(f ),\ni=1 Li would also be an unbiased estimate of \u00afR(f ).\nand therefore the aggregate estimator L = 1\nSince L1, . . . , Lt are also independent, the variance of the aggregate estimator would be 1\nt Var(L1).\nIn the \ufb01nite-sample case, aggregating multiple estimators is not as straightforward. Again, the event\nwhere the estimation fails introduces a small bias. Additionally, the number of independent estima-\ntions that \ufb01t in a sample of \ufb01xed size m is itself a random variable T . Moreover, the value of T\ndepends on the value of the risk estimators. In other words, if L1, L2, . . . take large values then T\nwill take a small value. The precise de\ufb01nition of T should be handled with care, to ensure that the\nindividual estimators remain independent and that the aggregate estimator maintains a small bias.\nFor example, the \ufb01rst thing that comes to mind is to set T to be the largest number t such that\nSt \u2264 m - this is a bad idea. To see why, note that if T = 2 and A1 = m \u2212 4 then we know with\ncertainty that AS2 = 0. This clearly demonstrates a strong statistical dependence between L1, L2\nand T , which both interferes with the variance reduction and introduces a bias. Instead, we de\ufb01ne T\nas follows: choose a positive integer l \u2264 m and set T using the last l examples in S, as follows, set\n\nt Pt\n\nT = min {t \u2208 N : St+1 \u2265 m \u2212 l} .\n\n(3)\n\nIn words, we think of the last l examples in S as the \u201clanding strip\u201d of our procedure: we keep\njumping forward in the sequence of samples, from S1 to S2, to S3, and so on, until the \ufb01rst time we\nland on the landing strip. Our new failure scenario occurs when our last jump overshoots the strip,\nand no Si falls on any one of the last l examples. If L does not fail, de\ufb01ne the aggregate estimator as\nL = PT\nTheorem 5. The probability of failure of the estimator L is at most e\u2212l\u03b2(f ).\n\ni=1 Li. Note that we are summing Li rather than averaging them; we explain this later on.\n\nWe now prove that our de\ufb01nition of T indeed decreases the variance without adding bias. We give a\nsimpli\ufb01ed version of the analysis, assuming that S is in\ufb01nite, and assuming that the limit m is merely\na recommendation. In other words, T is still de\ufb01ned as before, but estimation never fails, even in the\nrare case where ST + AST + 1 > m (the index of the last example used in the estimation exceeds\nthe prede\ufb01ned limit m). We note that a very similar theorem can be stated in the \ufb01nite-sample case,\n\n6\n\n\fINPUTS: a training set S = {(xi, yi)}m\n1 for i = 1, . . . , m\n\ni=1, a label taxonomy T .\n\n2\n\n3\n\n4\n\n5\n\n6\n\ngenerate random permutation \u03c8 : {1, . . . , (m \u2212 1)} \u2192 {1, . . . , (i \u2212 1), (i + 1), . . . , m}.\nfor u = 1, . . . , d\na = \u22121 + minnj \u2208 {1, . . . , (m \u2212 1)} : y\u03c8(j) \u2208 \u03c4 (u)o\nb = \u22121 + minnj \u2208 {1, . . . , (m \u2212 1)} : y\u03c8(j) \u2208 \u03c4(cid:0)\u03bb(u, yi)(cid:1)o\n\nb+1 + 1\n\nb+2 + \u00b7 \u00b7 \u00b7 + 1\n\nM (i, u) = 1\n\na\n\nOUTPUT: M\n\nFigure 2: A reduction from hierarchical multiclass to cost-sensitive multiclass.\n\nat the price of a signi\ufb01cantly more complicated analysis. The complication stems from the fact that\nwe are estimating the risk of k classi\ufb01ers simultaneously, and the failure of one estimator depends\non the values of the other estimators. We allow ourselves to ignore failures because they occur with\nsuch small probability, and because they introduce an insigni\ufb01cant bias.\nTheorem 6. Assuming that S is in\ufb01nite, but T is still de\ufb01ned as in Eq. (3), it holds that E(cid:2)L] =\nE(cid:2)T ] \u00afR(f ) and Var(L) \u2264 E[T ]\u03c32, where \u03c32 = Var(cid:0)Li).\nThe proof follows from variations on Wald\u2019s theorem [15].\nRecall that we have k competing classi\ufb01ers, f1, . . . , fk, and we want to choose one with a small\nrisk. We overload our notation to support multiple concurrent estimations, and de\ufb01ne T (fj) as the\nstopping time (previously de\ufb01ned as T in Eq. (3)) of the estimation process for \u00afR(fj). Also let\nLi(fj) be the i\u2019th unbiased estimator of \u00afR(fj). To conduct a fair comparison of the k classi\ufb01ers,\nwe rede\ufb01ne T = minj=1,...,k T (fj), and let L(fj) = PT\ni=1 Li(fj). In other words, we aggregate\nthe same number of estimators for each classi\ufb01er. We then choose the classi\ufb01er with the smallest\nrisk estimate, arg min L(Fj). Theorem 6 still holds for each individual classi\ufb01er because the new\nde\ufb01nition of T remains a stopping time for each of the individual estimation processes. Although\nwe may not know the exact value of E[T ], it is just a number that we can use to reason about the bias\nand the variance of L. We note that \ufb01nding j that minimizes L(fj) is equivalent to \ufb01nding j that\nminimizes L(fj)/E[T ]. The latter, according to Thm. 6, is an unbiased estimate of \u00afR(f ). Moreover,\nthe variance of each L(fj)/E[T ] is Var (L(fj)/E[T ]) = \u03c32/E[T ], so the effective variance of our\nunbiased estimate decreases like 1/E[T ], which is what we would expect. Using the one-tailed\nChebyshev inequality [11], we get that for any \u01eb > 0, Pr(cid:0) \u00afR(fj) \u2265 L(fj) + \u01eb(cid:1) < \u03c32/(\u03c32+E[T ]\u01eb2).\nThe bound holds uniformly for all k classi\ufb01ers with probability k\u03c32/(\u03c32 + E[T ]\u01eb2) (using the union\nbound). The variance of the estimation depends on E[T ], and we expect E[T ] to grow linearly with\nm. For example we can prove the following crude lower-bound.\nTheorem 7. E[T ] \u2265 (m \u2212 l)/c, where c = k + Pk\nj=1 1/\u03b2(fj).\n5 Reducing Hierarchical Classi\ufb01cation to Cost-Sensitive Classi\ufb01cation\n\nIn this section, we propose a method for learning low-risk hierarchical classi\ufb01ers, using our new\nde\ufb01nition of risk. More precisely, we describe a reduction from hierarchical classi\ufb01cation to cost-\nsensitive multiclass classi\ufb01cation. The appeal of this approach is the abundance of existing cost-\nsensitive learning algorithms. This reduction is itself an algorithm whose input is a training set of m\nexamples and a taxonomy over d labels, and whose output is a d \u00d7 m matrix of non-negative reals,\ndenoted by M . Entry M (i, j) is the cost of classifying example i with label j. This cost matrix, and\nthe original training set, are given to a cost-aware multiclass learning algorithm, which attempts to\n\ufb01nd a classi\ufb01er f with a small empirical loss Pm\n\ni=1 M (i, f (xi)).\n\n7\n\n\fFor example, a common approach to multiclass problems is to train a model fu : X \u2192 R for each\nlabel u \u2208 U and to de\ufb01ne the classi\ufb01er f (x) = arg maxu\u2208U fu(x). An SVM-\ufb02avored way to train\na cost sensitive classi\ufb01er is to assume that the functions fu live in a Hilbert space, and to minimize\n\nd\n\nX\n\nu=1\n\nkfuk2 + C\n\nm\n\nX\n\ni=1\n\nhM (i, u) + fu(xi) \u2212 fyi(xi)i+\n\n,\n\nX\n\nu6=yi\n\n(4)\n\nwhere C > 0 is a parameter and [\u03b1]+ = max{0, \u03b1}. The \ufb01rst term is a regularizer and the second is\nan empirical loss, justi\ufb01ed by the fact that M (i, f (xi)) \u2264 Pu6=yi (cid:2)M (i, u) + fu(xi) \u2212 fyi(xi)(cid:3)+\n.\nComing back to the reduction algorithm, we generate M using the procedure outlined in Fig. 2.\nBased on the analysis of the previous sections, it is easy to see that, for all i, M (i, f (xi)) is an\nunbiased estimator of the risk \u00afR(f ). This holds even if \u03c8 (as de\ufb01ned in Fig. 2) is a \ufb01xed function,\nm P M (i, f (xi)) is also an unbiased\nbecause the training set is assumed to be i.i.d. Therefore, 1\nestimator of \u00afR(f ). The cost-sensitive learning algorithm will try to minimize this empirical esti-\nmate. The purpose of the random permutation at each step is to hopefully decrease the variance\nof the overall estimate, by decreasing the dependencies between the different individual estimators.\nWe profess that a rigorous analysis of the variance of this estimator is missing from this work. Ide-\nm P M (i, f (xi)) is\nally, we would like to show that, with high probability, the empirical estimate 1\n\u01eb-close to its expectation of \u00afR(f ), uniformly for all classi\ufb01ers f in our function class. This is a\nchallenging problem due to the complex dependencies in the estimator.\n\nThe learning algorithm used to solve this problem can (and should) use the hierarchical structure to\nguide its search for a good classi\ufb01er. Our reduction to an unstructured cost-sensitive problem should\nnot be misinterpreted as a recommendation not to use the structure in the learning process. For\nexample, following [10, 8], we could augment the SVM approach described in Eq. (4) by replacing\nthe unstructured regularizerPd\nu=1 kfuk2 with the structured regularizerPd\nu=1 kfu\u2212f\u03c0(u)k2, where\n[8] showed signi\ufb01cant gains on hierarchical problems using this\n\u03c0(u) is the parent label of u.\nregularizer.\n\n6 Discussion\n\nWe started by taking a step back from the typical setup of a hierarchical classi\ufb01cation machine\nlearning problem. As a consequence, our focus was on the fundamental aspects of the hierarchical\nproblem de\ufb01nition, rather than on the equally important algorithmic issues. Our discussion was\nrestricted to the simplistic model of single-label hierarchical classi\ufb01cation with single-linked tax-\nonomies, and our \ufb01rst goal going forward is to relax these assumptions.\n\nWe point out that many of the theorems proven in this paper depend on the value of \u03b2(f ), which\nis de\ufb01ned as minu:q(u)>0 p(u). Speci\ufb01cally, if f occasionally outputs a very rare label, then \u03b2(f )\nis tiny and much of our analysis breaks down. This provides a strong indication that an empirical\nestimate of \u03b2(f ) would make a good regularization term in a hierarchical learning scheme. In other\nwords, we should deter the learning algorithm from choosing a classi\ufb01er that predicts very rare\nlabels. As mentioned in the introduction, the label taxonomy provides the perfect mechanism for\nbacking off and predicting a more common and less risky ancestor of that label.\n\nWe believe that our work is signi\ufb01cant in the broader context of structured learning. Most structured\nlearning algorithms blindly trust the structure that they are given, and arbitrary design choices are\nlikely to appear in many types of structured learning. The idea of using the data distribution to\ncalibrate, correct, and balance the side-information extends to other structured learning scenarios.\nThe geometric-type estimation procedure outlined in this paper may play an important role in those\nsettings as well.\n\nAcknowledgment\n\nThe author would like to thank Paul Bennett for his suggestion of the loss function for its information\ntheoretic properties, reduction to a tree-weighted distance, and ability to capture other desirable\ncharacteristics of hierarchical loss functions like weak monotonicity. The author also thanks Ohad\nShamir, Chris Burges, and Yael Dekel for helpful discussions.\n\n8\n\n\fReferences\n\n[1] The Library of Congress Classi\ufb01cation. http://www.loc.gov/aba/cataloging/classi\ufb01cation/.\n[2] The Open Directory Project. http://www.dmoz.org/about.html.\n[3] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines.\n\nIn 13th ACM Conference on Information and Knowledge Management, 2004.\n\n[4] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classi\ufb01cation: combining bayes\nwith svm. In Proceedings of the 23rd International Conference on Machine Learning, 2006.\n[5] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierarchical classi-\n\n\ufb01cation. Journal of Machine Learning Research, 7:31\u201354, 2007.\n\n[6] The Gene Ontology Consortium. Gene ontology: tool for the uni\ufb01cation of biology. Nature\n\nGenetics, 25:25\u201329, 2000.\n\n[7] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.\n[8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classi\ufb01cation. In Proceedings of\n\nthe Twenty-First International Conference on Machine Learning, 2004.\n\n[9] S. T. Dumais and H. Chen. Hierarchical classi\ufb01cation of Web content.\n\nSIGIR-00, pages 256\u2013263, 2000.\n\nIn Proceedings of\n\n[10] T. Evgeniou, C.Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal\n\nof Machine Learning Research, 6:615\u2013637, 2005.\n\n[11] W. Feller. An Introduction to Probability and its Applications, volume 2. John Wiley and Sons,\n\nsecond edition, 1970.\n\n[12] D. Koller and M. Sahami. Hierarchically classifying docuemnts using very few words.\n\nIn\nMachine Learning: Proceedings of the Fourteenth International Conference, pages 171\u2013178,\n1997.\n\n[13] A. K. McCallum, R. Rosenfeld, T. M. Mitchell, and A. Y. Ng. Improving text classi\ufb01cation by\n\nshrinkage in a hierarchy of classes. In Proceedings of ICML-98, pages 359\u2013367, 1998.\n\n[14] S. Montgomery-Smith and T. Schurmann. Unbiased estimators for entropy and class number.\n[15] S.M. Ross and E.A. Pekoz. A second course in probability theory. 2007.\n[16] E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information\n\nRetrieval, 5(1):87\u2013118, 2002.\n\n[17] C. Shirky. Ontology is overrated: Categories, links, and tags. In O\u2019Reilly Media Emerging\n\nTechnology Conference, 2005.\n\n[18] A. S. Weigend, E. D. Wiener, and J. O. Pedersen. Exploiting hierarchy in text categorization.\n\nInformation Retrieval, 1(3):193\u2013216, 1999.\n\n[19] J. Zhang, L. Tang, and H. Liu. Automatically adjusting content taxonomies for hierarchical\n\nclassi\ufb01cation. In Proceedings of the Fourth Workshop on Text Mining, SDM06, 2006.\n\n9\n\n\f", "award": [], "sourceid": 961, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": null}]}