{"title": "Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1233, "page_last": 1240, "abstract": null, "full_text": "Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods\nMatthias W. Seeger Max Planck Institute for Biological Cybernetics P.O. Box 2169, 72012 Tubingen, Germany seeger@tuebingen.mpg.de\n\nAbstract\nWe propose a highly efficient framework for kernel multi-class models with a large and structured set of classes. Kernel parameters are learned automatically by maximizing the cross-validation log likelihood, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical class structure, achieving state-of-the-art results in an order of magnitude less time than previous work.\n\n1\n\nIntroduction\n\nIn many real-world statistical problems, we would like to fit a model with a large number of dependent variables to a training sample with very many cases. For example, in multi-way classification problems with a structured label space, modern applications demand predictions on thousands of classes, and very large datasets become available. If n and C denote dataset size and number of classes respectively, nonparametric kernel methods like SVMs or Gaussian processes typically scale superlinearly in n C , if dependencies between the latent class functions are properly represented. Furthermore, most large scale kernel methods proposed so far refrain from solving the problem of learning hyperparameters (kernel or loss function parameters). The user has to run cross-validation schemes, which require frequent human interaction and are not suitable for learning more than a few hyperparameters. In this paper, we propose a general framework for learning in probabilistic kernel classification models. While the basic model is standard, a major feature of our approach is the high computational efficiency with which the primary fitting (for fixed hyperparameters) is done, allowing us to deal with hundreds of classes and thousands of datapoints within a few minutes. The primary fitting scales linearly in C , and depends on n mainly via a fixed number of matrix-vector multiplications (MVM) with n n kernel matrices. In many situations, these MVM primitives can be computed very efficiently, as will be demonstrated. Furthermore, we optimize hyperparameters automatically by minimizing the cross-validation log likelihood, making use of our primary fitting technology as inner loop in order to compute the CV criterion and its gradient. Our approach can be used to learn a large number of hyperparameters and does not need user interaction. Our framework is generally applicable to structured label spaces, which we demonstrate here for hierarchical classification of text documents. The hierarchy is represented through an ANOVA setup. While the C latent class functions are fully dependent a priori, the scaling of our method stays within a factor of two compared to unstructured classification. We test our framework on the same tasks treated in [1], achieving comparable results in at least an order of magnitude less time. Our method estimates predictive probabilities for each test point, which can allow better predictions w.r.t. loss functions different from zero-one. The primary fitting method is given in Section 2, the extension to hierarchical classification in Section 3. Hyperparameter learning is discussed in Section 4. Computational details are provided in\n\n\f\nSection 5. We present experimental results in Section 6. Our highly efficient implementation is publicly available, as project klr in the LHOTSE1 toolbox for adaptive statistical models.\n\n2\n\nPenalized Multiple Logistic Regression\n\nOur problem is to predict y {1, . . . , C } from x X , given some i.i.d. data D = {(xi , y i ) | i = 1, . . . , n}. We use zero-one coding, i.e. y i {0, 1}C , 1T y i = 1. We elpoy the multiple logistic regression model, consisting of C latent (unobserved) class functionc uc feeding into the s uc (xi ) ). We write multiple logistic (or softmax) likelihood P (yi,c = 1|xi , ui ) = euc (xi ) /( e uc = fc + bc for intercept parameters bc R and functions fc living in a reproducing kernel Hilbert space (RKHS) with kernel K (c) , and consider the penalized negative log likelihood n C = - i=1 log P (y i |ui ) + (1/2) c=1 fc c + (1/2) -2 b 2 , which we minimize for primary 2 fitting. c is the RKHS norm for kernel K (c) . Details on such setups can be found in [4]. Our notation for n C vectors2 (and matrices) uses the ordering y = (y1,1 , y2,1 , . . . , yn,1 , y1,2 , . . . ). We set u = (uc (xi )) RnC . denotes the Kronecker product, 1 is the vector of all ones. Selection indexes I are applied to i only: y I = (yi,c )iI ,c R|I |C . Since the likelihoodi depends on the fc only through fc (xi ), every minimizer of must be a kernel expansion: fc = i,c K (c) (, xi ) (representer theorem, see [4]). Plugging this in, the regularizer T becomes (1/2) K + (1/2) -2 b 2 . K (c) = (K (c) (xi , xj ))i,j Rn,n , K = diag(K (c) )c is block-diagonal. We refer to this setup as flat classification model. The bc may be eliminated as ~ b = 2 (I 1T ) . Thus, if K = K + 2 (I 1)(I 1T ), then becomes 1 ~ ~ (1) = lh + T K , lh = -y T u + 1T l , li = log 1T exp(ui ), u = K . 2 is strictly convex in (because the likelihood is log-cioncave), so it has a unique minimum point ^ . The corresponding kernel expansions are uc = ^ i,c (K (c) (, xi ) + 2 ). Estimates of the ^ conditional probability on test points x are obtained by plugging uc (x ) into the likelihood. ^ We note that this setup can also be seen as MAP approximation to a Bayesian model, where the fc are given independent Gaussian process priors, e.g.[7]. It is also related to the multi-class SVM [2], where - log P (yi |ui ) is replaced by the margin loss -uyi (xi ) + maxc {uc (xi ) + 1 - c,yi }. The negative log multiple logistic likelihood has similar properties, but is smooth as a function of u , and the primary fitting of does not require constrained convex optimization. We minimize using the Newton-Raphson (NR) algorithm, the details are provided in Section 5. The complexity of our fitting algorithm is dominated by k1 (k2 + 2) matrix-vector multiplications with K , where k1 is the number of NR iterations, k2 the number of linear conjugate gradient (LCG) steps for computing each Newton direction. Since NR is a second-order convergent method, k1 can be chosen small. k2 determines the quality of each Newton direction, for both fairly small values are sufficient (see Section 6.2).\n\n3\n\nHierarchical Classification\n\nSo far we dealt with flat classification, the classes being independent a priori, with block-diagonal kernel matrix K . However, if the label set has a known structure3 , we can benefit from representing it in the model. Here we focus on hierarchical classification, the label set {1, . . . , C } being the leaf nodes of a tree. Classes with lower common ancestor should be more closely related. In this Section, we propose a model for this setup and show how it can be dealt with in our framework with minor modifications and minor extra cost. In flat classification, the latent class functions uc are modelled as a priori independent, in that the regularizer (which plays the role of a log prior) is a sum of individual terms for each uc , without any\nSee www.kyb.tuebingen.mpg.de/bs/people/seeger/lhotse/. In Matlab, reshape(y,n,C) would give the matrix (yi,c ) Rn,C . 3 Learning an unknown label set structure may be achieved by expectation maximization techniques, but this is subject to future work.\n2 1\n\n\f\ninteraction terms. Analysis of variance (ANOVA) models go beyond this independent design, they have previously been applied to text classification by [1]. Let {0, . . . , P } be the nodes of the tree, 0 being the root, and the numbers are assigned breadth first (1, 2, . . . are the root's children). The tree is determined by P and np , p = 0, . . . , P , the number of children of node p. Let L be the set of leaf nodes, |L| = C . Assign a pair of latent functions up , up to each node, except the root. The up are assumed a priori independent, as in flat classification. up is the sum of up , p running over the nodes (including p) on the path from the root to p. The class functions to be fed into the likelihood are the uL(c) of the leafs. This setup represents similarities conditioned on the hierarchy. For example, if leafs L(c), L(c ) have the common parent p, then uL(c) = up + uL(c) , uL(c ) = up + uL(c ) , so the class functions share the effect up . Since regularization forces all independent effects up to be smooth, the classes c, c are urged to behave similarly a priori. Let u = (up (xi ))i,p , u = (up (xi ))i,p RnP . The vectors are related as u = ( I )u , {0, 1}P,P . Importantly, has a simple structure which allows MVM with or T to be computed easily in O(P ), without having to compute or store explicitly. MVM with is described in Algorithm 1, and MVM with T works in a similar manner [8]. Under the hierarchical model, the class functions uL(c) are strongly dependent a priori. We may represent this prior coupling in our framework by simply plugging in the implied kernel matrix K : K = (L, I )K (T I ), (2)\nL,\n\n where the inner K is block-diagonal. K is not sparse and certainly not block-diagonal, but the important point is that we are still able to do kernel MVMs efficiently: pre- and postmultiplying by is cheap, and K is block-diagonal just as in the flat case. We note that the step from flat to hierarchical classification requires minor modifications of Algorithm 1: Matrix-vector multiplication existing code only. If code for representing a y = x block-diagonal K is available, we can use it y (). y0 := 0. s := 0. to represent the inner K , just replacing C by for p = 0, . . . , P do P . This simplicity carries through to the hyperif np > 0 (p not a leaf node) then parameter learning case (see Section 4). The Let J (p) = {s + 1, . . . , s + np }. cost of a kernel MVM is increased by a factor y (y T , yp 1T + xT (p) )T . s s + np . J P /C < 2, which in most hierarchies in pracend if tice is close to 1. However, it would be wrong end for to claim that hierarchical classification in general comes as cheap as flat classification. The subtle issue is that the primary fitting becomes more costly, precisely because there is more coupling between the variables. In the flat case, the Hessian of is close to block-diagonal. The LCG algorithm to compute Newton directions converges quickly, because it nearly decomposes into C independent ones, and fewer NR steps are required (see Section 5). In the hierarchical case, both LCG and NR need more iterations to attain the same accuracy. In numerical mathematics, much work has been done to approximately decouple linear systems by preconditioning. In some of these strategies, knowledge about the structure of the system matrix (in our case: the hierarchy) can be used to drive preconditioning. An important point for future research is to find a good preconditioning strategy for the system of Eq. 5. However, in all our experiments so far the fitting of the hierarchical model took less than twice the time required for the flat model on the same task. Some further extensions, such as learning with incomplete label information, are discussed in [8].\n\n4\n\nHyperparameter Learning\n\nIn any model of interest, there will be free hyperparameters h , for example parameters of the kernels K (c) . These were assumed to be fixed in the primary fitting method introduced in Section 2. In this Section, we describe a scheme for learning h which makes use of the primary fitting algorithm as inner loop. Note that such nested strategies are commonplace in Bayesian Statistics, where (marginal) inference is typically used as subroutine for parameter learning. Recall that primary fitting consists of minimizing of Eq. 1 w.r.t. . If we minimize w.r.t. h as well, we run into the problem of overfitting. A common remedy is to minimize the negative cross-\n\n\f\nvalidation log likelihood instead. Let {Ik } be a partition of {1, . . . , n}, with Jk = {1, . . . , n} \\ Ik , and let Jk = uT k ] ((1/2)[Jk ] - y Jk ) + 1T l[Jk ] be the primary criterion on the subset Jk of the [J ~ data. Here, u[Jk ] = K Jk [Jk ] . The [Jk ] are independent variables, not part of a common . The CV criterion is k ~ = Ik , Ik = -y Tk u[Ik ] + 1T l[Ik ] , u[Ik ] = K Ik ,Jk [Jk ] , (3) I where [Jk ] minimizes Jk . Since for each k , we fit and evaluate on disjoint parts of y , is an unbiased estimator of the test negative log likelihood, and minimizing should be robust to overfitting. In order to select h , we pick a fixed partition at random, then do gradient-based minimization of w.r.t. h . To this end, we keep the set {[Jk ] } of primary variables, and iterate between re-fitting those for each fold Ik , and computing and h . The latter can be determined analytically, requiring ~ us to solve a linear system with the Hessian matrix I + V T k ] K Jk V [Jk ] already encountered during [J primary fitting (see Section 5). This means that the same LCG code used to compute Newton directions there can be applied here in order to compute the gradient of . The details are given in Section 5. As for the complexity, suppose there are q folds. The update of the [Jk ] requires q primary fitting applications, but since they are initialized with the previous values [Jk ] , they do converge very rapidly, especially during later outer iterations. Computing based on the [Jk ] comes basically for free. The gradient computation decomposes into two parts: accumulation, and kernel derivative MVMs. The accumulation part requires solving q systems of size ((q - 1)/q )n C , ~ thus q k3 kernel MVMs on the K Jk if linear conjugate gradients (LCG) is used, k3 being the number of LCG steps. We also need two buffer matrices E , F of q n C elements each. Note that the accumulation step is independent of the number of hyperparameters. The kernel derivative MVM part consists of q derivative MVM calls for each independent component of h , see Section 5.1. As opposed to the accumulation part, this part consists of a simple large matrix operation and can be run very efficiently using specialized numerical linear algebra code. As shown in Section 5, the extension of hyperparameter learning to the hierarchical case of Section 3 is simply done by wrapping the accumulation part, the coding and additional memory effort being minimal. Given a method for computing and h , we plug these into a custom optimizer such as Quasi-Newton in order to learn h .\n\n5\n\nComputational Details\n\nIn this Section, we provide details for the general plan laid out above. It is precisely these which characterize our framework and allow us to apply a standard model to domains beyond its usual applications, but of interest to Machine Learning. Recall Section 2. We minimize by choosing search directions s , and doing line minimizations ~ along + s , > 0. For the latter, we maintain the pair ( , u ), u = K . We have: u = - y + , = exp(u - 1 l ), i .e . i,c = P (yi,c = 1|ui ). (4) Given ( , u ), and u can be computed in O(n C ), without requiring MVMs. This suggests to ~ ~ perform the line search in u along the direction s = K s , the corresponding can be constructed from the final . Since kernel MVMs are significantly more expensive than these O(n C ) operations, the line searches basically come for free! We choose search directions by Newton-Raphson (NR)4 , since the Hessian of is required anyway for hyperparameter learning. Let D = diag , P = (1 I )(1T I ), and W = D - D P D . We ~ have u lh = W , and g = u lh = - y from Eq. 4. The NR system is (I + W K ) = 1/ 2 - W u - g , with the NR direction being s = . If V = (I - D P )D , then W = V V T , T = V (using (1T I )g = 0), and we can obtain it from because (1 I )D = I . We see that the equivalent symmetric system I ~ + V T KV = V T u - D -1/2 ( - y ), = V (5)\nInitial experiments with conjugate gradients in gave very slow convergence, due to poor conditioning, but experiments with a different dual criterion are in preparation.\n4\n\n\f\nc (c ) )c , so that MVM with V can be done in O(n C ). (details are in [8]). Note that P x = ( x The NR direction is obtained by solving this system approximately by the linear conjugate gradients (LCG) method, requiring a MVM with the system matrix in each iteration, thus a single MVM with K . Our implementation includes diagonal preconditioning and numerical stability safeguards [8]. The NR system need not be solved to high accuracy (see Section 6.2). Initially, = D -1/2 , because then V = if only (1T I ) = 0, which is true if the initial fulfils it. We now show how to compute the gradient h for the CV criterion (Eq. 3). Note that [J ] is determined by the stationary equation [J ] + g [J ] = 0. Taking the derivative gives ~ d[J ] = -W [J ] ((dK J )[J ] + K J (d[J ] )). We obtain a system for d[J ] which is symT~ metrized as above: (I + V [J ] K J V [J ] ) = -V T ] (dK J )[J ] , d[J ] = V [J ] . Also, [J ~ dI = ( [I ] - y I )T ((dK I ,J )[J ] + K I ,J (d[J ] )). With s = I ,I ( [I ] - y I ) - I ,J V [J ] (I + ~ ~ V T K J V [J ] )-1 V T K J,I ( [I ] - y I ), we have that dI = (I ,J [J ] )T (dK )s . If we collect these\n[J ] [J ]\n\nvectors as columns of E , F RnC,q , we have that d = tr E T (dK )F . In the hierarchical setup, ~ ~ ~T ~ we use Eq. 2: E = (T , I )E RnP,q , F accordingly, then d = tr E (dK )F . Here, we L ~~ build E , F in the buffers allocated for E , F , then transform them later in place. We finally mention some of the computational \"tricks\", without which we could not have dealt with the largest tasks in Section 6.2 (for section B, a single n C vector requires 88M of memory). For the linear kernel (see Section 5.1), the main primitive A X X T A can be coded very efficiently using a standard sparse matrix format for X . If A is stored row-major (a1,1 , a1,2 , . . . ), the computation becomes faster by a factor of 4 to 6 compared to the standard column-major format5 . For ~ hyperparameter learning, we work on subsets Jk and need MVMs with K Jk . \"Covariance repre~ J sits in the upper left part, and MVM can sentation shuffling\" permutes the representation s.t. K k use flat rather than indexed code, which is many times faster. We also share memory blocks of size n C between LCG, gradient accumulation, line searches, keeping the overall memory requirements at r n C for a small constant r, and avoiding frequent reallocations.\n\n5.1\n\nMatrix-Vector Multiplication\n\nMVM with K is the bottleneck of our framework, and all efforts should be concentrated on this primitive. We can tap into much prior work in numerical mathematics. With many classes C , we may share kernels: K (c) = vc M (lc ) , vc > 0 variance parameters, M (l) independent correlation functions. Our generic implementation stores two symmetric matrices M (l) in a n n buffer. The linear kernel K (c) (x , x ) = vc xT x is frequently used for text classification (see Section 6.2). If the data matrix X is sparse, kernel MVM can be done in much less than the generic O(C n2 ), typically in O(C n), requiring O(n) storage for X only, even if the dimension of x is way beyond n. If the K (c) are isotropic kernels (depending on x - x only) and the x are low-dimensional, MVM with K (c) can be approximated using specialized nearest neighbour data structures such as KD trees [12, 9]. Again, the MVM cost is typically O(C n) in this case. For general kernels whose kernel matrices have a rapidly decaying eigenspectrum, one can approximate MVM by using low-rank matrices instead of the K (c) [10], whence MVM is O(C n d), d the rank. In Section 4 we also need MVM with the derivatives ( / hj )K (c) . Note that ( / log vc )K (c) = K (c) , reducing to kernel MVM. For isotropic kernels, K (c) = f (A ), ai,j = xi - xj , so ( / hj )K (c) = gj (A ). If KD trees are used to approximate A , they can be used equivalently (and with little additional cost) for computing derivative MVMs.\n\nThe innermost vector operations work on contiguous chunks of memory, rather than strided ones, thus supporting cacheing or vector functions of the processor.\n\n5\n\n\f\n6\n\nExperiments\n\nIn this Section, we provide experimental results for our framework on data from remote sensing, and on a set of large text classification tasks with very many classes, the latter are hierarchical. 6.1 Flat Classification: Remote Sensing\n\nWe use the satimage remote sensing task from the statlog repository.6 This task has been used in the extensive SVM multi-class study of [5], where it is among the datasets on which the different methods show the most variance. It has n = 4435 training, m = 2000 test cases, and C = 6 classes. We use the isotropic Gaussian (RBF) kernel , -w c x-x 2 vc , wc > 0, x , x Rd . (6) K (c) (x , x ) = vc exp 2d We compare the methods mc-sep (ours with separate kernels for each class; 12 hyperparameters), mc-tied (ours with a single shared kernel; 2 hyperparameters), 1rest (one-against-rest: C binary classifiers are trained separately to discriminate c from the rest, they are voted by log probability upon prediction; 12 hyperparameters). Note that 1rest is arguably the most efficient method which can be used for multi-class, because its binary classifiers can be fitted separately and in parallel. Even if run sequentially, 1rest requires less memory by a factor of C than a joint multi-class method. We use our 5-fold CV criterion for each method. Results here are averaged over ten randomly drawn 5-partitions of the training set (the same partitions are used for the different methods). The test error (in percent) of mc-sep is 7.81 vs. 8.01 for 1rest. The result for mc-sep is state-of-the-art, for example the best SVM technique tested in [5] attained 7.65, and SVM one-against-rest attained 8.30 in this study. Note that while 1rest also may choose 12 independent kernel parameters, it does not make good use of this possibility, as opposed to mc-sep. mc-tied has test error 8.37, suggesting that tying kernels leads to significant degradation. ROC curves for the different methods are given in [8], showing that mc-sep also profits from estimating the predictive probabilities in a better way. 6.2 Hierarchical Classification: Patent Text Classification\n\nWe use the WIPO-alpha collection7 previously studied in [1], where patents (title and claim text) are to be classified w.r.t. the standard taxonomy IPC, a tree with 4 levels and 5229 nodes. Sections A, B,. . . , H. form the first level. As in [1], we concentrate on the 8 subtasks rooted at the sections, ranging from D (n = 1140, C = 160, P = 187) to B (n = 9794, C = 1172, P = 1319). We use linear kernels (see Section 5.1) with variance parameters vc . All experiments are averaged over three training/test splits, different methods using the same ones. is used with a different 5-partition per section and split, the same across all methods. Our method outputs a predictive pj RC for each test case xj . The standard prediction y (xj ) = argmaxc pj,c maximizes expected accuracy, classes are ranjked as rj (c) rj (c ) iff pj,c pj,c . Tj e test scores are the same as in [1]: h accuracy (acc) m-1 I{y(xj )=yj } , precision (prec) m-1 rj (yj )-1 , parent accuracy (pacc) j m -1 I{par(y(xj ))=par(yj )} , par(c) being the parent of L(c). Let (c, c ) be half the length of j the shortest path between leafs L(c), L(c ). The taxo-loss (taxo) is m-1 (y (xj ), yj ). These scores are motivated in [1]. For taxo-loss and parent accuracy, we better choose y (xj ) to minimize expected loss8 , different from the standard prediction. We compare methods F1, F2, H1, H2 (F: flat; H: hierarchical). F1: all vc shared (1); H1: vc shared across each level of the tree (3). F2, H2: vc shared across each subtree rooted at root's children (A: 15, B: 34, C: 17, D: 7, E: 7, F: 17, G: 12, H: 5). Recall that there are 3 accuracy parameters. For hyperparameter learning: k1 = 8, k2 = 4, k3 = 15 (F1, F2); k1 = 10, k2 = 4, k3 = 25 (H1, H2)9 .\nAvailable at http://www.niaad.liacc.up.pt/old/statlog/. Raw data from www.wipo.int/ibis/datasets. Label hierarchy described at www.wipo.int/classifications/en. Thanks to L. Cai, T. Hofmann for providing us with the count data and dictionary. We did Porter stemming, stop word removal, and removal of empty categories. The attributes are bag-of-words over the dictionary of occuring words. All cases xi were scaled to unit norm. 8 For parent accuracy, let p(j ) be the node with maximal mass (under pj ) of its children which are leafs, then y (xj ) must be a child of p(j ). 9 Except for section C, where k1 = 14, k2 = 6, k3 = 35.\n7 6\n\n\f\nA B C D E F G H\n\nF1 40.6 32.0 33.7 40.0 33.0 31.4 40.1 39.3 F1 1.28 1.54 1.33 1.20 1.43 1.43 1.32 1.19\n\nA B C D E F G H\n\nacc (%) H1 F2 41.9 40.5 32.9 31.7 34.7 34.1 40.6 39.7 34.2 32.8 32.4 31.4 40.7 40.2 39.6 39.4 taxo[0-1] H1 F2 1.19 1.29 1.44 1.56 1.26 1.32 1.12 1.22 1.33 1.44 1.34 1.44 1.26 1.32 1.16 1.19\n\nH2 41.9 32.7 34.5 40.8 34.1 32.5 40.7 39.7 H2 1.18 1.44 1.26 1.12 1.34 1.34 1.26 1.15\n\nF1 51.6 41.8 45.2 52.4 45.1 42.8 51.2 52.4 F1 58.9 53.6 58.9 64.6 56.0 56.8 58.0 61.6\n\nprec (%) H1 F2 53.4 51.4 43.8 41.6 46.6 45.4 54.1 52.2 47.1 45.0 44.9 42.8 52.5 51.3 53.3 52.5 pacc (%) H1 F2 61.6 58.2 56.4 52.7 62.6 58.5 67.0 64.4 59.1 56.2 59.7 56.8 59.7 57.6 62.5 61.8\n\nH2 53.4 43.7 46.4 54.3 47.1 45.0 52.5 53.4 H2 61.5 56.6 62.0 67.1 59.2 59.8 59.6 62.5\n\nF1 1.27 1.52 1.34 1.19 1.39 1.43 1.32 1.17 F1 57.2 51.9 58.6 63.5 54.0 54.9 56.8 59.9\n\ntaxo H1 F2 1.19 1.29 1.44 1.55 1.26 1.35 1.11 1.18 1.31 1.38 1.34 1.43 1.26 1.32 1.15 1.17 pacc[0-1] (%) H1 F2 61.3 56.9 55.9 51.4 61.8 58.9 67.1 62.6 58.2 53.5 58.7 54.6 59.2 56.6 61.6 60.0\n\nH2 1.19 1.44 1.27 1.11 1.31 1.34 1.26 1.14 H2 61.4 55.9 61.6 67.0 57.9 58.9 58.9 61.8\n\nTable 1: Results on tasks A-H. Methods F1, F2 flat, H1, H2 hierarchical. taxo[0-1], pacc[0-1] for argmaxc pj,c rule, rather than minimize expected loss. Final NR (s) F1 H1 2030 3873 3751 8657 4237 7422 56.3 118.5 CV Fold (s) F1 H1 573 598 873 1720 719 1326 9.32 20.2 Final NR (s) F1 H1 131.5 203.4 1202 2871 1342 2947 971.7 1052 CV Fold (s) F1 H1 32.2 49.6 426 568 232 579 146 230\n\nA B C D\n\nE F G H\n\nTable 2: Running times for tasks A-H. Method F1 flat, H1 hierarchical. CV Fold: Re-optimization of [J ] , gradient accumulation for single fold. For final fitting: k1 = 25, k2 = 12 (F1, F2); k1 = 30, k2 = 17 (H1, H2). The optimization is started from vc = 5 for all methods. Results are given in Table 1. The hierarchical model outperforms the flat one consistently. While the differences in accuracy and precision are hardly significant (as also found in [1]), they (partly) are in taxo-loss and parent accuracy. Also, minimizing expected loss is consistently better than using the standard rule for the latter, although the differences are very small. H1 and H2 do not perform differently: choosing many different vc in the linear kernel seems no advantage here (but see Section 6.1). The results are very similar to the ones of [1]. However, for our method, the recommendation in [1] to use vc = 1 leads to significantly worse results in all scores, the vc chosen by our methods are generally larger. In Table 2, we present running times10 for the final fitting and for a single fold during hyperparameter optimization (5 of them are required for , h ). Cai and Hofmann [1] quote a final fitting time of 2200s on the D section, while we require 119s (more than 18 times faster). It is precisely this high efficiency of primary fitting which allows us to use it as inner loop for hyperparameter learning.\n\n7\n\nDiscussion\n\nWe presented a general framework for very efficient large scale kernel multi-way classification with structured label spaces and demonstrated its features on hierarchical text classification tasks with many classes. As shown for the hierarchical case, the framework is easily extended to novel struc10\n\nProcessor time on 64bit 2.33GHz AMD machines.\n\n\f\ntural priors or covariance functions, and while not shown here, it is also easy to extend it to different likelihoods (as long as they are log-concave). We solve the kernel parameter learning problem by optimizing the CV log likelihood, whose gradient can be computed within the framework. Our method provides estimates of the predictive distribution at test points, which may result in better predictions for non-standard losses or ROC curves. Efficient and easily extendable code is publicly available (see Section 1). An extension to multi-label classification is planned. More advanced label set structures can be adressed, noting that Hessian vector products can often be computed in about the same way as gradients. An application to label sequence learning is work in progress, which may even be combined with a hierarchical prior. Infering a hierarchy from data is possible in principle, using expectation maximization techniques (note that the primary fitting can deal with target distributions y i ), as well as incorporating uncertain data. Empirical Bayesian methods or approximate CV scores for hyperparameter learning have been proposed in [11, 3, 6], but they are orders of magnitude more expensive than our proposal here, and do not apply to a massive number of classes. Many multi-class SVM techniques are available (see [2, 5] for references). Here, fitting is a constrained convex problem, and often fairly sparse solutions (many zeros in ) are found. However, if the degree of sparsity is not large, first-order conditional gradient methods typically applied can be slow11 . SVM methods typically do not come with efficient automatic kernel parameter learning schemes, and they do not provide estimates of predictive probabilities which are asymptotically correct. Acknowledgments Thanks to Olivier Chapelle for many useful discussions. Supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.\n\nReferences\n[1] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM 13, pages 7887, 2004. [2] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. M. Learn. Res., 2:265292, 2001. [3] P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377403, 1979. [4] P.J. Green and B. Silverman. Nonparametric Regression and Generalized Linear Models. Monographs on Statistics and Probability. Chapman & Hall, 1994. [5] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13:415425, 2002. [6] Y. Qi, T. Minka, R. Picard, and Z. Ghahramani. Predictive automatic relevance determination by expectation propagation. In Proceedings of ICML 21, 2004. [7] M. Seeger. Gaussian processes for machine learning. International Journal of Neural Systems, 14(2):69 106, 2004. [8] M. Seeger. Cross-validation optimization for structured Hessian kernel methods. Techni cal report, Max Planck Institute for Biologic Cybernetics, Tubingen, Germany, 2006. See www.kyb.tuebingen.mpg.de/bs/people/seeger. [9] Y. Shen, A. Ng, and M. Seeger. Fast Gaussian process regression using KD-trees. In Advances in NIPS 18, 2006. [10] A. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in NIPS 13, pages 619625, 2001. [11] C. K. I. Williams and D. Barber. 20(12):13421351, 1998. Bayesian classification with Gaussian processes. IEEE PAMI,\n\n[12] C. Yang, R. Duraiswami, and L. Davis. Efficient kernel machines using the improved fast Gauss transform. In Advances in NIPS 17, pages 15611568, 2005.\n11 These methods solve a very large number of small problems iteratively, as opposed to ours which does few expensive Newton steps. The latter kind, if feasible at all, often makes better use of hardware features such as cacheing and vector operations, and therefore is the preferred approach in numerical optimization.\n\n\f\n", "award": [], "sourceid": 3044, "authors": [{"given_name": "Matthias", "family_name": "Seeger", "institution": null}]}