Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)
Koby Crammer, Joseph Keshet, Yoram Singer
The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effective- ness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel.
1 Introduction and problem Setting
The last decade brought voluminous amount of work on the design, analysis and experi- mentation of kernel machines. Algorithm based on kernels can be used for various ma- chine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vec- tor Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that em- ployed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4].
Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. . An explicit way to describe K A kernel is an inner-product operator K : X (cid:2) X ! is via a mapping (cid:30) : X ! H from X to an inner-products space H such that K(x; x0) = (cid:30)(x)(cid:1)(cid:30)(x0). Given a kernel operator and a finite set of instances S = fxi; yigm i=1, the kernel matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki;j = K(xi; xj). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix.
The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K ? manifested as a kernel ma- trix on a set of examples. Upon calling the base kernel learner it returns a kernel op- erator denote Kj. The goal thereafter is to find a weighted combination of kernels
^K(x; x0) = Pj (cid:11)jKj(x; x0) that is similar, in a sense that will be defined shortly, to the target kernel, ^K (cid:24) K ?. Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel ma- trices < K; K 0 >F =Pm i;j=1 K(xi; xj)K 0(xi; xj). Given this definition, they defined the kernel-similarity, or alignment, to be the above inner-product normalized by the norm of each kernel, ^A(S; ^K; K ?) = (cid:16)< ^K; K ? >F(cid:17) =q< ^K; ^K >F < K ?; K ? >F ; where S
is, as above, a finite sample of m instances. Put another way, the kernel alignment Cris- tianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2. Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi 2 f(cid:0)1; +1g is the label of the instance xi, Cristianini et al. used the outer-product of y as the the target kernel, K ? = yyT . Therefore, an optimal alignment is achieved if ^K(xi; xj) = yiyj. Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X ! f(cid:0)1; +1g by setting, f (x) = sign(yiK(xi; x)) where (xi; yi) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K ? on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(xi; xj)) is equal to yiyj but its magnitude, jK(xi; xj)j, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above.
Like Cristianini et al. we assume that we are given a set of labelled instances S = f(xi; yi) j xi 2 X ; yi 2 f(cid:0)1; +1g; i = 1; : : : ; mg : We are also given a set of unlabelled examples ~S = f~xig ~m i=1. If such a set is not provided we can simply use the labelled in- stances (without the labels themselves) as the set ~S. The set ~S is used for constructing the primitive kernels that are combined to constitute the learned kernel ^K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned kernel ^K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels.
2 Using Boosting to Combine Kernels
Numerous interpretations of AdaBoost and its variants cast the boosting process as a pro- cedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir-
Input: Labelled and unlabelled sets of examples: S = f(xi; yi)gm Initialize: K 0 (all zeros matrix) For t = 1; 2; : : : ; T :
i=1