Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)

*Koby Crammer, Joseph Keshet, Yoram Singer*

The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efﬁcient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effective- ness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a ﬁxed RBF kernel.

1 Introduction and problem Setting

The last decade brought voluminous amount of work on the design, analysis and experi- mentation of kernel machines. Algorithm based on kernels can be used for various ma- chine learning tasks such as classiﬁcation, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vec- tor Machines (SVM) [1, 2] designed for classiﬁcation and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that em- ployed kernel operators concentrated on various machine learning problems that involved a predeﬁned kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predeﬁned kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for speciﬁc tasks such as text categorization [3] and protein classiﬁcation problems [4].

Our work attempts to give a computational alternative to predeﬁned kernels by learning kernel operators from data. We start with a few deﬁnitions. Let X be an instance space. . An explicit way to describe K A kernel is an inner-product operator K : X (cid:2) X ! is via a mapping (cid:30) : X ! H from X to an inner-products space H such that K(x; x0) = (cid:30)(x)(cid:1)(cid:30)(x0). Given a kernel operator and a ﬁnite set of instances S = fxi; yigm i=1, the kernel matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki;j = K(xi; xj). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix.

The speciﬁc setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K ? manifested as a kernel ma- trix on a set of examples. Upon calling the base kernel learner it returns a kernel op- erator denote Kj. The goal thereafter is to ﬁnd a weighted combination of kernels

^K(x; x0) = Pj (cid:11)jKj(x; x0) that is similar, in a sense that will be deﬁned shortly, to the target kernel, ^K (cid:24) K ?. Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel ma- trices < K; K 0 >F =Pm i;j=1 K(xi; xj)K 0(xi; xj). Given this deﬁnition, they deﬁned the kernel-similarity, or alignment, to be the above inner-product normalized by the norm of each kernel, ^A(S; ^K; K ?) = (cid:16)< ^K; K ? >F(cid:17) =q< ^K; ^K >F < K ?; K ? >F ; where S

is, as above, a ﬁnite sample of m instances. Put another way, the kernel alignment Cris- tianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “ﬂattened” into a vector of dimension m2. Therefore, this deﬁnition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi 2 f(cid:0)1; +1g is the label of the instance xi, Cristianini et al. used the outer-product of y as the the target kernel, K ? = yyT . Therefore, an optimal alignment is achieved if ^K(xi; xj) = yiyj. Clearly, if such a kernel is used for classifying instances from X , then the kernel itself sufﬁces to construct an excellent classiﬁer f : X ! f(cid:0)1; +1g by setting, f (x) = sign(yiK(xi; x)) where (xi; yi) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to ﬁnd a Gram matrix which attains a good alignment with K ? on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(xi; xj)) is equal to yiyj but its magnitude, jK(xi; xj)j, is not necessarily 1, might achieve a poor alignment score while it can constitute a classiﬁer whose empirical loss is zero. Furthermore, the task of ﬁnding a good kernel when it is not always possible to ﬁnd a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difﬁculties above.

Like Cristianini et al. we assume that we are given a set of labelled instances S = f(xi; yi) j xi 2 X ; yi 2 f(cid:0)1; +1g; i = 1; : : : ; mg : We are also given a set of unlabelled examples ~S = f~xig ~m i=1. If such a set is not provided we can simply use the labelled in- stances (without the labels themselves) as the set ~S. The set ~S is used for constructing the primitive kernels that are combined to constitute the learned kernel ^K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned kernel ^K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore ﬁrst discuss in brief the form of boosting algorithms for kernels.

2 Using Boosting to Combine Kernels

Numerous interpretations of AdaBoost and its variants cast the boosting process as a pro- cedure that attempts to minimize, or make small, a continuous bound on the classiﬁcation error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] uniﬁes the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir-

Input: Labelled and unlabelled sets of examples: S = f(xi; yi)gm Initialize: K 0 (all zeros matrix) For t = 1; 2; : : : ; T :

i=1

Do not remove: This comment is monitored to verify that the site is working properly