{"title": "Kernel Design Using Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 553, "page_last": 560, "abstract": null, "full_text": "Kernel Design Using Boosting\n\nKoby Crammer\n\nJoseph Keshet Yoram Singer\n\nSchool of Computer Science & Engineering\n\nThe Hebrew University, Jerusalem 91904, Israel\n\nfkobics,jkeshet,singerg@cs.huji.ac.il\n\nAbstract\n\nThe focus of the paper is the problem of learning kernel operators from\nempirical data. We cast the kernel design problem as the construction of\nan accurate kernel from simple (and less accurate) base kernels. We use\nthe boosting paradigm to perform the kernel construction process. To do\nso, we modify the booster so as to accommodate kernel operators. We\nalso devise an ef\ufb01cient weak-learner for simple kernels that is based on\ngeneralized eigen vector decomposition. We demonstrate the effective-\nness of our approach on synthetic data and on the USPS dataset. On the\nUSPS dataset, the performance of the Perceptron algorithm with learned\nkernels is systematically better than a \ufb01xed RBF kernel.\n\n1 Introduction and problem Setting\n\nThe last decade brought voluminous amount of work on the design, analysis and experi-\nmentation of kernel machines. Algorithm based on kernels can be used for various ma-\nchine learning tasks such as classi\ufb01cation, regression, ranking, and principle component\nanalysis. The most prominent learning algorithm that employs kernels is the Support Vec-\ntor Machines (SVM) [1, 2] designed for classi\ufb01cation and regression. A key component\nin a kernel machine is a kernel operator which computes for any pair of instances their\ninner-product in some abstract vector space. Intuitively and informally, a kernel operator\nis a means for measuring similarity between instances. Almost all of the work that em-\nployed kernel operators concentrated on various machine learning problems that involved\na prede\ufb01ned kernel. A typical approach when using kernels is to choose a kernel before\nlearning starts. Examples to popular prede\ufb01ned kernels are the Radial Basis Functions and\nthe polynomial kernels (see for instance [1]). Despite the simplicity required in modifying\na learning algorithm to a \u201ckernelized\u201d version, the success of such algorithms is not well\nunderstood yet. More recently, special efforts have been devoted to crafting kernels for\nspeci\ufb01c tasks such as text categorization [3] and protein classi\ufb01cation problems [4].\n\nOur work attempts to give a computational alternative to prede\ufb01ned kernels by learning\nkernel operators from data. We start with a few de\ufb01nitions. Let X be an instance space.\n. An explicit way to describe K\nA kernel is an inner-product operator K : X (cid:2) X ! \nis via a mapping (cid:30) : X ! H from X to an inner-products space H such that K(x; x0) =\n(cid:30)(x)(cid:1)(cid:30)(x0). Given a kernel operator and a \ufb01nite set of instances S = fxi; yigm\ni=1, the kernel\nmatrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S,\nKi;j = K(xi; xj). We therefore refer to the general form of K as the kernel operator and\nto the application of the kernel operator to a set of pairs of instances as the kernel matrix.\n\n\fThe speci\ufb01c setting of kernel design we consider assumes that we have access to a\nbase kernel learner and we are given a target kernel K ? manifested as a kernel ma-\ntrix on a set of examples. Upon calling the base kernel learner it returns a kernel op-\nerator denote Kj. The goal thereafter is to \ufb01nd a weighted combination of kernels\n\n^K(x; x0) = Pj (cid:11)jKj(x; x0) that is similar, in a sense that will be de\ufb01ned shortly, to\nthe target kernel, ^K (cid:24) K ?. Cristianini et al. [5] in their pioneering work on kernel target\nalignment employed as the notion of similarity the inner-product between the kernel ma-\ntrices < K; K 0 >F =Pm\ni;j=1 K(xi; xj)K 0(xi; xj). Given this de\ufb01nition, they de\ufb01ned the\nkernel-similarity, or alignment, to be the above inner-product normalized by the norm of\neach kernel, ^A(S; ^K; K ?) = (cid:16)< ^K; K ? >F(cid:17) =q< ^K; ^K >F < K ?; K ? >F ; where S\n\nis, as above, a \ufb01nite sample of m instances. Put another way, the kernel alignment Cris-\ntianini et al. employed is the cosine of the angle between the kernel matrices where each\nmatrix is \u201c\ufb02attened\u201d into a vector of dimension m2. Therefore, this de\ufb01nition implies that\nthe alignment is bounded above by 1 and can attain this value iff the two kernel matrices\nare identical. Given a (column) vector of m labels y where yi 2 f(cid:0)1; +1g is the label\nof the instance xi, Cristianini et al. used the outer-product of y as the the target kernel,\nK ? = yyT . Therefore, an optimal alignment is achieved if ^K(xi; xj) = yiyj. Clearly,\nif such a kernel is used for classifying instances from X , then the kernel itself suf\ufb01ces to\nconstruct an excellent classi\ufb01er f : X ! f(cid:0)1; +1g by setting, f (x) = sign(yiK(xi; x))\nwhere (xi; yi) is any instance-label pair. Cristianini et al. then devised a procedure that\nworks with both labelled and unlabelled examples to \ufb01nd a Gram matrix which attains a\ngood alignment with K ? on the labelled part of the matrix. While this approach can clearly\nconstruct powerful kernels, a few problems arise from the notion of kernel alignment they\nemployed. For instance, a kernel operator such that the sign(K(xi; xj)) is equal to yiyj\nbut its magnitude, jK(xi; xj)j, is not necessarily 1, might achieve a poor alignment score\nwhile it can constitute a classi\ufb01er whose empirical loss is zero. Furthermore, the task of\n\ufb01nding a good kernel when it is not always possible to \ufb01nd a kernel whose sign on each\npair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6])\nbecomes rather tricky. We thus propose a different approach which attempts to overcome\nsome of the dif\ufb01culties above.\n\nLike Cristianini et al. we assume that we are given a set of labelled instances S =\nf(xi; yi) j xi 2 X ; yi 2 f(cid:0)1; +1g; i = 1; : : : ; mg : We are also given a set of unlabelled\nexamples ~S = f~xig ~m\ni=1. If such a set is not provided we can simply use the labelled in-\nstances (without the labels themselves) as the set ~S. The set ~S is used for constructing the\nprimitive kernels that are combined to constitute the learned kernel ^K. The labelled set is\nused to form the target kernel matrix and its instances are used for evaluating the learned\nkernel ^K. This approach, known as transductive learning, was suggested in [5, 6] for kernel\nalignment tasks when the distribution of the instances in the test data is different from that\nof the training data. This setting becomes in particular handy in datasets where the test data\nwas collected in a different scheme than the training data. We next discuss the notion of\nkernel goodness employed in this paper. This notion builds on the objective function that\nseveral variants of boosting algorithms maintain [7, 8]. We therefore \ufb01rst discuss in brief\nthe form of boosting algorithms for kernels.\n\n2 Using Boosting to Combine Kernels\n\nNumerous interpretations of AdaBoost and its variants cast the boosting process as a pro-\ncedure that attempts to minimize, or make small, a continuous bound on the classi\ufb01cation\nerror (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8]\nuni\ufb01es the boosting process for two popular loss functions, the exponential-loss (denoted\nhenceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir-\n\n\fInput: Labelled and unlabelled sets of examples: S = f(xi; yi)gm\nInitialize: K 0 (all zeros matrix)\nFor t = 1; 2; : : : ; T :\n\ni=1\n\n; ~S = f~xig ~m\n\ni=1\n\n(cid:15) Calculate distribution over pairs 1 (cid:20) i; j (cid:20) m:\nDt(i; j) =(cid:26) exp((cid:0)yiyjK(xi; xj))\n\n1=(1 + exp((cid:0)yiyjK(xi; xj)))\n\n(cid:15) Call base-kernel-learner with (Dt; S; ~S) and receive Kt\n(cid:15) Calculate:\n\nExpLoss\nLogLoss\n\nS+\nt = f(i; j)j yiyjKt(xi; xj ) > 0g\nW +\nt =P(i;j)2S+\n2 ln(cid:16) W +\n\nDt(i; j)jKt(xi; xj )j\nt (cid:17) ; K K + (cid:11)tKt.\n\n(cid:15) Set: (cid:11)t = 1\n\nReturn: kernel operator K : X (cid:2) X ! \n\nW (cid:0)\n\nt\n\nt\n\n; S (cid:0)\n; W (cid:0)\n\nt = f(i; j)j yiyjKt(xi; xj) < 0g\nt =P(i;j)2S(cid:0)\n\nDt(i; j)jKt(xi; xj )j\n\nt\n\nFigure 1: The skeleton of the boosting algorithm for kernels.\n\nical classi\ufb01cation error. Given the prediction of a classi\ufb01er f on an instance x and a label\ny 2 f(cid:0)1; +1g the ExpLoss and the LogLoss are de\ufb01ned as,\nExpLoss(f (x); y) = exp((cid:0)yf (x))\nLogLoss(f (x); y) = log(1 + exp((cid:0)yf (x))) :\n\nCollins et al. described a single algorithm for the two losses above that can be used within\nthe boosting framework to construct a strong-hypothesis which is a classi\ufb01er f (x). This\nclassi\ufb01er is a weighted combination of (possibly very simple) base classi\ufb01ers.\n(In the\nboosting framework, the base classi\ufb01ers are referred to as weak-hypotheses.) The strong-\nt=1 (cid:11)tht(x). Collins et al. discussed a few ways to\nselect the weak-hypotheses ht and to \ufb01nd a good of weights (cid:11)t. Our starting point in this\npaper is the \ufb01rst sequential algorithm from [8] that enables the construction or creation of\nweak-hypotheses on-the-\ufb02y. We would like to note however that it is possible to use other\nvariants of boosting to design kernels.\n\nhypothesis is of the form f (x) = PT\n\nIn order to use boosting to design kernels we extend the algorithm to operate over pairs of\ninstances. Building on the notion of alignment from [5, 6], we say that the inner-product\nof x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1; x2)) = y1y2. Furthermore,\nwe would like to make the magnitude of K(x; x0) to be as large as possible. We therefore\nuse one of the following two alignment losses for a pair of examples (x1; y1) and (x2; y2),\n\nExpLoss(K(x1; x2); y1y2) = exp((cid:0)y1y2K(x1; x2))\nLogLoss(K(x1; x2); y1y2) = log(1 + exp((cid:0)y1y2K(x1; x2))) :\n\nPut another way, we view a pair of instances as a single example and cast the pairs of\ninstances that attain the same label as positively labelled examples while pairs of opposite\nlabels are cast as negatively labelled examples. Clearly, this approach can be applied to both\nlosses. In the boosting process we therefore maintain a distribution over pairs of instances.\nThe weight of each pair re\ufb02ects how dif\ufb01cult it is to predict whether the labels of the two\ninstances are the same or different. The core boosting algorithm follows similar lines to\nboosting algorithms for classi\ufb01cation algorithm. The pseudo code of the booster is given in\nFig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequential-\nupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in\nour case is charge of returning a good kernel with respect to the current distribution, is\nleft unspeci\ufb01ed. We therefore turn our attention to the algorithmic implementation of the\nbase-learning algorithm for kernels.\n\n\f3 Learning Base Kernels\n\ni=1 :\n\n(cid:15) Calculate:\n\nInput: A distribution Dt. Labelled and unlabelled sets:\nS = f(xi; yi)gm\nCompute :\n\ni=1 ; ~S = f~xig ~m\n\nThe base kernel learner is provided with a training set S and a distribution Dt over a pairs\nof instances from the training set. It is also provided with a set of unlabelled examples ~S.\nWithout any knowledge of the topology of the space of instances a learning algorithm is\nlikely to fail. Therefore, we assume the existence of an initial inner-product over the input\nspace. We assume for now that this initial inner-product is the standard scalar products\nover vectors in n. We later discuss a way to relax the assumption on the form of the\ninner-product. Equipped with an inner-product, we de\ufb01ne the family of base kernels to be\nthe possible outer-products Kw = wwT between a vector w 2 n and itself.\nUsing this de\ufb01nition we get,\nKw(xi; xj) = (xi(cid:1)w)(xj(cid:1)w) :\nTherefore, the similarity be-\ntween two instances xi and\nxj is high iff both xi and xj\nare similar (w.r.t the standard\ninner-product) to a third vec-\ntor w. Analogously, if both\nxi and xj seem to be dissim-\nilar to the vector w then they\nare similar to each other. De-\nspite the restrictive form of\nthe inner-products, this fam-\nily is still too rich for our set-\nting and we further impose\ntwo restrictions on the inner\nproducts. First, we assume\nthat w is restricted to a linear combination of vectors from ~S. Second, since scaling of\nthe base kernels is performed by the boosted, we constrain the norm of w to be 1. The\nresulting class of kernels is therefore, C = fKw = wwT j w = P ~m\nr=1 (cid:12)r ~xr; kwk = 1g :\nIn the boosting process we need to choose a speci\ufb01c base-kernel Kw from C. We therefore\nneed to devise a notion of how good a candidate for base kernel is given a labelled set S and\na distribution function Dt. In this work we use the simplest version suggested by Collins et\nal. This version can been viewed as a linear approximation on the loss function. We de\ufb01ne\nthe score of a kernel Kw w.r.t to the current distribution Dt to be,\n\nA 2 m(cid:2) ~m ; Ai;r = xi (cid:1) ~xr\nB 2 m(cid:2)m ; Bi;j = Dt(i; j)yiyj\n~m(cid:2) ~m ; Kr;s = ~xr (cid:1) ~xs\nK 2 \n(cid:15) Find the generalized eigenvector v 2 m for\nthe problem AT BAv = (cid:21)Kv which attains\nthe largest eigenvalue (cid:21)\n(cid:15) Set: w = (Pr vr ~xr)=kPr vr ~xrk.\n\nFigure 2: The base kernel learning algorithm.\n\nReturn: Kernel operator Kw = wwt.\n\nDt(i; j)yiyjKw(xi; xj) :\n\n(1)\n\nScore(Kw) =Xi;j\n\nThe higher the value of the score is, the better Kw \ufb01ts the training data. Note that if\nDt(i; j) = 1=m2 (as is D0) then Score(Kw) is proportional to the alignment since kwk =\n1. Under mild assumptions the score can also provide a lower bound of the loss function. To\nsee that let c be the derivative of the loss function at margin zero, c = (cid:12)(cid:12)\n. If all the\ntraining examples xi 2 S lies in a ball of radius pc, we get that Loss(Kw(xi; xj); yiyj) (cid:21)\n1 (cid:0) cKw(xi; xj)yiyj (cid:21) 0, and therefore,\n\nLoss0(0)(cid:12)(cid:12)\n\nXi;j\n\nDt(i; j)Loss(Kw(xi; xj); yiyj) (cid:21) 1 (cid:0) cXi;j\n\nDt(i; j)Kw(xi; xj )yiyj :\n\nUsing the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw) =\n\nPi;j D(i; j)yiyj(w(cid:1)xi)(w(cid:1)xj ) : Further developing the above equation using the constraint\nthat w =P ~m\n\nr=1 (cid:12)r ~xr we get,\n\nD(i; j)yiyj (xi (cid:1) ~xr) (xj (cid:1) ~xs) :\n\nScore(Kw) =Xr;s\n\n(cid:12)s(cid:12)rXi;j\n\n\fTo compute ef\ufb01ciently the base kernel score without an explicit enumeration we exploit\nthe fact that if the initial distribution D0 is symmetric (D0(i; j) = D0(j; i)) then all the\ndistributions generated along the run of the boosting process, Dt, are also symmetric. We\nnow de\ufb01ne a matrix A 2 m(cid:2) ~m where Ai;r = xi (cid:1) ~xr and a symmetric matrix B 2 m(cid:2)m\nwith Bi;j = Dt(i; j)yiyj. Simple algebraic manipulations yield that the score function\ncan be written as the following quadratic form, Score((cid:12)) = (cid:12) T (AT BA)(cid:12) ; where (cid:12) is\n~m dimensional column vector. Note that since B is symmetric so is AT BA. Finding a\ngood base kernel is equivalent to \ufb01nding a vector (cid:12) which maximizes this quadratic form\nunder the norm equality constraint kwk2 = kP ~m\nr=1 (cid:12)r ~xrk2 = (cid:12)T K(cid:12) = 1 where Kr;s =\n~xr (cid:1) ~xs : Finding the maximum of Score((cid:12)) subject to the norm constraint is a well known\nmaximization problem known as the generalized eigen vector problem (cf. [10]). Applying\nsimple algebraic manipulations it is easy to show that the matrix AT BA is positive semi-\nde\ufb01nite. Assuming that the matrix K is invertible, the the vector (cid:12) which maximizes the\nquadratic form is proportional the eigenvector of K (cid:0)1AT BA which is associated with the\nr=1 vr ~xr.\nr=1 vr ~xrk. The skeleton\nof the algorithm for \ufb01nding a base kernels is given in Fig. 3. To conclude the description of\nthe kernel learning algorithm we describe how to the extend the algorithm to be employed\nwith general kernel functions.\n\ngeneralized largest eigenvalue. Denoting this vector by v we get that w / P ~m\nAdding the norm constraint we get that w = (P ~m\n\nr=1 vr ~xr)=kP ~m\n\nKernelizing the Kernel: As described above, we assumed that the standard scalar-\nproduct constitutes the template for the class of base-kernels C. However, since the proce-\ndure for choosing a base kernel depends on S and ~S only through the inner-products matrix\nA, we can replace the scalar-product itself with a general kernel operator (cid:20) : X (cid:2)X ! \n,\nwhere (cid:20)(xi; xj) = (cid:30)(xi) (cid:1) (cid:30)(xj ). Using a general kernel function (cid:20) we can not com-\npute however the vector w explicitly. We therefore need to show that the norm of w, and\nevaluation Kw on any two examples can still be performed ef\ufb01ciently.\nFirst note that given the vector v we can compute the norm of w as follows,\n\nkwk2 = Xr\n\nvr ~xr!T Xs\n\nvs ~xr! =Xr;s\n\nvrvs(cid:20)(~xr; ~xs) :\n\nNext, given two vectors xi and xj the value of their inner-product is,\n\nKw(xi; xj) = Xr;s\n\nvrvs(cid:20)(xi; ~xr)(cid:20)(xj ; ~xs) :\n\nTherefore, although we cannot compute the vector w explicitly we can still compute its\nnorm and evaluate any of the kernels from the class C.\n4 Experiments\n\nSynthetic data: We generated binary-labelled data using as input space the vectors in\n 100. The labels, in f(cid:0)1; +1g, were picked uniformly at random. Let y designate the label\nof a particular example. Then, the \ufb01rst two components of each instance were drawn from\na two-dimensional normal distribution, N ((cid:22); (cid:1)P (cid:1)(cid:0)1) with the following parameters,\n\nThat is, the label of each examples determined the mean of the distribution from which\nthe \ufb01rst two components were generated. The rest of the components in the vector (98\n\n(cid:22) = y(cid:18) 0:03\n0:03 (cid:19)\n\n(cid:1) =\n\n1\n\np2(cid:18) 1 (cid:0)1\n\n1\n\n1 (cid:19) X =(cid:18) 0:1\n\n0\n\n0\n\n0:01 (cid:19) :\n\n\f0.2\n\n0\n\n\u22120.2\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u22120.2\n\n0\n\n0.2\n\n\u22128\n\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n8\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nFigure 3: Results on a toy data set prior to learning a kernel (\ufb01rst and third from left)\nand after learning (second and fourth). For each of the two settings we show the \ufb01rst two\ncomponents of the training data (left) and the matrix of inner products between the train\nand the test data (right).\n\naltogether) were generated independently using the normal distribution with a zero mean\nand a standard deviation of 0:05. We generated 100 training and test sets of size 300 and\n200 respectively. We used the standard dot-product as the initial kernel operator.\nOn each experiment we \ufb01rst learned a linear classier that separates the classes using the\nPerceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After\neach epoch we evaluated the performance of the current classi\ufb01er on the test set. We then\nused the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel\nfor each random training set. After learning the kernel we re-trained a classi\ufb01er with the\nPerceptron algorithm and recorded the results. A summary of the online performance is\ngiven in Fig. 4. The plot on the left-hand-side of the \ufb01gure shows the instantaneous error\n(achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the\nlearned kernel converges much faster than the original kernel. The middle plot shows the\ntest error after each epoch. The plot on the right shows the test error on a noisy test set\nin which we added a Gaussian noise of zero mean and a standard deviation of 0:03 to\nthe \ufb01rst two features. In all plots, each bar indicates a 95% con\ufb01dence level. It is clear\nfrom the \ufb01gure that the original kernel is much slower to converge than the learned kernel.\nFurthermore, though the kernel learning algorithm was not expoed to the test set noise, the\nlearned kernel re\ufb02ects better the structure of the feature space which makes the learned\nkernel more robust to noise.\n\nFig. 3 further illustrates the bene\ufb01ts of using a boutique kernel. The \ufb01rst and third plots\nfrom the left correspond to results obtained using the original kernel and the second and\nfourth plots show results using the learned kernel. The left plots show the empirical distri-\nbution of the two informative components on the test data. For the learned kernel we took\neach input vector and projected it onto the two eigenvectors of the learned kernel opera-\ntor matrix that correspond to the two largest eigenvalues. Note that the distribution after\nthe projection is bimodal and well separated along the \ufb01rst eigen direction (x-axis) and\nshows rather little deviation along the second eigen direction (y -axis). This indicates that\nthe kernel learning algorithm indeed found the most informative projection for separating\nthe labelled data with large margin. It is worth noting that, in this particular setting, any\nalgorithm which chooses a single feature at a time is prone to failure since both the \ufb01rst\nand second features are mandatory for correctly classifying the data.\n\nThe two plots on the right hand side of Fig. 3 use a gray level color-map to designate the\nvalue of the inner-product between each pairs instances, one from training set (y -axis) and\nthe other from the test set. The examples were ordered such that the \ufb01rst group consists\nof the positively labelled instances while the second group consists of the negatively la-\nbelled instances. Since most of the features are non-relevant the original inner-products\nare noisy and do not exhibit any structure. In contrast, the inner-products using the learned\nkernel yields in a 2 (cid:2) 2 block matrix indicating that the inner-products between instances\nsharing the same label obtain large positive values. Similarly, for instances of opposite\n\n\f%\n\n \nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\n\nl\n\n \n\na\nu\nm\nu\nC\nd\ne\ng\na\nr\ne\nv\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n100\n\nRegular Kernel\nLearned Kernel\n\n101\n\n102\nRound\n\n103\n\n104\n\n%\n\n \nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nRegular Kernel\nLearned Kernel\n\n2\n\n4\n\n6\nEpochs\n\n8\n\n10\n\n%\n\n \nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\nRegular Kernel\nLearned Kernel\n\n2\n\n4\n\n6\nEpochs\n\n8\n\n10\n\nFigure 4: The online training error (left), test error (middle) on clean synthetic data using\na standard kernel and a learned kernel. Right: the online test error for the two kernels on a\nnoisy test set.\n\nlabels the inner products are large and negative. The form of the inner-products matrix of\nthe learned kernel indicates that the learning problem itself becomes much easier. Indeed,\nthe Perceptron algorithm with the standard kernel required around 94 training examples\non the average before converging to a hyperplane which perfectly separates the training\ndata while using the Perceptron algorithm with learned kernel required a single example to\nreach a perfect separation on all 100 random training sets.\n\nUSPS dataset: The USPS (US Postal Service) dataset is known as a challenging clas-\nsi\ufb01cation problem in which the training set and the test set were collected in a different\nmanner. The USPS contains 7; 291 training examples and 2; 007 test examples. Each ex-\nample is represented as a 16 (cid:2) 16 matrix where each entry in the matrix is a pixel that can\ntake values in f0; : : : ; 255g. Each example is associated with a label in f0; : : : ; 9g which\nis the digit content of the image. Since the kernel learning algorithm is designed for binary\nproblems, we broke the 10-class problem into 45 binary problems by comparing all pairs\nof classes. The interesting question of how to learn kernels for multiclass problems is be-\nyond the scopre of this short paper. We thus constraint on the binary error results for the 45\nbinary problem described above. For the original kernel we chose a RBF kernel with (cid:27) = 1\nwhich is the value employed in the experiments reported in [12]. We used the kernelized\nversion of the kernel design algorithm to learn a different kernel operator for each of the\nbinary problems. We then used a variant of the Perceptron [11] and with the original RBF\nkernel and with the learned kernels. One of the motivations for using the Perceptron is its\nsimplicity which can underscore differences in the kernels. We ran the kernel learning al-\ngorithm with LogLoss and ExpLoss, using bith the training set and the test test as ~S. Thus,\nwe obtained four different sets of kernels where each set consists of 45 kernels. By exam-\nining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss\nand 50 for the ExpLoss, when using the trainin set. When using the test set, the number\nof rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower\nrate of convergence with the test data, we choose a a higher value without attempting to\noptimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of\neach of the binary classi\ufb01ers when trained with the original RBF a kernel versus the perfor-\nmance achieved on the same binary problem with a learned kernel. The kernels were built\nusing boosting with the LogLoss and ~S was the training data. In almost all of the 45 binary\nclassi\ufb01cation problems, the learned kernels yielded lower error rates when combined with\nthe Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the \ufb01rst\nwas build using the training instances as the templates constituing ~S while the second used\nthe test instances. Although the differenece between the two versions is not as signi\ufb01cant\nas the difference on the left plot, we still achieve an overall improvement in about 25% of\nthe binary problems by using the test instances.\n\n\fi\n\n)\nn\na\nr\nT\n(\n \nl\n\n \n\ne\nn\nr\ne\nK\nd\ne\nn\nr\na\ne\nL\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n0\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n)\nt\ns\ne\nT\n(\n \nl\n\n \n\ne\nn\nr\ne\nK\nd\ne\nn\nr\na\ne\nL\n\n1\n\n2\n\n3\n\nBase Kernel\n\n4\n\n5\n\n6\n\n0\n0\n\n1\n\n2\n\n3\n\nLearned Kernel (Train)\n\n4\n\n5\n\nFigure 5: Left: a scatter plot comparing the error rate of 45 binary classi\ufb01ers trained using\nan RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter\nplot for a learned kernel only constructed from training instances (x-axis) and test instances.\n\n5 Discussion\n\nIn this paper we showed how to use the boosting framework to design kernels. Our ap-\nproach is especially appealing in transductive learning tasks where the test data distribution\nis different than the the distribution of the training data. For example, in speech recogni-\ntion tasks the training data is often clean and well recorded while the test data often passes\nthrough a noisy channel that distorts the signal. An interesting and challanging question\nthat stem from this research is how to extend the framework to accommodate more com-\nplex decision tasks such as multiclass and regression problems. Finally, we would like to\nnote alternative approaches to the kernel design problem has been devised in parallel and\nindependently. See [13, 14] for further details.\nAcknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing\nthe connection to the generalized eigen vector problem. Thanks also to the anonymous\nreviewers for constructive comments.\n\nReferences\n[1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge\n\nUniversity Press, 2000.\n\n[3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text\nclassi\ufb01cation using string kernels. Journal of Machine Learning Research, 2:419\u2013444, 2002.\n[4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm\n\nprotein classi\ufb01cation. In Proceedings of the Paci\ufb01c Symposium on Biocomputing, 2002.\n\n[5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target align-\n\nment. In Advances in Neural Information Processing Systems 14, 2001.\n\n[6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix\nwith semi-de\ufb01nite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002.\n[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statisti-\n\ncal view of boosting. Annals of Statistics, 28(2):337\u2013374, April 2000.\n\n[8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and\n\nbregman distances. Machine Learning, 47(2/3):253\u2013285, 2002.\n\n[9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques\n\nfor combining hypotheses. In Advances in Large Margin Classi\ufb01ers. MIT Press, 1999.\n\n[10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n[11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological Review, 65:386\u2013407, 1958.\n\n[12] B. Sch\u00a8olkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M\u00a8uller, G. R\u00a8atsch, and A.J. Smola. Input\nspace vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000\u20131017, 1999.\n[13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002.\n[14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.\n\n\f", "award": [], "sourceid": 2202, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Joseph", "family_name": "Keshet", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}