{"title": "Convex Methods for Transduction", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 80, "abstract": "", "full_text": "Convex Methods for Transduction\n\nTijl De Bie\n\nNello Cristianini\n\nESAT-SCD/SISTA, K.U.Leuven\n\nDepartment of Statistics, U.C.Davis\n\nKasteelpark Arenberg 10\n\n3001 Leuven, Belgium\n\n360 Kerr Hall One Shields Ave.\n\nDavis, CA-95616\n\ntijl.debie@esat.kuleuven.ac.be\n\nnello@support-vector.net\n\nAbstract\n\nThe 2-class transduction problem, as formulated by Vapnik [1],\ninvolves \ufb01nding a separating hyperplane for a labelled data set\nthat is also maximally distant from a given set of unlabelled test\npoints. In this form, the problem has exponential computational\ncomplexity in the size of the working set. So far it has been attacked\nby means of integer programming techniques [2] that do not scale\nto reasonable problem sizes, or by local search procedures [3].\nIn this paper we present a relaxation of this task based on semi-\nde\ufb01nite programming (SDP), resulting in a convex optimization\nproblem that has polynomial complexity in the size of the data set.\nThe results are very encouraging for mid sized data sets, however\nthe cost is still too high for large scale problems, due to the high di-\nmensional search space. To this end, we restrict the feasible region\nby introducing an approximation based on solving an eigenproblem.\nWith this approximation, the computational cost of the algorithm\nis such that problems with more than 1000 points can be treated.\n\n1 Introduction\n\nThe general transduction task is the following: given a training set of labelled data,\nand a working set of unlabelled data (also called transduction samples), estimate the\nvalue of a classi\ufb01cation function at the given points in the working set. Statistical\nlearning results [1] suggest that this setting should deliver better results than the\ntraditional \u2018inductive\u2019 setting, where a function needs to be learned \ufb01rst and only\nlater tested on a test set of points chosen after the learning has been completed.\nDi\ufb00erent algorithms have been proposed so far to take advantage of this advance\nknowledge of the test points (such as in [1], [2], [3], [4], [5], [6] and others).\n\nGiven this general task, much research has been concentrated on a speci\ufb01c approach\nto transduction (\ufb01rst proposed by Vapnik [1]), based on the use of Support Vector\nMachines (SVM\u2019s).\nIn this case, the algorithm is aimed at \ufb01nding a separating\nhyperplane for the training set that is also maximally distant from the (unlabelled)\nworking set. This hyperplane is used to predict the labels for the working set points.\n\nIn this form, the problem has exponential computational complexity, and several\napproaches have been attempted to solve it. Generally they involve some form of\n\n\fmin\n\u0393\n\n\u03b1\n\ns.t.\n\n2\u03b1\n\n\u2032(K \u2299 \u0393)\u03b1\nyw (cid:19) \u00b7(cid:18) yt\n\n\u2032e \u2212 \u03b1\nC \u2265 \u03b1i \u2265 0\n\u0393 =(cid:18) yt\nyw\ni \u2208 {1,\u22121}\n\nyw (cid:19)\u2032\n\n(2)\n\n(3)\n\nlocal search [3], or integer programming methods [2].\n\nA recent development of convex optimization theory is Semi De\ufb01nite Programming\n(SDP), a branch of that \ufb01eld aimed at optimizing over the cone of semi positive def-\ninite (SPD) matrices. One of its main attractions is that it has proven successful in\nconstructing tight convex relaxations of hard combinatorial optimization problems\n[7]. SDP has recently been applied successfully to machine learning problems [8].\n\nIn this paper we show how to relax the problem of transduction into an SDP prob-\nlem, that can then be solved by (polynomial time) convex optimization methods.\nEmpirical results on mid-sized data sets are very promising, however, due to the\ndimensionality of the feasible region of the relaxed parameters, still the algorithm\ncomplexity appears too large to tackle large scale problems. Therefore, we subse-\nquently shrink the feasible region by making an approximation that is based on a\nspectral clustering method. Positive empirical results will be given.\n\nFormal de\ufb01nition: transductive SVM. Based on the dual of the 1-norm soft\nmargin SVM with zero bias1, the dual formulation of the transductive SVM opti-\nmization problem can be written as a minimization of the dual SVM cost function\n(which is the inverse margin plus training errors) over label matrix \u0393 ([1], p. 437):\n(1)\n\nmax\n\n(4)\nThe (symmetric) matrix \u0393 is thus parameterized by the unknown working set la-\nbel vector yw \u2208 {\u22121, 1}nw (with nw the size of the working set). The vector\nyt \u2208 {\u22121, 1}nt (with nt the number of training points) is the given \ufb01xed vec-\ntor containing the known labels for the training points. The (symmetric) matrix\nK \u2208 \u211c(nw+nt)\u00d7(nw+nt) is the entire kernel matrix on the training set together with\nthe working set. The dual vector is denoted by \u03b1 \u2208 \u211cnw+nt , and e is a vector of\nappropriate size containing all ones. The symbol \u2299 represents the elementwise ma-\ntrix product. It is clear indeed this is a combinatorial problem. The computational\ncomplexity scales exponentially in the size of the working set.\nFurther notation. Scalars are lower case; vectors boldface lower case; matrices\nboldface upper case. The unit matrix is denoted by I. A pseudo-inverse is denoted\nwith a superscript \u2020, a transpose with a \u2032. For ease of notation, the training part\nof the label matrix (and thus also of the kernel matrix) is always assumed to be\nits upper nt \u00d7 nt block (as is assumed already in (3)). Furthermore, the nt+ posi-\ntive training samples are assumed to correspond to the \ufb01rst entries in yt , the nt\u2212\nnegative samples being at the end of this vector.\n\n2 Relaxation to an SDP problem\n\nIn this section, we will gradually derive a relaxed version of the transductive SVM\nformulation. To start with, we replace some of the constraints by an equivalent set:\n\nProposition 2.1 (3) and (4) are equivalent with the following set of constraints:\n(5)\n\n[\u0393]i,j\u2208{1:nt,1:nt} = yt\n\ni yt\nj\n\n1We do not include a bias term since this would make the problem too non-convex.\n\nHowever this does not impair the result as is explained in [9].\n\n\fdiag(\u0393) = e\nrank(\u0393) = 1\n\n(6)\n(7)\n\nThe values of \u0393 will then indeed be equal to 1 or \u22121. It is basically the rank con-\nstraint that makes the resulting constrained optimization problem combinatorial.\nNote that these constraints imply that \u0393 is semi positive de\ufb01nite (SPD): \u0393 (cid:23) 0 (this\nfollows trivially from (3), or from (6) together with (7)). Now, in literature (see eg\n[7]) it is observed that such an SPD rank one constraint can often be relaxed to only\nthe SPD constraint without sacri\ufb01cing too much of the performance. Furthermore:\n\nProposition 2.2 If we relax the constraints by replacing (7) with\n\n\u0393 (cid:23) 0,\nthe optimization problem becomes convex.\n\n(8)\n\nThis follows from the fact that \u0393 appears linearly in the cost function, and that the\nconstraints (2), (5), (6) and (8) consist of only linear equalities and linear (matrix)\ninequalities in the variables. Further on it will be shown to be an SDP problem.\n\nWhile this relaxation of the rank constraint makes the optimization problem convex,\nthe result will not be a rank one matrix anymore; it will only provide an approxi-\nmation for the optimal rank one matrix. Thus the values of \u0393 will not be equal to\n1 or \u22121 anymore. However, it is well known that:\nLemma 2.1 A principal submatrix of an SPD matrix is also SPD [10].\n\nBy applying this lemma on all 2 \u00d7 2 principal submatrices of \u0393, it is shown that\nCorollary 2.1 From constraints (6) and (8) follows: \u22121 \u2264 [\u0393]i,j \u2264 1.\nThis is the problem will solve here: optimize (1) subject to (2), (5), (6) and (8). In\nthe remainder of this section we will reformulate the optimization problem into a\nstandard form of SDP, make further simpli\ufb01cations based on the problem structure,\nand show how to extract an approximation for the labels from the result.\n\n2.1 Formulation as a standard SDP problem\n\nIn the derivations in this subsection the equality constraints (5) and (6) will not\nbe stated for brevity. Their consequences will be treated further in the paper.\nFurthermore, in the implementation, they will be enforced explicitly by the param-\neterization, thus they will not appear as constraints in the optimization problem.\nAlso the SPD constraint (8) is not written every time, it should be understood.\nLet 2\u03bd \u2265 0 be the Lagrange dual variables corresponding to constraint \u03b1i \u2265 0 and\n2\u00b5 \u2265 0 corresponding to constraint \u03b1i \u2264 C. Then, since the problem is convex and\nthus the minimization and maximization are exchangeable (strong duality, see [8]\nfor a brief introduction to duality), the optimization problem is equivalent with:\n\nmin\n\n\u0393,\u03bd\u22650,\u00b5\u22650\n\nmax\n\n\u03b1\n\n2\u03b1\n\n\u2032(e + \u03bd \u2212 \u00b5) \u2212 \u03b1\n\n\u2032(K \u2299 \u0393)\u03b1 + 2C \u00b5\n\n\u2032e\n\nIn case K \u2299 \u0393 is rank de\ufb01cient, (e + \u03bd \u2212 \u00b5) will be orthogonal to the null space\nof K \u2299 \u0393 (otherwise, the object function could grow to in\ufb01nity, and this while \u03bd\nand \u00b5 on the contrary are minimizing the objective). The maximum over \u03b1 is then\nreached for \u03b1 = (K\u2299 \u0393)\u2020(e + \u03bd \u2212 \u00b5). Substituting this in the object function gives:\n\nmin\u0393,\u03bd\u22650,\u00b5\u22650(e + \u03bd \u2212 \u00b5)\u2032(K \u2299 \u0393)\u2020(e + \u03bd \u2212 \u00b5) + 2C \u00b5\n\n\u2032e\n\n\fmin\n\n\u0393,\u03bd\u22650,\u00b5\u22650,t\n\nt\n\ns.t.\n\n(cid:18) K \u2299 \u0393\n(e + \u03bd \u2212 \u00b5)\u2032\n\n\u2032e (cid:19) (cid:23) 0\n\n(e + \u03bd \u2212 \u00b5)\nt \u2212 2C\u00b5\n\n(9)\n\n(10)\n\nor equivalently:\n\nmin\n\n\u0393,\u03bd\u22650,\u00b5\u22650,t\n\nt\n\ns.t.\n\nt \u2265 (e + \u03bd \u2212 \u00b5)\u2032(K \u2299 \u0393)\u2020(e + \u03bd \u2212 \u00b5) + 2C \u00b5\n\n\u2032e.\n\nwith as additional constraint that (e + \u03bd \u2212 \u00b5) is orthogonal to the null space of\nK \u2299 \u0393. This latter constraint and the quadratic constraint can be reformulated\nas one SPD constraint thanks to the following extension of the Schur complement\nlemma [10] (the proof is omitted due to space restrictions):\n\nLemma 2.2 (Extended Schur complement lemma) For symmetric A (cid:23) 0\nand C \u227b 0:\n\nThe column space of B \u22a5 the null space of A\n\n(cid:27) \u21d4(cid:18) A B\n\nB\u2032 C (cid:19) (cid:23) 0.\n\nC (cid:23) B\u2032A\u2020B\n\nIndeed, applying this lemma to our problem with A = K \u2299 \u0393, B = e + \u03bd \u2212 \u00b5 and\nC = t \u2212 2C\u00b5\n\n\u2032e, leads to the problem formulation in the standard SDP form:\n\ntogether with the constraints (5), (6) and (8). The relaxation for the hard margin\nSVM is found by following a very similar derivation, or by just equating \u00b5 to 0.\n\nThe number of variables specifying \u0393, and the size of constraint (8) can be greatly\nreduced due to structure in the problem. This is subject of what follows now.\n\n2.2 Simpli\ufb01cations due to the problem structure\n\n\u0393w (cid:19) where we have a\nThe matrix \u0393 can be parameterized as \u0393 = (cid:18) ytyt\u2032 \u0393c\ntraining block ytyt\u2032 \u2208 \u211cnt\u00d7nt, cross blocks \u0393c \u2208 \u211cnt\u00d7nw and \u0393c\u2032, and a transduction\nblock \u0393w \u2208 \u211cnw\u00d7nw , which is a symmetric matrix with diagonal entries equal to 1.\nWe now use Lemma 2.1: by choosing a submatrix that contains all rows and columns\ncorresponding to the training block, and just one row and column corresponding to\nthe transduction part, the SPD constraint of \u0393 is seen to imply that\n\n\u0393c\u2032\n\n\u0393 =(cid:18) ytyt\u2032\n\n\u03b3\n\n\u2032\n\nc\ni\n\n\u03b3\n\nc\ni\n\n1 (cid:19) (cid:23) 0\n\ni represents the ith column of \u0393c. Using the extended Schur complement\nc\nwhere \u03b3\ni = giyt), and\nc\nlemma 2.2, it follows that \u03b3\n1 (cid:23) \u03b3\ni such\nthat \u22121 \u2264 gi \u2264 1. (Note that this is a corollary of the SPD constraint and does not\nneed to be imposed explicitly.) Thus, the parameterization of \u0393 can be reduced to:\n\ni is proportional to yt (denoted by \u03b3\nc\ni . This implies that 1 \u2265 giyt\u2032 ytyt \u2032\n\n\u2032(cid:16)ytyt\u2032(cid:17)\u2020\n\nkytk4 ytgi = g2\n\n\u2032 ytyt \u2032\nkytk4 \u03b3\n\nc\ni = \u03b3\n\nc\ni\n\nc\ni\n\n\u03b3\n\nc\n\n\u0393 = (cid:18) ytyt\u2032 ytg\u2032\n\n\u0393w (cid:19) with \u0393w\n\ngyt\u2032\n\nii = 1\n\nwhere g is the vector with gi as ith entry. We can now show that:\n\nProposition 2.3 The constraint \u0393 (cid:23) 0 is equivalent to (and can thus be replaced\nby) the following SPD constraint on a smaller matrix e\u0393:\n\ng\u2032\n\ng \u0393w (cid:19) (cid:23) 0.\n\ne\u0393 =(cid:18) 1\n\n\feigenvalue equal to 0 is added. Due to the interlacing property for bordered matrices\n\nSince e\u0393 is a principal submatrix of \u0393 (assuming at least one training label is equal to\n1), lemma 2.1 indeed shows that \u0393 (cid:23) 0 implies e\u0393 (cid:23) 0. On the other hand, note that\nby adding a column and corresponding row to e\u0393, the rank is not increased. Thus, an\n[10] and the fact that e\u0393 (cid:23) 0, we know this can only be the smallest eigenvalue of\nthe resulting matrix. By induction this shows that also e\u0393 (cid:23) 0 implies \u0393 (cid:23) 0.\n\nThis is the \ufb01nal formulation of the problem. For the soft margin case, the number of\nparameters is now 1+2nt+ n2\nw+3nw\n.\n\n2\n\nw+5nw\n\n2\n\n. For the hard margin case, this is 1+nt+ n2\n\n2.3 Extraction of an estimate for the labels from \u0393\n\nIn general, the optimal \u0393 will of course not be rank one. We can approximate it by a\nrank one matrix however, by taking g as an approximation for the labels optimizing\nthe unrelaxed problem. This is the approach we adopt: a thresholded value of the\nentries of g will be taken as a guess for the labels of the working set.\n\nNote that the minimum of the relaxed problem is always smaller than or equal to the\nminimum of the unrelaxed problem. Furthermore, the minimum of the unrelaxed\nproblem is smaller than or equal to the value achieved by the thresholded relaxed\nlabels. Thus, we obtain a lower and an upper bound for the true optimal cost.\n\n2.4 Remarks\n\nThe performance of this method is very good, as is seen on a toy problem (\ufb01gure\n1 shows an illustrative example). However, due to the (even though polynomial)\ncomplexity of SDP in combination with the quadratic dependence of the number of\nvariables on the number of transduction points2, problems with more than about\n1000 training samples and 100 transduction samples can not practically be solved\nwith general purpose SDP algorithms. Especially the limitation on the working set\nis a drawback, since the advantage of transduction becomes apparent especially for\na large working set as compared to the number of training samples. This makes the\napplicability of this approach for large real life problems rather limited.\n\n3 Subspace SDP formulation\n\nHowever, if we would know a subspace (spanned by the d columns of a matrix\nV \u2208 \u211c(nt+nw)\u00d7d) in which (or close to which) the label vector lies, we can restrict\nthe feasible region for \u0393, leading to a much more e\ufb03cient algorithm. In the next\nsection a fast method to estimate such a space V will be provided. In this section\nwe assume V is known, and explain how to do the reduction of the feasible region.\n\nIf we know that the true label vector y lies in the column space of a matrix V,\nwe know the true label matrix can be written in the form \u0393 = VMV\u2032, with M a\nsymmetric matrix. The number of parameters is now only d(d + 1)/2. Furthermore,\nconstraint (8) that \u0393 (cid:23) 0 is then equivalent to M (cid:23) 0, which is a cheaper constraint.\nNote however that in practical cases, the true label vector will not lie within but\nonly close to the subspace spanned by the columns of V. Then the diagonal of the\nlabel matrix \u0393 can not always be made exactly equal to e as required by (6). We\nthus relax this constraint to the requirement that the diagonal is not larger than\n\n2The worst case complexity for the problem at hand is O((nt + n2\n\nw)2(nt + nw)2.5), which\n\nis of order 6.5 in the number of transduction points nw.\n\n\f1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n0\n\n1\n\n0\n\n20\n\n40\n\n60\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121\n\n0\n\n1\n\nFigure 1: The left picture shows 10 labelled samples represented by a \u2019o\u2019 or a \u2019+\u2019,\ndepending on their class, together with 60 unlabelled samples represented by a \u2019\u00b7\u2019.\nThe middle picture shows the labels for the working set as estimated using the\nSDP method before thresholding: all are already invisibly close to 1 or \u22121. The\nright picture shows contour lines of the classi\ufb01cation surface obtained by training\nan SVM using all labels as found by the SDP method. The method clearly \ufb01nds a\nvisually good label assignment that takes cluster structure in the data into account.\n\ne. Similarly, the block in the label matrix corresponding to the training samples\nmay not contain 1\u2019s and \u22121\u2019s exactly (constraint (5)). However, the better V is\nchosen, the better this constraint will be met. Thus we optimize (9) subject to (10)\ntogether with three constraints that replace the constraints (5), (6) and (8):\n\n\u0393 = VMV\u2032\n\ndiag(\u0393) \u2264 e\nM (cid:23) 0\n\nThus we can approximate the relaxed transductive SVM using this reduced param-\neterization for \u0393. The number of e\ufb00ective variables is now only a linear function of\nnw: 1 + nt + nw + d(d + 1)/2 for a hard margin and 1 + 2(nt + nw) + d(d + 1)/2 for\na soft margin SVM. Furthermore, one of the SPD constraints is now a constraint\non a d \u00d7 d matrix instead of a potentially large (nw + 1) \u00d7 (nw + 1) matrix. For a\nconstant d, the worst case complexity is thus reduced to O((nt + nw)4.5).\n\nThe quality of the approximation can be determined by the user: the number\nof components d can be chosen depending on the available computing resources,\nhowever empirical results show a good performance already for relatively small d.\n\n4 Spectral transduction to \ufb01nd the subspace\n\nIn this section we will discuss how to \ufb01nd a subspace V close to which the label\nvector will lie. Our approach is based on the spectral clustering algorithm proposed\nin [11]. They start with computing the eigenvectors corresponding to the largest\neigenvalues of D\u22121/2KD\u22121/2 where d = Ke contains all row sums of K, and\nD = diag(d). The dominant eigenvectors are shown to re\ufb02ect the cluster structure\nof the data. The optimization problem corresponding to this eigenvalue problem is:\n\nmax\n\nv\n\nv\u2032D\u22121/2KD\u22121/2v = v\u2032eKv s.t. v\u2032v = 1.\n\n4.1 Constrained spectral clustering\n\n(11)\n\nWe could apply this algorithm to the kernel matrix K, but we can do more since\nwe already know some of the labels: we will constrain the estimates of the labels\n\n\ffor the training samples that are known to be in the same class to be equal to each\nother. Then we optimize the same object function with respect to these additional\nconstraints. This can be achieved by choosing the following parameterization for v:\n\nv =\uf8eb\n\uf8ed\n\nent+ /\u221ant+\n\n0\n0\n\n0\n\n0\nent\u2212 /\u221ant\u2212 0\nI\n\n0\n\n\uf8f8 \u00b7\uf8eb\n\uf8f6\n\uf8ed\n\nht+\nht\u2212\nhw\n\n\uf8f6\n\uf8f8 = Lh\n\nwhere en+ and en\u2212 denote the vectors containing nt+ (the number of positive\ntraining samples) and nt\u2212 (the number of negative training samples) ones. Then:\n\nProposition 4.1 Optimization problem (11) is equivalent with:\n\nmax\n\nh\n\nh\u2032L\u2032D\u22121/2KD\u22121/2Lh s.t. h\u2032h = 1\n\nwhich corresponds to the eigenvalue problem L\u2032D\u22121/2KD\u22121/2Lh = \u03bbh. Then v is\nfound as v = Lh.\n\nThis is an extension of spectral clustering towards transduction3. We will use a\nsubscript i to denote the ith eigenvector and eigenvalue, where \u03bbi \u2265 \u03bbj for i > j.\n4.2 Spectral transduction provides a good V\n\nBy construction, all entries of vi corresponding to positive training samples will be\ni /\u221ant+; entries corresponding to the negative ones will all be equal to\nequal to ht+\ni /\u221ant\u2212. Furthermore, as in spectral clustering, the other entries of vectors vi\nht\u2212\nwith large eigenvalue \u03bbi will re\ufb02ect the cluster structure of the entire data set, while\nrespecting the label assignment of the training points however4. This means that\nsuch a vi will provide a good approximation for the labels. More speci\ufb01cally, the\nlabel vector will lie close to the column space of V, having d dominant \u2018centered\u2019 vi\nas its columns; the larger d, the better the approximation. The way we \u2018center\u2019 vi\nis by adding a constant so that entries for positive training samples become equal to\nminus those for the negative ones. Since then the \ufb01rst nt columns of the resulting\n\u0393 = VMV\u2032 will be equal up to a sign, we can adopt basically the same approach\nas in section 2.3 to guess the labels: pick and threshold the \ufb01rst column of \u0393.\n\n5 Empirical results\n\nTo show the potential of the method, we extracted data from the USPS data set\nto form two classes. The positive class is formed by 100 randomly chosen sam-\nples representing a number 0, and 100 representing a 1; the negative class by 100\nsamples representing a 2 and 100 representing a 3. Thus, we have a balanced clas-\nsi\ufb01cation problem with two classes of each 200 samples. The training set is chosen\nto contain only 10 samples from each of both classes, and is randomly drawn but\nevenly distributed over the 4 numbers. We used a hard margin SVM with an rbf\nkernel with \u03c3 = 7 (which is equal to the average distance of the samples to their\nnearest neighbors, veri\ufb01ed to be a good value for the induction as well as for the\n\n3We want to point out that the spectral transduction on its own is empirically observed\nto signi\ufb01cantly improve over standard spectral clustering algorithms, and compares favor-\nably with a recently proposed [5] extension of spectral clustering towards transduction.\nFurthermore, as also in [5] the method can be generalized towards a method for clustering\nwith side-information (where side-information consists of sets of points that are known to\nbe co-clustered). Space restrictions do not permit us to go into this in the current paper.\n4Note: to reduce the in\ufb02uence from outliers, large entries of the vi can be thresholded.\n\n\ftransduction case). The average ROC-score (area under the ROC-curve) over 10\nrandomizations is computed, giving 0.75 \u00b1 0.03 as average for the inductive SVM,\nand 0.959 \u00b1 0.03 for the method developed in this paper (we chose d = 4). To\nillustrate the scalability of the method, and to show that a larger working set is\ne\ufb00ectively exploited, we used a similar setting (same training set size) but with 1000\nsamples and d = 3, giving an average ROC-score of 0.993 \u00b1 0.004.\n\n6 Conclusions\n\nWe developed a relaxation for the transductive SVM as \ufb01rst proposed by Vapnik.\nIt is shown how this combinatorial problem can be relaxed to an SDP problem.\n\nUnfortunately, the number of variables in combination with the complexity of SDP is\ntoo high for it to scale to signi\ufb01cant problem sizes. Therefore we show how, based on\na new spectral method, the feasible region of the variables can be shrinked, leading\nto an approximation for the original SDP method. The complexity of the resulting\nalgorithm is much more favorable. Positive empirical results are shown.\n\nAcknowledgement\n\nTijl De Bie is a Research Assistant with the Fund for Scienti\ufb01c Research \u2013 Flanders\n(F.W.O.\u2013Vlaanderen).\n\nReferences\n\n[1] V. N. Vapnik. Statistical Learning Theory. Springer, 1998.\n[2] K. Bennett and A. Demiriz. Semi-supervised support vector machines. In M. S.\nKearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information\nProcessing Systems 11, Cambridge, MA, 1999. MIT Press.\n\n[3] T. Joachims. Transductive inference for text classi\ufb01cation using support vec-\ntor machines. In Proc. of the International Conference on Machine Learning\n(ICML), 1999.\n\n[4] N. Cristianini, J. Kandola, A. Elissee\ufb00, and J. Shawe-Taylor. On optimizing\n\nkernel alignment. Submitted for publication, 2003.\n\n[5] S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning. In Proc. of the\n\nInternational Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2003.\n\n[6] O. Chapelle, J. Weston, and B. Sch\u00a8olkopf. Cluster kernels for semi-supervised\nlearning.\nIn S. Becker, S. Thrun, and K. Obermayer, editors, Advances in\nNeural Information Processing Systems 15, Cambridge, MA, 2003. MIT Press.\n[7] C. Helmberg. Semide\ufb01nite Programming for Combinatorial Optimization. Ha-\nbilitationsschrift, TU Berlin, January 2000. ZIB-Report ZR-00-34, Konrad-\nZuse-Zentrum Berlin, 2000.\n\n[8] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan.\nLearning the kernel matrix with semide\ufb01nite programming. Journal of Machine\nLearning Research (JMLR), 5:27\u201372, 2004.\n\n[9] T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri. b. In Proceedings\n\nof the Conference on Uncertainty in Geometric Computations, 2001.\n\n[10] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press,\n\n1985.\n\n[11] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans-\n\nactions on Pattern Analysis and Machine Intelligence, 22(8):888\u2013905, 2000.\n\n\f", "award": [], "sourceid": 2507, "authors": [{"given_name": "Tijl", "family_name": "Bie", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}]}