{"title": "Two view learning: SVM-2K, Theory and Practice", "book": "Advances in Neural Information Processing Systems", "page_first": 355, "page_last": 362, "abstract": null, "full_text": "Two view learning: SVM-2K, Theory and Practice\nJason D.R. Farquhar jdrf99r@ecs.soton.ac.uk Hongying Meng hongying@cs.york.ac.uk David R. Hardoon drh@ecs.soton.ac.uk John Shawe-Taylor jst@ecs.soton.ac.uk\n\nSandor Szedmak ss03v@ecs.soton.ac.uk School of Electronics and Computer Science, University of Southampton, Southampton, England\n\nAbstract\nKernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights.\n\n1\n\nIntroduction\n\nKernel methods enable us to work with high dimensional feature spaces by defining weight vectors implicitly as linear combinations of the training examples. This even makes it practical to learn in infinite dimensional spaces as for example when using the Gaussian kernel. The Gaussian kernel is an extreme example, but techniques have been developed to define kernels for a range of different datatypes, in many cases characterised by very high dimensionality. Examples are the string kernels for text, graph kernels for graphs, marginal kernels, kernels for image data, etc. With this plethora of high dimensional representations it is frequently helpful to assist learning algorithms by preprocessing the feature space in projecting the data into a low dimensional subspace that contains the relevant information for the learning task. Methods of performing this include principle components analysis (PCA) [7], partial least squares [8], kernel independent component analysis (KICA) [1] and kernel canonical correlation analysis (KCCA) [5].\n\n\f\nThe last method requires two views of the data both of which contain all of the relevant information for the learning task, but which individually contain representation specific details that are different and irrelevant. Perhaps the simplest example of this situation is a paired document corpus in which we have the same information in two languages. KCCA attempts to isolate feature space directions that correlate between the two views and hence might be expected to represent the common relevant information. Hence, one can view this preprocessing as a denoising of the individual representations through cross-correlating them. Experiments have shown how using this as a preprocessing step can improve subsequent analysis in for example classification experiments using a support vector machine (SVM) [6]. This is explained by the fact that the signal to noise ratio has improved in the identified subspace. Though the combination of KCCA and SVM seems effective, there appears no guarantee that the directions identified by KCCA will be best suited to the classification task. This paper therefore looks at the possibility of combining the two distinct stages of KCCA and SVM into a single optimisation that will be termed SVM-2K. The next section introduces the new algorithm and discusses its structure. Experiments are then given showing the performance of the algorithm on an image classification task. Though the performance is encouraging it is in many ways counter-intuitive, leading to speculation about why an improvement is seen. To investigate this question an analysis of its generalisation properties is given in the following two sections, before drawing conclusions.\n\n2\n\nSVM-2K Algorithm\n\nWe assume that we are given two views of the same data, one expressed through a feature projection A with corresponding kernel A and the other through a feature projection B with kernel B . A paired data set is then given by a set S = {(A (x1 ), B (x1 )), . . . , (A (x ), B (x ))}, where for example A could be the feature vector associated with one language and B that associated with a second language. For a classification task each data item would also include a label. The KCCA algorithm looks for directions in the two feature spaces such that when the training data is projected onto those directions the two vectors (one for each view) of values obtained are maximally correlated. One can also characterise these directions as those that minimise the two norm between the two vectors under the constraint that they both have norm 1 [5]. We can think of this as constraining the choice of weight vectors in the two spaces. KCCA would typically find a sequence of projection directions of dimension anywhere between 50 and 500 that can then be used as the feature space for training an SVM [6]. An SVM can be thought of as a 1-dimensional projection followed by thresholding, so SVM-2K combines the two steps by introducing the constraint of similarity between two 1-dimensional projections identifying two distinct SVMs one in each of the two feature spaces. The extra constraint is chosen slightly differently from the 2-norm that characterises KCCA. We rather take an -insensitive 1-norm using slack variables to measure the amount by which points fail to meet similarity: | wA , A (xi ) + bA - wB , B (xi ) - bB | i + , where wA , bA (wB , bB ) are the weight and threshold of the first (second) SVM. Combining this constraint with the usual 1-norm SVM constraints and allowing different\n\n\f\nregularisation constants gives the following optimisation: min L such that = i i i 1 1 A B wA 2 + wB 2 + C A i + C B i i + D 2 2 =1 =1 =1\n\n| wA , A (xi ) + bA - wB , B (xi ) - bB | i + \nA yi ( wA , A (xi ) + bA ) 1 - i B yi ( wB , B (xi ) + bB ) 1 - i A i 0, B i 0,\n\ni 0 all for\n\n1 i .\n\nApplying the usual Lagrange multiplier techniques we arrive at the following dual problem:\n gA A + i 1i A B BB (i + i ) gj A (xi , xj ) + gi gj B (xi , xj ) 2 ,j = 1 i =1 - + B B gi = i yi + i - i ,\n\n^ ^bb Let wA , wB , ^A , ^B be the solution to this optimisation problem. The final SVM-2K decision function is then h(x) = sign(f (x)), where = ^ 0.5 (fA (x) + fB (x)) . f (x) = 0.5 wA , A (x) + ^A + wB , B (x) + ^B b ^ b\n\nmax W such that\n\n=-\n\n- + A A gi = i yi - i + i ,\n\n0 0 with the functions\n\ni\n\nA gi = 0 = A/ B i +/ - i ,\n\n=1\n\nC\n\ni\n\nB gi ,\n\n=1\n\nA/ B\n\n- + i + i D\n\nfA/B (x) =\n\ni\n\ngi\n\nA/ B\n\nA/B (xi , x) + bA/B .\n\n=1\n\n3\n\nExperimental results\n\nFigure 1: Typical example images from the PASCAL VOC challenge database. Classes are; Bikes (top-left), People (top-right), Cars (bottom-left) and Motorbikes (bottom-right).\n\n\f\nThe performance of the algorithms developed in this paper we evaluated on PASCAL Visual Object Classes (VOC) challenge dataset test11 . This is a new dataset consisting of four object classes in realistic scenes. The object classes are, motorbikes (M), bicycles (B), people (P) and cars (C) with the dataset containing 684 training set images consisting of (214, 114, 84, 272) images in each class and 689 test set images with (216, 114, 84, 275) for each class. As can be seen in Figure 1 this is a very challenging dataset with objects of widely varying type, pose, illumination, occlusion, background, etc. The task is to classify the image according to whether it contains a given object type. We tested the images containing the object (i.e. categories M, B, C and P) against non-object images from the database (i.e. category N). The training set contained 100 positive and 100 negative images. The tests are carried out on 100 new images, half belonging to the learned class and half not. Like many other successful methods [3, 4] we take a \"set-of-patches\" approach to this problem. These methods represent an image in terms of the features of a set of small image patches. By carefully choosing the patches and their features this representation can be made largely robust to the common types of image transformation, e.g. scale, rotation, perspective, occlusion. Two views were provided of each image through the use of different patch types. One was from affine invariant interest point detectors with a moment invariant descriptor calculated for each interest point. The second were key point features from SIFT detectors. For one image, several hundred characteristic patches were detected according to the complexity of the images. These were then clustered around K = 400 centres for each feature space. Each image is then represented as a histogram over these centres. So finally, for one image there are two feature vectors of length 400 that provide the two views. SVM 1 SVM 2 KCCA + SVM SVM 2K Motorbike 94.05 91.15 94.19 94.34 Bicycle 91.58 91.15 90.28 93.47 People 91.58 90.57 90.57 92.74 Car 87.95 86.21 88.68 90.13\n\nTable 1: Results for 4 datasets showing test accuracy of the individual SVMs and SVM-2K. Figure 1 show the results of the test errors obtained for the different categories for the individual SVMs and the SVM-2K. There is a clear improvement in performance of the SVM-2K over the two individual SVMs in all four categories. If we examine the structure of the optimisation, the restriction that the output of the two linear functions be similar seems to be an arbitrary restriction particularly for points that are far from the margin or are misclassified. Intuitively it would appear better to take advantage of the abilities of the different representations to better fit the data. In order to understand this apparent contradiction we now consider a theoretical analysis of the generalisation of the SVM-2K using the framework provided by Rademacher complexity bounds.\n\n4\n\nBackground theory\n\nWe begin with the definitions required for Rademacher complexity, see for example Bartlett and Mendelson [2] (see also [9] for an introductory exposition). Definition 1. For a sample S = {x1 , , x } generated by a distribution D on a set\n1 Available from http://www.pascal-network.org/challenges/VOC/voc/ 160305 VOCdata.tar.gz\n\n\f\nX and a real-valued function class F with a domain X , the empirical Rademacher complexity of F is the random variable s w x 2 i ^ i f (xi ) 1 , , x R (F ) = E up f F =1\n\nhere = {1 , , } are independent uniform {1}-valued Rademacher random variables. The Rademacher complexity of F is W 2 s ^ = i i f ( xi ) ES up R (F ) = ES R (F ) f F =1 e use ED to denote expectation with respect to a distribution D and ES when the distribution is the uniform (empirical) distribution on a sample S .\n\nGiven a training set S the class of functions that we will primarily be considering are linear functions with bounded norm x i=1 i (xi , x) : K B 2 {x w, (x) : w B } = FB where is the feature mapping corresponding to the kernel and K is the corresponding kernel matrix for the sample S . The following result bounds the Rademacher complexity of linear function classes.\n\nTheorem 1. Fix (0, 1) and let F be a class of functions mapping from S to [0, 1]. Let (xi )i=1 be drawn independently according to a probability distribution D. Then with probability at least 1 - over random draws of samples of size , every f F satisfies l 2/ ED [f (x)] ES [f (x)] + R (F ) + 3 n(2 ) l 2/ ^ ES [f (x)] + R (F ) + 3 n(2 )\n\nTheorem 2. [2] If : X X R is a kernel, and S = {x1 , , x } is a sample of point from X, then the empirical Rademacher complexity of the class FB satisfies 2B i 2B t ^ r (K ) R (F ) ( xi , xi ) = =1 4.1 Analysing SVM-2K\n \n\nFor SVM-2K, the two feature sets from the same objects are (A (xi ))i=1 and (B (xi ))i=1 respectively. We assume the notation and optimisation of SVM-2K given in section 2, equation (1). First observe that an application of Theorem 1 shows that ES [|fA (x) - fB (x)|] ES [| wA , A (x) + ^A - wB , B (x) - ^B |] ^ b ^ b\n 2C 1i i + + =1 \n\nt r(KA ) + tr(KB ) + 3\n\nl n(2/ ) =: D 2\n\nwith probability at least 1 - . We have assumed that wA 2+b2 C 2 and wB 2+b2 A B C 2 for some prefixed C . Hence, the class of functions we are considering when applying\n\n\f\nSVM-2K to this problem can be restricted to ff i gA + B : x 0.5 bA + bB FC ,D = i A (xi , x) + gi B (xi , x)\n=1\n\n,\n\nT\n\ng A KA g A + b2 C 2 , g B KB g B + b2 C 2 , ES [|fA (x) - fB (x)|] D A B\n\nhe class FC,D is clearly closed under negation. Applying the usual Rademacher techniques for margin bounds on generalisation we obtain the following result. Theorem 3. Fix (0, 1) and let FC,D be the class of functions described above. Let (xi )i=1 be drawn independently according to a probability distribution D. Then with probability at least 1 - over random draws of samples of size , every f FC,D satisfies l 0.5 i A B ^ (FC,D ) + 3 n(2/ ) . P(x,y)D (sign(f (x)) = y ) ( + i ) + R =1 i 2 It therefore remains to compute the empirical Rademacher complexity of FC,D , which is the critical discriminator between the bounds for the individual SVMs and that of the SVM-2K. 4.2 Empirical Rademacher complexity of FC,D\n\nWe now define an auxiliary function of two weight vectors wA and wB , D(wA , wB ) := ED [| wA , A (x) + bA - wB , B (x) - bB |] With this notation we can consider computing the Rademacher complexity of the class FC ,D . = ^ R (FC,D ) = E O E s up s 2 i\n\nup\n\nf FC,D\n\n i f ( xi ) i [ wA , A (xi ) + bA + wB , B (xi ) + bB ]\n\n=1\n\nwA C, wB C D ( w A , w B ) D\n\n1 i\n\n=1\n\nur next observation follows from a reversed version of the basic Rademacher complexity theorem reworked to reverse the roles of the empirical and true expectations: Theorem 4. Fix (0, 1) and let F be a class of functions mapping from S to [0, 1]. Let (xi )i=1 be drawn independently according to a probability distribution D. Then with probability at least 1 - over random draws of samples of size , every f F satisfies l 2/ ES [f (x)] ED [f (x)] + R (F ) + 3 n(2 ) l 2/ ^ ED [f (x)] + R (F ) + 3 n(2 ) The proof tracks that of Theorem 1 but is omitted through lack of space. For weight vectors wA and wB satisfying D(wA , wB ) D, an application of Theorem 4\n\n\f\nshows that with probability at least 1 - we have ^ D(wA , wB ) := ES [| wA , A (x) + bA - wB , B (x) - bB |] l 2C t n(2/ ) D+ r(KA ) + tr(KB ) + 3 2 l i t 1 n(2/ ) 4C ^ + r(KA ) + tr(KB ) + 6 =: D i + =1 2\n\nWe now return to bounding the Rademacher complexity of FC,D . The above result shows that with probability greater than 1 - F ^ R C ,D s F 1 E up wA C i=1 i [ wA , A (xi ) + bA + wB , B (xi ) + bB ] \nwB C ^ ^ D ( w A , w B ) D\n\nirst note that the expression in square brackets is concentrated under the uniform distribution of Rademacher variables. Hence, we can estimate the complexity for a fixed instantiation of the the Rademacher variables . We now must find the value of wA and wB that ^ maximises the expression w = + + w i i i i 1 i A (xi ) ^ bA , i B ( xi ) ^ bB i + ^ , i ^ A B =1 =1 =1 =1 s 1 ^ KA g A + KB g B + (bA + bB ) j ^ ^ ubject to the constraints g A KA g A C 2 , g B KB g B C 2 , and 1 ^ 1 abs(KA g A - KB g B + (bA - bB )1) D where 1 is the all ones vector and abs(u) is the vector obtained by applying the abs function to u component-wise. The resulting value of the objective function is the estimate of the Rademacher complexity. This is the optimisation solved in the brief experiments described below. 4.3 Experiments with Rademacher complexity\n\nWe computed the Rademacher complexity for the problems considered in the experimental section above. We wished to verify that the Rademacher complexity of the space FC,D , where C and D are determined by applying the SVM-2K, are indeed significantly lower than that obtained for the SVMs in each space individually. Motorbike 94.05 1.65 91.15 1.72 94.34 1.26 Bicycle 91.58 0.93 91.15 1.48 93.47 1.28 People 91.58 0.91 90.57 0.87 92.74 0.82 Car 87.95 1.60 86.21 1.64 90.13 1.26\n\nSVM 1 Rad 1 SVM 2 Rad 2 SVM 2K Rad 2K\n\nTable 2: Results for 4 datasets showing test accuracy and Rademacher complexity (Rad) of the individual SVMs and SVM-2K.\n\n\f\nTable 2 shows the results for the motorbike, bicycle, people and car datasets. We show the Rademacher complexities for the individual SVMs and for the SVM-2K along with the generalisation results already given in Table 1. In the case of SVM-2K we sampled the Rademacher variables 10 times and give the corresponding standard deviation. As predicted the Rademacher complexity is significantly smaller for SVM-2K, hence confirming the intuition that led to the introduction of the approach, namely that the complexity of the class is reduced by restricting the weight vectors to align on the training data. Provided both representations contain the necessary data we can therefore expect an improvement in generalisation as observed in the reported experiments.\n\n5\n\nConclusions\n\nWith the plethora of data now being collected in a wide range of fields there is frequently the luxury of having two views of the same phenomenon. The simplest example is paired corpora of documents in different languages, but equally we can think of examples from bioinformatics, machine vision, etc. Frequently it is also reasonable to assume that both views contain all of the relevant information required for a classification task. We have demonstrated that in such cases it can be possible to leaver the correlation between the two views to improve classification accuracy. This has been demonstrated in experiments with a machine vision task. Furthermore, we have undertaken a theoretical analysis to illuminate the source and extent of the advantage that can be obtained, showing in the cases considered a significant reduction in the Rademacher complexity of the corresponding function classes.\n\nReferences\n[1] Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:148, 2002. [2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463482, 2002. [3] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In XRCE Research Reports, XEROX. The 8th European Conference on Computer Vision - ECCV, Prague, 2004. [4] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. [5] David Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:26392664, 2004. [6] Yaoyong Li and John Shawe-Taylor. Using kcca for japanese-english cross-language information retrieval and classification. to appear in Journal of Intelligent Information Systems, 2005. [7] S. Mika, B. Scholkopf, A. Smola, K.-R. Muller, M. Scholz, and G. Ratsch. Kernel PCA and de-noising in feature spaces. In Advances in Neural Information Processing Systems 11, 1998. [8] R. Rosipal and L. J. Trejo. Kernel partial least squares regression in reproducing kernel hilbert space. Journal of Machine Learning Research, 2:97123, 2001. [9] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004.\n\n\f\n", "award": [], "sourceid": 2829, "authors": [{"given_name": "Jason", "family_name": "Farquhar", "institution": null}, {"given_name": "David", "family_name": "Hardoon", "institution": null}, {"given_name": "Hongying", "family_name": "Meng", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}, {"given_name": "S\u00e1ndor", "family_name": "Szedm\u00e1k", "institution": null}]}