of supporting patterns per separating hypersurface. The results com(cid:173)\npare favorably to neural network classifiers which minimize the mean squared error \nwith backpropagation. For the one layer network (linear classifier),the error on the \ntest set is 12.7 % on DB1 and larger than 25 % on DB2. The lowest error rate for \nDB2, 4.9 %, obtained with a forth order polynomial, is comparable to the 5.1 % \nerror obtained with a multi-layer neural network with sophisticated architecture \nbeing trained and tested on the same data [6]. \n\nwith polynomial classifiers of order q are summarized in table 1. Also listed is \nthe number of adjustable parameters, N. This quantity increases rapidly with q \nand quickly reaches a level that is computationally intractable for algorithms thdt \nexplicitly compute each parameter [5]. Moreover, as N increases, the learning prob(cid:173)\nlem becomes grossly underdetermined: the number of training patterns (p = 600 \nfor DB1 and p = 7300 for DB2) becomes very small compared to N. Nevertheless, \ngood generalization is achieved as shown by the experimental results listed in the \ntable. This is a consequence of the inherent regularization of the algorithm. \n\nAn important concern is the sensitivity of the maximum margin solution to the \npresence of outliers in the training data. It is indeed important to remove undesired \noutliers (such as meaningless or mislabeled patterns) to get best generalization \nperformance. Conversely, \"good\" outliers (such as examples of rare styles) must be \nkept. Cleaning techniques have been developed based on the re-examination by a \nhuman supervisor of those supporting patterns which result in the largest increase of \nthe margin when removed, and thus, are the most likely candidates for outliers [3]. \nIn our experiments on DB2 with linear classifiers, the error rate on the test set \ndropped from 15.2% to 10.5% after cleaning the training data (not the test data). \n\n2 ALGORITHM DESIGN \n\nThe properties of the G P algorithm arise from merging two separate ideas: Training \nin dual space, and minimizing the maximum loss. For large VC-dimension classifiers \n(N ~ p), the first idea reduces the number of effective parameters to be actually \n\n\fAutomatic Capacity Tuning of Very Large VC-dimension Classifiers \n\n151 \n\ncomputed from N to p. The second idea reduces it from p to m. \n\n2.1 DUALITY \n\nWe seek a decision function for pattern vectors x of dimension n belonging to either \nof two classes A and B. The input to the training algorithm is a set of p examples \nXi with labels Yi: \n\n(3) \n\nwhere {Yk = 1 \n\nYk =-1 \n\nif Xk E class A \nif Xk E class B. \n\nFrom these training examples the algorithm finds the parameters of the decision \nfunction D(x) during a learning phase. After training, the classification of unknown \npatterns is predicted according to the following rule: \n\nx E A if D(x) > 0 \nx E B otherwise. \n\n(4) \n\nWe limit ourselves to classifiers linear in their parameters, but not restricted to \nlinear dependences in their input components, such as Perceptrons and kernel-based \nclassifiers. Perceptrons [5] have a decision function defined as: \n\nD(x) = w . 0 [4, 2]. The p x p square matrix H has elements: \n\nHkl = YkYIK(Xk,Xl). \n\nwhere K(x, x') is a kernel, such as the ones proposed in (9), which can be expanded \nas in (8). Examples are shown in figure 2. K(x, x') is not restricted to the dot \nproduct K(x, x') = x . x' as in the original formulation of the GP algorithm [2]. \nIn order for a unique solution to exist, H must be positive definite. The bias b can \nbe either fixed or optimized together with the parameters ak. This case introduces \nanother set of constraints: Ek Ykak = 0 [4]. \nThe quadratic programming problem thus defined can be solved efficiently by stan(cid:173)\ndard numerical methods [11]. Numerical computation can be further reduced by \nprocessing iteratively small chunks of data [2]. The computational time is linear the \ndimension n of x-space (not the dimension N of ip-space) and in the number p of \ntraining examples and polynomial in the number m < min(N + 1,p) of supporting \n\n\f154 \n\nGuyon, Boser, and Vapnik \n\nXi \n\n~!;~;t;.':