{"title": "Asymptotic Universality for Learning Curves of Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 486, "abstract": null, "full_text": "Asymptotic Universality for Learning \nCurves of Support Vector Machines \n\nM.Opperl \n\nR. Urbanczik2 \n\n1 Neural Computing Research Group \n\nSchool of Engineering and Applied Science \nAston University, Birmingham B4 7ET, UK. \n\nopperm@aston.ac.uk \n\nUniversitiit Wurzburg Am Rubland, D-97074 Wurzburg, Germany \n\n2Institut Fur Theoretische Physik, \n\nrobert@physik.uni-wuerzburg.de. \n\nAbstract \n\nUsing methods of Statistical Physics, we investigate the rOle \nof model complexity in learning with support vector machines \n(SVMs). We show the advantages of using SVMs with kernels \nof infinite complexity on noisy target rules, which, in contrast to \ncommon theoretical beliefs, are found to achieve optimal general(cid:173)\nization error although the training error does not converge to the \ngeneralization error. Moreover, we find a universal asymptotics of \nthe learning curves which only depend on the target rule but not \non the SVM kernel. \n\n1 \n\nIntroduction \n\nPowerful systems for data inference, like neural networks implement complex input(cid:173)\noutput relations by learning from example data. The price one has to pay for the \nflexibility of these models is the need to choose the proper model complexity for a \ngiven task, i.e. the system architecture which gives good generalization ability for \nnovel data. This has become an important problem also for support vector machines \n[1]. The main advantage of SVMs is that the learning task is a convex optimization \nproblem which can be reliably solved even when the example data require the fitting \nof a very complicated function. A common argument in computational learning \ntheory suggests that it is dangerous to utilize the full flexibility of the SVM to learn \nthe training data perfectly when these contain an amount of noise. By fitting more \nand more noisy data, the machine may implement a rapidly oscillating function \nrather than the smooth mapping which characterizes most practical learning tasks. \nIts prediction ability could be no better than random guessing in that case. Rence, \nmodifications of SVM training [2] that allow for training errors were suggested to be \nnecessary for realistic noisy scenarios. This has the drawback of introducing extra \nmodel parameters and spoils much of the original elegance of SVMs. \n\n\fSurprisingly, the results of this paper show that the picture is rather different in \nthe important case of high dimensional data spaces. Using methods of Statistical \nPhysics, we show that asymptotically, the SVM achieves optimal generalization \nability for noisy data already for zero training error. Moreover, the asymptotic rate \nof decay of the generalization error is universal, i.e. independent of the kernel that \ndefines the SVM. These results have been published previously only in a physics \njournal [3]. \n\nAs is well known, SVMs classify inputs y using a nonlinear mapping into a feature \nvector w(y) which is an element of a Hilbert space. Based on a training set of m \ninputs xl' and their desired classifications 71' , SVMs construct the maximum margin \nhyperplane P in the feature space. P can be expressed as a linear combination of \nthe feature vectors w(xl'), and to classify an input y, that is to decide on which side \nof P the image W (y) lies, one basically has to evaluate inner products W (xl') . W (y). \nFor carefully chosen mappings wand Hilbert spaces, inner products w(x) . w(y) \ncan be evaluated efficiently using a kernel function k(x, y) = w(x) . w(y), without \nhaving to individually calculate the feature vectors w(x) and w(y). In this manner \nit becomes computationally feasible to use very high and even infinite dimensional \nfeature vectors. \n\nThis raises the intriguing question whether the use of a very high dimensional \nfeature space may typically be helpful. So far, recent results [4, 5] obtained by \nusing methods of Statistical Mechanics (which are naturally well suited for analysing \nstochastic models in high dimensional spaces), have been largely negative in this \nrespect. They suggest (as one might perhaps expect) that it is rather important to \nmatch the complexity of the kernel to the target rule. The analysis in [4] considers \nthe case of N-dimensional inputs with binary components and assumes that the \ntarget rule giving the correct classification 7 of an input x is obtained as the sign \nof a function t(x) which is polynomial in the input components and of degree L. \nThe SVM uses a kernel which is a polynomial of the inner product x . y in input \nspace of degree K ;::: L, and the feature space dimension is thus O(NK ). In this \nscenario it is shown, under mild regularity condition on the kernel and for large N, \nthat the SVM generalizes well when the number of training examples m is on the \norder of N L . So the scale of the learning curve is determined by the complexity \nof the target rule and not by the kernel. However, considering the rate with which \nthe generalization error approaches zero one finds the optimal N L 1m decay only \nwhen K is equal to L and the convergence is substantially slower when K > L. So \nit is important to match the complexity of the kernel to the target rule and using \na large value of K is only justified if L is assumed large and if one can use on the \norder of N L examples for training. \n\nIn this paper we show that the situation is very different when one considers the \narguably more realistic scenario of a target rule corrupted by noise. Now one can no \nlonger use K = L since no separating hyperplane P will exist when m is sufficiently \nlarge compared to N L. However when K > L, this plane will exist and we will show \nthat it achieves optimal generalization performance in the limit that N L 1m is small. \nRemarkably, the asymptotic rate of decay of the generalization error is independent \nof the kernel in this case and a general characterization of the asymptote in terms of \nproperties of the target rule is possible. In a second step we show that under mild \nregularity conditions these findings also hold when k(x, y) is an arbitrary function \nof x . y or when the kernel is a function of the Euclidean distance Ix - YI. The latter \ntype of kernels is widely used in practical applications of SVMs. \n\n\f2 Learning with Noise: Polynomial Kernels \n\nWe begin by assuming a polynomial kernel k(x, y) = J(x\u00b7 y) where J(z) = \nLf=o Ck zk is of degree K. Denoting by P a multi-index P = (PI , ... ,PN) with Pi E \nNo, we set xp = JTPTfnf:l %.r and the degree of xp is Ipi = Lf:l Pi\u00b7 The kernel \ncan then be described by features wp(x) = JCiPTxp since k(x,y) = Lp wp(x)wp(y), \nwhere the summation runs over all multi-indices of degree up to K. To assure that \nthe features are real, we assume that the coefficients Ck in the kernel are nonneg(cid:173)\native. A hyperplane in feature space is parameterized by a weight vector w with \ncomponents wp, and if 0 < TI'W . W (xl'), a point (xl', TI') of the training set lies on \nthe correct side of the plane. To express that the plane P has maximal distance to \nthe points of the training set, we choose an arbitrary positive stability parameter /'i, \nand require that the weight vector w* of P minimize w . w subject to the constraints \n/'i, < TI'W' w(xl'), for f.l = 1, ... ,m. \n\n2.1 The Statistical Mechanics Formulation \n\nStatistical Mechanics allows to analyze a variety of learning scenarios exactly in the \n\"thermodynamic limit\" of high input dimensionality, when the data distributions \nare simple enough. \nIn this approach, one computes a partition function which \nserves as a generating function for important averages such as the generalization \nerror. To define the partition problem for SVMs one first analyzes a soft version of \nthe optimization problem characterized by an inverse temperature f3. One considers \nthe partition function \n\nz = f dwe-~f3w.w IT 8(TI'W' w(xl') - /'i,), \n\n(1) \n\n1'=1 \n\nwhere the SVM constraints are enforced strictly using the Heaviside step function \n8. Properties of w * can be computed from In Z and taking the limit f3 -t 00. \nTo model the training data, we assume that the random and independent input \ncomponents have zero mean and variance liN . This scaling assures that the vari(cid:173)\nance of w . w(xl') stays finite in the large N limit. For the target rule we assume \nthat its deterministic part is given by the polynomial t(x) = Lp JCiPTBpxp with \nreal parameters Bp. The magnitude of the contribution of each degree k to the \nvalue of t(x) is measured by the quantities \n\n2 \nTk = Ck Nk ~ Bp \n\n1 \n\n'\"' \np,lpl=k \n\n(2) \n\nwhere Nk = (N+; - I) is the number of terms in the sum. The degree of t(x) is L \nand lower than K, so TL > 0 and TL+l = ... = TK = O. Note, that this definition \nof t(x) ensures that any feature necessary for computing t(x) is available to the \nSVM. For brevity we assume that the constant term in t(x) vanishes (To = 0) and \nthe normalization is Lk Tk = 1. \n\n2.2 The Noise Model \n\nIn the deterministic case the label of a point x would simply be the sign of t(x). \nHere we consider a nondeterministic rule and the output label is obtained using a \nrandom variable Tu E {-1, 1} parameterized by a scalar u. The observable instances \nof the rule, and in particular the elements of the training set, are then obtained by \n\n\findependently sampling the random variable (x, Tt(x))' Simple examples are additive \nnoise, Tt(x) = sgn(t(x) + 77), or multiplicative noise, Tt(x) = sgn(t(x)77), where 77 is \na noise term independent of x. In general, we will assume that the noise does not \nsystematically corrupt the deterministic component, formally \n\n1> Prob(Tu = sgn(u)) >\"2 for all u. \n\n1 \n\n(3) \n\nSo sgn( t( x)) is the best possible prediction of the output label of x, and the minimal \nachievable generalization error is fmin = (8( -t(X)Tt(x)))x. In the limit of many \ninput dimensions N, a central limit argument yields that for a typical target rule \nfmin = 2(8( -u)0(u))u , where u is zero mean and unit variance Gaussian. The \nfunction 0 will play a considerable role in the sequel. It is a symmetrized form of \nthe probability p(u) that Tu is equal to 1, 0(u) = ~(p(u) + 1 - p( -u)). \n\n2.3 Order Parameter Equations \n\nOne now evaluates the average of In Z (Eq. 1) over random drawings of training \ndata for large N in terms of t he order parameters \n\nQ \n\nr \n\n(((W.1]i(X))2)Jw' q=((w)w\u00b71]i(X))2)x \n\nand \n\nQ-! \u00abw \u00b71]i(x))w B \u00b71]i(x))x . \n\n(4) \n\nHere the thermal average over w refers to the Gibbs distribution (1). For the large \nN limit, a scaling of the training set size m must be specified, for which we make \nt he generic Ansatz m = aNt, where I = 1, ... ,L. Focusing on the limit of large j3, \nwhere the density on the weight vectors converges to a delta peak and q -+ Q, we \nintroduce the rescaled order parameter X = j3( Q - q) / St, with \n\nSt = i (1) - L Ci . \n\nt \n\ni=O \n\n(5) \n\nNote that this scaling with St is only possible since the degree K of the kernel \ni(x, y) is greater than I, and thus St \u00a5- O. Finally, we obtain an expression for it = \nlim,B--+oo limN --+00 \u00abIn Z)) St / (j3Nt) , where the double brackets denote averaging over \nall training sets of size m . The value of it results from extremizing, with respect to \nr, q and X, the function \nit(r,q,X) = \n\n\\ \n\n-aq /0(-u)G (ru + ~v -~)) \nu,v \nX \n~ (~: - X ~ 1) (1- -(X -1)TzS~;Ct + L~=l TJ \n\nv0 \n\n(6) \n\nwhere G(z) = 8(z)z2, and u, v are independent Gaussian random variables with \nzero mean and unit variance. \n\nSince the stationary value of it is finite, \u00ab w* . w*)) is of the order Nt. So the \nhigher order components of w* are small, (W;)2 \u00ab 1 for Ipi > I, although these \ncomponents playa crucial role in ensuring that a hyperplane separating the training \npoints exists even for large a. But the key quantity obtained from Eq. (6) is the \nstationary value of r which determines the generalization error of the SVM via \nfg = (0(-u)8(ru + ~v))u,v, and in particular fg = fmin for r = 1. \n\n\f2.4 Universal Asymptotics \n\nWe now specialize to the case that l equals L, the degree of the polynomial t(x) in \nthe target rule. So m = aNL and for large a, after some algebra, Eq. (6) yields \n\nr = 1 - A(q*) ~ \n4B(q*)2 a \n\n(7) \n\nwhere \n\nB(q) \n\n(e(Y)8(-Y+Ii/yrj))}y \n\nand \n\nA(q) \n\n(e(Y)8(-Y+Ii/y7i) (_Y+Ii/y7i)2}y. Further q* = argminqqA(q), and con(cid:173)\nsidering the derivatives of qA(q) for q --+ 0 and q --+ 00, one may show that \ncondition (3) assures that qA(q) does have a minimum. \n\nEquation (7) shows that optimal generalization performance is achieved on this scale \nin the limit of large a. Remarkably, as long as K > L, the asymptote is invariant \nto the choice of the kernel since A(q) and B(q) are defined solely in terms of the \ntarget rule. \n\n3 Extension to other Kernels \n\nOur next goal is to understand cases where the kernel is a general function of the \ninner product or of the distance between the vectors. We still assume that the \ntarget rule is of finite complexity, i.e. defined by a polynomial and corrupted by \nnoise and that the number of examples is polynomial in N. Remarkably, the more \ngeneral kernels then reduce to the polynomial case in the thermodynamic limit. \nSince it is difficult to find a description of the Hilbert space for k( x, y) which is useful \nfor a Statistical Physics calculation, our starting point is the dual representation: \nThe weight vector w* defining the maximal margin hyperplane P can be written \nas a linear combination of the feature vectors w(x M ) and hence w* . w(y) = IJ(Y), \nwhere \n\nm \n\n(8) \n\nM=l \n\nBy standard results of convex optimization theory the AM are uniquely defined by the \nKuhn-Tucker conditions AM ::::: 0, TMIJ(XM) ::::: Ii (for all patterns), further requiring \nthat for positive AM the second of the two inequalities holds as an equality. One also \nfinds that w* . w* = 2:;=1 AM and for a polynomial kernel we thus obtain a bound \non 2:;=1 AM since w* . w* is O(m). \nWe first consider kernels \u00a2(x\u00b7 y), with a general continuous function \u00a2 of the inner \nproduct, and assume that \u00a2 can be approximated by a polynomial J in the sense \nthat \u00a2(1) = J(l) and \u00a2(z) - J(z) = O(ZK) for z --+ O. Now, with a probability \napproaching 1 with increasing N, the magnitude of xM\u00b7xl/ is smaller than, say, N- 1/3 \nfor all different indices {t and v as long as m is polynomial in N. So, considering Eq. \n(8), for large N the functions \u00a2(z) and J(z) will only be evaluated in a small region \naround zero and at z = 1 when used as kernels of a SVM trained on m = aN L \nexamples. Using the fact that 2:;=1 AM = O(m) we conclude that for large Nand \nK > 3L the solution of the K uhn-Tucker conditions for J converges to the one for \n\u00a2. So Eqs. (6,7) can be used to calculate the generalization error for \u00a2 by setting \nttl = \u00a2(l) (O)/l! for l = 1, ... , L, when \u00a2 is an analytic function. Note that results in \n[4] assure that ttl ::::: 0 if the kernel \u00a2( X\u00b7 y) is positive definite for all input dimensions \nN. Further, the same reduction to the polynomial case will hold in many instances \nwhere \u00a2 is not analytical but just sufficiently smooth close to O. \n\n\f3.1 RBF Kernels \n\nWe next turn to radial basis function (RBF) kernels where k( x, y) depends only \non the Euclidean distance between two inputs, k(x,y) = *(lx - YI2). For binary \ninput components (Xi = \u00b1N- 1/ 2 ) this is just the inner product case since **(lx -\nY12) = **(2 - 2x\u00b7 y). However, for more general input distributions, e.g. Gaussian \ninput components, the fluctuations of Ixl around its mean value 1 have the same \nmagnitude as x . y even for large N, and an equivalence with inner product kernels \nis not evident. \nOur starting point is the observation that any kernel **(lx - Y12) which is positive \ndefinite for all input dimensions N is a positive mixture of Gaussians [6]. More \nprecisely **(z) = fooo e-ez da(k) where the transform a(k) is nondecreasing. For \nthe special case of a single Gaussian one easily obtains features 'IT p by rewriting \n**(lx - Y12) = e-lx-vI2/2 = e- 1x12 /2ex've-lvI2 /2. Expanding the kernel eX 'v into \npolynomial features, yields the features 'IT p(x) = e- 1x12 /2x pl M for **(lx _ YI2). \nBut, for a generic weight vector w in feature space, \n\nw\u00b7 'IT(x) = ~Wp'ITp(x) = e-~lxI2 ~wp M \n\n(9) \n\nis of order 1, and thus for large N the fluctuations of Ixl can be neglected. \n\nThis line of argument can be extended to the case that the kernel is a finite mixture \nof Gaussians, **(z) = L~=l aie-'Y7z /2 with positive coefficients ai. Applying the \nreasoning for a single Gaussian to each term in the sum, one obtains a doubly \nindexed feature vector with components 'lTp,i(X) = e-'Y7IxI2/2(ai/';lpl/lpll)1/2xp. It \nis then straightforward to adapt the calculation of the partition function (Eq. 1-\n6) to the doubly indexed features, showing that the kernel **(lx - Y12) yields the \nsame generalization behavior as the inner product kernel ** (2 - 2x . y). Based on the \ncalculation, we expect the same equivalence to hold for general radial basis function \nkernels, i.e. an infinite mixture of Gaussians, even if it would be involved to prove \nthat the limit of many Gaussians commutes with the large N limit. \n\n4 Simulations \n\nTo illustrate the general results we first consider a scenario where a linear target rule, \ncorrupted by additive Gaussian noise, is learned using different transcendental RBF \nkernels (Fig. 1) . While Eq. (7) predicts that the asymptote of the generalization \nerror does not depend on the kernel, remarkably, the dependence on the kernel \nis very weak for all values of a. \nIn contrast, the generalization error depends \nsubstantially on the nature of the noise process. Figure 2 shows the finding for \na quadratic rule with additive noise for the cases that the noise is Gaussian and \nbinary. In the Gaussian case a 1/a decay of Eg to Emin is found, whereas for binary \nnoise the decay is exponential in a. Note that in both cases the order parameter r \napproaches 1 as 1/a. \n\n5 Summary \n\nThe general characterization of learning curves obtained in this paper demonstrates \nthat support vector machines with high order or even transcendental kernels have \ndefinitive advantages when the training data is noisy. Further the calculations lead(cid:173)\ning to Eq. (6) show that maximizing the margin is an essential ingredient of the \n\n\fapproach: If one just chooses any hyperplane which classifies the training data cor(cid:173)\nrectly, the scale of the learning curve is not determined by the target rule and far \nmore examples are needed to achieve good generalization. Nevertheless our findings \nare at odds with many of the current t heoretical motivations for maximizing the \nmargin which argue that this minimizes the effective Vapnik Chervonenkis dimen(cid:173)\nsion of the classifier and thus ensures fast convergence of the training error to the \ngeneralization error [1 , 2]. Since we have considered hard margins, in contrast to t he \ngeneralization error, the training error is zero, and we find no convergence between \nthe two quantities. But close to optimal generalization is achieved since maximizing \nthe margin biases the SVM to explain as much as possible of the data in terms of a \nlow order polynomial. While the Statistical Physics approach used in this paper is \nonly exactly valid in the thermodynamic limit, the numerical simulations indicate \nthat the theory is already a good approximation for a realistic number of input \ndimensions. \n\nWe thank Rainer Dietrich for useful discussion and for giving us his code for the sim(cid:173)\nulations. The work of M.O. was supported by the EPSRC (grant no. GR/M81601) \nand the British Council (ARC project 1037); R.U. was supported by the DFG and \nthe DAAD. \n\nReferences \n\n[1] C. Cortes and V. Vapnik. , Machine Learning, 20:273-297, 1995. \n[2] N. Cristianini and J . Shawe-Taylor. Support Vector Machines. Cambridge U ni(cid:173)\n\nversity Press , 2000. \n\n[3] M. Opper and R. Urbanczik. Phys. Rev. Lett., 86:4410- 4413, 200l. \n[4] R. Dietrich, M. Opper, and H. Sompolinsky. Phys. Rev. Lett., 82:2975 - 2978, \n\n1999. \n\n[5] S. Risau-Gusman and M. Gordon. Phys. Rev. E, 62:7092- 7099,2000. \n[6] I. Schoenberg. Anal. Math, 39:811-841, 1938. \n\n\f0.3 ,-----r -- - - - - - - - - - - - - - - - - , \n\n(A) \n(8) \n(C) \n(D) \n(E) \n\nD \n6. \n<> \n0 \n\n0.2 \n\ntOg \n\n0.1 \n\n-\n\ntrllin -\n\n-\n\n-\n\n-\n\n-\n\n-\n\no \n\n5 \n\n10 \n\na=P/N \n\n15 \n\n20 \n\nFigure 1: Linear target rule corrupted by additive Gaussian noise rJ ((rJ) = 0, \\rJ2 ) = \n1/16) and learned using different kernels. The curves are the theoretical prediction; \nsymbols show simulation results for N = 600 with Gaussian inputs and error bars \nare approximately the size of the symbols. (A) Gaussian kernel, **(z) = e- kz with \nk = 2/3. (B) Wiener kernel given by the non analytic function **(z) = e - e..jZ. We \nchose c ~ 0.065 so that the theoretical prediction for this case coincides with (A). \n(C) Gaussian kernel with k = 1/20, the influence of the parameter change on t he \nlearning curve is minimal. (D) Perceptron, \u00a2(z) = z . Above a e ~ 7.5 vanishing \ntraining error cannot be achieved and the SVM is undefined. (E) Kernel invariant \nasymptote for (A,B,C). \n\n0.1 \n\n- -E~in- -\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\no w-______ ~ ______ ~ _____ ~ ____ __w \n8 \n\n2 \n\n6 \n\no \n\n4 \n\na = P/N2 \n\nFigure 2: A noisy quadratic rule (Tl = 0, T2 = 1) learned using the Gaussian kernel \nwith k = 1/20. The upper curve (simulations.) is for additive Gaussian noise as \nin Fig. 1. The lower curve (simulations .) is for binary noise, rJ \u00b1 s with equal \nprobability. We chose s ~ 0.20 so that the value of fmin is the same for the two \nnoise processes. The inset shows that f9 decays as l/a for Gaussian noise, whereas \nan exponential decay is found in the binary case. The dashed curves are the kernel \ninvariant asymptotes. The simulations are for N = 90 with Gaussian inputs, and \nstandard errors are approximately the size of the symbols. \n\n\f", "award": [], "sourceid": 2123, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Robert", "family_name": "Urbanczik", "institution": null}]}*