{"title": "Understanding Stepwise Generalization of Support Vector Machines: a Toy Model", "book": "Advances in Neural Information Processing Systems", "page_first": 321, "page_last": 327, "abstract": null, "full_text": "Understanding stepwise generalization of \nSupport Vector Machines: a toy model \n\nSebastian Risau-Gusman and Mirta B. Gordon \nDRFMCjSPSMS CEA Grenoble, 17 avo des Martyrs \n\n38054 Grenoble cedex 09, France \n\nAbstract \n\nIn this article we study the effects of introducing structure in the \ninput distribution of the data to be learnt by a simple perceptron. \nWe determine the learning curves within the framework of Statis(cid:173)\ntical Mechanics. Stepwise generalization occurs as a function of \nthe number of examples when the distribution of patterns is highly \nanisotropic. Although extremely simple, the model seems to cap(cid:173)\nture the relevant features of a class of Support Vector Machines \nwhich was recently shown to present this behavior. \n\n1 \n\nIntroduction \n\nA new approach to learning has recently been proposed as an alternative to feedfor(cid:173)\nward neural networks: the Support Vector Machines (SVM) [1]. Instead of trying to \nlearn a non linear mapping between the input patterns and internal representations, \nlike in multilayered perceptrons, the SVMs choose a priori a non-linear kernel that \ntransforms the input space into a high dimensional feature space. In binary classi(cid:173)\nfication tasks like those considered in the present paper, the SVMs look for linear \nseparation with optimal margin in feature space. The main advantage of SVMs \nis that learning becomes a convex optimization problem. The difficulties of having \nmany local minima that hinder the process of training multilayered neural networks \nis thus avoided. One of the questions raised by this approach is why SVMs do not \noverfit the data in spite of the extremely large dimensions of the feature spaces \nconsidered. \nTwo recent theoretical papers [2, 3] studied a family of SVMs with the tools of \nStatistical Mechanics, predicting typical properties in the limit of large dimensional \nspaces. Both papers considered mappings generated by polynomial kernels, and \nmore specifically quadratic ones. In these, the input vectors x E RN are trans(cid:173)\nformed to N(N + I)j2-dimensional feature vectors *(x) . More precisely, the map(cid:173)\nping ** I (x) = (x, XIX, X2X, ... ,XkX) has been studied in [3] as a function of k, \nthe number of quadratic features, and ** 2 (x) = (x,xlxjN,X2xjN,\u00b7\u00b7\u00b7 ,xNxjN) has \nbeen considered in [2], leading to different results. These mappings are particu(cid:173)\nlar cases of quadratic kernels. In particular, in the case of learning quadratically \nseparable tasks with mapping ** 2 , the generalization error decreases up to a lower \nbound for a number of examples proportional to N, followed by a further decrease \nif the number of examples increases proportionally to the dimension of the feature \n\n\f322 \n\nS. Risau-Gusman and M. B. Gordon \n\nspace, i.e. to N 2 \u2022 In fact, this behavior is not specific of the SVMs. It also arises \nin the typical case of Gibbs learning (defined below) in quadratic feature spaces [4]: \non increasing the training set size, the quadratic components of the discriminating \nsurface are learnt after the linear ones. In the case of learning linearly separable \ntasks in quadratic feature spaces, the effect of overfitting is harmless, as it only \nslows down the decrease of the generalization error with the training set size. In \nthe case of mapping 1 may be deduced \nfrom the former through a straightforward rescaling of Nc and N u. Hereafter, the \nsubspace of dimension Nc and variance a will be called compressed subspace. The \ncorresponding orthogonal subspace, of dimension Nu = N - N c, will be called \nuncompressed subspace. \n\nWe study the typical generalization error of a student perceptron learning the clas(cid:173)\nsification task, using the tools of Statistical Mechanics. The pertinent cost function \n\n\fUnderstanding Stepwise Generalization ojSVM's: a Toy Model \n\nis the number of misclassified patterns: \n\nE(w; Va) = L 8( -TIL ~IL \u2022 w), \n\np \n\n1L=1 \n\n323 \n\n(2) \n\nThe weight vectors in version space correspond to a vanishing cost (2). Choosing a \nw at random from the a posteriori distribution \n\nP(wIVa ) = Z-l PO(w) exp (-,8E(w; Va\u00bb, \n\n(3) \n\nin the limit of ,8 -+ 00 is called Gibbs' learning. In eq. (3),,8 is equivalent to an \ninverse temperature in the Statistical Mechanics formulation, the cost (2) being the \nenergy function. We assume that Po, the a priori distribution of the weights, is \nuniform on the hypersphere of radius VN: \n\nPo(w) = (21re)-N/2 8(w . w - N). \n\n(4) \n\nThe normalization constant (21re)N/2 is the leading order term of the hypersphere's \nsurface in N-dimensional space. Z is the partition function ensuring the correct \nnormalization of P(wIVa): \n\nZ(,8; Va) = J dw Po(w) exp (-,8E(w; Va\u00bb . \n\n(5) \n\nIn general, the properties of the student are related to those of the free energy \nF(,8; Va) = -In Z(,8; V a )/,8. In the limit N -+ 00 with the training set size per \ninput dimension Q: == P / N constant, the properties of the student weights become \nindependant of the particular training set Va. They are deduced from the averaged \nfree energy per degree of freedom, calculated using the replica trick: \n\n(6) \n\nwhere the overline represents the average over Va, composed of patterns selected \naccording to (1). In the case of Gibbs learning, the typical behavior of any intensive \nquantity is obtained in the zero temperature limit ,8 -+ 00. In this limit, only error(cid:173)\nfree solutions, with vanishing cost, have non-vanishing posterior probability (3) . \nThus, Gibbs learning corresponds to picking at random a student in version space, \ni.e. a vector w that classifies correctly the training set Va, with a probability \nproportional to Po (w ). \nIn the case of an isotropic pattern distribution, which corresponds to (7 = 1 in \n(1), the properties of cost function (2) have been extensively studied [5]. The case \nof patterns drawn from two gaussian clusters in which the symmetry axis of the \nclusters is the same [6] and different [7] from the teacher's axis, have recently been \naddressed. Here we consider the problem where, instead of having a single direction \nalong which the patterns' distribution is contracted (or expanded), there is a finite \nfraction of compressed dimensions. In this case, all the properties of the student's \nperceptron may be expressed in terms of the following order parameters, that have \nto satisfy corresponding extremum conditions of the free energy: \n\n-ab \nqc \n\n-ab \nqu \n\n1 \n\n1 \n\n(N L WiaWib) \n(N L WiaWib) \n\niENc \n\niENu \n\n(7) \n\n(8) \n\n\f324 \n\nS. Risau-Gusman and M B. Gordon \n\nc \n\niEN< \n\nR a = (~ L wiawi) \n(~ L WiaWn \n(~ L (Wia)2) \n\nR a u \n\niENu \n\nQa \n\niEN< \n\n(9) \n\n(10) \n\n(11) \n\nwhere ( ... ) indicates the average over the posterior (3); a, b are replica indices, \nand the subcripts c and u stand for compressed and uncompressed respectively. \nNotice that we do not impose that Qa, the typical squared norm of the student's \ncomponents in the compressed subspace, be equal to the corresponding teacher's \nnorm Q* = LiEN\u00abwi)2 IN. \n\n3 Order parameters and learning curves \n\nAssuming that the order parameters are invariant under permutation of replicas, \nwe can drop the replica indices in equations (7) to (11). We expect that this \nhypothesis of replica symmetry is consistent, like it is in other cases of perceptrons \nlearning realizable tasks. The problem is thus reduced to the determination of \nfive order parameters. Their meaning becomes clearer if we consider the following \ncombinations: \n\nqc \n\nqu \n\nRc \n\nRu \n\niic \nQ' \niiu \n\nl-Q' \n\nRc \n\n.JCJ..ftJ*' \nRu \n\nv'1=QJl - Q* ' \n\nQ = (~ L (Wi)2). \n\niEN< \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\n(16) \n\nqc and qu are the typical overlaps between the components of two student vectors in \nthe compressed and the uncompressed subspaces respectively. Similarly, Rc and Ru \nare the corresponding overlaps between a typical student and the teacher. In terms \nof this set of parameters, the typical generalization error is \u20acg = (1 I 7r) arccos R with \n\nR = (72 RcJQQ* + R u J(1 - Q)(1 - Q*). \nJ(72Q + (1- Q)J(72Q* + (1 - Q*) \n\n(17) \n\nGiven a, the general solution to the extremum conditions depends on the three \nparameters of the problem, namely (7, Q* and nc == NcIN. An interesting case \nis the one where the teacher's anisotropy is consistent with that of the pattern's \ndistribution, i.e. Q* = nco In this case, it easy to show that Q = Q*, qc = Rc and \nqu = Ru. Thus, , \n\n(18) \n\n(19) \n\nR = nuRu + (72 ncRc , \n\nnu + (72n c \n\nwhere nu == NuIN, Ru and Rc are given by the following equations: \n\na J exp (_Rt2 12) \n\nRc \n\n1 - Rc \n\n(72 \n\n(72nc + nu 7r Jl - R \n\nVt H(tVR) \n\n, \n\n\fUnderstanding Stepwise Generalization ofSVM's: a Toy Model \n\n325 \n\nn =0.9 R \nc \n(12=0.01 \u2022\u2022. , .. \u2022. \n\n* U .................... ... ........... . . .. ............................ . . .. ... . \n...... \n\n-------\n\n______ - - - - - - - - -\n\nR \n\n\" \n, , \n,'''R \nG \n\n, \n\" \n, \n, , , \n\nR \n~ , .... , .... , \n\n.... \n\n...... \n\n...... .... . \n\n\u2022 \u2022 ~' \u2022 \u2022 ' \n\n0,5' .. -.. ~-~-~-~--, \n\n.' \n\nE \n~~ \n\n0,4 \n\n0,3 \n\n0.2 \n\nE \n9 \n\n\u2022 \n....... \n...... \n\nE G \n9 \n.... .... \n+ \n.................... \n\n0.1 ------------------ --------.--\n\n0,0 \n\n0~,0--~0,2~~0~,4--~0,76~0~,8~~1.0 \n\n....................... _- ...... \n\n1,0 \n\n0,8 \n\n0,6 \n\n0,4 \n\n0,2 \n\n0,0 \n\n, , . \n, . ' \n\nf\u00b7 \u2022 \n\nE \n9 \n\no \n\n2 \n\n4 \n\na \n\n6 \n\n8 \n\n10 \n\nFigure 1: Order parameters and generalization error for the case Q* = nc = 0.9, \n(72 = 10- 2 \u2022 The curves for the case of spherically distributed patterns is shown for \ncomparison. The inset shows the first step of learning and its plateau (see text). \n\n(7 \n\n2 Ru. \n1- Ru. \n\n(20) \n\nwhere Vt = dte- t2 / 2 /~ and H(x) = I~oo Vt . If (72 = 1, we recover the equations \ncorresponding to Gibbs learning of isotropic pattern distributions [;)]. \nThe order parameters are represented as a function of a on figure 1, for a particular \nchoice of nc and (7 . Ru. grows much faster than Rc, meaning that it is easier to \nlearn the components of the uncompressed space. As a result, R (and therefore the \ngeneralization error 109) presents a cross-over between two behaviors. At small a, \nboth Ru. \u00ab 1 and Rc \u00ab 1, so that R(a, (72) = Ra(a(nu. +(74nc)/(nu. +(72nc)2) where \nRa is the overlap for Gibbs learning with an isotropic (72 = 1) distribution [5]. \nLearning the anisotropic distribution is faster (in a) than learning the isotropic \none. If (7 \u00ab 1 the anisotropy is very large and R increases like Ra but with an \neffective training set size per input dimension,...., ainu. > a. On increasing a, \nthere is an intermediate regime in which Ru. increases but Rc \u00ab 1, so that R ::: \nRu.nu./(nu. +(72nc). The corresponding generalization error seems to reach a plateau \ncorresponding to Ru. = 1 and Rc = O. At a \u00bb 1, R(a, (72) ::: Ra(a), the asymptotic \nbehavior is independent of the details of the distribution, like in [7]. The crossover \nbetween these two regimes, when (72 \u00ab 1, occurs at ao ~ J2(nu. + (72nc)/(72nc ). \nThe cases Q* = 1 and Q* = 0 are also of interest. Q* = 1 corresponds to a teacher \nhaving all the weights components in the compressed subspace, whereas Q* = 0 \n\n\f326 \n\nS. Risau-Gusman and M. B. Gordon \n\n0,5 \n\n0,4 \n\n0,3 \n\n0,2 \n\n0,1 \n\n\u00a3 \n\n9 \n\nI \n\nI \ni \n\\ \ni \n~ Gibbs \n, ,. \n,I \n\" ,\", \n\\\\, \n, ' . \n, \nI \" \" \n\" \n\nQ'=0.9 \n..... _, \n. / \n-..... \n......... \n\n..... \n\n0.025 \n\nj ...... . \n\nQ\u00b7=O.O \n20 \n\n0,000 a \n\n40 \n\n60 \n\n80 \n\n100 \n\nO,O~ ____ ~ ____ ~ __ ~ ____ ~ ____ ~ ____ ~ __ ~ ____ ~ ____ ~ __ ~ \n\no \n\n2 \n\n4 \n\n6 \n\n8 \n\nFigure 2: Generalization errors as a function of a for different teachers (Q* = 1, \nQ* = 0.9 and Q* = 1), for the case nc = 0.9 and (J'2 = 10-2 . The curve for \nspherically distributed patterns [5] is included for comparison. The inset shows the \nlarge alpha behaviors. \n\ncorresponds to a teacher orthogonal to the compressed subspace, i.e. with all the \ncomponents in the uncompressed subspace. They correspond respectively to tasks \nwhere either the uncompressed or the compressed components are irrelevant for the \npatterns' classification. In Figure 2 we show all the generalization error curves, \nincluding the generalization error EgG for a uniform distribution [5] for comparison. \nThe behaviour of Eg(a) is very sensitive to the value of Q*. If Q* = 1, the teacher \nis in the compressed subspace where learning is difficult. Consequently, Eg(a) > \nEgG (a) as expected. On the contrary, for Q* = 0, only the components in the \nuncompressed space are relevant for the classification task. In this subspace learning \nis easy and Eg(a) < EgG(a). At Q* f. 0,1 there is a crossover between these regimes, \nas already discussed. All the curves merge in the asymptotic regime a -+ 00, as \nmay be seen in the inset of Figure 2. \n\n4 Discussion \n\nWe analyzed the typical learning behavior of a toy perceptron model that allows \nto clarify some aspects of generalization in high dimensional feature spaces. In \nparticular, it captures an element essential to obtain stepwise learning, which is \nshown to stem from the compression of high order features. The components in the \ncompressed space are more difficult to learn than those not compressed. Thus, if \n\n\fUnderstanding Stepwise Generalization ojSVM's: a Toy Model \n\n327 \n\nthe training set is not large enough, mainly the latter are learnt. \n\nOur results allow to understand the importance of the scaling of high order features \nin the SVMs kernels. In fact, with SVMs one has to choose a priori the kernel that \nmaps the input space to the feature space. If high order features are conveniently \ncompressed, hierarchical learning occurs. That is, low order features are learnt \nfirst; higher order features are only learnt if the training set is large enough. In the \ncases where the higher order features are irrelevant, it is likely that they will not \nhinder the learning process. This interesting behavior allows to avoid overfitting. \nComputer simulations currently in progress, of SVMs generated by quadratic kernels \nwith and without the 1/ N scaling, show a behavior consistent with the theoretical \npredictions [2, 3]. These may be understood with the present toy model. \n\nReferences \n\n[1] V. Vapnik (1995) The nature of statistical learning theory. Springer Verlag, \n\nNew York. \n\n[2] R. Dietrich, M. Opper, and H. Sompolinsky (1999) Statistical Mechanics of \n\nSupport Vector Networks. Phys. Rev. Lett. 82, 2975-2978. \n\n[3] A. Buhot and M. B. Gordon (1999) Statistical mechanics of support vector \n\nmachines. ESANN'99-European Symposium on Artificial Neural Networks Pro(cid:173)\nceedings, Michel Verleysen ed. 201-206; A. Buhot and M. B. Gordon (1998) \nLearning properties of support vector machines. Cond-Mat/9802179. \n\n[4] H. Yoon and J.-H. Oh (1998) Learning of higher order perceptrons with tunable \n\ncomplexities J. Phys. A: Math. Gen. 31, 7771-7784. \n\n[5] G. Gyorgyi and N. Tishby (1990) Statistical Theory of Learning a Rule. In \nNeural Networks and Spin Glasses (W. K. Theumann and R. Koberle, Worls \nScientific), 3-36. \n\n[6] R. Meir (1995) Empirical risk minimizaton. A case study. Neural Compo 7, \n\n144-157. \n\n[7] C. Marangi, M. Biehl, S. A. Solla (1995) Supervised Learning from Clustered \n\nExamples Europhys. Lett. 30 (2), 117-122. \n\n\f", "award": [], "sourceid": 1769, "authors": [{"given_name": "Sebastian", "family_name": "Risau-Gusman", "institution": null}, {"given_name": "Mirta", "family_name": "Gordon", "institution": null}]}*