{"title": "Semiparametric Support Vector and Linear Programming Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 591, "abstract": null, "full_text": "Semiparametric Support Vector and \n\nLinear Programming Machines \n\nAlex J. Smola, Thilo T. Frie6, and Bernhard Scholkopf \n\nGMD FIRST, Rudower Chaussee 5, 12489 Berlin \n\n{smola, friess, bs }@first.gmd.de \n\nAbstract \n\nSemiparametric models are useful tools in the case where domain \nknowledge exists about the function to be estimated or emphasis is \nput onto understandability of the model. We extend two learning \nalgorithms - Support Vector machines and Linear Programming \nmachines to this case and give experimental results for SV ma(cid:173)\nchines. \n\n1 \n\nIntroduction \n\nOne of the strengths of Support Vector (SV) machines is that they are nonparamet(cid:173)\nric techniques, where one does not have to e.g. specify the number of basis functions \nbeforehand. In fact, for many of the kernels used (not the polynomial kernels) like \nGaussian rbf- kernels it can be shown [6] that SV machines are universal approxi(cid:173)\nmators. \n\nWhile this is advantageous in general, parametric models are useful techniques in \ntheir own right. Especially if one happens to have additional knowledge about the \nproblem, it would be unwise not to take advantage of it. For instance it might be \nthe case that the major properties of the data are described by a combination of a \nsmall set of linear independent basis functions {\u00a2Jt (.), ... , \u00a2n (.)}. Or one may want \nto correct the data for some (e.g. linear) trends. Secondly it also may be the case \nthat the user wants to have an understandable model, without sacrificing accuracy. \nFor instance many people in life sciences tend to have a preference for linear models. \nThis may be some motivation to construct semiparametric models, which are both \neasy to understand (for the parametric part) and perform well (often due to the \nnonparametric term). For more advocacy on semiparametric models see [1]. \n\nA common approach is to fit the data with the parametric model and train the non(cid:173)\nparametric add-on on the errors of the parametric part, Le. fit the nonparametric \npart to the errors. We show in Sec. 4 that this is useful only in a very restricted \n\n\f586 \n\nA. 1. Smola, T. T. FriejJ and B. SchOlkopf \n\nsituation. In general it is impossible to find the best model amongst a given class \nfor different cost functions by doing so. The better way is to solve a convex op(cid:173)\ntimization problem like in standard SV machines, however with a different set of \nadmissible functions \n\nf(x) = (w,1jJ(x)) + 2:f3irPi(X). \n\nn \n\ni=l \n\nNote that this is not so much different from the classical SV [10J setting where one \nuses functions of the type \n\nf(x) = (w, 1jJ(x)) + b. \n\n(1) \n\n(2) \n\n(3) \n\n2 Semiparametric Support Vector Machines \n\nLet us now treat this setting more formally. For the sake of simplicity in the \nexposition we will restrict ourselves to the case of SV regression and only deal with \nthe c- insensitive loss function 1~lc = max{O, I~I - c}. Extensions of this setting are \nstraightforward and follow the lines of [7J. \nGiven a training set of size f, X := {(Xl, yd , ., . ,(xe, ye)} one tries to find a function \nf that minimizes the functional of the expected riskl \n\nR[JJ = J c(f(x) - y)p(x, y)dxdy. \n\nHere c(~) denotes a cost function, i.e. how much deviations between prediction \nand actual training data should be penalized. Unless stated otherwise we will use \nc(~) = 1~lc . \nAs we do not know p(x, y) we can only compute the empirical risk Remp[JJ (i.e. the \ntraining error). Yet, minimizing the latter is not a good idea if the model class is \nsufficiently rich and will lead to overfitting. Hence one adds a regularization term \nT [JJ and minimzes the regularized risk functional \n\ne \n\nRreg[J] = 2: C(f(Xi) - Yi) + AT[J] with A > O. \n\ni=l \n\n(4) \n\nThe standard choice in SV regression is to set T[J] = ~llwI12. \nThis is the point of departure from the standard SV approach. While in the latter \nf is described by (2), we will expand f in terms of (1). Effectively this means that \nthere exist functions rPl (.), ... , rPn (.) whose contribution is not regularized at all. \nIf n is sufficiently smaller than f this need not be a major concern, as the VC(cid:173)\ndimension of this additional class of linear models is n, hence the overall capacity \ncontrol will still work, provided the nonparametric part is restricted sufficiently. \nFigure 1 explains the effect of choosing a different structure in detail. \n\nSolving the optimization equations for this particular choice of a regularization \nterm, with expansion (1), the c- insensitive loss function and introducing kernels \n\n1 More general definitions, mainly in terms of the cost function , do exist but for the \nsake of clarity in the exposition we ignored these cases. See [10] or [7] for further details \non alternative definitions of risk functionals . \n\n\fSemiparametric Support Vector and Linear Programming Machines \n\n587 \n\n.(cid:173)\n\n'\" \n, I \n, , \n\n, \n--\n\n--------- ---\n\n----\n\n.(cid:173)\nI , \n\n------ ..... \n\nI \n\n..... \n\n- .......... \" \" \n\" ----_ ..... \" , ' \n-\n\nf \n\" \\ ) \\ \n, \n/} I \n, \n---,. \n\n- - ; ; I \n\n\" \n\nI \n\n'..... \n\n-----\n\n----------\n\n-----------\n\nFigure 1: Two different nested subsets (solid and dotted lines) of hypotheses and the \noptimal model (+) in the realizable case. Observe that the optimal model is already \ncontained in much a smaller (in this diagram size corresponds to the capacity of \na subset) subset of the structure with solid lines than in the structure denoted by \nthe dotted lines. Hence prior knowledge in choosing the structure can have a large \neffect on generalization bounds and performance. \n\nfollowing [2J we arrive at the following primal optimization problem: \n\nminimize %llwl12 + L ~i +~; \n\nl \n\ni=l \n\n(W,1jJ(Xi)) + L (3j\u00a2j(Xi) - Yi < \n\nsubject to \n\nYi - (w, 1jJ(xd) - L (3j\u00a2j (Xi) < \n\nn \n\nj=l \n\nn \n\nj=l \n\nto + ~i \n\nto + ~i \n\n(5) \n\n> 0 \n\nHere k(x, x') has been written as (1jJ(x) , 1jJ(x' )). Solving (5) for its Wolfe dual yields \n\nmaXImIze \n\nsubject to \n\n( \n\n( \n\n{ \n\n-~ i,El (ai - ai)(aj - aj)k(xi,Xj) \n-E L (ai + an + L Yi (ai - an \n{ ( \n\nL(ai - an\u00a2j(Xi) \ni=l \nLti,ai \n\ni=l \n\ni=l \n\nE \n\no for all 1 ~ j ~ n \n[0,1/ >.J \n\n(6) \n\nNote the similarity to the standard SV regression model. The objective function \nand the box constraints on the Lagrange multipliers ai, a; remain unchanged. The \nonly modification comes from the additional unregularized basis functions. Whereas \nin the standard SV case we only had a single (constant) function b\u00b7 1 we now have \nan expansion in the basis (3i \u00a2i ( .). This gives rise to n constraints instead of one. \nFinally f can be found as \n\nl \n\nn \nf(x) = L(ai - a;)k(xi' x) + L \ni=l \n\ni=l \n\n(3i\u00a2i(X) \n\nl \n\nsince w = L(ai - ai)1jJ(xi). \n\n(7) \n\ni=l \n\nThe only difficulty remaining is how to determine (3i. This can be done by exploiting \nthe Karush- Kuhn- Tucker optimality conditions, or much more easily, by using an \nIn the latter case the variables (3i can be \ninterior point optimization code [9J. \nobtained as the dual variables of the dual (dual dual = primal) optimization problem \n(6) as a by product of the optimization process. This is also how these variables \nhave been obtained in the experiments in the current paper. \n\n\fT[J] = ~//wI12 + t /~i/ \nT[f] = L lai - a:/ \n\ni=l \n\nt \n\ni=l \n\nt In \n\nT[f] = L lai - a:1 +\"2 L ~dJjMij \n\ni=l \n\ni ,j=l \n\n(8) \n\n(9) \n\n(10) \n\n588 \n\nA. 1. Smola, T. T. FriefJ and B. SchOlkopf \n\n3 Semiparametric Linear Programming Machines \n\nEquation (4) gives rise to the question whether not completely different choices of \nregularization functionals would also lead to good algorithms. Again we will allow \nfunctions as described in (7). Possible choices are \n\nor \n\nor \n\nfor some positive semidefinite matrix M. This is a simple extension of existing \nmethods like Basis Pursuit [3] or Linear Programming Machines for classification \n(see e.g. [4]). The basic idea in all these approaches is to have two different sets \nof basis functions that are regularized differently, or where a subset may not be \nregularized at all. This is an efficient way of encoding prior knowledge or the \npreference of the user as the emphasis obviously will be put mainly on the functions \nwith little or no regularization at all. Eq. (8) is essentially the SV estimation model \nwhere an additional linear regularization term has been added for the parametric \npart. In this case the constraints of the optimization problem (6) change into \n\n-1 < E(ai-ai)\u00a2j(xd < 1 \n\nforall1:::;j:::;n \n\n(11) \n\nt \n\ni=l \nai,ar \n\nE \n\n[O,l/A] \n\nIt makes little sense (from a technical viewpoint) to compute Wolfe's dual objective \nfunction in (10) as the problem does not get significantly easier by doing so. The \nbest approach is to solve the corresponding optimization problem directly by some \nlinear or quadratic programming code, e.g. [9]. Finally (10) can be reduced to the \ncase of (8) by renaming variables accordingly and a proper choice of M. \n\n4 Why Backfitting is not sufficient \n\nOne might think that the approach presented above is quite unnecessary and overly \ncomplicated for semi parametric modelling. In fact, one could try to fit the data to \nthe parametric model first, and then fit the nonparametric part to the residuals. \nIn most cases, however, this does not lead to finding the minimum of (4). We will \nshow this at a simple example. \nTake a SV machine with linear kernel (i.e. k(x, x') = (x, x')) in one dimension and \na constant term as parametric part (i.e. f(x) = wx + $). This is one of the simplest \nsemiparametric SV machines possible. Now suppose the data was generated by \n\n(12) \nwithout noise. Clearly then also Yi 2: 1 for all i. By construction the best overall fit \nof the pair (~, w) will be arbitrarily close to (0,1) if the regularization parameter A \nis chosen sufficiently small. \n\nYi = Xi where Xi 2: 1 \n\nFor backfitting one first carries out the parametric fit to find a constant ~ minimizing \nthe term E;=l C(Yi - $). Depending on the chosen cost function c(\u00b7), ~ will be the \nmean (L 2-error), the median (L1-error), etc., of the set {Yl, ... , Yt}\u00b7 As all Yi 2: 1 \n\n\fSemi parametric Support Vector and Linear Programming Machines \n\n589 \n\n2 - - __ -\n\n.... _-... \n\n-\n\n, , \n, \n\nFigure 2: Left: Basis functions used in the toy example. Note the different length \nscales of sin x and sinc 27rx. For convenience the functions were shifted by an offset \nof 2 and 4 respectively. Right: Training data denoted by '+', nonparametric (dash(cid:173)\ndotted line), semiparametric (solid line), and parametric regression (dots). The \nregularization constant was set to A = 2. Observe that the semiparametric model \npicks up the characteristic wiggles of the original function. \n\nalso {3 ~ 1 which is surely not the optimal solution of the overall problem as there \n(3 would be close to a as seen above. Hence not even in the simplest of all settings \nbackfitting minimizes the regularized risk functional, thus one cannot expect the \nlatter to happen in the more complex case either. There exists only one case in \nwhich backfitting would suffice, namely if the function spaces spanned by the kernel \nexpansion {k(Xi\")} and {4>i(')} were orthogonal. Consequently in general one has \nto jointly solve for both the parametric and the semiparametric part. \n\n5 Experiments \n\nThe main goal of the experiments shown is a proof of concept and to display the \nproperties of the new algorithm. We study a modification of the Mexican hat \nfunction, namely \n\nf(x) = sinx + sinc(27r{x - 5)). \n\n(13) \nData is generated by an additive noise process, i.e. Yi = f(xd + ~i' where ~i is \nadditive noise. For the experiments we choose Gaussian rbf-kernels with width \nu = 1/4, normalized to maximum output 1. The noise is uniform with 0.2 standard \ndeviation, the E:-insensitive cost function I . Ie with E = 0.05. Unless stated other(cid:173)\nwise averaging is done over 100 datasets with 50 samples each. The Xi are drawn \nuniformly from the interval [0,10]. L1 and L2 errors are computed on the interval \n[0, 10] with uniform measure. Figure 2 shows the function and typical predictions in \nthe nonparametric, semiparametric, and parametric setting. One can observe that \nthe semiparametric model including sin x, cos x and the constant function as basis \nfunctions generalizes better than the standard SV machine. Fig. 3 shows that the \ngeneralization performance is better in the semiparametric case. The length of the \nweight vector of the kernel expansion IIwll is displayed in Fig. 4. It is smaller in the \nsemiparametric case for practical values of the regularization strength. To make a \nmore realistic comparison, model selection (how to determine 1/ A) was carried out \nby la-fold cross validation for both algorithms independently for all 100 datasets. \nTable 1 shows generalization performance for both a nonparametric model, a cor(cid:173)\nrectly chosen and an incorrectly chosen semiparametric model. The experiments \nindicate that cases in which prior knowledge exists on the type of functions to be \nused will benefit from semiparametric modelling. Future experiments will show how \nmuch can be gained in real world examples. \n\n\f590 \n\n.. \n\n.35 \n\n03 \n\n0\" \n\n02 \n\n015 \n\nA. 1. Smola, T. T. FriejJ and B. Sch6lkopj \n\n071~\"'---~-~r=====::::;l \n\n1 -\n\nSeF\\'llP&l'ametnc Mode~ \n- _ . ~tnc Model J \n\n_ ._. _ . - ...... -_ ....... _-11- ___ ..... \n\n06 \n\n05 .. \n\n- ' ' ' ' \n\n\\ \n\\ \n\\ \n\n.\\ , \n\\ \n'. \n\no~oL,-\u00b7\u00b7 ~~\"\u00b7:------'O'-:-, ~~'O:-, --,,'-:-., ~\"\"\"\"\"O\u00b7~--'\", \n\nFigure 3: L1 error (left) and L2 error (right) of the nonparametric / semiparametric \nregression computed on the interval [0,10] vs. the regularization strength 1/),. The \ndotted lines (although hardly visible) denote the variance of the estimate. Note \nthat in both error measures the semiparametric model consistently outperforms the \nnonparametric one. \n\nFigure 4: Length of the weight vector w in fea(cid:173)\nture space CEi,j(ai - ai)(aj - aj)k(xi,Xj))1/2 \nvs. regularization strength. Note that Ilwl!' con(cid:173)\ntrolling the capacity of that part of the function, \nbelonging to the kernel expansion, is smaller (for \npractical choices of the regularization term) in \nthe semiparametric than in the nonparametric \nmodel. If this difference is sufficiently large the \noverall capacity of the resulting model is smaller \nin the semiparametric approach. As before dot(cid:173)\nted lines indicates the variance. \nFigure 5: Estimate of the parameters for \nsin x (top picture) and cos x (bottom picture) \nin the semiparametric model vs. regularization \nstrength 1/),. The dotted lines above and below \nshow the variation of the estimate given by its \nvariance. Training set size was f. = 50. Note the \nsmall variation of the estimate. Also note that \neven in the parametric case 1/), ~ 0 neither the \ncoefficient for sin x converges to 1, nor does the \ncorresponding term for cos x converge to O. This \nis due to the additional frequency contributions \nof sinc 27rx. \n\nI \nL1 error I 0.1263 \u00b1 0.0064 (12) I 0.0887 \u00b1 0.0018 (82) I 0.1267 \u00b1 0.0064 (6) I \nL2 error I 0.1760 \u00b1 0.0097 112)1 0.1197 \u00b1 0.0046 (82) I 0.1864 \u00b1 0.0124 (6) I \n\nI Semiparam. \n\nSemi par am. \nsin x, cos x, 1 \n\nsin 2x, cos 2x, 1 \n\nI \n\n,\" \n\nO(). \n\n003 \n\n002 \n\n00\\ \n\nNonparam. \n\nI \n\nTable 1: Ll and L2 error for model selection by 10-fold crossvalidation. The correct \nsemiparametric model (sin x, cos x, 1) outperforms the nonparametric model by at \nleast 30%, and has significantly smaller variance. The wrongly chosen nonparamet(cid:173)\nric model (sin 2x, cos 2x, 1), on the other hand, gives performance comparable to the \nnon parametric one, in fact, no significant performance degradation was noticeable. \nThe number in parentheses denotes the number of trials in which the corresponding \nmodel was the best among the three models. \n\n\fSemiparametric Support Vector and Linear Programming Machines \n\n591 \n\n6 Discussion and Outlook \n\nSimilar models have been proposed and explored in the context of smoothing splines. \nIn fact, expansion (7) is a direct result of the representer theorem, however only in \nthe case of regularization in feature space (aka Reproducing Kernel Hilbert Space, \nRKHS). One can show [5] that the expansion (7) is optimal in the space spanned \nby the RKHS and the additional set of basis functions. \n\nMoreover the semi parametric setting arises naturally in the context of conditionally \npositive definite kernels of order m (see [8]). There, in order to use a set of kernels \nwhich do not satisfy Mercer's condition, one has to exclude polynomials up to order \nm - 1. Hence, to with that one has to add polynomials back in 'manually' and our \napproach presents a way of doing that. \n\nAnother application of semiparametric models besides the conventional approach \nof treating the nonparametric part as nuisance parameters [1] is the domain of \nhypothesis testing, e.g. to test whether a parametric model fits the data sufficiently \nwell. This can be achieved in the framework of structural risk minimization [10] -\ngiven the different models (nonparametric vs. semiparametric vs. parametric) one \ncan evaluate the bounds on the expected risk and then choose the model with the \nlowest error bound. Future work will tackle the problem of computing good error \nbounds of compound hypothesis classes. Moreover it should be easily possible to \napply the methods proposed in this paper to Gaussian processes. \n\nAcknowledgements This work was supported in part by grants of the DFG Ja \n379/51 and ESPRIT Project Nr. 25387- STORM. The authors thank Peter Bartlett, \nKlaus- Robert Muller, Noboru Murata, Takashi Onoda, and Bob Williamson for \nhelpful discussions and comments. \n\nReferences \n\n[1] P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J.A. Wellner. Efficient and adaptive \nestimation for semiparametric models. J. Hopkins Press, Baltimore, ML, 1994. \n[2] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal \n\nmargin classifiers. In COLT'92, pages 144- 15'2, Pittsburgh, PA, 1992. \n\n[3] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. \n\nTechnical Report 479, Department of Statistics, Stanford University, 1995. \n\n[4] T.T. FrieB and R.F. Harrison. Perceptrons in kernel feature spaces. TR RR-\n\n720, University of Sheffield, Sheffield, UK, 1998. \n\n[5] G.S. Kimeldorf and G. Wahba. A correspondence between Bayesan estimation \non stochastic processes and smoothing by splines. Ann. Math. Statist., 2:495-\n502, 1971. \n\n[6] C.A. Micchelli. Interpolation of scattered data: distance matrices and condi(cid:173)\n\ntionally positive definite functions. Constructive Approximation, 2:11- 22, 1986. \n\n[7] A. J. Smola and B. Scholkopf. On a kernel-based method for pattern recogni(cid:173)\ntion, regression, approximation and operator inversion. Algorithmica, 22:211-\n231,1998. \n\n[8] A.J. Smola, B. Scholkopf, and K. Muller. The connection between regulariza(cid:173)\n\ntion operators and support vector kernels. Neural Netw., 11:637- 649, 1998. \n\n[9] R.J. Vanderbei. LOQO: An interior point code for quadratic programming. TR \n\nSOR-94-15, Statistics and Operations Research, Princeton Univ., NJ, 1994. \n[10] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995. \n\n\f", "award": [], "sourceid": 1575, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Thilo-Thomas", "family_name": "Frie\u00df", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}