{"title": "Sparsity of Data Representation of Optimal Kernel Machine and Leave-one-out Estimator", "book": "Advances in Neural Information Processing Systems", "page_first": 252, "page_last": 258, "abstract": null, "full_text": "Sparsity of data representation of optimal kernel \n\nmachine and leave-one-out estimator \n\nA. Kowalczyk \n\nChief Technology Office, Telstra \n\n770 Blackburn Road, Clayton, Vic. 3168, Australia \n\n(adam.kowalczy k@team.telstra.com) \n\nAbstract \n\nVapnik's result that the expectation of the generalisation error ofthe opti(cid:173)\nmal hyperplane is bounded by the expectation of the ratio of the number \nof support vectors to the number of training examples is extended to a \nbroad class of kernel machines. The class includes Support Vector Ma(cid:173)\nchines for soft margin classification and regression, and Regularization \nNetworks with a variety of kernels and cost functions. We show that key \ninequalities in Vapnik's result become equalities once \"the classification \nerror\" is replaced by \"the margin error\", with the latter defined as an in(cid:173)\nstance with positive cost. In particular we show that expectations of the \ntrue margin error and the empirical margin error are equal, and that the \nsparse solutions for kernel machines are possible only if the cost function \nis \"partially\" insensitive. \n\n1 Introduction \n\nMinimization of regularized risk is a backbone of several recent advances in machine learn(cid:173)\ning, including Support Vector Machines (SVM) [13], Regularization Networks (RN) [5] or \nGaussian Processes [15]. Such a machine is typically implemented as a weighted sum of \na kernel function evaluated for pairs composed of a data vector in question and a number \nof selected training vectors, so called support vectors. For practical machines it is desired \nto have as few support vectors as possible. It has been observed empirically that SVM \nsolutions have often very few support vectors, or that they are sparse, while RN machines \nare not. The paper shows that this behaviour is determined by the properties of the cost \nfunction used (its partial insensitivity, to be precise). \n\nAnother motivation for interest in sparsity of solutions comes from celebrated result of \nVapnik [13] which links the number of support vectors to the generalization error of SVM \nvia a bound on leave-one-out estimator [9]. This result has been originally shown for a \nspecial case of classification with hard margin cost function (optimal hyperplane). The \npapers by Opper and Winther [10], Jaakkola and Haussler [6], and Joachims [7] extend \nVapnik's result in the direction of bounds for classification error of SVM's. The first of \nthose papers deals with the hard margin case, while the other two derive tighter bounds on \nclassification error of the soft margin SVMs with [-insensitive linear cost. \n\nIn this paper we extend Vapnik's result in another direction. Firstly, we show that it holds \nfor to a wide range of kernel machines optimized for a variety of cost functions, for both \n\n\fclassification and regression tasks. Secondly, we find that Vapnik's key inequalities become \nequalities once \"the misclassification error\" is replaced by \"the margin error\" (defined as \nthe rate of data instances incurring positive costs). In particular, we find that for margin \nerrors the following three expectations: (i) of the empirical risk, (ii) of the the true risk and \n(iii) of the leave-one-out risk estimator are equal to each other. Moreover, we show that \nthey are equal to the expectation of the ratio of support vectors to the number of training \nexamples. \n\nThe main results are given in Section 2. Brief discussion of results is given in Section 3. \n\n2 Main results \n\nGiven an l-sample {(Xl, yd, .... , (XI, YI)} of patterns Xi E Xc IRn and target values Yi E \nY c IR. The learning algorithms used by SVMs [13], RNs [5] or Gaussian Processes [15] \nminimise the regularized risk functional of the form: \n\nmin Rreg[J, bj = L C(Xi,Yi, ~i[f, b]) + -21Ifll~. \n(f,b) E1f. xlR \n\ni=l \n\nI \n\nA \n\n(1) \n\nHere 1\u00a3 denotes a reproducing kernel Hilbert space (RKHS) [1], 11.111f. is the corresponding \nnorm, A > 0 is a regularization constant, C : X X Y x IR -+ IR+ is a non-negative cost \nfunction penalising for the deviation ~i[f, bj = Yi - iii of the estimator iii := f(Xi) + (3b \nfrom target Yi at location Xi, b E IR is a constant (bias) and (3 E {O, 1} is another constant \n((3 = 0 is used to switch the bias off). \nThe important Representer Theorem [8,4] states that the minimizer (1) has the expansion: \n\nI \n\nf(x) = L Qik(Xi,X), \n\ni=l \n\n(2) \n\nwhere k : X x X -+ IR is the kernel corresponding to the RKHS 1\u00a3. In the following \nsection we shall show that under general assumptions this expansion is unique. \nIf Qi '\" 0, then Xi is called a support vector of f(.). \n\n2.1 Unique Representer Theorem \nWe recall, that a function is called a real analytic function on a domain c IRq if for every \npoint of this domain the Taylor series for the function converges to this function in some \nneighborhood of that point. 1 \n\nA proof of the following crucial Lemma is omitted due to lack of space. \nLemma 2.1. If cp : X -+ IR is an analytic function on an open connected subset X c IRn, \nthen the subset cp-l (0) C X is either equal to X or has Lebesgue measure O. \n\nAnalyticity is essential for the above result and the result does not hold even for functions \ninfinitely differentiable, in general. Indeed, for every closed subset V C IRn there exists \nan infinitely differentiable function (COO) on IRn such that (,1>-1(0) = V and there exist \nclosed subsets with positive Lebesgue measure and empty interior. Hence the Lemma, and \nconsequently the subsequent results, do not hold for the broader class of all Coo functions. \n\nI Examples of analytic functions are polynomials. The ordinary functions such as sin( x), cos( x) \nand exp(x) are examples of non-polynomial analytic functions. The function 'IjJ(x) := exp( -1/x 2 ) \nfor x > 0 and 0, otherwise, is an example of infinitely differentiable function of the real line but not \nanalytic (locally it is not equal to its Taylor series expansion at zero). \n\n\fStanding assumptions. The following is assumed. \n\n1. The set X C IRn is open and connected and either Y = {\u00b1 I} (the case of classifi(cid:173)\n\ncation) or Y C IR is an open segment (the case of regression). \n\n2. The kernel k : X x X -+ IR is a real analytic function on its domain. \n3. The cost function ~~c(x, Y,~) is convex, differentiable on IR and c(x, y, 0) = 0 \n\nfor every (x,y) E X x Y. It can be shown that \n\nc(x, Y,~) > 0 \n\n\u00a2:} \n\n(3) \n\n4. lisafixedinteger, 1 < l ~ dim(1l),andthetrainingsample(xl,yt}, ... ,(XI,YI) \n\nis iid drawn from a continuous probability density p( x, y) on X x Y. \n\n5. The phrase \"with probability I\" will mean with probability 1 with respect to the \n\nselection of the training sample. \n\nNote that standard polynomial kernel k(x,x') = (1 + X\u00b7 x')d, X,X' E IRn, satisfies the \nabove assumptions with dim(1l) = (n~d). Similarly, the Gaussian kernel k(x, x') = \nexp( -llx - x' W / a) satisfies them with dim(1l) = 00. \nTypical cost functions such as the super-linear loss functions cp(x, Y,~) = (y~)~ := \n(max(O,y~))P used for SVM classification, or CpE(X,y,~) = (I~I - f)~ used for SVM \nregression, or the super-linear loss cp(x, y,~) = I~IP for p > 1 for RN regression, satisfy \nthe above assumptions2. Similarly, variations of Huber robust loss [11, 14] satisfy those \nassumptions. \n\nThe following result strengthens the Representer Theorem [8, 4] \nTheorem 2.2. If l ~ dim1l, then both, the minimizer of the regularized risk (1) and its \nexpansion (2) are unique with probability 1. \n\nProof outline. Convexity of the functional (1, b)~Rreg[j, bj and its strict convexity with \nrespect to ! E 1l implies the uniqueness of ! E 1l minimizing the regularized risk (1); \ncf.[3]. From the assumption that l ~ dim 1l we derive the existence of Xl, ... , Xl E X such \nthat the functions !(Xi, .), i = 1, ... , l, are linearly independent. Equivalently, the following \nGram determinant is # 0: \n\n O. \n\n(5) \n2Note that in general, if a function \u00a2> : IR --+ IR is convex, differentiable and such that d\u00a2>/d~(O) = \n\n0, then the cost function c( x, y,~) := \u00a2>( (~)+) is convex and differentiable. \n\n(4) \n\n\fProof outline. With probability 1, functions k(xj, .), j = 1, ... , I, are linearly independent \n: X -+ ~l such that \n(cf. the proof of Theorem 2.2) and there exists a feature map <1> \nvectors Zj := <1>(Xj), i = 1, ... ,1 are linearly independent, k(xj,x) = Zj' <1>(x) and \nfU)(x) = zU) . <1>(x) + (JbU) for every x E X, where zU) := L~=l aU)jzj. The pair \n(zU), bU)) minimizes the function \n\n(6) \n\nwhere ~j (z, b) := Yj - Z . Zj -\n(Jb. This function is differentiable due to the standing \nassumptions on the cost e. Hence, necessarily gradRreg = 0, at the minimum (zU), bU)), \nwhich due to the linear independence of vectors Zj, gives \n\no:U) . = -! ae (x . y ' i. (zU) bU))) \n\nA a~ J' J' <\"J \n\nJ \n\n, \n\n(7) \n\nfor every j = 1, ... , I. This equality combined with equivalence (3) proves (4). \n\nNow we proceed to the proof of (5). Note that the pair (zU\\i) , bU\\i)), where zU\\i) .(cid:173)\nL~\"'i aU\\i) jZj, corresponds in the feature space to the minimizer (JU\\i) , bU\\i)) of the \nreduced regularized risk: \n\nflU\\i) (z b) .= \n\nreg \n\n. \n\n, \n\nSufficiency in (5). From (4) and characterization (7) ofthe critical point it follows immedi(cid:173)\nately that if aU) i = 0, then the minimizers for the full and reduced data sets are identical. \n:j:. 0 and e(xi' Yi, ~df(l\\i), bU\\i)]) = 0 leads to a \nNecessity in (5). A supposition of a U\\ \ncontradiction. Indeed, from (4), e( xi, Yi, ~i (zU) , bU))) > 0, hence: \n\nflU) (zU\\i) bU\\i)) = flU\\i) (zU\\i) b(l\\i)) \n\nr eg ' \n\nr eg ' \n\n< fl(l\\i) (z(l) b(l)) = fl(l) (z(l) b(l)) - e(x\u00b7 y' i \u00b7(z(l) b(l))) \n< fl(l) (z(l) , b(l)) = min R(l) (z, b). \n\nr eg ' \n\nr eg ' \n\nt, t, <\"t \n\n, \n\nreg \n\n(z,b)ER1xR \n\nreg \n\nThis contradiction completes the proof. Q.E.D. \n\nWe say that Xi is a sensitive support vector if a(l) i :j:. 0 and f(l) :j:. f(l\\i) , i.e., if its removal \nfrom the training set changes the solution. \nCorollary 2.4. Every support vector is sensitive with probability 1. \n\n:j:. 0, then the vector z(l) \n\nf/. LinR(zl, .... , Zi-l, Zi+l, ... , Zl) since z(l) \nProof. If ai \nhas a non-trivial component aiZi in the direction of ith feature vector Zi, while \nz(l\\i) E LinR(zl, .... , Zi-l, Zi+1, ... , Zl) . Thus z(l) and z(l\\i) have different direc(cid:173)\ntions in LinR(zl, ... , Zl) C Z and there exists j' E {l, ... , I} such that f U) (Xjl) \n:j:. \nf(l\\i) (Xjl). Q.E.D. \n\nWe define the empirical risk and the expected (true) risk of margin error \n\nRemp[!,bj \n\nL~=lI{c(xi'Yi,e;[f,b]\u00bbO} _ #{i; e(xi'Yi'~i [f,b]) > O} \n\nI \n\nProb[e(x, Y, y - f(x) -\n\n-\n(Jb) > OJ, \n\nI \n\n\fwhere (f, b) E 1i x lR, 10 denotes the indicator function and # denotes the cardinality \n(number of elements) of a set. \n\nFrom the above Lemma we obtain immediately the following result: \nCorollary 2.5. With probability 1: \n\n#{ '. ( . \n\nZ , eXt, Yt, \n\n. I(l\\i) ( .) + (3b(l\\i)) > O} #{'. (l) . ...t. O} \n\n= \n\nZ ,a t T \n\nX t \n\nl \n\nl \n\n= R \n\n[/(1) btl)] \n\nemp,\u00b7 \n\nThere exist counter-examples showing the phrase \"with probability 1\" above cannot be \nomitted. The sum on L.H.S. above is the leave-one-out estimator of the risk of margin \nerror [14] for the minimizer of regularized risk (1). The above corollary shows that this \nestimator is uniquely determined by the number of support vectors as well as the number \nof training margin errors. \n\nNow from the Lunts-Brailovsky Theorem [14, Theorem 10.8] applied to the risk \nQ(x, Yj I, b) := I{c(x,y,y-!(x)-/3b>O} the following result is obtained. \nTheorem 2.6. \n\nE[Rexp(f(I-l) , b(l-l))] = E[Remp(f(l) , btl))] = E[#{i j ~(l)i :I O}], \n\n(8) \n\nwhere the first expectation is in the selection of training (l - 1)-sample and the remaining \ntwo are with respect to the selection of training l-sample. \nA cost function is called partially insensitive if there exists (x, y) E X x Y and 6 :I 6 \nsuch that c(x, y, ed = c(x, y, 6) = O. Otherwise, the cost c is called sensitive. Typical \nSVM cost functions are partially insensitive while typical RN cost functions are sensitive. \nThe following result can be derived from Theorem 2.6 and Lemma 2.3. \nCorollary 2.7. If the number of support vectors is < l with aprobabi/ity > 0, then the cost \nfunction has to be partially insensitive. \n\nTypical cost functions penalize for an allocation of a wrong sign, i.e. \n'v'(X,y,Y)EXXYXIR yf) < 0 ~ c(x, y, y - f)) > O. \n\n(9) \nLet us define the risk of misclassification of the kernel machine f)(x) = I(x) + (3b for \n(f,b) E 1i x ~asRclas[/,b]:= Prob[yf)(x) < 0]. Assuming (9),wehaveRclas[/,b] :::; \nRexp[/, b]. Combining this observation with (8) we obtain an extension of Vapnik's result \n[14, Theorem 10.5]: \nCorollary 2.8. If condition (9) holds then \n\nE[R \n\nclas \n\n(/(1-1) b(l-l))] < E[#{i j a(l)i :I O}] - E[R \n\n, \n\n-\n\nl \n\n-\n\n(/(1) btl))] \n. \n\n, \n\nemp \n\n(10) \n\nNote that the original Vapnik's result consists in an inequality analogous to the inequality \nin the above condition for the specific case of classification by optimal hyperplanes (hard \nmargin support vector machines). \n\n3 Brief Discussion of Results \n\nEssentiality of assumptions. For every formal result in this paper and any of the standing \nassumption there exists an example of a minimizer of (1) which violates the conclusions of \nthe result. In this sense all those assumptions are essential. \n\nLinear combinations of admissible cost functions. Any weighted sum of cost func(cid:173)\ntions satisfying our Standing Assumption 3 will satisfy this assumption as well, hence our \n\n\fformalism will apply to it. An illustrative example is the following cost function for clas(cid:173)\nsification c(x, y,~) = L:j Cj (max: (0, y(~ - f.j) )pj , where Cj > 0, f.j 2: 0 and Pj > 1 are \nconstants andy E Y = {\u00b11}. \n\nNon-differentiable cost functions. \nOur formal results can be extended with mi-\nnor modifications to the case of typical, non-differentiable linear cost function such as \nc = (y~)+ = max:(O, y~) for SVM classification, c = (I~I - f.)+ for SVM regression and \nto the classification with hard margins SVMs (optimal hyperplanes). Details are beyond \nthe scope of this paper. Note that the above linear cost functions can be uniformly approx(cid:173)\nimated by differentiable cost functions, e.g. by Huber cost function [11, 14], to which our \nformalism applies. This implies that our formalism \"applies approximately\" to the linear \nloss case and some partial extension of it can be obtained directly using some limit argu(cid:173)\nments. However, using direct algebraic approach based on an evaluation of Kuhn-Tucker \nconditions one can come to stronger conclusions. Details will be presented elsewhere. \n\nTheory of generalization. Equality of expectations of empirical and expected risk pro(cid:173)\nvided by Theorem 2.6 implies that minimizers of regularized risk (1) are on average con(cid:173)\nsistent. We should emphasize that this result holds for small training samples, of the size \nI smaller than VC dimension of the function class, which is dim(1i) + 1 in our case. This \nshould be contrasted with uniform convergence bounds [2, 13, 14] which are vacuous un(cid:173)\nless I > > VC dimension. \nSignificance of approximate solutions for RNs. Corollary 2.7 shows that sparsity of \nsolutions is practically not achievable for optimal RN solutions since they use sensitive \ncost functions. This emphasizes the significance of research into approximately optimal \nsolution algorithms in such a case, cf. [12] . \n\nApplication to selection of the regularization constant. The bound provided by Corol(cid:173)\nlary 2.8 and the equivalence given by Theorem 2.6 can be used as a justification of a heuris(cid:173)\ntic that the optimal value of regularization constant A is the one which minimizes the num(cid:173)\nber of margin errors (cf. [14]). This is especially appealing in the case of regression with \nf.-insensitive cost, where the margin error has a straightforward interpretation of sample \nbeing outside of the f.-tube. \n\nApplication to modelling of additive noise. Let us suppose that data is iid drawn form \nthe distribution of the form y = f(x) + f.noise, where f.noise is a random noise independent \nof x, with 0 mean. Theorem 2.6 implies the following heuristic for approximation of the \nnoise distribution in the regression model y = f (x) + f.noise : \n#{i; a(l) :I O} \n\nPrOb[f.noise > f.ll::::: \n\nI \n\n. \n\nHere (f(l) , b(l)) is a minimizer ofthe regularized risk (1) with an f.-insensitive cost function, \ni.e. such that c(x, y,~) > 0 iff I~I > f.. \nAcknowledgement. The permission of the Chief Technology Officer, Telstra, to publish \nthis paper, is gratefully acknowledged. \n\nReferences \n\n[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathe(cid:173)\n\nmatical Society, 68:337 - 404, 1950. \n\n[2] P. Bartlett and J. Shave-Taylor. Generalization performance of support vector ma(cid:173)\n\nchines and other pattern classifiers. In B. Scholkopf, et. al., eds., Advances in Kernel \nMethods, pages 43-54, MIT Press, 1998. \n\n[3] C. Burges and D. J. Crisp. Uniqueness of the SVM solution. In S. Sola et. al., ed., \n\nAdv. in Neural Info. Proc. Sys. 12, pages 144-152, MIT Press, 2000. \n\n\f[4] D. Cox and F. O'Sullivan. Asymptotic analysis of penalized likelihood and related \n\nestimators. Ann. Statist., 18:1676-1695,1990. \n\n[5] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks archi(cid:173)\n\ntectures. Neural Computation, 7(2):219-269,1995. \n\n[6] T. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proc. Seventh \n\nWork. onAI and Stat. , San Francisco, 1999. Morgan Kaufman. \n\n[7] T. Joachims. Estimating the Generalization Performance of an SVM Efficiently. In \nProc. of the International Conference on Machine Learning, 2000. Morgan Kaufman. \n[8] G. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation of \nstochastic processes and smoothing by splines. Ann. Math. Statist., 41 :495-502, 1970. \n[9] A. Lunts and V. Brailovsky. Evaluation of attributes obtained in statistical decision \n\nrules. Engineering Cybernetics, 3:98-109, 1967. \n\n[10] M. Opper and O. Winther. Gaussian process classification and SVM: Mean field \nresults and leave-one out estimator. In P. Bartlett, et. al eds., Advances in Large \nMargin Classifiers, pages 301-316, MIT Press, 2000. \n\n[11] A. Smola and B. Scholkopf. A tutorial on support vector regression. Statistics and \n\nComputing, 1998. In press. \n\n[12] A. 1. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine \n\nlearning. Typescript, March 2000. \n\n[13] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, \n\n1995. \n\n[14] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. \n[15] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to \nlinear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in \nGraphical Models. Kluwer, 1998. \n\n\f", "award": [], "sourceid": 1868, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}]}