{"title": "The Entropy Regularization Information Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 342, "page_last": 348, "abstract": null, "full_text": "The Entropy Regularization \n\nInformation Criterion \n\nAlex J. Smola \n\nDept. of Engineering and RSISE \nAustralian National University \nCanberra ACT 0200, Australia \n\nAlex.Smola@anu.edu.au \n\nJohn Shawe-Taylor \n\nRoyal Holloway College \n\nUniversity of London \n\nEgham, Surrey 1W20 OEX, UK \n\njohn@dcs.rhbnc.ac.uk \n\nBernhard Scholkopf \n\nMicrosoft Research Limited \n\nSt. George House, 1 Guildhall Street \n\nCambridge CB2 3NH \nbsc@microsoft.com \n\nRobert C. Williamson \nDept. of Engineering \n\nAustralian National University \nCanberra ACT 0200, Australia \nBob. Williamson @anu.edu.au \n\nAbstract \n\nEffective methods of capacity control via uniform convergence bounds \nfor function expansions have been largely limited to Support Vector ma(cid:173)\nchines, where good bounds are obtainable by the entropy number ap(cid:173)\nproach. We extend these methods to systems with expansions in terms of \narbitrary (parametrized) basis functions and a wide range of regulariza(cid:173)\ntion methods covering the whole range of general linear additive models. \nThis is achieved by a data dependent analysis of the eigenvalues of the \ncorresponding design matrix. \n\n1 \n\nINTRODUCTION \n\nModel selection criteria based on the Vapnik-Chervonenkis (VC) dimension are known to \nbe difficult to obtain, worst case, and often not very tight. Yet they have the theoretical \nappeal of providing bounds, with few or no assumptions made. \n\nRecently new methods [8, 7, 6] have been developed which are able to provide a better \ncharacterization of the complexity of function classes than the VC dimension, and more(cid:173)\nover, are easily obtainable and take advantage of the data at hand (i.e. they employ the \nconcept of luckiness). These techniques, however, have been limited to linear functions \nor expansions of functions in terms of kernels as happens to be the case in Support Vector \n(SV) machines. \n\nIn this paper we show that the previously mentioned techniques can be extended to expan(cid:173)\nsions in terms of arbitrary basis functions, covering a large range of practical algorithms \nsuch as general linear models, weight decay, sparsity regularization [3], and regularization \nnetworks [4]. \n\n\fThe Entropy Regularization Information Criterion \n\n343 \n\n2 SUPPORT VECTOR MACHINES \n\nSupport Vector machines carry out an effective means of capacity control by minimizing a \nweighted sum of the training error \n\n(1) \n\nand a regularization term Q[J] = ~llwI12; i.e. they minimize the regularized risk functional \n\nRreg[J] := Remp[f] + AQ[f] = m ~ C(Xi, Yi, f(Xi)) + \"2llwI12. \n\n(2) \n\n1 m \n\nA \n\nt=l \n\nHere X := {Xl, ... Xm} C X denotes the training set, Y := {YI, ... Ym} C }j the cor(cid:173)\nresponding labels (target values), X, }j the corresponding domains, A > a a regularization \nconstant, C : X X }j X }j -+ JRt a cost function, and f : X -+ }j is given by \nf(x) := (x, w), or in the nonlinear case f(x) := (4l(x), w). \n\n(3) \n\nHere 4l : X -+ l' is a map into a feature space 1'. Finally, dot products in feature space can \nbe written as (4l(x), 4l(X')) = k(x, x') where k is a so-called Mercer kernel. \nFor n E N, ~n denotes the n-dimensional space of vectors x = (Xl, ... , Xn). We de(cid:173)\nfine spaces f; as follows: as vector spaces, they are identical to ~n, in addition, they are \nendowed with p-norms: \n\nfora < p < 00 \nforp = 00 \n\nWe write fp = fr;:o Furthermore let Ue~ := {x: Ilxlle~ ::; I} be the unitf;-baU. \nFor model selection purposes one wants to obtain bounds on the richness of the map S x \n\nSx : w f-t (f(xd, ... , f(xm)) = ((4l(xd, w), ... , (4l(xm), w)). \n\n(4) \n\nwhere w is restricted to an f2 unit ball of some radius A (this is equivalent to choosing an \nappropriate value of A -\nan increase in A decreases A and vice versa). By the \"richness\" \nof S x specificaUy we mean the f: \u20ac-covering numbers N( \u20ac, S X (AUe;;, ), f1:J of the set \nSx(AUlm). In the standard COLT notation, we mean \n\np \n\nN(\u20ac, SX(AUl;;')' f:) := min n \n\n{\n\nSee [8] for further details. \n\nThere exists a set {Zl, ... zn} C F such that for all } \nZ E Sx(AUem) we have min liz - zililm < \u20ac \n\np \n\nl::;i::;n \n\n00 \n\n-\n\nWhen carrying out model selection in this case, advanced methods [6] exploit the distribu(cid:173)\ntion of X mapped into feature space 1', and thus of the spectral properties of the operator \nSx by analyzing the spectrum of the Gram matrix G = [gij]ij, where gij := k(Xi, Xj). \nAll this is possible since k(Xi,Xj) can be seen as a dot product of Xi,Xj mapped into \nsome feature space 1', i.e. k(Xi, Xj) = (4l(Xi), 4l(Xj )) . This property, whilst true for SV \nmachines with Mercer kernels, does not hold in general case where f is expanded in terms \nof more or less arbitrary basis functions. \n\n\f344 \n\nA. J. Smola. J. Shawe-Taylor, B. Sch61kopf and R. C. Williamson \n\n3 THE BASIC PROBLEMS \n\nOne basic problem is that when expanding 1 into \n\nn \n\n(5) \n\ni=l \n\nwith Ii (x) being arbitrary functions, it is not immediately obvious how to regard 1 as a \ndot product in some feature space. One can show that the VC dimension of a set of n \nlinearly independent functions is n. Hence one would intuitively try to restrict the class of \nadmissible models by controlling the number of basis functions n in terms of which 1 can \nbe expanded. \nNow consider an extreme case. In addition to the n basis functions Ii defined previously, \nwe are given n further basis functions II, linearly independent of the previous ones, which \ndiffer from Ii only on a small domain X', i.e. Iilx\\x1 = IIlx\\xl. Since this new set of \nfunctions is linearly independent, the VC dimension of the joint set is given by 2n. On the \nother hand, if hardly any data occurs on the domain X', one would not notice the difference \nbetween Ii and II. In other words, the joint system of functions would behave as if we \nonly had the initial system of n basis functions. \nAn analogous situation occurs if II = Ii + \u20acgi where \u20ac \nis a small constant and gi was \nbounded, say, within [0, 1J. Again, in this case, the additional effect of the set offunctions \nII would be hardly noticable, but still, the joint set of functions would count as one with VC \ndimension 2n. This already indicates, that simply counting the number of basis functions \nmay not be a good idea after all. \n\n.' ''~ \n\nFigure 1: From left to right: (a) initial set of functions h, ... , 15 (dots on the x-axis \nindicate sampling points); (b) additional set of functions IL ... , I~ which differ globally, \nbut only by a small amount; (c) additional set offunctions IL ... , I~ which differ locally, \nhowever by a large amount; (d) spectrum of the corresponding design matrices - the bars \ndenote the cases (a)-(c) in the corresponding order. Note that the difference is quite small. \n\nOn the other hand, the spectra of the corresponding design matrices (see Figure 1) are very \nsimilar. This suggests the use of the latter for a model selection criterion. \n\nFinally we have the practical problem that capacity control, which in SV machines was \ncarried out by minimizing the length of the \"weight vector\" w in feature space, cannot be \ndone in an analogous way either. There are several ways to do this. Below we consider \nthree that have appeared in the literature and for which there exist effective algorithms. \nExample 1 (Weight Decay) Define Q[IJ := ~ L:i ar .. i.e. the coefficients ai of the junc(cid:173)\ntion expansion are constrained to an \u00a32 ball. In this case we can consider the following \noperator S(1)\u00b7 \u00a3n -t \u00a3m where \n\nX \nSr): aM (f(xd, ... , I(xm)) = ((f(Xl), a), . .. , (f(Xm), a)) = Fa \n\n(6) \nHere I(x):= Ul(x) , .. \u00b7In(x)), Fij := Ii(Xj), a'- (al, ... ,an) and a E AUl'2for \nsome A> O. \n\n. 2 \n\n00' \n\n\fThe Entropy Regularization Information Criterion \n\n345 \n\nExample 2 (Sparsity Regularization) In this case Q[J] := Li lail, i.e. the coefficients \nai of the function expansion are constrained to an \u00a31 ball to enforce sparseness [3]. Thus \nsC;) : \u00a31 -t \u00a3~ with sC;) mapping a as in (6) except a E AUlI. This is similar to expan(cid:173)\nsions encountered in boosting or in linear programming machines. \n\nExample 3 (Regularization Networks) Finally one could set Q[J] := ~a T Qa for some \npositive definite matrix Q. For instance, Qij could be obtainedfrom (Ph, P fj) where P is \na regularization operator penalizing non-smooth functions [4J. In this case a lives inside \nsome n-dimensional ellipsoid. By substituting a' := Q% a one can reduce this setting to the \ncase of example 1 with a different set of basis functions (f'(x) = Q-% f(x)) and consider \nan evaluation operator s~) : \u00a32 -t \u00a3: given by \ns~): a' f-+ (f(xd, . .. , f(xm)) = ((Q-% f(X1), a'), . .. , (Q-t f(xm), a')) = Q-t Fa' \n(7) \n\nwhere a' E AUl2 for some A> 0 and Fij = fi(xj) as in example 1. \n\nExample 4 (Support Vector Machines) An important special case of example 3 are Sup(cid:173)\nport Vector Machines where we have Qij = k(Xi,Xj) andfi(x) = k(Xi,X), henceQ = F. \nHence the possible values generated by a Support Vector Machine can be written as \n\ns~): a' f-+ (f(X1), ... , f(xm)) = ((Q-% f(xd, a'), . .. , (Q-% f(xm), a')) = Ft a' \n\n(8) \n\nwhere a' E AUl2 for some A > o. \n\n4 ENTROPY NUMBERS \n\nCovering numbers characterize the difficulty of learning elements of a function class. En(cid:173)\ntropy numbers of operators can be used to compute covering numbers more easily and \nmore tightly than the traditional techniques based on VC-like dimensions such as the fat \nshattering dimension [1]. Knowing el (S x) = \u20ac (see below for the definition) tells one \nthat 10g:N(\u20ac , F,\u00a3~) ::; I, where F is the effective class of functions used by the regu(cid:173)\nlarised learning machines under consideration. In this section we summarize a few basic \ndefinitions and results as presented in [8] and [2]. \nThe lth entropy number \u20acl (F) of a set F with a corresponding metric d is the precision up \nt~ whicI:! F can _be approximated by 1 elements of F; i.e. for all f E F there exists some \nfi E {h, \u00b7\u00b7\u00b7, fd such that d(f, fi) ::; \u20acl. Hence \u20ac1(F) \nis the functional inverse of the \ncovering number of F. \n\nThe entropy number of an bounded linear operator T: A -t B between normed linear \nspaces A and B is defined as \u20ac1(T) \n:= \u20ac1(T(UA)) with the metric d being induced by \nII . liB. The dyadic entropy numbers el are defined by el := \u20ac2'+1 \n(the latter quantity is \noften more convenient to deal with since it corresponds to the log of the covering number). \n\nWe make use of the following three results on entropy numbers of the identity mapping \nfrom \u00a3;1 into \u00a3;2' diagonal operators, and products of operators. Let \n\nThe following result is due to Schlitt; the constants 9.94 and 1.86 were obtained in [9]. \n\nid;l ,P2 : \u00a3;1 -t \u00a3;2 \n\n; \n\nid;1 ,P2 : x f-+ x \n\nProposition 1 (Entropy numbers for identity operators) Be mEN. Then \n\nel(id~,2) ::; 9.94 (t log (1 + T) ) 2 \n\n1 \n\n& el (id~,(xJ ::; 1.86 (t log (1 + T) ) 2 \n\n1 \n\n(9) \n\n\f346 \n\nA. J Smola, J Shawe-Taylor, B. SchOlkopfand R. C. Williamson \n\nProposition 2 (Carl and Stephani [2, p.11]) Let E, F, G be Banach spaces, R : F -+ \nG, and S: E -+ F. Then,forn, tEN, \n\nen+t-l (RS) ~ en(R)et(S), en(RS) ~ en (R)IISII and en(RS) ~ en(S)IIRII. \n\n(to) \n\nNote that the latter two inequalities follow directly from the fact that \u20acl (R) = IIRIlfor all \nR: F -+ G by definition of the operator norm IIRII. \n\nProposition 3 Let 0\"1 ~ 0\"2 ~ . .. ~ O\"j ~ . .. ~ 0, 1 ~ p ~ 00 and \n\n(11) \n\nfor x = (Xl, X2, ... , Xj, . .. ) E f!p be the diagonal operator from f!p into itself, generated \nby the sequence (0\" j ) j. Then for all n E N, \n\n5 THE MAIN RESULT \n\nWe can now state the main theorem which gives bounds on the entropy numbers of S~) for \nthe first three examples of model selection described above (since Support Vector Machines \nare a special case of example 3 we will not deal with it separately). \n\nProposition 4 Let! be expanded in a linear combination of basis functions as ! .(cid:173)\nL~=l adi and the coefficients a restricted to one of the convex sets as described in the \nexamples 1 to 3. Moreover denote by Fij := !j(Xi) the design matrix on a particular \nsample X, and by Q the regularization matrix in the case of example 3. Then the following \nbound on Sx holds. \n\n1. In the case of weight decay (ex. 1 )(with h + l2 ~ l + 1) \n\nel(S~)) ~ 1.96 (llllog(1 +m/h))t eI2(~)' \n\n(13) \n\n2. 1n the case of weight sparsity regularization (ex. 2) (with h + l2 + l3 ~ l + 2) \n\nel(S~)) ~ 18.48 (lillog (1 + m/h)) t el2 (~) (l3'llog (1 + m/l3)) t. (14) \n\n3. Finally, in the case of regularization networks (ex. 3) (with II + l2 ~ l + 1) \n\nel (Sr)) ~ 1.96 (lillog (1 + m/h)) t el 2 (~). \n\n(15) \n\nHere ~ is a diagonal scaling operator (matrix) with (i, i) entries .j(ii and (.j(ii)i are the \neigenvalues (sorted in decreasing order) of the matrix FFT in the case of examples 1 and \n2, and FQ-l FT in the case of example 3. \n\nThe entropy number of ~ is readily bounded in terms of (O\"i)i by using (3). One can see \nthat the first setting (weight decay) is a special case of the third one, namely when Q = 1, \ni.e. when Q is just the identity matrix. \n\nProof The proofrelies on a factorization of S~) (i = 1,2,3) in the following way. First \nwe consider the equivalent operator S x mapping from f!~ to f!r and perform a singular \nvalue decomposition [5] of the latter into S x = V~W where V, W are operators of norm \n1, and ~ contains the singular values of S~), i.e. the singular values of F and FQ- t \n\n\fThe Entropy Regularization Information Criterion \n\n347 \n\nrespectively. The latter, however, are identical to the square root of the eigenvalues of \nF FT or FQ-l FT. Consequently we can factorize S~) as in the diagram \n\n(16) \n\nFinally, in order to compute the entropy number of the overall operator one only has to \nuse the factorization of Sx into S~) = id~oo VL:W for i E {1,3} and into S~) = \nid~oo VL:Wid~,2 for example 2, and apply Proposition 2 several times. We also exploit \nthe fact that for singular value decompositions IIVI\\' IIWII s l. \n\u2022 \nThe present theorem allows us to compute the entropy numbers (and thus the complexity) \nof a class of functions on the current sample X. Going back to the examples of section 3, \nwhich led to large bounds on the VC dimension one can see that the new result is much less \nsusceptible to such modifications: the addition of f{. ... f~ to h, ... f n does not change \nthe eigenspectrum L: of the design matrix significantly (possibly only doubling the nominal \nvalue of the singular values), if the functions fi differ from fi only slightly. Consequently \nalso the bounds will not change significantly even though the number of basis functions \njust doubled. \n\nAlso note that the current error bounds reduce to the results of [6] in the SV case: here \nQ ij = Fij = k( Xi, X j) (both the design matrix F and the regularization matrix Q are \ndetermined by kernels) and therefore FQ-l F = Q. Thus the analysis of the singular \nvalues of FQ-l F leads to an analysis of the eigenvalues of the kernel matrix, which is \nexactly what is done when dealing with SV machines. \n\n6 ERROR BOUNDS \n\nTo use the above result we need a bound on the expected error of a hypothesis f in terms \nof the empirical error (training error) and the observed entropy numbers \u20acn(J'). We use [6, \nTheorem 4.1] with a small modification. \n\nTheorem 1 Let:1' be a set of linear junctions as described in the previous examples with \nen(Sx) as the corresponding bound on the observed entropy numbers of:1' on the dataset \nX. Moreover suppose thatforafixed threshold b E [?for some f E :1', sgn(f - b) correctly \nclassifies the set X with a margin 'Y := minlSiSm If(Xi) - bl. \nFinally let U := min{ n E N with en(Sx) s 'Y /8.001} and a(U, <5) := 3.08(1 + bIn t). \nThen with confidence 1- <5 over X (drawn randomly from pm where P is some probability \ndistribution) the expected error ofsgn(f - b) is boundedfrom above by \n\n\u20ac(m,U,<5) =! (U(1+a(U,~)log(5t-m)log(17m)) + log (l6r)) . \n\n(17) \n\nThe proof is essentially identical to that of [6, Theorem 4.1] and is omitted. [6] also shows \nhow to compute en (S x) efficiently including an explicit formula for evaluating el (L:). \n\n7 DISCUSSION \n\nWe showed how improved bounds could be obtained on the entropy numbers of a wide \nclass of popular statistical estimators ranging from weight decay to sparsity regularization \n\n\f348 \n\nA. J Smola. J Shawe-Taylor, B. SchOllropf and R. C. Williamson \n\n(with SV machines being a special case thereof). The results are given in a way that is \ndirectly useable for practicioners without any tedious calculations of the VC dimension or \nsimilar combinatorial quantities. In particular, our method ignores (nearly) linear depen(cid:173)\ndent basis functions automatically. Finally, it takes advantage of favourable distributions \nof data by using the observed entropy numbers as a base for stating bounds on the true \nentropy numbers with respect to the function class under consideration. \n\nWhilst this leads to significantly improved bounds (we achieved an improvement of ap(cid:173)\nproximately two orders of magnitude over previous VC-type bounds involving only the \nradius of the data R and the weight vector IIwll in the experiments) on the expected risk, \nthe bounds are still not good enough to become predictive. This indicates that possibly \nrather than using the standard uniform convergence bounds (as used in the previous sec(cid:173)\ntion) one might want to use other techniques such as a PAC-Bayesian treatment (as recently \nsuggested by Herbrich and Graepel) in combination with the bounds on eigenvalues of the \ndesign matrix. \n\nAcknowledgements: This work was supported by the Australian Research Council and a \ngrant of the Deutsche Forschungsgemeinschaft SM 62/1-1. \n\nReferences \n\n[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive Dimen(cid:173)\nsions, Uniform Convergence, and Learnability. 1. of the ACM, 44(4):615-631,1997. \n\n[2] B. Carl and I. Stephani. Entropy, compactness, and the approximation of operators. \n\nCambridge University Press, Cambridge, UK, 1990. \n\n[3] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Tech(cid:173)\n\nnical Report 479, Department of Statistics, Stanford University, 1995. \n\n[4] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks archi(cid:173)\n\ntectures. Neural Computation, 7:219-269,1995. \n\n[5] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cam(cid:173)\n\nbridge, 1992. \n\n[6] B. Scholkopf, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Generalization \n\nbounds via eigenvalues of the gram matrix. Technical Report NC-TR-99-035, Neuro(cid:173)\nColt2, University of London, UK, 1999. \n\n[7] J. Shawe-Taylor and R. C. Williamson. Generalization performance of classifiers in \n\nterms of observed covering numbers. In Proc. EUROCOLT'99, 1999. \n\n[8] R. C. Williamson, A. J. Smola, and B. Scholkopf. Generalization performance of \nregularization networks and support vector machines via entropy numbers of compact \noperators. NeuroCOLT NC-TR-98-019, Royal Holloway College, 1998. \n\n[9] R. C. Williamson, A. J. Smola, and B. SchOlkopf. A Maximum Margin Miscellany. \n\nTypescript, 1999. \n\n\f", "award": [], "sourceid": 1677, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}