{"title": "The representer theorem for Hilbert spaces: a necessary and sufficient condition", "book": "Advances in Neural Information Processing Systems", "page_first": 189, "page_last": 196, "abstract": "The representer theorem is a property that lies at the foundation of regularization theory and kernel methods. A class of regularization functionals is said to admit a linear representer theorem if every member of the class admits minimizers that lie in the finite dimensional subspace spanned by the representers of the data. A recent characterization states that certain classes of regularization functionals with differentiable regularization term admit a linear representer theorem for any choice of the data if and only if the regularization term is a radial nondecreasing function. In this paper, we extend such result by weakening the assumptions on the regularization term. In particular, the main result of this paper implies that, for a sufficiently large family of regularization functionals, radial nondecreasing functions are the only lower semicontinuous regularization terms that guarantee existence of a representer theorem for any choice of the data.", "full_text": "The representer theorem for Hilbert spaces: a\n\nnecessary and suf\ufb01cient condition\n\nFrancesco Dinuzzo and Bernhard Sch\u00a8olkopf\nMax Planck Institute for Intelligent Systems\n\nSpemannstrasse 38,72076 T\u00a8ubingen\n\n[fdinuzzo@tuebingen.mpg.de, bs@tuebingen.mpg.de]\n\nGermany\n\nAbstract\n\nThe representer theorem is a property that lies at the foundation of regularization\ntheory and kernel methods. A class of regularization functionals is said to admit\na linear representer theorem if every member of the class admits minimizers that\nlie in the \ufb01nite dimensional subspace spanned by the representers of the data.\nA recent characterization states that certain classes of regularization functionals\nwith differentiable regularization term admit a linear representer theorem for any\nchoice of the data if and only if the regularization term is a radial nondecreasing\nfunction. In this paper, we extend such result by weakening the assumptions on\nthe regularization term. In particular, the main result of this paper implies that,\nfor a suf\ufb01ciently large family of regularization functionals, radial nondecreasing\nfunctions are the only lower semicontinuous regularization terms that guarantee\nexistence of a representer theorem for any choice of the data.\n\n1\n\nIntroduction\n\nRegularization [1] is a popular and well-studied methodology to address ill-posed estimation prob-\nlems [2] and learning from examples [3]. In this paper, we focus on regularization problems de\ufb01ned\nover a real Hilbert space H. A Hilbert space is a vector space endowed with a inner product and a\nnorm that is complete1. Such setting is general enough to take into account a broad family of \ufb01nite-\ndimensional regularization techniques such as regularized least squares or support vector machines\n(SVM) for classi\ufb01cation or regression, kernel principal component analysis, as well as a variety of\nmethods based on regularization over reproducing kernel Hilbert spaces (RKHS).\nThe focus of our study is the general problem of minimizing an extended real-valued regularization\nfunctional J : H \u2192 R \u222a {+\u221e} of the form\n\nJ(w) = f (L1w, . . . , L(cid:96)w) + \u2126(w),\n\n(1)\nwhere L1, . . . , L(cid:96) are bounded linear functionals on H. The functional J is the sum of an error\nterm f, which typically depends on empirical data, and a regularization term \u2126 that enforces certain\ndesirable properties on the solution. By allowing the error term f to take the value +\u221e, problems\nwith hard constraints on the values Liw (for instance, interpolation problems) are included in the\nframework. Moreover, by allowing \u2126 to take the value +\u221e, regularization problems of the Ivanov\ntype are also taken into account.\nIn machine learning, the most common class of regularization problems concerns a situation where\na set of data pairs (xi, yi) is available, H is a space of real-valued functions, and the objective\nfunctional to be minimized is of the form\n\nJ(w) = c ((x1, y1, w(x1)),\u00b7\u00b7\u00b7 , (x(cid:96), y(cid:96), w(x(cid:96))) + \u2126(w).\n\n1Meaning that Cauchy sequences are convergent.\n\n1\n\n\f(cid:96)(cid:88)\n\ni=1\n\n(cid:96)(cid:88)\n\ni=1\n\nIt is easy to see that this setting is a particular case of (1), where the dependence on the data pairs\n(xi, yi) can be absorbed into the de\ufb01nition of f, and Li are point-wise evaluation functionals, i.e.\nsuch that Liw = w(xi). Several popular techniques can be cast in such regularization framework.\nExample 1 (Regularized least squares). Also known as ridge regression when H is \ufb01nite-\ndimensional. Corresponds to the choice\n\nc ((x1, y1, w(x1)),\u00b7\u00b7\u00b7 , (x(cid:96), y(cid:96), w(x(cid:96))) = \u03b3\n\n(yi \u2212 w(xi))2,\n\nand \u2126(w) = (cid:107)w(cid:107)2, where the complexity parameter \u03b3 \u2265 0 controls the trade-off between \ufb01tting of\ntraining data and regularity of the solution.\nExample 2 (Support vector machine). Given binary labels yi = \u00b11, the SVM classi\ufb01er (without\nbias) can be interpreted as a regularization method corresponding to the choice\n\nc ((x1, y1, w(x1)),\u00b7\u00b7\u00b7 , (x(cid:96), y(cid:96), w(x(cid:96))) = \u03b3\n\nmax{0, 1 \u2212 yiw(xi)},\n\nand \u2126(w) = (cid:107)w(cid:107)2. The hard-margin SVM can be recovered by letting \u03b3 \u2192 +\u221e.\nExample 3 (Kernel principal component analysis). Kernel PCA can be shown to be equivalent to a\nregularization problem where\nc ((x1, y1, w(x1)),\u00b7\u00b7\u00b7 , (x(cid:96), y(cid:96), w(x(cid:96))) =\n,\nand \u2126 is any strictly monotonically increasing function of the norm (cid:107)w(cid:107), see [4]. In this problem,\nthere are no labels yi, but the feature extractor function w is constrained to produce vectors with\nunitary empirical variance.\n\n0,\n+\u221e, otherwise\n\nw(xi) \u2212 1\n\n(cid:80)(cid:96)\n\n(cid:80)(cid:96)\n\nj=1 w(xj)\n\n(cid:17)2\n\n(cid:40)\n\n(cid:16)\n\n1\n(cid:96)\n\ni=1\n\n= 1\n\n(cid:96)\n\nThe possibility of choosing general continuous linear functionals Li in (1) allows to consider a much\nbroader class of regularization problems. Some examples are the following.\nExample 4 (Tikhonov deconvolution). Given a \u201cinput signal\u201d u : X \u2192 R, assume that the convo-\nlution u \u2217 w is well-de\ufb01ned for any w \u2208 H, and the point-wise evaluated convolution functionals\n\n\u2126(w) =\n\n\u03c6(w) \u2264 1\n\n+\u221e, otherwise .\n\n(cid:26) 0,\n\n2\n\n(cid:90)\n\nX\n\n(cid:90)\n\nX\n\nare continuous. A possible way to recover w from noisy measurements yi of the \u201coutput signal\u201d is\nto solve regularization problems such as\n\nLiw = (u \u2217 w)(xi) =\n\nu(s)w(xi \u2212 s)ds,\n\n(cid:32)\n\n(cid:96)(cid:88)\n\ni=1\n\nmin\nw\u2208H\n\n\u03b3\n\n(cid:33)\n\n(yi \u2212 (u \u2217 w)(xi))2 + (cid:107)w(cid:107)2\n\n,\n\nwhere the objective functional is of the form (1).\nExample 5 (Learning from probability measures). In certain learning problems, it may be appropri-\nate to represent input data as probability distributions. Given a \ufb01nite set of probability measures Pi\non a measurable space (X ,A), where A is a \u03c3-algebra of subsets of X , introduce the expectations\n\nLiw = EPi(w) =\n\nw(x)dPi(x).\n\nThen, given output labels yi, one can learn a input-output relationship by solving regularization\nproblems of the form\n\n(cid:0)c ((y1, EP1 (w)),\u00b7\u00b7\u00b7 , (y(cid:96), EP(cid:96)(w)) + (cid:107)w(cid:107)2(cid:1) .\n\nmin\nw\u2208H\n\nIf the expectations are bounded linear functionals, such regularization functional is of the form (1).\nExample 6 (Ivanov regularization). By allowing the regularization term \u2126 to take the value +\u221e,\nwe can also take into account the whole class of Ivanov-type regularization problems of the form\n\nmin\nw\u2208H f (L1w, . . . , L(cid:96)w),\n\nsubject to\n\n\u03c6(w) \u2264 1,\n\nby reformulating them as the minimization of a functional of the type (1), where\n\n\f1.1 The representer theorem\n\nLet\u2019s now go back to the general formulation (1). By the Riesz representation theorem [5, 6], J can\nbe rewritten as\n\nJ(w) = f ((cid:104)w, w1(cid:105), . . . ,(cid:104)w, w(cid:96)(cid:105)) + \u2126(w),\n\nwhere wi is the representer of the linear functional Li with respect to the inner product. Consider\nthe following de\ufb01nition.\nDe\ufb01nition 1. A family F of regularization functionals of the form (1) is said to admit a linear\nrepresenter theorem if, for any J \u2208 F, and any choice of bounded linear functionals Li, there exists\na minimizer w\u2217 that can be written as a linear combination of the representers:\n\n(cid:96)(cid:88)\n\ni=1\n\n\u2217 =\n\nw\n\nciwi.\n\nIf a linear representer theorem holds, the regularization problem under study can be reduced to a\n(cid:96)-dimensional optimization problem on the scalar coef\ufb01cients ci, independently of the dimension\nof H. This property is fundamental in practice: without a \ufb01nite-dimensional parametrization, it\nwouldn\u2019t be possible to employ numerical optimization techniques to compute a solution. Suf\ufb01-\ncient conditions under which a family of functionals admits a representer theorem have been widely\nstudied in the literature of statistics, inverse problems, and machine learning. The theorem also pro-\nvides the foundations of learning techniques such as regularized kernel methods and support vector\nmachines, see [7, 8, 9] and references therein.\nRepresenter theorems are of particular interest when H is a reproducing kernel Hilbert space\n(RKHS) [10]. Given a non-empty set X , a RKHS is a space of functions w : X \u2192 R such that\npoint-wise evaluation functionals are bounded, namely, for any x \u2208 X , there exists a non-negative\nreal number Cx such that\n\n|w(x)| \u2264 Cx(cid:107)w(cid:107),\n\n\u2200w \u2208 H.\n\nIt can be shown that a RKHS can be uniquely associated to a positive-semide\ufb01nite kernel function\nK : X \u00d7 X \u2192 R (called reproducing kernel), such that the so-called reproducing property holds:\n\nw(x) = (cid:104)w, Kx(cid:105),\n\n\u2200 (x, w) \u2208 X \u00d7 H,\n\nwhere the kernel sections Kx are de\ufb01ned as\n\nKx(y) = K(x, y),\n\n\u2200y \u2208 X .\n\nThe reproducing property states that the representers of point-wise evaluation functionals coincide\nwith the kernel sections. Starting from the reproducing property, it is also easy to show that the\nrepresenter of any bounded linear functional L is given by a function KL \u2208 H such that\n\nKL(x) = LKx,\n\n\u2200x \u2208 X .\n\nTherefore, in a RKHS, the representer of any bounded linear functional can be obtained explicitly\nin terms of the reproducing kernel.\nIf the regularization functional (1) admits minimizers, and the regularization term \u2126 is a nondecreas-\ning function of the norm, i.e.\n\n\u2126(w) = h((cid:107)w(cid:107)), with h : R \u2192 R \u222a {+\u221e}, nondecreasing,\n\n(2)\nthe linear representer theorem follows easily from the Pythagorean identity. A proof that the con-\ndition (2) is suf\ufb01cient appeared in [11] in the case where H is a RKHS and Li are point-wise\nevaluation functionals. Earlier instances of representer theorems can be found in [12, 13, 14]. More\nrecently, the question of whether condition (2) is also necessary for the existence of linear repre-\nsenter theorems has been investigated [15]. In particular, [15] shows that, if \u2126 is differentiable (and\ncertain technical existence conditions hold), then (2) is a necessary and suf\ufb01cient condition for cer-\ntain classes of regularization functionals to admit a representer theorem. The proof of [15] heavily\nexploits differentiability of \u2126, but the authors conjecture that the hypothesis can be relaxed. In the\nfollowing, we indeed show that (2) is necessary and suf\ufb01cient for the family of regularization func-\ntionals of the form (1) to admit a linear representer theorem, by merely assuming that \u2126 is lower\nsemicontinuous and satis\ufb01es basic conditions for the existence of minimizers. The proof is based on\na characterization of radial nondecreasing functions de\ufb01ned on a Hilbert space.\n\n3\n\n\f2 A characterization of radial nondecreasing functions\n\nIn this section, we present a characterization of radial nondecreasing functions de\ufb01ned over Hilbert\nspaces. We will make use of the following de\ufb01nition.\nDe\ufb01nition 2. A subset S of a Hilbert space H is called star-shaped with respect to a point z \u2208 H if\n\n(1 \u2212 \u03bb)z + \u03bbx \u2208 S,\n\n\u2200x \u2208 S,\n\n\u2200\u03bb \u2208 [0, 1].\n\nIt is easy to verify that a convex set is star-shaped with respect to any point of the set, whereas a\nstar-shaped set does not have to be convex.\nThe following Theorem provides a geometric characterization of radial nondecreasing functions\nde\ufb01ned on a Hilbert space that generalizes the analogous result of [15] for differentiable functions.\nTheorem 1. Let H denote a Hilbert space such that dimH \u2265 2, and \u2126 : H \u2192 R \u222a {+\u221e} a lower\nsemicontinuous function. Then, (2) holds if and only if\n\n\u2126(x + y) \u2265 max{\u2126(x), \u2126(y)},\n\n\u2200x, y \u2208 H : (cid:104)x, y(cid:105) = 0.\n\n(3)\n\nProof. Assume that (2) holds. Then, for any pair of orthogonal vectors x, y \u2208 H, we have\n\n(cid:16)(cid:112)(cid:107)x(cid:107)2 + (cid:107)y(cid:107)2\n\n(cid:17) \u2265 max{h ((cid:107)x(cid:107)) , h ((cid:107)y(cid:107))}\n\n\u2126(x + y) = h ((cid:107)x + y(cid:107)) = h\n\n= max{\u2126(x), \u2126(y)}.\n\nConversely, assume that condition (3) holds. Since dimH \u2265 2, by \ufb01xing a generic vector x \u2208\nX \\ {0} and a number \u03bb \u2208 [0, 1], there exists a vector y such that (cid:107)y(cid:107) = 1 and\n\nwhere\n\nIn view of (3), we have\n\n\u03bb = 1 \u2212 cos2 \u03b8,\n\ncos \u03b8 =\n\n(cid:104)x, y(cid:105)\n(cid:107)x(cid:107)(cid:107)y(cid:107) .\n\n\u2126(x) = \u2126(x \u2212 (cid:104)x, y(cid:105)y + (cid:104)x, y(cid:105)y)\n\n\u2265 \u2126(x \u2212 (cid:104)x, y(cid:105)y) = \u2126(cid:0)x \u2212 cos2 \u03b8x + cos2 \u03b8x \u2212 (cid:104)x, y(cid:105)y(cid:1)\n\n\u2265 \u2126 (\u03bbx) .\n\nSince the last inequality trivially holds also when x = 0, we conclude that\n\u2200\u03bb \u2208 [0, 1],\n\n\u2126(x) \u2265 \u2126(\u03bbx),\n\n\u2200x \u2208 H,\n\n(4)\n\nso that \u2126 is nondecreasing along all the rays passing through the origin. In particular, the minimum\nof \u2126 is attained at x = 0.\nNow, for any c \u2265 \u2126(0), consider the sublevel sets\n\nSc = {x \u2208 H : \u2126(x) \u2264 c} .\n\nFrom (4), it follows that Sc is not empty and star-shaped with respect to the origin. In addition, since\n\u2126 is lower semicontinuous, Sc is also closed. We now show that Sc is either a closed ball centered\nat the origin, or the whole space. To this end, we show that, for any x \u2208 Sc, the whole ball\n\nB = {y \u2208 H : (cid:107)y(cid:107) \u2264 (cid:107)x(cid:107)},\n\nis contained in Sc. First, take any y \u2208 int(B) \\ span{x}, where int denotes the interior. Then, y has\nnorm strictly less than (cid:107)x(cid:107), that is\n\n0 < (cid:107)y(cid:107) < (cid:107)x(cid:107),\n\nand is not aligned with x, i.e.\n\ny (cid:54)= \u03bbx,\n\n\u2200\u03bb \u2208 R.\n\n4\n\n\fLet \u03b8 \u2208 R denote the angle between x and y. Now, construct a sequence of points xk as follows:\n\n(cid:26) x0 = y,\n\nxk+1 = xk + akuk,\n\n(cid:18) \u03b8\n\n(cid:19)\n\nn\n\nwhere\n\nak = (cid:107)xk(cid:107) tan\n\n,\n\nn \u2208 N\n\nand uk is the unique unitary vector that is orthogonal to xk, belongs to the two-dimensional subspace\nspan{x, y}, and is such that (cid:104)uk, x(cid:105) > 0, that is\n(cid:107)uk(cid:107) = 1,\n\nuk \u2208 span{x, y},\n\n(cid:104)uk, xk(cid:105) = 0,\n\n(cid:104)uk, x(cid:105) > 0.\n\n(cid:18)\n\n1 + tan2\n\n(cid:19)(cid:19)k+1\n\n(cid:18) \u03b8\n\nn\n\n.\n\n(5)\n\nSee Figure 1 for a geometrical illustration of the sequence xk.\nBy orthogonality, we have\n\n(cid:107)xk+1(cid:107)2 = (cid:107)xk(cid:107)2 + a2\n\nk = (cid:107)xk(cid:107)2\n\n1 + tan2\n\nn\nIn addition, the angle between xk+1 and xk is given by\n\n(cid:18)\n\n(cid:18) \u03b8\n(cid:18) ak(cid:107)xk(cid:107)\n\n(cid:19)(cid:19)\n(cid:19)\n\n=\n\n= (cid:107)y(cid:107)2\n\n\u03b8\nn\n\n,\n\n\u03b8k = arctan\n\nso that the total angle between y and xn is given by\n\nn\u22121(cid:88)\n\nk=0\n\n\u03b8k = \u03b8.\n\nSince all the points xk belong to the subspace spanned by x and y, and the angle between x and xn\nis zero, we have that xn is positively aligned with x, that is\n\u03bb \u2265 0.\n\nxn = \u03bbx,\n\nNow, we show that n can be chosen in such a way that \u03bb \u2264 1. Indeed, from (5) we have\n\n\u03bb2 =\n\nand it can be veri\ufb01ed that\n\n(cid:18)(cid:107)xn(cid:107)\n\n(cid:107)x(cid:107)\n\n(cid:18)(cid:107)y(cid:107)\n\n(cid:107)x(cid:107)\n\n(cid:19)2\n(cid:18)\n\n=\n\nlim\n\nn\u2192+\u221e\n\n1 + tan2\n\n(cid:19)2(cid:18)\n(cid:18) \u03b8\n\nn\n\n1 + tan2\n\n(cid:19)(cid:19)n\n\n= 1,\n\n(cid:18) \u03b8\n\n(cid:19)(cid:19)n\n\nn\n\n,\n\nn\u22121(cid:88)\n\n\u03bbx \u2212 y =\n\n(xk+1 \u2212 xk),\n\ntherefore \u03bb \u2264 1 for a suf\ufb01ciently large n. Now, write the difference vector in the form\n\nand observe that\n\nk=0\n\n(cid:104)xk+1 \u2212 xk, xk(cid:105) = 0.\n\nBy using (4) and proceeding by induction, we have\n\nc \u2265 \u2126(\u03bbx) = \u2126 (xn \u2212 xn\u22121 + xn\u22121) \u2265 \u2126(xn\u22121) \u2265 \u00b7\u00b7\u00b7 \u2265 \u2126(x0) = \u2126(y),\n\nso that y \u2208 Sc. Since Sc is closed and the closure of int(B) \\ span{x} is the whole ball B, every\npoint y \u2208 B is also included in Sc. This proves that Sc is either a closed ball centered at the origin,\nor the whole space H.\nFinally, for any pair of points such that (cid:107)x(cid:107) = (cid:107)y(cid:107), we have x \u2208 S\u2126(y), and y \u2208 S\u2126(x), so that\n\n\u2126(x) = \u2126(y).\n\n5\n\n\fFigure 1: The sequence xk constructed in the proof of Theorem 1 is associated with a geometrical\nconstruction known as spiral of Theodorus. Starting from any y in the interior of the ball (excluding\npoints aligned with x), a point of the type \u03bbx (with 0 \u2264 \u03bb \u2264 1) can be reached by using a \ufb01nite\nnumber of right triangles.\n\n3 Representer theorem: a necessary and suf\ufb01cient condition\n\nIn this section, we prove that condition (2) is necessary and suf\ufb01cient for suitable families of regu-\nlarization functionals of the type (1) to admit a linear representer theorem.\nTheorem 2. Let H denote a Hilbert space of dimension at least 2. Let F denote a family of func-\ntionals J : H \u2192 R \u222a {+\u221e} of the form (1) that admit minimizers, and assume that F contains a\nset of functionals of the form\n\np (w) = \u03b3f ((cid:104)w, p(cid:105)) + \u2126 (w) ,\nJ \u03b3\n\n(6)\nwhere f (z) is uniquely minimized at z = 1. Then, for any lower semicontinuous \u2126, the family F\nadmits a linear representer theorem if and only if (2) holds.\n\n\u2200\u03b3 \u2208 R+,\n\n\u2200p \u2208 H,\n\nProof. The \ufb01rst part of the theorem (suf\ufb01ciency) follows from an orthogonality argument. Take any\nfunctional J \u2208 F. Let R = span{w1, . . . , w(cid:96)} and let R\u22a5 denote its orthogonal complement. Any\nminimizer w\u2217 of J can be uniquely decomposed as\n\nIf (2) holds, then we have\n\nw\n\n\u2217 = u + v,\n\nu \u2208 R,\n\nv \u2208 R\u22a5\n\n.\n\nJ(w\nso that u \u2208 R is also a minimizer.\nNow, let\u2019s prove the second part of the theorem (necessity). First of all, observe that the functional\n\n\u2217) \u2212 J(u) = h((cid:107)w\n\n\u2217(cid:107)) \u2212 h((cid:107)u(cid:107)) \u2265 0,\n\nobtained by setting p = 0 in (6), belongs to F. By hypothesis, J \u03b3\nby the representer theorem, the only admissible minimizer of J0 is the origin, that is\n\n0 admits minimizers. In addition,\n\nJ \u03b3\n0 (w) = \u03b3f (0) + \u2126(w),\n\n\u2126(y) \u2265 \u2126(0),\n\n\u2200y \u2208 H.\n\n(7)\n\nNow take any x \u2208 H \\ {0} and let\n\np =\n\nx\n(cid:107)x(cid:107)2 .\n\nBy the representer theorem, the functional J \u03b3\n\np of the form (6) admits a minimizer of the type\nw = \u03bb(\u03b3)x.\n\nNow, take any y \u2208 H such that (cid:104)x, y(cid:105) = 0. By using the fact that f (z) is minimized at z = 1, and\nthe linear representer theorem, we have\n\u03b3f (1) + \u2126 (\u03bb(\u03b3)x) \u2264 \u03b3f (\u03bb(\u03b3)) + \u2126 (\u03bb(\u03b3)x) = J \u03b3\nBy combining this last inequality with (7), we conclude that\n\np (x + y) = \u03b3f (1) + \u2126 (x + y) .\n\np (\u03bb(\u03b3)x) \u2264 J \u03b3\n\u2200x, y \u2208 H : (cid:104)x, y(cid:105) = 0,\n\n\u2126 (x + y) \u2265 \u2126 (\u03bb(\u03b3)x) ,\n\n\u2200\u03b3 \u2208 R+.\n\n(8)\n\nNow, there are two cases:\n\n6\n\nxy\f\u2022 \u2126 (x + y) = +\u221e\n\u2022 \u2126 (x + y) = C < +\u221e.\n\nIn the \ufb01rst case, we trivially have\n\n\u2126 (x + y) \u2265 \u2126(x).\n\nIn the second case, using (7) and (8), we obtain\n\n0 \u2264 \u03b3 (f (\u03bb(\u03b3)) \u2212 f (1)) \u2264 \u2126 (x + y) \u2212 \u2126 (\u03bb(\u03b3)x) \u2264 C \u2212 \u2126(0) < +\u221e,\n\n\u2200\u03b3 \u2208 R+.\n\n(9)\n\nLet \u03b3k denote a sequence such that limk\u2192+\u221e \u03b3k = +\u221e, and consider the sequence\n\nak = \u03b3k (f (\u03bb(\u03b3k)) \u2212 f (1)) .\n\nFrom (9), it follows that ak is bounded. Since z = 1 is the only minimizer of f (z), the sequence ak\ncan remain bounded only if\n\nlim\nk\u2192+\u221e \u03bb(\u03b3k) = 1.\n\nBy taking the limit inferior in (8) for \u03b3 \u2192 +\u221e, and using the fact that \u2126 is lower semicontinuous, we\nobtain condition (3). It follows that \u2126 satis\ufb01es the hypotheses of Theorem 1, therefore (2) holds.\n\nThe second part of Theorem 2 states that any lower semicontinuous regularization term \u2126 has to be\nof the form (2) in order for the family F to admit a linear representer theorem. Observe that \u2126 is not\nrequired to be differentiable or even continuous. Moreover, it needs not to have bounded lower level\nsets. For the necessary condition to hold, the family F has to be broad enough to contain at least\na set of regularization functionals of the form (6). The following examples show how to apply the\nnecessary condition of Theorem 2 to classes of regularization problems with standard loss functions.\n\n\u2022 Let L : R2 \u2192 R \u222a {+\u221e} denote any loss function of the type\n\nsuch that(cid:101)L(t) is uniquely minimized at t = 0. Then, for any lower semicontinuous regula-\n\nration term \u2126, the family of regularization functionals of the form\n\nL(y, z) =(cid:101)L(y \u2212 z),\n(cid:96)(cid:88)\n\nJ(w) = \u03b3\n\nL (yi,(cid:104)w, wi(cid:105)) + \u2126(w),\n\ni=1\n\nadmits a linear representer theorem if and only if (2) holds. To see that the hypotheses of\nTheorem 2 are satis\ufb01ed, it is suf\ufb01cient to consider the subset of functionals with (cid:96) = 1,\ny1 = 1, and w1 = p \u2208 H. These functionals can be written in the form (6) with\n\nf (z) = L(1, z).\n\n\u2022 The class of regularization problems with the hinge (SVM) loss of the form\n\n(cid:96)(cid:88)\n\nJ(w) = \u03b3\n\nmax{0, 1 \u2212 yi(cid:104)w, wi(cid:105)} + \u2126(w),\n\nwith \u2126 lower semicontinuous, admits a linear representer theorem if and only if \u2126 satisfy\n(2). For instance, by choosing (cid:96) = 2, and\n\ni=1\n\n(y1, w1) = (1, p),\n\n(y2, w2) = (\u22121, p/2),\n\nwe obtain regularization functionals of the form (6) with\n\nf (z) = max{0, 1 \u2212 z} + max{0, 1 + z/2},\n\nand it is easy to verify that f is uniquely minimized at z = 1.\n\n7\n\n\f4 Conclusions\n\nSuf\ufb01ciently broad families of regularization functionals de\ufb01ned over a Hilbert space with lower\nsemicontinuous regularization term admit a linear representer theorem if and only if the regulariza-\ntion term is a radial nondecreasing function. More precisely, the main result of this paper (Theorem\n2) implies that, for any suf\ufb01ciently large family of regularization functionals, nondecreasing func-\ntions of the norm are the only lower semicontinuous (extended-real valued) regularization terms that\nguarantee existence of a representer theorem for any choice of the data functionals Li.\nAs a concluding remark, it is important to observe that other types of regularization terms are possi-\nble if the representer theorem is only required to hold for a restricted subset of the data functionals.\nExploring necessary conditions for the existence of representer theorems under different types of\nrestrictions on the data functionals is an interesting future research direction.\n\n5 Acknowledgments\n\nThe authors would like to thank Andreas Argyriou for useful discussions.\n\nReferences\n[1] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill Posed Problems. W. H. Winston, Washing-\n\nton, D. C., 1977.\n\n[2] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, USA, 1990.\n[3] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American\n\nmathematical society, 39:1\u201349, 2001.\n\n[4] B. Sch\u00a8olkopf, A. J. Smola, and K-R M\u00a8uller. Nonlinear component analysis as a kernel eigen-\n\nvalue problem. Neural Computation, 10(5):1299\u20131319, 1998.\n\n[5] F. Riesz. Sur une esp`ece de g\u00b4eom\u00b4etrie analytique des syst`emes de fonctions sommables.\n\nComptes rendus de l\u2019Acad\u00b4emie des sciences Paris, 144:1409\u20131411, 1907.\n\n[6] M. Fr\u00b4echet. Sur les ensembles de fonctions et les op\u00b4erations lin\u00b4eaires. Comptes rendus de\n\nl\u2019Acad\u00b4emie des sciences Paris, 144:1414\u20131416, 1907.\n\n[7] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, USA, 1998.\n[8] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regulariza-\ntion, Optimization, and Beyond. (Adaptive Computation and Machine Learning). MIT Press,\n2001.\n\n[9] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, New York, NY, USA, 2004.\n\n[10] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical\n\nSociety, 68:337\u2013404, 1950.\n\n[11] B. Sch\u00a8olkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In In Pro-\nceedings of the Annual Conference on Computational Learning Theory, pages 416\u2013426, 2001.\n[12] G. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions. Journal of\n\nMathematical Analysis and Applications, 33(1):82\u201395, 1971.\n\n[13] D. Cox and F. O\u2019 Sullivan. Asymptotic analysis of penalized likelihood and related estimators.\n\nThe Annals of Statistics, 18:1676\u20131695, 1990.\n\n[14] T. Poggio and F. Girosi. Networks for approximation and learning. In Proceedings of the IEEE,\n\nvolume 78, pages 1481\u20131497, 1990.\n\n[15] A. Argyriou, C. A. Micchelli, and M. Pontil. When is there a representer theorem? Vector\n\nversus matrix regularizers. Journal of Machine Learning Research, 10:2507\u20132529, 2009.\n\n8\n\n\f", "award": [], "sourceid": 117, "authors": [{"given_name": "Francesco", "family_name": "Dinuzzo", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}