{"title": "Asymptotics of Gaussian Regularized Least Squares", "book": "Advances in Neural Information Processing Systems", "page_first": 803, "page_last": 810, "abstract": null, "full_text": "Asymptotics of Gaussian Regularized\n\nLeast-Squares\n\nM.I.T., Department of Mathematics\n\nHonda Research Institute USA, Inc.\n\nRoss A. Lippert\n\n77 Massachusetts Avenue\n\nCambridge, MA 02139-4307\nlippert@math.mit.edu\n\nRyan M. Rifkin\n\n145 Tremont Street\nBoston, MA 02111\n\nrrifkin@honda-ri.com\n\nAbstract\n\nWe consider regularized least-squares (RLS) with a Gaussian kernel. We\nprove that if we let the Gaussian bandwidth \u03c3 \u2192 \u221e while letting the\nregularization parameter \u03bb \u2192 0, the RLS solution tends to a polynomial\nwhose order is controlled by the rielative rates of decay of 1\n\u03c32 and \u03bb: if\n\u03bb = \u03c3\u2212(2k+1), then, as \u03c3 \u2192 \u221e, the RLS solution tends to the kth order\npolynomial with minimal empirical error. We illustrate the result with an\nexample.\n\n1\n\nIntroduction\n\nGiven a data set (x1, y1), (x2, y2), . . . , (xn, yn), the inductive learning task is to build a\nfunction f(x) that, given a new x point, can predict the associated y value. We study the\nRegularized Least-Squares (RLS) algorithm for \ufb01nding f, a common and popular algo-\nrithm [2, 5] that can be used for either regression or classi\ufb01cation:\n\nnX\n\ni=1\n\nmin\nf\u2208H\n\n1\nn\n\n(f(xi) \u2212 yi)2 + \u03bb||f||2\nK.\n\nHere, H is a Reproducing Kernel Hilbert Space (RKHS) [1] with associated kernel function\nK, ||f||2\nK is the squared norm in the RKHS, and \u03bb is a regularization constant controlling\nthe tradeoff between \ufb01tting the training set accurately and forcing smoothness of f.\nThe Representer Theorem [7] proves that the RLS solution will have the form f(x) =\ni=1 ciK(xi, x), and it is easy to show [5] that we can \ufb01nd the coef\ufb01cients c by solving\n\nPn\n\nthe linear system\n\n(K + \u03bbnI)c = y,\n\n(1)\nwhere K is the n by n matrix satisfying Kij = K(xi, xj). We focus on the Gaussian kernel\nK(xi, xj) = exp(\u2212||xi \u2212 xj||2/2\u03c32).\nOur work was originally motivated by the empirical observation that on a range of bench-\nmark classi\ufb01cation tasks, we achieved surprisingly accurate classi\ufb01cation using a Gaussian\nkernel with a very large \u03c3 and a very small \u03bb (Figure 1; additional examples in [6]). This\nprompted us to study the large-\u03c3 asymptotics of RLS. As \u03c3 \u2192 \u221e, K(xi, xj) \u2192 1 for\narbitrary xi and xj. Consider a single test point x0. RLS will \ufb01rst \ufb01nd c using Equation 1,\n\n\fFig. 1. RLS classi\ufb01cation accuracy results for the UCI Galaxy dataset over a range of \u03c3 (along the\nx-axis) and \u03bb (different lines) values. The vertical labelled lines show m, the smallest entry in the\nkernel matrix for a given \u03c3. We see that when \u03bb = 1e \u2212 11, we can classify quite accurately when\nthe smallest entry of the kernel matrix is .99999.\n\nthen compute f(x0) = ctk where k is the kernel vector, ki = K(xi, x0). Combining the\ntraining and testing steps, we see that f(x0) = yt(K + \u03bbnI)\u22121k.\nBoth K and k are close to 1 for large \u03c3, i.e. Kij = 1 + \u0001ij and ki = 1 + \u0001i. If we directly\ncompute c = (K + \u03bbnI)\u22121y, we will tend to wash out the effects of the \u0001ij term as \u03c3\nbecomes large. If, instead, we compute f(x0) by associating to the right, \ufb01rst computing\npoint af\ufb01nities (K + \u03bbnI)\u22121k, then the \u0001ij and \u0001j interact meaningfully; this interaction is\ncrucial to our analysis.\nOur approach is to Taylor expand the kernel elements (and thus K and k) in 1/\u03c3, noting\nthat as \u03c3 \u2192 \u221e, consecutive terms in the expansion differ enormously. In computing (K +\n\u03bbnI)\u22121k, these scalings cancel each other out, and result in \ufb01nite point af\ufb01nities even as\n\u03c3 \u2192 \u221e. The asymptotic af\ufb01nity formula can then be \u201ctransposed\u201d to create an alternate\nexpression for f(x0). Our main result is that if we set \u03c32 = s2 and \u03bb = s\u2212(2k+1), then, as\ns \u2192 \u221e, the RLS solution tends to the kth order polynomial with minimal empirical error.\nThe main theorem is proved in full. Due to space restrictions, the proofs of supporting\nlemmas and corollaries are omitted; an expanded version containing all proofs is available\n[4].\n\n2 Notation and de\ufb01nitions\nDe\ufb01nition 1. Let xi be a set of n + 1 points (0 \u2264 i \u2264 n) in a d dimensional space. The\nscalar xia denotes the value of the ath vector component of the ith point.\n\n1e\u2212041e\u2212011e+021e+050.40.60.81.01.21e\u2212111e\u2212081e\u2212050.011010000RLSC Results for GALAXY DatasetSigmaAccuracym=1.0d\u2212249m=0.9m=0.99999\fThe n \u00d7 d matrix, X is given by Xia = xia.\nWe think of X as the matrix of training data x1, . . . , xn and x0 as an 1\u00d7d matrix consisting\nof the test point.\nLet 1m, 1lm denote the m dimensional vector and l \u00d7 m matrix with components all 1,\nsimilarly for 0m, 0lm. We will dispense with such subscripts when the dimensions are clear\nfrom context.\nDe\ufb01nition 2 (Hadamard products and powers). For two l \u00d7 m matrices, N, M, N (cid:12) M\ndenotes the l \u00d7 m matrix given by (N (cid:12) M)ij = NijMij. Analogously, we set (N(cid:12)c)ij =\nN c\nij.\nDe\ufb01nition 3 (polynomials in the data). Let I \u2208 Zd\u22650 (non-negative multi-indices) and\na=1 Y Ia\nia . If\nh : Rd \u2192 R then h(Y ) is the k dimensional vector given by (h(Y ))i = h(Yi1, . . . , Yid).\nThe d canonical vectors, ea \u2208 Zd\u22650, are given by (ea)b = \u03b4ab.\nAny scalar function, f : R \u2192 R, applied to any matrix or vector, A, will be assumed to\ndenote the elementwise application of f. We will treat y \u2192 ey as a scalar function (we\nhave no need of matrix exponentials in this work, so the notation is unambiguous).\n\nY be a k \u00d7 d matrix. Y I is the k dimensional vector given by(cid:0)Y I(cid:1)\n\ni = Qd\n\nWe can re-express the kernel matrix and kernel vector in this notation:\n\nn\u22121n(X2ea)t\n\n(cid:16)\n\n2\u03c32 ||X||2(cid:17)\n\ne\n\n1\n\u03c32 XX t\n\n\u2212 1\ne\n0 \u2212X2ea 11\u22121nx2ea\n\ndiag\n\n0\n2\u03c32 ||x0||2\n\u2212 1\n\n.\n\n0e\n\n1\n2\u03c32\n\n1\n2\u03c32\n\nk = e\n\nK = e\n\n= diag\n\nPd\n(cid:16)\na=1 2X ea (X ea )t\u2212X2ea 1t\n\u2212 1\nPd\ne\n(cid:16)\n\na=1 2X ea xea\n\u2212 1\ne\n\n2\u03c32 ||X||2(cid:17)\n2\u03c32 ||X||2(cid:17)\nLet Vc = span{X I : |I| = c} and V\u2264c =Sc\n(cid:18) c + d\n\n3 Orthogonal polynomial bases\n\n= diag\n\n1\n\u03c32 Xxt\n\n(cid:19)\n\ne\n\n\u2265 n.\n\nd\n\na=0 Vc which can be thought of as the set of all\nd variable polynomials of degree c, evaluated on the training data. Since the data are \ufb01nite,\nthere exists b such that V\u2264c = V\u2264b for all c \u2265 b. Generically, b is the smallest c such that\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nLet Q be an orthonormal matrix in Rn\u00d7n whose columns progressively span the V\u2264c\n\u00b7\u00b7\u00b7 Bc )} =\nspaces, i.e. Q = ( B0 B1\nV\u2264c. We might imagine building such a Q via the Gramm-Schmidt process on the vectors\nX 0, X e1, . . . , X ed, . . . X I , . . . taken in order of non-decreasing |I|.\n\n\u00b7\u00b7\u00b7 Bb ) where QtQ = I and colspan{( B0\n\nbe multinomial coef\ufb01cients, the following relations between\n\nLetting CI =\nQ, X, and x0 are easily proved.\n\nI1 . . . Id\n\n(cid:18) |I|\n(cid:19)\n0)(cid:12)c = X\n(XX t)(cid:12)c = X\n\n(Xxt\n\n|I|=c\n\n|I|=c\n\nCI X I(xI\n\n0)t\n\nhence\n\n(Xxt\n\n0)(cid:12)c \u2208 Vc\n\nCI X I(X I)t\n\nhence\n\ncolspan{(XX t)(cid:12)c} = Vc\n\n\f0)(cid:12)c = 0 if i > c, Bt\n\ni(XX t)(cid:12)cBj = 0 if i > c or j > c, and\n\n{||y \u2212 v||} =P\n\na\u2264c Ba(Bt\n\nay).\n\ni(Xxt\n\nand thus, Bt\nc(XX t)(cid:12)cBc is non-singular.\nBt\nFinally, we note that argminv\u2208V\u2264c\n4 Taking the \u03c3 \u2192 \u221e limit\n\nWe will begin with a few simple lemmas about the limiting solutions of linear systems.\nAt the end of this section we will arrive at the limiting form of suitably modi\ufb01ed RLSC\nequations.\nLemma 1. Let i1 < \u00b7\u00b7\u00b7 < iq be positive integers. Let A(s), y(s) be a block matrix and\nblock vector given by\n\n\uf8eb\uf8ec\uf8ed A00(s)\n\nsi1A10(s)\n\n\u00b7\u00b7\u00b7\n\nsiq Aq0(s)\n\nA(s) =\n\nsi1A01(s)\nsi1A11(s)\n\n\u00b7\u00b7\u00b7\n\nsiq Aq1(s)\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nsiq A0q(s)\nsiq A1q(s)\n\n\u00b7\u00b7\u00b7\n\nsiq Aqq(s)\n\n\uf8f6\uf8f7\uf8f8 ,\n\nwhere Aij(s) and bi(s) are continuous matrix-valued and vector-valued functions of s with\nAii(0) non-singular for all i.\n\n\uf8eb\uf8ec\uf8ed A00(0)\n\n0\nA10(0) A11(0)\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\nAq0(0) Aq1(0)\n\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7 Aqq(0)\n\nA\u22121(s)y(s) =\n\nlim\ns\u21920\n\n\uf8f6\uf8f7\uf8f8\n\nsi1b1(s)\n\nsiq bq(s)\n\ny(s) =\n\n\u00b7\u00b7\u00b7\n\n\uf8eb\uf8ec\uf8ed b0(s)\n\uf8f6\uf8f7\uf8f8\n\uf8f6\uf8f7\uf8f8\u22121\uf8eb\uf8ec\uf8ed b0(0)\n\nb1(0)\n\u00b7\u00b7\u00b7\nbq(0)\n\nWe are now ready to state and prove the main result of this section, characterizing the\nlimiting large-\u03c3 solution of Gaussian RLS.\n\nTheorem 1. Let q be an integer satisfying q < b, and let p = 2q + 1. Let \u03bb = C\u03c3\u2212p for\nsome constant C. De\ufb01ne A(c)\n\ni(XX t)(cid:12)cBj, and b(c)\n\n0)(cid:12)c.\n\ni (Xxt\n\ni = 1\n\nij = 1\n\nc! Bt\n\nlim\n\u03c3\u2192\u221e\n\nc! Bt\n\nk = v\n\n(cid:0)K + nC\u03c3\u2212pI(cid:1)\u22121\n\uf8eb\uf8ec\uf8ec\uf8ed A(0)\n\uf8f6\uf8f7\uf8f7\uf8f8 =\n\n0\n00\n10 A(1)\nA(1)\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\nq0 A(q)\nA(q)\n\n\u00b7\u00b7\u00b7 Bq ) w\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7 A(q)\n\nv = ( B0\n\n11\n\nq1\n\nqq\n\nwhere\n\n\uf8eb\uf8ec\uf8ec\uf8ed b(0)\n\n0\nb(1)\n1\u00b7\u00b7\u00b7\nb(q)\nq\n\n\uf8f6\uf8f7\uf8f7\uf8f8 w\n\n(6)\n\n(7)\n\nWe \ufb01rst manipulate the equation (K + n\u03bbI)y = k according to the factorizations in (3)\nand (5).\n\n(cid:16)\n(cid:16)\n\n2\u03c32 ||X||2(cid:17)\n2\u03c32 ||X||2(cid:17)\n\n\u2212 1\ne\n\u2212 1\ne\n\ne\n\ne\n\n1\n\u03c32 XX t\n\n1\n\u03c32 Xxt\n\n0e\n\n(cid:16)\n\n2\u03c32 ||X||2(cid:17)\n\n\u2212 1\ne\n\ndiag\n\u2212 1\n2\u03c32 ||x0||2 = N w\u03b1\n\nK = diag\n\nk = diag\n\n= N P N\n\n\fNoting that lim\u03c3\u2192\u221e e\nwe have\n\n\u2212 1\n2\u03c32 ||x0||2\n\n(cid:16)\n\ne\n\n2\u03c32 ||X||2(cid:17)\n\n1\n\ndiag\n\n= lim\u03c3\u2192\u221e \u03b1N\u22121 = I,\n\nChanging bases with Q,\n\nQtv = lim\n\u03c3\u2192\u221e\n\n1\n\u03c32 XX t\n\nQ + nC\u03c3\u2212pQtdiag\n\nExpanding via Taylor series and writing in block form (in the b \u00d7 b block structure of Q),\n1\n2!\u03c34 Qt(XX t)(cid:12)2Q + \u00b7\u00b7\u00b7\n\n1\n1!\u03c32 Qt(XX t)(cid:12)1Q +\n\n1\n\u03c32 XX t\n\nQte\n\nv \u2261 lim\n\u03c3\u2192\u221e(K + nC\u03c3\u2212pI)\u22121k\n\u03c3\u2192\u221e(N P N + \u03b2I)\u22121N w\u03b1\n= lim\n\u03c3\u2192\u221e \u03b1N\u22121(P + \u03b2N\u22122)\u22121w\n= lim\n\u03c32 XX t + nC\u03c3\u2212pdiag\n= lim\n\u03c3\u2192\u221e\n\n(cid:16)\n\ne\n\n1\n\nQ = Qt(XX t)(cid:12)0Q +\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n0\n0\n\u00b7\u00b7\u00b7\n0\n\n=\n\nQte\n\n1\n\u03c32 Xxt\n\n0 = Qt(Xxt\n\nQte\n\n(cid:16)\n\uf8eb\uf8ec\uf8ed A(0)\n\uf8eb\uf8ec\uf8ed b(0)\n\n00\n0\n\u00b7\u00b7\u00b7\n0\n\n=\n\n0\n\n0\u00b7\u00b7\u00b7\n0\n\nQ\n\n1\n\n1\n\ne\n\ne\n\n(cid:16)\n\n\u03c32 ||X||2(cid:17)(cid:17)\u22121\n(cid:16)\n\u03c32 ||X||2(cid:17)\n\uf8eb\uf8ec\uf8ec\uf8ed A(1)\n\uf8f6\uf8f7\uf8f8 +\n\uf8f6\uf8f7\uf8f7\uf8f8 + \u00b7\u00b7\u00b7\n\uf8eb\uf8ec\uf8ec\uf8ed b(1)\n\u03c32 ||X||2(cid:17)\n\n00 A(1)\n10 A(1)\nA(1)\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n0\n0\n\n1\n\u03c32 Qt(Xxt\n\n0\nb(1)\n1\u00b7\u00b7\u00b7\n0\n\n0)(cid:12)1 +\n\n0\n0\n\u00b7\u00b7\u00b7\n0\n\n1\n\u03c32\n\n11\n\n01\n\n1\n\n0)(cid:12)0 +\n\n\uf8f6\uf8f7\uf8f8 +\n(cid:16)\n\n1\n\u03c32\n\n1\n\u03c32 Xxt\n0.\n\ne\n\n(cid:17)\u22121\n\nQte\n\n1\n\u03c32 Xxt\n0.\n\n\uf8f6\uf8f7\uf8f7\uf8f8 + \u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n0\n0\n\u00b7\u00b7\u00b7\n0\n\n1\n\u03c34 Qt(Xxt\n\n0)(cid:12)2 + \u00b7\u00b7\u00b7\n\nnC\u03c3\u2212pQtdiag\n\ne\n\nQ = nC\u03c3\u2212pI + \u00b7\u00b7\u00b7 .\n\nSince the A(c)\n\ncc are non-singular, Lemma 3 applies, giving our result.\n\nut\n\n5 The classi\ufb01cation function\n\nWhen performing RLS, the actual prediction of the limiting classi\ufb01er is given via\n\nf\u221e(x0) \u2261 lim\n\n\u03c3\u2192\u221e yt(K + nC\u03c3\u2212pI)\u22121k.\n\nTheorem 1 determines v = lim\u03c3\u2192\u221e(K + nC\u03c3\u2212pI)\u22121k,showing that f\u221e(x0) is a polyno-\nmial in the training data X. In this section, we show that f\u221e(x0) is, in fact, a polynomial\nin the test point x0. We continue to work with the orthonormal vectors Bi as well as the\nauxilliary quantities A(c)\nTheorem 1 shows that v \u2208 V\u2264q: the point af\ufb01nity function is a polynomial of degree q in\nthe training data, determined by (7).\nj = (XX t)(cid:12)c\n\nc(XX t)(cid:12)c\n\nfrom Theorem 1.\n\nij and b(c)\n\nj = BcBt\n\nX\n\nX\n\nc!BcA(c)\n\nc!BiA(c)\n\nhence\n\ncj Bt\n\nij Bt\n\ni\n\nc!Bib(c)\n\ni = (Xxt\n\n0)(cid:12)c\n\nhence\n\ni = BcBt\n\nc(Xxt\n\n0)(cid:12)c\n\nj\u2264c\nc!Bcb(c)\n\ni,j\u2264c\n\nX\n\ni\u2264c\n\n\fwe can restate Equation 7 in an equivalent form:\n\n\uf8eb\uf8ed Bt\n\n0\u00b7\u00b7\u00b7\nBt\nq\n\n\uf8eb\uf8ec\uf8ec\uf8ed 0!b(0)\n\uf8f6\uf8f8t\uf8eb\uf8ec\uf8ec\uf8ed\n\n0\n1!b(1)\n1\u00b7\u00b7\u00b7\nq!b(q)\n\nq\n\n\uf8f6\uf8f7\uf8f7\uf8f8 \u2212\n\n\uf8eb\uf8ec\uf8ec\uf8ed 0!A(0)\n\n00\n1!A(1)\n\u00b7\u00b7\u00b7\n10\nq!A(q)\nq0\n\n0\n0\n\u00b7\u00b7\u00b7\nq!A(q)\nqq\n\n\uf8eb\uf8ed Bt\n\n\uf8f6\uf8f7\uf8f7\uf8f8 = 0\n\uf8f6\uf8f7\uf8f7\uf8f8\n\uf8f6\uf8f8 v\nX\n0)(cid:12)c \u2212 (XX t)(cid:12)cv(cid:1) = 0.\n\nc!BcA(c)\n\njv = 0\n\n0\u00b7\u00b7\u00b7\nBt\nq\n\ncj Bt\n\nj\u2264c\n\nc \u2212X\n(cid:0)(Xxt\n\nc\u2264q\n\n\u00b7\u00b7\u00b7\n0\n\u00b7\u00b7\u00b7\n1!A(1)\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n11\n\u00b7\u00b7\u00b7\nq!A(q)\nq1\nc!Bcb(c)\n\nX\nX\n\nc\u2264q\n\nc\u2264q\n\nBcBt\nc\n\n(8)\n\n(9)\n\n(10)\n\nUp to this point, our results hold for arbitrary training data X. To proceed, we require a\nmild condition on our training set.\n\nDe\ufb01nition 4. X is called generic if X I1, . . . , X In are linearly independent for any distinct\nmulti-indices {Ii}.\n\n0, where v \u2208 V\u2264q.\n\nf(x0) = ytv = h(x0), where h(x) =P|I|\u2264q aI xI is a multivariate polynomial of degree\n\nLemma 2. For generic X, the solution to Equation 7 (or equivalently, Equation 10) is\ndetermined by the conditions \u2200I : |I| \u2264 q, (X I)tv = xI\nTheorem 2. For generic data, let v be the solution to Equation 10. For any y \u2208 Rn,\nq minimizing ||y \u2212 h(X)||.\nWe see that as \u03c3 \u2192 \u221e, the RLS solution tends to the minimum empirical error kth order\npolynomial.\n\n6 Experimental Veri\ufb01cation\n\nIn this section, we present a simple experiment that illustrates our results. We consider\na \ufb01th-degree polynomial function. Figure 2 plots f, along with a 150 point dataset drawn\nby choosing xi uniformly in [0, 1], and choosing y = f(x) + \u0001i, where \u0001i is a Gaussian\nrandom variable with mean 0 and standard deviation .05. Figure 2 also shows (in red) the\nbest polynomial approximations to the data (not to the ideal f) of various orders. (We omit\nthird order because it is nearly indistinguishable from second order.)\n\nAccording to Theorem 1, if we parametrize our system by a variable s, and solve a Gaussian\nregularized least-squares problem with \u03c32 = s2 and \u03bb = Cs\u2212(2k+1) for some integer\nk, then, as s \u2192 \u221e, we expect the solution to the system to tend to the kth-order data-\nbased polynomial approximation to f. Asymptotically, the value of the constant C does\nnot matter, so we (arbitrarily) set it to be 1. Figure 3 demonstrates this result.\n\nWe note that these experiments frequently require setting \u03bb much smaller than machine-\n\u0001. As a consequence, we need more precision than IEEE double-precision \ufb02oating-point,\nand our results cannot be obtained via many standard tools (e.g., MATLAB(TM)) We per-\nformed our experiments using CLISP, an implementation of Common Lisp that includes\narithmetic operations on arbitrary-precision \ufb02oating point numbers.\n\n7 Discussion\n\nOur result provides insight into the asymptotic behavior of RLS, and (partially) explains\nFigure 1: in conjunction with additional experiments not reported here, we believe that\n\n\fFig. 2. f (x) = .5(1\u2212 x) + 150x(x\u2212 .25)(x\u2212 .3)(x\u2212 .75)(x\u2212 .95), a random dataset drawn from\nf (x) with added Gaussian noise, and data-based polynomial approximations to f.\n\nwe are recovering second-order polynomial behavior, with the drop-off in performance at\nvarious \u03bb\u2019s occurring at the transition to third-order behavior, which cannot be accurately\nrecovered in IEEE double-precision \ufb02oating-point. Although we used the speci\ufb01c details\nof RLS in deriving our solution, we expect that in practice, a similar result would hold for\nSupport Vector Machines, and perhaps for Tikhonov regularization with convex loss more\ngenerally.\n\nAn interesting implication of our theorem is that for very large \u03c3, we can obtain various\norder polynomial classi\ufb01cations by sweeping \u03bb. In [6], we present an algorithm for solving\nfor a wide range of \u03bb for essentially the same cost as using a single \u03bb. This algorithm is not\ncurrently practical for large \u03c3, due to the need for extended-precision \ufb02oating point.\n\nOur work also has implications for approximations to the Gaussian kernel. Yang et al. use\nthe Fast Gauss Transform (FGT) to speed up matrix-vector multiplications when perform-\ning RLS [8]. In [6], we studied this work; we found that while Yang et al. used moderate-to-\nsmall values of \u03c3 (and did not tune \u03bb), the FGT sacri\ufb01ced substantial accuracy compared\nto the best achievable results on their datasets. We showed empirically that the FGT be-\ncomes much more accurate at larger values of \u03c3; however, at large-\u03c3, it seems likely we\nare merely recovering low-order polynomial behavior. We suggest that approximations to\nthe Gaussian kernel must be checked carefully, to show that they produce suf\ufb01ciently good\nresults are moderate values of \u03c3; this is a topic for future work.\n\nReferences\n\n1. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society,\n\n68:337\u2013404, 1950.\n\n2. Evgeniou, Pontil, and Poggio. Regularization networks and support vector machines. Advances\n\nIn Computational Mathematics, 13(1):1\u201350, 2000.\n\n0.00.20.40.60.81.0\u22120.4\u22120.20.00.20.40.60.8xyllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllf0th order1st order2nd order4th order5th orderf(x), Random Sample of f(x), and Polynomial Approximations\fFig. 3. As s \u2192 \u221e, \u03c32 = s2 and \u03bb = s\u2212(2k+1), the solution to Gaussian RLS approaches the kth\norder polynomial solution.\n\n3. Keerthi and Lin. Asymptotic behaviors of support vector machines with gaussian kernel. Neural\n\nComputation, 15(7):1667\u20131689, 2003.\n\n4. Ross Lippert and Ryan Rifkin. Asymptotics of gaussian regularized least-squares. Technical\nReport MIT-CSAIL-TR-2005-067, MIT Computer Science and Arti\ufb01cial Intelligence Laboratory,\n2005.\n\n5. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches to Machine Learn-\n\ning. PhD thesis, Massachusetts Institute of Technology, 2002.\n\n6. Rifkin and Lippert.\n\nPractical regularized least-squares: \u03bb-selection and fast leave-one-out-\n\ncomputation. In preparation, 2005.\n\n7. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference\n\nSeries in Applied Mathematics. Society for Industrial & Applied Mathematics, 1990.\n\n8. Yang, Duraiswami, and Davis. Ef\ufb01cient kernel machines using the improved fast Gauss transform.\n\nIn Advances in Neural Information Processing Systems, volume 16, 2004.\n\n0.00.20.40.60.81.0\u22120.4\u22120.20.00.20.40.60.80th order solution, and successive approximations. Deg. 0 polynomials = 1.d+1s = 1.d+2s = 1.d+30.00.20.40.60.81.0\u22120.4\u22120.20.00.20.40.60.81st order solution, and successive approximations. Deg. 1 polynomials = 1.d+1s = 1.d+20.00.20.40.60.81.0\u22120.4\u22120.20.00.20.40.60.84th order solution, and successive approximations. Deg. 4 polynomials = 1.d+1s = 1.d+2s = 1.d+3s = 1.d+40.00.20.40.60.81.0\u22120.4\u22120.20.00.20.40.60.85th order solution, and successive approximations. Deg. 5 polynomials = 1.d+1s = 1.d+3s = 1.d+5s = 1.d+6\f", "award": [], "sourceid": 2828, "authors": [{"given_name": "Ross", "family_name": "Lippert", "institution": null}, {"given_name": "Ryan", "family_name": "Rifkin", "institution": null}]}