{"title": "Optimal Kernel Shapes for Local Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 540, "page_last": 546, "abstract": null, "full_text": "Optimal Kernel Shapes for Local Linear \n\nRegression \n\nDirk Ormoneit \n\nTrevor Hastie \n\nDepartment of Statistics \n\nStanford University \n\nStanford, CA 94305-4065 \normoneit@stat.stanjord.edu \n\nAbstract \n\nLocal linear regression performs very well in many low-dimensional \nforecasting problems. In high-dimensional spaces, its performance \ntypically decays due to the well-known \"curse-of-dimensionality\". \nA possible way to approach this problem is by varying the \"shape\" \nof the weighting kernel. In this work we suggest a new, data-driven \nmethod to estimating the optimal kernel shape. Experiments us(cid:173)\ning an artificially generated data set and data from the UC Irvine \nrepository show the benefits of kernel shaping. \n\n1 \n\nIntroduction \n\nLocal linear regression has attracted considerable attention in both statistical and \nmachine learning literature as a flexible tool for nonparametric regression analysis \n[Cle79, FG96, AMS97]. Like most statistical smoothing approaches, local modeling \nsuffers from the so-called \"curse-of-dimensionality\", the well-known fact that the \nproportion of the training data that lie in a fixed-radius neighborhood of a point \ndecreases to zero at an exponential rate with increasing dimension of the input \nspace. Due to this problem, the bandwidth of a weighting kernel must be chosen \nvery big so as to contain a reasonable sample fraction . As a result, the estimates \nproduced are typically highly biased. One possible way to reduce the bias of local \nlinear estimates is to vary the \"shape\" of the weighting kernel. In this work, we \nsuggest a method for estimating the optimal kernel shape using the training data. \nFor this purpose, we parameterize the kernel in terms of a suitable \"shape matrix\" , \nL, and minimize the mean squared forecasting error with respect to L. For such an \napproach to be meaningful, the \"size\" of the weighting kernel must be constrained \nduring the minimization to avoid overfitting. We propose a new, entropy-based \nmeasure of the kernel size as a constraint. By analogy to the nearest neighbor \napproach to bandwidth selection [FG96], the suggested measure is adaptive with \nregard to the local data density. In addition, it leads to an efficient gradient descent \nalgorithm for the computation of the optimal kernel shape. Experiments using an \nartificially generated data set and data from the UC Irvine repository show that \nkernel shaping can improve the performance of local linear estimates substantially. \nThe remainder of this work is organized as follows. In Section 2 we briefly review \n\n\fOptimal Kernel Shapes for Local Linear Regression \n\n541 \n\nlocal linear models and introduce our notation. In Section 3 we formulate an objec(cid:173)\ntive function for kernel shaping, and in Section 4 we discuss entropic neighborhoods. \nSection 5 describes our experimental results and Section 6 presents conclusions. \n\n2 Local Linear Models \n\nConsider a nonlinear regression problem where a continuous response y E JR is to \nbe predicted based on a d-dimensional predictor x E JRd. Let D == {(Xt, Yt), t = \n1, . .. ,T} denote a set of training data. To estimate the conditional expectation \nf(xo) == E[ylxo], we consider the local linear expansion f(x) ~ 0:0 + (x - xo),/3o in \nthe neighborhood of Xo. In detail, we minimize the weighted least squares criterion \n\nT \n\nC(o:,/3;xo) == ~)Yt - 0: - (Xt - xo)'/3)2k(xt,xo) \n\n(1) \n\nt=1 \n\nto determine estimates of the parameters 0:0 and /30. Here k(xt, xo) is a non-negative \nweighting kernel that assigns more weight to residuals in the neighborhood of Xo \nthan to residuals distant from Xo. \nIn multivariate problems, a standard way of \ndefining k(xt, xo) is by applying a univariate, non-negative \"mother kernel\" (z) to \nthe distance measure Ilxt - xolln == J(Xt - xo)'O(Xt - xo): \n\nk(xt, xo) == \n\n: (1lxt - xolln) \n\n. \nES=1 (1lx s - xolln) \n\n(2) \n\nHere 0 is a positive definite d x d matrix determining the relative importance \nassigned to different directions of the input space. For example, if (z) is a stan(cid:173)\ndard normal density, k(xt, xo) is a normalized multivariate Gaussian with mean \nXo and covariance matrix 0- 1 . Note that k(xt, xo) is normalized so as to satisfy \nE;=1 k(xt, xo) = 1. Even though this restriction is not relevant directly with re(cid:173)\ngard to the estimation of 0:0 and /30, it will be needed in our discussion of entropic \nneighborhoods in Section 4. \nUsing the shorthand notation i(xo, 0) == (&0, ~b)\" the solution of the minimization \nproblem (1) may be written conveniently as \n\ni(xo,O) = (X'W X)-1 X'WY, \n\n(3) \nwhere X is the T x (d + 1) design matrix with rows (1, x~ - xb)\" Y is the vector of \nresponse values, and W is a TxT diagonal matrix with entries Wt,t = k(xt, xo). \nThe resulting local linear fit at Xo using the inverse covariance matrix 0 is simply \n!(xo; 0) == &0. Obviously, !(xo; 0) depends on 0 through the definition of the \nweighting kernel (2). In the discussion below, our focus is on choices of 0 that lead \nto favorable estimates of the unknown function value f(xo). \n\n3 Kernel Shaping \n\nThe local linear estimates resulting from different choices of 0 vary considerably \nin practice. A common strategy is to choose 0 proportional to the inverse sample \ncovariance matrix. The remaining problem of finding the optimal scaling factor is \nequivalent to the problem of bandwidth selection in univariate smoothing [FG96, \nBBB99]. For example, the bandwidth is frequently chosen as a function of the \ndistance between Xo and its kth nearest neighbor in practical applications [FG96]. \nIn this paper, we take a different viewpoint and argue that optimizing the \"shape\" \n\n\f542 \n\nD. Onnoneit and T. Hastie \n\nof the weighting kernel is at least as important as optimizing the bandwidth. More \nspecifically, for a fixed \"volume\" of the weighting kernel, the bias of the estimate \ncan be reduced drastically by shrinking the kernel in directions of large nonlinear \nvariation of f (x), and stretching it in directions of small nonlinear variation. This \nidea is illustrated using the example shown in Figure 1. The plotted function \nis sigmoidal along an index vector K, and constant in directions orthogonal to K,. \nTherefore, a \"shaped\" weighting kernel is shrunk in the direction K, and stretched \northogonally to K\" minimizing the exposure of the kernel to the nonlinear variation. \n\nFigure 1: Left: Example of a single index model of the form y = g(X'K) with K = (1,1) \nand g(z) = tanh(3z). Right: The contours of g(z) are straight lines orthogonal to K. \n\n0== A' (LL' + I). \n\nTo distinguish formally the metric and the bandwidth of the weighting kernel, we \nrewrite 0 as follows: \n\n(4) \nHere A corresponds to the inverse bandwidth, and L may be interpreted as a metric(cid:173)\nor shape-matrix. Below we suggest an algorithm which is designed to minimize \nthe bias with respect to the kernel metric. Clearly, for such an approach to be \nmeaningful, we need to restrict the \"volume\" of the weighting kernel; otherwise, the \nbias of the estimate could be minimized trivially by choosing a zero bandwidth. For \nexample, we might define A contingent on L so as to satisfy 101 = c for some constant \nc. A serious disadvantage of this idea is that, by contrast to the nearest neighbor \napproach, 101 is independent of the design. As a more appropriate alternative, we \ndefine A in terms of a measure of the number of neighboring observations. In detail, \nwe fix the volume of k(xt, xo) in terms of the \"entropy\" of the weighting kernel. \nThen, we choose A so as to satisfy the resulting entropy constraint. Given this \ndefinition of the bandwidth, we determine the metric of k (Xt, xo) by minimizing the \nmean squared prediction error: \n\nT \n\nC(L; D) == I)Yt - f(Xt; 0\u00bb2 \n\nt=l \n\n(5) \n\nwith respect to L. In this way, we obtain an approximation of the optimal kernel \nshape because the expectation of C(L; D) differs from the bias only by a variance \nterm which is independent of L. Details of the entropic neighborhood criterion and \nof the numerical minimization procedure are described next. \n\n4 Entropic Neighborhoods \n\nWe mentioned previously that, for a given shape matrix L, we choose the bandwidth \nparameter A in (4) so as to fulfill a volume constraint on the weighting kernel. For \nthis purpose, we interpret the kernel weights k(xt, xo) as probabilities. In particular, \n\n\fOptimal Kernel Shapes for Local Linear Regression \n\n543 \n\nas k(Xt, xo) > 0 and Et k(xt, xo) = 1 by definition (2), we can formulate the local \nentropy of k(xt, xo): \n\nH(O) == - I: k(xt, xo) log k(xt, xo). \n\nT \n\nt=l \n\n(6) \n\nThe entropy of a probability distribution is typically thought of as a measure of \nuncertainty. In the context of the weighting kernel k(xt, xo), H(O) can be used \nas a smooth measure of the \"size\" of the neighborhood that is used for averaging. \nTo see this, note that in the extreme case where equal weights are placed on all \nobservations in D, the entropy is maximized. At the other extreme, if the single \nnearest neighbor of Xo is assigned the entire weight of one, the entropy attains its \nminimum value zero. Thus, fixing the entropy at a constant value c is similar to \nfixing the number k in the nearest neighbor approach. Besides justifying (6), the \ncorrespondence between k and c can also be used to derive a more intuitive volume \nparameter than the entropy level c. We specify c in terms of a hypothetical weighting \nkernel that places equal weight on the k nearest neighbors of Xo and zero weight \non the remaining observations. Note that the entropy of this hypothetical kernel \nis log k. Thus, it is natural to characterize the size of an entropic neighborhood in \nterms of k, and then to determine A by numerically solving the nonlinear equation \nsystem (for details, see [OH99]) \n\nH(O) = logk. \n\n(7) \n\nMore precisely, we report the number of neighbors in terms of the equivalent sample \nfraction p == kiT to further intuition. This idea is illustrated in Figure 2 using a \none- and a two-dimensional example. The equivalent sample fractions are p = 30% \nand p = 50%, respectively. Note that in both cases the weighting kernel is wider \nin regions with few observations, and narrower in regions with many observations. \nAs a consequence, the number of observations within contours of equal weighting \nremains approximately constant across the input space. \n\n\" \n\n0.2 . \n\n': .\n\n. \n\n. ,. 0\u00b7:\u00b7\u00b7:\u00b7\u00b7.\"\u00b7\u00b7\u00b7----------\n. . . .,', . ~', .~,' .,'. ': . \n\nFigure 2: Left: Univariate weighting kernel k(-, xo) evaluated at Xo = 0.3 and Xo = 0.7 \nbased on a sample data set of 100 observations (indicated by the bars at the bottom) . Right: \nMultivariate weighting kernel k(\u00b7, xo) based on a sample data set of 200 observations. The \ntwo ellipsoids correspond to 95% contours of a weighting kernel evaluated at (0.3,0.3)' and \n(0.6,0.6)' . \n\n03 \n\n04 \n\n0.1 \n\nO. \n\not \n\n1 \n\nTo summarize, we define the value of A by fixing the equivalent sample fraction \nparameter p, and subsequently minimize the prediction error on the training set \nwith respect to the shape matrix L. Note that we allow for the possibility that \nL may be of reduced rank I :::; d as a means of controlling the number of free \nparameters. As a minimization procedure, we use a variant of gradient descent that \n\n\f544 \n\nD. Ormoneit and T. Hastie \n\naccounts for the entropy constraint. In particular, our algorithm relies on the fact \nthat (7) is differentiable with respect to L. Due to space limitations, the interested \nreader is referred to [OH99] for a formal derivation of the involved gradients and \nfor a detailed description of the optimization procedure. \n\n5 Experiments \n\nIn this section we compare kernel shaping to standard local linear regression using \na fixed spherical kernel in two examples. First, we evaluate the performance using \na simple toy problem which allows us to estimate confidence intervals for the pre(cid:173)\ndiction accuracy using Monte Carlo simulation. Second, we investigate a data set \nfrom the machine learning data base at UC Irvine [BKM98]. \n\n5.1 Mexican Hat Function \n\nIn our first example, we employ Monte Carlo simulation to evaluate the performance \nof kernel shaping in a five-dimensional regression problem. For this purpose, 20 sets \nof 500 data points each are generated independently according to the model \n\ny = coS(SJxI + x~) . exp( -(xi + x~)). \n\n(8) \n\nHere the predictor variables Xl, ... ,X5 are drawn according to a five-dimensional \nstandard normal distribution. Note that, even though the regression is carried out \nin a five-dimensional predictor space, y is really only a function of the variables \nXl and X2 . In particular, as dimensions two through five do not contribute any \ninformation with regard to the value of y, kernel shaping should effectively discard \nthese variables. Note also that there is no noise in this example. \n\nFigure 3: \nLeft: \"True\" Mexican hat function. Middle: Local linear estimate using a \nspherical kernel (p = 2%). Right: Local linear estimate using kernel shaping (p = 2%) . \nBoth estimates are based on a training set consisting of 500 data points. \n\nFigure 3 shows a plot of the true function, the spherical estimate, and the estimate \nusing kernel shaping as functions of Xl and X2. The true function has the familiar \n\"Mexican hat\" shape, which is recovered by the estimates to different degrees. We \nevaluate the local linear estimates for values of the equivalent neighborhood fraction \nparameter p in the range from 1% to 15%. Note that, to warrant a fair comparison, \nwe used the entropic neighborhood also to determine the bandwith of the spherical \nestimate. For each value of p, 20 models are estimated using the 20 artificially \ngenerated training sets, and subsequently their performance is evaluated on the \ntraining set and on the test set of 31 x 31 grid points shown in Figure 3. The shape \nmatrix L has maximal rank 1 = 5 in this experiment. Our results for local linear \nregression using the spherical kernel and kernel shaping are summarized in Table \n1. Performance is measured in terms of the mean R 2-value of the 20 models, and \nstandard deviations are reported in parenthesis. \n\n\fOptimal Kernel Shapes for Local Linear Regression \n\n545 \n\nAlgorithm \nspherical kernel \nspherical kernel \nspherical kernel \nspherical kernel \nspherical kernel \nkernel shaping \nkernel shaping \nkernel shaping \nkernel shaping \n\np=l% \np=2% \np=5% \np = 10% \np= 20% \np= 1% \np=2% \np=5% \np= 15% \n\nTraining R2 \n0.961 (0.005) \n0.871 (0.014) \n0.680 (0.029) \n0.507 (0.038) \n0.341 (0.039) \n0.995 (0.001) \n0.984 (0.002) \n0.923 (0.009) \n0.628 (0.035) \n\n0.215 (0.126) \n0.293 (0.082) \n0.265 (0.043) \n0.213 (0.030) \n0.164 (0.021) \n0.882 (0.024) \n0.909 (0.017) \n0.836 (0.023) \n0.517 (0 .035) \n\nTable 1: Performances in the toy problem. The results for kernel shaping were obtained \nusing 200 gradient descent steps with step size a = 0.2. \n\nThe results in Table 1 indicate that the optimal performance on the test set is \nobtained using the parameter values p = 2% both for kernel shaping (R2 = 0.909) \nand for the spherical kernel (R2 = 0.293). Given the large difference between the \nR2 values, we conclude that kernel shaping clearly outperforms the spherical kernel \non this data set. \n\n----\n\nFigure 4: The eigenvectors of the estimate of n obtained on the first of 20 training sets. \nThe graphs are ordered from left to right by increasing eigenvalues (decreasing extension \nof the kernel in that direction): 0.76,0.76, 0.76, 33.24, 34.88. \n\nFinally, Figure 4 shows the eigenvectors of the optimized n on the first of the 20 \ntraining sets. The eigenvectors are arranged according to the size of the correspond(cid:173)\ning eigenvalues. Note that the two rightmost eigenvectors, which correspond to the \ndirections of minimum kernel extension, span exactly the Xl -x2-space where the \ntrue function lives. The kernel is stretched in the remaining directions, effectively \ndiscarding nonlinear contributions from X3, X4, and X5' \n\n5.2 Abalone Database \n\nThe task in our second example is to predict the age of abalone based on several \nmeasurements. More specifically, the response variable is obtained by counting \nthe number of rings in the shell in a time-consuming procedure. Preferably, the \nage of the abalone could be predicted from alternative measurements that may be \nobtained more easily. In the data set, eight candidate measurements including sex, \ndimensions, and various weights are reported along with the number of rings of the \nabalone as predictor variables. We normalize these variables to zero mean and unit \nvariance prior to estimation. Overall, the data set consists of 4177 observations. To \nprevent possible artifacts resulting from the order of the data records, we randomly \ndraw 2784 observations as a training set and use the remaining 1393 observations \nas a test set. Our results are summarized in Table 2 using various settings for \nthe rank l, the equivalent fraction parameter p, and the gradient descent step size \na. The optimal choice for p is 20% both for kernel shaping (R2 = 0.582) and for \nthe spherical kernel (R2 = 0.572). Note that the performance improvement due to \nkernel shaping is negligible in this experiment. \n\n\f546 \n\nD. Ormoneit and T. Hastie \n\nKernel \nspherical kernel \nspherical kernel \nspherical kernel \nspherical kernel \nspherical kernel \nspherical kernel \nkernel shaping \nkernel shaping \nkernel shaping \nkernel shaping \nkernel shaping \nkernel shaping \n\np = 0.05 \np = 0.10 \nP = 0.20 \nP = 0.50 \np = 0.70 \nP = 0.90 \n\nl - 5, p - 0.20, a = 0.5 \nl = 5, p = 0.20, a = 0.2 \nl = 2, P = 0.10, a = 0.2 \nl = 2, P = 0.20, a = 0.2 \nl = 2, P = 0.50, a = 0.2 \nl = 2, p = 0.20, a = 0.5 \n\nTraining R2 \n\n0.752 \n0.686 \n0.639 \n0.595 \n0.581 \n0.568 \n0.705 \n0.698 \n0.729 \n0.663 \n0.603 \n0.669 \n\n0.543 \n0.564 \n0.572 \n0.565 \n0.552 \n0.533 \n0.575 \n0.577 \n0.574 \n0.582 \n0.571 \n0.582 \n\nTable 2: Results using the Abalone database after 200 gradient descent steps. \n\n6 Conclusions \n\nWe introduced a data-driven method to improve the performance of local linear \nestimates in high dimensions by optimizing the shape of the weighting kernel. In \nour experiments we found that kernel shaping clearly outperformed local linear re(cid:173)\ngression using a spherical kernel in a five-dimensional toy example, and led to a \nsmall performance improvement in a second, real-world example. To explain the \nresults of the second experiment, we note that kernel shaping aims at exploiting \nglobal structure in the data. Thus, the absence of a larger performance improve(cid:173)\nment may suggest simply that no corresponding structure prevails in that data set. \nThat is, even though optimal kernel shapes exist locally, they may vary accross the \npredictor space so that they cannot be approximated by any particular global shape. \nPreliminary experiments using a localized variant of kernel shaping did not lead to \nsignificant performance improvements in our experiments . \n\nAcknowledgments \n\nThe work of Dirk Ormoneit was supported by a grant of the Deutsche Forschungsge(cid:173)\nmeinschaft (DFG) as part of its post-doctoral program. Trevor Hastie was partially \nsupported by NSF grant DMS-9803645 and NIH grant ROI-CA-72028-01. Carrie \nGrimes pointed us to misleading formulations in earlier drafts of this work. \n\nReferences \n[AMS97] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial \n\nIntelligence Review, 11:11-73, 1997. \n\n[BBB99] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive \nleast squares algorithm. In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, \nAdvances in Neural Information Processing Systems 11. The MIT Press, 1999. \n[BKM98] C. Blake, E. Koegh, and C. J. Merz. UCI Repository of machine learning \n\ndatabases. http://vvw.ics.uci.edu/-mlearn/MLRepository.html. \n\n[Cle79] W . S. Cleveland. Robust locally weighted regression and smoothing scatterplots. \n\n[FG96] \n\nJournal of the American Statistical Association, 74:829-836, 1979. \nJ. Fan and 1. Gijbels. Local Polynomial Modelling and Its Applications. Chap(cid:173)\nman & Hall, 1996. \n\n[OH99] D. Ormoneit and T . Hastie. Optimal kernel shapes for local linear regression. \n\nTech. report 1999-11, Department of Statistics, Stanford University, 1999. \n\n\f", "award": [], "sourceid": 1755, "authors": [{"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}