{"title": "Gradient Weights help Nonparametric Regressors", "book": "Advances in Neural Information Processing Systems", "page_first": 2861, "page_last": 2869, "abstract": "In regression problems over $\\real^d$, the unknown function $f$ often varies more in some coordinates than in others. We show that weighting each coordinate $i$ with the estimated norm of the $i$th derivative of $f$ is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and $k$-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online.", "full_text": "Gradient Weights help Nonparametric Regressors\n\nSamory Kpotufe(cid:3)\n\nMax Planck Institute for Intelligent Systems\n\nAbdeslam Boularias\n\nMax Planck Institute for Intelligent Systems\n\nsamory@tuebingen.mpg.de\n\nboularias@tuebingen.mpg.de\n\nAbstract\n\nIn regression problems over Rd, the unknown function f often varies more in\nsome coordinates than in others. We show that weighting each coordinate i with\nthe estimated norm of the ith derivative of f is an ef\ufb01cient way to signi\ufb01cantly\nimprove the performance of distance-based regressors, e.g. kernel and k-NN re-\ngressors. We propose a simple estimator of these derivative norms and prove its\nconsistency. Moreover, the proposed estimator is ef\ufb01ciently learned online.\n\n1 Introduction\n\n0\ni\n\n0\ni\n\n0\ni\n\n0\ni\n\n0\ni\n\n>\n0\ni = e\ni\n\nk\n1;(cid:22) = EX jf\n\nIn regression problems over Rd, the unknown function f might vary more in some coordinates than\nin others, even though all coordinates might be relevant. How much f varies with coordinate i can\nbe captured by the norm kf\nrf of f. A simple\ni (X)j of the ith derivative f\n0\nk\nway to take advantage of the information in kf\n1;(cid:22) is to weight each coordinate proportionally to an\nk\nestimate of kf\n1;(cid:22). The intuition, detailed in Section 2, is that the resulting data space behaves as a\nk\nlow-dimensional projection to coordinates with large norm kf\n1;(cid:22), while maintaining information\nabout all coordinates. We show that such weighting can be learned ef\ufb01ciently, both in batch-mode\nand online, and can signi\ufb01cantly improve the performance of distance-based regressors in real-world\np\napplications. In this paper we focus on the distance-based methods of kernel and k-NN regression.\nFor distance-based methods, the weights can be incorporated into a distance function of the form\n(x (cid:0) x0)>W(x (cid:0) x0), where each element Wi of the diagonal matrix W is an esti-\n0\n(cid:26)(x; x\n) =\nk\nmate of kf\n1;(cid:22). This is not metric learning [1, 2, 3, 4] where the best (cid:26) is found by optimizing\nover a suf\ufb01ciently large space of possible metrics. Clearly metric learning can only yield better per-\nformance, but the optimization over a larger space will result in heavier preprocessing time, often\nO(n2) on datasets of size n. Yet, preprocessing time is especially important in many modern ap-\nplications where both training and prediction are done online (e.g. robotics, \ufb01nance, advertisement,\nrecommendation systems). Here we do not optimize over a space of metrics, but rather estimate a\nsingle metric (cid:26) based on the norms kf\n1;(cid:22). Our metric (cid:26) is ef\ufb01ciently obtained, can be estimated\nonline, and still signi\ufb01cantly improves the performance of distance-based regressors.\nTo estimate kf\n0\ni well everywhere, just well on average. While\nmany elaborate derivative estimators exist (see e.g. [5]), we have to keep in mind our need for\nfast but consistent estimator of kf\nk\n1;(cid:22). We propose a simple estimator Wi which averages the\ndifferences along i of an estimator fn;h of f. More precisely (see Section 3) Wi has the form\nEn jfn;h(X + tei) (cid:0) fn;h(X (cid:0) tei)j =2t where En denotes the empirical expectation over a sample\nfXign\nIn this paper fn;h is a kernel estimator, although any regression method might be used in estimating\nk\nkf\n1;(cid:22). We prove in Section 4 that, under mild conditions, Wi is a consistent estimator of the\n\n1 . Wi can therefore be updated online at the cost of just two estimates of fn;h.\n\n0\ni\n\nk\n1;(cid:22), one does not need to estimate f\n\nk\n\n0\ni\n\n0\ni\n\n0\ni\n\n(cid:3)\n\nCurrently at Toyota Technological Institute Chicago, and af\ufb01liated with the Max Planck Institute.\n\n1\n\n\f(a) SARCOS robot, joint 7.\n\nFigure 1: Typical gradient weights\n\n(b) Parkinson\u2019s.\n\nn\nWi (cid:25) kf\n\n0\ni\n\nk\n\n1;(cid:22)\n\no\n\n(c) Telecom.\n\nfor some real-world datasets.\n\ni2[d]\n\nunknown norm kf\npractical tuning of the two parameters t and h.\n\nk\n1;(cid:22). Moreover we prove \ufb01nite sample convergence bounds to help guide the\n\n0\ni\n\nMost related work\n\nAs we mentioned above, metric learning is closest in spirit to the gradient-weighting approach pre-\nsented here, but our approach is different from metric learning in that we do not search a space\nof possible metrics, but rather estimate a single metric based on gradients. This is far more time-\nef\ufb01cient and can be implemented in online applications which require fast preprocessing.\nThere exists many metric learning approaches, mostly for classi\ufb01cation and few for regression (e.g.\n[1, 2]). The approaches of [1, 2] for regression are meant for batch learning. Moreover [1] is limited\nto Gaussian-kernel regression, and [2] is tuned to the particular problem of age estimation. For the\nproblem of classi\ufb01cation, the metric-learning approaches of [3, 4] are meant for online applications,\nbut cannot be used in regression.\nIn the case of kernel regression and local polynomial regression, multiple bandwidths can be used,\none for each coordinate [6]. However, tuning d bandwidth parameters requires searching a d(cid:2)d grid,\nwhich is impractical even in batch mode. The method of [6] alleviates this problem, however only\nin the particular case of local linear regression. Our method applies to any distance-based regressor.\nFinally, the ideas presented here are related to recent notions of nonparametric sparsity where it is\nassumed that the target function is well approximated by a sparse function, i.e. one which varies\nlittle in most coordinates (e.g. [6], [? ]). Here we do not need sparsity, instead we only need the\ntarget function to vary in some coordinates more than in others. Our approach therefore works even\nin cases where the target function is far from sparse.\n\n2 Technical motivation\nk\nIn this section, we motivate the approach by considering the ideal situation where Wi = kf\n1;(cid:22).\nLet\u2019s consider regression on (X ; (cid:26)), where the input space X (cid:26) Rd is connected. The prediction\nperformance of a distance-based estimator (e.g. kernel or k-NN) is well known to be the sum of its\nvariance and its bias [7]. Regression on (X ; (cid:26)) decreases variance while keeping the bias controlled.\nRegression variance decreases on (X ; (cid:26)): The variance of a distance based estimate fn(x) is in-\nversely proportional to the number of samples (and hence the mass) in a neighborhood of x (see\ne.g. [8]). Let\u2019s therefore compare the masses of (cid:26)-balls and Euclidean balls. Suppose some weights\nlargely dominate others, for instance in R2, let kf\nk\n1;(cid:22). A ball B(cid:26) in (X ; (cid:26)) then takes\nthe ellipsoidal shape below which we contrast with the dotted Euclidean ball inside.\n\n(cid:29) kf\n\nk\n\n1;(cid:22)\n\n0\n1\n\n0\ni\n\n0\n2\n\n2\n\n\fd\n\nRelative to a Euclidean ball, a ball B(cid:26) of similar1 radius has more mass in the direction e1 in which f\nvaries least. This intuition is made more precise in Lemma 1 below, which is proved in the appendix.\nEssentially, let R (cid:26) [d] be the set of coordinates with larger weights Wi, then the mass of balls B(cid:26)\nbehaves like the mass of balls in RjRj. Thus, effectively, regression in (X ; (cid:26)) has variance nearly as\nsmall as that for regression in the lower-dimensional space RjRj.\nNote that the assumptions on the marginal (cid:22) in the lemma statement are veri\ufb01ed for instance when\n(cid:22) has a continuous lower-bounded density on X . For simplicity we let (X ;k(cid:1)k) have diameter 1.\np\nLemma 1 (Mass of (cid:26)-balls). Consider any R (cid:26) [d] such that maxi =2R Wi < mini2R Wi: Sup-\npose X (cid:17) 1p\n[0; 1]d, and the marginal (cid:22) satis\ufb01es on (X ;k(cid:1)k), for some C1; C2: 8x 2 X ;8r >\nmaxi2R Wi= mini2R Wi, (cid:15)6R , maxi =2R Wi (cid:1) p\n0; C1rd (cid:20) (cid:22)(B(x; r)) (cid:20) C2rd. Let (cid:20) ,\nd,\nand let (cid:26)(X ) , supx;x02X (cid:26)(x; x\n0\nThen for any (cid:15)(cid:26)(X ) > 2(cid:15)6R, (cid:22)(B(cid:26)(x; (cid:15)(cid:26)(X ))) (cid:21) C(2(cid:20))\nIdeally we would want jRj (cid:28) d and (cid:15)6R (cid:25) 0, which corresponds to a sparse metric.\nRegression bias remains bounded on (X ; (cid:26)): The bias of distance-based regressors is controlled by\nthe smoothness of the unknown function f on (X ; (cid:26)), i.e. how much f might differ for two close\n0 that were originally far from x\npoints. Turning back to our earlier example in R2, some points x\nalong e1 might now be included in the estimate fn(x) on (X ; (cid:26)). Intuitively, this should not add bias\nto the estimate since f does not vary much in e1. We have the following lemma.\ni is bounded on X\n0\nLemma 2 (Change in Lipschitz smoothness for f). Suppose each derivative f\nj\nj\nsup. Assume Wi > 0 whenever jf\nby jf\n0\n0\nsup > 0. Denote by R the largest subset of [d] such that\njf\nj\n0 2 X,\nsup > 0 for i 2 R . We have for all x; x\ni\ni\n0\ni\njf (x) (cid:0) f (x\n0\n\njRj, where C is independent of (cid:15).\n\n X\n\n0\n(cid:26)(x; x\n\n(cid:0)jRj\n\n).\n\n):\n\n(cid:15)\n\n!\nP\n\nj\njf\n0\nsupp\ni\nWi\n\ni2R\n\n)j (cid:20)\nq\n\n0\ni\n\nj\n\n0\ni\n\njf\n\nkf\n\ni2R\n\n0, is of the order\n\nApplying the above lemma with Wi = 1, we see that in the original Euclidean space, the variation\nsup. This variation in f is now\nin f relative to distance between points x; x\nincreased in (X ; (cid:26)) by a factor of 1= inf i2R\nk\n1;(cid:22) in the worst case. In this sense, the space\n(X ; (cid:26)) maintains information about all relevant coordinates. In contrast, information is lost under a\nprojection of the data in the likely scenario that all or most coordinates are relevant.\nFinally, note that if all weights were close, the space (X ; (cid:26)) is essentially equivalent to the original\n(X ;k(cid:1)k), and we likely neither gain nor loose in performance, as con\ufb01rmed by experiments. How-\never, we observed that in practice, even when all coordinates are relevant, the gradient-weights vary\nsuf\ufb01ciently (Figure 1) to observe signi\ufb01cant performance gains for distance-based regressors.\n3 Estimating kf\nIn all that follows we are given n i.i.d samples (X; Y) = f(Xi; Yi)gn\ndistribution with marginal (cid:22). The marginal (cid:22) has support X (cid:26) Rd while the output Y 2 R.\nThe kernel estimate at x is de\ufb01ned using any kernel K(u), positive on [0; 1=2], and 0 for u > 1. If\nnX\nB(x; h) \\ X = ;, fn;h(x) = EnY , otherwise\n\ni=1, from some unknown\n\nnX\n\nk\n\n1;(cid:22)\n\n0\ni\n\nfn; (cid:22)(cid:26);h(x) =\n\ni=1\n\nK((cid:22)(cid:26)(x; Xi)=h)\nn\nj=1 K((cid:22)(cid:26)(x; Xj)=h)\n\n(cid:1) Yi =\n\nwi(x)Yi;\n\ni=1\n\nP\n\n(1)\n\nfor some metric (cid:22)(cid:26) and a bandwidth parameter h.\nFor the kernel regressor fn;h used to learn the metric (cid:26) below, (cid:22)(cid:26) is the Euclidean metric. In the\n\nanalysis we assume the bandwidth for fn;h is set as h (cid:21) (cid:0)\n\n(cid:1)1=d, given a con\ufb01dence\n\nlog2(n=(cid:14))=n\n\n1Accounting for the scale change induced by (cid:26) on the space X .\n\n3\n\n\fparameter 0 < (cid:14) < 1. In practice we would learn h by cross-validation, but for the analysis we only\nneed to know the existence of a good setting of h.\nThe metric is de\ufb01ned as\nWi , En\n(2)\nwhere An;i(X) is the event that enough samples contribute to the estimate (cid:1)t;ifn;h(X). For the\nconsistency result, we assume the following setting:\n\n(cid:2)\n(cid:1)t;ifn;h(X) (cid:1) 1fAn;i(X)g\n\njfn;h(X + tei) (cid:0) fn;h(X (cid:0) tei)j\n\n(cid:1) 1fAn;i(X)g = En\n\n(cid:3)\n\n2t\n\n;\n\nAn;i(X) (cid:17) min\n\ns2f(cid:0)t;tg (cid:22)n(B(X + sei; h=2)) (cid:21) (cid:11)n where (cid:11)n , 2d ln 2n + ln(4=(cid:14))\n\nn\n\n:\n\n4 Consistency of the estimator Wi of kf\n4.1 Theoretical setup\n\n0\ni\n\nk\n\n1;(cid:22)\n\n4.1.1 Marginal (cid:22)\nWithout loss of generality we assume X has bounded diameter 1. The marginal is assumed to have\na continuous density on X and has mass everywhere on X : 8x 2 X ;8h > 0; (cid:22)(B(x; h)) (cid:21) C(cid:22)hd.\nThis is for instance the case if (cid:22) has a lower-bounded density on X . Under this assumption, for\nsamples X in dense regions, X (cid:6) tei is also likely to be in a dense region.\n4.1.2 Regression function and noise\nThe output Y 2 R is given as Y = f (X) + (cid:17)(X), where E(cid:17)(X) = 0. We assume the following\ngeneral noise model: 8(cid:14) > 0 there exists c > 0 such that supx2X PY jX=x (j(cid:17)(x)j > c) (cid:20) (cid:14):\nWe denote by CY ((cid:14)) the in\ufb01mum over all such c. For instance, suppose (cid:17)(X) has exponentially\ndecreasing tail, then 8(cid:14) > 0, CY ((cid:14)) (cid:20) O(ln 1=(cid:14)). A last assumption on the noise is that the\nvariance of (Y jX = x) is upper-bounded by a constant (cid:27)2\nDe\ufb01ne the (cid:28)-envelope of X as X +B(0; (cid:28) ) , fz 2 B(x; (cid:28) ); x 2 Xg. We assume there exists (cid:28) such\nthat f is continuously differentiable on the (cid:28)-envelope X + B(0; (cid:28) ). Furthermore, each derivative\n>\n0\nsup and is uniformly continuous on\nf\ni (x) = e\nX + B(0; (cid:28) ) (this is automatically the case if the support X is compact).\ni\n4.1.3 Parameters varying with t\nOur consistency results are expressed in terms of the following distributional quantities. For i 2 [d],\nde\ufb01ne the (t; i)-boundary of X as @t;i(X ) , fx : fx + tei; x (cid:0) teig * Xg. The smaller the mass\n(cid:22)(@t;i(X )) at the boundary, the better we approximate kf\nk\n0\n1;(cid:22).\ni\ni (x) (cid:0) f\njf\n0\nThe second type of quantity is (cid:15)t;i , supx2X ; s2[(cid:0)t;t]\nSince (cid:22) has continuous density on X and rf is uniformly continuous on X + B(0; (cid:28) ), we automat-\nically have (cid:22)(@t;i(X )) t!0(cid:0)(cid:0)(cid:0)! 0 and (cid:15)t;i\n\nrf (x) is upper bounded on X + B(0; (cid:28) ) by jf\n\nY uniformly over all x 2 X .\n\ni (x + sei)j.\n0\n\nt!0(cid:0)(cid:0)(cid:0)! 0.\n\n0\ni\n\nj\n\n4.2 Main theorem\nk\nOur main theorem bounds the error in estimating each norm kf\n1;(cid:22) with Wi. The main technical\nhurdles are in handling the various sample inter-dependencies introduced by both the estimates\nfn;h(X) and the events An;i(X), and in analyzing the estimates at the boundary of X .\nTheorem 1. Let t + h (cid:20) (cid:28), and let 0 < (cid:14) < 1. There exist C = C((cid:22); K((cid:1))) and N = N ((cid:22)) such\nthat the following holds with probability at least 1(cid:0) 2(cid:14). De\ufb01ne A(n) , Cd(cid:1) log(n=(cid:14))(cid:1) C 2\nY ((cid:14)=2n)(cid:1)\n!\nY = log2(n=(cid:14)). Let n (cid:21) N, we have for all i 2 [d]:\n(cid:27)2\n+ (cid:22) (@t;i(X ))\n\n(cid:12)(cid:12)(cid:12)Wi (cid:0) kf\n\n1A + 2jf\n\n(cid:12)(cid:12)(cid:12) (cid:20) 1\n\n0@r\n\n r\n\nX\n\nln 2d=(cid:14)\n\n+ (cid:15)t;i:\n\njf\n\nk\n\n0\ni\n\nj\n\n0\ni\n\nsup\n\n0\ni\n\n1;(cid:22)\n\nj\n\n0\ni\n\nsup\n\nA(n)\n\nnhd + h (cid:1)\n\ni2[d]\n\nt\n\nn\n\n4\n\n\fThe bound suggest to set t in the order of h or larger. We need t to be small in order for (cid:22) (@t;i(X ))\nand (cid:15)t;i to be small, but t need to be suf\ufb01ciently large (relative to h) for the estimates fn;h(X + tei)\nand fn;h(X (cid:0) tei) to differ suf\ufb01ciently so as to capture the variation in f along ei.\nThe theorem immediately implies consistency for t n!1(cid:0)(cid:0)(cid:0)(cid:0)! 0, h n!1(cid:0)(cid:0)(cid:0)(cid:0)! 0, h=t n!1(cid:0)(cid:0)(cid:0)(cid:0)! 0, and\n(n= log n)hdt2 n!1(cid:0)(cid:0)(cid:0)(cid:0)! 1. This is satis\ufb01ed for many settings, for example t / p\nh and h / 1= log n.\n\n4.3 Proof of Theorem 1\n\n(cid:12)(cid:12)(cid:12) is in circumventing certain depencies: both quanti-\n(cid:12)(cid:12)(cid:12), i 2 [d], starting with:\n\nThe main dif\ufb01culty in bounding\nties fn;h(X) and An;i(X) depend not just on X 2 X, but on other samples in X, and thus introduce\ninter-dependencies between the estimates (cid:1)t;ifn;h(X) for different X 2 X.\nTo handle these dependencies, we carefully decompose\ni (X)jj +\n0\n\n(cid:12)(cid:12)(cid:12)Wi (cid:0) kf\n(cid:12)(cid:12)(cid:12)En jf\nr\nThe following simple lemma bounds the second term of (3).\nLemma 3. With probability at least 1 (cid:0) (cid:14), we have for all i 2 [d],\n\n(cid:12)(cid:12)(cid:12)Wi (cid:0) kf\n(cid:12)(cid:12)(cid:12) (cid:20) jWi (cid:0) En jf\n(cid:12)(cid:12)(cid:12)En jf\n\nk\ni (X)j (cid:0) kf\n0\n\n(cid:12)(cid:12)(cid:12)Wi (cid:0) kf\n\n(cid:12)(cid:12)(cid:12) (cid:20) jf\n\ni (X)j (cid:0) kf\n0\n\n(cid:12)(cid:12)(cid:12) :\n\n(3)\n\nk\n\nk\n\nk\n\nj\n\n0\ni\n\nk\n\n0\ni\n\n1;(cid:22)\n\n(cid:1)\n\nsup\n\n0\ni\n\n1;(cid:22)\n\n0\ni\n\n1;(cid:22)\n\n0\ni\n\n1;(cid:22)\n\n1;(cid:22)\n\n0\ni\n\nln 2d=(cid:14)\n\nn\n\n:\n\nProof. Apply a Chernoff bound, and a union bound on i 2 [d].\n\nNow the \ufb01rst term of equation (3) can be further bounded as\ni (X)j (cid:1) 1fAn;i(X)g\n0\ni (X)j (cid:1) 1fAn;i(X)g\n0\n\ni (X)jj (cid:20)(cid:12)(cid:12)Wi (cid:0) En jf\n(cid:20)(cid:12)(cid:12)Wi (cid:0) En jf\n\njWi (cid:0) En jf\n\n0\n\n(cid:12)(cid:12) + En jf\n(cid:12)(cid:12) + jf\n\nj\n\n0\ni\n\nsup\n\ni (X)j (cid:1) 1f (cid:22)An;i(X)g\n0\n(cid:1) En1f (cid:22)An;i(X)g:\n\n(4)\n\nWe will bound each term of (4) separately.\nThe next lemma bounds the second term of (4). It is proved in the appendix. The main technicality\nin this lemma is that, for any X in the sample X, the event (cid:22)An;i(X) depends on other samples in X.\nLemma 4. Let @t;i(X ) be de\ufb01ned as in Section (4.1.3). For n (cid:21) n((cid:22)), with probability at least\n1 (cid:0) 2(cid:14), we have for all i 2 [d],\n\nr\n\nln 2d=(cid:14)\n\n+ (cid:22) (@t;i(X )) :\n\n(cid:12)(cid:12). To this end we need to bring in f through the\n\n(cid:21)\n\n(cid:2)\n(cid:1)t;if (X) (cid:1) 1fAn;i(X)g\n\n(cid:3)\n\n= En\n\nP\n(cid:1) 1fAn;i(X)g\n\ni wi(x)f (xi).\n\nn\n\nEn1f (cid:22)An;i(X)g (cid:20)\n\ni (X)j (cid:1) 1fAn;i(X)g\n0\n\nfWi , En\n\nIt remains to bound\nfollowing quantities:\n\n(cid:12)(cid:12)Wi (cid:0) En jf\n(cid:20)jf (X + tei) (cid:0) f (X (cid:0) tei)j\nThe quantityfWi is easily related to En jf\nquantity ~fn;h(x) is needed when relating Wi tofWi.\n(cid:12)(cid:12)(cid:12)fWi (cid:0) En jf\n\nand for any x 2 X , de\ufb01ne ~fn;h(x) , EYjXfn;h(x) =\n\n2t\n\ni (X)j (cid:1) 1fAn;i(X)g. This is done in Lemma 5 below. The\n0\nLemma 5. De\ufb01ne (cid:15)t;i as in Section (4.1.3). With probability at least 1 (cid:0) (cid:14), we have for all i 2 [d],\n\n(cid:12)(cid:12)(cid:12) (cid:20) (cid:15)t;i:\n\ni (X)j (cid:1) 1fAn;i(X)g\n0\n\n5\n\n\fi (x) (cid:0) (cid:15)t;i) (cid:20) f (x + tei) (cid:0) f (x (cid:0) tei) (cid:20) 2t (f\n0\n\n0\ni (x) + (cid:15)t;i) :\n\n0\ni (x + sei) ds and therefore\n\ni (x)j(cid:12)(cid:12) (cid:20) (cid:15)t;i, therefore\n\n0\n\njf (x + tei) (cid:0) f (x (cid:0) tei)j (cid:0) jf\n\ni (x)j\n0\n\n(cid:12)(cid:12)(cid:12)(cid:12) (cid:20) (cid:15)t;i:\n\nR\n\nt(cid:0)t f\n\nProof. We have f (x + tei) (cid:0) f (x (cid:0) tei) =\n\n2t\n\n2t (f\n\nIt follows that\n\njf (x + tei) (cid:0) f (x (cid:0) tei)j (cid:0) jf\ni (X)j (cid:1) 1fAn;i(X)g\n0\n\n(cid:12)(cid:12) 1\n(cid:12)(cid:12)(cid:12) (cid:20) En\n(cid:12)(cid:12)(cid:12)fWi (cid:0) En jf\nIt remains to relate Wi tofWi. We have\n(cid:12)(cid:12)(cid:12)Wi (cid:0)fWi\n\n(cid:12)(cid:12)(cid:12) =2t\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n2t\n\n2t\n\n(cid:20)2 max\ns2f(cid:0)t;tg\n(cid:20)2 max\ns2f(cid:0)t;tg\n\n(cid:12)(cid:12)\n(cid:12)(cid:12)En((cid:1)t;ifn;h(X) (cid:0) (cid:1)t;if (X)) (cid:1) 1fAn;i(X)g\n(cid:12)(cid:12)(cid:12) (cid:1) 1fAn;i(X)g\n(cid:12)(cid:12)(cid:12)fn;h(X + sei) (cid:0) ~fn;h(X + sei)\n(cid:12)(cid:12)(cid:12) (cid:1) 1fAn;i(X)g:\n(cid:12)(cid:12)(cid:12) ~fn;h(X + sei) (cid:0) f (X + sei)\n(cid:12)(cid:12)(cid:12) (cid:1) 1fAn;i(X)g (cid:20) h (cid:1)\nWe \ufb01rst handle the bias term (6) in the next lemma which is given in the appendix.\nX\nLemma 6 (Bias). Let t + h (cid:20) (cid:28). We have for all i 2 [d], and all s 2 ft;(cid:0)tg:\njf\n\nEnjfn;h(X + sei) (cid:0) f (X + sei)j (cid:1) 1fAn;i(X)g\nEn\n\n(cid:12)(cid:12)(cid:12) ~fn;h(X + sei) (cid:0) f (X + sei)\n\n+ 2 max\ns2f(cid:0)t;tg\n\nEn\n\nEn\n\n0\ni\n\nj\n\nsup :\n\ni2[d]\n\n(5)\n\n(6)\n\nThe variance term in (5) is handled in the lemma below. The proof is given in the appendix.\nLemma 7 (Variance terms). There exist C = C((cid:22); K((cid:1))) such that, with probability at least 1(cid:0) 2(cid:14),\nwe have for all i 2 [d], and all s 2 f(cid:0)t; tg:\n\n(cid:12)(cid:12)(cid:12)fn;h(X + sei) (cid:0) ~fn;h(X + sei)\n\n(cid:12)(cid:12)(cid:12) (cid:1) 1fAn;i(X)g (cid:20)\n\nEn\n\nCd (cid:1) log(n=(cid:14))C 2\nn(h=2)d\n\nY ((cid:14)=2n) (cid:1) (cid:27)2\n\nY\n\n:\n\ns\n\nThe next lemma summarizes the above results:\nLemma 8. Let t + h (cid:20) (cid:28) and let 0 < (cid:14) < 1. There exist C = C((cid:22); K((cid:1))) such that the following\nholds with probability at least 1 (cid:0) 2(cid:14). De\ufb01ne A(n) , Cd (cid:1) log(n=(cid:14)) (cid:1) C 2\nY = log2(n=(cid:14)).\nX\nWe have\n\nY ((cid:14)=2n) (cid:1) (cid:27)2\n\n(cid:12)(cid:12)Wi (cid:0) En jf\n\n1A + (cid:15)t;i:\n\ni (X)j (cid:1) 1fAn;i(X)g\n0\n\nA(n)\n\nnhd + h (cid:1)\n\njf\n\n0\ni\n\nj\n\nsup\n\ni2[d]\n\n0@r\n\n(cid:12)(cid:12) (cid:20) 1\n\nt\n\nProof. Apply lemmas 5, 6 and 7, in combination with equations 5 and 6.\n\nTo complete the proof of Theorem 1, apply lemmas 8 and 3 in combination with equations 3 and 4.\n\n5 Experiments\n\n5.1 Data description\n\nWe present experiments on several real-world regression datasets. The \ufb01rst two datasets describe the\ndynamics of 7 degrees of freedom of robotic arms, Barrett WAM and SARCOS [9, 10]. The input\npoints are 21-dimensional and correspond to samples of the positions, velocities, and accelerations\nof the 7 joints. The output points correspond to the torque of each joint. The far joints (1, 5, 7)\n\n6\n\n\fKR error\nKR-(cid:26) error\nKR time\nKR-(cid:26) time\n\nKR error\nKR-(cid:26) error\nKR time\nKR-(cid:26) time\n\nBarrett joint 5\n0.50 (cid:6) 0.03\n0.35 (cid:6) 0.02\n0.37 (cid:6) 0.01\n0.38 (cid:6) 0.02\nConcrete Strength Wine Quality\n0.75 (cid:6) 0.03\n0.75 (cid:6) 0.02\n0.19 (cid:6) 0.02\n0.19 (cid:6) 0.02\n\nBarrett joint 1\n0.50 (cid:6) 0.02\n0.38(cid:6) 0.03\n0.39 (cid:6) 0.02\n0.41 (cid:6) 0.03\n0.42 (cid:6) 0.05\n0.37 (cid:6) 0.03\n0.14 (cid:6) 0.02\n0.14 (cid:6) 0.01\n\nk-NN error\nk-NN-(cid:26) error\nk-NN time\nk-NN-(cid:26) time\n\nk-NN error\nk-NN-(cid:26) error\nk-NN time\nk-NN-(cid:26) time\n\nBarrett joint 5\n0.40 (cid:6) 0.02\n0.30 (cid:6) 0.02\n0.16 (cid:6) 0.03\n0.16 (cid:6) 0.03\nConcrete Strength Wine Quality\n0.73 (cid:6) 0.04\n0.72 (cid:6) 0.03\n0.15 (cid:6) 0.01\n0.15 (cid:6) 0.01\n\nBarrett joint 1\n0.41 (cid:6) 0.02\n0.29 (cid:6) 0.01\n0.21 (cid:6) 0.04\n0.13 (cid:6) 0.04\n0.40 (cid:6) 0.04\n0.38 (cid:6) 0.03\n0.10 (cid:6) 0.01\n0.11 (cid:6) 0.01\n\nSARCOS joint 1\n0.16 (cid:6) 0.02\n0.14 (cid:6) 0.02\n0.28 (cid:6) 0.05\n0.32 (cid:6) 0.05\nTelecom\n0.30(cid:6)0.02\n0.23(cid:6)0.02\n0.15(cid:6)0.01\n0.16(cid:6)0.01\n\nSARCOS joint 1\n0.08 (cid:6) 0.01\n0.07 (cid:6) 0.01\n0.13 (cid:6) 0.01\n0.14 (cid:6) 0.01\nTelecom\n0.13(cid:6)0.02\n0.17(cid:6)0.02\n0.16(cid:6)0.02\n0.15(cid:6)0.01\n\nSARCOS joint 5\n0.14 (cid:6) 0.02\n0.12 (cid:6) 0.01\n0.23 (cid:6) 0.03\n0.23 (cid:6) 0.02\nAilerons\n0.40(cid:6)0.02\n0.39(cid:6)0.02\n0.20(cid:6)0.01\n0.21(cid:6)0.01\n\nSARCOS joint 5\n0.08 (cid:6) 0.01\n0.07 (cid:6) 0.01\n0.13 (cid:6) 0.01\n0.13 (cid:6) 0.01\nAilerons\n0.37(cid:6)0.01\n0.34(cid:6)0.01\n0.12(cid:6)0.01\n0.11(cid:6)0.01\n\nHousing\n0.37 (cid:6)0.08\n0.25 (cid:6)0.06\n0.10 (cid:6)0.01\n0.11 (cid:6)0.01\nParkinson\u2019s\n0.38(cid:6)0.03\n0.34(cid:6)0.03\n0.30(cid:6)0.03\n0.30(cid:6)0.03\n\nHousing\n0.28 (cid:6)0.09\n0.22(cid:6)0.06\n0.08 (cid:6)0.01\n0.08 (cid:6)0.01\nParkinson\u2019s\n0.22(cid:6)0.01\n0.20(cid:6)0.01\n0.14(cid:6)0.01\n0.15(cid:6)0.01\n\nTable 1: Normalized mean square prediction errors and average prediction time per point (in mil-\nliseconds). The top two tables are for KR vs KR-(cid:26) and the bottom two for k-NN vs k-NN-(cid:26).\n\n(a) SARCOS, joint 7, with KR\n\n(b) Ailerons with KR\n\n(c) Telecom with KR\n\n(d) SARCOS, joint 7, with k-NN\n\n(e) Ailerons with k-NN\n\n(f) Telecom with k-NN\n\nFigure 2: Normalized mean square prediction error over 2000 points for varying training sizes.\nResults are shown for k-NN and kernel regression (KR), with and without the metric (cid:26).\n\ncorrespond to different regression problems and are the only results reported. Expectedly, results for\nthe other joints are similarly good.\nThe other datasets are taken from the UCI repository [11] and from [12]. The concrete strength\ndataset (Concrete Strength) contains 8-dimensional input points, describing age and ingredients of\nconcrete, the output points are the compressive strength. The wine quality dataset (Wine Quality)\ncontains 11-dimensional input points corresponding to the physicochemistry of wine samples, the\noutput points are the wine quality. The ailerons dataset (Ailerons) is taken from the problem of \ufb02ying\na F16 aircraft. The 5-dimensional input points describe the status of the aeroplane, while the goal is\n\n7\n\n1000200030004000500000.020.040.060.080.1number of training pointserror KR errorKR\u2212\u03c1 error100020003000400050000.320.340.360.380.40.420.44number of training pointserror KR errorKR\u2212\u03c1 error10002000300040005000600070000.10.150.20.250.30.35number of training pointserror KR errorKR\u2212\u03c1 error100020003000400050000.0050.010.0150.020.025number of training pointserror k\u2212NN errork\u2212NN\u2212\u03c1 error100020003000400050000.290.30.310.320.330.340.350.360.370.38number of training pointserror k\u2212NN errork\u2212NN\u2212\u03c1 error100020003000400050006000700000.050.10.150.2number of training pointserror k\u2212NN errork\u2212NN\u2212\u03c1 error\fto predict the control action on the ailerons of the aircraft. The housing dataset (Housing) concerns\nthe task of predicting housing values in areas of Boston, the input points are 13-dimensional. The\nParkinson\u2019s Telemonitoring dataset (Parkison\u2019s) is used to predict the clinician\u2019s Parkinson\u2019s disease\nsymptom score using biomedical voice measurements represented by 21-dimensional input points.\nWe also consider a telecommunication problem (Telecom), wherein the 47-dimensional input points\nand the output points describe the bandwidth usage in a network.\nFor all datasets we normalize each coordinate with its standard deviation from the training data.\n\n5.2 Experimental setup\n\nTo learn the metric, we set h by cross-validation on half the training points, and we set t = h=2\nfor all datasets. Note that in practice we might want to also tune t in the range of h for even\nbetter performance than reported here. The event An;i(X) is set to reject the gradient estimate\n(cid:1)n;ifn;h(X) at X if no sample contributed to one the estimates fn;h(X (cid:6) tei).\nIn each experiment, we compare kernel regression in the euclidean metric space (KR) and in the\nlearned metric space (KR-(cid:26)), where we use a box kernel for both. Similar comparisons are made\nusing k-NN and k-NN-(cid:26). All methods are implemented using a fast neighborhood search procedure,\nnamely the cover-tree of [13], and we also report the average prediction times so as to con\ufb01rm that,\non average, time-performance is not affected by using the metric.\nThe parameter k in k-NN/k-NN-(cid:26), and the bandwidth in KR/KR-(cid:26) are learned by cross-validation\non half of the training points. We try the same range of k (from 1 to 5 log n) for both k-NN and\nk-NN-(cid:26). We try the same range of bandwidth/space-diameter (a grid of size 0:02 from 1 to 0:02 )\nfor both KR and KR-(cid:26): this is done ef\ufb01ciently by starting with a log search to detect a smaller range,\nfollowed by a grid search on a smaller range.\nTable 5 shows the normalized Mean Square Errors (nMSE) where the MSE on the test set is normal-\nized by variance of the test output. We use 1000 training points in the robotic datasets, 2000 training\npoints in the Telecom, Parkinson\u2019s, Wine Quality, and Ailerons datasets, and 730 training points in\nConcrete Strength, and 300 in Housing. We used 2000 test points in all of the problems, except for\nConcrete, 300 points, and Housing, 200 points. Averages over 10 random experiments are reported.\nFor the larger datasets (SARCOS, Ailerons, Telecom) we also report the behavior of the algorithms,\nwith and without metric, as the training size n increases (Figure 2).\n\n5.3 Discussion of results\n\nFrom the results in Table 5 we see that virtually on all datasets the metric helps improve the perfor-\nmance of the distance based-regressor even though we did not tune t to the particular problem (re-\nmember t = h=2 for all experiments). The only exceptions are for Wine Quality where the learned\nweights are nearly uniform, and for Telecom with k-NN. We noticed that the Telecom dataset has\na lot of outliers and this probably explains the discrepancy, besides from the fact that we did not\nattempt to tune t. Also notice that the error of k-NN is already low for small sample sizes, making\nit harder to outperform. However, as shown in Figure 2, for larger training sizes k-NN-(cid:26) gains on\nk-NN. The rest of the results in Figure 2 where we vary n are self-descriptive: gradient weighting\nclearly improves the performance of the distance-based regressors.\nWe also report the average prediction times in Table 5. We see that running the distance-based\nmethods with gradient weights does not affect estimation time. Last, remember that the metric can\nbe learned online at the cost of only 2d times the average kernel estimation time reported.\n\n6 Final remarks\n\nGradient weighting is simple to implement, computationally ef\ufb01cient in batch-mode and online, and\nmost importantly improves the performance of distance-based regressors on real-world applications.\nIn our experiments, most or all coordinates of the data are relevant, yet some coordinates are more\nimportant than others. This is suf\ufb01cient for gradient weighting to yield gains in performance. We\nbelieve there is yet room for improvement given the simplicity of our current method.\n\n8\n\n\fReferences\n[1] Kilian Q. Weinberger and Gerald Tesauro. Metric learning for kernel regression. Journal of\n\nMachine Learning Research - Proceedings Track, 2:612\u2013619, 2007.\n\n[2] Bo Xiao, Xiaokang Yang, Yi Xu, and Hongyuan Zha. Learning distance metric for regression\nby semide\ufb01nite programming with application to human age estimation. In Proceedings of the\n17th ACM international conference on Multimedia, pages 451\u2013460, 2009.\n\n[3] Shai Shalev-shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-\n\nmetrics. In ICML, pages 743\u2013750. ACM Press, 2004.\n\n[4] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-\n\ntheoretic metric learning. In ICML, pages 209\u2013216, 2007.\n\n[5] W. H\u00a8ardle and T. Gasser. On robust kernel estimation of derivatives of regression functions.\n\nScandinavian journal of statistics, pages 233\u2013240, 1985.\n\n[6] J. Lafferty and L. Wasserman. Rodeo: Sparse nonparametric regression in high dimensions.\n\nArxiv preprint math/0506342, 2005.\n\n[7] L. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri. Nonparametric sparsity and regular-\n\nization. http://arxiv.org/abs/1208.2572, 2012.\n\n[8] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric\n\nRegression. Springer, New York, NY, 2002.\n\n[9] S. Kpotufe. k-NN Regression Adapts to Local Intrinsic Dimension. NIPS, 2011.\n[10] Duy Nguyen-Tuong, Matthias W. Seeger, and Jan Peters. Model learning with local gaussian\n\nprocess regression. Advanced Robotics, 23(15):2015\u20132034, 2009.\n\n[11] Duy Nguyen-Tuong and Jan Peters. Incremental online sparsi\ufb01cation for model learning in\n\nreal-time robot control. Neurocomputing, 74(11):1859\u20131867, 2011.\n\n[12] A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.\nuci.edu/ml. University of California, Irvine, School of Information and Computer Sci-\nences, 2012.\n\n[13] Luis Torgo. Regression datasets. http://www.liaad.up.pt/\u02dcltorgo. University of\n\nPorto, Department of Computer Science, 2012.\n\n[14] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbors. ICML, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1297, "authors": [{"given_name": "Samory", "family_name": "Kpotufe", "institution": null}, {"given_name": "Abdeslam", "family_name": "Boularias", "institution": null}]}