{"title": "Pointwise Tracking the Optimal Regression Function", "book": "Advances in Neural Information Processing Systems", "page_first": 2042, "page_last": 2050, "abstract": "This paper examines the possibility of a `reject option' in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn `selective' regressors that can $\\epsilon$-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.", "full_text": "Pointwise Tracking the Optimal Regression Function\n\nRan El-Yaniv and Yair Wiener\nComputer Science Department\n\nTechnion \u2013 Israel Institute of Technology\n\n{rani,wyair}@{cs,tx}.technion.ac.il\n\nAbstract\n\nThis paper examines the possibility of a \u2018reject option\u2019 in the context of least\nsquares regression. It is shown that using rejection it is theoretically possible to\nlearn \u2018selective\u2019 regressors that can \u01eb-pointwise track the best regressor in hind-\nsight from the same hypothesis class, while rejecting only a bounded portion of\nthe domain. Moreover, the rejected volume vanishes with the training set size,\nunder certain conditions. We then develop ef\ufb01cient and exact implementation of\nthese selective regressors for the case of linear regression. Empirical evaluation\nover a suite of real-world datasets corroborates the theoretical analysis and indi-\ncates that our selective regressors can provide substantial advantage by reducing\nestimation error.\n\n1 Introduction\n\nConsider a standard least squares regression problem. Given m input-output training pairs,\n(x1, y1), . . . , (xm, ym), we are required to learn a predictor, \u02c6f \u2208 F, capable of generating accurate\noutput predictions, \u02c6f (x) \u2208 R, for any input x. Assuming that input-output pairs are i.i.d. realiza-\ntions of some unknown stochastic source, P (x, y), we would like to choose \u02c6f so as to minimize the\nstandard least squares risk functional,\n\nR( \u02c6f ) =Z (y \u2212 \u02c6f (x))2dP (x, y).\n\nLet f \u2217 = argminf \u2208F R(f ) be the optimal predictor in hindsight (based on full knowledge of P ).\nA classical result in statistical learning is that under certain structural conditions on F and possibly\non P , one can learn a regressor that approaches the average optimal performance, R(f \u2217), when the\nsample size, m, approaches in\ufb01nity [1].\nIn this paper we contemplate the challenge of pointwise tracking the optimal predictions of f \u2217 after\nobserving only a \ufb01nite (and possibly small) set of training samples. It turns out that meeting this\ndif\ufb01cult task can be made possible by harnessing the \u2018reject option\u2019 compromise from classi\ufb01cation.\nInstead of predicting the output for the entire input domain, the regressor is allowed to abstain from\nprediction for part of the domain. We present here new techniques for regression with a reject\noption, capable of achieving pointwise optimality on substantial parts of the input domain, under\ncertain conditions.\n\nSection 3 introduces a general strategy for learning selective regressors. This strategy is guaranteed\nto achieve \u01eb-pointwise optimality (de\ufb01ned in Section 2) all through its region of action. This result\nis proved in Theorem 3.8, which also shows that the guaranteed coverage increases monotonically\nwith the training sample size and converges to 1. This type of guarantee is quite strong, as it ensures\ntight tracking of individual optimal predictions made by f \u2217, while covering a substantial portion of\nthe input domain.\n\nAt the outset, the general strategy we propose appears to be out of reach because accept/reject\ndecisions require the computation of a supremum over a a very large, and possibly in\ufb01nite hypothesis\n\n1\n\n\fsubset.\nIn Section 4, however, we show how to compute the strategy for each point of interest\nusing only two constrained ERM calculations. This useful reduction, shown in Lemma 4.2, opens\npossibilities for ef\ufb01cient implementations of optimal selective regressors whenever the hypothesis\nclass of interest allows for ef\ufb01cient (constrained) ERM (see De\ufb01nition 4.1).\n\nFor the case of linear least squares regression we utilize known techniques for both ERM and con-\nstrained ERM and derive in Section 5 exact implementation achieving pointwise optimal selective\nregression. The resulting algorithm is ef\ufb01cient and can be easily implemented using standard matrix\noperations including (pseudo) inversion. Theorem 5.3 in this section states a novel pointwise bound\non the difference between the prediction of an ERM linear regressor and the prediction of f \u2217 for\neach individual point. Finally, in Section 6 we present numerical examples over a suite of real-world\nregression datasets demonstrating the effectiveness of our methods, and indicating that substantial\nperformance improvements can be gained by using selective regression.\nRelated work. Utilizations of a reject option are quite common in classi\ufb01cation where this technique\nwas initiated more than 50 years ago with Chow\u2019s pioneering work [2, 3]. However, the reject\noption is only scarcely and anecdotally mentioned in the context of regression. In [4] a boosting\nalgorithm for regression is proposed and a few reject mechanisms are considered, applied both\non the aggregate decision and/or on the underlying weak regressors. A straightforward threshold-\nbased reject mechanism (rejecting low response values) is applied in [5] on top of support vector\nregression. This mechanism was found to improve false positive rates.\n\nThe present paper is inspired and draws upon recent results on selective classi\ufb01cation [6, 7, 8],\nand can be viewed as a natural continuation of the results of [8]. In particular, we adapt the basic\nde\ufb01nitions of selectivity and the general outline of the derivation and strategy presented in [8].\n\n2 Selective regression and other preliminary de\ufb01nitions\n\nWe begin with a de\ufb01nition of the following general and standard regression setting. A \ufb01nite training\nsample of m labeled examples, Sm , {(xi, yi)}m\ni=1 \u2286 (X \u00d7 Y)m, is observed, where X is some\nfeature space and Y \u2286 R. Using Sm we are required to select a regressor \u02c6f \u2208 F, where F is a \ufb01xed\nhypothesis class containing potential regressors of the form f : X \u2192 Y. It is desired that predictions\n\u02c6f (x), for unseen instances x, will be as accurate as possible. We assume that pairs (x, y), including\ntraining instances, are sampled i.i.d. from some unknown stochastic source, P (x, y), de\ufb01ned over\nX \u00d7 Y. Given a loss function, \u2113 : Y \u00d7 Y \u2192 [0,\u221e), we quantify the prediction quality of any f\nthrough its true error or risk, R(f ), de\ufb01ned as its expected loss with respect to P ,\n\nR(f ) , E(x,y) {\u2113(f (x), y)} =Z \u2113(f (x), y)dP (x, y).\n\nWhile R(f ) is an unknown quantity, we do observe the empirical error of f , de\ufb01ned as\n\n\u02c6R(f ) , 1\nm\n\nm\n\nXi=1\n\n\u2113(f (xi), yi).\n\nLet \u02c6f , arg inf f \u2208F \u02c6R(f ) be the empirical risk minimizer (ERM), and f \u2217 , arg inf f \u2208F R(f ), the\ntrue risk minimizer.\n\nNext we de\ufb01ne selective regression using the following de\ufb01nitions, which are taken, as is, from the\nselective classi\ufb01cation setting of [6]. Here again, we are given a training sample Sm as above, but\nare now required to output a selective regressor de\ufb01ned to be a pair (f, g), with f \u2208 F being a\nstandard regressor, and g : X \u2192 {0, 1} is a selection function, which is served as quali\ufb01er for f as\nfollows. For any x \u2208 X ,\n\n(1)\n\n(f, g)(x) ,(cid:26) reject,\n\nf (x),\n\nif g(x) = 0;\nif g(x) = 1.\n\nThus, the selective regressor abstains from prediction at a point x iff g(x) = 0. The general perfor-\nmance of a selective regressor is characterized in terms of two quantities: coverage and risk. The\ncoverage of (f, g) is\n\n\u03a6(f, g) , EP [g(x)] .\n\n2\n\n\fThe true risk of (f, g) is the risk of f restricted to its region of activity as quali\ufb01ed by g, and\nnormalized by its coverage,\n\nR(f, g) ,\n\nEP [\u2113(f (x), y) \u00b7 g(x)]\n\n\u03a6(f, g)\n\n.\n\nWe say that the selective regressor (f, g) is \u01eb-pointwise optimal if\n\n\u2200x \u2208 {x \u2208 X : g(x) = 1} ,\n\n|f (x) \u2212 f \u2217(x)| \u2264 \u01eb.\n\nNote that pointwise optimality is a considerably stronger property than risk, which only refers to\naverage performance.\nWe de\ufb01ne a (standard) distance metric over the hypothesis class F. For any probability measure \u00b5\non X , let L2(\u00b5) be the Hilbert space of functions from X to R, with the inner product de\ufb01ned as\n\nThe distance function induced by the inner product is\n\nhf, gi , E\u00b5(x)f (x)g(x).\n\n\u03c1(f, g) ,k f \u2212 g k=phf \u2212 g, f \u2212 gi =qE\u00b5(x) (f (x) \u2212 g(x))2.\n\nFinally, for any f \u2208 F we de\ufb01ne a ball in F of radius r around f ,\nB(f, r) , {f \u2032 \u2208 F : \u03c1(f, f \u2032) \u2264 r} .\n\n3 Pointwise optimality with bounded coverage\n\nIn this section we analyze the following strategy for learning a selective regressor, which turns out\nto ensure \u01eb-pointwise optimality with monotonically increasing coverage (with m). We call it a\nstrategy rather than an algorithm because it is not at all clear at the outset how to implement it. In\nsubsequent sections we develop ef\ufb01cient and precise implementation for linear regression.\nWe require the following de\ufb01nition. For any hypothesis class F, target hypothesis f \u2208 F, distribu-\ntion P , sample Sm, and real r > 0, de\ufb01ne,\n\u02c6V(f, r) ,nf \u2032 \u2208 F : \u02c6R(f \u2032) \u2264 \u02c6R(f ) + ro .\nV(f, r) , {f \u2032 \u2208 F : R(f \u2032) \u2264 R(f ) + r}\n\nand\n\n(2)\n\nStrategy 1 A learning strategy for \u01eb-pointwise optimal selective regressors\nInput: Sm, m, \u03b4, F, \u01eb\nOutput: A selective regressor ( \u02c6f , g) achieving \u01eb-pointwise optimality\n1: Set \u02c6f = ERM (F, Sm), i.e., \u02c6f is any empirical risk minimizer from F\n2: Set G = \u02c6V \u201c \u02c6f ,`\u03c3(m, \u03b4/4, F)2 \u2212 1\u00b4 \u00b7 \u02c6R( \u02c6f )\u201d\n3: Construct g such that g(x) = 1 \u21d0\u21d2 \u2200f \u2032 \u2208 G |f \u2032(x) \u2212 \u02c6f (x)| < \u01eb\n\n/* see De\ufb01nition 3.3 and (2) */\n\nFor the sake of brevity, throughout this section we often write f instead of f (x), where f is any\nregressor. The following Lemma 3.1 is based on the proof of Lemma A.12 in [9].\nLemma 3.1 ([9]). For any f \u2208 F. Let \u2113 : Y \u00d7 Y \u2192 [0,\u221e) be the squared loss function and F be\na convex hypothesis class. Then, E(x,y)(f \u2217(x) \u2212 y)(f (x) \u2212 f \u2217(x)) \u2265 0.\nLemma 3.2. Under the same conditions of Lemma 3.1, for any r > 0, V(f \u2217, r) \u2286 B (f \u2217,\u221ar) .\nProof. If f \u2208 V(f \u2217, r), then by de\ufb01nition,\n\nR(f ) \u2264 R(f \u2217) + r.\n\n(3)\n\nR(f ) \u2212 R(f \u2217) = E {\u2113(f, y) \u2212 \u2113(f \u2217, y)} = E(cid:8)(f \u2212 y)2 \u2212 (f \u2217 \u2212 y)2(cid:9)\nApplying Lemma 3.1 and (3) we get, \u03c1(f, f \u2217) \u2264pR(f ) \u2212 R(f \u2217) \u2264 \u221ar.\n\n= En(f \u2212 f \u2217)2 \u2212 2(y \u2212 f \u2217)(f \u2212 f \u2217)o = \u03c12(f, f \u2217) + 2E(f \u2217 \u2212 y)(f \u2212 f \u2217).\n\n3\n\n\fDe\ufb01nition 3.3 (Multiplicative Risk Bounds). Let \u03c3\u03b4 , \u03c3 (m, \u03b4,F ) be de\ufb01ned such that for any\n0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4 over the choice of Sm from P m, any hypothesis f \u2208 F\nsatis\ufb01es\n\nR(f ) \u2264 \u02c6R(f ) \u00b7 \u03c3 (m, \u03b4,F ) .\n\nSimilarly, the reverse bound , \u02c6R(f ) \u2264 R(f ) \u00b7 \u03c3 (m,F , \u03b4), holds under the same conditions.\nRemark 3.1. The purpose of De\ufb01nition 3.3 is to facilitate the use of any (known) risk bound as a\nplug-in component in subsequent derivations. We de\ufb01ne \u03c3 as a multiplicative bound, which is com-\nmon in the treatment of unbounded loss functions such as the squared loss (see discussion by Vapnik\nin [10], page 993). Instances of such bounds can be extracted, e.g., from [11] (Theorem 1), and from\nbounds discussed in [10]. We also developed the entire set of results that follow while relying on\nadditive bounds, which are common when using bounded loss functions. These developments will\nbe presented in the full version of the paper.\n\nThe proof of the following lemma follows closely the proof of Lemma 5.3 in [8]. However, it\nconsiders a multiplicative risk bound rather than additive.\nLemma 3.4. For any r > 0, and 0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4,\n\u03b4/2 \u2212 1) \u00b7 R(f \u2217) + r \u00b7 \u03c3\u03b4/2(cid:17) .\nLemma 3.5. Let F be a convex hypothesis space, \u2113 : Y \u00d7 Y \u2192 [0,\u221e), a convex loss function, and\n\u02c6f be an ERM. Then, with probability of at least 1 \u2212 \u03b4/2, for any x \u2208 X ,\n\n\u02c6V( \u02c6f , r) \u2286 V(cid:16)f \u2217, (\u03c32\n\n|f \u2217(x) \u2212 \u02c6f (x)| \u2264\n\nf \u2208 \u02c6V\u201c \u02c6f ,(\u03c32\n\nsup\n\u03b4/4\u22121)\u00b7 \u02c6R( \u02c6f )\u201d\n\n|f (x) \u2212 \u02c6f (x)|.\n\nProof. Applying the multiplicative risk bound, we get that with probability of at least 1 \u2212 \u03b4/4,\n\n\u02c6R(f \u2217) \u2264 R(f \u2217) \u00b7 \u03c3\u03b4/4.\n\nSince f \u2217 minimizes the true error, R(f \u2217) \u2264 R( \u02c6f ). Applying the multiplicative risk bound on \u02c6f,\nwe know also that with probability of at least 1 \u2212 \u03b4/4, R( \u02c6f ) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c3\u03b4/4. Combining the three\ninequalities by using the union bound we get that with probability of at least 1 \u2212 \u03b4/2,\n\n\u02c6R(f \u2217) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4 = \u02c6R( \u02c6f ) +(cid:16)\u03c32\n\n\u03b4/4 \u2212 1(cid:17) \u00b7 \u02c6R( \u02c6f ).\n\nHence, with probability of at least 1 \u2212 \u03b4/2 we get f \u2217 \u2208 \u02c6V(cid:16) \u02c6f , (\u03c32\nLet G \u2286 F. We generalize the concept of disagreement set [12, 6] to real-valued functions. The\n\u01eb-disagreement set w.r.t. G is de\ufb01ned as\n\n\u03b4/4 \u2212 1) \u00b7 \u02c6R( \u02c6f )(cid:17)\n\nDIS\u01eb(G) , {x \u2208 X : \u2203f1, f2 \u2208 G s.t.\n\n|f1(x) \u2212 f2(x)| \u2265 \u01eb} .\n\nFor any G \u2286 F, distribution P , and \u01eb > 0, we de\ufb01ne \u2206\u01ebG , P rP {DIS\u01eb(G)} . In the following\nde\ufb01nition we extend Hanneke\u2019s disagreement coef\ufb01cient [13] to the case of real-valued functions.1\nDe\ufb01nition 3.6 (\u01eb-disagreement coef\ufb01cient). The \u01eb-disagreement coef\ufb01cient of F under P is,\n\n\u03b8\u01eb , sup\nr>r0\n\n\u2206\u01ebB(f \u2217, r)\n\nr\n\n.\n\n(4)\n\nThroughout this paper we set r0 = 0. Our analyses for arbitrary r0 > 0 will be presented in the full\nversion of this paper.\n\nThe proof of the following technical statement relies on the same technique used for the proof of\nTheorem 5.4 in [8].\n\n1Our attemps to utilize a different known extension of the disagreement coef\ufb01cient [14] were not successful.\n\nSpeci\ufb01cally, the coef\ufb01cient proposed there is unbounded for the squared loss function when Y is unbounded.\n\n4\n\n\fLemma 3.7. Let F be a convex hypothesis class, and assume \u2113 : Y \u00d7 Y \u2192 [0,\u221e) is the squared\nloss function. Let \u01eb > 0 be given. Assume that F has \u01eb-disagreement coef\ufb01cient \u03b8\u01eb. Then, for any\nr > 0 and 0 < \u03b4 < 1, with probability of at least 1 \u2212 \u03b4,\n\n\u2206\u01eb \u02c6V( \u02c6f , r) \u2264 \u03b8\u01ebr(cid:16)\u03c32\n\n\u03b4/2 \u2212 1(cid:17) \u00b7 R(f \u2217) + r \u00b7 \u03c3\u03b4/2.\n\nThe following theorem is the main result of this section, showing that Strategy 1 achieves \u01eb-pointwise\noptimality with a meaningful coverage that converges to 1. Although R(f \u2217) in the bound (5) is an\nunknown quantity, it is still a constant, and as \u03c3 approaches 1, the coverage lower bound approaches\n1 as well. When using a typical additive risk bound, R(f \u2217) disappears from the RHS.\nTheorem 3.8. Assume the conditions of Lemma 3.7 hold. Let (f, g) be the selective regressor chosen\nby Strategy 1. Then, with probability of at least 1 \u2212 \u03b4,\n\nand\n\n\u03a6(f, g) \u2265 1 \u2212 \u03b8\u01ebr(cid:16)\u03c32\n\u2200x \u2208 {x \u2208 X : g(x) = 1}\n\n\u03b4/4 \u2212 1(cid:17) \u00b7(cid:16)R(f \u2217) + \u03c3\u03b4/4 \u00b7 \u02c6R( \u02c6f )(cid:17)\n\n|f (x) \u2212 f \u2217(x)| < \u01eb.\n\n(5)\n\nProof. According to Strategy 1, if g(x) = 1 then sup\nApplying Lemma 3.5 we get that, with probability of at least 1 \u2212 \u03b4/2,\n\nf \u2208 \u02c6V( \u02c6f ,\u201c\u03c32\n\n\u03b4/4\u22121\u201d\u00b7 \u02c6R( \u02c6f )) |f (x) \u2212 \u02c6f (x)| < \u01eb.\n\nSince \u02c6f \u2208 \u02c6V(cid:16) \u02c6f , (\u03c32\n\n\u2200x \u2208 {x \u2208 X : g(x) = 1}\n\u03b4/4 \u2212 1) \u00b7 \u02c6R( \u02c6f )(cid:17) = G wet get\n\u03a6(f, g) = E{g(X)} = E(I sup\n\n|f (x) \u2212 f \u2217(x)| < \u01eb.\n\nf \u2208G |f (x) \u2212 \u02c6f (x)| < \u01eb!)\n\n= 1 \u2212 E(I sup\n\u2265 1 \u2212 E(I sup\n\nf \u2208G |f (x) \u2212 \u02c6f (x)| \u2265 \u01eb!)\nf1,f2\u2208G|f1(x) \u2212 f2(x)| \u2265 \u01eb!) = 1 \u2212 \u2206\u01ebG.\n\nApplying Lemma 3.7 and the union bound we conclude that with probability of at least 1 \u2212 \u03b4,\n\n\u03a6(f, g) = E{g(X)} \u2265 1 \u2212 \u03b8\u01ebr(cid:16)\u03c32\n\n\u03b4/4 \u2212 1(cid:17) \u00b7(cid:16)R(f \u2217) + \u03c3\u03b4/4 \u00b7 \u02c6R( \u02c6f )(cid:17).\n\n4 Rejection via constrained ERM\n\nIn Strategy 1 we are required to track the supremum of a possibly in\ufb01nite hypothesis subset, which\nin general might be intractable. The following Lemma 4.2 reduces the problem of calculating the\nsupremum to a problem of calculating a constrained ERM for two hypotheses.\nDe\ufb01nition 4.1 (constrained ERM). Let x \u2208 X and \u01eb \u2208 R be given. De\ufb01ne,\nf (x) = \u02c6f (x) + \u01ebo ,\n\nf \u2208F n \u02c6R(f )\n\n\u02c6f\u01eb,x , argmin\n\nwhere \u02c6f (x) is, as usual, the value of the unconstrained ERM regressor at point x.\nLemma 4.2. Let F be a convex hypothesis space, and \u2113 : Y \u00d7 Y \u2192 [0,\u221e), a convex loss function.\nLet \u01eb > 0 be given, and let (f, g) be a selective regressor chosen by Strategy 1 after observing the\ntraining sample Sm. Let \u02c6f be an ERM. Then,\n\n|\n\ng(x) = 0 \u21d4 \u02c6R( \u02c6f\u01eb,x) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4 \u2228\n\n\u02c6R( \u02c6f\u2212\u01eb,x) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4.\n\n5\n\n\fProof. Let G , \u02c6V(cid:16) \u02c6f , (\u03c32\n\u01eb. Assume w.l.o.g. (the other case is symmetric) that f (x) \u2212 \u02c6f (x) = a \u2265 \u01eb. Since F is convex,\n\n\u03b4/4 \u2212 1) \u00b7 \u02c6R( \u02c6f )(cid:17), and assume there exists f \u2208 G such that |f (x)\u2212 \u02c6f (x)| \u2265\n\nWe thus have,\n\nf \u2032 =(cid:16)1 \u2212\n\n\u01eb\n\na(cid:17) \u00b7 \u02c6f +\n\n\u01eb\na \u00b7 f \u2208 F.\n\nf \u2032(x) =(cid:16)1 \u2212\n\n\u01eb\n\na(cid:17) \u00b7 \u02c6f (x) +\n\n\u01eb\n\na \u00b7 f (x) =(cid:16)1 \u2212\n\n\u01eb\n\na(cid:17) \u00b7 \u02c6f (x) +\n\n\u01eb\n\na \u00b7(cid:16) \u02c6f (x) + a(cid:17) = \u02c6f (x) + \u01eb.\n\nTherefore, by the de\ufb01nition of \u02c6f\u01eb,x, and using the convexity of \u2113, together with Jensen\u2019s inequality,\n\n\u01eb\n\na(cid:17) \u00b7 \u02c6f (xi) +\n\n\u01eb\n\na \u00b7 f (xi), yi(cid:17)\n\nm\n\n1\nm\n\nm\n\n\u02c6R( \u02c6f\u01eb,x) \u2264 \u02c6R(f \u2032) =\nXi=1\n\u01eb\n\u2264 (cid:16)1 \u2212\na(cid:17) \u00b7\nXi=1\na(cid:17) \u00b7 \u02c6R( \u02c6f ) +\n= (cid:16)1 \u2212\na \u00b7(cid:16)\u03c32\n\n= \u02c6R( \u02c6f ) +\n\n1\nm\n\n\u01eb\n\n\u01eb\n\n\u2113 (f (xi), yi)\n\nm\n\n1\nm\n\n\u01eb\n\nm\n\n1\nm\n\u01eb\n\n\u2113(f \u2032(xi), yi) =\n\nXi=1\n\u01eb\n\u2113(cid:16) \u02c6f (xi), yi(cid:17) +\na \u00b7\na \u00b7 \u02c6R(f ) \u2264(cid:16)1 \u2212\n\n\u2113(cid:16)(cid:16)1 \u2212\nXi=1\na(cid:17) \u00b7 \u02c6R( \u02c6f ) +\n\u03b4/4. Then \u02c6f\u01eb,x \u2208 G and(cid:12)(cid:12)(cid:12)\n\n\u03b4/4 \u2212 1(cid:17) \u00b7 \u02c6R( \u02c6f ) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4.\n\n\u01eb\n\nAs for the other direction, if \u02c6R( \u02c6f\u01eb,x) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4(cid:17)\na \u00b7(cid:16) \u02c6R( \u02c6f ) \u00b7 \u03c32\n\u02c6f\u01eb,x(x) \u2212 \u02c6f (x)(cid:12)(cid:12)(cid:12)\n\n= \u01eb.\n\nSo far we have discussed the case where \u01eb is given, and our objective is to \ufb01nd an \u01eb-pointwise\noptimal regressor. Lemma 4.2 provides the means to compute such an optimal regressor assuming\nthat a method to compute a constrained ERM is available (as is the case for squared loss linear\nregressors ; see next section). However, as was discussed in [6], in many cases our objective is to\nexplore the entire risk-coverage trade-off, in other words, to get a pointwise bound on |f \u2217(x)\u2212f (x)|,\ni.e., individually for any test point x. The following theorem states such a pointwise bound.\nTheorem 4.3. Let F be a convex hypothesis class, \u2113 : Y \u00d7 Y \u2192 [0,\u221e), a convex loss function, and\nlet \u02c6f be an ERM. Then, with probability of at least 1 \u2212 \u03b4/2 over the choice of Sm from P m , for any\nx \u2208 X ,\n\nProof. De\ufb01ne \u02dcf ,\n\nargmax\n\nf \u2208 \u02c6V\u201c \u02c6f ,(\u03c32\n\n\u03b4/4\u22121)\u00b7 \u02c6R( \u02c6f )\u201d\n\n|f \u2217(x) \u2212 \u02c6f (x)| \u2264 sup\n\n\u03b4/4o .\n\n\u01eb\u2208Rn|\u01eb| : \u02c6R( \u02c6f\u01eb,x) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n|f (x)\u2212 \u02c6f (x)|. Assume w.l.o.g (the other case is symmetric)\n\u03b4/4. De\ufb01ne\n\nthat \u02dcf (x) = \u02c6f (x) + a. Following De\ufb01nition 4.1 we get \u02c6R( \u02c6fa,x) \u2264 \u02c6R( \u02dcf ) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\u01eb\u2032 = sup\u01eb\u2208Rn|\u01eb| : \u02c6R( \u02c6f\u01eb,x) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4o . We thus have,\n\nf \u2208 \u02c6V\u201c \u02c6f ,(\u03c32\n\nsup\n\u03b4/4\u22121)\u00b7 \u02c6R( \u02c6f)\u201d\n\n|f (x) \u2212 \u02c6f (x)| = a \u2264 \u01eb\u2032.\n\nAn application of Lemma 3.5 completes the proof.\n\nWe conclude this section with a general result on the monotonicity of the empirical risk attained by\nconstrained ERM regressors. This property, which will be utilized in the next section, can be easily\nproved using a simple application of Jensen\u2019s inequality.\nLemma 4.4 (Monotonicity). Let F be a convex hypothesis space, \u2113 : Y \u00d7 Y \u2192 [0,\u221e), a convex\n\u01eb2 (cid:16) \u02c6R( \u02c6f\u01eb2,x0) \u2212 \u02c6R( \u02c6f )(cid:17) . The\nloss function, and 0 \u2264 \u01eb1 < \u01eb2, be given. Then, \u02c6R(f\u01eb1,x0) \u2212 \u02c6R( \u02c6f ) \u2264 \u01eb1\nresult also holds for the case 0 \u2265 \u01eb1 > \u01eb2.\n\n6\n\n\f5 Selective linear regression\n\nWe now restrict attention to linear least squares regression (LLSR), and, relying on Theorem 4.3 and\nLemma 4.4, as well as on known closed-form expressions for LLSR, we derive ef\ufb01cient implemen-\ntation of Strategy 1 and a new pointwise bound. Let X be an m \u00d7 d training sample matrix whose\nith row, xi \u2208 Rd, is a feature vector. Let y \u2208 Rm be a column vector of training labels.\nLemma 5.1 (ordinary least-squares estimate [15]). The ordinary least square (OLS) solution of\nthe following optimization problem, min\u03b2 kX\u03b2 \u2212 yk2, is given by \u02c6\u03b2 , (X T X)+X T y, where the\nsign + represents the pseudoinverse.\nLemma 5.2 (constrained least-squares estimate [15], page 166). Let x0 be a row vector and c a\nlabel. The constrained least-squares (CLS) solution of the following optimization problem\n\nminimize kX\u03b2 \u2212 yk2\n0 (x0(X T X)+xT\n\ns.t x0\u03b2 = c,\n\n0 )+(cid:16)c \u2212 x0 \u02c6\u03b2(cid:17) , where \u02c6\u03b2 is the OLS solution.\nis given by \u02c6\u03b2C (c) , \u02c6\u03b2 + (X T X)+xT\nTheorem 5.3. Let F be the class of linear regressors, and let \u02c6f be an ERM. Then, with probability\nof at least 1 \u2212 \u03b4 over choices on Sm, for any test point x0 we have,\n|f \u2217(x0) \u2212 \u02c6f (x0)| \u2264 kX \u02c6\u03b2 \u2212 yk\nProof. According to Lemma 4.4, for squared loss, \u02c6R( \u02c6f\u01eb,x0) is strictly monotonically increasing for\n\u01eb > 0, and decreasing for \u01eb < 0. Therefore, the equation, \u02c6R( \u02c6f\u01eb,x0) = \u02c6R( \u02c6f ) \u00b7 \u03c32\n\u03b4/4, where \u01eb is the\nunknown, has precisely two solutions for any \u03c3 > 1. Denoting these solutions by \u01eb1, \u01eb2 we get,\n\nkXKk q\u03c32\n\nwhere K = (X T X)+xT\n\n0 (x0(X T X)+xT\n\n\u03b4/4 \u2212 1,\n\n0 )+.\n\nsup\n\n\u01eb\u2208Rn|\u01eb| : \u02c6R( \u02c6f\u01eb,x0) \u2264 \u02c6R( \u02c6f ) \u00b7 \u03c32\n\n\u03b4/4o = max(|\u01eb1|,|\u01eb2|).\n\n1\n\nApplying Lemma 5.1 and 5.2 and setting c = X0 \u02c6\u03b2 + \u01eb, we obtain,\n1\nmkX \u02c6\u03b2 \u2212 yk2 \u00b7 \u03c32\n\nHence, kX \u02c6\u03b2 + XK\u01eb \u2212 yk2 = kX \u02c6\u03b2 \u2212 yk2 \u00b7 \u03c32\nyk2 \u00b7 (\u03c32\n\nmkX \u02c6\u03b2C(cid:16)x0 \u02c6\u03b2 + \u01eb(cid:17) \u2212 yk2 = \u02c6R( \u02c6f\u01eb,x0) = \u02c6R( \u02c6f ) \u00b7 \u03c32\n\u03b4/4 \u2212 1). We note that by applying Lemma 5.1 on (X \u02c6\u03b2 \u2212 y)T X, we get,\n(X \u02c6\u03b2 \u2212 y)T X =(cid:0)X T (cid:0)X(X T X)+X T y \u2212 y(cid:1)(cid:1)T\n\n= (X T y \u2212 X T y)T = 0.\n\u03b4/4 \u2212 1). Application of Theorem 4.3 completes the proof.\n\n\u03b4/4 =\n\nkXKk2\n\n\u03b4/4.\n\n\u03b4/4, so, 2(X \u02c6\u03b2 \u2212 y)T XK\u01eb + kXKk2\u01eb2 = kX \u02c6\u03b2 \u2212\n\nTherefore, \u01eb2 = kX \u02c6\u03b2\u2212yk2\n\n\u00b7 (\u03c32\n6 Numerical examples\n\nFocusing on linear least squares regression, we empirically evaluated the proposed method. Given a\nlabeled dataset we randomly extracted two disjoint subsets: a training set Sm, and a test set Sn. The\nselective regressor (f, g) is computed as follows. The regressor f is an ERM over Sm, and for any\ncoverage value \u03a6, the function g selects a subset of Sn of size n \u00b7 \u03a6, including all test points with\nlowest value of the bound in Theorem 5.3.2\nWe compare our method relative to the following simple and natural 1-nearest neighbor (NN) tech-\nnique for selection. Given the training set Sm and the test set Sn, let N N (x) denote the nearest\n\nneighbor of x in Sm, with corresponding \u03c1(x) , pkN N (x) \u2212 xk2 distance to x. These \u03c1(x)\ndistances, corresponding to all x \u2208 Sn, were used as alternative method to reject test points in\ndecreasing order of their \u03c1(x) values.\nWe tested the algorithm on 10 of the 14 LIBSVM [16] regression datasets. From this repository we\ntook all sets that are not too small and have reasonable feature dimensionality.3 Figure 1 depicts\n\n2We use here the theorem only for ranking test points, so any constant > 1 can be used instead of \u03c32\n3Two datasets having less than 200 samples, and two that have over 150,000 features were excluded.\n\n\u03b4/4.\n\n7\n\n\fresults obtained for \ufb01ve different datasets, each with training sample size m = 30, and test set size\nn = 200. The \ufb01gure includes a matrix of 2\u00d7 5 graphs. Each column corresponds to a single dataset.\nEach of the graphs on the \ufb01rst row shows the average absolute difference between the selective\nregressor (f, g) and the optimal regressor f \u2217 (taken as an ERM over the entire dataset) as a function\nof coverage, where the average is taken over the accepted instances. Our method appears in solid\nred line, and the baseline NN method, in dashed black line. Each curve point is an average over 200\nindependent trials (error bars represent standard error of the mean). It is evident that for all datasets\nthe average distance monotonically increases with coverage. Furthermore, in all cases the proposed\nmethod signi\ufb01cantly outperforms the NN baseline.\n\nx 10\u22123\n\nbodyfat\n\n1.64\n\n1.29\n\n|\nf\n\nx 10 4\n\ncadata\n\n3.99\n\n3.04\n\n|\nf\n\n1.29\n\n0.85\n\n|\nf\n\nx 10 1\n\ncpusmall\n\n3.93\n\n2.90\n\n|\nf\n\nx 10 0\n\nhousing\n\nx 10\u22122\n\nspace\n\n5.66\n\n4.25\n\n|\nf\n\n\u2212\n\n*\n\nf\n|\n\n\u2212\n\n*\n\nf\n|\n\n\u2212\n\n*\n\nf\n|\n\n\u2212\n\n*\n\nf\n|\n\n\u2212\n\n*\n\nf\n|\n\n1.01\n\n0.80\n0\n\n2.38\n\n1.31\n\n0.72\n\n0.39\n0\n\n)\ng\n\n,\nf\n(\n\nR\n\n2.33\n\n0.56\n\n2.14\n\n3.19\n\n0.5\nc\n\n1\n\n1.78\n0\n\n0.5\nc\n\n1\n\n0.37\n0\n\n0.5\nc\n\n1\n\n1.58\n0\n\n0.5\nc\n\n1\n\n2.39\n0\n\n0.5\nc\n\nx 10\u22125\n\nbodyfat\n\nx 10 9\n\ncadata\n\nx 10 3\n\ncpusmall\n\nx 10 1\n\nhousing\n\nx 10\u22122\n\nspace\n\n9.47\n\n)\ng\n\n,\nf\n(\n\nR\n\n6.46\n\n4.40\n\n4.32\n\n0.61\n\n0.09\n\n)\ng\n\n,\nf\n(\n\nR\n\n6.81\n\n)\ng\n\n,\nf\n(\n\nR\n\n3.59\n\n1.89\n\n2.33\n\n1.72\n\n1.27\n\n)\ng\n\n,\nf\n(\n\nR\n\n0.5\nc\n\n1\n\n3.00\n0\n\n0.5\nc\n\n1\n\n0.01\n0\n\n0.5\nc\n\n1\n\n1.00\n0\n\n0.5\nc\n\n1\n\n0.94\n0\n\n0.5\nc\n\n1\n\n1\n\nFigure 1: (top row) absolute difference between the selective regressor (f, g) and the optimal re-\ngressor f \u2217. (bottom row) test error of selective regressor (f, g). Our proposed method in solid red\nline and the baseline method in dashed black line. In all curves the y-axis has logarithmic scale.\n\nEach of the graphs in the second row shows the test error of the selective regressor (f, g) as a function\nof coverage. This curve is known as the RC (risk-coverage) trade-off curve [6]. In this case we see\nagain that the test error is monotonically increasing with coverage. In four datasets out of the \ufb01ve\nwe observe a clear domination of the entire RC curve, and in one dataset the performance of our\nmethod is statistically indistinguishable from that of the NN baseline method.\n\n7 Concluding remarks\n\nRooted in the centuries-old linear least squares method of Gauss and Legendre, regression estima-\ntion remains an indispensable routine in statistical analysis, modeling and prediction. This paper\nproposes a novel rejection technique allowing for a least squares regressor, learned from a \ufb01nite and\npossibly small training sample, to pointwise track, within its selected region of activity, the predic-\ntions of the globally optimal regressor in hindsight (from the same class). The resulting algorithm,\nwhich is motivated and derived entirely from the theory, is ef\ufb01cient and practical.\n\nImmediate plausible extensions are the handling of other types of regressions including regularized,\nand kernel regression, as well as extensions to other convex loss functions such as the epsilon-\ninsensitive loss. The presence of the \u01eb-disagreement coef\ufb01cient in our coverage bound suggests a\npossible relation to active learning, since the standard version of this coef\ufb01cient has a key role in\ncharacterizing the ef\ufb01ciency of active learning in classi\ufb01cation [17]. Indeed, a formal reduction of\nactive learning to selective classi\ufb01cation was recently found, whereby rejected points are precisely\nthose points to be queried in a stream based active learning setting. Moreover, \u201cfast\u201d coverage\nbounds in selective classi\ufb01cation give rise to fast rates in active learning [7]. Borrowing their in-\ntuition to our setting, one could consider devising a querying function for active regression that is\nbased on the pointwise bound of Theorem 5.3.\n\nAcknowledgments\n\nThe research leading to these results has received funding from both Intel and the European Union\u2019s\nSeventh Framework Programme under grant agreement n\u25e6 216886.\n\n8\n\n\fReferences\n\n[1] V. Vapnik. Statistical learning theory. 1998. Wiley, New York, 1998.\n[2] C.K. Chow. An optimum character recognition system using decision function. IEEE Trans.\n\nComputer, 6(4):247\u2013254, 1957.\n\n[3] C.K. Chow. On optimum recognition error and reject trade-off. IEEE Trans. on Information\n\nTheory, 16:41\u201336, 1970.\n\n[4] B. K\u00b4egl. Robust regression by boosting the median. Learning Theory and Kernel Machines,\n\n[5]\n\npages 258\u2013272, 2003.\n\u00a8O. Ays\u00b8eg\u00a8ul, G. Mehmet, A. Ethem, and H. T\u00a8urkan. Machine learning integration for predicting\nthe effect of single amino acid substitutions on protein stability. BMC Structural Biology, 9.\n\n[6] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classi\ufb01cation. The\n\nJournal of Machine Learning Research, 11:1605\u20131641, 2010.\n\n[7] R. El-Yaniv and Y. Wiener. Active learning via perfect selective classi\ufb01cation. Journal of\n\nMachine Learning Research, 13:255\u2013279, 2012.\n\n[8] R. El-Yaniv and Y. Wiener. Agnostic selective classi\ufb01cation. In Neural Information Processing\n\nSystems (NIPS), 2011.\n\n[9] W.S. Lee. Agnostic Learning and Single Hidden Layer Neural Networks. PhD thesis, Aus-\n\ntralian National University, 1996.\n\n[10] V.N. Vapnik. An overview of statistical learning theory. Neural Networks, IEEE Transactions\n\non, 10(5):988\u2013999, 1999.\n\n[11] R.M. Kil and I. Koo. Generalization bounds for the regression of real-valued functions. In\nProceedings of the 9th International Conference on Neural Information Processing, volume 4,\npages 1766\u20131770, 2002.\n\n[12] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages\n\n353\u2013360, 2007.\n\n[13] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon Uni-\n\nversity, 2009.\n\n[14] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML\n\u201909: Proceedings of the 26th Annual International Conference on Machine Learning, pages\n49\u201356. ACM, 2009.\n\n[15] J.E. Gentle. Numerical linear algebra for applications in statistics. Springer Verlag, 1998.\n[16] C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM Trans-\nactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at\n\u201dhttp://www.csie.ntu.edu.tw/ cjlin/libsvm\u201d.\n\n[17] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361,\n\n2011.\n\n9\n\n\f", "award": [], "sourceid": 1018, "authors": [{"given_name": "Yair", "family_name": "Wiener", "institution": null}, {"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}]}