{"title": "Active Regression by Stratification", "book": "Advances in Neural Information Processing Systems", "page_first": 469, "page_last": 477, "abstract": "We propose a new active learning algorithm for parametric linear regression with random design. We provide finite sample convergence guarantees for general distributions in the misspecified model. This is the first active learner for this setting that provably can improve over passive learning. Unlike other learning settings (such as classification), in regression the passive learning rate of O(1/epsilon) cannot in general be improved upon. Nonetheless, the so-called `constant' in the rate of convergence, which is characterized by a distribution-dependent risk, can be improved in many cases. For a given distribution, achieving the optimal risk requires prior knowledge of the distribution. Following the stratification technique advocated in Monte-Carlo function integration, our active learner approaches a the optimal risk using piecewise constant approximations.", "full_text": "Active Regression by Strati\ufb01cation\n\nSivan Sabato\n\nDepartment of Computer Science\n\nBen Gurion University, Beer Sheva, Israel\n\nsabatos@cs.bgu.ac.il\n\nRemi Munos\u2217\n\nINRIA\n\nLille, France\n\nremi.munos@inria.fr\n\nAbstract\n\nWe propose a new active learning algorithm for parametric linear regression with\nrandom design. We provide \ufb01nite sample convergence guarantees for general dis-\ntributions in the misspeci\ufb01ed model. This is the \ufb01rst active learner for this setting\nthat provably can improve over passive learning. Unlike other learning settings\n(such as classi\ufb01cation), in regression the passive learning rate of O(1/\u0001) cannot\nin general be improved upon. Nonetheless, the so-called \u2018constant\u2019 in the rate\nof convergence, which is characterized by a distribution-dependent risk, can be\nimproved in many cases. For a given distribution, achieving the optimal risk re-\nquires prior knowledge of the distribution. Following the strati\ufb01cation technique\nadvocated in Monte-Carlo function integration, our active learner approaches the\noptimal risk using piecewise constant approximations.\n\n1\n\nIntroduction\n\nIn linear regression, the goal is to predict the real-valued labels of data points in Euclidean space\nusing a linear function. The quality of the predictor is measured by the expected squared error of\nits predictions. In the standard regression setting with random design, the input is a labeled sample\ndrawn i.i.d. from the joint distribution of data points and labels, and the cost of data is measured by\nthe size of the sample. This model, which we refer to here as passive learning, is useful when both\ndata and labels are costly to obtain. However, in domains where raw data is very cheap to obtain, a\nmore suitable model is that of active learning (see, e.g., Cohn et al., 1994). In this model we assume\nthat random data points are essentially free to obtain, and the learner can choose, for any observed\ndata point, whether to ask also for its label. The cost of data here is the total number of requested\nlabels.\nIn this work we propose a new active learning algorithm for linear regression. We provide \ufb01nite\nsample convergence guarantees for general distributions, under a possibly misspeci\ufb01ed model. For\nparametric linear regression, the sample complexity of passive learning as a function of the excess\nerror \u0001 is of the order O(1/\u0001). This rate cannot in general be improved by active learning, unlike\nin the case of classi\ufb01cation (Balcan et al., 2009). Nonetheless, the so-called \u2018constant\u2019 in this rate\nof convergence depends on the distribution, and this is where the potential improvement by active\nlearning lies.\nFinite sample convergence of parametric linear regression in the passive setting has been studied by\nseveral (see, e.g., Gy\u00a8or\ufb01 et al., 2002; Hsu et al., 2012). The standard approach is Ordinary Least\nSquares (OLS), where the output predictor is simply the minimizer of the mean squared error on the\nsample. Recently, a new algorithm for linear regression has been proposed (Hsu and Sabato, 2014).\nThis algorithm obtains an improved convergence guarantee under less restrictive assumptions. An\nappealing property of this guarantee is that it provides a direct and tight relationship between the\npoint-wise error of the optimal predictor and the convergence rate of the predictor. We exploit this to\n\n\u2217Current Af\ufb01liation: Google DeepMind.\n\n1\n\n\fallow our active learner to adapt to the underlying distribution. Our approach employs a strati\ufb01cation\ntechnique, common in Monte-Carlo function integration (see, e.g., Glasserman, 2004). For any \ufb01nite\npartition of the data domain, an optimal oracle risk can be de\ufb01ned, and the convergence rate of our\nactive learner approaches the rate de\ufb01ned by this risk. By constructing an in\ufb01nite sequence of\npartitions that become increasingly re\ufb01ned, one can approach the globally optimal oracle risk.\nActive learning for parametric regression has been investigated in several works, some of them in\nthe context of statistical experimental design. One of the earliest works is Cohn et al. (1996), which\nproposes an active learning algorithm for locally weighted regression, assuming a well-speci\ufb01ed\nmodel and an unbiased learning function. Wiens (1998, 2000) calculates a minimax optimal de-\nsign for regression given the marginal data distribution, assuming that the model is approximately\nwell-speci\ufb01ed. Kanamori (2002) and Kanamori and Shimodaira (2003) propose an active learning\nalgorithm that \ufb01rst calculates a maximum likelihood estimator and then uses this estimator to come\nup with an optimal design. Asymptotic convergence rates are provided under asymptotic normal-\nity assumptions. Sugiyama (2006) assumes an approximately well-speci\ufb01ed model and i.i.d. label\nnoise, and selects a design from a \ufb01nite set of possibilities. The approach is adapted to pool-based\nactive learning by Sugiyama and Nakajima (2009). Burbidge et al. (2007) propose an adaptation\nof Query By Committee. Cai et al. (2013) propose guessing the potential of an example to change\nthe current model. Ganti and Gray (2012) propose a consistent pool-based active learner for the\nsquared loss. A different line of research, which we do not discuss here, focuses on active learning\nfor non-parameteric regression, e.g. Efromovich (2007).\nOutline In Section 2 the formal setting and preliminaries are introduced. In Section 3 the notion of\nan oracle risk for a given distribution is presented. The strati\ufb01cation technique is detailed in Section\n4. The new active learner algorithm and its analysis are provided in Section 5, with the main result\nstated in Theorem 5.1. In Section 6 we show via a simple example that in some cases the active\nlearner approaches the maximal possible improvement over passive learning.\n\n2 Setting and Preliminaries\n\nWe assume a data space in Rd and labels in R. For a distribution P over Rd \u00d7 R, denote by\nsuppX (P ) the support of the marginal of P over Rd. Denote the strictly positive reals by R\u2217\n+.\nWe assume that labeled examples are distributed according to a distribution D. A random labeled\nexample is (X, Y ) \u223c D, where X \u2208 Rd is the example and Y \u2208 R is the label. Throughout this\nwork, whenever P[\u00b7] or E[\u00b7] appear without a subscript, they are taken with respect to D. DX is\nthe marginal distribution of X in pairs draws from D. The conditional distribution of Y when the\nexample is X = x is denoted DY |x. The function x (cid:55)\u2192 DY |x is denoted DY |X.\nA predictor is a function from Rd to R that predicts a label for every possible example. Linear\npredictors are functions of the form x (cid:55)\u2192 x(cid:62)w for some w \u2208 Rd. The squared loss of w \u2208 Rd\nfor an example x \u2208 Rd with a true label y \u2208 R is (cid:96)((x, y), w) = (x(cid:62)w \u2212 y)2. The expected\nsquared loss of w with respect to D is L(w, D) = E(X,Y )\u223cD[(X(cid:62)w \u2212 Y )2]. The goal of the\nlearner is to \ufb01nd a w such that L(w) is small. The optimal loss achievable by a linear predictor is\nL(cid:63)(D) = minw\u2208Rd L(w, D). We denote by w(cid:63)(D) a minimizer of L(w, D) such that L(cid:63)(D) =\nL(w(cid:63)(D), D). In all these notations the parameter D is dropped when clear from context.\nIn the passive learning setting, the learner draws random i.i.d. pairs (X, Y ) \u223c D. The sample\ncomplexity of the learner is the number of drawn pairs. In the active learning setting, the learner\ndraws i.i.d. examples X \u223c DX. For any drawn example, the learner may draw a label according to\nthe distribution DY |X. The label complexity of the learner is the number of drawn labels. In this\nsetting it is easy to approximate various properties of DX to any accuracy, with zero label cost. Thus\nwe assume for simplicity direct access to some properties of DX, such as the covariance matrix of\nDX, denoted \u03a3D = EX\u223cDX [XX(cid:62)], and expectations of some other functions of X. We assume\nw.l.o.g. that \u03a3D is not singular. For a matrix A \u2208 Rd\u00d7d, and x \u2208 Rd, denote (cid:107)x(cid:107)A =\nx(cid:62)Ax. Let\nD = maxx\u2208suppX (D) (cid:107)x(cid:107)2\n. This is the condition number of the marginal distribution DX. We\nR2\nhave\n\n\u221a\n\n\u22121\nD\n\n\u03a3\n\nE[(cid:107)X(cid:107)2\n\n\u22121\nD\n\n\u03a3\n\n] = E[tr(X(cid:62)\u03a3\u22121\n\nD X)] = tr(\u03a3\u22121\n\nD\n\nE[XX(cid:62)]) = d.\n\n(1)\n\n2\n\n\fHsu and Sabato (2014) provide a passive learning algorithm for least squares linear regression with a\nminimax optimal sample complexity (up to logarithmic factors). The algorithm is based on splitting\nthe labeled sample into several subsamples, performing OLS on each of the subsamples, and then\nchoosing one of the resulting predictors via a generalized median procedure. We give here a useful\nversion of the result.1\nTheorem 2.1 (Hsu and Sabato, 2014). There are universal constants C, c, c(cid:48), c(cid:48)(cid:48) > 0 such that the\nfollowing holds. Let D be a distribution over Rd\u00d7R. There exists an ef\ufb01cient algorithm that accepts\nas input a con\ufb01dence \u03b4 \u2208 (0, 1) and a labeled sample of size n drawn i.i.d. from D, and returns\n\u02c6w \u2208 Rd, such that if n \u2265 cR2\n\nD log(c(cid:48)n) log(c(cid:48)(cid:48)/\u03b4), with probability 1 \u2212 \u03b4,\n\nL( \u02c6w, D) \u2212 L(cid:63)(D) = (cid:107)w(cid:63)(D) \u2212 \u02c6w(cid:107)2\n\n\u03a3D\n\n\u2264 C log(1/\u03b4)\n\nn\n\n\u00b7 ED[(cid:107)X(cid:107)2\n\n\u22121\nD\n\n\u03a3\n\n(Y \u2212 X(cid:62)w(cid:63)(D))2].\n\n(2)\n\nThis result is particularly useful in the context of active learning, since it provides an explicit de-\npendence on the point-wise errors of the labels, including in heteroscedastic settings, where this\nerror is not uniform. As we see below, in such cases active learning can potentially gain over passive\nlearning. We denote an execution of the algorithm on a labeled sample S by \u02c6w \u2190 REG(S, \u03b4). The al-\ngorithm is used a black box, thus any other algorithm with similar guarantees could be used instead.\nFor instance, similar guarantees might hold for OLS for a more restricted class of distributions.\nThroughout the analysis we omit for readability details of integer rounding, whenever the effects are\nnegligible. We use the notation O(exp), where exp is a mathematical expression, as a short hand\nfor \u00afc \u00b7 exp + \u00afC for some universal constants \u00afc, \u00afC \u2265 0, whose values can vary between statements.\n\n3 An Oracle Bound for Active Regression\n\nThe bound in Theorem 2.1 crucially depends on the input distribution D.\nIn an active learning\nframework, rejection sampling (Von Neumann, 1951) can be used to simulate random draws of\nlabeled examples according to a different distribution, without additional label costs. By selecting a\nsuitable distribution, it might be possible to improve over Eq. (2). Rejection sampling for regression\nhas been explored in Kanamori (2002); Kanamori and Shimodaira (2003); Sugiyama (2006) and\nothers, mostly in an asymptotic regime. Here we use the explicit bound in Eq. (2) to obtain new\n\ufb01nite sample guarantees that hold for general distributions.\nLet \u03c6 : Rd \u2192 R\u2217\n+ be a strictly positive weight function such that E[\u03c6(X)] = 1. We de\ufb01ne the\ndistribution P\u03c6 over Rd \u00d7 R as follows: For x \u2208 Rd, y \u2208 R, let \u0393\u03c6(x, y) = {(\u02dcx, \u02dcy) \u2208 Rd \u00d7 R | x =\n\u02dcx\u221a\n}, and de\ufb01ne P\u03c6 by\n\n, y = \u02dcy\u221a\n\n\u03c6(\u02dcx)\n\n\u03c6(\u02dcx)\n\n\u2200(X, Y ) \u2208 Rd \u00d7 R,\n\nP\u03c6(X, Y ) =\n\n( \u02dcX, \u02dcY )\u2208\u0393\u03c6(X,Y )\n\n\u03c6( \u02dcX)dD( \u02dcX, \u02dcY ).\n\nA labeled i.i.d. sample drawn according to P\u03c6 can be simulated using rejection sampling without\nadditional label costs (see Alg. 2 in Appendix B). We denote drawing m random labeled examples\naccording to P by S \u2190 SAMPLE(P, m). For the squared loss on P\u03c6 we have\n\n(cid:90)\n\n(cid:90)\n\u02dcY(cid:113)\n\n(cid:90)\n(cid:90)\n(cid:90)\n(cid:90)\n\nL(w, P\u03c6) =\n\n(\u2217)\n=\n\n=\n\n=\n\n(cid:96)((X, Y ), w) dP\u03c6(X, Y )\n\n(cid:96)((X, Y ), w)\n\n\u03c6( \u02dcX) dD( \u02dcX, \u02dcY )\n\n( \u02dcX, \u02dcY )\u2208\u0393\u03c6(X,Y )\n\n(X,Y )\u2208Rd\n\n(X,Y )\u2208Rd\n\n\u02dcX(cid:113)\n\n(cid:96)((\n\n,\n\n), w) \u03c6( \u02dcX) dD( \u02dcX, \u02dcY )\n\n( \u02dcX, \u02dcY )\u2208Rd\n\n\u03c6( \u02dcX)\n\n\u03c6( \u02dcX)\n\n(cid:96)((X, Y ), w) dD(X, Y ) = L(w, D).\n\n(X,Y )\u2208Rd\n\nThe equality (\u2217) can be rigorously derived from the de\ufb01nition of Lebesgue integration. It follows\nthat also L(cid:63)(D) = L(cid:63)(P\u03c6) and that w(cid:63)(D) = w(cid:63)(P\u03c6). We thus denote these by L(cid:63) and w(cid:63). In\n\n1This is a slight variation of the original result of Hsu and Sabato (2014), see Appendix A.\n\n3\n\n\fa similar manner, we have \u03a3P\u03c6 =(cid:82) XX(cid:62) dP\u03c6(X, Y ) =(cid:82) XX(cid:62) dD(X, Y ) = \u03a3D. From now on\n\nwe denote this matrix simply \u03a3. We denote (cid:107) \u00b7 (cid:107)\u03a3 by (cid:107) \u00b7 (cid:107), and (cid:107) \u00b7 (cid:107)\u03a3\u22121 by (cid:107) \u00b7 (cid:107)\u2217. The condition\nnumber of P\u03c6 is R2\nP\u03c6\nIf the regression algorithm is applied to n labeled examples drawn from the simulated P\u03c6, then by\nEq. (2) and the equalities above, with probability 1 \u2212 \u03b4, if n \u2265 cR2\n\nlog(c(cid:48)n) log(c(cid:48)(cid:48)/\u03b4)),\n\n= maxx\u2208suppX (D)\n\n(cid:107)x(cid:107)2\u2217\n\u03c6(x) .\n\nP\u03c6\n\nL( \u02c6w) \u2212 L(cid:63) \u2264 C \u00b7 log(1/\u03b4)\nC \u00b7 log(1/\u03b4)\n\nn\n\n=\n\nn\n\n\u00b7 EP\u03c6[(cid:107)X(cid:107)2\u2217(X(cid:62)w(cid:63) \u2212 Y )2]\n\u00b7 ED[(cid:107)X(cid:107)2\u2217(X(cid:62)w(cid:63) \u2212 Y )2/\u03c6(X)].\n\nDenote \u03c82(x) := (cid:107)x(cid:107)2\u2217 \u00b7 ED[(X(cid:62)w(cid:63) \u2212 Y )2 | X = x]. Further denote \u03c1(\u03c6) := ED[\u03c82(X)/\u03c6(X)],\nwhich we term the risk of \u03c6. Then, if n \u2265 cR2\n\nlog(c(cid:48)n) log(c(cid:48)(cid:48)/\u03b4), with probability 1 \u2212 \u03b4,\n\nP\u03c6\n\nL( \u02c6w) \u2212 L(cid:63) \u2264 C \u00b7 \u03c1(\u03c6) log(1/\u03b4)\n\n.\n\n(3)\nA passive learner essentially uses the default \u03c6, which is constantly 1, for a risk of \u03c1(1) = E[\u03c82(X)].\nBut the \u03c6 that minimizes the bound is the solution to the following minimization problem:\n\nn\n\nMinimize\u03c6\nsubject to\n\nE[\u03c82(X)/\u03c6(X)]\nE[\u03c6(X)] = 1,\n\u03c6(x) \u2265 c log(c(cid:48)n) log(c(cid:48)(cid:48)/\u03b4)\n\nn\n\n(cid:107)x(cid:107)2\u2217,\n\n\u2200x \u2208 suppX (D).\n\n(4)\n\nP\u03c6\n\nlog(c(cid:48)n) log(c(cid:48)(cid:48)/\u03b4). The following lemma\n\nThe second constraint is due to the requirement n \u2265 cR2\nbounds the risk of the optimal \u03c6. Its proof is provided in Appendix C.\nLemma 3.1. Let \u03c6(cid:63) be the solution to the minimization problem in Eq. (4). Then for n \u2265\nO(d log(d) log(1/\u03b4)), E2[\u03c8(X)] \u2264 \u03c1(\u03c6(cid:63)) \u2264 E2[\u03c8(X)](1 + O(d log(n) log(1/\u03b4)/n)).\nThe ratio between the risk of \u03c6(cid:63) and the risk of the default \u03c6 thus approaches E[\u03c82(X)]/E2[\u03c8(X)],\nand this is also the optimal factor of label complexity reduction. The ratio is 1 for highly symmetric\ndistributions, where the support of DX is on a sphere and all the noise variances are identical. In\nthese cases, active learning is not helpful, even asymptotically. However, in the general case, this\nratio is unbounded, and so is the potential for improvement from using active learning. The crucial\nchallenge is that without access to the conditional distribution DY |X, Eq. (4) cannot be solved\ndirectly. We consider the oracle risk \u03c1(cid:63) = E2[\u03c8(X)], which can be approached if an oracle divulges\nthe optimal \u03c6 and n \u2192 \u221e. The goal of the active learner is to approach the oracle guarantee without\nprior knowledge of DY |X.\n\n4 Approaching the Oracle Bound with Strata\n\nTo approximate the oracle guarantee, we borrow the strati\ufb01cation approach used in Monte-Carlo\nfunction integration (e.g., Glasserman, 2004). Partition suppX (D) into K disjoint subsets A =\n{A1, . . . , AK}, and consider for \u03c6 only functions that are constant on each Ai and such that\nE[\u03c6(X)] = 1. Each of the functions in this class can be described by a vector a = (a1, . . . , aK) \u2208\n, where pj := P[X \u2208 Aj]. Let \u03c6a denote\n(R\u2217\na function de\ufb01ned by a, leaving the dependence on the partition A implicit. To calculate the risk of\n\u03c6a, denote \u00b5i := E[(cid:107)X(cid:107)2\u2217(X(cid:62)w(cid:63) \u2212 Y )2 | X \u2208 Ai]. From the de\ufb01nition of \u03c1(\u03c6),\n\nai(cid:80)\n+)K. The value of the function on x \u2208 Ai is\n(cid:88)\n\n(cid:88)\n\nj\u2208[K] pj aj\n\n\u03c1(\u03c6a) =\nj\u2208[K]\n\u221a\n\npjaj\n\ni\u2208[K]\n\n\u00b5i minimizes \u03c1(\u03c6a), and\n\u221a\n\n\u03c1(\u03c6a) = \u03c1(\u03c6a(cid:63) ) = (\n\npi\n\n(5)\n\n(6)\n\n\u00b5i)2.\n\npi\nai\n\n\u00b5i.\n\n(cid:88)\n\ni\u2208[K]\n\n\u03c1(cid:63)A := inf\na\u2208RK\n\n+\n\nIt is easy to verify that a(cid:63) such that a(cid:63)\n\ni =\n\n4\n\n\f\u03c1(\u03c61) =(cid:80)\n\n\u03c1(cid:63)A is the oracle risk for the \ufb01xed partition A. In comparison, the standard passive learner has risk\ni\u2208[K] pi\u00b5i. Thus, the ratio between the optimal risk and the default risk can be as large as\n1/ mini pi. Note that here, as in the de\ufb01nition of \u03c1(cid:63) above, \u03c1(cid:63)A might not be achievable for samples\nup to a certain size, because of the additional requirement that \u03c6 not be too small (see Eq. (4)).\nNonetheless, this optimistic value is useful as a comparison.\nConsider an in\ufb01nite sequence of partitions: for j \u2208 N, Aj = {Aj\n}, with Kj \u2192 \u221e.\nSimilarly to Carpentier and Munos (2012), under mild regularity assumptions, if the partitions have\ndiameters and probabilities that approach zero, then \u03c1(cid:63)Aj \u2192 \u03c1(\u03c6(cid:63)), achieving the optimal upper\nbound for Eq. (3). For a \ufb01xed partition A, the challenge is then to approach \u03c1\u2217\nA without prior\nknowledge of the true \u00b5i\u2019s, using relatively few extra labeled examples.\nIn the next section we\ndescribe our active learning algorithm that does just that.\n\n1, . . . , Aj\nKj\n\n5 Active Learning for Regression\nA, we need a good estimate of \u00b5i for i \u2208 [K]. Note that \u00b5i depends on\nTo approach the optimal risk \u03c1\u2217\nthe optimal predictor w(cid:63), therefore its value depends on the entire distribution. We assume that the\nerror of the label relative to the optimal predictor is bounded as follows: There exists a b \u2265 0 such\nthat (x(cid:62)w(cid:63) \u2212 y)2 \u2264 b2(cid:107)x(cid:107)2\u2217 for all (x, y) in the support of D. This boundedness assumption can be\nreplaced by an assumption on sub-Gaussian tails with similar results. Our assumption implies also\nL(cid:63) = E[(x(cid:62)w(cid:63) \u2212 y)2] \u2264 b2E[(cid:107)X(cid:107)2\u2217] = b2d, where the last equality follows from Eq. (1).\nAlgorithm 1 Active Regression\ninput Con\ufb01dence \u03b4 \u2208 (0, 1), label budget m, partition A.\noutput \u02c6w \u2208 Rd\n1: m1 \u2190 m4/5/2, m2 \u2190 m4/5/2, m3 \u2190 m \u2212 (m1 + m2).\n2: \u03b41 \u2190 \u03b4/4, \u03b42 \u2190 \u03b4/4, \u03b43 \u2190 \u03b4/2.\n3: S1 \u2190 SAMPLE(P\u03c6[\u03a3], m1)\n4: \u02c6v \u2190 REG(S1, \u03b41)\n\n; \u03b3 \u2190 (b + 2\u2206)2(cid:112)K log(2K/\u03b42)/m2;\n\n5: \u2206 \u2190(cid:113) Cd2b2 log(1/\u03b41)\n\u02dc\u00b5i \u2190 \u0398i \u00b7(cid:16) 1\nm1\n(cid:80)\n6: for i = 1 to K do\nTi \u2190 SAMPLE(Qi, t).\n7:\n8:\n\u02c6ai \u2190 \u221a\n9:\n10: end for\n11: \u03be \u2190 c log(c(cid:48)m3) log(c(cid:48)(cid:48)/\u03b43)\n12: Set \u02c6\u03c6 such that for x \u2208 Ai, \u02c6\u03c6(x) := (cid:107)x(cid:107)2\u2217 \u00b7 \u03be + (1 \u2212 d\u03be)\n13: S3 \u2190 SAMPLE(P \u02c6\u03c6, m3).\n14: \u02c6w \u2190 REG(S3, \u03b43).\n\n(|x(cid:62) \u02c6v \u2212 y| + \u2206)2 + \u03b3\n\n\u02c6ai(cid:80)\n\nj pj \u02c6aj\n\n.\n\n(x,y)\u2208Ti\n\n(cid:17)\n\n.\n\nm3\n\nt\n\n\u02dc\u00b5i.\n\nt \u2190 m2/K.\n\nOur active regression algorithm, listed in Alg. 1, operates in three stages. In the \ufb01rst stage, the goal is\nto \ufb01nd a crude loss optimizer \u02c6v, so as to later estimate \u00b5i. To \ufb01nd this optimizer, the algorithm draws\nd(cid:107)x(cid:107)2\u2217.\na labeled sample of size m1 from the distribution P\u03c6[\u03a3], where \u03c6[\u03a3](x) := 1\nNote that \u03c1(\u03c6[\u03a3]) = d \u00b7 E[(Xw(cid:63) \u2212 Y )2] = dL(cid:63). In addition, R2\n= d. Consequently, by Eq. (3),\napplying REG to m1 \u2265 O(d log(d) log(1/\u03b41)) random draws from P\u03c6[\u03a3] gets, with probability 1\u2212\u03b41\n\nd x(cid:62)\u03a3\u22121x = 1\n\nP\u03c6[\u03a3]\n\nL(\u02c6v) \u2212 L(cid:63) = (cid:107)\u02c6v \u2212 w(cid:63)(cid:107)2 \u2264 CdL(cid:63) log(1/\u03b41)\n\n\u2264 Cd2b2 log(1/\u03b41)\n\n.\n\n(7)\n\nIn Needell et al. (2013) a similar distribution is used to speed up gradient descent for convex losses.\nHere, we make use of \u03c6[\u03a3] as a stepping stone in order to approach the optimal \u03c6 at a rate that does\nnot depend on the condition number of D. Denote by E the event that Eq. (7) holds.\nIn the second stage, estimates for \u00b5i, denoted \u02dc\u00b5i, are calculated from labeled samples that are drawn\nfrom another set of probability distributions, Qi for i \u2208 [K]. These distributions are de\ufb01ned as\nfollows. Denote \u0398i = E[(cid:107)X(cid:107)4\u2217 | X \u2208 Ai]. For x \u2208 Rd, y \u2208 R, let \u0393i(x, y) = {(\u02dcx, \u02dcy) \u2208 Ai \u00d7\n\nm1\n\nm1\n\n5\n\n\f(cid:82)\n\n\u0398i\n\nR | x = \u02dcx(cid:107)\u02dcx(cid:107)\u2217 , y = \u02dcy(cid:107)\u02dcx(cid:107)\u2217}, and de\ufb01ne Qi by dQi(X, Y ) = 1\n( \u02dcX, \u02dcY )\u2208\u0393i(X,Y ) (cid:107) \u02dcX(cid:107)4\u2217 dD( \u02dcX, \u02dcY ).\nClearly, for all x \u2208 suppX (Qi), (cid:107)x(cid:107)\u2217 = 1. Drawing labeled examples from Qi can be done using\nrejection sampling, similarly to P\u03c6. The use of the Qi distributions in the second stage again helps\navoid a dependence on the condition number of D in the convergence rates.\nIn the last stage, a weight function \u02c6\u03c6 is determined based on the estimated \u02dc\u00b5i. A labeled sample is\ndrawn from P \u02c6\u03c6, and the algorithm returns the predictor resulting from running REG on this sample.\nThe following theorem gives our main result, a \ufb01nite sample convergence rate guarantee.\nTheorem 5.1. Let b \u2265 0 such that (x(cid:62)w(cid:63) \u2212 y)2 \u2264 b2(cid:107)x(cid:107)2\u2217 for all (x, y) in the support of D. Let\n\u039bD = E[(cid:107)X(cid:107)4\u2217]. If Alg. 1 is executed with \u03b4 and m such that m \u2265 O(d log(d) log(1/\u03b4))5/4, then it\ndraws m labels, and with probability 1 \u2212 \u03b4,\n(cid:33)\n(cid:32)\nL( \u02c6w) \u2212 L(cid:63) \u2264 C\u03c1(cid:63)A log(3/\u03b4)\nm\nd1/2\u039b1/4\n\nD K 1/4 log1/4(K/\u03b4) log(1/\u03b4)\n\nD log5/4(1/\u03b4)\n\nlog(1/\u03b4)\n\nd\u039b1/2\n\n+\n\nO\n\n\u03c1(cid:63)A +\n\nm6/5\n\nm6/5\n\nb1/2\u03c1(cid:63)A3/4 +\n\nm6/5\n\nb\u03c1(cid:63)A1/2\n\n.\n\nThe theorem shows that the learning rate of the active learner approaches the oracle rate for the given\npartition. With an in\ufb01nite sequence of partitions with K an increasing function of m, the optimal\noracle risk can also be approached. The rate of convergence to the oracle rate does not depend on the\ncondition number of D, unlike the passive learning rate. In addition, m = O(d log(d) log(1/\u03b4))5/4\nsuf\ufb01ces to approach the optimal rate, whereas m = \u2126(d) is obviously necessary for any learner. It\nis interesting that also in active learning for classi\ufb01cation, it has been observed that active learning\nin a non-realizable setting requires a super-linear dependence on d (See, e.g., Dasgupta et al., 2008).\nWhether this dependence is unavoidable for active regression is an open question. Theorem 5.1 is\nbe proved via a series of lemmas. First, we show that if \u02dc\u00b5i is a good approximation of \u00b5i then \u03c1A( \u02c6\u03c6)\ncan be bounded as a function of the oracle risk for A.\nLemma 5.2. Suppose m3 \u2265 O(d log(d) log(1/\u03b43)), and let \u02c6\u03c6 as in Alg. 1. If, for some \u03b1, \u03b2 \u2265 0,\n\n\u00b5i \u2264 \u02dc\u00b5i \u2264 \u00b5i + \u03b1i\n\n\u00b5i + \u03b2i,\n\n\u221a\n\n(cid:88)\n\n(8)\n\npi\u03b2i)1/2\u03c1(cid:63)A1/2).\n\n(cid:88)\n\ni\n\ni\n\npi\u03b1i)1/2\u03c1(cid:63)A3/4 + (\n\nProof. We have \u2200x \u2208 Ai, \u02c6\u03c6(x) \u2265 (1 \u2212 d\u03be)\n\u03c1( \u02c6\u03c6) \u2261 E[\u03c82(X)/ \u02c6\u03c6(X)] \u2264 1\n\nthen\n\u03c1A( \u02c6\u03c6) \u2264 (1 + O(d log(m3) log(1/\u03b43)/m3))(\u03c1(cid:63)A + (\n(cid:88)\n(cid:88)\npi\u00b5i/\u02c6ai = (1 +\n1\u2212d\u03be \u2264 2d\u03be. It follows\n\u03c1( \u02c6\u03c6) \u2264 (1 + O(d log(m3) log(1/\u03b43)/m3))\u03c1(\u03c6\u02c6a).\n\nFor m3 \u2265 O(d log(d) log(1/\u03b43)), d\u03be \u2264 1\n\n\u02c6ai(cid:80)\n(cid:88)\n(cid:88)\n\n2,2 therefore d\u03be\n\n1 \u2212 d\u03be\n\n1 \u2212 d\u03be\n\npj \u02c6aj\n\npj \u02c6aj\n\nj pj \u02c6aj\n\nm3\n\n=\n\n1\n\nj\n\nj\n\ni\n\ni\n\n, where \u03be = c log(c(cid:48)m3) log(c(cid:48)(cid:48)/\u03b4)\n\n. Therefore\n\npi \u00b7 E[\u03c82(X)/ \u02c6ai | X \u2208 Ai]\n\nd\u03be\n1 \u2212 d\u03be\n\n)\u03c1(\u03c6\u02c6a).\n\n(9)\n\nBy Eq. (8),\n\n\u03c1A(\u03c6\u02c6a) =\n\n(cid:88)\n\u2264(cid:88)\n(cid:88)\n\nj\n\nj\n\n= (\n\npj\n\n\u221a\npj(\n\u221a\n\npi\n\ni\n\n= \u03c1(cid:63)A + (\n\n(cid:88)\n\n(cid:112)\u02dc\u00b5j\n\ni\n\n\u00b5j +\n\n\u221a\n\npi\u00b5i/(cid:112)\u02dc\u00b5i\nj +(cid:112)\u03b2j)\n(cid:88)\n\n\u03b1j\u00b51/4\n\u221a\n\npj\n\n\u03b1j\u00b51/4\n\nj\n\n)(\n\n(cid:88)\n\n\u00b5i)2 + (\n\u221a\n\nj\n\npj\n\n\u03b1j\u00b51/4\n\nj\n\n)\u03c1(cid:63)A1/2 + (\n\npj\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ni\n\ni\n\n\u221a\n\npi\n\n\u00b5i\n\u221a\n\n(cid:88)\n(cid:112)\u03b2j)\u03c1(cid:63)A1/2.\n\n\u00b5i) + (\n\nj\n\npi\n\n(cid:112)\u03b2j)(\n\n(cid:88)\n\npj\n\ni\n\n\u221a\n\npi\n\n\u00b5i).\n\n2Using the fact that m \u2265 O(d log(d) log(1/\u03b43)) implies m \u2265 O(d log(m) log(1/\u03b43)).\n\nj\n\nj\n\n6\n\n\fThe last equality is since \u03c1(cid:63)A = ((cid:80)\n((cid:80)\ni pi\u03b1i)1/2\u03c1(cid:63)A3/4. By Jensen\u2019s inequality,(cid:80)\n\ni pi\n\n\u221a\n\nand Eq. (9), the lemma directly follows.\n\n\u00b5i)2. By Cauchy-Schwartz, ((cid:80)\n\n(cid:112)\u03b2j \u2264 ((cid:80)\n\n) \u2264\nj pj\u03b2j)1/2. Combined with Eq. (6)\n\n\u03b1j\u00b51/4\n\nj pj\n\nj\n\nj pj\n\n\u221a\n\nWe now show that Eq. (8) holds and provide explicit values for \u03b1 and \u03b2. De\ufb01ne\n\n\u03bdi := \u0398i \u00b7 EQi[(|X(cid:62) \u02c6w \u2212 Y | + \u2206)2],\n\nand\n\n\u02c6\u03bdi :=\n\n\u0398i\nt\n\n(|x(cid:62) \u02c6w \u2212 y| + \u2206)2.\n\n(cid:88)\n\n(x,y)\u2208Ti\n\nNote that \u02dc\u00b5i = \u02c6\u03bdi + \u0398i\u03b3. We will relate \u02c6\u03bdi to \u03bdi, and then \u03bdi to \u00b5i, to conclude a bound of the\nform in Eq. (8) for \u02dc\u00b5i. First, note that if m1 \u2265 O(d log(d) log(1/\u03b41) and E holds, then for any\nx \u2208 \u222ai\u2208[K]suppX (Qi),\n\n|x(cid:62) \u02c6v \u2212 x(cid:62)w(cid:63)| \u2264 (cid:107)x(cid:107)\u2217(cid:107)\u02c6v \u2212 w(cid:63)(cid:107) \u2264\n\n(10)\nThe second inequality stems from (cid:107)x(cid:107)\u2217 = 1 for x \u2208 \u222ai\u2208[K]suppX (Qi), and Eq. (7). This is useful\nin the following lemma, which relates \u02c6\u03bdi with \u03bdi.\nLemma 5.3. Suppose that m1 \u2265 O(d log(d) log(1/\u03b41)) and E holds. Then with probability 1 \u2212 \u03b42\n\nover the draw of T1, . . . , TK, for all i \u2208 [K], |\u02c6\u03bdi \u2212 \u03bdi| \u2264 \u0398i(b + 2\u2206)2(cid:112)K log(2K/\u03b42)/m2 \u2261 \u0398i\u03b3.\n\nCd2b2 log(1/\u03b41)\n\n\u2261 \u2206.\n\nm1\n\n(cid:115)\n\nProof. For a \ufb01xed \u02c6v, \u02c6\u03bdi/\u0398i is the empirical average of i.i.d. samples of the random variable Z =\n(|X(cid:62) \u02c6v \u2212 Y | + \u2206)2, where (X, Y ) is drawn according to Qi. We now give an upper bound for Z\nwith probability 1. Let ( \u02dcX, \u02dcY ) in the support of D such that X = \u02dcX/(cid:107) \u02dcX(cid:107)\u2217 and Y = \u02dcY /(cid:107) \u02dcX(cid:107)\u2217.\nThen |X(cid:62)w(cid:63) \u2212 Y | = | \u02dcX(cid:62)w(cid:63) \u2212 \u02dcY |/(cid:107) \u02dcX(cid:107)\u2217 \u2264 b. If E holds and m1 \u2265 O(d log(d) log(1/\u03b41)),\n\nZ \u2264 (|X(cid:62) \u02c6v \u2212 X(cid:62)w(cid:63)| + |X(cid:62)w(cid:63) \u2212 Y | + \u2206)2 \u2264 (b + 2\u2206)2,\n\nbility 1 \u2212 \u03b42, |\u02c6\u03bdi \u2212 \u03bdi| \u2264 \u0398i(b + 2\u2206)2(cid:112)log(2/\u03b42)/t. The statement of the lemma follows from a\n\nwhere the last inequality follows from Eq. (10). By Hoeffding\u2019s inequality, for every i, with proba-\nunion bound over i \u2208 [K] and t = m2/K.\n\nThe following lemma, proved in Appendix D, provides the desired relationship between \u03bdi and \u00b5i.\n\u221a\nLemma 5.4. If m1 \u2265 O(d log(d) log(1/\u03b41)) and E holds, then \u00b5i \u2264 \u03bdi \u2264 \u00b5i +4\u2206\n\u0398i\u00b5i +4\u22062\u0398i.\nWe are now ready to prove Theorem 5.1.\n\nProof of Theorem 5.1. From the condition on m and the de\ufb01nition of m1, m3 in Alg. 1 we have\nm1 \u2265 O(d log(d/\u03b41)) and m3 \u2265 O(d log(d/\u03b43)). Therefore the inequalities in Lemma 5.4, Lemma\n5.3 and Eq. (3) (with n, \u03b4, \u03c6 substituted with m3, \u03b43, \u02c6\u03c6) hold simultaneously with probability 1 \u2212\n\u03b41 \u2212 \u03b42 \u2212 \u03b43. For Eq. (3), note that (cid:107)x(cid:107)\u2217\nCombining Lemma 5.4 and Lemma 5.3, and noting that \u02dc\u00b5i = \u02c6\u03bdi + \u0398i\u03b3, we conclude that\n\nlog(c(cid:48)n) log(c(cid:48)(cid:48)/\u03b43) as required.\n\n\u2265 \u03be, thus m3 \u2265 cR2\n\n\u02c6\u03c6(x)\n\nP \u02c6\u03c6\n\n\u00b5i \u2264 \u02dc\u00b5i \u2264 \u00b5i + 4\u2206(cid:112)\u0398i\u00b5i + \u0398i(4\u22062 + 2\u03b3).\n(cid:88)\n(cid:88)\n(cid:112)\ni\u2208[K]\nThe last inequality follows since(cid:80)\n\u2264 \u03c1(cid:63)A + 2\u22061/2\u039b1/4\nD \u03c1(cid:63)A3/4 +\n\nBy Lemma 5.2, it follows that\n\u03c1A( \u02c6\u03c6) \u2264 \u03c1(cid:63)A + 2\n\n4\u22062 + 2\u03b3 \u00b7 \u039b1/2\n\n4\u22062 + 2\u03b3 \u00b7 (\n\n\u0398i)1/2\u03c1(cid:63)A3/4 +\n\n(cid:112)\n\n(cid:112)\n\ni\u2208[K]\n\n\u221a\n\n\u2206(\n\npi\n\nappear in the other terms of the bound. Combining this with Eq. (3),\n\npi\u0398i)1/2\u03c1(cid:63)A1/2 + \u00afO(\n\nlog(m3)\n\nm3\n\n)\n\nD \u03c1(cid:63)A1/2 + \u00afO(log(m3)/m3).\n\ni\u2208[K] pi\u0398i = \u039bD. We use \u00afO to absorb parameters that already\n\nL( \u02c6w) \u2212 L(cid:63) \u2264 C\u03c1(cid:63)A log(1/\u03b43)\n\n(cid:16)\n\n+\n\nD \u03c1(cid:63)A3/4 + (2\u2206 +(cid:112)2\u03b3) \u00b7 \u039b1/2\n\nD \u03c1(cid:63)A1/2(cid:17)\n\nm3\n\n2\u22061/2\u039b1/4\n\nC log(1/\u03b43)\n\nm3\n\n+ \u00afO(\n\nlog(m3)\n\nm2\n3\n\n).\n\n7\n\n\f\u221a\n\u2206 \u2264 b\n\nWe have \u03b3 = (b+2\u2206)2(cid:112)K log(2K/\u03b42)/m2, and \u2206 =\n\n\u221a\n\n(cid:113) Cd2b2 log(1/\u03b41)\n(cid:19)1/4\n(cid:18) 16Cd2b2 log(1/\u03b41)\n\nd + 1)2(cid:112)K log(2K/\u03b42)/m2. Substituting for \u2206 and \u03b3, we have\n(cid:32)(cid:18) 4Cd2b2 log(1/\u03b41)\n\n\u039b1/4\nD \u03c1(cid:63)A3/4\n\nC log(1/\u03b43)\n\nm3\n\nm1\n\nm3\n\nm1\n\n+\n\nL( \u02c6w) \u2212 L(cid:63) \u2264 C\u03c1(cid:63)A log(1/\u03b43)\n\nd, thus \u03b3 \u2264 b2(2\n\n. For m1 \u2265 Cd log(1/\u03b41),\n\n+\n\nC log(1/\u03b43)\n\nm3\n\n\u221a\n\n+\n\n2b(2\n\nm1\n\u221a\n\nd + 1)\n\n(cid:19)1/2\n(cid:18) K log(2K/\u03b42)\n\nm2\n\n(cid:19)1/4(cid:33)\n\n\u00b7 \u039b1/2\n\nD \u03c1(cid:63)A1/2 + \u00afO(\n\nlog(m3)\n\nm2\n3\n\n).\n\nTo get the theorem, set m3 = m \u2212 m4/5, m2 = m1 = m4/5/2, \u03b41 = \u03b42 = \u03b4/4, and \u03b43 = \u03b4/2.\n\n6\n\nImprovement over Passive Learning\n\n2\n\n, p = 1\n\n(cid:113) 1\u2212p\u03b12\n\n2\u03b12 , and \u03b7 \u2208 R such that |\u03b7| \u2264 \u03c3\n\nTheorem 5.1 shows that our active learner approaches the oracle rate, which can be strictly faster than\nthe rate implied by Theorem 2.1 for passive learning. To complete the picture, observe that this better\nrate cannot be achieved by any passive learner. This can be seen by the following 1-dimensional\n\u03b1. Let D\u03b7 over R \u00d7 R such\nexample. Let \u03c3 > 0, \u03b1 > 1\u221a\nthat with probability p, X = \u03b1 and Y = \u03b1\u03b7 + \u0001, where \u0001 \u223c N (0, \u03c32), and with probability 1 \u2212 p,\n1\u2212p and Y = 0. Then E[X 2] = 1 and w(cid:63) = p\u03b12\u03b7. Consider a partition of R such\nX = \u03b2 :=\nthat \u03b1 \u2208 A1 and \u03b2 \u2208 A2. Then p1 = p, \u00b51 = E\u0001[\u03b12(\u0001 + \u03b1\u03b7\u2212 \u03b1w(cid:63))2] = \u03b12(\u03c32 + \u03b12\u03b72(1\u2212 p\u03b12)) \u2264\n2 \u03b12\u03c32. In addition, p2 = 1 \u2212 p and \u00b52 = \u03b24w2\n4(1\u2212p)2 . The oracle risk is\n)2 \u2264 2p\u03c32.\n\n1\u2212p )2p2\u03b14\u03b72 \u2264 p2\u03b12\u03c32\n\n(cid:63) = ( 1\u2212p\u03b12\np\u03b1\u03c3\n\n\u03b1\u03c3 + (1 \u2212 p)\n\n(cid:114) 3\n\n(cid:114) 3\n\n)2 = p2\u03b12\u03c32(\n\n\u00b52)2 \u2264 (p\n\n\u00b51 + p2\n\n\u221a\n\n\u221a\n\n\u03c1(cid:63)A = (p1\n\n3\n\n2\n\n2(1 \u2212 p)\n\n1\n2\n\n+\n\n2\n\nTherefore, for the active learner, with probability 1 \u2212 \u03b4,\n\nL( \u02c6w) \u2212 L(cid:63) \u2264 2Cp\u03c32 log(1/\u03b4)\n\nm\n\n+ o(\n\n1\nm\n\n).\n\n(11)\n\nIn contrast, consider any passive learner that receives m labeled examples and outputs a predictor\n\u02c6w. Consider the estimator for \u03b7 de\ufb01ned by \u02c6\u03b7 = \u02c6w\np\u03b12 . \u02c6\u03b7 estimates the mean of a Gaussian distribution\nwith variance \u03c32/\u03b12. The minimax optimal rate for such an estimator is \u03c32\n\u03b12n, where n is the number\nof examples with X = \u03b1.3 With probability at least 1/2, n \u2264 2mp. Therefore, EDm [(\u02c6\u03b7 \u2212 \u03b7)2] \u2265\n4\u03b12mp. It follows that EDm[L( \u02c6w) \u2212 L(cid:63)] = EDm [( \u02c6w \u2212 w)2] = p2\u03b14 \u00b7 E[(\u02c6\u03b7 \u2212 \u03b7)2] \u2265 p\u03b12\u03c32\n4m = \u03c32\n4m .\nComparing this to Eq. (11), one can see that the ratio between the rate of the best passive learner\nand the rate of the active learner approaches O(1/p) for large m.\n\n\u03c32\n\n7 Discussion\n\nMany questions remain open for active regression. For instance, it is of particular interest whether\nthe convergence rates provided here are the best possible for this model. Second, we consider here\nonly the plain vanilla \ufb01nite-dimensional regression, however we believe that the approach can be\nextended to ridge regression in a general Hilbert space. Lastly, the algorithm uses static allocation\nof samples to stages and to partitions. In Monte-Carlo estimation Carpentier and Munos (2012),\ndynamic allocation has been used to provide convergence to a pseudo-risk with better constants. It\nis an open question whether this type of approach can be useful in the case of active regression.\n\nReferences\nM. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and\n\nSystem Sciences, 75(1):78\u201389, 2009.\n3Since |\u03b7| \u2264 \u03c3\n\u03b1 , this rate holds when \u03c32\n\nn (cid:28) \u03c32\n\n\u03b12 , that is n (cid:29) \u03b12. (Casella and Strawderman, 1981)\n\n8\n\n\fR. Burbidge, J. J. Rowland, and R. D. King. Active learning for regression based on query by\ncommittee. In Intelligent Data Engineering and Automated Learning-IDEAL 2007, pages 209\u2013\n218. Springer, 2007.\n\nW. Cai, Y. Zhang, and J. Zhou. Maximizing expected model change for active learning in regression.\nIn Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 51\u201360. IEEE, 2013.\nA. Carpentier and R. Munos. Minimax number of strata for online strati\ufb01ed sampling given noisy\nsamples. In N. H. Bshouty, G. Stoltz, N. Vayatis, and T. Zeugmann, editors, Algorithmic Learning\nTheory, volume 7568 of Lecture Notes in Computer Science, pages 229\u2013244. Springer Berlin\nHeidelberg, 2012.\n\nG. Casella and W. E. Strawderman. Estimating a bounded normal mean. The Annals of Statistics, 9\n\n(4):870\u2013878, 1981.\n\nD. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n15:201\u2013221, 1994.\n\nD. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of\n\nArti\ufb01cial Intelligence Research, 4:129\u2013145, 1996.\n\nS. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In J. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems\n20, pages 353\u2013360. MIT Press, 2008.\n\nS. Efromovich. Sequential design and estimation in heteroscedastic nonparametric regression. Se-\n\nquential Analysis, 26(1):3\u201325, 2007.\n\nR. Ganti and A. G. Gray. Upal: Unbiased pool based active learning. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 422\u2013431, 2012.\n\nP. Glasserman. Monte Carlo methods in \ufb01nancial engineering, volume 53. Springer, 2004.\nL. Gy\u00a8or\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of nonparametric regres-\n\nsion. Springer, 2002.\n\nD. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In Proceed-\nings of the 31st International Conference on Machine Learning, volume 32, pages 37\u201345. JMLR\nWorkshop and Conference Proceedings, 2014.\n\nD. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. In Twenty-Fifth\n\nConference on Learning Theory, 2012.\n\nT. Kanamori. Statistical asymptotic theory of active learning. Annals of the Institute of Statistical\n\nMathematics, 54(3):459\u2013475, 2002.\n\nT. Kanamori and H. Shimodaira. Active learning algorithm using the maximum weighted log-\n\nlikelihood estimator. Journal of Statistical Planning and Inference, 116(1):149\u2013162, 2003.\n\nD. Needell, N. Srebro, and R. Ward. Stochastic gradient descent and the randomized kaczmarz\n\nalgorithm. arXiv preprint arXiv:1310.5715, 2013.\n\nM. Sugiyama. Active learning in approximately linear regression based on conditional expectation\n\nof generalization error. The Journal of Machine Learning Research, 7:141\u2013166, 2006.\n\nM. Sugiyama and S. Nakajima. Pool-based active learning in approximate linear regression. Ma-\n\nchine Learning, 75(3):249\u2013274, 2009.\n\nJ. Von Neumann. Various techniques used in connection with random digits. Applied Math Series,\n\n12(36-38):1, 1951.\n\nD. P. Wiens. Minimax robust designs and weights for approximately speci\ufb01ed regression models\nwith heteroscedastic errors. Journal of the American Statistical Association, 93(444):1440\u20131450,\n1998.\n\nD. P. Wiens. Robust weights and designs for biased regression models: Least squares and general-\n\nized m-estimation. Journal of Statistical Planning and Inference, 83(2):395\u2013412, 2000.\n\n9\n\n\f", "award": [], "sourceid": 294, "authors": [{"given_name": "Sivan", "family_name": "Sabato", "institution": "Ben Gurion University"}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA / MSR"}]}