{"title": "Lower Bounds for Passive and Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1026, "page_last": 1034, "abstract": "We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander's capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of \"disagreement coefficient.\" For passive learning, our lower bounds match the upper bounds of Gine and Koltchinskii up to constants and generalize analogous results of Massart and Nedelec. For active learning, we provide first known lower bounds based on the capacity function rather than the disagreement coefficient.", "full_text": "Lower Bounds for Passive and Active Learning\n\nMaxim Raginsky\u2217\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nAlexander Rakhlin\n\nDepartment of Statistics\nUniversity of Pennsylvania\n\nAbstract\n\nWe develop uni\ufb01ed information-theoretic machinery for deriving lower bounds\nfor passive and active learning schemes. Our bounds involve the so-called Alexan-\nder\u2019s capacity function. The supremum of this function has been recently redis-\ncovered by Hanneke in the context of active learning under the name of \u201cdisagree-\nment coef\ufb01cient.\u201d For passive learning, our lower bounds match the upper bounds\nof Gin\u00b4e and Koltchinskii up to constants and generalize analogous results of Mas-\nsart and N\u00b4ed\u00b4elec. For active learning, we provide \ufb01rst known lower bounds based\non the capacity function rather than the disagreement coef\ufb01cient.\n\n1\n\nIntroduction\n\nNot all Vapnik-Chervonenkis classes are created equal. This was observed by Massart and N\u00b4ed\u00b4elec\n[24], who showed that, when it comes to binary classi\ufb01cation rates on a sample of size n under a\nmargin condition, some classes admit rates of the order 1/n while others only (log n)/n. The latter\nclasses were called \u201crich\u201d in [24]. As noted by Gin\u00b4e and Koltchinskii [15], the \ufb01ne complexity\nnotion that de\ufb01nes this \u201crichness\u201d is in fact embodied in Alexander\u2019s capacity function.1 Somewhat\nsurprisingly, the supremum of this function (called the disagreement coef\ufb01cient by Hanneke [19])\nplays a key role in risk bounds for active learning. The contribution of this paper is twofold. First, we\nprove lower bounds for passive learning based on Alexander\u2019s capacity function, matching the upper\nbounds of [15] up to constants. Second, we prove lower bounds for the number of label requests in\nactive learning in terms of the capacity function. Our proof techniques are information-theoretic in\nnature and provide a uni\ufb01ed tool to study active and passive learning within the same framework.\nActive and passive learning. Let (X ,A) be an arbitrary measurable space. Let (X, Y ) be a random\nvariable taking values in X \u00d7 {0, 1} according to an unknown distribution P = \u03a0 \u2297 PY |X, where\n\u03a0 denotes the marginal distribution of X. Here, X is an instance (or a feature, a predictor variable)\nand Y is a binary response (or a label). Classical results in statistical learning assume availability of\nan i.i.d. sample {(Xi, Yi)}n\ni=1 from P . In this framework, the learner is passive and has no control\non how this sample is chosen. The classical setting is well studied, and the following question has\nrecently received attention: do we gain anything if data are obtained sequentially, and the learner\nis allowed to modify the design distribution \u03a0 of the predictor variable before receiving the next\npair (Xi, Yi)? That is, can the learner actively use the information obtained so far to facilitate faster\nlearning?\nTwo paradigms often appear in the literature: (i) the design distribution is a Dirac delta function\nat some xi that depends on (xi\u22121, Y i\u22121), or (ii) the design distribution is a restriction of the orig-\ninal distribution to some measurable set. There is rich literature on both approaches, and we only\nmention a few results here. The paradigm (i) is closely related to learning with membership queries\n[21], generalized binary search [25], and coding with noiseless feedback [6]. The goal is to actively\nchoose the next xi so that the observed Yi \u223c PY |X=xi is suf\ufb01ciently \u201cinformative\u201d for the clas-\nsi\ufb01cation task. In this paradigm, the sample no longer provides information about the distribution\n\n\u2217Af\ufb01liation until January, 2012: Department of Electrical and Computer Engineering, Duke University.\n1To be precise, the capacity function depends on the underlying probability distribution.\n\n1\n\n\f\u03a0 (see [7] for further discussion and references). The setting (ii) is often called selective sampling\n[9, 13, 8], although the term active learning is also used. In this paradigm, the aim is to sequentially\nchoose subsets Di \u2286 X based on the observations prior to the ith example, such that the label Yi\nis requested only if Xi \u2208 Di. The sequence {Xi}n\ni=1 is assumed to be i.i.d., and so, form the view\npoint of the learner, the Xi is sampled from the conditional distribution \u03a0(\u00b7|Di).\nIn recent years, several interesting algorithms for active learning and selective sampling have ap-\npeared in the literature, most notably: the A2 algorithm of Balcan et al. [4], which explicitly main-\ntains Di as a \u201cdisagreement\u201d set of a \u201cversion space\u201d; the empirical risk minimization (ERM) based\nalgorithm of Dasgupta et al. [11], which maintains the set Di implicitly through synthetic and real\nexamples; and the importance-weighted active learning algorithm of Beygelzimer et al. [5], which\nconstructs the design distribution through careful reweighting in the feature space. An insightful\nanalysis has been carried out by Hanneke [20, 19], who distilled the role of the so-called disagree-\nment coef\ufb01cient in governing the performance of several of these active learning algorithms. Finally,\nKoltchinskii [23] analyzed active learning procedures using localized Rademacher complexities and\nAlexander\u2019s capacity function, which we discuss next.\nAlexander\u2019s capacity function. Let F denote a class of candidate classi\ufb01ers, where a classi\ufb01er is a\nmeasurable function f : X \u2192 {0, 1}. Suppose the VC dimension of F is \ufb01nite: VC-dim(F) = d.\nThe loss (or risk) of f is its probability of error, RP (f) (cid:44) EP [1{f (X)(cid:54)=Y }] = P (f(X) (cid:54)= Y ).\nIt is well known that the risk is globally minimized by the Bayes classi\ufb01er f\u2217 = f\u2217\nP , de\ufb01ned by\nf\u2217(x) (cid:44) 1{2\u03b7(x)\u22651}, where \u03b7(x) (cid:44) E[Y |X = x] is the regression function. De\ufb01ne the margin as\nh (cid:44) inf x\u2208X |2\u03b7(x) \u2212 1|. If h > 0, we say the problem satis\ufb01es Massart\u2019s noise condition. We\nde\ufb01ne the excess risk of a classi\ufb01er f by EP (f) (cid:44) RP (f) \u2212 RP (f\u2217), so that EP (f) \u2265 0, with\nequality if and only if f = f\u2217 \u03a0-a.s. Given \u03b5 \u2208 (0, 1], de\ufb01ne\n\nF\u03b5(f\u2217) (cid:44) {f \u2208 F : \u03a0(f(X) (cid:54)= f\u2217(X)) \u2264 \u03b5} ,\nD\u03b5(f\u2217) (cid:44) {x \u2208 X : \u2203f \u2208 F\u03b5(f\u2217) s.t. f(x) (cid:54)= f\u2217(x)}\n\nThe set F\u03b5 consists of all classi\ufb01ers f \u2208 F that are \u03b5-close to f\u2217 in the L1(\u03a0) sense, while the set\nD\u03b5 consists of all points x \u2208 X , for which there exists a classi\ufb01er f \u2208 F\u03b5 that disagrees with the\nBayes classi\ufb01er f\u2217 at x. The Alexander\u2019s capacity function [15] is de\ufb01ned as\n\n\u03c4(\u03b5) (cid:44) \u03a0(D\u03b5(f\u2217))/\u03b5,\n\n(1)\nthat is, \u03c4(\u03b5) measures the relative size (in terms of \u03a0) of the disagreement region D\u03b5 compared to \u03b5.\nClearly, \u03c4(\u03b5) is always bounded above by 1/\u03b5; however, in some cases \u03c4(\u03b5) \u2264 \u03c40 with \u03c40 < \u221e.\nThe function \u03c4 was originally introduced by Alexander [1, 2] in the context of exponential in-\nequalities for empirical processes indexed by VC classes of functions, and Gin\u00b4e and Koltchin-\nskii [15] generalized Alexander\u2019s results.\nIn particular, they proved (see [15, p. 1213]) that,\n\nfor a VC-class of binary-valued functions with VC-dim(F) = d, the ERM solution (cid:98)fn =\n\narg minf\u2208F 1\nn\n\ni=1 1{f (Xi)(cid:54)=Yi} under Massart\u2019s noise condition satis\ufb01es\n\n(cid:80)n\n\nEP ((cid:98)fn) \u2264 C\n\n(cid:20) d\n\nnh\n\n(cid:18) d\n\n(cid:19)\n\n(cid:21)\n\n+ s\nnh\n\nlog \u03c4\n\nnh2\n\n(2)\nwith probability at least 1 \u2212 Ks\u22121e\u2212s/K for some constants C, K and any s > 0. The upper bound\n(2) suggests the importance of the Alexander\u2019s capacity function for passive learning, leaving open\nthe question of necessity. Our \ufb01rst contribution is a lower bound which matches the upper bound (2)\nup to constant, showing that, in fact, dependence on the capacity is unavoidable.\nRecently, Koltchinskii [23] made an important connection between Hanneke\u2019s disagreement coef\ufb01-\ncient and Alexander\u2019s capacity function. Under Massart\u2019s noise condition, Koltchinskii showed (see\n[23, Corollary 1]) that, for achieving an excess loss of \u03b5 with con\ufb01dence 1\u2212\u03b4, the number of queries\nissued by his active learning algorithm is bounded above by\n\nC\n\n\u03c40 log(1/\u03b5)\n\nh2\n\n[d log \u03c40 + log(1/\u03b4) + log log(1/\u03b5) + log log(1/h)] ,\n\n(3)\n\nwhere \u03c40 = sup\u03b5\u2208(0,1] \u03c4(\u03b5) is Hanneke\u2019s disagreement coef\ufb01cient. Similar bounds based on the\ndisagreement coef\ufb01cient have appeared in [19, 20, 11]. The second contribution of this paper is a\nlower bound on the expected number of queries based on Alexander\u2019s capacity \u03c4(\u03b5).\n\n2\n\n\fComparison to known lower bounds. For passive learning, Massart and N\u00b4ed\u00b4elec [24] proved\ntwo lower bounds which, in fact, correspond to \u03c4(\u03b5) = 1/\u03b5 and \u03c4(\u03b5) = \u03c40, the two endpoints on\nthe complexity scale for the capacity function. Without the capacity function at hand, the authors\nemphasize that \u201crich\u201d VC classes yield a larger lower bound. Our Theorem 1 below gives a uni\ufb01ed\nconstruction for all possible complexities \u03c4(\u03b5).\nIn the PAC framework, the lower bound \u2126(d/\u03b5 + (1/\u03b5) log(1/\u03b4)) goes back to [12]. It follows\nfrom our results that in the noisy version of the problem (h (cid:54)= 1), the lower bound is in fact\n\u2126((d/\u03b5) log(1/\u03b5) + (1/\u03b5) log(1/\u03b4)) for classes with \u03c4(\u03b5) = \u2126(1/\u03b5).\nFor active learning, Castro and Nowak [7] derived lower bounds, but without the disagreement\ncoef\ufb01cient and under a Tsybakov-type noise condition. This setting is out of the scope of this paper.\nHanneke [19] proved a lower bound on the number of label requests speci\ufb01cally for the A2 algorithm\nin terms of the disagreement coef\ufb01cient. In contrast, lower bounds of Theorem 2 are valid for any\nalgorithm and are in terms of Alexander\u2019s capacity function. Finally, a result by K\u00a8a\u00a8ari\u00a8ainen [22]\n(strengthened by [5]) gives a lower bound of \u2126(\u03bd2/\u03b52) where \u03bd = inf f\u2208F EP (f). A closer look\nat the construction of the lower bound reveals that it is achieved by considering a speci\ufb01c margin\nh = \u03b5/\u03bd. Such an analysis is somewhat unsatisfying, as we would like to keep h as a free parameter,\nnot necessarily coupled with the desired accuracy \u03b5. This point of view is put forth by Massart and\nN\u00b4ed\u00b4elec [24, p. 2329], who argue for a non-asymptotic analysis where all the parameters of the\nproblem are made explicit. We also feel that this gives a better understanding of the problem.\n\n2 Setup and main results\nWe suppose that the instance space X is a countably in\ufb01nite set. Also, log(\u00b7) \u2261 loge(\u00b7) throughout.\nDe\ufb01nition 1. Given a VC function class F and a margin parameter h \u2208 [0, 1], let C(F, h) denote\nthe class of all conditional probability distributions PY |X of Y \u2208 {0, 1} given X \u2208 X , such that:\n(a) the Bayes classi\ufb01er f\u2217 \u2208 F, and (b) the corresponding regression function satis\ufb01es the Massart\ncondition with margin h > 0.\nLet P(X ) denote the space of all probability measures on X . We now introduce Alexander\u2019s ca-\npacity function (1) into the picture. Whenever we need to specify explicitly the dependence of \u03c4(\u03b5)\non f\u2217 and \u03a0, we will write \u03c4(\u03b5; f\u2217, \u03a0). We also denote by T the set of all admissible capacity\nfunctions \u03c4 : (0, 1] \u2192 R+, i.e., \u03c4 \u2208 T if and only if there exist some f\u2217 \u2208 F and \u03a0 \u2208 P(X ), such\nthat \u03c4(\u03b5) = \u03c4(\u03b5; f\u2217, \u03a0) for all \u03b5 \u2208 (0, 1]. Without loss of generality, we assume \u03c4(\u03b5) \u2265 2.\nDe\ufb01nition 2. Given some \u03a0 \u2208 P(X ) and a pair (F, h) as in Def. 1, we let P(\u03a0,F, h) denote the set\nof all joint distributions of (X, Y ) \u2208 X \u00d7 {0, 1} of the form \u03a0 \u2297 PY |X, such that PY |X \u2208 C(F, h).\nMoreover, given an admissible function \u03c4 \u2208 T and some \u03b5 \u2208 (0, 1], we let P(\u03a0,F, h, \u03c4, \u03b5) denote\nthe subset of P(\u03a0,F, h), such that \u03c4(\u03b5; f\u2217, \u03a0) = \u03c4(\u03b5).\n\nFinally, we specify the type of learning schemes we will be dealing with.\nDe\ufb01nition 3. An n-step learning scheme S consists of the following objects: n conditional proba-\nbility distributions \u03a0(t)\n\nXt|X t\u22121,Y t\u22121, t = 1, . . . , n, and a mapping \u03c8 : X n \u00d7 {0, 1}n \u2192 F.\n\nThis de\ufb01nition covers the passive case if we let\nXt|X t\u22121,Y t\u22121(\u00b7|xt\u22121, yt\u22121) = \u03a0(\u00b7),\n\u03a0(t)\n\n\u2200(xt\u22121, yt\u22121) \u2208 X t\u22121 \u00d7 {0, 1}t\u22121\n\nas well as the active case, in which \u03a0(t)\nXt|X t\u22121,Y t\u22121 is the user-controlled design distribution for\nthe feature at time t given all currently available information. The learning process takes place\nsequentially as follows: At each time step t = 1, . . . , n, a random feature Xt is drawn accord-\nt=1 are collected, the learner computes the candidate classi\ufb01er (cid:98)fn = \u03c8(X n, Y n).\nX t\u22121,Y t\u22121(\u00b7|X t\u22121, Y t\u22121), and then a label Yt is drawn given Xt. After the n samples\ning to \u03a0(t)\n{(Xt, Yt)}n\nTo quantify the performance of such a scheme, we need the concept of an induced measure, which\ngeneralizes the set-up of [14]. Speci\ufb01cally, given some P = \u03a0 \u2297 PY |X \u2208 P(\u03a0,F, h), de\ufb01ne the\n\n3\n\n\ffollowing probability measure on X n \u00d7 {0, 1}n:\n\nn(cid:89)\n\nt=1\n\nPS(xn, yn) =\n\nPY |X(yt|xt)\u03a0(t)\n\nXt|X t\u22121,Y t\u22121(xt|xt\u22121, yt\u22121).\n\nDe\ufb01nition 4. Let Q be a subset of P(\u03a0,F, h). Given an accuracy parameter \u03b5 \u2208 (0, 1) and a\ncon\ufb01dence parameter \u03b4 \u2208 (0, 1), an n-step learning scheme S is said to (\u03b5, \u03b4)-learn Q if\n\nPS(cid:16)\n\nEP ((cid:98)fn) \u2265 \u03b5h\n\n(cid:17) \u2264 \u03b4.\n\nsup\nP\u2208Q\n\n(4)\n\nRemark 1. Leaving the precision as \u03b5h makes the exposition a bit cleaner in light of the fact that,\nunder Massart\u2019s noise condition with margin h, EP (f) \u2265 h(cid:107)f \u2212 f\u2217\nP (X))\n(cf. Massart and N\u00b4ed\u00b4elec [24, p. 2352]).\n\nP(cid:107)L1(\u03a0) = h\u03a0(f(X) (cid:54)= f\u2217\n\nWith these preliminaries out of the way, we can state the main results of this paper:\nTheorem 1 (Lower bounds for passive learning). Given any \u03c4 \u2208 T , any suf\ufb01ciently large d \u2208 N and\nany \u03b5 \u2208 (0, 1], there exist a probability measure \u03a0 \u2208 P(X ) and a VC class F with VC-dim(F) = d\nwith the following properties:\n(1) Fix any K > 1 and \u03b4 \u2208 (0, 1/2). If there exists an n-step passive learning scheme that (\u03b5/2, \u03b4)-\nlearns P(\u03a0,F, h, \u03c4, \u03b5) for some h \u2208 (0, 1 \u2212 K\u22121], then\n\n(cid:19)\n\nlog 1\n\u03b4\nK\u03b5h2\n\n(cid:18)(1 \u2212 \u03b4)d log \u03c4(\u03b5)\n(cid:18)(1 \u2212 \u03b4)d\n\nK\u03b5h2\n\n+\n\n(cid:19)\n\nn = \u2126\n\n.\n\n(5)\n\n(2) If there exists an n-step passive learning scheme that (\u03b5/2, \u03b4)-learns P(\u03a0,F, 1, \u03c4, \u03b5), then\n\nn = \u2126\n\n(6)\nTheorem 2 (Lower bounds for active learning). Given any \u03c4 \u2208 T , any suf\ufb01ciently large d \u2208 N and\nany \u03b5 \u2208 (0, 1], there exist a probability measure \u03a0 \u2208 P(X ) and a VC class F with VC-dim(F) = d\nwith the following property: Fix any K > 1 and any \u03b4 \u2208 (0, 1/2). If there exists an n-step active\nlearning scheme that (\u03b5/2, \u03b4)-learns P(\u03a0,F, h, \u03c4, \u03b5) for some h \u2208 (0, 1 \u2212 K\u22121], then\n\n\u03b5\n\n.\n\n(cid:18)(1 \u2212 \u03b4)d log \u03c4(\u03b5)\n\nKh2\n\nn = \u2126\n\n+\n\n\u03c4(\u03b5) log 1\n\u03b4\n\nKh2\n\n(cid:19)\n\n.\n\n(7)\n\nRemark 2. The lower bound in (6) is well-known and goes back to [12]. We mention it because it\nnaturally arises from our construction. In fact, there is a smooth transition between (5) and (6), with\nthe extra log \u03c4(\u03b5) factor disappearing as h approaches 1. As for the active learning lower bound, we\nconjecture that d log \u03c4(\u03b5) is, in fact, optimal, and the extra factor of \u03c40 in d\u03c40 log \u03c40 log(1/\u03b5) in (3)\narises from the use of a passive learning algorithm as a black box.\n\nThe remainder of the paper is organized as follows: Section 3 describes the required information-\ntheoretic tools, which are then used in Section 4 to prove Theorems 1 and 2. The proofs of a number\nof technical lemmas can be found in the Supplementary Material.\n\nInformation-theoretic framework\n\n3\nLet P and Q be two probability distributions on a common measurable space W. Given a convex\nfunction \u03c6 : [0,\u221e) \u2192 R such that \u03c6(1) = 0, the \u03c6-divergence2 between P and Q [3, 10] is given by\n\n(cid:90)\n\n(cid:19)\n\n(cid:18) dP/d\u00b5\n\ndQ/d\u00b5\n\nD\u03c6(P(cid:107)Q) (cid:44)\n\ndQ\nd\u00b5\n\n\u03c6\n\nW\n\nd\u00b5,\n\n(8)\n\nwhere \u00b5 is an arbitrary \u03c3-\ufb01nite measure that dominates both P and Q.3 For the special case of\nW = {0, 1}, when P and Q are the distributions of a Bernoulli(p) and a Bernoulli(q) random\n\n2We deviate from the standard term \u201cf-divergence\u201d since f is already reserved for a generic classi\ufb01er.\n3For instance, one can always take \u00b5 = P + Q. It it easy to show that the value of D\u03c6(P(cid:107)Q) in (8) does not\n\ndepend on the choice of the dominating measure.\n\n4\n\n\fvariable, we will denote their \u03c6-divergence by\n\n(cid:18) p\n\nq\n\n(cid:19)\n\n+ (1 \u2212 q) \u00b7 \u03c6\n\n(cid:19)\n\n(cid:18)1 \u2212 p\n\n1 \u2212 q\n\n.\n\n(9)\n\nd\u03c6(p(cid:107)q) = q \u00b7 \u03c6\n\nTwo particular choices of \u03c6 are of interest: \u03c6(u) = u log u, which gives the ordinary Kullback\u2013\nLeibler (KL) divergence D(P(cid:107)Q), and \u03c6(u) = \u2212 log u, which gives the reverse KL divergence\nD(Q(cid:107)P), which we will denote by Dre(P(cid:107)Q). We will write d(\u00b7(cid:107)\u00b7) for the binary KL divergence.\nOur approach makes fundamental use of the data processing inequality that holds for any \u03c6-\ndivergence [10]: if P and Q are two possible probability distributions for a random variable W \u2208 W\nand if PZ|W is a conditional probability distribution of some other random variable Z given W , then\n(10)\n\nD\u03c6(PZ(cid:107)QZ) \u2264 D\u03c6(P(cid:107)Q),\n\nwhere PZ (resp., QZ) is the marginal distribution of Z when W has distribution P (resp., Q).\nConsider now an arbitrary n-step learning scheme S. Let us \ufb01x a \ufb01nite set {f1, . . . , fN} \u2282 F and\nassume that to each m \u2208 [N] we can associate a probability measure P m = \u03a0\u2297P m\nY |X \u2208 P(\u03a0,F, h)\nwith the Bayes classi\ufb01er f\u2217\n\n= fm. For each m \u2208 [N], let us de\ufb01ne the induced measure\n\nY |X(yt|xt)\u03a0(t)\nP m\n\nXt|X t\u22121,Y t\u22121(xt|xt\u22121, yt\u22121).\n\n(11)\n\nPS,m(xn, yn) (cid:44) n(cid:89)\n\nPm\n\nt=1\n\nMoreover, given any probability distribution \u03c0 over [N], let PS,\u03c0(m, xn, yn) (cid:44) \u03c0(m)PS,m(xn, yn).\nIn other words, PS,\u03c0 is the joint distribution of (M, X n, Y n) \u2208 [N] \u00d7 X n \u00d7 {0, 1}n, under which\nM \u223c \u03c0 and P(X n, Y n|M = m) = PS,m(X n, Y n).\nThe \ufb01rst ingredient in our approach is standard [27, 14, 24]. Let {f1, . . . , fN} be an arbitrary 2\u03b5-\npacking subset of F (that is, (cid:107)fi \u2212 fj(cid:107)L1(\u03a0) > 2\u03b5 for all i (cid:54)= j). Suppose that S satis\ufb01es (4) on\nsome Q that contains {P 1, . . . , P N}. Now consider\n\n(cid:99)M \u2261 (cid:99)M(X n, Y n) (cid:44) arg min\nLemma 1. With the above de\ufb01nitions, PS,\u03c0((cid:99)M (cid:54)= M) \u2264 \u03b4.\n\nThen the following lemma is easily proved using triangle inequality:\n\n(cid:107)(cid:98)fn \u2212 fm(cid:107)L1(\u03a0).\n\n1\u2264m\u2264N\n\n(12)\n\nThe second ingredient of our approach is an application of the data processing inequality (10) with a\njudicious choice of \u03c6. Let W (cid:44) (M, X n, Y n), let M be uniformly distributed over [N], \u03c0(m) = 1\nfor all m \u2208 [N], and let P be the induced measure PS,\u03c0. Then we have the following lemma (see\nalso [17, 16]):\nLemma 2. Consider any probability measure Q for W , under which M is distributed according to\n\u03c0 and independent of (X n, Y n). Let the divergence-generating function \u03c6 be such that the mapping\np (cid:55)\u2192 d\u03c6(p(cid:107)q) is nondecreasing on the interval [q, 1]. Then, assuming that \u03b4 \u2264 1 \u2212 1\nN ,\n\nN\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\n(cid:18) N \u03b4\n\nN \u2212 1\n\nD\u03c6(P(cid:107)Q) \u2265 1\nN\n\n\u00b7 \u03c6 (N(1 \u2212 \u03b4)) +\n\n1 \u2212 1\nN\n\n\u00b7 \u03c6\n\n.\n\n(13)\n\nProof. De\ufb01ne the indicator random variable Z = 1{cM =M}. Then P(Z = 1) \u2265 1 \u2212 \u03b4 by Lemma 1.\nOn the other hand, since Q can be factored as Q(m, xn, yn) = 1\n\nN(cid:88)\n\nm=1\n\nQ(M = m,(cid:99)M = m) =\n\nN(cid:88)\n\n(cid:88)\n\nm=1\n\nxn,yn\n\n1\nN\n\nN\n\nQX n,Y n(xn, yn), we have\nQX n,Y n(xn, yn)1{cM (xn,yn)=m} =\n\n1\nN\n\n.\n\nQ(Z = 1) =\n\nTherefore,\n\nD\u03c6(P(cid:107)Q) \u2265 D\u03c6(PZ(cid:107)QZ) = d\u03c6(P(Z = 1)(cid:107)Q(Z = 1)) \u2265 d\u03c6(1 \u2212 \u03b4(cid:107)1/N),\n\nwhere the \ufb01rst step is by the data processing inequality (10), the second is due to the fact that Z is\nbinary, and the third is by the assumed monotonicity property of \u03c6. Using (9), we arrive at (13).\n\nNext, we need to choose the divergence-generating function \u03c6 and the auxiliary distribution Q.\n\n5\n\n\fif \u03c6(u) behaves like \u2212 log u for small u, then the lower bounds will be of the form \u2126(cid:0)log 1\nthe marginals PM \u2261 \u03c0 and PX n,Y n \u2261 N\u22121(cid:80)N\n\nChoice of \u03c6.\nInspection of the right-hand side of (13) suggests that the usual \u2126(log N) lower\nbounds [14, 27, 24] can be obtained if \u03c6(u) behaves like u log u for large u. On the other hand,\nThese observations naturally lead to the respective choices \u03c6(u) = u log u and \u03c6(u) = \u2212 log u,\ncorresponding to the KL divergence D(P(cid:107)Q) and the reverse KL divergence Dre(P(cid:107)Q) = D(Q(cid:107)P).\nChoice of Q. One obvious choice of Q satisfying the conditions of the lemma is the product of\nPS,m: Q = PM \u2297 PX n,Y n. With this Q and\n\n(cid:1).\n\n\u03b4\n\nm=1\n\u03c6(u) = u log u, the left-hand side of (13) is given by\n\nD(P(cid:107)Q) = D(PM,X n,Y n(cid:107)PM \u2297 PX n,Y n) = I(M; X n, Y n),\n\n(14)\nwhere I(M; X n, Y n) is the mutual information between M and (X n, Y n) with joint distribution P.\nOn the other hand, it is not hard to show that the right-hand side of (13) can be lower-bounded by\n(1 \u2212 \u03b4) log N \u2212 log 2. Combining with (14), we get\n\nI(M; X n, Y n) \u2265 (1 \u2212 \u03b4) log N \u2212 log 2,\n\n(cid:19)\n\n(cid:18)\n\nwhich is (a commonly used variant of) the well-known Fano\u2019s inequality [14, Lemma 4.1], [18,\np. 1250], [27, p. 1571]. The same steps, but with \u03c6(u) = \u2212 log u, lead to the bound\n\u2212 log 2,\n\nL(M; X n, Y n) \u2265\n\nlog\n\nlog\n\n1\n\u03b4\n\n\u2212 log 2 \u2265 1\n2\n\n1\n\u03b4\n\n1 \u2212 1\nN\n\nwhere L(M; X n, Y n) (cid:44) Dre(PM,X n,Y n(cid:107)PM \u2297 PX n,Y n) is the so-called lautum information be-\ntween M and (X n, Y n) [26], and the second inequality holds whenever N \u2265 2.\nHowever, it is often more convenient to choose Q as follows. Fix an arbitrary conditional distribution\nQY |X of Y \u2208 {0, 1} given X \u2208 X . Given a learning scheme S, de\ufb01ne the probability measure\n\nQY |X(yt|xt)\u03a0(t)\nQS(xn, yn) for all m \u2208 [N].\n\nt=1\n\nXt|X t\u22121,Y t\u22121(xt|xt\u22121, yt\u22121)\n\nand let Q(m, xn, yn) = 1\nLemma 3. For each xn \u2208 X n and y \u2208 X , let N(y|xn) (cid:44) |{1 \u2264 t \u2264 n : xt = y}|. Then\n\nN\n\nQS(xn, yn) (cid:44) n(cid:89)\n\nD(P(cid:107)Q) =\n\nDre(P(cid:107)Q) =\n\n1\nN\n\n1\nN\n\nN(cid:88)\nN(cid:88)\n\nm=1\n\n(cid:88)\n(cid:88)\n\nx\u2208X\n\nm=1\n\nx\u2208X\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\nD(P m\n\nY |X(\u00b7|x)(cid:107)QY |X(\u00b7|x))EPS,m [N(x|X n)] ;\n\nDre(P m\n\nY |X(\u00b7|x)(cid:107)QY |X(\u00b7|x))EQ [N(x|X n)] .\n\n(cid:104)\n\n(cid:105)\nY |X(\u00b7|X)(cid:107)QY |X(\u00b7|X))\n\n,\n\nDre(P M\n\nMoreover, if the scheme S is passive, then Eq. (17) becomes\n\nDre(P(cid:107)Q) = n \u00b7 EXEM\nand the same holds for Dre replaced by D.\n\ntance dH(\u03b2, \u03b2(cid:48)) (cid:44)(cid:80)k\n\n4 Proofs of Theorems 1 and 2\nCombinatorial preliminaries. Given k \u2208 N, onsider the k-dimensional Boolean cube {0, 1}k =\n{\u03b2 = (\u03b21, . . . , \u03b2k) : \u03b2i \u2208 {0, 1}, i \u2208 [k]}. For any two \u03b2, \u03b2(cid:48) \u2208 {0, 1}k, de\ufb01ne their Hamming dis-\ni}. The Hamming weight of any \u03b2 \u2208 {0, 1}k is the number of its\nnonzero coordinates. For k > d, let {0, 1}k\nd denote the subset of {0, 1}k consisting of all binary\nstrings with Hamming weight d. We are interested in large separated and well-balanced subsets of\n{0, 1}k\nLemma 4. Suppose that d is even and k > 2d. Then, for d suf\ufb01ciently large, there exists a set\nMk,d \u2282 {0, 1}k\n6d ; (ii) dH(\u03b2, \u03b2(cid:48)) > d for\nany two distinct \u03b2, \u03b2(cid:48) \u2208 M(2)\n\nd with the following properties: (i) log |Mk,d| \u2265 d\n\nd. To that end, we will use the following lemma:\n\ni=1 1{\u03b2i(cid:54)=\u03b2(cid:48)\n\n4 log k\n\n\u03b2j \u2264 3d\n2k\n\n(19)\n\n(cid:88)\nk,d ; (iii) for any j \u2208 [k],\n\n1\n\n\u2264\n\nd\n2k\n\n|Mk,d|\n\n\u03b2\u2208Mk,d\n\n6\n\n\fProof of Theorem 1. Without loss of generality, we take X = N. Let k = d\u03c4(\u03b5) (we increase \u03b5 if\nnecessary to ensure that k \u2208 N), and consider the probability measure \u03a0 that puts mass \u03b5/d on each\nx = 1 through x = k and the remaining mass 1 \u2212 \u03b5\u03c4(\u03b5) on x = k + 1. (Recall that \u03c4(\u03b5) \u2264 1/\u03b5.)\nLet F be the class of indicator functions of all subsets of X with cardinality d. Then VC-dim(F) =\nd. We will focus on a particular subclass F(cid:48) of F. For each \u03b2 \u2208 {0, 1}k\nd, de\ufb01ne f\u03b2 : X \u2192 {0, 1}\nby f\u03b2(x) = \u03b2x if x \u2208 [k] and 0 otherwise, and take F(cid:48) = {f\u03b2 : \u03b2 \u2208 {0, 1}k\nd}. For p \u2208 [0, 1], let \u03bdp\ndenote the probability distribution of a Bernoulli(p) random variable. Now, to each f\u03b2 \u2208 F(cid:48) let us\nassociate the following conditional probability measure P \u03b2\n\nY |X(y|x) =(cid:2)\u03bd(1+h)/2(y)\u03b2x + \u03bd(1\u2212h)/2(y)(1 \u2212 \u03b2x)(cid:3) 1{x\u2208[k]} + 1{y=0}1{x(cid:54)\u2208[k]}\n\nY |X:\n\nP \u03b2\n\nIt is easy to see that each P \u03b2\n\nY |X belongs to C(F, h). Moreover, for any two f\u03b2, f\u03b2(cid:48) \u2208 F we have\n\n(cid:107)f\u03b2 \u2212 f\u03b2(cid:48)(cid:107)L1(\u03a0) = \u03a0(f\u03b2(X) (cid:54)= f\u03b2(cid:48)(X)) = \u03b5\nd\n\n1{\u03b2i(cid:54)=\u03b2(cid:48)\n\ni} \u2261 \u03b5\n\nd\n\ndH(\u03b2, \u03b2(cid:48)).\n\nk(cid:88)\n\ni=1\n\nd, the probability measure P \u03b2 = \u03a0 \u2297 P \u03b2\n\nHence, for each choice of f\u2217 = f\u03b2\u2217 \u2208 F we have F\u03b5(f\u03b2\u2217) = {f\u03b2 : dH(\u03b2, \u03b2\u2217) \u2264 d}. This implies\nthat D\u03b5(f\u03b2\u2217) = [k], and therefore \u03c4(\u03b5; f\u03b2\u2217 , \u03a0) = \u03a0([k])/\u03b5 = \u03c4(\u03b5). We have thus established that,\nY |X is an element of P(\u03a0,F, h, \u03c4, \u03b5).\nfor each \u03b2 \u2208 {0, 1}k\nd be the set described in Lemma 4, and let G (cid:44) {f\u03b2 : \u03b2 \u2208 Mk,d}. Then\nFinally, let Mk,d \u2282 {0, 1}k\nfor any two distinct \u03b2, \u03b2(cid:48) \u2208 Mk,d we have (cid:107)f\u03b2 \u2212 f\u03b2(cid:48)(cid:107)L1(\u03a0) = \u03b5\nd dH(\u03b2, \u03b2(cid:48)) > \u03b5. Hence, G is a\n\u03b5-packing of F(cid:48) in the L1(\u03a0)-norm.\nNow we are in a position to apply the lemmas of Section 3. Let {\u03b2(1), . . . , \u03b2(N )}, N = |Mk,d|,\nbe a \ufb01xed enumeration of the elements of Mk,d. For each m \u2208 [N], let us denote by P m\nY |X the\nY |X on X \u00d7 {0, 1}, and by\nconditional probability measure P \u03b2(m)\nfm \u2208 G the corresponding Bayes classi\ufb01er. Now consider any n-step passive learning scheme that\n(\u03b5/2, \u03b4)-learns P(\u03a0,F, h, \u03c4, \u03b5), and de\ufb01ne the probability measure P on [N] \u00d7 X n \u00d7 {0, 1}n by\nP(m, xn, yn) = 1\nIn addition, for\nevery \u03b3 \u2208 (0, 1) de\ufb01ne the auxiliary measure Q\u03b3 on [N] \u00d7 X n \u00d7 {0, 1}n by Q\u03b3(m, xn, yn) =\n\nPS,m(xn, yn), where PS,m is constructed according to (11).\n\nY |X , by P m the measure \u03a0 \u2297 P m\n\nN\n\n\u03b3(xn, yn), where QS\nQS\n\n1\nN\n\n\u03b3 is constructed according to (15) with\n\nY |X(y|x) (cid:44) \u03bd\u03b3(y)1{x\u2208[k]} + 1{y=0}1{x(cid:54)\u2208[k]}.\nQ\u03b3\n\nApplying Lemma 2 with \u03c6(u) = u log u, we can write\n\nD(P(cid:107)Q\u03b3) \u2265 (1 \u2212 \u03b4) log N \u2212 log 2 \u2265 (1 \u2212 \u03b4)d\n\n4\n\nlog k\n6d\n\n\u2212 log 2\n\nNext we apply Lemma 3. De\ufb01ning \u03b7 = 1+h\n2\n\nand using the easily proved fact that\n\nD(P m\n\nY |X(\u00b7|x)(cid:107)Q\u03b3\n\nY |X(\u00b7|x)) = [d(\u03b7(cid:107)\u03b3) \u2212 d(1 \u2212 \u03b7(cid:107)\u03b3)] fm(x) + d(1 \u2212 \u03b7(cid:107)\u03b3)1{x\u2208[k]},\n\nwe get\n\nD(P(cid:107)Q\u03b3) = n\u03b5 [d(\u03b7(cid:107)\u03b3) + (\u03c4(\u03b5) \u2212 1)d(1 \u2212 \u03b7(cid:107)\u03b3)] .\n\nTherefore, combining Eqs. (20) and (21) and using the fact that k = d\u03c4(\u03b5), we obtain\n\n(20)\n\n(21)\n\nn \u2265\n\n(1 \u2212 \u03b4)d log \u03c4 (\u03b5)\n\n6 \u2212 log 16\n\n4\u03b5 [d(\u03b7(cid:107)\u03b3) + (\u03c4(\u03b5) \u2212 1)d(1 \u2212 \u03b7(cid:107)\u03b3)] ,\n\n(22)\nThis bound is valid for all h \u2208 (0, 1], and the optimal choice of \u03b3 for a given h can be calculated in\nclosed form: \u03b3\u2217(h) = 1\u2212h\n\u03c4 (\u03b5) . We now turn to the reverse KL divergence. First, suppose that\nh (cid:54)= 1. Lemma 2 gives Dre(P(cid:107)Q1\u2212\u03b7) \u2265 (1/2) log(1/\u03b4) \u2212 log 2. On the other hand, using the fact\nthat\n\n2 + h\n\n\u2200\u03b3 \u2208 (0, 1)\n\nDre(P m\n\nY |X(\u00b7|x)(cid:107)Q1\u2212\u03b7\n\nY |X(\u00b7|x)) = d(\u03b7(cid:107)1 \u2212 \u03b7)fm(x)\n\n(23)\n\n7\n\n\fand applying Eq. (18), we can write\n\nDre(P(cid:107)Q1\u2212\u03b7) = n\u03b5 \u00b7 d(\u03b7(cid:107)1 \u2212 \u03b7) = n\u03b5 \u00b7 h log\n\n1 + h\n1 \u2212 h\n\n.\n\n(24)\n\nWe conclude that\n\n\u03b4 \u2212 log 2\n2 log 1\n\u03b5h log 1+h\n1\u2212h\n\n.\n\nn \u2265 1\nFor h = 1, we get the vacuous bound n \u2265 0.\nNow we consider the two cases of Theorem 1.\n1\u2212h \u2264 Kh2 for all\n(1) For a \ufb01xed K > 1, it follows from the inequality log u \u2264 u \u2212 1 that h log 1+h\n(2) For h = 1, we use (22) with the optimal setting \u03b3\u2217(1) = 1/\u03c4(\u03b5), which gives (6). The transition\n\nh \u2208 (0, 1 \u2212 K\u22121]. Choosing \u03b3 = 1\u2212h\nbetween h = 1 and h (cid:54)= 1 is smooth and determined by \u03b3\u2217(h) = 1\u2212h\n\nand using Eqs. (22) and (25), we obtain (5).\n\n(25)\n\n2\n\n2 + h\n\n\u03c4 (\u03b5).\n\nN\n\nProof of Theorem 2. We work with the same construction as in the proof of Theorem 1. First, let\nPS,m. and Q = \u03c0 \u2297 QX n,Y n, where \u03c0 is the uniform distribution on [N].\nQX n,Y n (cid:44) 1\nThen, by convexity,\nD(P(cid:107)Q) \u2264 1\nN 2\n\nY |X(\u00b7|x)(cid:107)P m(cid:48)\n\n(cid:34) n(cid:88)\n\n\u2264 n max\n\nY |X(\u00b7|x))\n\nY |X(Yt|Xt)\nP m\nY |X(Yt|Xt)\nP m(cid:48)\n\nm,m(cid:48)\u2208[N ]\n\nmax\nx\u2208[k]\n\nD(P m\n\n(cid:35)\n\nEP\n\nlog\n\nm,m(cid:48)=1\n\nt=1\n\n(cid:80)N\nN(cid:88)\n\nm=1\n\n.\n\nwhich is upper bounded by nh log 1+h\nobtain\n\n1\u2212h. Applying Lemma 2 with \u03c6(u) = u log u, we therefore\nn \u2265 (1 \u2212 \u03b4)d log k\nk(cid:88)\nN(cid:88)\n(b)= d(\u03b7(cid:107)1 \u2212 \u03b7)\n\n6d \u2212 log 16\n4h log 1+h\n1\u2212h\nNext, consider the auxiliary measure Q1\u2212\u03b7 with \u03b7 = 1+h\n2 . Then\nY |X(\u00b7|x)(cid:107)Q1\u2212\u03b7\nY |X(\u00b7|x))EQ1\u2212\u03b7[N(x|X n)]\nN(cid:88)\nk(cid:88)\n(cid:33)\n(cid:32)\nk(cid:88)\n(cid:33)\n(cid:32)\nk(cid:88)\n\nEQ1\u2212\u03b7[N(x|X n)]\n\nDre(P(cid:107)Q1\u2212\u03b7) (a)=\n\n= d(\u03b7(cid:107)1 \u2212 \u03b7)\n\nx=1\n1\nN\n\nDre(P m\n\n1\nN\n\n(26)\n\nM =1\n\nm=1\n\nm=1\n\nx=1\n\nx=1\n\nN\n\nfm(x)\n\nfm(x)EQ1\u2212\u03b7[N(x|X n)]\nN(cid:88)\nN(cid:88)\n(cid:34) k(cid:88)\n\n\u03b2(m)\nx\n\n(cid:35)\n\nN(x|X n)\n\nm=1\n\nEQ1\u2212\u03b7[N(x|X n)]\n\n(c)= d(\u03b7(cid:107)1 \u2212 \u03b7)\n\n1\nN\n\n(d)\u2264 3\n\n2\u03c4(\u03b5) h log\n\n(e)\u2264 3n\n\n2\u03c4(\u03b5) h log\n\nx=1\n1 + h\n1 \u2212 h\n1 + h\n1 \u2212 h\n\n,\n\nEQ1\u2212\u03b7\n\nx=1\n\nwhere (a) is by Lemma 3, (b) is by (23), (c) is by de\ufb01nition of {fm}, (d) is by the balance condi-\nx\u2208X N(x|X n) = n.\nApplying Lemma 2 with \u03c6(u) = \u2212 log u, we get\n\ntion (19) satis\ufb01ed by Mk,d, and (e) is by the fact that(cid:80)k\n\u03b4 \u2212 log 4(cid:1)\n\nx=1 N(x|X n) \u2264 (cid:80)\n\nn \u2265 \u03c4(\u03b5)(cid:0)log 1\n\n3h log 1+h\n1\u2212h\n1\u2212h \u2264 Kh2 for h \u2208 (0, 1 \u2212 K\u22121], we get (7).\nCombining (26) and (27) and using the bound h log 1+h\n\n(27)\n\n8\n\n\fReferences\n[1] K.S. Alexander. Rates of growth and sample moduli for weighted empirical processes indexed by sets.\n\nProbability Theory and Related Fields, 75(3):379\u2013423, 1987.\n\n[2] K.S. Alexander. The central limit theorem for weighted empirical processes indexed by sets. Journal of\n\nMultivariate Analysis, 22(2):313\u2013339, 1987.\n\n[3] S. M. Ali and S. D. Silvey. A general class of coef\ufb01cients of divergence of one distribution from another.\n\nJ. Roy. Stat. Soc. Ser. B, 28:131\u2013142, 1966.\n\n[4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML \u201906: Proceedings of\nthe 23rd international conference on Machine learning, pages 65\u201372, New York, NY, USA, 2006. ACM.\n[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML. ACM\n\nNew York, NY, USA, 2009.\n\n[6] M.V. Burnashev and K.S. Zigangirov. An interval estimation problem for controlled observations. Prob-\n\nlemy Peredachi Informatsii, 10(3):51\u201361, 1974.\n\n[7] R. M. Castro and R. D. Nowak. Minimax bounds for active learning.\n\n54(5):2339\u20132353, 2008.\n\nIEEE Trans. Inform. Theory,\n\n[8] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear classi\ufb01cation and selective sampling under low\n\nnoise conditions. Advances in Neural Information Processing Systems, 21, 2009.\n\n[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n15(2):201\u2013221, 1994.\n\n[10] I. Csisz\u00b4ar. Information-type measures of difference of probability distributions and indirect observations.\n\nStudia Sci. Math. Hungar., 2:299\u2013318, 1967.\n\n[11] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in\n\nneural information processing systems, volume 20, page 2, 2007.\n\n[12] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of exam-\n\nples needed for learning. Information and Computation, 82(3):247\u2013261, 1989.\n\n[13] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n\nalgorithm. Machine Learning, 28(2):133\u2013168, 1997.\n\n[14] C. Gentile and D. P. Helmbold. Improved lower bounds for learning from noisy examples: an information-\n\ntheoretic approach. Inform. Comput., 166:133\u2013155, 2001.\n\n[15] E. Gin\u00b4e and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empirical\n\nprocesses. Ann. Statist., 34(3):1143\u20131216, 2006.\n\n[16] A. Guntuboyina. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans.\n\nInf. Theory, 57(4):2386\u20132399, 2011.\n\n[17] A. A. Gushchin. On Fano\u2019s lemma and similar inequalities for the minimax risk. Theory of Probability\n\nand Mathematical Statistics, pages 29\u201342, 2003.\n\n[18] T. S. Han and S. Verd\u00b4u. Generalizing the Fano inequality. IEEE Trans. Inf. Theory, 40(4):1247\u20131251,\n\n1994.\n\n[19] S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th\n\ninternational conference on Machine learning, page 360. ACM, 2007.\n\n[20] S. Hanneke. Rates of convergence in active learning. Ann. Statist., 39(1):333\u2013361, 2011.\n[21] T. Heged\u02ddus. Generalized teaching dimensions and the query complexity of learning. In COLT \u201995, pages\n\n108\u2013117, New York, NY, USA, 1995. ACM.\n\n[22] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In ALT, pages 63\u201377, 2006.\n[23] V. Koltchinskii. Rademacher complexities and bounding the excess risk of active learning. J. Machine\n\nLearn. Res., 11:2457\u20132485, 2010.\n\n[24] P. Massart and \u00b4E. N\u00b4ed\u00b4elec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326\u20132366, 2006.\n[25] R. D. Nowak. The geometry of generalized binary search. Preprint, October 2009.\n[26] D. P. Palomar and S. Verd\u00b4u. Lautum information. IEEE Trans. Inform. Theory, 54(3):964\u2013975, March\n\n2008.\n\n[27] Y. Yang and A. Barron.\n\nInformation-theoretic determination of minimax rates of convergence. Ann.\n\nStatist., 27(5):1564\u20131599, 1999.\n\n9\n\n\f", "award": [], "sourceid": 629, "authors": [{"given_name": "Maxim", "family_name": "Raginsky", "institution": null}, {"given_name": "Alexander", "family_name": "Rakhlin", "institution": null}]}