{"title": "The Label Complexity of Active Learning from Observational Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1810, "page_last": 1819, "abstract": "Counterfactual learning from observational data involves learning a classifier on an entire population based on data that is observed conditioned on a selection policy. This work considers this problem in an active setting, where the learner additionally has access to unlabeled examples and can choose to get a subset of these labeled by an oracle. \n\nPrior work on this problem uses disagreement-based active learning, along with an importance weighted loss estimator to account for counterfactuals, which leads to a high label complexity. We show how to instead incorporate a more efficient counterfactual risk minimizer into the active learning algorithm. This requires us to modify both the counterfactual risk to make it amenable to active learning, as well as the active learning process to make it amenable to the risk. We provably demonstrate that the result of this is an algorithm which is statistically consistent as well as more label-efficient than prior work.", "full_text": "The Label Complexity of Active Learning from\n\nObservational Data\n\nSongbai Yan\n\nUniversity of California San Diego\n\nyansongbai@eng.ucsd.edu\n\nKamalika Chaudhuri\n\nUniversity of California San Diego\n\nkamalika@cs.ucsd.edu\n\nTara Javidi\n\nUniversity of California San Diego\n\ntjavidi@eng.ucsd.edu\n\nAbstract\n\nCounterfactual learning from observational data involves learning a classi\ufb01er on an\nentire population based on data that is observed conditioned on a selection policy.\nThis work considers this problem in an active setting, where the learner additionally\nhas access to unlabeled examples and can choose to get a subset of these labeled\nby an oracle.\nPrior work on this problem uses disagreement-based active learning, along with\nan importance weighted loss estimator to account for counterfactuals, which leads\nto a high label complexity. We show how to instead incorporate a more ef\ufb01cient\ncounterfactual risk minimizer into the active learning algorithm. This requires us\nto modify both the counterfactual risk to make it amenable to active learning, as\nwell as the active learning process to make it amenable to the risk. We provably\ndemonstrate that the result of this is an algorithm which is statistically consistent\nas well as more label-ef\ufb01cient than prior work.\n\n1\n\nIntroduction\n\nCounterfactual learning from observational data is an emerging problem that arises naturally in many\napplications. In this problem, the learner is given observational data \u2013 a set of examples selected\naccording to some policy along with their labels \u2013 as well as access to the policy that selects the\nexamples, and the goal is to construct a classi\ufb01er with high performance on an entire population,\nnot just the observational data distribution. An example is learning to predict if a treatment will be\neffective based on features of a patient. Here, we have some observational data on how the treatment\nworks for patients that were assigned to it, but if the treatment is given only to a certain category of\npatients, then the data is not re\ufb02ective of the population. Thus the main challenge in counterfactual\nlearning is how to counteract the effect of the observation policy and build a classi\ufb01er that applies\nmore widely.\nThis work considers counterfactual learning in the active setting, which has received very recent\nattention in a few different contexts [25, 21, 3]. In addition to observational data, the learner has\nan online stream of unlabeled examples drawn from the underlying population distribution, and the\nability to selectively label a subset of these in an interactive manner. The learner\u2019s goal is to again\nbuild a classi\ufb01er while using as few label queries as possible. The advantage of the active over the\npassive is its potential for more label-ef\ufb01cient solutions; the question however is how to do this\nalgorithmically.\nPrior work in this problem has looked at both probabilistic inference [21, 3] as well as a standard\nclassi\ufb01cation [25], which is the setting of our work. [25] uses a modi\ufb01ed version of disagreement-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbased active learning [7, 9, 4, 11], along with an importance weighted empirical risk to account for\nthe population. However, a problem with this approach is that the importance weighted risk estimator\ncan have extremely high variance when the importance weights \u2013 that re\ufb02ect the inverse of how\nfrequently an instance in the population is selected by the policy \u2013 are high; this may happen if, for\nexample, certain patients are rarely given the treatment. This high variance in turn results in high\nlabel requirement for the learner.\nThe problem of high variance in the loss estimator is addressed in the passive case by minimizing a\nform of counterfactual risk [22] \u2013 an importance weighted loss that combines a variance regularizer\nand importance weight clipping or truncation to achieve low generalization error. A plausible solution\nis to use this risk for active learning as well. However, this cannot be readily achieved for two reasons.\nThe \ufb01rst is that the variance regularizer itself is a function of the entire dataset, and is therefore\nchallenging to use in interactive learning where data arrives sequentially. The second reason is\nthat the minimizer of the (expected) counterfactual risk depends on n, the data size, which again is\ninconvenient for learning in an interactive manner.\nIn this work, we address both challenges. To address the \ufb01rst, we use, instead of a variance\nregularizer, a novel regularizer based on the second moment; the advantage is that it decomposes\nacross multiple segments of the data set as which makes it amenable for active learning. We provide\ngeneralization bounds for this modi\ufb01ed counterfactual risk minimizer, and show that it has almost\nthe same performance as counterfactual risk minimization with a variance regularizer [22]. The\nsecond challenge arises because disagreement-based active learning ensures statistical consistency\nby maintaining a set of plausible minimizers of the expected risk. This is problematic when the\nminimizer of the expected risk itself changes between iterations as in the case with our modi\ufb01ed\nregularizer. We address this challenge by introducing a novel variant of disagreement-based active\nlearning which is always guaranteed to maintain the population error minimizer in its plausible set.\nAdditionally, to improve sample ef\ufb01ciency, we then propose a third novel component \u2013 a new sampling\nalgorithm for correcting sample selection bias that selectively queries labels of those examples which\nare underrepresented in the observational data. Combining these three components gives us a new\nalgorithm. We prove this newly proposed algorithm is statistically consistent \u2013 in the sense that\nit converges to the true minimizer of the population risk given enough data. We also analyze its\nlabel complexity, show it is better than prior work [25], and demonstrate the contribution of each\ncomponent of the algorithm to the label complexity bound.\n\n2 Related Work\n\nWe consider learning with logged observational data where the logging policy that selects the samples\nto be observed is known to the learner. The standard approach is importance sampling to derive an\nunbiased loss estimator [19], but this is known to suffer from high variance. One common approach\nfor reducing variance is to clip or truncate the importance weights [6, 22], and we provide a new\nprincipled method for choosing the clipping threshold with theoretical guarantees. Another approach\nis to add a regularizer based on empirical variance to the loss function to favor models with low loss\nvariance [17, 22, 18]. Our second moment regularizer achieves a similar effect, but has the advantage\nof being applicable to active learning with theoretical guarantees.\nIn this work, in addition to logged observational data, we allow the learner to actively acquire\nadditional labeled examples. The closest to our work is [25], the only known work in the same setting.\n[25] and our work both use disagreement-based active learning (DBAL) framework [7, 9, 4, 11]\nand multiple importance sampling [24] for combining actively acquired examples with logged\nobservational data. [25] uses an importance weighted loss estimator which leads to high variance\nand hence high sample complexity. In our work, we incorporate a more ef\ufb01cient variance-controlled\nimportance sampling into active learning and show that it leads to a better label complexity.\n[3] and [21] consider active learning for predicting individual treatment effect which is similar to\nour task. They take a Bayesian approach which does not need to know the logging policy, but\nassumes the true model is from a known distribution family. Additionally, they do not provide\nlabel complexity bounds. A related line of research considers active learning for domain adaptation,\nand their methods are mostly based on heuristics [20, 27], utilizing a clustering structure [14], or\nnon-parametric methods [15]. In other related settings, [26] considers warm-starting contextual\n\n2\n\n\fbandits targeting at minimizing the cumulative regret instead of the \ufb01nal prediction error; [16] studies\nactive learning with bandit feedback without any logged observational data.\n\n3 Problem Setup\nWe are given an instance space X , a label space Y = {\u22121, +1}, and a hypothesis class H \u2282 YX .\nLet D be an underlying data distribution over X \u00d7 Y. For simplicity, we assume H is a \ufb01nite set, but\nour results can be generalized to VC-classes by standard arguments [23, 18].\nIn the passive setting for learning with observational data, the learner has access to a logged obser-\nvational dataset generated from the following process. First, m examples {(Xt, Yt)}m\nt=1 are drawn\ni.i.d. from D. Then a logging policy Q0 : X \u2192 [0, 1] that describes the probability of observing\nthe label is applied. In particular, for each example (Xt, Yt) (1 \u2264 t \u2264 m), an independent Bernoulli\nrandom variable Zt with expectation Q0(Xt) is drawn, and then the label Yt is revealed to the learner\nif Zt = 11. We call T0 = {(Xt, Yt, Zt)}m\nt=1 the logged dataset. We assume the learner knows the\nlogging policy Q0, and only observes instances {Xt}m\nt=1, and revealed labels\n{Yt | Zt = 1}m\nt=1.\nIn the active learning setting, in addition to the logged dataset, the learner has access to a stream of\nonline data. In particular, there is a stream of additional n examples {(Xt, Yt)}m+n\nt=m+1 drawn i.i.d.\nfrom distribution D. At time t (m < t \u2264 m + n), the learner applies a query policy to compute an\nindicator Zt \u2208 {0, 1}, and then the label Yt is revealed if Zt = 1. The computation of Zt may in\ngeneral be randomized, and is based on the observed logged data T0, previously observed instances\n{Xi}t\nWe focus on the active learning setting, and the goal of the learner is to learn a classi\ufb01er h \u2208 H from\nobserved logged data and online data. Fixing D, Q0, m, n, the performance is measured by: (1) the\nerror rate l(h) := PrD(h(X) (cid:54)= Y ) of the output classi\ufb01er, and (2) the number of label queries on\nthe online data. Note that the error rate is over the entire population D instead of conditioned on the\nlogging policy, and that we assume the labels of the logged data T0 come at no cost. In this work, we\nare interested in the situation where n, the size of the online stream, is smaller than m.\n\ni=m+1, and observed labels {Yi | Zi = 1}t\u22121\n\ni=m+1, decisions{Zi}t\u22121\n\nt=1, indicators {Zt}m\n\ni=m+1.\n\nNotation Unless otherwise speci\ufb01ed, all probabilities and expectations are over the draw of all\nrandom variables {(Xt, Yt, Zt)}m+n\nt=1 . De\ufb01ne q0 = inf x Q0(x). De\ufb01ne the optimal classi\ufb01er h(cid:63) =\narg minh\u2208H l(h), \u03bd = l(h(cid:63)). For any r > 0, h \u2208 H, de\ufb01ne the r\u2212ball around h as B(h, r) =\n{h(cid:48) \u2208 H : Pr(h(X) (cid:54)= h(cid:48)(X)) \u2264 r}. For any C \u2286 H, de\ufb01ne the disagreement region DIS(C) =\n{x \u2208 X : \u2203h1 (cid:54)= h2 \u2208 C, h1(X) (cid:54)= h2(X)}.\nDue to space limit, all proofs are postponed to Appendix.\n\n4 Variance-Controlled Importance Sampling for Passive Learning with\n\nObservational Data\n\n(cid:80)\n\nt\n\nm\n\nQ0(Xt)\n\n1{h(Xt)(cid:54)=Yt}Zt\n\nIn the passive setting, the standard method to overcome sample selection bias is to optimize the\nimportance weighted (IW) loss l(h, T0) = 1\n. This loss is an unbiased estimator of\nm\n\u2212 l(h))2 can be high, leading\nthe population error Pr(h(X) (cid:54)= Y ), but its variance 1\nto poor solutions. Previous work addresses this issue by adding a variance regularizer [17, 22, 18] and\nclipping/truncating the importance weight [6, 22]. However, the variance regularizer is challenging to\nuse in interactive learning when data arrives sequentially, and it is unclear how the clipping/truncating\nthreshold should be chosen to yield good theoretical guarantees.\nIn this paper, as an alternative to the variance regularizer, we propose a novel second moment\nregularizer which achieves a similar error bound to the variance regularizer [18]; and this motivates a\nprincipled choice of the clipping threshold.\n\nE( 1{h(X)(cid:54)=Y }Z\n\nQ0(X)\n\n1This generating process implies the standard unconfoundedness assumption in the counterfactual inference\nIn other words, the label Yt is conditionally\n\nliterature: Pr(Yt, Zt | Xt) = Pr(Yt | Xt) Pr(Zt | Xt).\nindependent with the action Zt (indicating whether the label is observed) given the instance Xt.\n\n3\n\n\f4.1 Second-Moment-Regularized Empirical Risk Minimization\n\nIntuitively, between two classi\ufb01ers with similarly small training loss l(h, T0), the one with lower\n(cid:80)\nvariance should be preferred, since its population error l(h) would be small with a higher probability\nthan the one with higher variance. Existing work encourages low variance by regularizing the loss\n(cid:80)\ni( 1{h(Xi)(cid:54)=Yi}Zi\n)2 \u2212 l(h, T0)2. Here, we propose to\nwith the estimated variance \u02c6Var(h, T0) = 1\nm\nQ0(Xi)\ni( 1{h(Xi)(cid:54)=Yi}Zi\nregularize with the estimated second moment \u02c6V(h, T0) = 1\n)2, an upper bound\n(cid:113)\nm\nof \u02c6Var(h, T0). We have the following generalization error bound for regularized ERM.\n(cid:114)\n\n\u02c6V(h, T0). For any \u03b4 > 0, then with proba-\n|H|\n\u03b4\n\nTheorem 1. Let \u02c6h = arg minh\u2208H l(h, T0) +\n|H|\nbility at least 1 \u2212 \u03b4, l(\u02c6h) \u2212 l(h(cid:63)) \u2264 28 log\n\u03b4\n3mq0\n\nE 1{h(cid:63)(X)(cid:54)=Y }\n\nQ0(X) +\n\n4 log\nm\n\n4 log\nm\n\n(cid:113)\n\nQ0(Xi)\n\n|H|\n\u03b4\n\n|H|\n\u03b4\n\n+\n\n.\n\n4 log\n3\n2 q2\n0\n\nm\n\n(cid:113) log |H|\n(cid:113)\n\n|S2|\n\nTheorem 1 shows an error rate similar to the one for the variance regularizer [18]. However, the\nadvantage of using the second moment is the decomposability: \u02c6V(h, S1 \u222a S2) =\n|S1|+|S2| \u02c6V(h, S1) +\n|S1|+|S2| \u02c6V(h, S2). This makes it easier to analyze for active learning that we will discuss later.\nRecall for the unregularized importance sampling loss minimizer \u02c6hIW = arg minh\u2208H l(h, T0), the\nerror bound is \u02dcO( log |H|\n\nQ0(X) )) [8, 25]. In Theorem 1, the extra\n\nm min( l(h(cid:63))\n\n, E 1\n\n|S1|\n\nmq0\n\n+\n\nq0\n\n(cid:113)E 1{h(X)(cid:54)=Y }\n\n1\n3\n2 q2\n0\n, and is negligible when m is large.\n\nm\n\nQ0(X)\n\n\u02c6V(h, T0) around\n\nterm is due to the deviation of\nIn this case, learning with a second moment regularizer gives a better generalization bound.\nThis improvement in generalization error is due to the regularizer instead of tighter analysis. Similar\nto [17, 18], we show in Theorem 2 that for some distributions, the error bound in Theorem 1 cannot\nbe achieved by any algorithm that simply optimizes the unregularized empirical loss.\n\u03bd2 , there is a sample space X \u00d7 Y, a hypothesis class H,\nTheorem 2. For any 0 < \u03bd < 1\n, and that with probability\na distribution D, and a logging policy Q0 such that \u03bd\nq0\nt=1, if \u02c6h = arg minh\u2208H l(h, S), then l(\u02c6h) \u2265\n100 over the draw of S = {(Xt, Yt, Zt)}m\nat least\nl(h(cid:63)) + 1\nmq0\n\n> E 1{h(cid:63)(X)(cid:54)=Y }\n\n3 , m \u2265 49\n\n(cid:113) \u03bd\n\nQ0(X)\n\nmq0\n\n+\n\n.\n\n1\n\n4.2 Clipped Importance Sampling\n\n1\n\n1\n\ni=1\n\n1[\n\nQ0(Xi)\n\n(cid:80)m\n\n1{h(Xi)(cid:54)=Yi}Zi\n\nThe variance and hence the error bound for second-moment regularized ERM can still be high if\nQ0(x) is large. This\nQ0(X) factor arises inevitably to guarantee the importance weighted estimator is\nunbiased. Existing work alleviates the variance issue at the cost of some bias by clipping or truncating\nthe importance weight. In this paper, we focus on clipping, where the loss estimator becomes\nQ0(Xi) \u2264 M ]. This estimator is no longer unbiased, but\nl(h; T0, M ) := 1\nm\nas the weight is clipped at M, so is the variance. Although studied previously [6, 22], to the best of\nour knowledge, it remains unclear how the clipping threshold M can be chosen in a principled way.\nWe propose to choose M0 = inf{M(cid:48) \u2265 1 | 2M(cid:48) log\nQ0(X) > M(cid:48))}. This choice\nminimizes an error bound for the clipped second-moment regularized ERM and we formally show\nthis in Appendix E. Example 30 in Appendix E shows this clipping threshold avoids outputting\nsuboptimal classi\ufb01ers. The choice of M0 implies that the clipping threshold should be larger as the\nsample size m increases, which con\ufb01rms the intuition that with a larger sample size the variance\nbecomes less of an issue than the bias. We have the following generalization error bound.\n\n\u2265 PrX (\n\n|H|\n\u03b4\n\nm\n\n1\n\n1\n\nTheorem 3. Let \u02c6h = arg minh\u2208H l(h; T0, M0) +\nprobability at least 1 \u2212 \u03b4,\n\n|H|\n\u03b4\n\n4 log\nm\n\n\u02c6V(h; T0, M0). For any \u03b4 > 0, with\n\nl(\u02c6h)\u2212 l(h(cid:63)) \u2264 34 log\n3m\n\n|H|\n\u03b4\n\nM0 +\n\n4 log\nm 3\n\n2\n\n4 log\nm\n\n|H|\n\u03b4\n\nE 1{h(cid:63)(X) (cid:54)= Y }\n\nQ0(X)\n\n1[\n\n1\n\nQ0(X)\n\n\u2264 M0].\n\n(cid:113)\n\n(cid:113)\n(cid:115)\n\n|H|\n\u03b4\n\nM 2\n\n0 +\n\n4\n\n\fWe always have M0 \u2264 1\nthat without clipping asymptotically.\n\nas PrX (\n\nq0\n\n1\n\nQ0(X) > 1\nq0\n\n) = 0. Thus, this error bound is always no worse than\n\n5 Active Learning with Observational Data\n\nNext, we consider active learning where in addition to a logged observational dataset the learner\nhas access to a stream of unlabeled samples from which it can actively query for labels. The main\nchallenges are how to control the variance due to the observational data with active learning, and how\nto leverage the logged observational data to reduce the number of label queries beyond simply using\nthem for warm-start.\nTo address these challenges, we \ufb01rst propose a nontrivial change to the Disagreement-Based Active\nLearning (DBAL) so that the variance-controlled importance sampling objective can be incorporated.\nThis modi\ufb01ed algorithm also works in a general cost-sensitive active learning setting which we\nbelieve is of independent interest. Second, we show how to combine logged observational data with\nactive learning through multiple importance sampling (MIS). Finally, we propose a novel sample\nselection bias correction technique to query regions under-explored in the observational data more\nfrequently. We provide theoretical analysis demonstrating that the proposed method gives better label\ncomplexity guarantees than previous work [25] and alternative methods.\n\nKey Technique 1: Disagreement-Based Active Learning with Variance-Controlled\nImportance Sampling\n\nThe DBAL framework is a widely-used general framework for active learning [7, 9, 4, 11]. This\nframework iteratively maintains a candidate set Ct to be a con\ufb01dence set for the optimal classi\ufb01er. A\ndisagreement region Dt is then de\ufb01ned accordingly to be the set of instances on which there are two\nclassi\ufb01ers in Ct that predict labels differently. At each iteration, it draws a set of unlabeled instances.\nThe labels for instances falling inside the disagreement region are queried; otherwise, the labels are\ninferred according to the unanimous prediction of the candidate set. These instances with inferred or\nqueried labels are then used to shrink the candidate set.\nThe classical DBAL framework only considers the unregularized 0-1 loss. As discussed in the\nprevious section, with observational data, unregularized loss leads to suboptimal label complexity.\nHowever, directly adding a regularizer breaks the statistical consistency of DBAL, since the proof of\nits consistency is contingent on two properties: (1) the minimizer of the population loss l(h) stays in\nall candidate sets with high probability; (2) the loss difference l(h1, S)\u2212 l(h2, S) for any h1, h2 \u2208 Ct\ndoes not change no matter how examples outside the disagreement region Dt are labeled.\nUnfortunately, if we add a variance based regularizer (either estimated variance or second moment),\n\u02c6V(h, S) has to change as the sample size n increases, and\nthe objective function l(h, S) +\nso does the optimal classi\ufb01er w.r.t. regularized population loss \u02dchn = arg min l(h) +\nn V (h).\nConsequently, \u02dchn may not stay in all candidate sets. Besides, the difference of the regularized\n\u02c6V(h2, S)) changes if labels of examples outside the\nloss l(h1, S) +\ndisagreement region Dt are modi\ufb01ed, breaking the second property.\nTo resolve the consistency issues, we \ufb01rst carefully choose the de\ufb01nition of the candidate set\nand guarantee the optimal classi\ufb01er w.r.t. the prediction error h(cid:63) = arg min l(h), instead of the\nregularized loss \u02dchn, stays in candidate sets with high probability. Moreover, instead of the plain\nvariance regularizer, we apply the second moment regularizer and exploit its decomposability property\nto bound the difference of the regularized loss for ensuring consistency.\n\n\u02c6V(h1, S) \u2212 (l(h2, S) +\n\n(cid:113) \u03bb\n\n(cid:113) \u03bb\n\n(cid:113) \u03bb\n\n(cid:113) \u03bb\n\nn\n\nn\n\nn\n\nKey Technique 2: Multiple Importance Sampling\n\nMIS addresses how to combine logged observational data with actively collected data for training\nclassi\ufb01ers [2, 25]. To illustrate this, for simplicity, we assume a \ufb01xed query policy Q1 is used\nfor active learning. To make use of both T0 = {(Xi, Yi, Zi)}m\ni=1 collected by Q0 and T1 =\n{(Xi, Yi, Zi)}m+n\ni=m+1 collected by Q1, one could optimize the unbiased importance weighted error\n1{h(Xi)(cid:54)=Yi}Zi\n(m+n)Q1(Xi) which can have high\n\nestimator lIS(h, T0 \u222a T1) =(cid:80)m\n\n(m+n)Q0(Xi) +(cid:80)m+n\n\n1{h(Xi)(cid:54)=Yi}Zi\n\ni=m+1\n\ni=1\n\n5\n\n\f(cid:80)m+n\n\nm+n\n\nvariance and lead to poor generalization error. Here, we apply the MIS estimator lMIS(h, T0 \u222a T1) :=\n1{h(Xi)(cid:54)=Yi}Zi\nmQ0(Xi)+nQ1(Xi) which effectively treats the data T0 \u222a T1 as drawn from a mixture policy\n. lMIS is also unbiased, but has lower variance than lIS and thus gives better error bounds.\n\nmQ0+nQ1\n\ni=1\n\nKey Technique 3: Active Sample Selection Bias Correction\n\nAnother advantage to consider active learning is that the learner can apply a strategy to correct\nthe sample selection bias, which improves label ef\ufb01ciency further. This strategy is inspired from\nthe following intuition: due to sample selection bias caused by the logging policy, labels for some\nregions of the sample space may be less likely to be observed in the logged data, thus increasing the\nuncertainty in these regions. To counter this effect, during active learning, the learner should query\nmore labels from such regions.\nWe formalize this intuition as follows. Suppose we would like to design a single query strategy\nQ1 : X \u2192 [0, 1] that determines the probability of querying the label for an instance during the active\nlearning phase. For any Q1, we have the following generalization error bound for learning with n\nlogged examples and m unlabeled examples from which the learner can select and query for labels\n(for simplicity of illustration, we use the unclipped estimator here)\n\n\u03b4\n\n+\n\nl(h1) \u2212 l(h2) \u2264 l(h1, S) \u2212 l(h2, S) +\n\n4 log 2|H|\n3(mq0 + n)\nWe propose to set Q1(x) = 1{mQ0(x) < m\n2 Q0(x) + n} which only queries instances if Q0(x) is\nsmall. This leads to fewer queries while guarantees an error bound close to the one achieved by\nsetting Q1(x) \u2261 1 that queries every instance. In Appendix E we give an example, Example 31,\nshowing the reduction of queries due to this strategy.\nThe sample selection bias correction strategy is complementary to the DBAL technique. We note that\na similar query strategy is proposed in [25], but the strategy here stems from a tighter analysis and\ncan be applied with variance control techniques discussed in Section 4, and thus gives better label\ncomplexity guarantees as to be discussed in the analysis section.\n\nmQ0(X) + nQ1(X)\n\n2|H|\n\u03b4\n\nlog\n\n.\n\n(cid:115)\n4E 1{h1(X) (cid:54)= h2(X)}\n\n5.1 Algorithm\n\nPutting things together, our proposed algorithm is shown as Algorithm 1. It takes the logged data\nand an epoch schedule as input. It assumes the logging policy Q0 and its distribution f (x) =\nPr(Q0(X) \u2264 x) are known (otherwise, these quantities can be estimated with unlabeled data).\nAlgorithm 1 uses the DBAL framework that recursively shrinks a candidate set C and its cor-\nresponding disagreement region D to save label queries by not querying examples outside D.\nIn particular, at iteration k, it computes a clipping threshold Mk (step 5) and MIS weights\nwhich are used to de\ufb01ne the clipped MIS error estimator and two\nwk(x) :=\nsecond moment estimators\n\nmQ0(Xi)+(cid:80)k\n\nj=1 \u03c4iQi(Xi)\n\nm+nk\n\ni=1\n\nm+nk(cid:88)\nm+nk(cid:88)\nm+nk(cid:88)\n\ni=1\n\nl(h; \u02dcSk, Mk) :=\n\n\u02c6V(h1, h2; \u02dcSk, Mk) :=\n\n\u02c6V(h; \u02dcSk, Mk) :=\n\n1\n\nm + nk\n\n1\n\nm + nk\n\n1\n\nm + nk\n\ni=1\n\nwk(Xi)Zi1{h(Xi) (cid:54)= \u02dcYi}1{wk(Xi) \u2264 Mk},\n\nk(Xi)Zi1{h1(Xi) (cid:54)= h2(Xi)}1{wk(Xi) \u2264 Mk},\nw2\n\nk(Xi)Zi1{h(Xi) (cid:54)= \u02dcYi}1{wk(Xi) \u2264 Mk}.\nw2\n\nThe algorithm shrinks the candidate set Ck+1 by eliminating classi\ufb01ers whose estimated error is\nlarger than a threshold that takes the minimum empirical error and the second moment into account\n(step 7), and de\ufb01nes a corresponding disagreement region Dk+1 = DIS(Ck+1) as the set of all\ninstances on which there are two classi\ufb01ers in the candidate set Ck+1 that predict labels differently.\nIt derives a query policy Qk+1 with the sample selection bias correction strategy (step 9). At the\nend of iteration k, it draws \u03c4k+1 unlabeled examples. For each example X with Qk+1(X) > 0, if\n\n6\n\n\fX \u2208 Dk+1, the algorithm queries for the actual label Y and sets \u02dcY = Y , otherwise it infers the label\nand sets \u02dcY = \u02c6hk(X). These examples {X} and their inferred or queried labels { \u02dcY } are then used in\nsubsequent iterations. In the last step of the algorithm, a classi\ufb01er that minimizes the clipped MIS\nerror with the second moment regularizer over all received data is returned.\n\n1: Input: con\ufb01dence \u03b4, logged data T0, epoch schedule \u03c41, . . . , \u03c4K, n =(cid:80)K\n\nAlgorithm 1 Disagreement-Based Active Learning with Logged Observational Data\ni=1 \u03c4i.\n\n2: \u02dcS0 \u2190 T0; C0 \u2190 H; D0 \u2190 X ; n0 = 0\n3: for k = 0, . . . , K \u2212 1 do\n\u03c31(k, \u03b4, M ) \u2190 ( M\n4:\nChoose Mk = inf{M \u2265 1 |\n\u02c6hk \u2190 arg minh\u2208Ck l(h; \u02dcSk, Mk)\nDe\ufb01ne the candidate set Ck+1 \u2190 {h \u2208 Ck\n\u03c32(k, \u03b4k) \u02c6V(h, \u02c6hk; \u02dcSk, Mk)}\n\n+ M 2\n(m+nk)\n2M\n\n\u03b31\u03c31(k, \u03b4k, Mk) + \u03b31\n\n(cid:113)\n\n5:\n6:\n7:\n\n|H|\n\u03b4k\n\n) log\n\nm+nk\n\nm+nk\n\nlog\n\n3\n2\n\n|H|\n\u03b4 ; \u03c32(k, \u03b4) = 1\n\u2265 Pr( m+nk\n\nmQ0(X)+nk\n\nm+nk\n\n|H|\n\u03b4 ; \u03b4k \u2190\n\nlog\n> M/2)}\n\n\u03b4\n\n2(k+1)(k+2)\n\n|\n\nl(h; \u02dcSk, Mk) \u2264 l(\u02c6hk; \u02dcSk, Mk) +\n\nQk+1(x) \u2190 1{mQ0(x) +(cid:80)k\n\nDe\ufb01ne the Disagreement Region Dk+1 \u2190 {x \u2208 X | \u2203h1, h2 \u2208 Ck+1 s.t. h1(x) (cid:54)= h2(x)}\nnk+1 \u2190 nk + \u03c4k+1\nDraw \u03c4k+1 samples {(Xt, Yt)}m+nk+1\nfor t = m + nk + 1 to m + nk+1 do\n\ni=1 \u03c4iQi(x) < m\nt=m+nk+1, and present {Xt}m+nk+1\n\n2 Q0(x) + nk+1};\n\nt=m+nk+1 to the learner.\n\nZt \u2190 Qk+1(Xt)\nif Zt = 1 then\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19: end for\n20: Output \u02c6h = arg minh\u2208CK l(h; \u02dcSK, Mk) + \u03b31\n\nend for\n\u02dcTk+1 \u2190 {Xt, \u02dcYt, Zt}m+nk+1\n\nend if\n\nIf Xt \u2208 Dk+1, query for label: \u02dcYt \u2190 Yt; otherwise infer \u02dcYt \u2190 \u02c6hk(Xt).\n\nt=m+nk+1, \u02dcSk+1 \u2190 \u02dcSk \u222a \u02dcTk+1;\n\n(cid:113) 1\n\nm+n log\n\n|H|\n\u03b4K\n\n\u02c6V(h; \u02dcSK, Mk).\n\n5.2 Analysis\n\n(cid:115)\n\nWe have the following generalization error bound for Algorithm 1. Despite not querying for all labels,\nour algorithm achieves the same asymptotic bound as the one that queries labels for all online data.\nTheorem 4. Let M = inf{M(cid:48) \u2265 1 | 2M(cid:48)\nmQ0(X)+n \u2265 M(cid:48)/2)} be the \ufb01nal clipping\nthreshold used in step 20. There is an absolute constant c0 > 1 such that for any \u03b4 > 0, with\nprobability at least 1 \u2212 \u03b4,\n\n\u2265 Pr( m+n\n\nm+n log\n\n|H|\n\u03b4K\n\n1{ m + n\n\nmQ0(X) + n\n\n\u2264 M} log\n\n|H|\n\u03b4\n\n+\n\nM log\n\n|H|\n\u03b4\nm + n\n\n+\n\nM 2\n\nlog\n\n|H|\n\u03b4\n(m + n) 3\n\n2\n\n).\n\nmQ0(X) + n\n\nl(\u02c6h) \u2264 l(h(cid:63))+c0(\n\nE 1{h(cid:63)(X) (cid:54)= Y }\nr Pr(cid:0)DIS(B(h(cid:63), r)) \u2229(cid:8)x : Q0(x) \u2264 1\n\n1\n\nt\n\nNext, we analyze the number of labels queried by Algorithm 1 with the help of following de\ufb01nitions.\nDe\ufb01nition 5. For any t \u2265 1, r > 0, de\ufb01ne the modi\ufb01ed disagreement coef\ufb01cient \u02dc\u03b8(r, t) :=\n\n(cid:9)(cid:1). De\ufb01ne \u02dc\u03b8 := supr>2\u03bd\n\n\u02dc\u03b8(r, 2m\n\nn ).\n\n(cid:113)\n\nThe modi\ufb01ed disagreement coef\ufb01cient \u02dc\u03b8(r, t) measures the probability of the intersection of two sets:\nthe disagreement region for the r-ball around h(cid:63) and where the propensity score Q0(x) is smaller than\nt . It characterizes the size of the querying region of Algorithm 1. Note that the standard disagreement\n1\ncoef\ufb01cient [10], which is widely used for analyzing DBAL in the classical active learning setting,\ncan be written as \u03b8(r) := \u02dc\u03b8(r, 1). Here, the modi\ufb01ed disagreement coef\ufb01cient modi\ufb01es the standard\nde\ufb01nition to account for the reduction of the number of label queries due to the sample selection bias\ncorrection strategy: Algorithm 1 only queries examples on which Q0(x) is lower than some threshold,\nhence \u02dc\u03b8(r, t) \u2264 \u03b8(r). Moreover, our modi\ufb01ed disagreement coef\ufb01cient \u02dc\u03b8 is always smaller than the\nmodi\ufb01ed disagreement coef\ufb01cient of [25] (denoted by \u03b8(cid:48)) which is used to analyze their algorithm.\n\n7\n\n\fn to be the size ratio of logged and online data, let \u03c4k = 2k, de\ufb01ne\n} to be the minimum ratio between the clipping threshold Mk and\nby the choice of Mk), and de\ufb01ne\n\nAdditionally, de\ufb01ne \u03b1 = m\n\u03be = min1\u2264k\u2264K{Mk/ m+nk\nmq0+nk\nmaximum MIS weight m+nk\nmq0+nk\n\u00afM = max1\u2264k\u2264K Mk to be the maximum clipping threshold. Recall q0 = inf X Q0(X).\nThe following theorem upper-bounds the number of label queries by Algorithm 1.\nTheorem 6. There is an absolute constant c1 > 1 such that for any \u03b4 > 0, with probability at least\n1 \u2212 \u03b4, the number of labels queried by Algorithm 1 is at most:\n\n(\u03be \u2264 1 since Mk \u2264 m+nk\n\nmq0+nk\n\n\u02dc\u03b8 \u00b7 (n\u03bd +\n\nc1\n\nn\u03bd\u03be\n\n\u03b1q0 + 1\n\nlog\n\n|H| log n\n\n\u03b4\n\n+\n\n\u00afM \u03be log n\u221a\nn\u03b1\n\n|H| log n\n\n\u03b4\n\nlog\n\n+\n\n\u03be log n\n\u03b1q0 + 1\n\nlog\n\n|H| log n\n\n\u03b4\n\n).\n\n(cid:114)\n\n(cid:115)\n\n5.3 Discussion\n\n\u03bd \u02dc\u03b8 log |H| \u00b7 ( 1\n\u2265 E 1{h(cid:63)(X)(cid:54)=Y }\n1+\u03b1Q0(X) .\n\u03bd\u03b8 log |H| \u00b7 ( 1\n\n\u03bd \u02dc\u03b8 log |H| \u00b7(cid:16) M\n(cid:16)\n(cid:16)\n\u03bd \u02dc\u03b8 log |H| \u00b7(cid:16)\n(cid:16)\n(cid:16)\n\u03bd\u03b8 log |H| \u00b7(cid:16)\n(cid:16)\nlog |H| \u00b7(cid:16)\n(cid:16)\n(cid:16)\n\n\u2265 1\n\n(1+\u03b1)2q0\n\n1+\u03b1q0\n\nq0+\u03b1\n\n1+\u03b1Q0(X)\n\n\u00012 E 1{h(cid:63)(X)(cid:54)=Y }\n(cid:17)\n(cid:17)\n\n(cid:17)(cid:17)\n\n(cid:17)(cid:17)\n\n(cid:17)\n\nIn this subsection, we compare the theoretical performance of the proposed algorithm and some\nalternatives to understand the effect of proposed techniques. We present some empirical results in\nSection F in Appendix.\nThe theoretical performance of learning algorithms is captured by label complexity, which is de\ufb01ned\nas the number of label queries required during the active learning phase to guarantee the test error\nof the output classi\ufb01er to be at most \u03bd + \u0001 (here \u03bd = l(h(cid:63)) is the optimal error , and \u0001 is the target\nexcess error). This can be derived by combining the upper bounds on the error (Theorem 4) and the\nnumber of queries (Theorem 6).\n\u2022 The label complexity is \u02dcO\n\n1{\n\nfor\n\n\u00012 E 1{h(cid:63)(X)(cid:54)=Y }\n\n1+\u03b1Q0(X)\n\n\u0001(1+\u03b1) + 1\n\nAlgorithm 1. This is derived from Theorem 4, 6.\n\n\u2022 The label complexity is \u02dcO\n\n1\n\n\u0001(1+\u03b1q0) + 1\n\nis derived by setting the \ufb01nal clipping threshold MK = 1+\u03b1\n1+\u03b1q0\n\n. It is worse since 1+\u03b1\n1+\u03b1q0\n\n1+\u03b1Q0(X) \u2264 M}(cid:17)(cid:17)\n(cid:17)(cid:17)\n\n1+\u03b1\n\nwithout clipping. This\n\n\u2265 M.\n\n\u0001 + \u03bd\n\u00012 )\n\n1\n\n1+\u03b1q0\n\nif regularizers are removed further. This\n\n\u2022 The label complexity is \u02dcO\n\nis worse since\n\n\u03bd\n\n1+\u03b1q0\n\n\u2022 The label complexity is \u02dcO\n\n\u2022 The label complexity is \u02dcO\ntechnique. It can be shown\n\u2022 The label complexity is \u02dcO\n\nbias correction strategy. Here the standard disagreement coef\ufb01cient \u03b8 is used (\u03b8 \u2265 \u02dc\u03b8).\n\n\u0001 + \u03bd\n\u00012 )\n\n1\n\n1+\u03b1q0\n\nif we further remove the sample selection\n\n1\n\n\u0001(1+\u03b1q0) + \u03bd(q0+\u03b1)\n\u00012(1+\u03b1)2q0\n\nif we further remove the MIS\n\n, so MIS gives a better label complexity bound.\n\nif DBAL is further removed. Here,\nall n online examples are queried. This demonstrates that DBAL decreases the label complexity\nbound by a factor of \u03bd\u03b8 which is at most 1 by de\ufb01nition.\n\n\u0001(1+\u03b1q0) + \u03bd(q0+\u03b1)\n\u00012(1+\u03b1)2q0\n\n1\n\n\u2022 Finally, the label complexity is \u02dcO\n\nfor [25], the only known algorithm in\nour setting. Here, \u03b8(cid:48) \u2265 \u02dc\u03b8,\n\u2265 M\n1+\u03b1. Thus, the label complexity\nof the proposed algorithm is better than [25]. This improvement is made possible by the second\nmoment regularizer, the principled clipping technique, and thereby the improved sample selection\nbias correction strategy.\n\n1+\u03b1Q0(X) , and\n\n\u03bd\u03b8(cid:48) log |H| \u00b7 \u03bd+\u0001\n\u2265 E 1{h(cid:63)(X)(cid:54)=Y }\n\n1+\u03b1q0\n\n1+\u03b1q0\n\n1+\u03b1q0\n\n\u00012\n\n\u03bd\n\n1\n\n1\n\n6 Conclusion\n\nWe consider active learning with logged observational data where the learner is given an observational\ndata set selected according to some logging policy, and can actively query for additional labels\nfrom an online data stream. Previous work applies disagreement-based active learning with an\n\n8\n\n\fimportance weighted loss estimator to account for counterfactuals, which has high variance and\nleads to a high label complexity. In this work, we utilize variance control techniques for importance\nweighted estimators, and propose a novel variant of DBAL to make it amenable to variance-controlled\nimportance sampling. Based on these improvements, a new sample selection bias correction strategy\nis proposed to further boost label ef\ufb01ciency. Our theoretical analysis shows that the proposed\nalgorithm is statistically consistent and more label-ef\ufb01cient than prior work and alternative methods.\n\nAcknowledgement We thank NSF under CCF 1513883 and 1719133 for support.\n\nReferences\n[1] Vowpal Wabbit. https://github.com/JohnLangford/vowpal_wabbit/.\n\n[2] Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. Effective evaluation\nusing logged bandit feedback from multiple loggers. arXiv preprint arXiv:1703.06180, 2017.\n\n[3] Onur Atan, William R. Zame, and Mihaela van der Schaar. Sequential patient recruitment and\nallocation for adaptive clinical trials. In Kamalika Chaudhuri and Masashi Sugiyama, editors,\nProceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning\nResearch, pages 1891\u20131900. PMLR, 16\u201318 Apr 2019.\n\n[4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci.,\n\n75(1):78\u201389, 2009.\n\n[5] P Borjesson and C-E Sundberg. Simple approximations of the error function q (x) for commu-\n\nnications applications. IEEE Transactions on Communications, 27(3):639\u2013643, 1979.\n\n[6] L\u00e9on Bottou, Jonas Peters, Joaquin Qui\u00f1onero-Candela, Denis X Charles, D Max Chickering,\nElon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and\nlearning systems: The example of computational advertising. The Journal of Machine Learning\nResearch, 14(1):3207\u20133260, 2013.\n\n[7] D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning.\n\nMachine Learning, 15(2), 1994.\n\n[8] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance\n\nweighting. In Advances in neural information processing systems, pages 442\u2013450, 2010.\n\n[9] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS,\n\n2007.\n\n[10] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n[11] Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends R(cid:13)\n\nin Machine Learning, 7(2-3):131\u2013309, 2014.\n\n[12] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.\n\n[13] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Ef\ufb01cient\nand parsimonious agnostic active learning. In Advances in Neural Information Processing\nSystems, pages 2755\u20132763, 2015.\n\n[14] David Kale, Marjan Ghazvininejad, Anil Ramakrishna, Jingrui He, and Yan Liu. Hierarchical\nactive transfer learning. In Proceedings of the 2015 SIAM International Conference on Data\nMining, pages 514\u2013522. SIAM, 2015.\n\n[15] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the bene\ufb01ts of labels in\n\ncovariate-shift. In Conference On Learning Theory, pages 1882\u20131886, 2018.\n\n[16] Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daum\u00e9, III, and John Langford.\nActive learning for cost-sensitive classi\ufb01cation. In Doina Precup and Yee Whye Teh, editors,\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Pro-\nceedings of Machine Learning Research, pages 1915\u20131924, International Convention Centre,\nSydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n9\n\n\f[17] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In\n\nCOLT 2009-The 22nd Conference on Learning Theory, 2009.\n\n[18] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives.\n\nIn Advances in Neural Information Processing Systems, pages 2971\u20132980, 2017.\n\n[19] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational\n\nstudies for causal effects. Biometrika, 70(1):41\u201355, 1983.\n\n[20] Avishek Saha, Piyush Rai, Hal Daum\u00e9, Suresh Venkatasubramanian, and Scott L DuVall.\nActive supervised domain adaptation. In Joint European Conference on Machine Learning and\nKnowledge Discovery in Databases, pages 97\u2013112. Springer, 2011.\n\n[21] Iiris Sundin, Peter Schulam, Eero Siivola, Aki Vehtari, Suchi Saria, and Samuel Kaski.\nActive learning for decision-making from imbalanced observational data. arXiv preprint\narXiv:1904.05268, 2019.\n\n[22] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from\nlogged bandit feedback. In International Conference on Machine Learning, pages 814\u2013823,\n2015.\n\n[23] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of\n\nevents to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.\n\n[24] Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for monte carlo\nrendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive\ntechniques, pages 419\u2013428. ACM, 1995.\n\n[25] Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. Active learning with logged data. In\n\nInternational Conference on Machine Learning, pages 5517\u20135526, 2018.\n\n[26] Chicheng Zhang, Alekh Agarwal, Hal Daum\u00e9 III, John Langford, and Sahand N Negahban.\nWarm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv\npreprint arXiv:1901.00301, 2019.\n\n[27] Zihan Zhang, Xiaoming Jin, Lianghao Li, Guiguang Ding, and Qiang Yang. Multi-domain\nactive learning for recommendation. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence,\n2016.\n\n[28] Andre M Zubkov and Aleksandr A Serov. A complete proof of universal inequalities for the\ndistribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539\u2013\n544, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1044, "authors": [{"given_name": "Songbai", "family_name": "Yan", "institution": "University of California, San Diego"}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UCSD"}, {"given_name": "Tara", "family_name": "Javidi", "institution": "University of California San Diego"}]}