{"title": "Auditing: Active Learning with Outcome-Dependent Query Costs", "book": "Advances in Neural Information Processing Systems", "page_first": 512, "page_last": 520, "abstract": "We propose a learning setting in which unlabeled data is free, and the cost of a label depends on its value, which is not known in advance. We study binary classification in an extreme case, where the algorithm only pays for negative labels. Our motivation are applications such as fraud detection, in which investigating an honest transaction should be avoided if possible. We term the setting auditing, and consider the auditing complexity of an algorithm: The number of negative points it labels to learn a hypothesis with low relative error. We design auditing algorithms for thresholds on the line and axis-aligned rectangles, and show that with these algorithms, the auditing complexity can be significantly lower than the active label complexity. We discuss a general approach for auditing for a general hypothesis class, and describe several interesting directions for future work.", "full_text": "Auditing: Active Learning with\nOutcome-Dependent Query Costs\n\nSivan Sabato\n\nMicrosoft Research New England\n\nsivan.sabato@microsoft.com\n\nAnand D. Sarwate\n\nTTI-Chicago\n\nasarwate@ttic.edu\n\nTechnion-Israel Institute of Technology and TTI-Chicago\n\nNathan Srebro\n\nnati@ttic.edu\n\nAbstract\n\nWe propose a learning setting in which unlabeled data is free, and the cost of a\nlabel depends on its value, which is not known in advance. We study binary clas-\nsi\ufb01cation in an extreme case, where the algorithm only pays for negative labels.\nOur motivation are applications such as fraud detection, in which investigating\nan honest transaction should be avoided if possible. We term the setting audit-\ning, and consider the auditing complexity of an algorithm: the number of negative\nlabels the algorithm requires in order to learn a hypothesis with low relative er-\nror. We design auditing algorithms for simple hypothesis classes (thresholds and\nrectangles), and show that with these algorithms, the auditing complexity can be\nsigni\ufb01cantly lower than the active label complexity. We also show a general com-\npetitive approach for learning with outcome-dependent costs.\n\n1\n\nIntroduction\n\nActive learning algorithms seek to mitigate the cost of learning by using unlabeled data and sequen-\ntially selecting examples to query for their label to minimize total number of queries. In some cases,\nhowever, the actual cost of each query depends on the true label of the example and is thus not known\nbefore the label is requested. For instance, in detecting fraudulent credit transactions, a query with\na positive answer is not wasteful, whereas a negative answer is the result of a wasteful investigation\nof an honest transaction, and perhaps a loss of good-will. More generally, in a multiclass setting,\ndifferent queries may entail different costs, depending on the outcome of the query. In this work we\nfocus on the binary case, and on the extreme version of the problem, as described in the example of\ncredit fraud, in which the algorithm only pays for queries which return a negative label. We term\nthis setting auditing, and the cost incurred by the algorithm its auditing complexity.\nThere are several natural ways to measure performance for auditing. For example, we may wish\nthe algorithm to maximize the number of positive labels it \ufb01nds for a \ufb01xed \u201cbudget\u201d of negative\nlabels, or to minimize the number of negative labels while \ufb01nding a certain number or fraction of\npositive labels. In this work we focus on the classical learning problem, in which one attempts to\nlearn a classi\ufb01er from a \ufb01xed hypothesis class, with an error close to the best possible. Similar to\nactive learning, we assume we are given a large set of unlabeled examples, and aim to learn with\nminimal labeling cost. But unlike active learning, we only incur a cost when requesting the label of\nan example that turns out to be negative.\nThe close relationship between auditing and active learning raises natural questions. Can the au-\nditing complexity be signi\ufb01cantly better than the label complexity in active learning? If so, should\n\n1\n\n\falgorithms be optimized for auditing, or do optimal active learning algorithms also have low audit-\ning complexity? To answer these questions, and demonstrate the differences between active learning\nand auditing, we study the simple hypothesis classes of thresholds and of axis-aligned rectangles in\nRd, in both the realizable and the agnostic settings. We then also consider a general competitive\nanalysis for arbitrary hypothesis classes.\n\nOther work. Existing work on active learning with costs (Margineantu, 2007; Kapoor et al., 2007;\nSettles et al., 2008; Golovin and Krause, 2011) typically assumes that the cost of labeling each\npoint is known a priori, so the algorithm can use the costs directly to select a query. Our model is\nsigni\ufb01cantly different, as the costs depend on the outcome of the query itself. Kapoor et al. (2007)\ndo mention the possibility of class-dependent costs, but this possibility is not studied in detail. An\nunrelated game-theoretic learning model addressing \u201cauditing\u201d was proposed by Blocki et al. (2011).\n\nNotation and Setup\nFor an integer m, let [m] = {1, 2, . . . , m}. The function I[A] is the indicator function of a set A.\nFor a function f and a sub-domain X, f|X is the restriction of f to X. For vectors a and b in Rd,\nthe inequality a \u2264 b implies ai \u2264 bi for all i \u2208 [d].\nWe assume a data domain X and a distribution D over labeled data points in X \u00d7 {\u22121, +1}. A\nlearning algorithm may sample i.i.d. pairs (X, Y ) \u223c D. It then has access to the value of X, but the\nlabel Y remains hidden until queried. The algorithm returns a labeling function \u02c6h : X \u2192 {\u22121, +1}.\nThe error of a function h : X \u2192 {\u22121, +1} on D is err(D, h) = E(X,Y )\u223cD[h(X) (cid:54)= Y ]. The error\nof h on a multiset S \u2286 X \u00d7 {\u22121, +1} is given by err(S, h) = 1|S|\nI[h(x) (cid:54)= y]. The\npassive sample complexity of an algorithm is the number of pairs it draws from D. The active label\ncomplexity of an algorithm is the total number of label queries the algorithm makes. Its auditing\ncomplexity is the number of queries the algorithm makes on points with negative labels.\nWe consider guarantees for learning algorithms relative to a hypothesis class H \u2286 {\u22121, +1}X . We\ndenote the error of the best hypothesis in H on D by err(D,H) = minh\u2208H err(D, h). Similarly,\nerr(S,H) = minh\u2208H err(S, h). We usually denote the best error for D by \u03b7 = err(D,H).\nTo describe our algorithms it will be convenient to de\ufb01ne the following sample sizes, using universal\nconstants C, c > 0. Let \u03b4 \u2208 (0, 1) be a con\ufb01dence parameter, and let \u0001 \u2208 (0, 1) be an error parameter.\nLet mag(\u0001, \u03b4, d) = C(d + ln(c/\u03b4))/\u00012. If a sample S is drawn from D with |S| = mag(\u0001, \u03b4, d) then\nwith probability 1 \u2212 \u03b4, \u2200h \u2208 H, err(D, h) \u2264 err(S, h) + \u0001 and err(S,H) \u2264 err(D,H) + \u0001 (Bartlett\nand Mendelson, 2002). Let m\u03bd(\u0001, \u03b4, d) = C(d ln(c/\u03bd\u0001) + ln(c/\u03b4))/\u03bd2\u0001. Results of Vapnik and\nChervonenkis (1971) show that if H has VC dimension d and S is drawn from D with |S| = m\u03bd,\nthen for all h \u2208 H,\n\n(cid:80)\n\n(x,y)\u2208S\n\nerr(S, h) \u2264 max{err(D, h)(1 + \u03bd), err(D, h) + \u03bd\u0001} and\nerr(D, h) \u2264 max{err(S, h)(1 + \u03bd), err(S, h) + \u03bd\u0001} .\n\n(1)\n\n2 Active Learning vs. Auditing: Summary of Results\n\nThe main point of this paper is that the auditing complexity can be quite different from the active\nlabel complexity, and that algorithms tuned to minimizing the audit label complexity give improve-\nments over standard active learning algorithms. Before presenting these differences, we note that in\nsome regimes, neither active learning nor auditing can improve signi\ufb01cantly over the passive sample\ncomplexity. In particular, a simple adaptation of a result of Beygelzimer et al. (2009), establishes\nthe following lower bound.\nLemma 2.1. Let H be a hypothesis class with VC dimension d > 1. If an algorithm always \ufb01nds a\nhypothesis \u02c6h with err(D, \u02c6h) \u2264 err(D,H)+\u0001 for \u0001 > 0, then for any \u03b7 \u2208 (0, 1) there is a distribution\nD with \u03b7 = err(D,H) such that the auditing complexity of this algorithm for D is \u2126(d\u03b72/\u00012).\nThat is, when \u03b7 is \ufb01xed while \u0001 \u2192 0, the auditing complexity scales as \u2126(d/\u00012), similar to the\npassive sample complexity. Therefore the two situations which are interesting are the realizable\n\n2\n\n\fcase, corresponding to \u03b7 = 0, and the agnostic case, when we want to guarantee an excess error \u0001\nsuch that \u03b7/\u0001 is bounded. We provide results for both of these regimes.\nWe will \ufb01rst consider the realizable case, when \u03b7 = 0. Here it is suf\ufb01cient to consider the case\nwhere a \ufb01xed pool S of m points is given and the algorithm must return a hypothesis \u02c6h such that\nerr(S, \u02c6h) = 0 with probability 1. A pool labeling algorithm can be used to learn a hypothesis\nwhich is good for a distribution by drawing and labeling a large enough pool. We de\ufb01ne auditing\ncomplexity for an unlabeled pool as the minimal number of negative labels needed to perfectly\nclassify it. It is easy to see that there are pools with an auditing complexity at least the VC dimension\nof the hypothesis class.\nFor the agnostic case, when \u03b7 > 0, we denote \u03b1 = \u0001/\u03b7 and say that an algorithm (\u03b1, \u03b4)-learns a\nclass of distributions D with respect to H if for all D \u2208 D, with probability 1 \u2212 \u03b4, \u02c6h returned by\nthe algorithm satis\ufb01es err(D, \u02c6h) \u2264 (1 + \u03b1)\u03b7. By Lemma 2.1 an auditing complexity of \u2126(d/\u03b12)\nis unavoidable, but we can hope to improve over the passive sample complexity lower bound of\n\u2126(d/\u03b7\u03b12) (Devroye and Lugosi, 1995) by avoiding the dependence on \u03b7.\nOur main results are summarized in Table 1, which shows the auditing and active learning complex-\nities in the two regimes, for thresholds on [0, 1] and axis-aligned rectangles in Rd, where we assume\nthat the hypotheses label the points in the rectangle as negative and points outside as positive.\n\nRealizable Thresholds\nRectangles\nThresholds \u2126\nRectangles \u2126\n\nAgnostic\n\nActive\n\u0398(ln m)\n\n(cid:17)\n\nm\n\n(cid:16) 1\n(cid:16) 1\n\n\u03b7\n\n(cid:16)\n(cid:16)\n\nln\n\nd\n\n(cid:17)\n(cid:17)(cid:17)\n\n+ 1\n\u03b12\n\n\u03b7 + 1\n\u03b12\n\nAuditing\n\n1\n2d\n\n(cid:1)\nO(cid:0) 1\n(cid:17) \u00b7 1\nd2 ln2(cid:16) 1\n\u03b12 ln(cid:0) 1\n\n\u03b12\n\n\u03b7\n\n\u03b1\n\n(cid:1)(cid:17)\n\n(cid:16)\n\nO\n\nTable 1: Auditing complexity upper bounds vs. active label complexity lower bounds for realizable\n(pool size m) and agnostic (err(D,H) = \u03b7) cases. Agnostic bounds are for (\u03b1, \u03b4)-learning with a\n\ufb01xed \u03b4, where \u03b1 = \u0001/\u03b7.\n\nIn the realizable case, for thresholds, the optimal active learning algorithm performs binary search,\nresulting in \u2126(ln m) labels in the worst case. This is a signi\ufb01cant improvement over the passive label\ncomplexity of m. However, a simple auditing procedure that scans from right to left queries only\na single negative point, achieving an auditing complexity of 1. For rectangles, we present a simple\ncoordinate-wise scanning procedure with auditing complexity of at most 2d, demonstrating a huge\ngap versus active learning, where the labels of all m points might be required. Not all classes enjoy\nreduced auditing complexity: we also show that for rectangles with positive points on the inside,\nthere exists pools of size m with an auditing complexity of m.\nIn the agnostic case we wish to (\u03b1, \u03b4)-learn distributions with a true error of \u03b7 = err(D,H), for\nconstant \u03b1, \u03b4. For active learning, it has been shown that in some cases, the \u2126(d/\u03b7) passive sample\ncomplexity can be replaced by an exponentially smaller O(d ln(1/\u03b7)) active label complexity (Han-\nneke, 2011), albeit sometimes with a larger polynomial dependence on d. In other cases, an \u2126(1/\u03b7)\ndependence exists also for active learning. Our main question is whether the dependence on \u03b7 in the\nactive label complexity can be further reduced for auditing.\nFor thresholds, active learning requires \u2126(ln(1/\u03b7)) labels (Kulkarni et al., 1993). Using auditing,\nwe show that the dependence on \u03b7 can be completely removed, for any true error level \u03b7 > 0, if\nwe know \u03b7 in advance. We also show that if \u03b7 is not known at least approximately, the logarithmic\ndependence on 1/\u03b7 is unavoidable also for auditing. For rectangles, we show that the active label\ncomplexity is at least \u2126(d/\u03b7). In contrast, we propose an algorithm with an auditing complexity\nof O(d2 ln2(1/\u03b7)), reducing the linear dependence on 1/\u03b7 to a logarithmic dependence. We do not\nknow whether a linear dependence on d is possible with a logarithmic dependence on 1/\u03b7.\nOmitted proofs of results below are provided in the extended version of this paper (Sabato et al.,\n2013).\n\n3\n\n\f3 Auditing for Thresholds on the Line\n\nThe \ufb01rst question to ask is whether the audit label complexity can ever be signi\ufb01cantly smaller than\nthe active or passive label complexities, and whether a different algorithm is required to achieve this\nimprovement. The following simple case answers both questions in the af\ufb01rmative. Consider the\nhypothesis class of thresholds on the line, de\ufb01ned over the domain X = [0, 1]. A hypothesis with\nthreshold a is ha(x) = I[x \u2212 a \u2265 0]. The hypothesis class is H(cid:97) = {ha | a \u2208 [0, 1]}. Consider\nthe pool setting for the realizable case. The optimal active label complexity of \u0398(log2 m) can be\nachieved by a binary search on the pool. The auditing complexity of this algorithm can also be as\nlarge as \u0398(log2(m)). However, auditing allows us to beat this barrier. This case exempli\ufb01es an in-\nteresting contrast between auditing and active learning. Due to information-theoretic considerations,\nany algorithm which learns an unlabeled pool S has an active label complexity of at least log2 |H|S|\n(Kulkarni et al., 1993), where H|S is the set of restrictions of functions in H to the domain S. For\nH(cid:97), log2 |H(cid:97)|S| = \u2126(log2 m). However, the same considerations are invalid for auditing.\nWe showed that for the realizable case, the auditing label complexity for H(cid:97) is a constant. We now\nprovide a more complex algorithm that guarantees this for (\u03b1, \u03b4)-learning in the agnostic case. The\nintuition behind our approach is that to get the optimal threshold in a pool with at most k errors, we\ncan query from highest to lowest until observing k + 1 negative points and then \ufb01nd the minimal\nerror threshold on the labeled points.\nLemma 3.1. Let S be a pool of size m in [0, 1], and assume that err(S,H(cid:97)) \u2264 k/m. Then the\nprocedure above \ufb01nds \u02c6h such that err(S, \u02c6h) = err(S,H(cid:97)) with an auditing complexity of k + 1.\nProof. Denote the last queried point by x0, and let ha\u2217 = argminh\u2208H(cid:97) err(S,H(cid:97)). Since\nerr(S, ha\u2217 ) \u2264 k/m, a\u2217 > x0. Denote by S(cid:48) \u2286 S the set of points queried by the procedure.\nFor any a > x0, err(S(cid:48), ha) = err(S, ha) + |{(x, y) \u2208 S | x < x0, y = 1}|/m. Therefore,\nminimizing the error on S(cid:48) results in a hypothesis that minimizes the error on S.\nTo learn from a distribution, one can draw a random sample and use it as the pool in the procedure\nabove. However, the sample size required for passive (\u03b1, \u03b4)-learning of thresholds is \u2126(ln(1/\u03b7)/\u03b7).\nThus, the number of errors in the pool would be k = \u03b7\u00b7\u2126(ln(1/\u03b7)/\u03b7) = \u2126(ln(1/\u03b7)), which depends\non \u03b7. To avoid this dependence, the auditing algorithm we propose uses Alg. 1 below to select a\nsubset of the random sample, which still represents the distribution well, but its size is only \u2126(1/\u03b7).\nLemma 3.2. Let \u03b4, \u03b7max \u2208 (0, 1). Let S be a pool such that err(S,H(cid:97)) \u2264 \u03b7max. Let Sq be the\noutput of Alg. 1 with inputs S, \u03b7max, \u03b4, and let \u02c6h = argminh\u2208H(cid:97) err(Sq,H(cid:97)). Then with probability\n1 \u2212 \u03b4,\n\nerr(Sq, \u02c6h) \u2264 6\u03b7max\n\nand\n\nerr(S, \u02c6h) \u2264 17\u03b7max.\n\nThe algorithm for auditing thresholds on the line in the agnostic case is listed in Alg. 2. This\nalgorithm \ufb01rst achieves (C, \u03b4) learning of H(cid:97) for a \ufb01xed C (in step 7, based on Lemma 3.2 and\nLemma 3.1, and then improves its accuracy to achieve (\u03b1, \u03b4)-learning for \u03b1 > 0, by additional\npassive sampling in a restricted region. The following theorem provides the guarantees for Alg. 2.\n\nAlgorithm 1: Representative Subset Selection\n1: Input: pool S = (x1, . . . , xm) (with hidden labels), xi \u2208 [0, 1], \u03b7max \u2208 (0, 1], \u03b4 \u2208 (0, 1).\n2: T \u2190 max{(cid:98)1/3\u03b7max(cid:99), 1}.\n} be the multiset with T copies of each point in S.\n3: Let U = {x1, . . . , x1\n\n, . . . , xm, . . . , xm\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nT copies\n\nT copies\n\n4: Sort and rename the points in U such that x(cid:48)\n5: Let Sq be an empty multiset.\n6: for t = 1 to T do\n(t\u22121)m+1, . . . , x(cid:48)\n7:\n8:\n\ntm}.\n\nS(t) \u2190 {x(cid:48)\nDraw 14 ln(8/\u03b4) random points from S(t) independently uniformly at random and add them\nto Sq (with duplications).\n\ni \u2264 x(cid:48)\n\ni+1 for all i \u2208 [T m].\n\n9: end for\n10: Return Sq (with the corresponding hidden labels).\n\n4\n\n\fAlgorithm 2: Auditing for Thresholds with a constant \u03b1\n1: Input: \u03b7max, \u03b4, \u03b1 \u2208 (0, 1), access to distribution D such that err(D,H(cid:97)) \u2264 \u03b7max.\n2: \u03bd \u2190 \u03b1/5.\n3: Draw a random labeled pool (with hidden labels) S0 of size m\u03bd(\u03b7, \u03b4/2, 1) from D.\n4: Draw a random sample S of size mag((1 + \u03bd)\u03b7max, \u03b4/2, 1) uniformly from S0.\n5: Get a subset Sq using Alg. 1 with inputs S, 2(1 + \u03bd)\u03b7max, \u03b4/2.\n6: Query points in Sq from highest to lowest. Stop after (cid:100)12|Sq|(1 + \u03bd)\u03b7max(cid:101) + 1 negatives.\n7: Find \u02c6a such that h\u02c6a minimizes the error on the labeled part of Sq.\n8: Let S1 be the set of the 36(1 + \u03bd)\u03b7max|S0| closest points to \u02c6a in S from each side of \u02c6a.\n9: Draw S2 of size mag(\u03bd/72, \u03b4/2, 1) from S1 (see de\ufb01nition on page 2).\n10: Query all points in S2, and return \u02c6h that minimizes the error on S2.\n\nTheorem 3.3. Let \u03b7max, \u03b4, \u03b1 \u2208 (0, 1). Let D be a distribution with error err(D,H(cid:97)) \u2264 \u03b7max.\nAlg. 2 with input \u03b7max, \u03b4, \u03b1 has an auditing complexity of O(ln(1/\u03b4)/\u03b12), and returns \u02c6h such that\nwith probability 1 \u2212 \u03b4, err(D, \u02c6h) \u2264 (1 + \u03b1)\u03b7max.\nIt immediately follows that if \u03b7 = err(D,H) is known, (\u03b1, \u03b4)-learning is achievable with an auditing\ncomplexity that does not depend on \u03b7. This is formulated in the following corollary.\nCorollary 3.4 ((\u03b1, \u03b4)-learning for H(cid:97)). Let \u03b7, \u03b1, \u03b4 \u2208 (0, 1]. For any distribution D with error\nerr(D,H(cid:97)) = \u03b7, Alg. 2 with inputs \u03b7max = \u03b7, \u03b1, \u03b4 (\u03b1, \u03b4)-learns D with respect to H(cid:97) with an\nauditing complexity of O(ln(1/\u03b4)/\u03b12).\n\nA similar result holds if the error is known up to a multiplicative constant. But what if no bound\non \u03b7 is known? The following lower bound shows that in this case, the best active complexity for\nthreshold this similar to the best active label complexity.\nTheorem 3.5 (Lower bound on auditing H(cid:97) without \u03b7max). Consider any constant \u03b1 \u2265 0. For any\n\u03b4 \u2208 (0, 1), if an auditing algorithm (\u03b1, \u03b4)-learns any distribution D such that err(D,H(cid:97)) \u2265 \u03b7min,\nthen the algorithm\u2019s auditing complexity is \u2126(ln( 1\u2212\u03b4\n\n\u03b4 ) ln(1/\u03b7min)).\n\nIn the next section show that there are classes with a signi\ufb01cant gap between active and auditing\ncomplexities even without an upper bound on the error.\n\n4 Axis Aligned Rectangles\n\nA natural extension of thresholds to higher dimension is the class of axis-aligned rectangles, in which\nthe labels are determined by a d-dimensional hyperrectangle. This hypothesis class, \ufb01rst introduced\nin Blumer et al. (1989), has been studied extensively in different regimes (Kearns, 1998; Long and\nTan, 1998), including active learning (Hanneke, 2007b). An axis-aligned-rectangle hypothesis is a\ndisjunction of 2d thresholds. For simplicity of presentation, we consider here the slightly simpler\nclass of disjunctions of d thresholds over the positive orthant Rd\n+. It is easy to reduce learning of an\naxis-aligned rectangle in Rd to learning of a disjunction of thresholds in R2d by mapping each point\nx \u2208 Rd to a point \u02dcx \u2208 R2d such that for i \u2208 [d], \u02dcx[i] = max(x[i], 0) and \u02dcx[i + d] = max(0,\u2212x[i])).\nThus learning the class of disjunctions is equivalent, up to a factor of two in the dimensionality, to\nlearning rectangles1. Because auditing costs are asymmetric, we consider two possibilities for label\nassignment. For a vector a = (a[1], . . . , a[d]) \u2208 Rd\n\n+, de\ufb01ne the hypotheses ha and h\u2212\n\na by\n\nha(x) = 2I[\u2203i \u2208 [d], x[i] \u2265 a[i]] \u2212 1,\n\na (x) = \u2212ha(x).\nh\u2212\n\nand\nDe\ufb01ne H2 = {ha | a \u2208 Rd\n+}. In H2 the positive points are outside\nthe rectangle and in H\u2212\n2 the negatives are outside. Both classes have VC dimension d. All of our\nresults for these classes can be easily extended to the corresponding classes of general axis-aligned\nrectangles on Rd, with at most a factor of two penalty on the auditing complexity.\n\na | a \u2208 Rd\n\n+} and H\u2212\n\n2 = {h\u2212\n\n1This reduction suf\ufb01ces if the origin is known to be in the rectangle. Our algorithms and results can all be\nextended to the case where rectangles are not required to include the origin. To keep the algorithm and analysis\nas simple as possible, we state the result for this special case.\n\n5\n\n\f4.1 The Realizable Case\n\nWe \ufb01rst consider the pool setting for the realizable case, and show a sharp contrast between the\nauditing complexity and the active label complexity for H2 and H\u2212\n2 . Assume a pool of size m.\nWhile the active learning complexity for H2 and H\u2212\n2 can be as large as m, the auditing complexities\nfor the two classes are quite different. For H\u2212\n2 , the auditing complexity can be as large as m, but for\nH2 it is at most d. We start by showing the upper bound for auditing of H2.\nTheorem 4.1 (Pool auditing upper bound for H2). The auditing complexity of any unlabeled pool\nSu of size m with respect to H2 is at most d.\nProof. The method is a generalization of the approach to auditing for thresholds. Let h\u2217 \u2208 H2 such\nthat err(S, h\u2217) = 0. For each i \u2208 [d], order the points x in S by the values of their i-th coordinates\nx[i]. Query the points sequentially from largest value to the smallest (breaking ties arbitrarily) and\nstop when the \ufb01rst negative label is returned, for some point xi. Set a[i] \u2190 xi[i], and note that h\u2217\nlabels all points in {x | x[i] > a[i]} positive. Return the hypothesis \u02c6h = ha. This procedure clearly\nqueries at most d negative points and agrees with the labeling of h\u2217.\nIt is easy to see that a similar approach yields an auditing complexity of 2d for full axis-aligned\nrectangles. We now provide a lower bound for the auditing complexity of H\u2212\n2 that immediately\nimplies the same lower bound for active label complexity of H\u2212\n2 ). For any m and any d \u2265 2, there is a pool\nTheorem 4.2 (Pool auditing lower bound for H\u2212\nSu \u2286 Rd\nProof. The construction is a simple adaptation of a construction due to Dasgupta (2005), originally\nshowing an active learning lower bound for the class of hyperplanes. Let the pool be composed of m\ndistinct points on the intersection of the unit circle and the positive orthant: Su = {(cos \u03b8j, sin \u03b8j)}\nfor distinct \u03b8j \u2208 [0, \u03c0/2]. Any labeling which labels all the points in Su negative except any one\npoint is realizable for H\u2212\n2 , and so is the all-negative labeling. Thus, any algorithm that distinguishes\nbetween these different labelings with probability 1 must query all the negative labels.\n2 ). For H2 and H\u2212\nCorollary 4.3 (Realizable active label complexity of H2 and H\u2212\nof size m such that its active label complexity is m.\n\n2 and H2.\n+ of size m such that its auditing complexity with respect to H\u2212\n\n2 , there is a pool\n\n2 is m.\n\n4.2 The Agnostic Case\nWe now consider H2 in the agnostic case, where \u03b7 > 0. The best known algorithm for ac-\ntive learning of rectangles (2, \u03b4)-learns a very restricted class of distributions (continuous product\ndistributions which are suf\ufb01ciently balanced in all directions) with an active label complexity of\n\u02dcO(d3p(ln(1/\u03b7)p(ln(1/\u03b4))), where p(\u00b7) is a polynomial (Hanneke, 2007b). However, for a general\ndistribution, active label complexity cannot be signi\ufb01cantly better than passive label complexity.\nThis is formalized in the following theorem.\nTheorem 4.4 (Agnostic active label complexity of H2). Let \u03b1, \u03b7 > 0, \u03b4 \u2208 (0, 1\n2 ). Any learning\nalgorithm that (\u03b1, \u03b4)-learns all distributions such that err(D,H) = \u03b7 for \u03b7 > 0 with respect to H2\nhas an active label complexity of \u2126(d/\u03b7).\nIn contrast, the auditing complexity of H2 can be much smaller, as we show for Alg. 3 below.\nTheorem 4.5 (Auditing complexity of H2). For \u03b7min, \u03b1, \u03b4 \u2208 (0, 1), there is an algorithm that\n(\u03b1, \u03b4)-learns all distributions with \u03b7 \u2265 \u03b7min with respect to H2 with an auditing complexity of\nO( d2 ln(1/\u03b1\u03b4)\n\nln2(1/\u03b7min)).\n\n\u03b12\n\nIf \u03b7min is polynomially close to the true \u03b7, we get an auditing complexity of O(d2 ln2(1/\u03b7)), com-\npared to the active label complexity of \u2126(d/\u03b7), an exponential improvement in \u03b7. It is an open\nquestion whether the quadratic dependence on d is necessary here.\nAlg. 3 implements a \u2018low-con\ufb01dence\u2019 version of the realizable algorithm. It sequentially queries\npoints in each direction, until enough negative points have been observed to make sure the thresh-\nold in this direction has been overstepped. To bound the number of negative labels, the algorithm\niteratively re\ufb01nes lower bounds on the locations of the best thresholds, and an upper bound on the\nnegative error, de\ufb01ned as the probability that a point from D with negative label is classi\ufb01ed as\n\n6\n\n\fpositive by a minimal-error classi\ufb01er. The algorithm uses queries that mostly result in positive la-\nbels, and stops when the upper bound on the negative error cannot be re\ufb01ned. The idea of iteratively\nre\ufb01ning a set of possible hypotheses has been used in a long line of active learning works (Cohn\net al., 1994; Balcan et al., 2006; Hanneke, 2007a; Dasgupta et al., 2008). Here we re\ufb01ne in a par-\nticular way that uses the structure of H2, and allows bounding the number of negative examples we\nobserve.\nWe use the following notation in Alg. 3. The negative error of a hypothesis is errneg(D, h) =\nP(X,Y )\u223cD[h(X) = 1 and Y = \u22121]. It is easy to see that the same convergence guarantees that\nhold for err(\u00b7,\u00b7) using a sample size m\u03bd(\u0001, \u03b4, d) hold also for the negative error errneg(\u00b7,\u00b7) (see\nSabato et al., 2013). For a labeled set of points S, an \u0001 \u2264 (0, 1) and a hypothesis class H, denote\nV\u03bd(S, \u0001,H) = {h \u2208 H | err(S, h) \u2264 err(S,H) + (2\u03bd + \u03bd2) \u00b7 max(err(S,H), \u0001)}. For a vector\nb \u2208 Rd\n\n+, de\ufb01ne H2[b] = {ha \u2208 H2 | a \u2265 b}.\n\nj \u2190 0\nwhile j \u2264 (cid:100)(1 + \u03bd)\u03b7t|St|(cid:101) + 1 do\nIf unqueried points exist, query the unqueried point with highest i\u2019th coordinate;\nIf query returned \u22121, j \u2190 j + 1.\n\nend while\nbt[i] \u2190 the i\u2019th coordinate of the last queried point, or 0 if all points were queried.\n\n+ \u00d7 {\u22121, +1}.\n\nAlgorithm 3: Auditing for H2\n1: Input: \u03b7min > 0, \u03b1 \u2208 (0, 1], access to distribution D over Rd\n2: \u03bd \u2190 \u03b1/25.\n3: for t = 0 to (cid:98)log2(1/\u03b7min)(cid:99) do\n\u03b7t \u2190 2\u2212t.\n4:\nDraw a sample St of size m\u03bd(\u03b7t, \u03b4/ log2(1/\u03b7min), 10d) with hidden labels.\n5:\nfor i = 1 to d do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\nend if\n19:\n20: end for\n21: Return \u02c6h \u2261 argminh\u2208H2[bt] err(Sbt, h).\n\nend for\nSet Sbt to St, with unqueried labels set to \u22121.\nVt \u2190 V\u03bd(Sbt, \u03b7t,H2[bt]).\n\u02c6\u03b7t \u2190 maxh\u2208Vt errneg(Sbt, h).\nif \u02c6\u03b7t > \u03b7t/4 then\nSkip to step 21\n\nTheorem 4.5 is proven in Sabato et al. (2013).\n. The proof idea is to show that at each round t,\nVt includes any h\u2217 \u2208 argminh\u2208H err(D, h), and \u02c6\u03b7t is an upper bound on errneg(D, h\u2217). Further,\nat any given point minimizing the error on Sbt is equivalent to minimizing the error on the entire\n(unlabeled) sample. We conclude that the algorithm obtains a good approximation of the total error.\nIts auditing complexity is bounded since it queries a bounded number of negative points at each\nround.\n\n5 Outcome-dependent Costs for a General Hypothesis Class\nIn this section we return to the realizable pool setting and consider \ufb01nite hypothesis classes H. We\naddress general outcome-dependent costs and a general space of labels Y, so that H \u2286 YX . Let\nS \u2286 X be an unlabeled pool, and let cost : S \u00d7 H \u2192 R+ denote the cost of a query: For x \u2208 S\nand h \u2208 H, cost(x, h) is the cost of querying the label of x given that h is the true (unknown)\nhypothesis. In the auditing setting, Y = {\u22121, +1} and cost(x, h) = I[h(x) = \u22121]. For active\nlearning, cost \u2261 1. Note that under this de\ufb01nition of cost function, the algorithm may not know the\ncost of the query until it reveals the true hypothesis.\nDe\ufb01ne OPTcost(S) to be the minimal cost of an algorithm that for any labeling of S which is\nconsistent with some h \u2208 H produces a hypothesis \u02c6h such that err(S, \u02c6h) = 0. In the active learning\nsetting, where cost \u2261 1, it is NP-hard to obtain OPTcost(S) for general H and S. This can be\n\n7\n\n\fshown by a reduction to set-cover (Hya\ufb01l and Rivest, 1976). A simple adaptation of the reduction\nfor the auditing complexity, which we defer to the full version of this work, shows that it is also\nNP-hard to obtain OPTcost(S) in the auditing setting.\nFor active learning, and for query costs that do not depend on the true hypothesis (that is cost(x, h) \u2261\ncost(x)), Golovin and Krause (2011) showed an ef\ufb01cient greedy strategy that achieves a cost of\nO(OPTcost(S) \u00b7 ln(|H|)) for any S. This approach has also been shown to provide considerable\nperformance gains in practical settings (Gonen et al., 2013). The greedy strategy consists of itera-\ntively selecting a point whose label splits the set of possible hypotheses as evenly as possible, with\na normalization proportional on the cost of each query.\nWe now show that for outcome-dependent costs, another greedy strategy provides similar approx-\nimation guarantees for OPTcost(S). The algorithm is de\ufb01ned as follows: Suppose that so far the\nalgorithm requested labels for x1, . . . , xt and received the corresponding labels y1, . . . , yt. Letting\nSt = {(x1, y1), . . . , (xt, yt)}, denote the current version space by V (St) = {h \u2208 H|S | \u2200(x, y) \u2208\nSt, h(x) = y}. The next query selected by the algorithm is\n\nx \u2208 argmax\n\nx\u2208S\n\nmin\nh\u2208H\n\n|V (St) \\ V (St \u222a {(x, h(x))})|\n\ncost(x, h)\n\n.\n\nThat is, the algorithm selects the query that in the worst-case over the possible hypotheses, would\nremove the most hypotheses from the version spaces, when normalizing by the outcome-dependent\ncost of the query. The algorithm terminates when |V (St)| = 1, and returns the single hypothesis in\nthe version space.\nTheorem 5.1. For any cost function cost, hypothesis class H, pool S, and true hypothesis h \u2208 H,\nthe cost of the proposed algorithm is at most (ln(|H|S| \u2212 1) + 1) \u00b7 OPT.\n\nIf cost is the auditing cost, the proposed algorithm corresponds to the following intuitive strategy: At\nevery round, select a query such that, if its result is a negative label, then the number of hypotheses\nremoved from the version space is the largest. This strategy is consistent with a simple principle\nbased on a partial ordering of the points: For points x, x(cid:48) in the pool, de\ufb01ne x(cid:48) (cid:22) x if {h \u2208 H |\nh(x(cid:48)) = \u22121} \u2287 {h \u2208 H | h(x) = \u22121}, so that if x(cid:48) has a negative label, so does x. In the auditing\nsetting, it is always preferable to query x before querying x(cid:48). Therefore, for any realizable auditing\nproblem, there exists an optimal algorithm that adheres to this principle. It is thus encouraging that\nour greedy algorithm is also consistent with it.\nAn O(ln(|H|S|)) approximation factor for auditing is less appealing than the same factor for active\nlearning. By information-theoretic arguments, active label complexity is at least log2(|H|S|) (and\nhence the approximation at most squares the cost), but this does not hold for auditing. Nonetheless,\nhardness of approximation results for set cover (Feige, 1998), in conjunction with the reduction to\nset cover of Hya\ufb01l and Rivest (1976) mentioned above, imply that such an approximation factor\ncannot be avoided for a general auditing algorithm.\n\n6 Conclusion and Future Directions\n\nAs summarized in Section 2, we show that in the auditing setting, suitable algorithms can achieve\nimproved costs in the settings of thresholds on the line and axis parallel rectangles. There are many\nopen questions suggested by our work. First, it is known that for some hypothesis classes, active\nlearning cannot improve over passive learning for certain distributions (Dasgupta, 2005), and the\nsame is true for auditing. However, exponential speedups are possible for active learning on certain\nclasses of distributions (Balcan et al., 2006; Dasgupta et al., 2008), in particular ones with a small\ndisagreement coef\ufb01cient (Hanneke, 2007a). It is an open question whether a similar property of\nthe distribution can guarantee an improvement with auditing over active or passive learning. This\nmight be especially relevant to important hypothesis classes such as decision trees or halfspaces. An\ninteresting generalization of the auditing problem is a multiclass setting with a different cost for each\nlabel. Finally, one may attempt to optimize other performance measures for auditing, as described\nin the introduction. These measures are different from those studied in active learning, and may lead\nto new algorithmic insights.\n\n8\n\n\fReferences\nM. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the\n\n23rd international conference on Machine learning (ICML), pages 65\u201372, 2006.\n\nP. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\nA. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings\nof the 26th Annual International Conference on Machine Learning (ICML), pages 49\u201356. ACM,\n2009.\n\nJ. Blocki, N. Christin, A. Dutta, and A. Sinha. Regret minimizing audits: A learning-theoretic basis\nfor privacy protection. In Proceedings of 24th IEEE Computer Security Foundations Symposium,\n2011.\n\nA. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-\n\nChervonenkis dimension. Journal of the ACM, 36(4):929\u2013965, Oct. 1989.\n\nD. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n15:201\u2013221, 1994.\n\nS. Dasgupta. Analysis of a greedy active learning strategy. In L. K. Saul, Y. Weiss, and L. Bot-\ntou, editors, Advances in Neural Information Processing Systems 17, pages 337\u2013344. MIT Press,\nCambridge, MA, 2005.\n\nS. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In J. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems\n20, pages 353\u2013360. MIT Press, Cambridge, MA, 2008.\n\nL. Devroye and G. Lugosi. Lower bounds in pattern recognition and learning. Pattern Recognition,\n\n28(7):1011\u20131018, 1995.\n\nU. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):\n\n634\u2013652, 1998.\n\nD. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and\n\nstochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013486, 2011.\n\nA. Gonen, S. Sabato, and S. Shalev-Shwartz. Ef\ufb01cient active learning of halfspaces: an aggressive\n\napproach. In The 30th International Conference on Machine Learning (ICML), 2013.\n\nS. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th\n\ninternational conference on Machine learning, pages 353\u2013360. ACM, 2007a.\n\nS. Hanneke. Teaching dimension and the complexity of active learning. In Learning Theory, pages\n\n66\u201381. Springer, 2007b.\n\nS. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361, 2011.\nL. Hya\ufb01l and R. L. Rivest. Constructing optimal binary decision trees is NP-complete. Information\n\nProcessing Letters, 5(1):15\u201317, May 1976.\n\nA. Kapoor, E. Horvitz, and S. Basu. Selective supervision: Guiding supervised learning with\n\ndecision-theoretic active learning. In Proceedings of IJCAI, 2007.\n\nM. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the ACM (JACM),\n\n45(6):983\u20131006, 1998.\n\nS. R. Kulkarni, S. K. Mitter, and J. N. Tsitsiklis. Active learning using arbitrary binary valued\n\nqueries. Machine Learning, 11(1):23\u201335, 1993.\n\nP. M. Long and L. Tan. PAC learning axis-aligned rectangles with respect to product distributions\n\nfrom multiple-instance examples. Machine Learning, 30(1):7\u201321, 1998.\n\nD. D. Margineantu. Active cost-sensitive learning. In Proceedings of IJCAI, 2007.\nS. Sabato, A. D. Sarwate, and N. Srebro. Auditing: Active learning with outcome-dependent query\n\ncosts. arXiv preprint arXiv:1306.2347, 2013.\n\nB. Settles, M. Craven, and L. Friedlan. Active learning with real annotation costs. In Proceedings\n\nof the NIPS Workshop on Cost-Sensitive Learning, 2008.\n\nV. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their probabilities. Theory of Probability and Its Applications, XVI(2):264\u2013280, 1971.\n\n9\n\n\f", "award": [], "sourceid": 340, "authors": [{"given_name": "Sivan", "family_name": "Sabato", "institution": "Microsoft Research"}, {"given_name": "Anand", "family_name": "Sarwate", "institution": "TTI Chicago"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI Chicago"}]}