{"title": "Noise-Tolerant Interactive Learning Using Pairwise Comparisons", "book": "Advances in Neural Information Processing Systems", "page_first": 2431, "page_last": 2440, "abstract": "We study the problem of interactively learning a binary classifier using noisy labeling and pairwise comparison oracles, where the comparison oracle answers which one in the given two instances is more likely to be positive. Learning from such oracles has multiple applications where obtaining direct labels is harder but pairwise comparisons are easier, and the algorithm can leverage both types of oracles. In this paper, we attempt to characterize how the access to an easier comparison oracle helps in improving the label and total query complexity. We show that the comparison oracle reduces the learning problem to that of learning a threshold function. We then present an algorithm that interactively queries the label and comparison oracles and we characterize its query complexity under Tsybakov and adversarial noise conditions for the comparison and labeling oracles. Our lower bounds show that our label and total query complexity is almost optimal.", "full_text": "Noise-Tolerant Interactive Learning Using\n\nPairwise Comparisons\n\nYichong Xu*, Hongyang Zhang*, Kyle Miller\u2020, Aarti Singh*, and Artur Dubrawski\u2020\n\n*Machine Learning Department, Carnegie Mellon University, USA\n\n\u2020Auton Lab, Carnegie Mellon University, USA\n\n{yichongx, hongyanz, aarti, awd}@cs.cmu.edu,\n\nmille856@andrew.cmu.edu\n\nAbstract\n\nWe study the problem of interactively learning a binary classi\ufb01er using noisy\nlabeling and pairwise comparison oracles, where the comparison oracle answers\nwhich one in the given two instances is more likely to be positive. Learning from\nsuch oracles has multiple applications where obtaining direct labels is harder but\npairwise comparisons are easier, and the algorithm can leverage both types of\noracles. In this paper, we attempt to characterize how the access to an easier\ncomparison oracle helps in improving the label and total query complexity. We\nshow that the comparison oracle reduces the learning problem to that of learning a\nthreshold function. We then present an algorithm that interactively queries the label\nand comparison oracles and we characterize its query complexity under Tsybakov\nand adversarial noise conditions for the comparison and labeling oracles. Our lower\nbounds show that our label and total query complexity is almost optimal.\n\nIntroduction\n\n1\nGiven high costs of obtaining labels for big datasets, interactive learning is gaining popularity in\nboth practice and theory of machine learning. On the practical side, there has been an increasing\ninterest in designing algorithms capable of engaging domain experts in two-way queries to facilitate\nmore accurate and more effort-ef\ufb01cient learning systems (c.f. [26, 31]). On the theoretical side, study\nof interactive learning has led to signi\ufb01cant advances such as exponential improvement of query\ncomplexity over passive learning under certain conditions (c.f. [5, 6, 7, 15, 19, 27]). While most of\nthese approaches to interactive learning \ufb01x the form of an oracle, e.g., the labeling oracle, and explore\nthe best way of querying, recent work allows for multiple diverse forms of oracles [12, 13, 16, 33].\nThe focus of this paper is on this latter setting, also known as active dual supervision [4]. We\ninvestigate how to recover a hypothesis h that is a good approximator of the optimal classi\ufb01er h\u2217,\nin terms of expected 0/1 error PrX [h(X) (cid:54)= h\u2217(X)], given limited access to labels on individual\ninstances X \u2208 X and pairwise comparisons about which one of two given instances is more likely to\nbelong to the +1/-1 class.\nOur study is motivated by important applications where comparisons are easier to obtain than labels,\nand the algorithm can leverage both types of oracles to improve label and total query complexity.\nFor example, in material design, synthesizing materials for speci\ufb01c conditions requires expensive\nexperimentation, but with an appropriate algorithm we can leverage expertize of material scientists,\nfor whom it may be hard to accurately assess the resulting material properties, but who can quickly\ncompare different input conditions and suggest which ones are more promising. Similarly, in clinical\nsettings, precise assessment of each individual patient\u2019s health status can be dif\ufb01cult, expensive and/or\nrisky (e.g. it may require application of invasive sensors or diagnostic surgeries), but comparing\nrelative statuses of two patients at a time may be relatively easy and accurate. In both these scenarios\nwe may have access to a modest amount of individually labeled data, but the bulk of more accessible\ntraining information is available via pairwise comparisons. There are many other examples where\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Explanation of work \ufb02ow of ADGAC-based algorithms. Left: Procedure of typical active\nlearning algorithms. Right: Procedure of our proposed ADGAC-based interactive learning algorithm\nwhich has access to both pairwise comparison and labeling oracles.\nTable 1: Comparison of various methods for learning of generic hypothesis class (Omitting log(1/\u03b5)\nfactors).\n\nLabel Noise\n\nWork\n\nTsybakov (\u03ba)\nTsybakov (\u03ba)\n\n[18]\nOurs\nAdversarial (\u03bd = O(\u03b5))\n[19]\nAdversarial (\u03bd = O(\u03b5)) Ours\n\n# Label\n\n(cid:17)\n\u02dcO(cid:16)(cid:0) 1\n(cid:1)2\u03ba\u22122\n\u02dcO(cid:16)(cid:0) 1\n(cid:1)2\u03ba\u22122(cid:17)\n\nd\u03b8\n\n\u03b5\n\n\u03b5\n\n\u02dcO(d\u03b8)\n\u02dcO(1)\n\n(cid:17)\n\nd\u03b8\n\n\u03b8 + d\u03b8\n\n# Query\n\n\u02dcO(cid:16)(cid:0) 1\n(cid:1)2\u03ba\u22122\n\u02dcO(cid:16)(cid:0) 1\n(cid:1)2\u03ba\u22122\n\n\u03b5\n\n\u03b5\n\n\u02dcO(d\u03b8)\n\u02dcO(d\u03b8)\n\nTolcomp\n\n(cid:17) O(\u03b52\u03ba)\n\nN/A\n\nN/A\nO(\u03b52)\n\nhumans \ufb01nd it easier to perform pairwise comparisons rather than providing direct labels, including\ncontent search [17], image retrieval [31], ranking [21], etc.\nDespite many successful applications of comparison oracles, many fundamental questions remain.\nOne of them is how to design noise-tolerant, cost-ef\ufb01cient algorithms that can approximate the\nunknown target hypothesis to arbitrary accuracy while having access to pairwise comparisons. On\none hand, while there is theoretical analysis on the pairwise comparisons concerning the task of\nlearning to rank [3, 22], estimating ordinal measurement models [28] and learning combinatorial\nfunctions [11], much remains unknown how to extend these results to more generic hypothesis classes.\nOn the other hand, although we have seen great progress on using single or multiple oracles with the\nsame form of interaction [9, 16], classi\ufb01cation using both comparison and labeling queries remains\nan interesting open problem. Independently of our work, Kane et al. [23] concurrently analyzed\na similar setting of learning to classify using both label and comparison queries. However, their\nalgorithms work only in the noise-free setting.\nOur Contributions: Our work addresses the aforementioned issues by presenting a new algorithm,\nActive Data Generation with Adversarial Comparisons (ADGAC), which learns a classi\ufb01er with both\nnoisy labeling and noisy comparison oracles.\n\u2022 We analyze ADGAC under Tsybakov (TNC) [30] and adversarial noise conditions for\nthe labeling oracle, along with the adversarial noise condition for the comparison ora-\ncle. Our general framework can augment any active learning algorithm by replacing the\nbatch sampling in these algorithms with ADGAC. Figure 1 presents the work \ufb02ow of our\nframework.\n\u2022 We propose A2-ADGAC algorithm, which can learn an arbitrary hypothesis class. The label\ncomplexity of the algorithm is as small as learning a threshold function under both TNC and\nadversarial noise condition, independently of the structure of the hypothesis class. The total\nquery complexity improves over previous best-known results under TNC which can only\naccess the labeling oracle.\n\u2022 We derive Margin-ADGAC to learn the class of halfspaces. This algorithm has the same\n\u2022 We present lower bounds on total query complexity for any algorithm that can access both\nlabeling and comparison oracles, and a noise tolerance lower bound for our algorithms.\nThese lower bounds demonstrate that our analysis is nearly optimal.\n\nlabel and total query complexity as A2-ADGAC, but is computationally ef\ufb01cient.\n\nAn important quantity governing the performance of our algorithms is the adversarial noise level\nof comparisons: denote by Tolcomp(\u03b5, \u03b4,A) the adversarial noise tolerance level of comparisons that\nguarantees an algorithm A to achieve an error of \u03b5, with probability at least 1 \u2212 \u03b4. Table 1 compares\nour results with previous work in terms of label complexity, total query complexity, and Tolcomp for\ngeneric hypothesis class C with error \u03b5. We see that our results signi\ufb01cantly improve over prior\n\n2\n\nRefine Sampling SpaceRequest BatchLearn ClassifierLabeling Oracle(cid:9)(cid:27)(cid:21)(cid:2)(cid:5)(cid:6)(cid:2)(cid:4)(cid:26)(cid:22)(cid:22)(cid:12)(cid:26)(cid:11)(cid:18)(cid:21)(cid:3)(cid:11)(cid:26)(cid:13)(cid:17)(cid:7)(cid:11)(cid:12)(cid:14)(cid:19)(cid:18)(cid:21)(cid:16)(cid:1)(cid:8)(cid:24)(cid:11)(cid:13)(cid:19)(cid:14)(cid:4)(cid:22)(cid:20)(cid:23)(cid:11)(cid:24)(cid:18)(cid:25)(cid:22)(cid:21)(cid:1)(cid:8)(cid:24)(cid:11)(cid:13)(cid:19)(cid:14)(cid:9)(cid:14)(cid:15)(cid:18)(cid:21)(cid:14)(cid:1)(cid:10)(cid:11)(cid:20)(cid:23)(cid:19)(cid:18)(cid:21)(cid:16)(cid:1)(cid:10)(cid:23)(cid:11)(cid:13)(cid:14)(cid:7)(cid:14)(cid:11)(cid:24)(cid:21)(cid:1)(cid:4)(cid:19)(cid:11)(cid:25)(cid:25)(cid:18)(cid:15)(cid:18)(cid:14)(cid:24)\fLabel Noise\n\nMassart\nMassart\nMassart\n\nTsybakov (\u03ba)\nTsybakov (\u03ba)\n\n# Label\n\u02dcO(d)\npoly(d)\n\u02dcO(1)\n\nTolcomp\nN/A\nN/A\nO(\u03b52)\nN/A\n\n# Query\n\u02dcO(d)\npoly(d)\n\u02dcO(d)\n\n(cid:1)2\u03ba\u22122\n(cid:1)2\u03ba\u22122\n\n(cid:1)2\u03ba\u22122\n\u02dcO((cid:0) 1\n\u02dcO(cid:16)(cid:0) 1\n(cid:1)2\u03ba\u22122(cid:17)\n\nTable 2: Comparison of various methods for learning of halfspaces (Omitting log(1/\u03b5) factors).\nEf\ufb01cient?\n\nWork\n[8]\n[5]\nOurs\n[19]\nOurs\nAdversarial (\u03bd = O(\u03b5))\n[34]\nAdversarial (\u03bd = O(\u03b5))\n[6]\nAdversarial (\u03bd = O(\u03b5)) Ours\nwork with the extra comparison oracle. Denote by d the VC-dimension of C and \u03b8 the disagreement\ncoef\ufb01cient. We also compare the results in Table 2 for learning halfspaces under isotropic log-concave\ndistributions. In both cases, our algorithms enjoy small label complexity that is independent of \u03b8 and\nd. This is helpful when labels are very expensive to obtain. Our algorithms also enjoy better total\nquery complexity under both TNC and adversarial noise condition for ef\ufb01ciently learning halfspaces.\n\n(cid:17) O(\u03b52\u03ba)\n\n\u02dcO((cid:0) 1\n\u02dcO(cid:16)(cid:0) 1\n\n\u03b5\n\n\u03b5\n\nNo\nYes\nYes\nNo\nYes\nNo\nYes\nYes\n\n\u03b5\n\n\u03b5\n\n\u02dcO(d)\n\u02dcO(d2)\n\u02dcO(1)\n\n\u02dcO(d)\n\u02dcO(d2)\n\u02dcO(d)\n\nd\u03b8)\n\nd\u03b8)\n\n+ d\n\nN/A\nN/A\nO(\u03b52)\n\nr\n\nPr[DIS(B(h\u2217,r))]\n\n2 Preliminaries\nNotations: We study the problem of learning a classi\ufb01er h : X \u2192 Y = {\u22121, 1}, where X\nand Y are the instance space and label space, respectively. Denote by PXY the distribution over\nX \u00d7 Y and let PX be the marginal distribution over X . A hypothesis class C is a set of functions\nh : X \u2192 Y. For any function h, de\ufb01ne the error of h under distribution D over X \u00d7 Y as\nerrD(h) = Pr(X,Y )\u223cD[h(X) (cid:54)= Y ]. Let err(h) = errPXY (h). Suppose that h\u2217 \u2208 C satis\ufb01es\nerr(h\u2217) = inf h\u2208C err(h). For simplicity, we assume that such an h\u2217 exists in class C.\nWe apply the concept of disagreement coef\ufb01cient from Hanneke [18] for generic hypothesis class\nin this paper. In particular, for any set V \u2286 C, we denote by DIS(V ) = {x \u2208 X : \u2203h1, h2 \u2208\nV, h1(x) (cid:54)= h2(x)}. The disagreement coef\ufb01cient is de\ufb01ned as \u03b8 = supr>0\n, where\nB(h\u2217, r) = {h \u2208 C : PrX\u223cPX [h(X) (cid:54)= h\u2217(X)] \u2264 r}.\nProblem Setup: We analyze two kinds of noise conditions for the labeling oracle, namely, adversarial\nnoise condition and Tsybakov noise condition (TNC). We formally de\ufb01ne them as follows.\nCondition 1 (Adversarial Noise Condition for Labeling Oracle). Distribution PXY satis\ufb01es adver-\nsarial noise condition for labeling oracle with parameter \u03bd \u2265 0, if \u03bd = Pr(X,Y )\u223cPXY [Y (cid:54)= h\u2217(X)].\nCondition 2 (Tsybakov Noise Condition for Labeling Oracle). Distribution PXY satis\ufb01es Tsybakov\nnoise condition for labeling oracle with parameters \u03ba \u2265 1, \u00b5 \u2265 0, if \u2200h : X \u2192 {\u22121, 1}, err(h) \u2212\nerr(h\u2217) \u2265 \u00b5 PrX\u223cPX [h(X) (cid:54)= h\u2217(X)]\u03ba. Also, h\u2217 is the Bayes optimal classi\ufb01er, i.e., h\u2217(x) =\nsign(\u03b7(x) \u2212 1/2). 1 where \u03b7(x) = Pr[Y = 1|X = x]. The special case of \u03ba = 1 is also called\nMassart noise condition.\nIn the classic active learning scenario, the algorithm has access to an unlabeled pool drawn from PX .\nThe algorithm can then query the labeling oracle for any instance from the pool. The goal is to \ufb01nd\nan h \u2208 C such that the error Pr[h(X) (cid:54)= h\u2217(X)] \u2264 \u03b52. The labeling oracle has access to the input\nx \u2208 X , and outputs y \u2208 {\u22121, 1} according to PXY. In our setting, however, an extra comparison\noracle is available. This oracle takes as input a pair of instances (x, x(cid:48)) \u2208 X \u00d7 X , and returns a\nvariable Z(x, x(cid:48)) \u2208 {\u22121, 1}, where Z(x, x(cid:48)) = 1 indicates that x is more likely to be positive, while\nZ(x, x(cid:48)) = \u22121 otherwise. In this paper, we discuss an adversarial noise condition for the comparison\noracle. We discuss about dealing with TNC on the comparison oracle in appendix.\nCondition 3 (Adversarial Noise Condition for Comparison Oracle). Distribution PXXZ satis\ufb01es\nadversarial noise with parameter \u03bd(cid:48) \u2265 0, if \u03bd(cid:48) = Pr[Z(X, X(cid:48))(h\u2217(X) \u2212 h\u2217(X(cid:48))) < 0].\n\nquanti\ufb01ed under assumptions on the decision boundary (c.f. [15]).\n\n1The assumption that h\u2217 is Bayes optimal classi\ufb01er can be relaxed if the approximation error of h\u2217 can be\n2Note that we use the disagreement Pr[h(X) (cid:54)= h\u2217(X)] instead of the excess error err(h) \u2212 err(h\u2217) in\nsome of the other literatures. The two conditions can be linked by assuming a two-sided version of Tsybakov\nnoise (see e.g., Audibert 2004).\n\n3\n\n\fNotation\n\nMeaning\n\nNotation\n\nMeaning\n\nTable 3: Summary of notations.\n\nC\nX,X\nY,Y\nZ,Z\nd\n\u03b8\nh\u2217\ng\u2217\n\nComparison & Comparison space\n\nHypothesis class\n\nInstance & Instance space\n\nLabel & Label space\nVC dimension of C\nDisagreement coef\ufb01cient\nOptimal classi\ufb01er in C\nOptimal scoring function\n\n\u03ba\n\u03bd\n\u03bd(cid:48)\n\nerrD(h)\nSClabel\nSCcomp\nTollabel\nTolcomp\n\nTsybakov noise level (labeling)\nAdversarial noise level (labeling)\n\nAdversarial noise level (comparison)\n\nError of h on distribution D\n\nLabel complexity\n\nComparison complexity\nNoise tolerance (labeling)\n\nNoise tolerance (comparison)\n\nNote that we do not make any assumptions on the randomness of Z: Z(X, X(cid:48)) can be either random\nor deterministic as long as the joint distribution PXXZ satis\ufb01es Condition 3.\nFor an interactive learning algorithm A, given error \u03b5 and failure probability \u03b4, let SCcomp(\u03b5, \u03b4,A)\nand SClabel(\u03b5, \u03b4,A) be the comparison and label complexity, respectively. The query complexity of A\nis de\ufb01ned as the sum of label and comparison complexity. Similar to the de\ufb01nition of Tolcomp(\u03b5, \u03b4,A),\nde\ufb01ne Tollabel(\u03b5, \u03b4,A) as the maximum \u03bd such that algorithm A achieves an error of at most \u03b5 with\nprobability 1 \u2212 \u03b4. As a summary, A learns an h such that Pr[h(X) (cid:54)= h\u2217(X)] \u2264 \u03b5 with probability\n1 \u2212 \u03b4 using SCcomp(\u03b5, \u03b4,A) comparisons and SClabel(\u03b5, \u03b4,A) labels, if \u03bd \u2264 Tollabel(\u03b5, \u03b4,A) and\n\u03bd(cid:48) \u2264 Tolcomp(\u03b5, \u03b4,A). We omit the parameters of SCcomp, SClabel, Tolcomp, Tollabel if they are clear\nfrom the context. We use O(\u00b7) to express sample complexity and noise tolerance, and \u02dcO(\u00b7) to ignore\nthe log(\u00b7) terms. Table 3 summarizes the main notations throughout the paper.\n3 Active Data Generation with Adversarial Comparisons (ADGAC)\nThe hardness of learning from pairwise comparisons follows from the error of comparison oracle: the\ncomparisons are noisy, and can be asymmetric and intransitive, meaning that the human might give\ncontradicting preferences like x1 (cid:52) x2 (cid:52) x1 or x1 (cid:52) x2 (cid:52) x3 (cid:52) x1 (here (cid:52) is some preference).\nThis makes traditional methods, e.g., de\ufb01ning a function class {h : h(x) = Z(x, \u02c6x), \u02c6x \u2208 X}, fail,\nbecause such a class may have in\ufb01nite VC dimension.\nIn this section, we propose a novel algorithm, ADGAC, to address this issue. Having access to\nboth comparison and labeling oracles, ADGAC generates a labeled dataset by techniques inspired\nfrom group-based binary search. We show that ADGAC can be combined with any active learning\nprocedure to obtain interactive algorithms that can utilize both labeling and comparison oracles. We\nprovide theoretical guarantees for ADGAC.\n3.1 Algorithm Description\nTo illustrate ADGAC, we start with a general active learning framework in Algorithm 1. Many\nactive learning algorithms can be adapted to this framework, such as A2 [7] and margin-based active\nalgorithms [6, 5]. Here U represents the querying space/disagreement region of the algorithm (i.e.,\nwe reject an instance x if x (cid:54)\u2208 U), and V represents a version space consisting of potential classi\ufb01ers.\nFor example, A2 algorithm can be adapted to Algorithm 1 straightforwardly by keeping U as the\nsample space and V as the version space. More concretely, A2 algorithm [7] for adversarial noise can\nbe characterized by\n\nU0 = X , V0 = C, fV (U, V, W, i) = {h : |W|errW (h) \u2264 ni\u03b5i}, fU (U, V, W, i) = DIS(V ),\n\nwhere \u03b5i and ni are parameters of the A2 algorithm, and DIS(V ) = {x \u2208 X : \u2203h1, h2 \u2208 V, h1(x) (cid:54)=\nh2(x)} is the disagreement region of V . Margin-based active learning [6] can also be \ufb01tted into\nAlgorithm 1 by taking V as the halfspace that (approximately) minimizes the hinge loss, and U as\nthe region within the margin of that halfspace.\nTo ef\ufb01ciently apply the comparison oracle, we propose to replace step 4 in Algorithm 1 with a\nsubroutine, ADGAC, that has access to both comparison and labeling oracles. Subroutine 2 describes\nADGAC. It takes as input a dataset S and a sampling number k. ADGAC \ufb01rst runs Quicksort\nalgorithm on S using feedback from comparison oracle, which is of form Z(x, x(cid:48)). Given that the\ncomparison oracle Z(\u00b7,\u00b7) might be asymmetric w.r.t. its two arguments, i.e., Z(x, x(cid:48)) may not equal\nto Z(x(cid:48), x), for each pair (xi, xj), we randomly choose (xi, xj) or (xj, xi) as the input to Z(\u00b7,\u00b7).\nAfter Quicksort, the algorithm divides the data into multiple groups of size \u03b1m = \u03b5| \u02dcS|, and does\n\n4\n\n\fAlgorithm 1 Active Learning Framework\nInput: \u03b5, \u03b4, a sequence of ni, functions fU , fV .\n1: Initialize U \u2190 U0 \u2286 X , V \u2190 V0 \u2286 C.\n2: for i = 1, 2, ..., log(1/\u03b5) do\n3:\n4:\n5:\nOutput: Any classi\ufb01er \u02c6h \u2208 V .\n\nSample unlabeled dataset \u02dcS of size ni. Let S \u2190 {x : x \u2208 \u02dcS, x \u2208 U}.\nRequest the labels of x \u2208 S and obtain W \u2190 {(xi, yi) : xi \u2208 S}.\nUpdate V \u2190 fV (U, V, W, i), U \u2190 fU (U, V, W, i).\n\nincreasing order. Obtain a sorted list S = (x1, x2, ..., xm).\n\nSubroutine 2 Active Data Generation with Adversarial Comparison (ADGAC)\nInput: Dataset S with |S| = m, n, \u03b5, k.\n1: \u03b1 \u2190 \u03b5n\n2m.\n2: De\ufb01ne preference relation on S according to Z. Run Quicksort on S to rank elements in an\n3: Divide S into groups of size \u03b1m: Si = {x(i\u22121)\u03b1m+1, ..., xi\u03b1m}, i = 1, 2, ..., 1/\u03b1 .\n4: tmin \u2190 1, tmax \u2190 1/\u03b1.\n5: while tmin < tmax do\n6:\n7:\n\nt = (tmin + tmax)/2.\nSample k points uniformly without replacement from St and obtain the labels Y =\n\n(cid:46) Do binary search\n\n{y1, ..., yk}.\n\nIf (cid:80)k\n\ni=1 yi \u2265 0, then tmax = t; else tmin = t + 1.\n\n8:\n9: For t(cid:48) > t and xi \u2208 St(cid:48), let \u02c6yi \u2190 1.\n10: For t(cid:48) < t and xi \u2208 St(cid:48), let \u02c6yi \u2190 \u22121.\n11: For xi \u2208 St, let \u02c6yi be the majority of labeled points in St.\nOutput: Predicted labels \u02c6y1, \u02c6y2, ..., \u02c6ym.\n\ngroup-based binary search by sampling k labels from each group and determining the label of each\ngroup by majority vote.\nFor active learning algorithm A, let A-ADGAC be the algorithm of replacing step 4 with ADGAC\nusing parameters (Si, ni, \u03b5i, ki), where \u03b5i, ki are chosen as additional parameters of the algorithm. We\nestablish results for speci\ufb01c A: A2 and margin-based active learning in Sections 4 and 5, respectively.\n3.2 Theoretical Analysis of ADGAC\n\n(cid:17)\n\nlog(1/\u03b4)\n\n(cid:16)(cid:0) 1\n\n\u03b5\n\n(cid:1)2\u03ba\u22121\n\nBefore we combine ADGAC with active learning algorithms, we provide theoretical results for\nADGAC. By the algorithmic procedure, ADGAC reduces the problem of labeling the whole dataset\nS to binary searching a threshold on the sorted list S. One can show that the con\ufb02icting instances\ncannot be too many within each group Si, and thus binary search performs well in our algorithm. We\nalso use results in [3] to give an error estimate of Quicksort. We have the following result based on\nthe above arguments.\nTheorem 4. Suppose that Conditions 2 and 3 hold for \u03ba \u2265 1, \u03bd(cid:48) \u2265 0, and n =\n. Assume a set \u02dcS with | \u02dcS| = n is sampled i.i.d. from PX and S \u2286 \u02dcS is\n\u2126\nan arbitrary subset of \u02dcS with |S| = m. There exist absolute constants C1, C2, C3 such that if we\nrun Subroutine 2 with \u03b5 < C1, \u03bd(cid:48) \u2264 C2\u03b52\u03ba\u03b4, k = k(1)(\u03b5, \u03b4) := C3 log\noutput a labeling of S such that |{xi \u2208 S : \u02c6yi (cid:54)= h\u2217(xi)}| \u2264 \u03b5n, with probability at least 1 \u2212 \u03b4.\nThe expected number of comparisons required is O(m log m), and the number of sample-label pairs\n\nrequired is SClabel(\u03b5, \u03b4) = \u02dcO(cid:16)\nTheorem 5. Suppose that Conditions 1 and 3 hold for \u03bd, \u03bd(cid:48) \u2265 0, and n = \u2126(cid:0) 1\n\nSimilarly, we analyze ADGAC under adversarial noise condition w.r.t. labeling oracle with \u03bd = O(\u03b5).\n\n\u03b5 log(1/\u03b4)(cid:1). Assume\n\n(cid:1) log(1/\u03b4)(cid:0) 1\n\nfrom PX and S \u2286 \u02dcS is an arbitrary subset of \u02dcS with\na set \u02dcS with | \u02dcS| = n is sampled i.i.d.\n|S| = m. There exist absolute constants C1, C2, C3, C4 such that if we run Subroutine 2 with\n\u03b5 < C1, \u03bd(cid:48) \u2264 C2\u03b52\u03b4, k = k(2)(\u03b5, \u03b4) := C3 log\n, and \u03bd \u2264 C4\u03b5, it will output a labeling\n\n(cid:16) log(1/\u03b5)\n\n(cid:1)2\u03ba\u22122, it will\n\n(cid:16) log(1/\u03b5)\n\nlog(cid:0) m\n\n\u03b5n\n\n(cid:1)2\u03ba\u22122(cid:17)\n\n(cid:17)(cid:0) 1\n\n\u03b5\n\n\u03b5\n\n.\n\n(cid:17)\n\n\u03b4\n\n\u03b4\n\n5\n\n\fof S such that |{xi \u2208 S : \u02c6yi (cid:54)= h\u2217(xi)}| \u2264 \u03b5n, with probability at least 1 \u2212 \u03b4. The expected\nnumber of comparisons required is O(m log m), and the number of sample-label pairs required is\n\nSClabel(\u03b5, \u03b4) = O(cid:16)\n\nlog(cid:0) m\n\n(cid:1) log\n\n(cid:16) log(1/\u03b5)\n\n(cid:17)(cid:17)\n\n.\n\n\u03b4\n\n\u03b5n\n\nProof Sketch. We call a pair (xi, xj) an inverse pair if Z(xi, xj) = \u22121, h\u2217(xi) = 1, h\u2217(xj) = \u22121,\nand an anti-sort pair if h\u2217(xi) = 1, h\u2217(xj) = \u22121, and i < j. We show that the expectation of inverse\npairs is n(n \u2212 1)\u03b5\u2217. By the results in [3] the numbers of inverse pairs and anti-sort pairs have the\nsame expectation, and the actual number of anti-sort pairs can be bounded by Markov\u2019s inequality.\nThen we show that the majority label of each group must be all -1 starting from beginning the list,\nand changes to all 1 at some point of the list. With a careful choice of k, we may obtain the true\nmajority with k labels under Tsybakov noise; we will thus end up in the turning point of the list. The\nerror is then bounded by the size of groups. See appendix for the complete proof.\nTheorems 4 and 5 show that ADGAC gives a labeling of dataset with arbitrary small error using label\ncomplexity independent of the data size. Moreover, ADGAC is computationally ef\ufb01cient since it only\ninvolves binary search. These nice properties of ADGAC lead to improved query complexity when\nwe combine ADGAC with other active learning algorithms.\n4 A2-ADGAC: Learning of Generic Hypothesis Class\nIn this section, we combine ADGAC with A2 algorithm to learn a generic hypothesis class. We use\nthe framework in Algorithm 1: let A2-ADGAC be the algorithm that replaces step 4 in Algorithm 1\nwith ADGAC of parameters (S, ni, \u03b5i, ki), where ni, \u03b5i, ki are parameters to be speci\ufb01ed later. Under\nTNC, we have the following result.\nTheorem 6. Suppose that Conditions 2 and 3 hold, and h\u2217(x) = sign(\u03b7(x) \u2212 1/2). There exist\nglobal constants C1, C2 such that if we run A2-ADGAC with \u03b5 < C1, \u03b4, \u03bd(cid:48) \u2264 Tolcomp(\u03b5, \u03b4) = C2\u03b52\u03ba\u03b4,\n\u03b5i = 2\u2212(i+2), ni = \u2126\nwith k(1)\nspeci\ufb01ed in Theorem 4, with probability at least 1 \u2212 \u03b4, the algorithm will return a classi\ufb01er \u02c6h with\nPr[\u02c6h(X) (cid:54)= h\u2217(X)] \u2264 \u03b5 with comparison and label complexity\n\n(d log(1/\u03b5)) +\n\n(cid:16) 1\n\nlog(1/\u03b4)\n\n(cid:18)\n\n4 log(1/\u03b5)\n\n1\n\u03b5i\n\n\u03b5i\n\n\u03b4\n\n(cid:17)\n(cid:33)(cid:33)\n\n(cid:17)2\u03ba\u22121\n(cid:32)(cid:18)\n(cid:26) 1\n\n(cid:18)\n\n\u03b5i,\n\n(cid:19)\n, ki = k(1)(cid:16)\n(cid:19)2\u03ba\u22122\n(cid:18) 1\n(cid:18) 1\n(cid:19)(cid:19)\n(cid:19)2\u03ba\u22122(cid:33)\n(cid:27)(cid:19)\n(cid:18) 1\n\n+\n\n\u03b5\n\n\u03b5\n\nlog(1/\u03b4)\n\n.\n\n\u03b5\n\nE[SCcomp] = \u02dcO\n\nlog(d\u03b8)\n\nd log\n\nlog(1/\u03b4)\n\n,\n\n(cid:32)\n\n\u03b8 log2\n\n(cid:32)\n\n(cid:19)\n(cid:18) 1\n(cid:18) 1\n(cid:19)\n\n\u03b5\n\n\u03b5\n\nSClabel = \u02dcO\n\nlog\n\nlog\n\nmin\n\n, \u03b8\n\n\u03b5\n\nThe dependence on log2(1/\u03b5) in SCcomp can be reduced to log(1/\u03b5) under Massart noise.\n\n(cid:17)\n\n(cid:16) 1\n\n(cid:16) 1\n\nWe can prove a similar result for adversarial noise condition.\nTheorem 7. Suppose that Conditions 1 and 3 hold. There exist global constants C1, C2, C3 such that\nif we run A2-ADGAC with \u03b5 < C1, \u03b4, \u03bd(cid:48) \u2264 Tolcomp(\u03b5, \u03b4) = C2\u03b52\u03b4, \u03bd \u2264 Tollabel(\u03b5, \u03b4) = C3\u03b5, \u03b5i =\n2\u2212(i+2), ni = \u02dc\u2126\nwith k(2) speci\ufb01ed in Theorem\n5, with probability at least 1\u2212 \u03b4, the algorithm will return a classi\ufb01er \u02c6h with Pr[\u02c6h(X) (cid:54)= h\u2217(X)] \u2264 \u03b5\nwith comparison and label complexity\nE[SCcomp] = \u02dcO\n\n, ki = k(2)(cid:16)\n\n\u03b8d log(\u03b8d) log\n\nlog(1/\u03b4)\n\nlog(1/\u03b4)\n\n(cid:19)\n\n4 log(1/\u03b5)\n\n(cid:17)\n\nd log\n\n\u03b5i,\n\n\u03b5i\n\n\u03b5i\n\n,\n\n\u03b4\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:17)\n(cid:18)\n(cid:18) 1\n\n\u03b5\n\n(cid:18) 1\n(cid:26) 1\n\n\u03b5i\n\n(cid:19)\n(cid:27)(cid:19)\n\n, \u03b8\n\n\u03b5\n\n(cid:19)\n\nSClabel = \u02dcO\n\nlog\n\nlog\n\nmin\n\nlog(1/\u03b4)\n\n.\n\nProof of Theorems 6 and 7 uses Theorem 4 and Theorem 5 with standard manipulations in VC theory.\nTheorems 6 and 7 show that having access to even a biased comparison function can reduce the\nproblem of learning a classi\ufb01er in high-dimensional space to that of learning a threshold classi\ufb01er in\none-dimensional space as the label complexity matches that of actively learning a threshold classi\ufb01er.\nGiven the fact that comparisons are usually easier to obtain, A2-ADGAC will save a lot in practice\ndue to its small label complexity. More importantly, we improve the total query complexity under\nTNC by separating the dependence on d and \u03b5; The query complexity is now the sum of the two\nterms instead of the product of them. This observation shows the power of pairwise comparisons for\nlearning classi\ufb01ers. Such small label/query complexity is impossible without access to a comparison\n\n6\n\n\f(cid:16)\n\nd(cid:0) 1\n\n\u03b5\n\n(cid:1)2\u03ba\u22122(cid:17)\n\nand \u2126(cid:0)d log(cid:0) 1\n\n(cid:1)(cid:1)\n\n\u03b5\n\n(cid:80)\n\noracle, since query complexity with only labeling oracle is at least \u2126\nunder TNC and adversarial noise conditions, respectively [19]. Our results also matches the lower\nbound of learning with labeling and comparison oracles up to log factors (see Section 6).\nWe note that Theorems 6 and 7 require rather small Tolcomp, equal to O(\u03b52\u03ba\u03b4) and O(\u03b52\u03b4), respec-\ntively. We will show in Section 6.3 that it is necessary to require Tolcomp = O(\u03b52) in order to obtain\na classi\ufb01er of error \u03b5, if we restrict the use of labeling oracle to only learning a threshold function.\nSuch restriction is able to reach the near-optimal label complexity as speci\ufb01ed in Theorems 6 and 7.\n5 Margin-ADGAC: Learning of Halfspaces\nIn this section, we combine ADGAC with margin-based active learning [6] to ef\ufb01ciently learn the\nclass of halfspaces. Before proceeding, we \ufb01rst mention a naive idea of utilizing comparisons: we\ncan i.i.d. sample pairs (x1, x2) from PX \u00d7 PX , and use Z(x1, x2) as the label of x1 \u2212 x2, where\nZ is the feedback from comparison oracle. However, this method cannot work well in our setting\nwithout additional assumption on the noise condition for the labeling Z(x1, x2).\nBefore proceeding, we assume that PX is isotropic log-concave on Rd; i.e., PX has mean 0, co-\nvariance I and the logarithm of its density function is a concave function [5, 6]. The hypothesis\nclass of halfspaces can be represented as C = {h : h(x) = sign(w \u00b7 x), w \u2208 Rd}. Denote\nby h\u2217(x) = sign(w\u2217 \u00b7 x) for some w\u2217 \u2208 Rd. De\ufb01ne l\u03c4 (w, x, y) = max (1 \u2212 y(w \u00b7 x)/\u03c4, 0)\nand l\u03c4 (w, W ) = 1|W|\n(x,y)\u2208W l\u03c4 (w, x, y) as the hinge loss. The expected hinge loss of w is\nL\u03c4 (w, D) = Ex\u223cD[l\u03c4 (w, x, sign(w\u2217 \u00b7 x))].\nMargin-based active learning [6] is a concrete example of Algorithm 1 by taking V as (a sin-\ngleton set of) the hinge loss minimizer, while taking U as the margin region around that mini-\nmizer. More concretely, take U0 = X and V0 = {w0} for some w0 such that \u03b8(w0, w\u2217) \u2264 \u03c0/2.\nThe algorithm works with constants M \u2265 2, \u03ba < 1/2 and a set of parameters ri, \u03c4i, bi, zi\nthat equal to \u0398(M\u2212i) (see proof in Appendix for formal de\ufb01nition of these parameters). V al-\nways contains a single hypothesis. Suppose V = {wi\u22121} in iteration i \u2212 1. Let vi satis\ufb01es\nl\u03c4i(vi, W ) \u2264 minv:(cid:107)v\u2212wi\u22121(cid:107)2\u2264ri,(cid:107)v(cid:107)2\u22641 l\u03c4i (v, W ) + \u03ba/8, where wi is the content of V in iteration i.\nWe also have fV (V, W, i) = {wi} =\nLet Margin-ADGAC be the algorithm obtained by replacing the sampling step in margin-based active\nlearning with ADGAC using parameters (S, ni, \u03b5i, ki), where ni, \u03b5i, ki are additional parameters\nto be speci\ufb01ed later. We have the following results under TNC and adversarial noise conditions,\nrespectively.\nTheorem 8. Suppose that Conditions 2 and 3 hold, and h\u2217(x) = sign(w\u2217 \u00b7 x) = sign(\u03b7(x) \u2212 1/2).\nThere are settings of M, \u03ba, ri, \u03c4i, bi, \u03b5i, ki, and constants C1, C2 such that for all \u03b5 \u2264 C1, \u03bd(cid:48) \u2264\nTolcomp(\u03b5, \u03b4) = C2\u03b52\u03ba\u03b4, if we run Margin-ADGAC with w0 such that \u03b8(w0, w\u2217) \u2264 \u03c0/2, and ni =\n, it \ufb01nds \u02c6w such that Pr[sign( \u02c6w \u00b7 X) (cid:54)= sign(w\u2217 \u00b7 X)] \u2264 \u03b5\n\u02dcO\nwith probability at least 1 \u2212 \u03b4. The comparison and label complexity are\n\nd log3(dk/\u03b4) +(cid:0) 1\n\nand fU (U, V, W, i) = {x : |wi \u00b7 x| \u2264 bi}.\n\n(cid:110) vi(cid:107)vi(cid:107)2\n\n(cid:16) 1\n\nlog(1/\u03b4)\n\n(cid:111)\n\n(cid:17)\n\n\u03b5i\n\n\u03b5\n\n(cid:1)2\u03ba\u22121\n(cid:32)\n\nE[SCcomp] = \u02dcO\n\nlog2(1/\u03b5)\n\nd log4(d/\u03b4) +\n\nlog(1/\u03b4)\n\n,\n\n(cid:32)\n\n(cid:32)\n\n(cid:33)(cid:33)\n\n(cid:19)2\u03ba\u22122\n(cid:18) 1\n(cid:19)2\u03ba\u22122(cid:33)\n(cid:18) 1\n\n\u03b5\n\n\u03b5\n\nSClabel = \u02dcO\n\nlog(1/\u03b5) log(1/\u03b4)\n\n.\n\nC3\u03b5, if we run Margin-ADGAC with ni = \u02dcO(cid:16) 1\n\nThe dependence on log2(1/\u03b5) in SCcomp can be reduced to log(1/\u03b5) under Massart noise.\nTheorem 9. Suppose that Conditions 1 and 3 hold. There are settings of M, \u03ba, ri, \u03c4i, bi, \u03b5i, ki, and\nconstants C1, C2, C3 such that for all \u03b5 \u2264 C1, \u03bd(cid:48) \u2264 Tolcomp(\u03b5, \u03b4) = C2\u03b52\u03ba\u03b4, \u03bd \u2264 Tolcomp(\u03b5, \u03b4) =\nand w0 such that \u03b8(w0, w\u2217) \u2264 \u03c0/2,\nit \ufb01nds \u02c6w such that Pr[sign( \u02c6w \u00b7 X) (cid:54)= sign(w\u2217 \u00b7 X)] \u2264 \u03b5 with probability at least 1 \u2212 \u03b4. The\ncomparison and label complexity are\n\nE[SCcomp] = \u02dcO(cid:0)log(1/\u03b5)(cid:0)d log4(d/\u03b4)(cid:1)(cid:1) , SClabel = \u02dcO (log(1/\u03b5) log(1/\u03b4)) .\n\nd log3(dk/\u03b4)\n\n(cid:17)\n\n\u03b5i\n\nThe proofs of Theorems 8 and 9 are different from the conventional analysis of margin-based active\nlearning in two aspects: a) Since we use labels generated by ADGAC, which is not independently\n\n7\n\n\fsampled from the distribution PXY, we require new techniques that can deal with adaptive noises; b)\nWe improve the results of [6] over the dependence of d by new Rademacher analysis.\nTheorems 8 and 9 enjoy better label and query complexity than previous results (see Table 2). We\nmention that while Yan and Zhang [32] proposed a perceptron-like algorithm with label complexity\nas small as \u02dcO(d log(1/\u03b5)) under Massart and adversarial noise conditions, their algorithm works only\nunder uniform distributions over the instance space. In contrast, our algorithm Margin-ADGAC works\nunder broad log-concave distributions. The label and total query complexity of Margin-ADGAC\nimproves over that of traditional active learning. The lower bounds in Section 6 show the optimality\nof our complexity.\n6 Lower Bounds\nIn this section, we give lower bounds on learning using labeling and pairwise comparison. In Section\n6.1, we give a lower bound on the optimal label complexity SClabel. In Section 6.2 we use this result\nto give a lower bound on the total query complexity, i.e., the sum of comparison and label complexity.\nOur two methods match these lower bounds up to log factors. In Section 6.3, we additionally give an\ninformation-theoretic bound on Tolcomp, which matches our algorithms in the case of Massart and\nadversarial noise.\nFollowing from [19, 20], we assume that there is an underlying score function g\u2217 such that h\u2217(x) =\nsign(g\u2217(x)). Note that g\u2217 does not necessarily have relation with \u03b7(x); We only require that g\u2217(x)\nrepresents how likely a given x is positive. For instance, in digit recognition, g\u2217(x) represents how\nan image looks like a 7 (or 9); In the clinical setting, g\u2217(x) measures the health condition of a patient.\nSuppose that the distribution of g\u2217(X) is continuous, i.e., the probability density function exists and\nfor every t \u2208 R, Pr[g\u2217(X) = t] = 0.\n6.1 Lower Bound on Label Complexity\nThe de\ufb01nition of g\u2217 naturally induces a comparison oracle Z with Z(x, x(cid:48)) = sign(g\u2217(x) \u2212 g\u2217(x(cid:48))).\nWe note that this oracle is invariant to shifting w.r.t. g\u2217, i.e., g\u2217 and g\u2217 + t lead to the same comparison\noracle. As a result, we cannot distinguish g\u2217 from g\u2217 + t without labels. In other words, pairwise\ncomparisons do not help in improving label complexity when we are learning a threshold function\non R, where all instances are in the natural order. So the label complexity of any algorithm is lower\nbounded by that of learning a threshold classi\ufb01er, and we formally prove this in the following theorem.\nTheorem 10. For any algorithm A that can access both labeling and comparison oracles, suf\ufb01ciently\nsmall \u03b5, \u03b4, and any score function g that takes at least two values on X , there exists a distribution\nPXY satisfying Condition 2 such that the optimal function is in the form of h\u2217(x) = sign(g(x) + t)\nfor some t \u2208 R and\n\nSClabel(\u03b5, \u03b4,A) = \u2126\n\n(1/\u03b5)2\u03ba\u22122 log(1/\u03b4)\n\n.\n\n(1)\n\n(cid:16)\n\n(cid:17)\n\nIf PXY satis\ufb01es Condition 1 with \u03bd = O(\u03b5), SClabel satis\ufb01es (1) with \u03ba = 1.\nThe lower bound in Theorem 10 matches the label complexity of A2-ADGAC and Margin-ADGAC\nup to a log factor. So our algorithm is near-optimal.\n6.2 Lower Bound on Total Query Complexity\nWe use Theorem 10 to give lower bounds on the total query complexity of any algorithm which can\naccess both comparison and labeling oracles.\nTheorem 11. For any algorithm A that can access both labeling and comparison oracles, and\nsuf\ufb01ciently small \u03b5, \u03b4, there exists a distribution PXY satisfying Condition 2, such that\n(1/\u03b5)2\u03ba\u22122 log(1/\u03b4) + d log(1/\u03b5)\nIf PXY satis\ufb01es Condition 1 with \u03bd = O(\u03b5), SCcomp + SClabel satis\ufb01es (2) with \u03ba = 1.\nThe \ufb01rst term of (2) follows from Theorem 10, whereas the second term follows from transforming a\nlower bound of active learning with access to only the labeling oracle. The lower bounds in Theorem\n11 match the performance of A2-ADGAC and Margin-ADGAC up to log factors.\n6.3 Adversarial Noise Tolerance of Comparisons\nNote that label queries are typically expensive in practice. Thus it is natural to ask the following\nquestion: what is the minimal requirement on \u03bd(cid:48), given that we are only allowed to have access to\nminimal label complexity as in Theorem 10? We study this problem in this section. More concretely,\n\nSCcomp(\u03b5, \u03b4,A) + SClabel(\u03b5, \u03b4,A) = \u2126\n\n(cid:16)\n\n(cid:17)\n\n.\n\n(2)\n\n8\n\n\f\u221a\n\n\u03bd(cid:48).\n\nwe study the requirement on \u03bd(cid:48) when we learn a threshold function using labels. Suppose that the\ncomparison oracle gives feedback using a scoring function \u02c6g, i.e., Z(x, x(cid:48)) = sign(\u02c6g(x) \u2212 \u02c6g(x(cid:48))),\nand has error \u03bd(cid:48). We give a sharp minimax bound on the risk of the optimal classi\ufb01er in the form of\nh(x) = sign(\u02c6g(x) \u2212 t) for some t \u2208 R below.\nTheorem 12. Suppose that min{Pr[h\u2217(X) = 1], Pr[h\u2217(X) = \u22121]} \u2265 \u221a\n\u03bd(cid:48) and both \u02c6g(X) and\ng\u2217(X) have probability density functions. If \u02c6g(X) induces an oracle with error \u03bd(cid:48), then we have\nmint max\u02c6g,g\u2217 Pr[sign(\u02c6g(X) \u2212 t) (cid:54)= h\u2217(X)] =\nThe proof is technical and omitted. By Theorem 12, we see that the condition of \u03bd(cid:48) = \u03b52 is necessary\nif labels from g\u2217 are only used to learn a threshold on \u02c6g. This matches our choice of \u03bd(cid:48) under Massart\nand adversarial noise conditions for labeling oracle (up to a factor of \u03b4).\n7 Conclusion\nWe presented a general algorithmic framework, ADGAC, for learning with both comparison and\nlabeling oracles. We proposed two variants of the base algorithm, A2-ADGAC and Margin-ADGAC,\nto facilitate low query complexity under Tsybakov and adversarial noise conditions. The performance\nof our algorithms matches lower bounds for learning with both oracles. Our analysis is relevant to\na wide range of practical applications where it is easier, less expensive, and/or less risky to obtain\npairwise comparisons than labels.\nThere are multiple directions for future works. One improvement over our work is to show complexity\nbounds for excess risk err(h) \u2212 err(h\u2217) instead of Pr[h (cid:54)= h\u2217]. Also, our bound on comparison\ncomplexity is in expectation due to limits of quicksort; deriving concentration inequalities on the\ncomparison complexity would be helpful. Also, an adaptive algorithm that applies to different levels\nof noise w.r.t. labels and comparisons would be interesting; i.e., use labels when comparisons are\nnoisy and use comparisons when labels are noisy. Other directions include using comparisons (or\nmore broadly, rankings) for other ML tasks like regression or matrix completion.\n\nAcknowledgments\n\nThis research is supported in part by AFRL grant FA8750-17-2-0212. We thank Chicheng Zhang for\ninsightful ideas on improving results in [6] using Rademacher complexity.\n\nReferences\n[1] S. Agarwal and P. Niyogi. Stability and generalization of bipartite ranking algorithms.\n\nIn Annual\n\nConference on Learning Theory, pages 32\u201347, 2005.\n\n[2] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal\n\nof Machine Learning Research, 10:441\u2013474, 2009.\n\n[3] N. Ailon and M. Mohri. An ef\ufb01cient reduction of ranking to classi\ufb01cation. arXiv preprint arXiv:0710.2889,\n\n2007.\n\n[4] J. Attenberg, P. Melville, and F. Provost. A uni\ufb01ed approach to active dual supervision for labeling features\nand examples. In Machine Learning and Knowledge Discovery in Databases, pages 40\u201355. Springer, 2010.\n[5] P. Awasthi, M.-F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under\n\nasymmetric noise. In Annual Conference on Learning Theory, pages 152\u2013192, 2016.\n\n[6] P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently learning linear separators\n\nwith noise. Journal of the ACM, 63(6):50, 2017.\n\n[7] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd\n\ninternational conference on Machine learning, pages 65\u201372. ACM, 2006.\n\n[8] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Annual Conference On Learning\n\nTheory, pages 35\u201350, 2007.\n\n[9] M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, pages 20\u20131, 2012.\n[10] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distribu-\n\ntions. In Annual Conference on Learning Theory, pages 288\u2013316, 2013.\n\n[11] M.-F. Balcan, E. Vitercik, and C. White. Learning combinatorial functions from pairwise comparisons.\n\narXiv preprint arXiv:1605.09227, 2016.\n\n[12] M.-F. Balcan and H. Zhang. Noise-tolerant life-long matrix completion via adaptive sampling. In Advances\n\nin Neural Information Processing Systems, pages 2955\u20132963, 2016.\n\n[13] A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. Search improves label for active learning. In\n\nAdvances in Neural Information Processing Systems, pages 3342\u20133350, 2016.\n\n[14] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: A survey of some recent advances.\n\nESAIM: probability and statistics, 9:323\u2013375, 2005.\n\n[15] R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information\n\nTheory, 54(5):2339\u20132353, 2008.\n\n9\n\n\f2012.\n\n[21] R. Heckel, N. B. Shah, K. Ramchandran, and M. J. Wainwright. Active ranking from pairwise comparisons\n\nand the futility of parametric assumptions. arXiv preprint arXiv:1606.08842, 2016.\n\n[22] K. G. Jamieson and R. Nowak. Active ranking using pairwise comparisons. In Advances in Neural\n\nInformation Processing Systems, pages 2240\u20132248, 2011.\n\n[23] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classi\ufb01cation with comparison queries. arXiv\n\n[24] A. Krishnamurthy. Interactive Algorithms for Unsupervised Machine Learning. PhD thesis, Carnegie\n\npreprint arXiv:1704.03564, 2017.\n\nMellon University, 2015.\n\nStructures & Algorithms, 30(3):307\u2013358, 2007.\n\nof Computer Vision, 108(1-2):82\u201396, 2014.\n\nTheory, pages 1419\u20131439, 2016.\n\n[25] L. Lov\u00e1sz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random\n\n[26] S. Maji and G. Shakhnarovich. Part and attribute discovery from relative annotations. International Journal\n\n[27] S. Sabato and T. Hess. Interactive algorithms: from pool to stream. In Annual Conference On Learning\n\n[16] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple\n\nteachers. Journal of Machine Learning Research, 13:2655\u20132697, 2012.\n\n[17] J. F\u00fcrnkranz and E. H\u00fcllermeier. Preference learning and ranking by pairwise comparison. In Preference\n\nlearning, pages 65\u201382. Springer, 2010.\n\n[18] S. Hanneke. Adaptive rates of convergence in active learning. In COLT. Citeseer, 2009.\n[19] S. Hanneke. Theory of active learning, 2014.\n[20] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. arXiv preprint arXiv:1207.3772,\n\n[28] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. Wainwright. When is it better\n\nto compare than to score? arXiv preprint arXiv:1406.6618, 2014.\n\n[29] N. Stewart, G. D. Brown, and N. Chater. Absolute identi\ufb01cation by relative judgment. Psychological\n\n[30] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics, pages\n\nreview, 112(4):881, 2005.\n\n135\u2013166, 2004.\n\n[31] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie. Similarity comparisons for\ninteractive \ufb01ne-grained categorization. In IEEE Conference on Computer Vision and Pattern Recognition,\npages 859\u2013866, 2014.\n\n[32] S. Yan and C. Zhang. Revisiting perceptron: Ef\ufb01cient and label-optimal active learning of halfspaces.\n\narXiv preprint arXiv:1702.05581, 2017.\n\n[33] L. Yang and J. G. Carbonell. Cost complexity of proactive learning via a reduction to realizable active\n\nlearning. Technical report, CMU-ML-09-113, 2009.\n\n[34] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural\n\nInformation Processing Systems, pages 442\u2013450, 2014.\n\n10\n\n\f", "award": [], "sourceid": 1431, "authors": [{"given_name": "Yichong", "family_name": "Xu", "institution": "Carnegie Mellon University"}, {"given_name": "Hongyang", "family_name": "Zhang", "institution": "Carnegie Mellon University"}, {"given_name": "Kyle", "family_name": "Miller", "institution": "Carnegie Mellon University"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "Carnegie Mellon University"}, {"given_name": "Artur", "family_name": "Dubrawski", "institution": "Carnegie Mellon University"}]}