{"title": "Interactive Structure Learning with Structural Query-by-Committee", "book": "Advances in Neural Information Processing Systems", "page_first": 1121, "page_last": 1131, "abstract": "In this work, we introduce interactive structure learning, a framework that unifies many different interactive learning tasks. We present a generalization of the query-by-committee active learning algorithm for this setting, and we study its consistency and rate of convergence, both theoretically and empirically, with and without noise.", "full_text": "Interactive Structure Learning with Structural\n\nQuery-by-Committee\n\nChristopher Tosh\nColumbia University\n\nc.tosh@columbia.edu\n\nAbstract\n\nSanjoy Dasgupta\n\nUC San Diego\n\ndasgupta@cs.ucsd.edu\n\nIn this work, we introduce interactive structure learning, a framework that uni\ufb01es\nmany different interactive learning tasks. We present a generalization of the query-\nby-committee active learning algorithm for this setting, and we study its consistency\nand rate of convergence, both theoretically and empirically, with and without noise.\n\n1\n\nIntroduction\n\nWe introduce interactive structure learning, an abstract problem that encompasses many interactive\nlearning tasks that have traditionally been studied in isolation, including active learning of binary\nclassi\ufb01ers, interactive clustering, interactive embedding, and active learning of structured output\npredictors. These problems include variants of both supervised and unsupervised tasks, and allow\nmany different types of feedback, from binary labels to must-link/cannot-link constraints to similarity\nassessments to structured outputs. Despite these surface differences, they conform to a common\ntemplate that allows them to be fruitfully uni\ufb01ed.\nIn interactive structure learning, there is a space of items X \u2014for instance, an input space on which a\nclassi\ufb01er is to be learned, or points to cluster, or points to embed in a metric space\u2014and the goal\nis to learn a structure on X , chosen from a family G. This set G could consist, for example, of all\nlinear classi\ufb01ers on X , or all hierarchical clusterings of X , or all knowledge graphs on X . There is a\ntarget structure g\u21e4 2G and the hope is to get close to this target. This is achieved by combining a\nloss function or prior on G with interactive feedback from an expert.\nWe allow this interaction to be fairly general. In most interactive learning work, the dominant\nparadigm has been question-answering: the learner asks a question (like \u201cwhat is the label of this\npoint x?\u201d) and the expert provides the answer. We allow a more \ufb02exible protocol in which the learner\nprovides a constant-sized snapshot of its current structure and asks whether it is correct (\u201cdoes the\nclustering, restricted to these ten points, look right?\u201d). If the snapshot is correct, the expert accepts it;\notherwise, the expert \ufb01xes some part of it. This type of feedback, \ufb01rst studied in generality by [15],\ncan be called partial correction. It is a strict generalization of question-answering, and as we explain\nin more detail below, it allows more intuitive interactions in many scenarios.\nIn Section 3, we present structural query-by-committee, a simple algorithm that can be used for\nany instance of interactive structure learning. It is a generalization of the well-known query-by-\ncommittee (QBC) algorithm [33, 16], and operates, roughly, by maintaining a posterior distribution\nover structures and soliciting feedback on snapshots on which there is high uncertainty. We also\nintroduce an adaptation of the algorithm that allows convex loss functions to handle the noise. This\nhelps computational complexity in some practical settings, most notably when G consists of linear\nfunctions, and also makes it possible to ef\ufb01ciently kernelize structural QBC.\nIn Section 4, we show that structural QBC is guaranteed to converge to the target g\u21e4, even when the\nexpert\u2019s feedback is noisy. In the appendix, we give rates of convergence in terms of a shrinkage\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcoef\ufb01cient, present experiments on a variety of interactive learning tasks, and give an overview of\nrelated work.\n\n2\n\nInteractive structure learning\n\nThe space of possible interactive learning schemes is large and mostly unexplored. We can get a sense\nof its diversity from a few examples. In active learning [32], a machine is given a pool of unlabeled\ndata and adaptively queries the labels of certain data points. By focusing on informative points, the\nmachine may learn a good classi\ufb01er using fewer labels than would be needed in a passive setting.\nSometimes, the labels are complex structured objects, such as parse trees for sentences or segmen-\ntations of images. In such cases, providing an entire label is time-consuming and it is easier if the\nmachine simply suggests a label (such as a tree) and lets the expert either accept it or correct some\nparticularly glaring fault in it. This is interaction with partial correction. It is more general than the\nquestion-answering usually assumed in active learning, and more convenient in many settings.\nInteraction has also been used to augment unsupervised learning. Despite great improvements in\nalgorithms for clustering, topic modeling, and so on, the outputs of these procedures are rarely\nperfectly aligned with the user\u2019s needs. The problem is one of underspeci\ufb01cation: there are many\nlegitimate ways to organize complex high-dimensional data, and no algorithm can magically guess\nwhich a user has in mind. However, a modest amount of interaction may help overcome this issue.\nFor instance, the user can iteratively provide must-link and cannot-link constraints [37] to edit a\n\ufb02at clustering, or triplet constraints to edit a hierarchy [36].\nThese are just a few examples of interactive learning that have been investigated. The true scope\nof the settings in which interaction can be integrated is immense, ranging from structured output\nprediction to metric learning and beyond. In what follows, we aim to provide a unifying framework\nto address this profusion of learning problems.\n\n2.1 The space of structures\nLet X be a set of data points. This could be a pool of unlabeled data to be used for active learning, or\na set of points to be clustered, or an entire instance space on which a metric will be learned.\nWe wish to learn a structure on X , chosen from a class G. This could, for instance, be the set of all\nlabelings of X consistent with a function class F of classi\ufb01ers (binary, multiclass, or with complex\nstructured labels), or the set of all partitions of X , or the set of all metrics on X . Of these, there is\nsome target g\u21e4 2G that we wish to attain.\nAlthough interaction will help choose a structure, it is unreasonable to expect that interaction alone\ncould be an adequate basis for this choice. For instance, pinpointing a particular clustering over n\npoints requires \u2326(n) must-link/cannot-link constraints, which is an excessive amount of interaction\nwhen n is large.\nTo bridge this gap, we need a prior or a loss function over structures. For instance, if G consists\nof \ufb02at k-clusterings, then we may prefer clusterings with low k-means cost. If G consists of linear\nseparators, then we may prefer functions with small norm kgk. In the absence of interaction, the\nmachine would simply pick the structure that optimizes the prior or cost function. In this paper, we\nassume that this preference is encoded as a prior distribution \u21e1 over G.\nWe emphasize that although we have adopted a Bayesian formulation, there is no assumption that the\ntarget structure g\u21e4 is actually drawn from the prior.\n\n2.2 Feedback\n\nWe consider schemes in which each individual round of interaction is not expected to take too long.\nThis means, for instance, that the expert cannot be shown an entire clustering, of unrestricted size,\nand asked to comment upon it. Instead, he or she can only be given a small snapshot of the clustering,\nsuch as its restriction to 10 elements. The feedback on this snapshot will be either be to accept it, or\nto provide some constraint that \ufb01xes part of it.\n\n2\n\n\fIn order for this approach to work, it is essential that structures be locally checkable: that is, g\ncorresponds to the target g\u21e4 if and only if every snapshot of g is satisfactory.\nWhen g is a clustering, for instance, the snapshots could be restrictions of g to subsets S \u2713X of some\n\ufb01xed size s. Technically, it is enough to take s = 2, which corresponds to asking the user questions of\nthe form \u2018Do you agree with having zebra and giraffe in the same cluster?\u201d From the viewpoint of\nhuman-computer interaction, it might be preferable to use larger subsets (like s = 5 or s = 10), with\nquestions such as \u201cDo you agree with the clustering {zebra, giraffe, dolphin},{whale, seal}?\u201d\nLarger substructures provide more context and are more likely to contain glaring faults that the user\ncan easily \ufb01x (dolphin and whale must go together). In general, we can only expect the user to\nprovide partial feedback in these cases, rather than fully correcting the substructure.\n\n2.3 Snapshots\n\nPerhaps the simplest type of snapshot of a structure g is the restriction of g to a small number of\npoints. We start by discussing this case, and later present a generalization.\n\n2.3.1 Projections\nFor any g 2G and any subset S \u2713X of size s = O(1), let g|S be a suitable notion of the restriction\nof g to S, which we will sometimes call the projection of g onto S. For instance:\n\n\u2022 G is a set of classi\ufb01ers on X . We can take s = 1 and let g|x be (x, g(x)) for any x 2X .\n\u2022 G is a set of partitions (\ufb02at clusterings) of X . For a set S \u2713X of size s 2, let g|S be the\n\ninduced partition on just the points S.\n\nAs discussed earlier, it will often be helpful to pick projections of size larger than the minimal possible\ns. For clusterings, for instance, any s 2 satis\ufb01es local checkability, but human feedback might be\nmore effective when s = 10 than when s = 2. Thus, in general, the queries made to the expert will\nconsist of snapshots (projections of size s = 10, say) that can in turn be decomposed further into\natomic units (projections of size 2).\n\n2.3.2 Atomic decompositions of structures\nNow we generalize the notion of projection to other types of snapshots and their atomic units.\nWe will take a functional view of the space of structures G, in which each structure g is speci\ufb01ed\nby its \u201canswers\u201d to a set of atomic questions A. For instance, if G is the set of partitions of X , then\nA =X2, with g({x, x0}) = 1 if g places x, x0 in the same cluster and 0 otherwise.\nThe queries made during interaction can, in general, be composed of multiple atomic units, and\nfeedback will be received on at least one of these atoms. Formally, let Q be the space of queries. In\nthe partitioning example, this might beX10. The relationship between Q and A is captured by the\nfollowing requirements:\n\u2022 Each q 2Q can be decomposed as a set of atomic questions A(q) \u2713A , and we write\ng(q) = {(a, g(a)) : a 2 A(q)}. In the partitioning example, A(q) is the set of all pairs in q.\n\u2022 The user accepts g(q) if and only if g satisfactorily answers every atomic question in q, that\nis, if and only if g(a) = g\u21e4(a) for all a 2 A(q).\n\n2.4 Summary of framework\n\nTo summarize, interactive structure learning has two key components:\n\n\u2022 A reduction to multiclass classi\ufb01er learning. We view each structure g 2G as a function\non atomic questions A. Thus, learning a good structure is equivalent to picking one whose\nlabels g(a) are correct.\n\u2022 Feedback by partial correction. For practical reasons we consider broad queries, from a\nset Q, where each query can be decomposed into atomic questions, allowing for partial\ncorrections. This decomposition is given by the function A : Q! 2A.\n\n3\n\n\fAlgorithm 1 STRUCTURAL QBC\n\nInput: Distribution1 \u232b over query space Q and initial prior distribution \u21e1o over G\nOutput: Posterior distribution \u21e1t over G\nfor t = 1, 2, . . . do\nDraw gt \u21e0 \u21e1t1\nwhile Next query qt has not been chosen do\n\nDraw q \u21e0 \u232b and g, g0 \u21e0 \u21e1t1\nWith probability d(g, g0; q): take qt = q\n\nend while\nShow user qt and gt(qt) and receive feedback in form of pairs (at, yt)\nUpdate posterior: \u21e1t(g) / \u21e1t1(g) exp( \u00b7 1(g(at) 6= yt))\n\nend for\n\nThe reduction to multiclass classi\ufb01cation immediately suggests algorithms that can be used in the\ninteractive setting. We are particular interested in adaptive querying, with the aim of \ufb01nding a good\nstructure with minimal interaction. Of the many schemes available for binary classi\ufb01ers, one that\nappears to work well in practice and has good statistical properties is query-by-committee [33, 16]. It\nis thus a natural candidate to generalize to the broader problem of structure learning.\n\n3 Structural QBC\n\nQuery-by-committee, as originally analyzed by [16], is an active learning algorithm for binary\nclassi\ufb01cation in the noiseless setting. It uses a prior probability distribution \u21e1 over its classi\ufb01ers and\nkeeps track of the current version space, i.e. the classi\ufb01ers consistent with the labeled data seen so far.\nAt any given time, the next query is chosen as follows:\n\n\u2022 Repeat:\n\n\u2013 Pick x 2X at random\n\u2013 Pick classi\ufb01ers h, h0 at random from \u21e1 restricted to the current version space\n\u2013 If h(x) 6= h0(x): halt with x as the query\n\nIn our setting, the feedback at time t is the answer yt to some atomic question at 2A , and we\ncan de\ufb01ne the resulting version space to be {g 2G : g(at0) = yt0 for all t0 \uf8ff t}. The immediate\ngeneralization of QBC would involve picking a query q 2Q at random (or more generally, drawn\nfrom some query distribution \u232b), and then choosing it if g, g0 sampled from \u21e1 restricted to our version\nspace happen to disagree on it. But this is unlikely to work well, because the answers to queries are\nno longer binary labels but mini-structures. As a result, g, g0 are likely to disagree on minor details\neven when the version space is quite small, leading to excessive querying. To address this, we will\nuse a more re\ufb01ned notion of the difference between g(q) and g0(q):\n\nd(g, g0; q) =\n\n1\n\n|A(q)| Xa2A(q)\n\n1[g(a) 6= g0(a)].\n\nIn words, this is the fraction of atomic subquestions of q on which g and g0 disagree. It is a value\nbetween 0 and 1, where higher values mean that g(q) differs signi\ufb01cantly from g0(q). Then we will\nquery q with probability d(g, g0; q).\n\n3.1 Accommodating noisy feedback\nWe are interested in the noisy setting, where the user\u2019s feedback may occasionally be inconsistent\nwith the target structure. In this case, the notion of a version space is less clear-cut. Our modi\ufb01cation\nis very simple: the feedback at time t, say (at, yt), causes the posterior to be updated as follows:\n\n\u21e1t(g) / \u21e1t1(g) exp( \u00b7 1[g(at) 6= yt]).\n\n1In the setting where Q is \ufb01nite, a reasonable choice of \u232b would be uniform over Q.\n\n(1)\n\n4\n\n\fHere > 0 is a constant that controls how aggressively errors are punished. In the noiseless setting,\nwe can take = 1 and recover the original QBC update. Even with noise, however, this posterior\nupdate still enjoys nice theoretical properties. The full algorithm is shown in Algorithm 1.\n\n3.2 Uncertainty and informative queries\nWhat kinds of queries will structural QBC make? To answer this, we \ufb01rst quantify the uncertainty\nin the current posterior about a particular query or atom. De\ufb01ne the uncertainty of atom a 2A\nunder distribution distribution \u21e1 as u(a; \u21e1) = Prg,g0\u21e0\u21e1(g(a) 6= g0(a)) and u(q; \u21e1) as the average\nuncertainty of its atoms A(q). These values lie in the range [0, 1].\nThe probability that a particular query q 2Q is chosen in round t by structural QBC is proportional\nto \u232b(q)u(q; \u21e1t1). Thus, queries with higher uncertainty under the current posterior are more likely\nto be chosen. As the following lemma demonstrates, getting feedback on uncertain atoms eliminates,\nor down-weights in the case of noisy feedback, many structures inconsistent with g\u21e4.\nLemma 1. For any distribution \u21e1 over G, we have \u21e1({g : g(a) 6= y}) u(a; \u21e1)/2.\nThe proof of Lemma 1 is deferred to the appendix. This gives some intuition for the query selection\ncriterion of structural QBC, and will later be used in the proof of consistency.\n\n3.3 General loss functions\nThe update rule for structural QBC, equation (1), results in a posterior of the form \u21e1t(g) /\n\u21e1(g) exp( \u00b7 #(mistakes made by g)), which may be dif\ufb01cult to sample from. Thus, we con-\nsider a broader class of updates,\n(2)\nwhere `(\u00b7,\u00b7) is a general loss function. In the special case where G consists of linear functions and `\nis convex, \u21e1t will be a log-concave distribution, which allows for ef\ufb01cient sampling [28]. We will\nshow that this update also enjoys nice theoretical properties, albeit under different noise conditions.\nTo formally specify this setting, let Y be the space of answers to atomic questions A, and suppose that\nstructures in G generate values in some prediction space Z\u2713 Rd. That is, each g 2G is a function\ng : A!Z , and any output z 2Z gets translated to some prediction in Y. The loss associated with\npredicting z when the true answer is y is denoted `(z, y). Here are some examples:\n\n\u21e1t(g) / \u21e1t1(g) exp( \u00b7 `(g(at), yt)),\n\n\u2022 0 1 loss. Z = Y and `(z, y) = 1(y 6= z).\n\u2022 Squared loss. Y = {1, 1}, Z = [B, B], and `(z, y) = (y z)2.\n\u2022 Logistic loss. Y = {1, 1}, Z = [B, B] for some B > 0, and `(z, y) = ln(1 + eyz).\nWhen moving from a discrete to a continuous prediction space, it becomes very possible that the\npredictions, on a particular atom, of two randomly chosen structures will be close but not perfectly\naligned. Thus, instead of checking strict equality of these predictions, we need to modify our querying\nstrategy to take into account the distance between them. To this end, we will use the normalized\naverage squared Euclidean distance:\n\nd2(g, g0; q) =\n\n1\n\n|A(q)| Xa2A(q)\n\nkg(a) g0(a)k2\n\nD\n\nwhere D = maxa2A maxg,g02G kg(a) g0(a)k2. Note that d2(g, g0; q) is a value between 0 and 1.\nWe treat it as a probability, in exactly the same way we used d(g, g0; q) in the 0-1 loss setting.\nIn the 0-1 loss setting, structural QBC chooses queries proportional to their uncertainty. What queries\nwill structural QBC make in the general loss setting? De\ufb01ne the variance of a 2A under \u21e1 as\n\nvar(a; \u21e1) =\n\n\u21e1(g) \u21e1(g0)kg(a) g0(a)k2\n\nand var(q; \u21e1) as the average variance of its atoms A(q). Then the probability that structural QBC\nchooses q 2Q at step t is proportional to \u232b(q)var(q; \u21e1t1) in the general loss setting.\n\n1\n\n2 Xg,g02G\n\n5\n\n\fAlgorithm 2 ROBUST QUERY SELECTION\n\nfor t = 0, 1, 2, . . . do\n\nInput: Fixed set of queries q1, . . . , qm 2Q , current distribution \u21e1 over G\nOutput: Query qi\nInitial shrinkage estimate:buo = 1/2\nDraw g1, g01, . . . , gnt, g0nt \u21e0 \u21e1\nntPnt\nIf there exists qj such that 1\nOtherwise, letbut+1 =but/2.\n\ni=1 d(gi, g0i; qj) but then we halt and query qj\n\nend for\n\n3.4 Kernelizing structural QBC\nConsider the special case where G consists of linear functions, i.e. G = {gw(x) = hx, wi : w 2 Rd}.\nAs mentioned above, when our loss function is convex, the posteriors we encounter are log-concave,\nand thus ef\ufb01ciently samplable. But what if we want a more expressive class than linear functions? To\naddress this, we will resort to kernels.\nGilad-Bachrach et al. [17] investigated the use of kernels in QBC. In particular, they observed that\nQBC does not actually need samples from the prior restricted to the current version space. Rather,\ngiven a candidate query x, it is enough to be able to sample from the distribution the posterior induces\nover the labelings of x. Although their work was in the realizable binary setting, this observation still\napplies to our setting.\nRd, our posterior update becomes \u21e1t(gw) /\nGiven a feature mapping : X!\n\u21e1t1(gw) exp (`(h(xt), wi, yt)) . As the following lemma shows, when `(\u00b7,\u00b7) is the squared-loss\nand our prior is Gaussian, the predictions of the posterior have a univariate normal distribution.\nLemma 2. Suppose \u21e1 = N (0, 2\noId), `(\u00b7,\u00b7) is the squared-loss, and we have observed\n(x1, y1),\u00b7\u00b7\u00b7 , (xt, yt). If gw \u21e0 \u21e1t, then hw, (x)i \u21e0N (\u00b5, 2) where\n\n\u00b5 = 22\n\no\uf8ffT (It \u2303oK) y\n\nfor Kij = h(xi), (xj)i, \uf8ffi = h(xi), (x)i, and \u2303o =\u21e3 1\n\nand\n\n2 = 2\n\no(x)T (x) \uf8ffT \u2303o\uf8ff\n22 It + K\u23181\n\n.\n\nThe proof is deferred to the appendix. The important observation here is that all the quantities\ninvolving the feature mapping in Lemma 2 are inner products. Thus we never need to explicitly\nconstruct any feature vectors.\n\n3.5 Reducing the randomness in structural QBC\n\nIt is easy to see that the query selection procedure of structural QBC is a rejection sampler where\neach query q is chosen with probability proportional to \u232b(q)u(q; \u21e1t) (in the case of the 0-1 loss) or\n\u232b(q)var(q; \u21e1t) (for general losses). However, it is possible for the rejection rate to be quite high,\neven when there are many queries that have much higher uncertainty or variance than the rest. To\ncircumvent this issue, we introduce a \u2018robust\u2019 version of structural QBC, wherein many candidate\nqueries are sampled, and the query that has the highest uncertainty or variance is chosen.\nIn the 0-1 loss case, we can estimate the uncertainty of a candidate query q by drawing many pairs\n\nnPn\ng1, g01, . . . , gn, g0n \u21e0 \u21e1t and using the unbiased estimatorbu(q; \u21e1t) := 1\n\nUnfortunately, the number of structures we need to sample in order to identify the most uncertain\nquery depends on its uncertainty, which we do not know a priori. To circumvent this dif\ufb01culty, we\ncan use the halving procedure shown in Algorithm 2. If the appropriate number of structures are\nsampled at each round t, on the order of O((1/u2\nt ) log(m log(1/uo))) for some crude lower bound\nuo on the highest uncertainty, then with high probability this procedure terminates with a candidate\nquery whose uncertainty is within a constant factor of the highest uncertainty [35].\n\ni=1 d(gi, g0i; q).\n\n6\n\n\f4 Consistency of structural QBC\n\nIn this section, we look at a typical setting in which there is a \ufb01nite but possibly very large pool of\ncandidate questions Q, and thus the space of structures G is effectively \ufb01nite. Let g\u21e4 2G be the target\nstructure, as before. Our goal in this setting is to demonstrate the consistency of structural QBC,\nmeaning that limt!1 \u21e1t(g\u21e4) = 1 almost surely. To do so, we formalize our setting. Note that the\nrandom outcomes during time step t of structural QBC consist of the query qt, the atomic question at\nthat the expert chooses to answer (pick one at random if the expert answers several of them), and the\nresponse yt to at. Let Ft denote the sigma-\ufb01eld of all outcomes up to, and including, time t.\n4.1 Consistency under 0-1 loss\n\nIn order to prove consistency, we will have to make some assumptions about the feedback we receive\nfrom a user. For query q 2Q and atomic question a 2 A(q), let \u2318(y|a, q) denote the conditional\nprobability that the user answers y to atomic question a, in the context of query q. Our \ufb01rst assumption\nis that the single most likely answer is g\u21e4(a).\nAssumption 1. There exists 0 < \uf8ff 1 such that \u2318(g\u21e4(a)|a, q) \u2318(y|a, q) for all q 2Q and\na 2 A(q) and all y 6= g\u21e4(a).\n(We will use the convention = 1 for the noiseless setting.) In the learning literature, Assumption 1\nis known as Massart\u2019s bounded noise condition [2].\nThe following lemma, whose proof is deferred to the appendix, demonstrates that under Assumption 1,\nthe posterior probability of g\u21e4 increases in expectation with each query, as long as the parameter of\nthe update rule in equation (1) is small enough relative to .\nLemma 3. Fix any t and suppose the expert provides an answer to atomic question at 2 A(qt) at\ntime t. Let t = \u21e1t1({g 2G : g(at) = g\u21e4(at)}). De\ufb01ne t by:\n\nUnder Assumption 1, t can be lower-bounded as follows:\n\n(a) If = 1 (noiseless setting), t (1 t)(1 e).\n(b) For 2 (0, 1), if \uf8ff /2, then t (1 t)/2.\n\nTo understand the requirement = O(), consider an atomic question on which there are just two\npossible labels, 1 and 2, and the expert chooses these with probabilities p1 and p2, respectively. If the\ncorrect answer according to g\u21e4 is 1, then p1 p2 + under Assumption 1. Let G2 denote structures\nthat answer 2.\n\nmultiplied by e.\n\n\u2022 With probability p1, the expert answers 1, and the posterior mass of G2 is effectively\n\u2022 With probability p2, the expert answers 2, and the posterior mass of G2 is effectively\n\nmultiplied by e.\n\nThe second outcome is clearly undesirable. In order for it to be counteracted, in expectation, by the\n\ufb01rst, must be small relative to p1/p2. The condition \uf8ff /2 ensures this.\nLemma 3 does not, in itself, imply consistency. It is quite possible for 1/\u21e1t(g\u21e4) to keep shrinking\nbut not converge to 1. Imagine, for instance, that the input space has two parts to it, and we keep\nimproving on one of them but not the other. What we need is, \ufb01rst, to ensure that the queries qt\ncapture some portion of the uncertainty in the current posterior, and second, that the user chooses\natoms that are at least slightly informative. The \ufb01rst condition is assured by the SQBC querying\nstrategy. For the second, we need an assumption.\nAssumption 2. There is some minimum probability po > 0 for which the following holds. If the user\nis presented with a query q and a structure g 2G such that g(q) 6= g\u21e4(q), then with probability at\nleast po the user will provide feedback on some a 2 A(q) such that g(a) 6= g\u21e4(a).\n\n7\n\nE\uf8ff\n\n1\n\n\u21e1t(g\u21e4)Ft1, qt, at = (1 t)\n\n1\n\n\u21e1t1(g\u21e4)\n\n,\n\n\fAssumption 2 is one way of avoiding scenarios in which a user never provides feedback on a particular\natom a. In such a pathological case, we might not be able to recover g\u21e4(a), and thus our posterior\nwill always put some probability mass on structures that disagree with g\u21e4 on a.\nThe following lemma gives lower bounds on 1 t under Assumption 2.\nLemma 4. Suppose that G is \ufb01nite and the user\u2019s feedback obeys Assumption 2. Then there exists a\nconstant c > 0 such that for every round t\n\nE[1 t |Ft1] c\u21e1 t1(g\u21e4)2(1 \u21e1t1(g\u21e4))2\n\nwhere t = \u21e1t1({g 2G : g(at) = g\u21e4(at)}) and at is the atom the user provides feedback on.\nTogether, Lemmas 3 and 4 show that the sequence\n\u21e1t(g\u21e4) is a positive supermartingale that decreases\nin expectation at each round by an amount that depends on \u21e1t(g\u21e4). The following lemma tells us\nexactly when such stochastic processes can be guaranteed to converge.\nLemma 5. Let f : [0, 1] ! R0 be a continuous function such that f (1) = 0 and f (x) > 0 for all\nx 2 (0, 1). If\n\n1\n\nE\uf8ff\n\n1\n\n\u21e1t(g\u21e4)Ft1 \uf8ff\n\n1\n\n\u21e1t1(g\u21e4) f (\u21e1t1(g\u21e4))\n\nfor each t 2 N, then \u21e1t(g\u21e4) ! 1 almost surely.\nAs a corollary, we see that structural QBC is consistent.\nTheorem 6. Suppose that G is \ufb01nite, and Assumptions 1 and 2 hold. If \u21e1(g\u21e4) > 0, then \u21e1t(g\u21e4) ! 1\nalmost surely under structural QBC\u2019s query strategy.\n\nWe provide a proof of Theorem 6 in the appendix, where we also provide rates of convergence.\n\n4.2 Consistency under general losses\n\nWe now turn to analyzing structural QBC with general losses. As before, we will need to make some\nassumptions. The \ufb01rst is that the loss function is well-behaved.\nAssumption 3. The loss function is bounded, 0 \uf8ff `(z, y) \uf8ff B, and Lipschitz in its \ufb01rst argument,\ni.e. `(z, y) `(z0, y) \uf8ff Ckz z0k, for some constants B, C > 0.\nIt is easily checked that this assumption holds for the three loss functions we mentioned earlier.\nIn the case of 0-1 loss, we assumed that for any atomic question a, the correct answer g\u21e4(a) would be\ngiven with higher probability than any incorrect answer. We now formulate an analogous assumption\nfor the case of more general loss functions. Recall that \u2318(\u00b7|a) is the conditional probability distribution\nover the user\u2019s answers to a 2A (we can also allow \u2318 to also depend upon the context q, as we did\nbefore; here we drop the dependence for notational convenience). The expected loss incurred by\nz 2Z on this atom is thus\n\nL(z, a) =Xy\n\n\u2318(y|a) `(z, y).\n\nWe will require that for any atomic question a, this expected loss is minimized when z = g\u21e4(a), and\npredicting any other z results in expected loss that grows with the distance between z and g\u21e4(a).\nAssumption 4. There exists a constant > 0 such that L(z, a) L(g\u21e4(a), a) kz g\u21e4(a)k2\nfor any atomic question a 2A and any z 2Z .\nLet\u2019s look at some concrete settings:\n\n\u2022 0 1 loss with Y = Z = {0, 1}. Assumption 4 is equivalent to Assumption 1.\n\u2022 Squared loss with Y = {1, 1} and Z\u21e2 R. Assumption 4 is satis\ufb01ed when g\u21e4(a) = E[y|a]\n\u2022 Logistic loss with Y = {1, 1} and Z = [B, B]. For a 2 A, let p = \u2318(1|a). Assump-\n\nand = 1.\n\ntion 4 is satis\ufb01ed when g\u21e4(a) = ln p\n\n1p and = 2e2B\n\n(1+eB)4 .\n\n8\n\n\fFrom these examples, it is clear that requiring g\u21e4(a) to be the minimizer of L(z, a) is plausible if\nZ is a discrete space but much less so if Z is continuous. In general, we can only hope that this\nholds approximately. With this caveat in mind, we stick with Assumption 4 as a useful but idealized\nmathematical abstraction.\nWith these assumptions in place, the following theorem guarantees the consistency of structural QBC\nunder general losses. Its proof is deferred to the appendix.\nTheorem 7. Suppose we are in the general loss setting, G is \ufb01nite, and the user\u2019s feedback satis\ufb01es\nAssumptions 2, 3, and 4. If \u21e1(g\u21e4) > 0, then \u21e1t(g\u21e4) ! 1 almost surely.\n5 Conclusion\n\nIn this work, we introduced interactive structure learning, a generic framework for learning structures\nunder partial correction feedback. This framework can be applied to any structure learning problem\nin which structures are in one-to-one correspondence with their answers to atomic questions. Thus,\ninteractive structure learning may be viewed as a generalization of active learning, interactive\nclustering with pairwise constraints, interactive hierarchical clustering with triplet constraints, and\ninteractive ordinal embeddings with quadruplet constraints.\nOn the algorithmic side, we introduced structural QBC, a generalization of the classical QBC\nalgorithm to the interactive structure learning setting. We demonstrated that this algorithm is\nconsistent, even in the presence of noise, provided that we can sample from a certain natural posterior.\nIn the appendix, we also provided rates of convergence. Because this posterior is often intractable\nto sample from, we also considered an alternative posterior based on convex loss functions that\nsometimes allows for ef\ufb01cient sampling. We showed that structural QBC remains consistent in this\nsetting, albeit under different noise conditions.\nIn the appendix, we provide experiments on both interactive clustering and active learning tasks. On\nthe interactive clustering side, these experiments demonstrate that even when the prior distribution\nplaces relatively low mass on the target clustering, structural QBC is capable of recovering a low-\nerror clustering with relatively few rounds of interaction. In contrast, these experiments also show\nthat random corrections are not quite as useful. On the active learning side, there are experiments\ndemonstrating the good empirical performance of structural QBC using linear classi\ufb01ers with the\nsquared-loss posterior update, with and without kernelization.\n\nAcknowledgments\n\nThe authors are grateful to the reviewers for their feedback and to the NSF for support under grant\nCCF-1813160. Part of this work was done at the Simons Institute for Theoretical Computer Science,\nBerkeley, during the \u201cFoundations of Machine Learning\u201d program. CT also thanks Stefanos Poulis\nand Sharad Vikram for helpful discussions and feedback.\n\nReferences\n[1] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. In Advances\n\nin Neural Information Processing Systems, pages 3216\u20133224, 2016.\n\n[2] P. Awasthi, M.-F. Balcan, N. Haghtalab, and R. Urner. Ef\ufb01cient learning of linear separators\nunder bounded noise. In Proceedings of the 28th Annual Conference on Learning Theory, pages\n167\u2013190, 2015.\n\n[3] P. Awasthi, M.-F. Balcan, and K. Voevodski. Local algorithms for interactive clustering. In\n\nProceedings of the 31st International Conference on Machine Learning, 2014.\n\n[4] P. Awasthi and R.B. Zadeh. Supervised clustering. In Advances in Neural Information Process-\n\ning Systems, 2010.\n\n[5] K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical\n\nJournal, Second Series, 19(3):357\u2013367, 1967.\n\n9\n\n\f[6] M.-F. Balcan and A. Blum. Clustering with interactive feedback. In Algorithmic Learning\nTheory (volume 5254 of the series Lecture Notes in Computer Science), pages 316\u2013328, 2008.\nIn\n\n[7] A. Beygelzimer, S. Dasgupta, and J. Langford.\n\nProceedings of the 26th International Conference on Machine Learning, 2009.\n\nImportance weighted active learning.\n\n[8] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information\n\nTheory, 54(5):2339\u20132353, 2008.\n\n[9] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15(2):201\u2013221, 1994.\n\n[10] G. Dasarathy, R. Nowak, and X. Zhu. S2: An ef\ufb01cient graph based active learning algorithm\nwith application to nonparametric classi\ufb01cation. In 28th Annual Conference on Learning Theory,\npages 503\u2013522, 2015.\n\n[11] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information\n\nProcessing Systems, 2004.\n\n[12] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural\n\nInformation Processing Systems, 2005.\n\n[13] S. Dasgupta and D.J. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th\n\nInternational Conference on Machine Learning, 2008.\n\n[14] S. Dasgupta, D.J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nAdvances in Neural Information Processing Systems, 2007.\n\n[15] S. Dasgupta and M. Luby. Learning from partial correction. ArXiv e-prints, 2017.\n[16] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n\nalgorithm. Machine Learning, 28(2):133\u2013168, 1997.\n\n[17] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committeee made real. In Advances in\n\nNeural Information Processing Systems, 2005.\n\n[18] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Ef\ufb01cient active learning of halfspaces: an\n\naggressive approach. Journal of Machine Learning Research, 14(1):2583\u20132615, 2013.\n\n[19] A. Guillory and J. Bilmes. Average-case active learning with costs. In Conference on Algorithmic\n\nLearning Theory, pages 141\u2013155, 2009.\n\n[20] S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the\n\n25th International Conference on Machine Learning, 2007.\n\n[21] N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.\n[22] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican Statistical Association, 58(301):13\u201330, 1963.\n\n[23] T.-K. Huang, A. Agarwal, D.J. Hsu, J. Langford, and R.E. Schapire. Ef\ufb01cient and parsimonious\n\nagnostic active learning. In Advances in Neural Information Processing Systems, 2015.\n\n[24] D.M Kane, S. Lovett, S. Moran, and J. Zhang. Active classi\ufb01cation with comparison queries.\n\nIn IEEE Symposium on Foundations of Computer Science, pages 355\u2013366, 2017.\n\n[25] S. Kpotufe, R. Urner, and S. Ben-David. Hierarchical label queries with data-dependent\n\npartitions. In Proceedings of the 28th Annual Conference on Learning Theory, 2015.\n\n[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[27] M. Lichman. UCI machine learning repository, 2013.\n[28] L. Lovasz and S. Vempala. The geometry of logconcave functions and sampling algorithms.\n\nRandom Structures and Algorithms, 30:307\u2013358, 2007.\n\n10\n\n\f[29] Nicolo N. Cesa-Bianchi, C. Gentile, and F. Vitale. Learning unknown graphs. In Conference on\n\nAlgorithmic Learning Theory, pages 110\u2013125, 2009.\n\n[30] R. Nowak. The geometry of generalized binary search. IEEE Transactions on Information\n\nTheory, 57(12):7893\u20137906, 2011.\n\n[31] S. Poulis and S. Dasgupta. Learning with feature feedback: from theory to practice.\n\nIn\nProceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n1104\u20131113, 2017.\n\n[32] B. Settles. Active learning. Morgan Claypool, 2012.\n[33] H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the 5th\n\nAnnual Workshop on Computational Learning Theory, pages 287\u2013294, 1992.\n\n[34] C. Tosh and S. Dasgupta. Lower bounds for the gibbs sampler on mixtures of gaussians. In\n\nThirty-First International Conference on Machine Learning, 2014.\n\n[35] C. Tosh and S. Dasgupta. Diameter-based active learning. In Proceedings of the 34th Interna-\n\ntional Conference on Machine Learning, pages 3444\u20133452, 2017.\n\n[36] S. Vikram and S. Dasgupta. Interactive Bayesian hierarchical clustering. In Proceedings of the\n\n33rd International Conference on Machine Learning, 2016.\n\n[37] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In Proceedings of the\n\n17th International Conference on Machine Learning, 2000.\n\n[38] Y. Xu, H. Zhang, K. Miller, A. Singh, and A. Dubrawski. Noise-tolerant interactive learning\nusing pairwise comparisons. In Advances in Neural Information Processing Systems, pages\n2431\u20132440, 2017.\n\n[39] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised\nlearning using gaussian \ufb01elds and harmonic functions. In ICML Workshop on the Continuum\nfrom Labeled to Unlabeled Data, 2003.\n\n11\n\n\f", "award": [], "sourceid": 597, "authors": [{"given_name": "Christopher", "family_name": "Tosh", "institution": "Columbia University"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}]}