{"title": "Active Learning and Best-Response Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 2222, "page_last": 2230, "abstract": "We consider a setting in which low-power distributed sensors are each making highly noisy measurements of some unknown target function. A center wants to accurately learn this function by querying a small number of sensors, which ordinarily would be impossible due to the high noise rate. The question we address is whether local communication among sensors, together with natural best-response dynamics in an appropriately-de\ufb01ned game, can denoise the system without destroying the true signal and allow the center to succeed from only a small number of active queries. We prove positive (and negative) results on the denoising power of several natural dynamics, and also show experimentally that when combined with recent agnostic active learning algorithms, this process can achieve low error from very few queries, performing substantially better than active or passive learning without these denoising dynamics as well as passive learning with denoising.", "full_text": "Active Learning and Best-Response Dynamics\n\nMaria-Florina Balcan\n\nCarnegie Mellon\n\nninamf@cs.cmu.edu\n\nChristopher Berlind\n\nGeorgia Tech\n\ncberlind@gatech.edu\n\nAvrim Blum\n\nCarnegie Mellon\n\navrim@cs.cmu.edu\n\nEmma Cohen\nGeorgia Tech\n\necohen@gatech.edu\n\nKaushik Patnaik\n\nGeorgia Tech\n\nkpatnaik3@gatech.edu\n\nLe Song\n\nGeorgia Tech\n\nlsong@cc.gatech.edu\n\nAbstract\n\nWe examine an important setting for engineered systems in which low-power dis-\ntributed sensors are each making highly noisy measurements of some unknown\ntarget function. A center wants to accurately learn this function by querying a\nsmall number of sensors, which ordinarily would be impossible due to the high\nnoise rate. The question we address is whether local communication among sen-\nsors, together with natural best-response dynamics in an appropriately-de\ufb01ned\ngame, can denoise the system without destroying the true signal and allow the\ncenter to succeed from only a small number of active queries. By using techniques\nfrom game theory and empirical processes, we prove positive (and negative) re-\nsults on the denoising power of several natural dynamics. We then show experi-\nmentally that when combined with recent agnostic active learning algorithms, this\nprocess can achieve low error from very few queries, performing substantially\nbetter than active or passive learning without these denoising dynamics as well as\npassive learning with denoising.\n\n1\n\nIntroduction\n\nActive learning has been the subject of signi\ufb01cant theoretical and experimental study in machine\nlearning, due to its potential to greatly reduce the amount of labeling effort needed to learn a given\ntarget function. However, to date, such work has focused only on the single-agent low-noise setting,\nwith a learning algorithm obtaining labels from a single, nearly-perfect labeling entity. In large\npart this is because the effectiveness of active learning is known to quickly degrade as noise rates\nbecome high [5]. In this work, we introduce and analyze a novel setting where label information\nis held by highly-noisy low-power agents (such as sensors or micro-robots). We show how by \ufb01rst\nusing simple game-theoretic dynamics among the agents we can quickly approximately denoise the\nsystem. This allows us to exploit the power of active learning (especially, recent advances in agnostic\nactive learning), leading to ef\ufb01cient learning from only a small number of expensive queries.\nWe speci\ufb01cally examine an important setting relevant to many engineered systems where we have a\nlarge number of low-power agents (e.g., sensors). These agents are each measuring some quantity,\nsuch as whether there is a high or low concentration of a dangerous chemical at their location,\nbut they are assumed to be highly noisy. We also have a center, far away from the region being\nmonitored, which has the ability to query these agents to determine their state. Viewing the agents\nas examples, and their states as noisy labels, the goal of the center is to learn a good approximation\nto the true target function (e.g., the true boundary of the high-concentration region for the chemical\nbeing monitored) from a small number of label queries. However, because of the high noise rate,\nlearning this function directly would require a very large number of queries to be made (for noise\nrate \u03b7, one would necessarily require \u2126(\n(1/2\u2212\u03b7)2 ) queries [4]). The question we address in this\n\n1\n\n1\n\n\fpaper is to what extent this dif\ufb01culty can be alleviated by providing the agents the ability to engage\nin a small amount of local communication among themselves.\nWhat we show is that by using local communication and applying simple robust state-changing\nrules such as following natural game-theoretic dynamics, randomly distributed agents can modify\ntheir state in a way that greatly de-noises the system without destroying the true target boundary.\nThis then nicely meshes with recent advances in agnostic active learning [1], allowing for the center\nto learn a good approximation to the target function from a small number of queries to the agents.\nIn particular, in addition to proving theoretical guarantees on the denoising power of game-theoretic\nagent dynamics, we also show experimentally that a version of the agnostic active learning algorithm\nof [1], when combined with these dynamics, indeed is able to achieve low error from a small number\nof queries, outperforming active and passive learning algorithms without the best-response denoising\nstep, as well as outperforming passive learning algorithms with denoising. More broadly, engineered\nsystems such as sensor networks are especially well-suited to active learning because components\nmay be able to communicate among themselves to reduce noise, and the designer has some control\nover how they are distributed and so assumptions such as a uniform or other \u201cnice\u201d distribution on\ndata are reasonable. We focus in this work primarily on the natural case of linear separator decision\nboundaries but many of our results extend directly to more general decision boundaries as well.\n1.1 Related Work\nThere has been signi\ufb01cant work in active learning (e.g., see [11, 15]) including active learning in\nthe presence of noise [9, 4, 1], yet it is known active learning can provide signi\ufb01cant bene\ufb01ts in low\nnoise scenarios only [5]. There has also been extensive work analyzing the performance of simple\ndynamics in consensus games [6, 8, 14, 13, 3, 2]. However this work has focused on getting to some\nequilibria or states of low social cost, while we are primarily interested in getting near a speci\ufb01c\ndesired con\ufb01guration, which as we show below is an approximate equilibrium.\n\n2 Setup\nWe assume we have a large number N of agents (e.g., sensors) distributed uniformly at random\nin a geometric region, which for concreteness we consider to be the unit ball in Rd. There is\nan unknown linear separator such that in the initial state, each sensor on the positive side of this\nseparator is positive independently with probability \u2265 1\u2212\u03b7, and each on the negative side is negative\nindependently with probability \u2265 1 \u2212 \u03b7. The quantity \u03b7 < 1/2 is the noise rate.\n\n2.1 The basic sensor consensus game\n\nThe sensors will denoise themselves by viewing themselves as players in a certain consensus game,\nand performing a simple dynamics in this game leading towards a speci\ufb01c \u0001-equilibrium.\nSpeci\ufb01cally, the game is de\ufb01ned as follows, and is parameterized by a communication radius r,\nwhich should be thought of as small. Consider a graph where the sensors are vertices, and any two\nsensors within distance r are connected by an edge. Each sensor is in one of two states, positive or\nnegative. The payoff a sensor receives is its correlation with its neighbors: the fraction of neighbors\nin the same state as it minus the fraction in the opposite state. So, if a sensor is in the same state as all\nits neighbors then its payoff is 1, if it is in the opposite state of all its neighbors then its payoff is \u22121,\nand if sensors are in uniformly random states then the expected payoff is 0. Note that the states of\nhighest social welfare (highest sum of utilities) are the all-positive and all-negative states, which are\nnot what we are looking for. Instead, we want sensors to approach a different near-equilibrium state\nin which (most of) those on the positive side of the target separator are positive and (most of) those\non the negative side of the target separator are negative. For this reason, we need to be particularly\ncareful with the speci\ufb01c dynamics followed by the sensors.\nWe begin with a simple lemma that for suf\ufb01ciently large N, the target function (i.e., all sensors on\nthe positive side of the target separator in the positive state and the rest in the negative state) is an\n\u0001-equilibrium, in that no sensor has more than \u0001 incentive to deviate.\nLemma 1 For any \u0001, \u03b4 > 0, for suf\ufb01ciently large N, with probability 1 \u2212 \u03b4 the target function is an\n\u0001-equilibrium.\n\nPROOF SKETCH: The target function fails to be an \u0001-equilibrium iff there exists a sensor for which\nmore than an \u0001/2 fraction of its neighbors lie on the opposite side of the separator. Fix one sensor\n\n2\n\n\f2 + \u0001\n\n\u00012 ln( 2N\n\n\u00012 ] ln( 2N\n\n\u03b4 ), with probability 1 \u2212 \u03b4\n\nx and consider the probability this occurs to x, over the random placement of the N \u2212 1 other\nsensors. Since the probability mass of the r-ball around x is at least (r/2)d (see discussion in proof\nof Theorem 2), so long as N \u2212 1 \u2265 (2/r)d \u00b7 max[8, 4\n2N , point x\nwill have mx \u2265 2\n\u03b4 ) neighbors (by Chernoff bounds), each of which is at least as likely to be\non x\u2019s side of the target as on the other side. Thus, by Hoeffding bounds, the probability that more\nthan a 1\nN . The result then follows by\nunion bound over all N sensors. For a bit tighter argument and a concrete bound on N, see the proof\nof Theorem 2 which essentially has this as a special case.\nLemma 1 motivates the use of best-response dynamics for denoising. Speci\ufb01cally, we consider a\ndynamics in which each sensor switches to the majority vote of all the other sensors in its neigh-\nborhood. We analyze below the denoising power of this dynamics under both synchronous and\nasynchronous update models. In supplementary material, we also consider more robust (though less\npractical) dynamics in which sensors perform more involved computations over their neighborhoods.\n\n2 fraction lie on the wrong side is at most\n\n2N + \u03b4\n\n2N = \u03b4\n\n\u03b4\n\n3 Analysis of the denoising dynamics\n\n3.1 Simultaneous-move dynamics\n\nWe start by providing a positive theoretical guarantee for one-round simultaneous move dynamics.\nWe will use the following standard concentration bound:\n\nTheorem 1 (Bernstein, 1924) Let X =(cid:80)N\n\nthat |Xi \u2212 E[Xi]| \u2264 M for all i. Then for any t > 0, P[X \u2212 E[X] > t] \u2264 exp\n\ni=1 Xi be a sum of independent random variables such\n.\n\n\u2212t2\n\n2(Var[X]+M t/3)\n\n(cid:17)\n\n(cid:16)\n\n1\n\nln\n\n2\u2212\u03b7)2\n\n2\u2212\u03b7)2\u03b4\n\n2\n(r/2)d(\n\n1\n1\n(r/2)d(\n\n+ 1 then, with probability \u2265 1 \u2212 \u03b4, after one\nTheorem 2 If N \u2265\nsynchronous consensus update every sensor at distance \u2265 r from the separator has the correct label.\n\u221a\n\u221a\nd), Theorem\nNote that since a band of width 2r about a linear separator has probability mass O(r\n2 implies that with high probability one synchronous update denoises all but an O(r\nd) fraction of\nthe sensors. In fact, Theorem 2 does not require the separator to be linear, and so this conclusion\napplies to any decision boundary with similar surface area, such as an intersection of a constant\nnumber of halfspaces or a decision surface of bounded curvature.\nProof (Theorem 2): Fix a point x in the sample at distance \u2265 r from the separator and consider the\nball of radius r centered at x. Let n+ be the number of correctly labeled points within the ball and\nn\u2212 be the number of incorrectly labeled points within the ball. Now consider the random variable\n\u2206 = n\u2212 \u2212 n+. Denoising x can give it the incorrect label only if \u2206 \u2265 0, so we would like to\nbound the probability that this happens. We can express \u2206 as the sum of N \u2212 1 independent random\nvariables \u2206i taking on value 0 for points outside the ball around x, 1 for incorrectly labeled points\ninside the ball, or \u22121 for correct labels inside the ball. Let V be the measure of the ball centered\nat x (which may be less than rd if x is near the boundary of the unit ball). Then since the ball lies\nentirely on one side of the separator we have\n\nE[\u2206i] = (1 \u2212 V ) \u00b7 0 + V \u03b7 \u2212 V (1 \u2212 \u03b7) = \u2212V (1 \u2212 2\u03b7).\n\nSince |\u2206i| \u2264 1 we can take M = 2 in Bernstein\u2019s theorem. We can also calculate that Var[\u2206i] \u2264\nE[\u22062\n\ni ] = V . Thus the probability that the point x is updated incorrectly is\n\n(cid:18)\n\n(cid:19)\n\n(cid:34)N\u22121(cid:88)\n\ni=1\n\nP\n\n(cid:35)\n\n\u2206i \u2265 0\n\ni=1\n\ni=1\n\n\u2206i\n\n= P\n\n\u2264 exp\n\n\u2206i \u2212 E\n\n(cid:104) N\u22121(cid:88)\n\n(cid:34)N\u22121(cid:88)\n(cid:35)\n(cid:105) \u2265 (N \u2212 1)V (1 \u2212 2\u03b7)\n(cid:32)\n2(cid:0)(N \u2212 1)V + 2(N \u2212 1)V (1 \u2212 2\u03b7)/3(cid:1)(cid:33)\n(cid:19)\n(cid:18)\u2212(N \u2212 1)V (1 \u2212 2\u03b7)2\n\u2264 exp(cid:0)\u2212(N \u2212 1)V ( 1\n2 \u2212 \u03b7)2(cid:1)\n2 \u2212 \u03b7)2(cid:1) ,\n\u2264 exp(cid:0)\u2212(N \u2212 1)(r/2)d( 1\n\n\u2212(N \u2212 1)2V 2(1 \u2212 2\u03b7)2\n\n2 + 4(1 \u2212 2\u03b7)/3\n\n\u2264 exp\n\n3\n\n\fwhere in the last step we lower bound the measure V of the ball around r by the measure of the\nsphere of radius r/2 inscribed in its intersection with the unit ball. Taking a union bound over all N\npoints, it suf\ufb01ces to have e\n\n2\u2212\u03b7)2 \u2264 \u03b4/N, or equivalently\n\n\u2212(N\u22121)(r/2)d(\n\n1\n\n(cid:18)\n\n(cid:19)\n\n.\n\nN \u2212 1 \u2265\n\n1\n(r/2)d( 1\n\n2 \u2212 \u03b7)2\n\nln N + ln\n\n1\n\u03b4\n\nUsing the fact that ln x \u2264 \u03b1x \u2212 ln \u03b1 \u2212 1 for all x, \u03b1 > 0 yields the claimed bound on N.\nWe can now combine this result with the ef\ufb01cient agnostic active learning algorithm of [1].\nIn\nparticular, applying the most recent analysis of [10, 16] of the algorithm of [1], we get the following\nbound on the number of queries needed to ef\ufb01ciently learn to accuracy 1\u2212 \u0001 with probability 1\u2212 \u03b4.\nCorollary 1 There exists constant c1 > 0 such that for r \u2264 \u0001/(c1\nd), and N satisfying the bound\nof Theorem 2, if sensors are each initially in agreement with the target linear separator indepen-\ndently with probability at least 1\u2212\u03b7, then one round of best-response dynamics is suf\ufb01cient such that\nthe agnostic active learning algorithm of [1] will ef\ufb01ciently learn to error \u0001 using only O(d log 1/\u0001)\nqueries to sensors.\n\n\u221a\n\nIn Section 5 we implement this algorithm and show that experimentally it learns a low-error decision\nrule even in cases where the initial value of \u03b7 is quite high.\n\n3.2 A negative result for arbitrary-order asynchronous dynamics\n\nWe contrast the above positive result with a negative result for arbitrary-order asynchronous moves.\nIn particular, we show that for any d \u2265 1, for suf\ufb01ciently large N, with high probability there exists\nan update order that will cause all sensors to become negative.\nTheorem 3 For some absolute constant c > 0, if r \u2264 1/2 and sensors begin with noise rate \u03b7, and\n\n(cid:18)\n\n(cid:19)\n\n,\n\nN \u2265 16\n\n(cr)d\u03c62\n\nln\n\n8\n\n(cr)d\u03c62 + ln\n\n1\n\u03b4\n\n2 \u2212 \u03b7), then with probability at least 1 \u2212 \u03b4 there exists an ordering of\n\nwhere \u03c6 = \u03c6(\u03b7) = min(\u03b7, 1\nthe agents so that asynchronous updates in this order cause all points to have the same label.\nPROOF SKETCH: Consider the case d = 1 and a target function x > 0. Each subinterval of [\u22121, 1]\nof width r has probability mass r/2, and let m = rN/2 be the expected number of points within such\nan interval. The given value of N is suf\ufb01ciently large that with high probability, all such intervals\nin the initial state have both a positive count and a negative count that are within \u00b1 \u03c6\n4 m of their\nexpectations. This implies that if sensors update left-to-right, initially all sensors will (correctly) \ufb02ip\nto negative, because their neighborhoods have more negative points than positive points. But then\nwhen the \u201cwave\u201d of sensors reaches the positive region, they will continue (incorrectly) \ufb02ipping to\nnegative because the at least m(1 \u2212 \u03c6\n2 ) negative points in the left-half of their neighborhood will\noutweigh the at most (1 \u2212 \u03b7 + \u03c6\n4 )m positive points in the right-half of their neighborhood. For a\ndetailed proof and the case of general d > 1, see supplementary material.\n\n3.3 Random order dynamics\n\nWhile Theorem 3 shows that there exist bad orderings for asynchronous dynamics, we now show\nthat we can get positive theoretical guarantees for random order best-response dynamics.\nThe high level idea of the analysis is to partition the sensors into three sets: those that are within\ndistance r of the target separator, those at distance between r and 2r from the target separator, and\nthen all the rest. For those at distance < r from the separator we will make no guarantees: they\nmight update incorrectly when it is their turn to move due to their neighbors on the other side of the\ntarget. Those at distance between r and 2r from the separator might also update incorrectly (due to\n\u201ccorruption\u201d from neighbors at distance < r from the separator that had earlier updated incorrectly)\nbut we will show that with high probability this only happens in the last 1/4 of the ordering. I.e.,\nwithin the \ufb01rst 3N/4 updates, with high probability there are no incorrect updates by sensors at\ndistance between r and 2r from the target. Finally, we show that with high probability, those at\n\n4\n\n\fc4\n\nc3\n\nd\n\n\u221a\n\n\u221a\n\n\u221a\n2 \u2212 \u03b7)/\n\nd, and the \ufb01nal error is O(r\n2 \u2212 \u03b7, \u0001]/\n\nand N \u2265\nrd\u03b3\u03b4 ), with probability 1\u2212 \u03b4, each sensor x at distance between r and 2r from the target\n\u03b32 ln(4N/\u03b4) neighbors, and furthermore the number of inside-neighbors of x\n\ndistance greater than 2r never update incorrectly. This last part of the argument follows from two\nfacts: (1) with high probability all such points begin with more correctly-labeled neighbors than\nincorrectly-labeled neighbors (so they will update correctly so long as no neighbors have previously\nupdated incorrectly), and (2) after 3N/4 total updates have been made, with high probability more\nthan half of the neighbors of each such point have already (correctly) updated, and so those points\nwill now update correctly no matter what their remaining neighbors do. Our argument for the sensors\nat distance in [r, 2r] requires r to be small compared to ( 1\nd),\nso the conclusion is we have a total error less than \u0001 for r < c min[ 1\nd for some absolute\nconstant c.\nWe begin with a key lemma. For any given sensor, de\ufb01ne its inside-neighbors to be its neighbors\nin the direction of the target separator and its outside-neighbors to be its neighbors away from the\ntarget separator. Also, let \u03b3 = 1/2 \u2212 \u03b7.\nLemma 2 For any c1, c2 > 0 there exist c3, c4 > 0 such that for r \u2264 \u03b3\n\u221a\n(r/2)d\u03b32 ln( 1\nseparator has mx \u2265 c1\nthat move before x is within \u00b1 \u03b3\nProof: First, the guarantee on mx follows immediately from the fact that the probability mass of\nthe ball around each sensor x is at least (r/2)d, so for appropriate c4 the expected value of mx is at\nleast max[8, 2c1\n\u03b32 ] ln(4N/\u03b4), and then applying Hoeffding bounds [12, 7] and the union bound. Now,\n\ufb01x some sensor x and let us \ufb01rst assume the ball of radius r about x does not cross the unit sphere.\nBecause this is random-order dynamics, if x is the kth sensor to move within its neighborhood,\nthe k \u2212 1 sensors that move earlier are each equally likely to be an inside-neighbor or an outside-\nneighbor. So the question reduces to: if we \ufb02ip k\u22121 \u2264 mx fair coins, what is the probability that the\nnumber of heads differs from the number of tails by more than \u03b3\n\u03b3 )2 ln(4N/\u03b4),\nc2\nthis is at most \u03b4/(2N ) by Hoeffding bounds. Now, if the ball of radius r about x does cross the\nunit sphere, then a random neighbor is slightly more likely to be an inside-neighbor than an outside-\nneighbor. However, because x has distance at most 2r from the target separator, this difference in\nfor appropriate choice of constant c3.1 So, the\nprobabilities is only O(r\nresult follows by applying Hoeffding bounds to the \u03b3\n2c2\nTheorem 4 For some absolute constants c3, c4, for r \u2264 \u03b3\n\u221a\nrd\u03b3\u03b4 ), in\nrandom order dynamics, with probability 1 \u2212 \u03b4 all sensors at distance greater than 2r from the\ntarget separator update correctly.\n\nmx of the number of outside neighbors of x that move before x.\n\nd), which is at most \u03b3\n2c2\n\nmx. For mx \u2265 2( c2\n\nc4\n\n(r/2)d\u03b32 ln( 1\n\ngap that remains.\n\nand N \u2265\n\nc3\n\nd\n\nc2\n\n\u221a\n\nPROOF SKETCH: We begin by using Lemma 2 to argue that with high probability, no points at\ndistance between r and 2r from the separator update incorrectly within the \ufb01rst 3N/4 updates (which\nimmediately implies that all points at distance greater than 2r update correctly as well, since by\nTheorem 2, with high probability they begin with more correctly-labeled neighbors than incorrectly-\nlabeled neighbors and their neighborhood only becomes more favorable). In particular, for any given\nsuch point, the concern is that some of its inside-neighbors may have previously updated incorrectly.\nHowever, we use two facts: (1) by Lemma 2, we can set c4 so that with high probability the total\ncontribution of neighbors that have already updated is at most \u03b3\n8 mx in the incorrect direction (since\nthe outside-neighbors will have updated correctly, by induction), and (2) by standard concentration\n\n\u221a\n\n1We can analyze the difference in probabilities as follows. First, in the worst case, x is at distance exactly\n2r from the separator, and is right on the edge of the unit ball. So we can de\ufb01ne our coordinate system to view\n1 \u2212 4r2, 0, . . . , 0). Now, consider adding to x a random offset y in the r-ball. We\nx as being at location (2r,\nwant to look at the probability that x + y has Euclidean length less than 1 conditioned on the \ufb01rst coordinate\nof y being negative compared to this probability conditioned on the \ufb01rst coordinate of y being positive. Notice\nthat because the second coordinate of x is nearly 1, if y2 \u2264 \u2212cr2 for appropriate c then x + y has length less\nthan 1 no matter what the other coordinates of y are (worst-case is if y1 = r but even that adds at most O(r2)\nto the squared-length). On the other hand, if y2 \u2265 cr2 then x + y has length greater than 1 also no matter\n\u221a\nwhat the other coordinates of y are. So, it is only in between that the value of y1 matters. But notice that the\ndistribution over y2 has maximum density O(\nd/r). So, with probability nearly 1/2, the point is inside the\nunit ball for sure, with probability nearly 1/2 the point is outside the unit ball for sure, and only with probability\nO(r2\n\n\u221a\nd) does the y1 coordinate make any difference at all.\n\nd/r) = O(r\n\n\u221a\n\n5\n\n\fwk+1\n\nwk\n\nrk\n\nbk\n\n+\n+\n\u2212\n+\n+\n\n+\n\u2212\n+\n+\n\u2212\n\u2212\n+\n\u2212\n\u2212\n\u2212\n\n+\n\u2212\n+\n\n+\n\u2212\n\u2212\n\nFigure 1: The margin-based active learning algorithm after iteration k. The algorithm samples points\nwithin margin bk of the current weight vector wk and then minimizes the hinge loss over this sample\nsubject to the constraint that the new weight vector wk+1 is within distance rk from wk.\n\n8 mx neighbors of x have not yet updated. These\ninequalities [12, 7], with high probability at least 1\n8 mx un-updated neighbors together have in expectation a \u03b3\n4 mx bias in the correct direction, and\n1\nso with high probability have greater than a \u03b3\n8 mx correct bias for suf\ufb01ciently large mx (suf\ufb01ciently\nlarge c1 in Lemma 2). So, with high probability this overcomes the at most \u03b3\n8 mx incorrect bias\nof neighbors that have already updated, and so the points will indeed update correctly as desired.\nFinally, we consider the points of distance \u2265 2r. Within the \ufb01rst 3\n4 N updates, with high probability\nthey will all update correctly as argued above. Now consider time 3\n4 N. For each such point, in\nexpectation 3\n4 of its neighbors have already updated, and with high probability, for all such points\nthe fraction of neighbors that have updated is more than half. Since all neighbors have updated\ncorrectly so far, this means these points will have more correct neighbors than incorrect neighbors\nno matter what the remaining neighbors do, and so they will update correctly themselves.\n\n4 Query ef\ufb01cient polynomial time active learning algorithm\n\nRecently, Awasthi et al. [1] gave the \ufb01rst polynomial-time active learning algorithm able to learn\nlinear separators to error \u0001 over the uniform distribution in the presence of agnostic noise of rate\nO(\u0001). Moreover, the algorithm does so with optimal query complexity of O(d log 1/\u0001). This algo-\nrithm is ideally suited to our setting because (a) the sensors are uniformly distributed, and (b) the\nresult of best response dynamics is noise that is low but potentially highly coupled (hence, \ufb01tting\nthe low-noise agnostic model). In our experiments (Section 5) we show that indeed this algorithm\nwhen combined with best-response dynamics achieves low error from a small number of queries,\noutperforming active and passive learning algorithms without the best-response denoising step, as\nwell as outperforming passive learning algorithms with denoising.\nHere, we brie\ufb02y describe the algorithm of [1] and the intuition behind it. At high level, the algorithm\nproceeds through several rounds, in each performing the following operations (see also Figure 1):\n\nInstance space localization: Request labels for a random sample of points within a band of width\n\nbk = O(2\u2212k) around the boundary of the previous hypothesis wk.\n\nConcept space localization: Solve for hypothesis vector wk+1 by minimizing hinge loss subject to\n\nthe constraint that wk+1 lie within a radius rk from wk; that is, ||wk+1 \u2212 wk|| \u2264 rk.\n\n[1, 10, 16] show that by setting the parameters appropriately (in particular, bk = \u0398(1/2k) and\nrk = \u0398(1/2k)), the algorithm will achieve error \u0001 using only k = O(log 1/\u0001) rounds, with O(d)\nlabel requests per round. In particular, a key idea of their analysis is to decompose, in round k, the\nerror of a candidate classi\ufb01er w as its error outside margin bk of the current separator plus its error\ninside margin bk, and to prove that for these parameters, a small constant error inside the margin\nsuf\ufb01ces to reduce overall error by a constant factor. A second key part is that by constraining the\nsearch for wk+1 to vectors within a ball of radius rk about wk, they show that hinge-loss acts as a\nsuf\ufb01ciently faithful proxy for 0-1 loss.\n\n6\n\n\f5 Experiments\n\nIn our experiments we seek to determine whether our overall algorithm of best-response dynamics\ncombined with active learning is effective at denoising the sensors and learning the target boundary.\nThe experiments were run on synthetic data, and compared active and passive learning (with Support\nVector Machines) both pre- and post-denoising.\n\nSynthetic data. The N sensor locations were generated from a uniform distribution over the unit\nball in R2, and the target boundary was \ufb01xed as a randomly chosen linear separator through the\norigin. To simulate noisy scenarios, we corrupted the true sensor labels using two different methods:\n1) \ufb02ipping the sensor labels with probability \u03b7 and 2) \ufb02ipping randomly chosen sensor labels and all\ntheir neighbors, to create pockets of noise, with \u03b7 fraction of total sensors corrupted.\n\nDenoising via best-response dynamics.\nIn the denoising phase of the experiments, the sensors\napplied the basic majority consensus dynamic. That is, each sensor was made to update its label\nto the majority label of its neighbors within distance r from its location2. We used radius values\nr \u2208 {0.025, 0.05, 0.1, 0.2}. Updates of sensor labels were carried out both through simultaneous\nupdates to all the sensors in each iteration (synchronous updates) and updating one randomly chosen\nsensor in each iteration (asynchronous updates).\n\nLearning the target boundary. After denoising the dataset, we employ the agnostic active learn-\ning algorithm of Awasthi et al. [1] described in Section 4 to decide which sensors to query and\nobtain a linear separator. We also extend the algorithm to the case of non-linear boundaries by im-\nplementing a kernelized version (see supplementary material for more details). Here we compare\nthe resulting error (as measured against the \u201ctrue\u201d labels given by the target separator) against that\nobtained by training a SVM on a randomly selected labeled sample of the sensors of the same size\nas the number of queries used by the active algorithm. We also compare these post-denoising er-\nrors with those of the active algorithm and SVM trained on the sensors before denoising. For the\nactive algorithm, we used parameters asymptotically matching those given in Awasthi et al [1] for\na uniform distribution. For SVM, we chose for each experiment the regularization parameter that\nresulted in the best performance.\n\n5.1 Results\n\nHere we report the results for N = 10000 and r = 0.1. Results for experiments with other values of\nthe parameters are included in the supplementary material. Every value reported is an average over\n50 independent trials.\n\nDenoising effectiveness. Figure 2 (left side) shows, for various initial noise rates, the fraction of\nsensors with incorrect labels after applying 100 rounds of synchronous denoising updates. In the\nrandom noise case, the \ufb01nal noise rate remains very small even for relatively high levels of initial\nnoise. Pockets of noise appear to be more dif\ufb01cult to denoise. In this case, the \ufb01nal noise rate\nincreases with initial noise rate, but is still nearly always smaller than the initial level of noise.\n\nSynchronous vs. asynchronous updates. To compare synchronous and asynchronous updates we\nplot the noise rate as a function of the number of rounds of updates in Figure 2 (right side). As our\ntheory suggests, both simultaneous updates and asynchronous updates can quickly converge to a low\nlevel of noise in the random noise setting (in fact, convergence happens quickly nearly every time).\nNeither update strategy achieves the same level of performance in the case of pockets of noise.\n\nGeneralization error: pre- vs. post-denoising and active vs. passive. We trained both active\nand passive learning algorithms on both pre- and post-denoised sensors at various label budgets,\nand measured the resulting generalization error (determined by the angle between the target and\nthe learned separator). The results of these experiments are shown in Figure 3. Notice that, as\nexpected, denoising helps signi\ufb01cantly and on the denoised dataset the active algorithm achieves\nbetter generalization error than support vector machines at low label budgets. For example, at a\n\n2We also tested distance-weighted majority and randomized majority dynamics and experimentally ob-\n\nserved similar results to those of the basic majority dynamic.\n\n7\n\n\fFigure 2: Initial vs. \ufb01nal noise rates for synchronous updates (left) and comparison of synchronous\nand asynchronous dynamics (right). One synchronous round updates every sensor once simultane-\nously, while one asynchronous round consists of N random updates.\n\nlabel budget of 30, active learning achieves generalization error approximately 33% lower than\nthe generalization error of SVMs. Similar observations were also obtained upon comparing the\nkernelized versions of the two algorithms (see supplementary material).\n\nFigure 3: Generalization error of the two learning methods with random noise at rate \u03b7 = 0.35 (left)\nand pockets of noise at rate \u03b7 = 0.15 (right).\n\n6 Discussion\n\nWe demonstrate through theoretical analysis as well as experiments on synthetic data that local best-\nresponse dynamics can signi\ufb01cantly denoise a highly-noisy sensor network without destroying the\nunderlying signal, allowing for fast learning from a small number of label queries. Our positive\ntheoretical guarantees apply both to synchronous and random-order asynchronous updates, which\nis borne out in the experiments as well. Our negative result in Section 3.2 for adversarial-order\ndynamics, in which a left-to-right update order can cause the entire system to switch to a single label,\nraises the question whether an alternative dynamics could be robust to adversarial update orders. In\nthe supplementary material we present an alternative dynamics that we prove is indeed robust to\narbitrary update orders, but this dynamics is less practical because it requires substantially more\ncomputational power on the part of the sensors. It is an interesting question whether such general\nrobustness can be achieved by a simple practicall update rule. Another open question is whether an\nalternative dynamics can achieve better denoising in the region near the decision boundary.\n\nAcknowledgments\n\nThis work was supported in part by NSF grants CCF-0953192, CCF-1101283, CCF-1116892, IIS-\n1065251, IIS1116886, NSF/NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, AFOSR\ngrant FA9550-09-1-0538, ONR grant N00014-09-1-0751, and Raytheon Faculty Fellowship.\n\n8\n\n01020304050InitialNoise(%)051015202530354045FinalNoise(%)RandomNoisePocketsofNoise01101001000NumberofRounds01020304050FinalNoise(%)RandomNoise-AsynchronousupdatesPocketsofNoise-AsynchronousupdatesRandomNoise-SynchronousupdatesPocketsofNoise-Synchronousupdates30405060708090100LabelBudget0.00.10.20.30.40.5GeneralizationErrorPreDenoising-OurMethodPreDenoising-SVMPostDenoising-OurMethodPostDenoising-SVM30405060708090100LabelBudget0.000.050.100.150.20GeneralizationErrorPreDenoising-OurMethodPreDenoising-SVMPostDenoising-OurMethodPostDenoising-SVM\fReferences\n[1] P. Awasthi, M. F. Balcan, and P. Long. The power of localization for ef\ufb01ciently learning linear\n\nseparators with noise. In STOC, 2014.\n\n[2] M.-F. Balcan, A. Blum, and Y. Mansour. The price of uncertainty. In EC, 2009.\n[3] M.-F. Balcan, A. Blum, and Y. Mansour. Circumventing the price of anarchy: Leading dynam-\n\nics to good behavior. SICOMP, 2014.\n\n[4] M. F. Balcan and V. Feldman. Statistical active learning algorithms. In NIPS, 2013.\n[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML,\n\n2009.\n\n[6] L. Blume. The statistical mechanics of strategic interaction. Games and Economic Behavior,\n\n5:387\u2013424, 1993.\n\n[7] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory\n\nof Independence. OUP Oxford, 2013.\n\n[8] G. Ellison. Learning, local interaction, and coordination. Econometrica, 61:1047\u20131071, 1993.\n[9] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning\n\nwith noisy observations. In NIPS, 2010.\n\n[10] S. Hanneke. Personal communication. 2013.\n[11] S. Hanneke. A statistical theory of active learning. Foundations and Trends in Machine Learn-\n\ning, pages 1\u2013212, 2013.\n\n[12] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican Statistical Association, 58(301):13\u201330, March 1963.\n\n[13] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in\ufb02uence through a social\nnetwork. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, KDD \u201903, pages 137\u2013146. ACM, 2003.\n\n[14] S. Morris. Contagion. The Review of Economic Studies, 67(1):57\u201378, 2000.\n[15] B. Settles. Active Learning. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine Learn-\n\ning. Morgan & Claypool Publishers, 2012.\n\n[16] L. Yang. Mathematical Theories of Interaction with Oracles. PhD thesis, CMU Dept. of\n\nMachine Learning, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1159, "authors": [{"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Georgia Tech"}, {"given_name": "Christopher", "family_name": "Berlind", "institution": "Georgia Institute of Technology"}, {"given_name": "Avrim", "family_name": "Blum", "institution": "CMU"}, {"given_name": "Emma", "family_name": "Cohen", "institution": "Georgia Institute of Technology"}, {"given_name": "Kaushik", "family_name": "Patnaik", "institution": "Georgia Institute of Technology"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Tech"}]}