{"title": "Co-Training and Expansion: Towards Bridging Theory and Practice", "book": "Advances in Neural Information Processing Systems", "page_first": 89, "page_last": 96, "abstract": null, "full_text": "Co-Training and Expansion: Towards Bridging\n\nTheory and Practice\n\nMaria-Florina Balcan\nComputer Science Dept.\nCarnegie Mellon Univ.\nPittsburgh, PA 15213\n\nninamf@cs.cmu.edu\n\nAvrim Blum\n\nComputer Science Dept.\nCarnegie Mellon Univ.\nPittsburgh, PA 15213\navrim@cs.cmu.edu\n\nAbstract\n\nKe Yang\n\nComputer Science Dept.\nCarnegie Mellon Univ.\nPittsburgh, PA 15213\n\nyangke@cs.cmu.edu\n\nCo-training is a method for combining labeled and unlabeled data when\nexamples can be thought of as containing two distinct sets of features. It\nhas had a number of practical successes, yet previous theoretical analyses\nhave needed very strong assumptions on the data that are unlikely to be\nsatis\ufb01ed in practice.\nIn this paper, we propose a much weaker \u201cexpansion\u201d assumption on the\nunderlying data distribution, that we prove is suf\ufb01cient for iterative co-\ntraining to succeed given appropriately strong PAC-learning algorithms\non each feature set, and that to some extent is necessary as well. This\nexpansion assumption in fact motivates the iterative nature of the origi-\nnal co-training algorithm, unlike stronger assumptions (such as indepen-\ndence given the label) that allow a simpler one-shot co-training to suc-\nceed. We also heuristically analyze the effect on performance of noise in\nthe data. Predicted behavior is qualitatively matched in synthetic experi-\nments on expander graphs.\n\nIntroduction\n\n1\nIn machine learning, it is often the case that unlabeled data is substantially cheaper and\nmore plentiful than labeled data, and as a result a number of methods have been developed\nfor using unlabeled data to try to improve performance, e.g., [15, 2, 6, 11, 16]. Co-training\n[2] is a method that has had substantial success in scenarios in which examples can be\nthought of as containing two distinct yet suf\ufb01cient feature sets. Speci\ufb01cally, a labeled ex-\nample takes the form (hx1; x2i; \u2018), where x1 2 X1 and x2 2 X2 are the two parts of the\nexample, and \u2018 is the label. One further assumes the existence of two functions c1; c2 over\nthe respective feature sets such that c1(x1) = c2(x2) =\u2018 . Intuitively, this means that each\nexample contains two \u201cviews,\u201d and each view contains suf\ufb01cient information to determine\nthe label of the example. This redundancy implies an underlying structure of the unlabeled\ndata (since they need to be \u201cconsistent\u201d), and this structure makes the unlabeled data infor-\nmative. In particular, the idea of iterative co-training [2] is that one can use a small labeled\nsample to train initial classi\ufb01ers h1; h2 over the respective views, and then iteratively boot-\nstrap by taking unlabeled examples hx1; x2i for which one of the hi is con\ufb01dent but the\nother is not \u2014 and using the con\ufb01dent hi to label such examples for the learning algorithm\non the other view, improving the other classi\ufb01er. As an example for webpage classi\ufb01ca-\ntion given in [2], webpages contain text (x1) and have hyperlinks pointing to them (x2).\nFrom a small labeled sample, we might learn a classi\ufb01er h2 that says that if a link with\nthe words \u201cmy advisor\u201d points to a page, then that page is probably a positive example\nof faculty-member-home-page; so, if we \ufb01nd an unlabeled example with this property we\ncan use h2 to label the page for the learning algorithm that uses the text on the page itself.\nThis approach and its variants have been used for a variety of learning problems, including\n\n\fnamed entity classi\ufb01cation [3], text classi\ufb01cation [10, 5], natural language processing [13],\nlarge scale document classi\ufb01cation [12], and visual detectors [8].\nCo-training effectively requires two distinct properties of the underlying data distribution\nin order to work. The \ufb01rst is that there should at least in principle exist low error classi\ufb01ers\nc1; c2 on each view. The second is that these two views should on the other hand not be too\nhighly correlated \u2014 we need to have at least some examples where h1 is con\ufb01dent but h2\nis not (or vice versa) for the co-training algorithm to actually do anything. Unfortunately,\nprevious theoretical analyses have needed to make strong assumptions of this second type in\norder to prove their guarantees. These include \u201cconditional independence given the label\u201d\nused by [2] and [4], or the assumption of \u201cweak rule dependence\u201d used by [1]. The primary\ncontribution of this paper is a theoretical analysis that substantially relaxes the strength\nof this second assumption to just a form of \u201cexpansion\u201d of the underlying distribution (a\nnatural analog of the graph-theoretic notions of expansion and conductance) that we show\nin some sense is a necessary condition for co-training to succeed as well. However, we will\nneed a fairly strong assumption on the learning algorithms: that the hi they produce are\nnever \u201ccon\ufb01dent but wrong\u201d (formally, the algorithms are able to learn from positive data\nonly), though we give a heuristic analysis of the case when this does not hold.\nOne key feature of assuming only expansion on the data is that it speci\ufb01cally motivates the\niterative nature of the co-training algorithm. Previous assumptions that had been analyzed\nimply such a strong form of expansion that even a \u201cone-shot\u201d version of co-training will\nsucceed (see Section 2.2). In fact, the theoretical guarantees given in [2] are exactly of\nthis type. However, distributions can easily satisfy our weaker condition without allowing\none-shot learning to work as well, and we describe several natural situations of this form.\nAn additional property of our results is that they are algorithmic in nature. That is, if we\nhave suf\ufb01ciently strong ef\ufb01cient PAC-learning algorithms for the target function on each\nfeature set, we can use them to achieve ef\ufb01cient PAC-style guarantees for co-training as\nwell. However, as mentioned above, we need a stronger assumption on our base learning\nalgorithms than used by [2] (see section 2.1).\nWe begin by formally de\ufb01ning the expansion assumption we will use, connecting it to stan-\ndard graph-theoretic notions of expansion and conductance. We then prove the statement\nthat (cid:15)-expansion is suf\ufb01cient for iterative co-training to succeed, given strong enough base\nlearning algorithms over each view, proving bounds on the number of iterations needed to\nconverge. In Section 4.1, we heuristically analyze the effect of imperfect feature sets on\nco-training accuracy. Finally, in Section 4.2, we present experiments on synthetic expander\ngraph data that qualitatively bear out our analyses.\n\n\u2212\n\n(cid:2) X +\n\n1\n\n2 , and let X\n\u2212\n\n2 Notations, De\ufb01nitions, and Assumptions\nWe assume that examples are drawn from some distribution D over an instance space X =\nX1 (cid:2) X2, where X1 and X2 correspond to two different \u201cviews\u201d of an example. Let c\ndenote the target function, and let X + and X\ndenote the positive and negative regions of\nX respectively (for simplicity we assume we are doing binary classi\ufb01cation). For most of\nthis paper we assume that each view in itself is suf\ufb01cient for correct classi\ufb01cation; that is,\nc can be decomposed into functions c1; c2 over each view such that D has no probability\nmass on examples x such that c1(x1) 6= c2(x2). For i 2 f1; 2g, let X +\ni = fxi 2 Xi :\nci(xi) = 1g, so we can think of X + as X +\ni = Xi \u2212 X +\n\u2212\ni . Let D+ and\n\u2212\nrespectively.\n\nD\nIn order to discuss iterative co-training, we need to be able to talk about a hypothesis\nbeing con\ufb01dent or not con\ufb01dent on a given example. For convenience, we will identify\n\u201ccon\ufb01dent\u201d with \u201ccon\ufb01dent about being positive\u201d. This means we can think of a hypothesis\nhi as a subset of Xi, where xi 2 hi means that hi is con\ufb01dent that xi is positive, and xi 62 hi\nmeans that hi has no opinion.\nAs in [2], we will abstract away the initialization phase of co-training (how labeled data is\nused to generate an initial hypothesis) and assume we are given initial sets S0\n1 and\n1\n\ndenote the marginal distribution of D over X + and X\n\n(cid:18) X +\n\n\fi over X +\n\n(cid:18) X +\n\n2 such that Prhx1;x2i2D(x1 2 S0\n\n1 or x2 2 S0\n\n2) (cid:21) (cid:26)init for some (cid:26)init > 0. The\n\nS0\n2\ngoal of co-training will be to bootstrap from these sets using unlabeled data.\nNow, to prove guarantees for iterative co-training, we make two assumptions:\nthat the\nlearning algorithms used in each of the two views are able to learn from positive data only,\nand that the distribution D+ is expanding as de\ufb01ned in Section 2.2 below.\n2.1 Assumption about the base learning algorithms on the two views\nWe assume that the learning algorithms on each view are able to PAC-learn from positive\ni , and any given (cid:15); (cid:14) > 0, given\ndata only. Speci\ufb01cally, for any distribution D+\naccess to examples from D+\nthe algorithm should be able to produce a hypothesis hi such\ni (so hi only has one-sided error), and (b) with probability 1\u2212(cid:14), the error of\nthat (a) hi (cid:18) X +\ni\nhi under D+\nis at most (cid:15). Algorithms of this type can be naturally thought of as predicting\ni\neither \u201cpositive with con\ufb01dence\u201d or \u201cdon\u2019t know\u201d, \ufb01tting our framework. Examples of\nconcept classes learnable from positive data only include conjunctions, k-CNF, and axis-\nparallel rectangles; see [7]. For instance, for the case of axis-parallel rectangles, a simple\nalgorithm that achieves this guarantee is just to output the smallest rectangle enclosing the\npositive examples seen.\nIf we wanted to consider algorithms that could be con\ufb01dent in both directions (rather than\njust con\ufb01dent about being positive) we could instead use the notion of \u201creliable, useful\u201d\nlearning due to Rivest and Sloan [14]. However, fewer classes of functions are learnable\nin this manner. In addition, a nice feature of our assumption is that we will only need D+\nto expand and not D\n. This is especially natural if the positive class has a large amount\nof cohesion (e.g, it consists of all documents about some topic Y ) but the negatives do not\n(e.g., all documents about all other topics). Note that we are effectively assuming that our\nalgorithms are correct when they are con\ufb01dent; we relax this in our heuristic analysis in\nSection 4.\n2.2 The expansion assumption for the underlying distribution\nFor S1 (cid:18) X1 and S2 (cid:18) X2, let boldface Si (i = 1; 2) denote the event that an example\nhx1; x2i has xi 2 Si. So, if we think of S1 and S2 as our con\ufb01dent sets in each view, then\nPr(S1 ^ S2) denotes the probability mass on examples for which we are con\ufb01dent about\nboth views, and Pr(S1 (cid:8) S2) denotes the probability mass on examples for which we are\ncon\ufb01dent about just one. In this section, all probabilities are with respect to D+. We say:\nDe\ufb01nition 1 D+ is (cid:15)-expanding if for any S1 (cid:18) X +\n(cid:8)\nWe say that D+ is (cid:15)-expanding with respect to hypothesis class H1 (cid:2) H2 if the above\nholds for all S1 2 H1 \\ X +\nthe set\nh \\ X +\nfor i = 1; 2).\nTo get a feel for this de\ufb01nition, notice that (cid:15)-expansion is in some sense necessary for\niterative co-training to succeed, because if S1 and S2 are our con\ufb01dent sets and do not\nexpand, then we might never see examples for which one hypothesis could help the other.1\nIn Section 3 we show that De\ufb01nition 1 is in fact suf\ufb01cient. To see how much weaker this\nde\ufb01nition is than previously-considered requirements, it is helpful to consider a slightly\nstronger kind of expansion that we call \u201cleft-right expansion\u201d.\nDe\ufb01nition 2 We say D+ is (cid:15)-right-expanding if for any S1 (cid:18) X +\n\n(cid:2)\n(cid:3)\nPr(S1 ^ S2); Pr(S1 ^ S2)\n\nPr(S1 (cid:8) S2) (cid:21) (cid:15) min\n(cid:9)\n\n1 ; S2 2 H2 \\ X +\n\n2 (here we denote by Hi \\ X +\n\n1 , S2 (cid:18) X +\n\n2 , we have\n\n: h 2 Hi\n\ni\n\n:\n\ni\n\n\u2212\n\nif Pr(S1) (cid:20) 1=2 and Pr(S2jS1) (cid:21) 1 \u2212 (cid:15) then Pr(S2) (cid:21) (1 + (cid:15)) Pr(S1):\n\n1However, (cid:15)-expansion requires every pair to expand and so it is not strictly necessary. If there\nwere occasional pairs (S1; S2) that did not expand, but such pairs were rare and unlikely to be en-\ncountered as con\ufb01dent sets in the co-training process, we might still be OK.\n\n1 ; S2 (cid:18) X +\n2 ,\n\n\fWe say D+ is (cid:15)-left-expanding if the above holds with indices 1 and 2 reversed. Finally,\nD+ is (cid:15)-left-right-expanding if it has both properties.\n\nIt is not immediately obvious but left-right expansion in fact implies De\ufb01nition 1 (see Ap-\npendix A), though the converse is not necessarily true. We introduce this notion, however,\nfor two reasons. First, it is useful for intuition: if Si is our con\ufb01dent set in X +\ni and this\nset is small (Pr(Si) (cid:20) 1=2), and we train a classi\ufb01er that learns from positive data on the\nconditional distribution that Si induces over X3\u2212i until it has error (cid:20) (cid:15) on that distribution,\nthen the de\ufb01nition implies the con\ufb01dent set on X3\u2212i will have noticeably larger probability\nthan Si; so it is clear why this is useful for co-training, at least in the initial stages. Sec-\nondly, this notion helps clarify how our assumptions are much less restrictive than those\nconsidered previously. Speci\ufb01cally,\nIndependence given the label: Independence given the label implies that for any S1 (cid:18)\n1 and S2 (cid:18) X +\n2 we have Pr(S2jS1) = Pr(S2). So, if Pr(S2jS1) (cid:21) 1\u2212 (cid:15), then\nX +\nPr(S2) (cid:21) 1 \u2212 (cid:15) as well, even if Pr(S1) is tiny. This means that not only does S1\nexpand by a (1 + (cid:15)) factor as in Def. 2, but in fact it expands to nearly all of X +\n2 .\nWeak dependence: Weak dependence [1] is a relaxation of conditional independence that\nrequires only that for all S1 (cid:18) X +\n2 we have Pr(S2jS1) (cid:21) (cid:11) Pr(S2)\nfor some (cid:11) > 0. This seems much less restrictive. However, notice that if\nPr(S2jS1) (cid:21) 1 \u2212 (cid:15), then Pr(S2jS1) (cid:20) (cid:15), which implies by de\ufb01nition of weak\ndependence that Pr(S2) (cid:20) (cid:15)=(cid:11) and therefore Pr(S2) (cid:21) 1 \u2212 (cid:15)=(cid:11). So, again (for\nsuf\ufb01ciently small (cid:15)), even if S1 is very small, it expands to nearly all of X +\n2 . This\nmeans that, as with conditional independence, if one has an algorithm over X2 that\nPAC-learns from positive data only, and one trains it over the conditional distri-\nbution given by S1, then by driving down its error on this conditional distribution\none can perform co-training in just one iteration.\n\n1 ; S2 (cid:18) X +\n\n2.2.1 Connections to standard graph-theoretic notions of expansion\nOur de\ufb01nition of (cid:15)-expansion (De\ufb01nition 1) is a natural analog of the standard graph-\ntheoretic notion of edge-expansion or conductance. A Markov-chain is said to have high\nconductance if under the stationary distribution, for any set of states S of probability at\nmost 1=2, the probability mass on transitions exiting S is at least (cid:15) times the probability\nof S. E.g., see [9]. A graph has high edge-expansion if the random walk on the graph has\nhigh conductance. Since the stationary distribution of this walk can be viewed as having\nequal probability on every edge, this is equivalent to saying that for any partition of the\ngraph into two pieces (S; V \u2212 S), the number of edges crossing the partition should be at\nleast an (cid:15) fraction of the number of edges in the smaller half. To connect this to De\ufb01nition\n1, think of S as S1 ^ S2.\nIt is well-known that, for example, a random degree-3 bipartite graph with high probability\nis expanding, and this in fact motivates our synthetic data experiments of Section 4.2.\n2.2.2 Examples\nWe now give two simple examples that satisfy (cid:15)-expansion but not weak dependence.\nExample 1: Suppose X = Rd(cid:2)Rd and the target function on each view is an axis-parallel\nrectangle. Suppose a random positive example from D+ looks like a pair hx1; x2i such that\nx1 and x2 are each uniformly distributed in their rectangles but in a highly-dependent way:\nspeci\ufb01cally, x2 is identical to x1 except that a random coordinate has been \u201cre-randomized\u201d\nwithin the rectangle. This distribution does not satisfy weak dependence (for any sets S\nand T that are disjoint along all axes we have Pr(TjS) = 0) but it is not hard to verify that\nD+ is (cid:15)-expanding for (cid:15) = \u2126(1=d).\nExample 2: Imagine that we have a learning problem such that the data in X1 falls into n\ndifferent clusters: the positive class is the union of some of these clusters and the negative\nclass is the union of the others. Imagine that this likewise is true if we look at X2 and for\nsimplicity suppose that every cluster has the same probability mass. Independence given\n\n\f1\n\n2\n\n1\n\n1\n\n1\n\n2\n\n1 , S0\n2\n\n_ S0\n\n(cid:18) X +\n\nthe label would say that given that x1 is in some positive cluster Ci in X1, x2 is equally\nlikely to be in any of the positive clusters Cj in X2. But, suppose we have something much\nweaker: each Ci in X1 is associated with only 3 Cj\u2019s in X2 (i.e., given that x1 is in Ci,\nx2 will only be in one of these Cj\u2019s). This distribution clearly will not even have the weak\ndependence property. However, say we have a learning algorithm that assumes everything\nin the same cluster has the same label (so the hypothesis space H consists of all rules that\ndo not split clusters). Then if the graph of which clusters are associated with which is an\nexpander graph, then the distributions will be expanding with respect to H. In particular,\ngiven a labeled example x, the learning algorithm will generalize to x\u2019s entire cluster Ci,\nthen this will be propagated over to nodes in the associated clusters Cj in X2, and so on.\n3 The Main Result\nWe now present our main result. We assume that D+ is (cid:15)-expanding ((cid:15) > 0) with respect\nto hypothesis class H1 (cid:2) H2, that we are given initial con\ufb01dent sets S0\n(cid:18) X +\n2) (cid:21) (cid:26)init, that the target function can be written as hc1; c2i with\nsuch that Pr(S0\nc1 2 H1, c2 2 H2, and that on each of the two views we have algorithms A1 and A2 for\n1\nlearning from positive data only.\n(cid:18) X2\nThe iterative co-training that we consider proceeds in rounds. Let Si\n1\nbe the con\ufb01dent sets in each view at the start of round i. We construct Si+1\nby feeding\ninto A2 examples according to D2 conditioned on Si\n2. That is, we take unlabeled\nexamples from D such that at least one of the current predictors is con\ufb01dent, and feed them\ninto A2 as if they were positive. We run A2 with error and con\ufb01dence parameters given in\nthe theorem below. We simultaneously do the same with A1, creating Si+1\nAfter a pre-determined number of rounds N (speci\ufb01ed in Theorem 1), the algorithm termi-\nnates and outputs the predictor that labels examples hx1; x2i as positive if x1 2 SN +1\nor\nx2 2 SN +1\nWe begin by stating two lemmas that will be useful in our analysis. For both of these\n2 , where Sj; Tj 2 Hj. All probabilities are with\nlemmas, let S1; T1 (cid:18) X +\nrespect to D+.\nLemma 1 Suppose Pr (S1 ^ S2) (cid:20) Pr (S1 ^ S2), Pr (T1 j S1 _ S2) (cid:21) 1 \u2212 (cid:15)=8 and\nPr (T2 j S1 _ S2) (cid:21) 1 \u2212 (cid:15)=8. Then Pr (T1 ^ T2) (cid:21) (1 + (cid:15)=2) Pr (S1 ^ S2).\nProof: From Pr (T1 j S1 _ S2) (cid:21) 1 \u2212 (cid:15)=8 and Pr (T2 j S1 _ S2) (cid:21) 1 \u2212 (cid:15)=8 we get that\nPr (T1 ^ T2) (cid:21) (1 \u2212 (cid:15)=4) Pr (S1 _ S2). Since Pr (S1 ^ S2) (cid:20) Pr (S1 ^ S2) it follows\nfrom the expansion property that\n\n1 , S2; T2 (cid:18) X +\n\n(cid:18) X1 and Si\n\nand negative otherwise.\n\n_ Si\n\n.\n\n2\n\nPr (S1 _ S2) = Pr (S1 (cid:8) S2) + Pr (S1 ^ S2) (cid:21) (1 + (cid:15)) Pr (S1 ^ S2):\n\n8 ) Pr (S1 ^ S2).\n\n8 and Pr (T2 j S1 _ S2) (cid:21) 1 \u2212 \u03b3(cid:15)\n\nTherefore, Pr (T1 ^ T2) (cid:21) (1 \u2212 (cid:15)=4)(1 + (cid:15)) Pr (S1 ^ S2) which implies that\nPr (T1 ^ T2) (cid:21) (1 + (cid:15)=2) Pr (S1 ^ S2):\nLemma 2 Suppose Pr (S1 ^ S2) > Pr (S1 ^ S2) and let \u03b3 = 1 \u2212 Pr (S1 ^ S2).\nPr (T1 j S1 _ S2) (cid:21) 1 \u2212 \u03b3(cid:15)\n(1 + \u03b3(cid:15)\nProof: From Pr (T1 j S1 _ S2) (cid:21) 1 \u2212 \u03b3(cid:15)\n8 and Pr (T2 j S1 _ S2) (cid:21) 1 \u2212 \u03b3(cid:15)\n8 we get that\nPr (T1 ^ T2) (cid:21) (1 \u2212 \u03b3(cid:15)\n4 ) Pr (S1 _ S2). Since Pr (S1 ^ S2) > Pr (S1 ^ S2) it follows\nfrom the expansion property that Pr (S1 (cid:8) S2) (cid:21) (cid:15) Pr (S1 ^ S2). Therefore\n\u03b3 = Pr (S1 (cid:8) S2) + Pr (S1 ^ S2) (cid:21) (1 + (cid:15)) Pr (S1 ^ S2) (cid:21) (1 + (cid:15))(1 \u2212 Pr (S1 _ S2))\nand so Pr (S1 _ S2) (cid:21) 1 \u2212 \u03b3\n1+(cid:15)) (cid:21)\n(1 \u2212 \u03b3)(1 + \u03b3(cid:15)\n\n1+(cid:15) . This implies Pr (T1 ^ T2) (cid:21) (1 \u2212 \u03b3(cid:15)\n\n8 ). So, we have Pr (T1 ^ T2) (cid:21) (1 + \u03b3(cid:15)\n\n8 , then Pr (T1 ^ T2) (cid:21)\n\n8 ) Pr (S1 ^ S2):\n\n4 )(1 \u2212 \u03b3\n\nIf\n\n2\n\n\f8\n\n1\n\n1\n\n1\n\n1\n\n1\n\n8\n\n2\n\n1\n\n1\n\n(cid:1)\n\n+ 1\n(cid:15)\n\n1\n\n(cid:26)init\n\n_ Si\n\n1 and Si\n2\n\n(cid:18) X +\n\n(cid:15) log 1\n(cid:15)f in\n\n2), qi = Pr (Si\n\n2N respectively.\n\nTheorem 1 Let (cid:15)f in and (cid:14)f in be the (\ufb01nal) desired accuracy and con\ufb01dence parameters.\nThen we can achieve error rate (cid:15)f in with probability 1 \u2212 (cid:14)f in by running co-training for\n) rounds, each time running A1 and A2 with accuracy and\nN = O( 1\ncon\ufb01dence parameters set to (cid:15)(cid:1)(cid:15)f in\nand (cid:14)f in\n(cid:18) X +\nProof Sketch: Assume that, for i (cid:21) 1, Si\n2 are the con\ufb01dent sets in\n^ Si\n^ Si\neach view after step i \u2212 1 of co-training. De\ufb01ne pi = Pr (Si\n2),\nand \u03b3i = 1 \u2212 pi, with all probabilities with respect to D+. We are interested in bounding\n^ Si\nPr (Si\n2), but since technically it is easier to bound Pr (Si\n2), we will instead show\nthat pN (cid:21) 1 \u2212 (cid:15)f in with probability 1 \u2212 (cid:14)f in, which obviously implies that Pr(SN\n_ SN\n1\n2 )\nis at least as good.\nBy the guarantees on A1 and A2, after each round we get that with probability 1 \u2212 (cid:14)f in\nN ,\nj Si\n2) (cid:21) 1 \u2212 (cid:15)f in(cid:1)(cid:15)\nwe have Pr (Si+1\nIn\n.\n2)(cid:21)\n^ S1\nparticular, this implies that with probability 1 \u2212 (cid:14)f in\nN , we have p1 = Pr (S1\n(1 \u2212 (cid:15)=4) (cid:1) Pr (S0\n_ S0\nConsider now i (cid:21) 1.\nIf pi (cid:20) qi, since with probability 1 \u2212 (cid:14)f in\nN we have\n2) (cid:21) 1 \u2212 (cid:15)\nj Si\nPr (Si+1\n8 and Pr (Si+1\n8 , using lemma 1 we obtain\nthat with probability 1 \u2212 (cid:14)f in\n) (cid:21) (1 + (cid:15)=2) Pr (Si\n^ Si\n2). Sim-\nN , we havePr ( Si+1\nilarly, by applying lemma 2, we obtain that if pi > qi and \u03b3i (cid:21) (cid:15)f in then with probability\n^ Si+1\n) (cid:21) (1 + \u03b3i(cid:15)\n^ Si\n1 \u2212 (cid:14)f in\n2). Assume now that it is the\ncase that the learning algorithms A1 and A2 were successful on all the N rounds; note that\nthis happens with probability at least 1 \u2212 (cid:14)f in.\nThe above observations imply that so long as pi (cid:20) 1=2 (so \u03b3i (cid:21) 1=2) we have pi+1 (cid:21)\n(cid:1) 1\n(1+ (cid:15)=16)i(1\u2212 (cid:15)=4)(cid:26)init. This means that after N1 = O( 1\n(cid:15) ) iterations of co-training\nwe get to a situation where pN1 > 1=2. At this point, notice that every 8=(cid:15) rounds, \u03b3\n(cid:20) 1\ndrops by at least a factor of 2; that is, if \u03b3i (cid:20) 1\n2k+1 . So, after a total\n) rounds, we have a predictor of the desired accuracy with the\n+ 1\nof O( 1\n(cid:15)\ndesired con\ufb01dence.\n\n_ Si\n2) (cid:21) 1 \u2212 (cid:15)f in(cid:1)(cid:15)\n2) (cid:21) (1 \u2212 (cid:15)=4)(cid:26)init.\n\nN we have Pr (Si+1\n\n_ Si\n^ Si+1\n\n2) (cid:21) 1 \u2212 (cid:15)\n\nand Pr (Si+1\n\n2k then \u03b3 8\n\n8 ) Pr (Si\n\n1\n\n(cid:15) log 1\n(cid:15)f in\n\nj Si\n\n1\n\n_ Si\n\nj Si\n\n1\n\n_ Si\n\n1\n\n(cid:26)init\n\n1\n\n2\n\n1\n\n2\n\n8\n\n1\n\n(cid:26)init\n\n(cid:15) +i\n\n2\n\n1\n\n1\n\n1\n\n1\n\n(cid:1)\n\n4 Heuristic Analysis of Error propagation and Experiments\nSo far, we have assumed the existence of perfect classi\ufb01ers on each view: there are no\nexamples hx1; x2i with x1 2 X +\nIn addition, we have\nassumed that given correctly-labeled positive examples as input, our learning algorithms\nare able to generalize in a way that makes only 1-sided error (i.e., they are never \u201ccon\ufb01dent\nbut wrong\u201d). In this section we give a heuristic analysis of the case when these assumptions\nare relaxed, along with several synthetic experiments on expander graphs.\n\n1 and x2 2 X\n\n\u2212\n2 or vice-versa.\n\n4.1 Heuristic Analysis of Error propagation\n(cid:18) X2 at the ith iteration, let us de\ufb01ne their pu-\n(cid:18) X1 and Si\nGiven con\ufb01dent sets Si\nrity (precision) as puri = PrD(c(x) = 1jSi\n1\n2) and their coverage (recall) to be\njc(x) = 1). Let us also de\ufb01ne their \u201copposite coverage\u201d to be\n_ Si\ncovi = PrD(Si\njc(x) = 0). Previously, we assumed oppi = 0 and therefore puri = 1.\n_Si\n1\noppi = PrD(Si\n1\nHowever, if we imagine that there is an (cid:17) fraction of examples on which the two views dis-\nagree, and that positive and negative regions expand uniformly at the same rate, then even\nif initially opp0 = 0, it is natural to assume the following form of increase in cov and opp:\n\n_ Si\n\n2\n\n2\n\n2\n\n1\n\ncovi+1 = min (covi(1 + (cid:15)(1 \u2212 covi)) + (cid:17) (cid:1) (oppi+1 \u2212 oppi) ; 1);\noppi+1 = min (oppi(1 + (cid:15)(1 \u2212 oppi)) + (cid:17) (cid:1) (covi+1 \u2212 covi) ; 1):\n\n(1)\n(2)\n\n\faccuracy on negative\naccuracy on positive\noverall accuracy\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n1\n\n2\n\n3\n\n4\n5\niteration\n\n6\n\n7\n\n8\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\naccuracy on negative\naccuracy on positive\noverall accuracy\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n2\n\n4\n\n6\n\niteration\n\n8\n\n10\n\n0\n\n2\n\naccuracy on negative\naccuracy on positive\noverall accuracy\n4\n\n6\n\niteration\n\n8\n\n10\n\n12\n\n14\n\nFigure 1: Co-training with noise rates 0.1, 0.01, and 0.001 respectively (n = 5000). Solid\nline indicates overall accuracy; green (dashed, increasing) curve is accuracy on positives\n(covi); red (dashed, decreasing) curve is accuracy on negatives (1 \u2212 oppi).\nThat is, this corresponds to both the positive and negative parts of the con\ufb01dent region\nexpanding in the way given in the proof of Theorem 1, with an (cid:17) fraction of the new\nedges going to examples of the other label. By examining (1) and (2), we can make a\nfew simple observations. First, initially when coverage is low, every O(1=(cid:15)) steps we get\nroughly cov 2 (cid:1) cov and opp 2 (cid:1) opp + (cid:17) (cid:1) cov. So, we expect coverage to increase\nexponentially and purity to drop linearly. However, once coverage gets large and begins to\nsaturate, if purity is still high at this time it will begin dropping rapidly as the exponential\nincrease in oppi causes oppi to catch up with covi. In particular, a calculation (omitted)\nshows that if D is 50/50 positive and negative, then overall accuracy increases up to the\npoint when covi + oppi = 1, and then drops from then on. This qualitative behavior is\nborne out in our experiments below.\n4.2 Experiments\nWe performed experiments on synthetic data along the lines of Example 2, with noise added\nas in Section 4.1. Speci\ufb01cally, we create a 2n-by-2n bipartite graph. Nodes 1 to n on each\nside represent positive clusters, and nodes n + 1 to 2n on each side represent negative\nclusters. We connect each node on the left to three nodes on the right: each neighbor is\nchosen with probability 1\u2212 (cid:17) to be a random node of the same class, and with probability (cid:17)\nto be a random node of the opposite class. We begin with an initial con\ufb01dent set S1 (cid:18) X +\n1\nand then propagate con\ufb01dence through rounds of co-training, monitoring the percentage\nof the positive class covered, the percent of the negative class mistakenly covered, and\nthe overall accuracy. Plots of three experiments are shown in Figure 1, for different noise\nrates (0.1, 0.01, and 0.001). As can be seen, these qualitatively match what we expect:\ncoverage increases exponentially, but accuracy on negatives (1\u2212 oppi) drops exponentially\ntoo, though somewhat delayed. At some point there is a crossover where covi = 1 \u2212 oppi,\nwhich as predicted roughly corresponds to the point at which overall accuracy starts to\ndrop.\n\n5 Conclusions\nCo-training is a method for using unlabeled data when examples can be partitioned into\ntwo views such that (a) each view in itself is at least roughly suf\ufb01cient to achieve good\nclassi\ufb01cation, and yet (b) the views are not too highly correlated. Previous theoretical work\nhas required instantiating condition (b) in a very strong sense: as independence given the\nlabel, or a form of weak dependence. In this work, we argue that the \u201cright\u201d condition\nis something much weaker: an expansion property on the underlying distribution (over\npositive examples) that we show is suf\ufb01cient and to some extent necessary as well.\nThe expansion property is especially interesting because it directly motivates the iterative\nnature of many of the practical co-training based algorithms, and our work is the \ufb01rst\nrigorous analysis of iterative co-training in a setting that demonstrates its advantages over\none-shot versions.\nAcknowledgements: This work was supported in part by NSF grants CCR-0105488,\nNSF-ITR CCR-0122581, and NSF-ITR IIS-0312814.\n\n\fReferences\n[1] S. Abney. Bootstrapping. In Proceedings of the 40th Annual Meeting of the Association for\n\nComputational Linguistics (ACL), pages 360\u2013367, 2002.\n\n[2] A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training.\n\nProc. 11th Annual Conference on Computational Learning Theory, pages 92\u2013100, 1998.\n\nIn\n\n[3] M. Collins and Y. Singer. Unsupervised models for named entity classi\ufb01cation. In SIGDAT\n\nConf. Empirical Methods in NLP and Very Large Corpora, pages 189\u2013196, 1999.\n\n[4] S. Dasgupta, M. L. Littman, and D. McAllester. PAC generalization bounds for co-training. In\n\nAdvances in Neural Information Processing Systems 14. MIT Press, 2001.\n\n[5] R. Ghani. Combining labeled and unlabeled data for text classi\ufb01cation with a large number of\n\ncategories. In Proceedings of the IEEE International Conference on Data Mining, 2001.\n\n[6] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In\nProceedings of the 16th International Conference on Machine Learning, pages 200\u2013209, 1999.\n[7] M. Kearns, M. Li, and L. Valiant. Learning Boolean formulae. JACM, 41(6):1298\u20131328, 1995.\n[8] A. Levin, Paul Viola, and Yoav Freund. Unsupervised improvement of visual detectors using\nco-training. In Proc. 9th IEEE International Conf. on Computer Vision, pages 626\u2013633, 2003.\n\n[9] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.\n[10] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training.\n\nIn\nProc. ACM CIKM Int. Conf. on Information and Knowledge Management, pages 86\u201393, 2000.\n[11] K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classi\ufb01cation from labeled and\n\nunlabeled documents using em. Machine Learning, 39(2/3):103\u2013134, 2000.\n\n[12] S. Park and B. Zhang. Large scale unstructured document classi\ufb01cation using unlabeled data\n\nand syntactic information. In PAKDD 2003, LNCS vol. 2637, pages 88\u201399. Springer, 2003.\n\n[13] D. Pierce and C. Cardie. Limitations of Co-Training for natural language learning from large\n\ndatasets. In Proc. Conference on Empirical Methods in NLP, pages 1\u20139, 2001.\n\n[14] R. Rivest and R. Sloan. Learning complicated concepts reliably and usefully. In Proceedings\n\nof the 1988 Workshop on Computational Learning Theory, pages 69\u201379, 1988.\n\n[15] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In\n\nMeeting of the Association for Computational Linguistics, pages 189\u2013196, 1995.\n\n[16] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\nharmonic functions. In Proc. 20th International Conf. Machine Learning, pages 912\u2013912, 2003.\n\nA Relating the de\ufb01nitions\n\nWe show here how De\ufb01nition 2 implies De\ufb01nition 1.\n\n(cid:2)\n\n= (cid:15)=(1 + (cid:15)).\n\nTheorem 2 If D+ satis\ufb01es (cid:15)-left-right expansion (De\ufb01nition 2), then it also satis\ufb01es (cid:15)\n0\n(De\ufb01nition 1) for (cid:15)\nProof: We will prove the contrapositive. Suppose there exist S1 (cid:18) X +\n2 such that\nPr(S1 (cid:8) S2) < (cid:15)\n. Assume without loss of generality that\nmin\nPr(S1 ^ S2) (cid:20) Pr(S1 ^ S2). Since Pr(S1 ^ S2) + Pr(S1 ^ S2) + Pr(S1 (cid:8) S2) = 1 it fol-\nlows that Pr(S1^ S2) (cid:20) 1\n. Assume Pr(S1) (cid:20) Pr(S2). This implies that Pr(S1) (cid:20) 1\nsince Pr(S1)+Pr(S2) = 2 Pr(S1^S2)+Pr(S1(cid:8)S 2) and so Pr(S1) (cid:20) Pr(S1^S2)+ Pr(S1(cid:8)S2)\n2\n.\nNow notice that\n\nPr(S1 ^ S2); Pr(S1 ^ S2)\n\u2212 Pr(S1(cid:8)S2)\n\n1 ; S2 (cid:18) X +\n\n-expansion\n\n(cid:3)\n\n2\n\n2\n\n0\n\n2\n\n0\n\nPr(S2jS1) =\n\nPr(S1 ^ S2)\n\n(cid:21)\n\nPr(S1)\n\nPr(S1 ^ S2)\n\nPr(S1 ^ S2) + Pr(S1 (cid:8) S2)\n\n>\n\n1\n\n1 + (cid:15)0 (cid:21) 1 \u2212 (cid:15):\n\nBut\n\nPr(S2) (cid:20) Pr(S1 ^ S2) + Pr(S1 (cid:8) S2) < (1 + (cid:15)\n\n0\n\n) Pr(S1 ^ S2) (cid:20) (1 + (cid:15)) Pr(S1)\n\nand so Pr(S2) < (1 + (cid:15)) Pr(S1). Similarly if Pr(S2) (cid:20) Pr(S1) we get a failure of expansion in the\nother direction. This completes the proof.\n\n\f", "award": [], "sourceid": 2578, "authors": [{"given_name": "Maria-florina", "family_name": "Balcan", "institution": null}, {"given_name": "Avrim", "family_name": "Blum", "institution": null}, {"given_name": "Ke", "family_name": "Yang", "institution": null}]}