{"title": "Semi-Supervised Learning with Adversarially Missing Label Information", "book": "Advances in Neural Information Processing Systems", "page_first": 2244, "page_last": 2252, "abstract": "We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines.", "full_text": "Semi-Supervised Learning with Adversarially\n\nMissing Label Information\n\nUmar Syed\n\nBen Taskar\n\nDepartment of Computer and Information Science\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\n{usyed,taskar}@cis.upenn.edu\n\nAbstract\n\nWe address the problem of semi-supervised learning in an adversarial setting. In-\nstead of assuming that labels are missing at random, we analyze a less favor-\nable scenario where the label information can be missing partially and arbitrarily,\nwhich is motivated by several practical examples. We present nearly matching\nupper and lower generalization bounds for learning in this setting under reason-\nable assumptions about available label information. Motivated by the analysis, we\nformulate a convex optimization problem for parameter estimation, derive an ef\ufb01-\ncient algorithm, and analyze its convergence. We provide experimental results on\nseveral standard data sets showing the robustness of our algorithm to the pattern\nof missing label information, outperforming several strong baselines.\n\n1\n\nIntroduction\n\nSemi-supervised learning algorithms use both labeled and unlabeled examples. Most theoretical\nanalyses of semi-supervised learning assume that m + n labeled examples are drawn i.i.d. from a\ndistribution, and then a subset of size n is chosen uniformly at random and their labels are erased\n[1]. This missing-at-random assumption is best suited for a situation where the labels are acquired\nby annotating a random subset of all available data. But in many applications of semi-supervised\nlearning, the partially-labeled data is \u201cnaturally occurring\u201d, and the learning algorithm has no control\nover which examples were labeled.\n\nFor example, pictures on popular websites like Facebook and Flikr are tagged by users at their dis-\ncretion, and it is dif\ufb01cult to know how users decide which pictures to tag. A similar problem occurs\nwhen data is submitted to an online labor marketplace, such as Amazon Mechanical Turk, to be\nmanually labeled. The workers who label the data are often poorly motivated, and may deliberately\nskip examples that are dif\ufb01cult to correctly label. In such a setting, a learning algorithm should not\nassume that the examples were labeled at random.\n\nAdditionally, in many semi-supervised learning settings, the partial label information is not provided\non a per-example basis. For example, in multiple instance learning [2], examples are presented to\na learning algorithm in sets, with either zero or one positive examples per set.\nIn graph-based\nregularization [3], a learning algorithm is given information about which examples are likely to\nhave the same label, but not necessarily the identity of that label. Recently, there has been much\ninterest in algorithms that learn from labeled features [4]; in this setting, the learning algorithm is\ngiven information about the expected value of several features with respect to the true distribution\non labeled examples.\n\nTo summarize, in a typical semi-supervised learning problem, label information is often missing in\nan arbitrary fashion, and even when present, does not always have a simple form, like one label per\nexample. Our goal in this paper is to develop and analyze a learning algorithm that is explicitly\n\n1\n\n\fdesigned for these types of problems. We derive our learning algorithm within a framework that is\nexpressive enough to permit a very general notion of label information, allowing us to make minimal\nassumptions about which examples in a data set have been labeled, how they have been labeled,\nand why. We present both theoretical upper and lower bounds for learning in this framework, and\nmotivated by these bounds, derive a simple yet provably optimal learning algorithm. We also provide\nexperimental results on several standard data sets, which show that our algorithm is effective and\nrobust when the label information has been provided by \u201clazy\u201d or \u201cunhelpful\u201d labelers.\nRelated Work: Our learning framework is related to the malicious label noise setting, in which\nthe labeler is allowed to mislabel a small fraction of the training set (this is a special case of the\neven more challenging malicious noise setting [5], where an adverary can inject a small number\nof arbitrary examples into the training set). Learning with this type of label noise is known to be\nquite dif\ufb01cult, and positive results often make quite restrictive assumptions about the underlying\ndata distribution [6, 7]. By contrast, our results apply far more generally, at the expense of assuming\na more benign (but possibly more realistic) model of label noise, where the labeler can adversarially\nerase labels, but not change them. In other words, we assume that the labeler equivocates, but does\nnot lie. The difference in these assumptions shows up quite clearly in our analysis: As we point out\nin Section 3, our bounds become vacuous if the labeler is allowed to mislabel data.\n\nIn Section 2 we describe how our framework encodes label information in a label regularization\nfunction, which closely resembles the idea of a compatibility function introduced by Balcan & Blum\n[8]. However, they did not analyze a setting where this function is selected adversarially.\n\n2 Learning Framework\n\nLet X be the set of all possible examples, and Y the set of all possible labels, where |Y| = k. Let D\nbe an unknown distribution on X \u00d7 Y. We write x and y as abbreviations for (x1, . . . , xm) \u2208 X m\nand (y1, . . . , ym) \u2208 Y m, respectively. We write (x, y) \u223c Dm to denote that each (xi, yi) is drawn\ni.i.d. from the distribution D on X \u00d7 Y, and x \u223c Dm to denote that each xi is drawn i.i.d. from the\nmarginal distribution of D on X .\nLet (\u02c6x, \u02c6y) \u223c Dm be the m labeled training examples. In supervised learning, one assumes access to\nthe entire training set (\u02c6x, \u02c6y). In semi-supervised learning, one assumes access to only some of the\nlabels \u02c6y, and in most theoretical analyses, the missing components of \u02c6y are assumed to have been\nselected uniformly at random.\n\nWe make a much weaker assumption about what label information is available. We assume that,\nafter the labeled training set (\u02c6x, \u02c6y) has been drawn, the learning algorithm is only given access to\nthe examples \u02c6x and to a label regularization function R. The function R encodes some information\nabout the labels \u02c6y of \u02c6x, and is selected by a potentially adversarial labeler from a family R(\u02c6x, \u02c6y).\nA label regularization function R maps each possible soft labeling q of the training examples \u02c6x to a\nreal number R(q) (a soft labeling is natual generalization of a labeling that we will de\ufb01ne formally\nin a moment). Except for knowing that R belongs to R(\u02c6x, \u02c6y), the learner can make no assumptions\nabout how the labeler selects R. We give examples of label regularization functions in Section 2.1.\n\nLet \u2206 denote the set of distributions on Y. A soft labeling q \u2208 \u2206m of the training examples \u02c6x\nis a doubly-indexed vector, where q(i, y) is interpreted as the probability that example \u02c6xi has label\ny \u2208 Y. The correct soft labeling has q(i, y) = 1{y = \u02c6yi}, where the indicator function 1{\u00b7} is 1\nwhen its argument is true and 0 otherwise; we overload notation and write \u02c6y to denote the correct\nsoft labeling.\n\nAlthough the labeler is possibly adversarial, the family R(\u02c6x, \u02c6y) of label regularization functions\nrestricts the choices the labeler can make. We are interested in designing learning algorithms that\nwork well when each R \u2208 R(\u02c6x, \u02c6y) assigns a low value to the correct labeling \u02c6y. In the examples we\ndescribe in Section 2.1, the correct labeling \u02c6y will be near the minimum of R, but there will be many\nother minima and near-minima as well. This is the sense in which label information is \u201cmissing\u201d \u2014\nit is dif\ufb01cult for any learning algorithm to distinguish among these minima.\n\nWe emphasize that, while our algorithms work best when \u02c6y is close to the minimum of each R \u2208\nR(\u02c6x, \u02c6y), nothing in our framework requires this to be true; in Section 3 we will see that our learning\nbounds degrade gracefully as this condition is violated.\n\n2\n\n\fWe are interested in learning a parameterized model that predicts a label y given an example x. Let\nL(\u03b8, x, y) be the loss of parameter \u03b8 \u2208 Rd with respect to labeled example (x, y). While some of\nthe development in this paper will apply to generic loss functions, but two loss functions that will\nparticularly interest us are the negative log-likelihood of a log-linear model\n\nLlike(\u03b8, x, y) = \u2212 log p\u03b8(y|x) = \u2212 log\n\nexp(\u03b8T \u03c6\u03c6\u03c6(x, y))\n\nPy\u2032 exp(\u03b8T \u03c6\u03c6\u03c6(x, y\u2032))\n\nwhere \u03c6\u03c6\u03c6(x, y) \u2208 Rd is the feature function, and the 0-1 loss of a linear classi\ufb01er\n\nL0,1(\u03b8, x, y) = 1{arg max\ny\u2032\u2208Y\n\n\u03b8T \u03c6\u03c6\u03c6(x, y\u2032) 6= y}.\n\nGiven training examples \u02c6x, label regularization function R, and loss function L, the goal of a learn-\ning algorithm is to \ufb01nd a parameter \u03b8 that minimizes the expected loss ED[L(\u03b8, x, y)], where ED[\u00b7]\ndenotes expectation with respect to (x, y) \u223c D.\nLet E\u02c6x,q[f (x, y)] denote the expected value of f (x, y) when example x is chosen uniformly at\nrandom from the training examples \u02c6x and \u2014 supposing that this is example \u02c6xi \u2014 label y is chosen\nfrom the distribution q(i, \u00b7). Accordingly, E\u02c6x,\u02c6y[f (x, y)] denotes the expected value of f (x, y) when\nlabeled example (x, y) is chosen uniformly at random from the labeled training examples (\u02c6x, \u02c6y).\n\n2.1 Examples of Label Regularization Functions\n\nTo make the concept of a label regularization function more clear, we describe several well-known\nlearning settings in which the information provided to the learning algorithm is less than the fully\nlabeled training set. We show that, for each these settings, there is a natural de\ufb01nition of R that\ncaptures the information that is provided to the learning algorithm, and thus each of these settings\ncan be seen as special cases of our framework.\n\nBefore proceeding with the partially labeled cases, we explain how supervised learning can be\nexpressed in our framework. In the supervised learning setting, the label of every example in the\ntraining set is revealed to the learner. In this setting, the label regularization function family R(\u02c6x, \u02c6y)\ncontains a single function R\u02c6y such that R\u02c6y(q) = 0 if q = \u02c6y, and R\u02c6y(q) = \u221e otherwise.\nIn the semi-supervised learning setting, the labels of only some of the training examples are revealed.\nIn this case, there is a function RI \u2208 R(\u02c6x, \u02c6y) for each I \u2286 [m] such that RI (q) = 0 if q(i, y) =\n1{y = \u02c6yi} for all i \u2208 I and y \u2208 Y, and RI (q) = \u221e otherwise. In other words, RI (q) is zero\nif and only if the soft labeling q agrees with \u02c6y on the examples in I. This implies that RI (q) is\nindependent of how q labels examples not in I \u2014 these are the examples whose labels are missing.\n\nIn the ambiguous learning setting [9, 10], which is a generalization of semi-supervised learning,\nthe labeler reveals a label set \u02c6Yi \u2286 Y for each training example \u02c6xi such that \u02c6yi \u2208 \u02c6Yi. That is,\nfor each training example, the learning algorithm is given a set of possibile labels the example can\nhave (semi-supervised learning is the special case where each label set has size 1 or k). Letting\n\u02c6Y = ( \u02c6Y1, . . . , \u02c6Ym) be all the label sets revealed to the learner, there is a function R \u02c6Y \u2208 R(\u02c6x, \u02c6y) for\neach possible \u02c6Y such that R \u02c6Y (q) = 0 if supp(qi) \u2286 \u02c6Yi for all i \u2208 [m] and RY(q) = \u221e otherwise.\nHere qi , q(i, \u00b7) and supp(qi) is the support of label distribution qi. In other words, R \u02c6Y (q) is\nzero if and only if the soft labeling q is supported on the sets \u02c6Y1, . . . , \u02c6Ym.\nThe label regularization functions described above essentially give only local information; they spec-\nify, for each example in the training set, which labels are possible for that example. In some cases,\nwe may want to allow the labeler to provide more global information about the correct labeling.\n\nOne example of providing global information is Laplacian regularization, a kind of graph-based\nregularization [3] that encodes information about which examples are likely to have the same labels.\nFor any soft labeling q, let q[y] be the m-length vector whose ith component is q(i, y). The Lapla-\n\ncian regularizer is de\ufb01ned to be RL(q) = Py\u2208Y q[y]T L(\u02c6x)q[y], where L(\u02c6x) is an m \u00d7 m positive\n\nsemi-de\ufb01nite matrix de\ufb01ned so that RL(q) is large whenever examples in \u02c6x that are believed to have\nthe same label are assigned different label distributions by q.\n\nAnother possibility is posterior regularization. De\ufb01ne a feature function f (x, y) \u2208 R\u2113; these features\nmay or may not be related to the model features \u03c6\u03c6\u03c6 de\ufb01ned in Section 2. As noted by several authors\n\n3\n\n\f[4, 11, 12], it is often convenient for a labeler to provide information about the expected value of\nf (x, y) with respect to the true distribution. A typical posterior regularizer of this type will have\nthe form Rf ,b(q) = kE\u02c6x,q[f (x, y)] \u2212 bk2\n2, where the vector b \u2208 R\u2113 is the labeler\u2019s estimate of the\nexpected value of f. This term penalizes soft labelings q which cause the expected value of f on the\ntraining set to deviate from b.\n\nLabel regularization functions can also be added together. So, for instance, ambiguous learning can\nbe combined with a Laplacian, and in this case the learner is given a label regularization function of\nthe form R \u02c6Y (q)+RL(q). We will experiment with these kinds of combined regularization functions\nin Section 5.\n\nNote that, in all the examples described above, while the correct labeling \u02c6y is at or close to the\nminimum of each function R \u2208 R(\u02c6x, \u02c6y), there may be many labelings meeting this condition.\nAgain, this is the sense in which label information is \u201cmissing\u201d.\n\nIt is also important to note that we have only speci\ufb01ed what information the labeler can reveal to the\nlearner (some function from the set R(\u02c6x, \u02c6y)), but we do not specify how that information is chosen\nby the labeler (which function R \u2208 R(\u02c6x, \u02c6y)?). This will have a signi\ufb01cant impact on our analysis of\nthis framework. To see why, consider the example of semi-supervised learning. Using the notation\nde\ufb01ned above, most analyses of semi-supervised learning assume that RI is chosen be selecting a\nrandom subset I of the training examples [13, 14]. By constrast, we make no assumptions about\nhow RI is chosen, because we are interested in settings where such assumptions are not realistic.\n\n3 Upper and Lower Bounds\n\nIn this section, we state upper and lower bounds for learning in our framework. But \ufb01rst, we provide\na de\ufb01nition of the well-known concept of uniform convergence.\nDe\ufb01nition 1 (Uniform Convergence). Loss function L has \u01eb-uniform convergence if with probability\n1 \u2212 \u03b4\n\nsup\n\u03b8\u2208\u0398\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nED[L(\u03b8, x, y)] \u2212 E\u02c6x,\u02c6y[L(\u03b8, x, y)](cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2264 \u01eb(\u03b4, m)\n\nwhere (\u02c6x, \u02c6y) \u223c Dm and \u01eb(\u00b7, \u00b7) is an expression bounding the rate of convergence.\n\nFor example, if k\u03c6\u03c6\u03c6(x, y)k \u2264 c for all (x, y) \u2208 X \u00d7 Y and \u0398 = {\u03b8 : k\u03b8k \u2264 1} \u2286 Rd, then the\n\nloss function Llike has \u01eb-uniform convergence with \u01eb(\u03b4, m) = O(cid:18)cq d log m+log(1/\u03b4)\n\nm\n\n(cid:19), which fol-\n\nlows from standard results about Rademacher complexity and covering numbers. Other commonly\nused loss functions, such as hinge loss and 0-1 loss, also have \u01eb-uniform convergence under similar\nboundedness assumptions on \u03c6\u03c6\u03c6 and \u0398.\nWe are now ready to state an upper bound for learning in our framework. The proof is contained in\nthe supplement.\nTheorem 1. Suppose loss function L has \u01eb-uniform convergence. If (\u02c6x, \u02c6y) \u223c Dm then with proba-\nbility at least 1 \u2212 \u03b4 for all parameters \u03b8 \u2208 \u0398 and label regularization functions R \u2208 R(\u02c6x, \u02c6y)\n\nED[L(\u03b8, x, y)] \u2264 max\nq\u2208\u2206m\n\n(E\u02c6x,q[L(\u03b8, x, y)] \u2212 R(q)) + R(\u02c6y) + \u01eb(\u03b4, m).\n\nTheorem 2 below states a lower bound that nearly matches the upper bound in Theorem 1, in certain\ncases. As we will see, the existence of a matching lower bound depends strongly on the structure\nof the label regularization function family R. Note that, given a labeled training set (x, y), the set\nR(x, y) essentially constrains what information the labeler can reveal to the learning algorithm,\nthereby encoding our assumptions about how the labeler will behave. We make three such assump-\ntions, described below. For the remainder of this section, we let the set of all possible examples\nX = {\u02dcx1, . . . , \u02dcxN } be \ufb01nite.\nRecall that all the label regularization functions described in Section 2.1 use the value \u221e to indicate\nwhich labelings of the training set are impossible. Our \ufb01rst assumption is that, for each R \u2208 R(x, y),\nthe set of possible labelings under R is separable over examples.\n\n4\n\n\fAssumption 1 (\u221e-Separability). For all labeled training sets (x, y) and R \u2208 R(x, y) there ex-\nists a collection of label sets {Y\u02dcx : \u02dcx \u2208 X } and real-valued function F such that R(q) =\nPm\ni=1 \u03c7{supp(qi) \u2286 Yxi } + F (q), where the characteristic function \u03c7{\u00b7} is 0 when its argument\n\nis true and \u221e otherwise, and F (q) < \u221e for all q \u2208 \u2206m.\n\nIt is easy to verify that all the examples of label regularization function families given in Section 2.1\nsatisfy Assumption 1. Also note that Assumption 1 allows the \ufb01nite part of R (denoted by F ) to\ndepend on the entire soft labeling q in a basically arbitrarily manner.\n\nBefore describing our second assumption, we need a few additional de\ufb01nitions. We write h to\ndenote a labeling function that maps examples X to labels Y. Also, for any labeling function h and\nunlabeled training set x \u2208 X m, we let h(x) \u2208 Y m denote the vector of labels whose ith component\nis h(xi). Let px be an N-length vector that represents unlabeled training set x as a distribution on\nX , whose ith component is px(i) , |{j : xj =\u02dcxi}|\nOur second assumption is the labeler\u2019s behavior is stable: If training sets (x, y) and (x\u2032, y\u2032) are\n\u201cclose\u201d (by which we mean that they are consistently labeled and kpx \u2212 px\u2032 k\u221e is small) then the\nlabel regularization functions available to the labeler for each training set are the \u201csame\u201d, in the\nsense that the sets of possible labelings under each of them are identical.\nAssumption 2 (\u03b3-Stability). For any labeling function h\u2217 and unlabeled training sets x, x\u2032 such that\nkpx \u2212 px\u2032 k\u221e \u2264 \u03b3 the following holds: For all R \u2208 R(x, h\u2217(x)) there exists R\u2032 \u2208 R(x\u2032, h\u2217(x\u2032))\nsuch that R(h(x)) < \u221e if and only if R\u2032(h(x\u2032)) < \u221e, for all labeling functions h.\n\nm\n\n.\n\nOur \ufb01nal assumption, which we call reciprocity, states there is no way to deduce which of the\npossible labelings under R is the correct one only by examining R.\nAssumption 3 (Reciprocity). For all labeled training sets (x, y) and R \u2208 R(x, y), if R(y\u2032) < \u221e\nthen R \u2208 R(x, y\u2032).\n\nOf all our assumptions, reciprocity seems to be the most unnatural and unmotivated. We argue it is\nnecessary for two reasons: Firstly, all the examples of label regularization function families given in\nSection 2.1 satisfy this assumption, and secondly, in Theorem 3 we show that lifting the reciprocity\nassumption makes the upper bound in Theorem 1 very loose.\n\nWe are nearly ready to state our lower bound. Let A be a (possibly randomized) learning algorithm\nthat takes a set of unlabeled training examples \u02c6x and a label regularization function R as input, and\noutputs an estimated parameter \u02c6\u03b8. Also, if under distribution D each example x \u2208 X is associated\nwith exactly one label h\u2217(x) \u2208 Y, then we write D = DX \u00b7 h\u2217, where the data distribution DX is\nthe marginal distribution of D on X . Theorem 2 proves the existence of a true labeling function h\u2217\nsuch that a nearly tight lower bound holds for all learning algorithms A and all data distributions\nDX whenever the training set is drawn from DX \u00b7 h\u2217. The fact that our lower bound holds for all\ndata distributions signi\ufb01cantly complicates the analysis, but this generality is important: since DX\nis typically easy to estimate, it is possible that the learning algorithm A has been tuned for DX . The\nproof of Theorem 2 is contained in the supplement.\nTheorem 2. Suppose Assumptions 1, 2 and 3 hold for label regularization function family R, the\nloss function L is 0-1 loss, and the set of all possible examples X is \ufb01nite. For all learning algorithms\nA and data distributions DX there exists a labeling function h\u2217 such that if (\u02c6x, \u02c6y) \u223c Dm (where\n\u03b3 2 log |X |\nD = DX \u00b7 h\u2217) and m \u2265 O( 1\n\u03b4 ) then with probability at least 1\n1\nq\u2208\u2206m(cid:16)E\u02c6x,q[L(\u02c6\u03b8, x, y)] \u2212 R(q)(cid:17) + min\n4\n\nED[L(\u02c6\u03b8, x, y)] \u2265\n\nR(q) \u2212 \u01eb(\u03b4, m)\n\nmax\n\n4 \u2212 2\u03b4\n\nq\u2208\u2206m\n\nfor some R \u2208 R(\u02c6x, \u02c6y), where \u02c6\u03b8 is the parameter output by A, and \u03b3 is the constant from Assumption\n2.\n\nObviously, Assumptions 1, 2 and 3 restrict the kinds of label regularization function families to\nwhich Theorem 2 can be applied. However, some restriction is necessary in order to prove a mean-\ningful lower bound, as Theorem 3 below shows. This theorem states that if Assumption 3 does not\nhold, then it may happen that each family R(x, y) has a structure which a clever (but computation-\nally infeasible) learning algorithm can exploit to perform much better than the upper bound given in\nTheorem 1. The proof of Theorem 3, which is contained in the supplement, constructs an example\nof such a family.\n\n5\n\n\fTheorem 3. Suppose the loss function L is 0-1 loss. There exists a label regularization function\nfamily R that satis\ufb01es Assumptions 1 and 2, but not Assumption 3, and a learning algorithm A such\nthat for all distributions D if (\u02c6x, \u02c6y) \u223c Dm then with probability at least 1 \u2212 \u03b4\n\nED[L(\u02c6\u03b8, x, y)] \u2264 max\n\nq\u2208\u2206m(cid:16)E\u02c6x,q[L(\u02c6\u03b8, x, y)] \u2212 R(q)(cid:17) + min\n\nq\u2208\u2206m\n\nR(q) + \u01eb(\u03b4, m) \u2212 1\n\nfor some R \u2208 R(\u02c6x, \u02c6y), where \u02c6\u03b8 is the parameter output by A.\n\nWhenever limm\u2192\u221e \u01eb(\u03b4, m) = 0 the gap between the upper and lower bounds in Theorems 1 and\n2 approaches R(\u02c6y) \u2212 minq R(q) as m \u2192 \u221e (ignoring constant factors). Therefore, these bounds\nare asymptotically matching if the labeler always chooses a label regularization function R such\nthat R(\u02c6y) = minq R(q). We emphasize that this is true even if \u02c6y is a nonunique minimum of R.\nSeveral of the example learning settings described in Section 2.1, such as semi-supervised learning\nand ambiguous learning, meet this criteria. On the other hand, if R(\u02c6y) \u2212 minq R(q) is large, then\nthe gap is very large, and the utility of our analysis degrades. In the extreme case that R(\u02c6y) = \u221e\n(i.e., the correct labeling of the training set is not possible under R), our upper bound is vacuous. In\nthis sense, our framework is best suited to settings in which the information provided by the labeler\nis equivocal, but not actually untruthful, as it is in the malicious label noise setting [6, 7].\nFinally, note that if limm\u2192\u221e \u01eb(\u03b4, m) = 0, then the upper bound in Theorem 3 is smaller than\nthe lower bound in Theorem 2 for all suf\ufb01ciently large m, which establishes the importance of\nAssumption 3.\n\n4 Algorithm\n\nGiven the unlabeled training examples \u02c6x and label regularization function R, the bounds in Section\n3 suggest an obvious learning algorithm: Find a parameter \u03b8\u2217 that realizes the minimum\n\nmin\n\n\u03b8\n\nmax\nq\u2208\u2206m\n\n(E\u02c6x,q[L(\u03b8, x, y)] \u2212 R(q)) + \u03b1 k\u03b8k2 .\n\n(1)\n\nThe objective (1) is simply the minimization of the upper bound in Theorem 1, with one difference:\nfor algorithmic convenience, we do not minimize over the set \u0398, but instead add the quantity \u03b1 k\u03b8k2\nto the objective and leave \u03b8 unconstrained (here, and in the rest of the paper, k\u00b7k denotes L2 norm).\nIf we assume that \u0398 = {\u03b8 : k\u03b8k \u2264 c} for some c > 0, then this modi\ufb01cation is without loss of\ngenerality, since there exists a constant \u03b1c for which this is an equivalent formulation.\nIn order to estimate \u03b8\u2217, throughout this section we make the following assumption about the loss\nfunction L and label regularization function R.\nAssumption 4. The loss function L is convex in \u03b8, and the label regularization function R is convex\nin q.\n\nIt is easy to verify that all of the loss functions and label regularization functions we gave as examples\nin Sections 2 and 2.1 satisfy Assumption 4.\nInstead of \ufb01nding \u03b8\u2217 directly, our approach will be to \u201cswap\u201d the min and max in (1), \ufb01nd the\nsoft labeling q\u2217 that realizes the maximum, and then use q\u2217 to compute \u03b8\u2217. For convenience, we\nabbreviate the function that appears in the objective (1) as F (\u03b8, q) , E\u02c6x,q[L(\u03b8, x, y)] \u2212 R(q) +\n\u03b1 k\u03b8k2. A high-level version of our learning algorithm \u2014 called GAME due to the use of a game-\ntheoretic minimax theorem in its proof of correctness \u2014 is given in Algorithm 1; the implementation\ndetails for each step are given below Theorem 4.\n\nAlgorithm 1 GAME: Game for Adversarially Missing Evidence\n1: Given: Constants \u01eb1, \u01eb2 > 0.\n2: Find \u02dcq such that min\u03b8 F (\u03b8, \u02dcq) \u2265 maxq\u2208\u2206m min\u03b8 F (\u03b8, q) \u2212 \u01eb1\n3: Find \u02dc\u03b8 such that F (\u02dc\u03b8, \u02dcq) \u2264 min\u03b8 F (\u03b8, \u02dcq) + \u01eb2\n4: Return: Parameter estimate \u02dc\u03b8.\n\nIn the \ufb01rst step of Algorithm 1, we modify the objective (1) by swapping the min and max, and then\n\ufb01nd a soft labeling \u02dcq that approximately maximizes this modi\ufb01ed objective. In the next step, we\n\n6\n\n\f\ufb01nd a parameter \u02dc\u03b8 that approximately minimizes the original objective with respect to the \ufb01xed soft\nlabeling \u02dcq. The next theorem proves that Algorithm 1 produces a good estimate of \u03b8\u2217, the minimum\nof the objective (1). Its proof is in the supplement.\n\nTheorem 4. The parameter \u02dc\u03b8 output by Algorithm 1 satis\ufb01es k\u02dc\u03b8 \u2212 \u03b8\u2217k \u2264 q 8\n\n\u03b1 (\u01eb1 + \u01eb2).\n\nWe now brie\ufb02y explain how the steps of Algorithm 1 can be implemented using off-the-shelf algo-\nrithms. For concreteness, we focus on an implementation for the loss function L = Llike, which is\nalso the loss function we use in our experiments in Section 5.\n\nThe second step of Algorithm 1 is the easier one, so we explain it \ufb01rst. In this step, we need to\nminimize F (\u03b8, \u02dcq) over \u03b8. Since \u02dcq is \ufb01xed in this minimization, we can ignore the R(\u02dcq) term in\nthe de\ufb01nition of F , and we see that this minimization amounts to maximizing the likelihood of a\nlog-linear model. This is a very well-studied problem, and there are numerous ef\ufb01cient methods\navailable for solving it, such as stochastic gradient descent.\n\nThe \ufb01rst step of Algorithm 1 is more complicated, as it requires \ufb01nding the maximum of a max-\nmin objective. Our approach is to \ufb01rst take the dual of the inner minimization; after doing this the\n\u03b1 k\u2206\u03c6\u03c6\u03c6(p, q)k2 \u2212 R(q), where we let H(p) ,\nfunction to maximize becomes G(p, q) , H(p) \u2212 1\n\u2212Pi,y p(i, y) log p(i, y) and \u2206\u03c6\u03c6\u03c6(p, q) , E\u02c6x,p[\u03c6\u03c6\u03c6(x, y)] \u2212 E\u02c6x,q[\u03c6\u03c6\u03c6(x, y)]. By convex duality we\n\nhave maxq min\u03b8 F (\u03b8, q) = maxp,q G(p, q). This dual has been previously derived by several\nauthors; see [15] for more details. Note that G is concave function, and we need to maximize it\nover simplex constraints. Exponentiated-gradient-style algorithms [16, 15] are well-suited for this\nkind of problem, as they \u201cnatively\u201d maintain the simplex constraint, and converged quickly in the\nexperiments described in Section 5.\n\n5 Experiments\n\nWe tested our GAME algorithm (Algorithm 1) on several standard learning data sets. In all of our\nexperiments, we labeled a fraction of the training examples sets in a non-random manner that was\ndesigned to simulate various types of dif\ufb01cult \u2014 even adversarial \u2014 labelers.\n\nOur \ufb01rst set of experiments involved two binary classi\ufb01cation data sets that belong to a benchmark\nsuite1 accompanying a widely-used semi-supervied learning book [1]: the Columbia object image\nlibrary (COIL) [17], and a data set of EEG scans of a human subject connected to a brain-computer\ninterface (BCI) [18]. For each data set, a training set was formed by randomly sampling a subset of\nthe data in a way that produced a skewed class distribution. We de\ufb01ned the outlier score of a training\nexample to be the fraction of its nearest neighbors that belong to a different class. For several values\nof p \u2208 [0, 1] and for each training set, we labeled only the p-fraction of examples with the highest\noutlier score. In this way, we simulated an \u201cunhelpful\u201d labeler who only labels examples that are\nexceptions to the general rule, thinking (perhaps sincerely, but erroneously) that this is the most\neffective use of her effort.\n\nWe tested three algorithms on these data sets: GAME, where R(\u02c6x, \u02c6y) was chosen to match the\nsemi-supervised learning setting with a Laplacian regularizer (see Section 2.1); Laplacian SVM\n[3]; and Transductive SVM [19]. When constructing the Laplacian matrix and choosing values for\nhyperparameters, we adhered closely to the model-selection procedure described in [1, Sections\n21.2.1 and 21.2.5]. The results of our experiments are given in Figures 1(a) and 1(b).\n\nWe also tested the GAME algorithm on a multiclass data set, namely a subset of the Labeled Faces\nin the Wild data set [20], a standard corpus of face photographs. Our subset contained 500 faces\nof the top 10 characters from the corpus, but with a randomly skewed distribution, so that some\nfaces appeared more often than others. The feature representation for each photograph was PCA\non the pixel values (i.e., eigenfaces). We used an ambiguously-labeled version of this data set,\nwhere each face in the training set is associated with one or more labels, only one of which is correct\n(see Section 2.1 for a de\ufb01nition of ambiguous learning). We labeled trainined examples to simulate a\n\u201clazy\u201d labeler, in the following way: For each pair of labels (y, y\u2032), we sorted the examples with true\n\n1This benchmark suite contains several data sets; we selected these two because they contain a large number\n\nof examples that meet our de\ufb01nition of outliers.\n\n7\n\n\f \n\n100\n\n \n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\n70\n\n60\n\n50\n\n40\n\n30\n\n \n\n0.1\n\nTransductive SVM\nLaplacian SVM\nGame\n\n0.2\n\n0.3\n\n0.4\n\nFraction of training set labeled\n\ny\nc\na\nr\nu\nc\nc\nA\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n \n\nTransductive SVM\nLaplacian SVM\nGame\n\n0.1\n\n0.2\n\n0.3\n\nFraction of training set labeled\n\ny\nc\na\nr\nu\nc\nc\nA\n\n80\n\n60\n\n40\n\n20\n\n \n\n0.2\n\n0.4\n\nUniform\nEM\nGame\n\n0.8\n\n0.4\n\n0.6\n\nFraction of training set labeled\n\nFigure 1: (a) Accuracy vs. fraction of unlabeled data for BCI data set. (b) Accuracy vs. fraction of\nunlabeled data for COIL data set. (c) Accuracy vs. fraction of partially labeled data for Faces in the\nWild data set. In all plots, error bars represent 1 standard deviation over 10 trials.\n\nlabel y with respect to their distance, in feature space, from the centroid of the cluster of examples\nwith true label y\u2032. For several values of p \u2208 [0, 1], we added the label y\u2032 to the top p-fraction of\nthis list. The net effect of this procedure is that examples on the \u201cborder\u201d of the two clusters are\ngiven both labels y and y\u2032 in the training set. The idea behind this labeling procedure is to mimic\na (realistic, in our view) situation where a \u201clazy\u201d labeler declines to commit to one label for those\nexamples that are especially dif\ufb01cult to distinguish.\n\nWe tested the GAME algorithm on this data set, where R(\u02c6x, \u02c6y) was chosen to match the ambiguous\nlearning setting with a Laplacian regularizer (see Section 2.1). We compared with two algorithms\nfrom [9]: UNIFORM, which assumes each label in the ambiguous label set is equally likely, and\nlearns a maximum likelihood log-linear model; and a discrimitive EM algorithm that guesses the\ntrue labels, learns the most likely parameter, updates the guess, and repeats. The results of our\nexperiments are given in Figure 1(c).\n\nPerhaps the best way to characterize the difference between GAME and the algorithms we compared\nit to is that the other algorithms are \u201coptimistic\u201d, by which we mean they assume that the missing\nlabels most likely agree with the estimated parameter, while GAME is a \u201cpessimistic\u201d algorithm\nthat, because it was designed for an adverarial setting, assumes exactly the opposite. The results of\nour experiments indicate that, for certain labeling styles, as the fraction of fully labeled examples\ndecreases, the GAME algorithm\u2019s pessimistic approach is substantially more effective. Importantly,\nFigures 1(a)-(c) show that the GAME algorithm\u2019s performance advantage is most signi\ufb01cant when\nthe number of labeled examples is very small. Semi-supervised learning algorithms are often pro-\nmoted as being able to learn from only a handful of labeled examples. Our results show that this\nability may be quite sensitive to how these examples are labeled.\n\n6 Future Work\n\nOur framework lends itself to several natural extensions. For example, it can be straightforwardly\nextended to the structured prediction setting [21], in which both examples and labels have some\ninternal structure, such as sequences or trees. One can show that both steps of the GAME algorithm\ncan be implemented ef\ufb01ciently even when the number of labels is combinatorial, provided that\nboth the loss function and label regularization function decompose appropriately over the structure.\nAnother possibility is to interactively poll the labeler for label information, resulting in a sequence\nof successively more informative label regularization functions, with the aim of extracting the most\nuseful label information from the labeler with a minimum of labeling effort. Also, it would be\ninteresting to design Amazon Mechanical Turk experiments that test whether the \u201cunhelpful\u201d and\n\u201clazy\u201d labeling styles described in Section 5 in fact occur in practice. Finally, of the three technical\nassumptions we introduced in Section 3 to aid our analysis, we only proved (in Theorem 3) that one\nof them is necessary. We would like to determine whether the other assumptions are necessary as\nwell, or can be relaxed.\n\nAcknowledgements\n\nUmar Syed was partially supported by DARPA CSSG 2009 Award. Ben Taskar was partially sup-\nported by DARPA CSSG 2009 Award and the ONR 2010 Young Investigator Award.\n\n8\n\n\fReferences\n[1] Olivier Chapelle, Bernhard Sch\u00a8olkopf, and Alexander Zien, editors. Semi-Supervised Learning. MIT\n\nPress, Cambridge, MA, 2006.\n\n[2] Thomas G. Dietterich, Richard H. Lathrop, and Tom\u00b4as Lozano-P\u00b4erez. Solving the multiple instance\n\nproblem with axis-parallel rectangles. Arti\ufb01cial Intelligence, 89(1-2):31\u201371, 1997.\n\n[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework\nfor learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399\u20132434,\n2006.\n\n[4] Gregory Druck, Gideon Mann, and Andrew McCallum. Learning from labeled features using generalized\nexpectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research\nand Development in Information Retrieval, pages 595\u2013602, 2008.\n\n[5] Michael Kearns and Ming Li. Learning in the presence of malicious errors. In Proceedings of the 20th\n\nAnnual ACM Symposium on Theory of Computing, pages 267\u2013280, New York, NY, USA, 1988. ACM.\n\n[6] Adam T. Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically learning halfs-\npaces. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, pages\n11\u201320, 2005.\n\n[7] Adam R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with malicious noise.\n\nJournal of Machine Learning Research, 10:2715\u20132740, 2009.\n\n[8] Maria-Florina Balcan and Avrim Blum. A PAC-style model for learning from labeled and unlabeled data.\n\nIn Proceedings of the 18th Annual Conference on Learning Theory, pages 111\u2013126, 2005.\n\n[9] Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In Advances in Neural Information\n\nProcessing Systems 16, 2003.\n\n[10] Timothee Cour, Ben Sapp, Chris Jordan, and Ben Taskar. Learning from ambiguously labeled images. In\n\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.\n\n[11] Kuzman Ganchev, Jo\u02dcao Grac\u00b8a, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for struc-\n\ntured latent variable models. Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\n[12] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In\n\nProceedings of the 26th Annual International Conference on Machine Learning, pages 641\u2013648, 2009.\n\n[13] Rie Johnson and Tong Zhang. On the effectiveness of laplacian normalization for graph semi-supervised\n\nlearning. Journal of Machine Learning Research, 8:1489\u20131517, December 2007.\n\n[14] Philippe Rigollet. Generalization error bounds in semi-supervised classi\ufb01cation under the cluster assump-\n\ntion. Journal of Machine Learning Research, 8:1369\u20131392, December 2007.\n\n[15] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated\ngradient algorithms for conditional random \ufb01elds and max-margin markov networks. Journal of Machine\nLearning Research, 9:1775\u20131822, 2008.\n\n[16] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear pre-\n\ndictors. Inf. Comput., 132(1):1\u201363, 1997.\n\n[17] Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (COIL-100).\n\nTechnical Report CUCS-006-96, Columbia University, 1996.\n\n[18] Thomas Navin Lal, Thilo Hinterberger, Guido Widman, Michael Schr\u00a8oder, N. Jeremy Hill, Wolfgang\nRosenstiel, Christian Erich Elger, Bernhard Sch\u00a8olkopf, and Niels Birbaumer. Methods towards invasive\nhuman brain computer interfaces. In Advances in Neural Information Processing Systems 17, 2004.\n[19] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector machines.\n\nIn\n\nProceedings of the 16th International Conference on Machine Learning, pages 200\u2013209, 1999.\n\n[20] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A\n\ndatabase for studying face recognition in unconstrained environments.\n\n[21] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In Advances in Neural\n\nInformation Processing Systems 16, 2004.\n\n9\n\n\f", "award": [], "sourceid": 684, "authors": [{"given_name": "Umar", "family_name": "Syed", "institution": null}, {"given_name": "Ben", "family_name": "Taskar", "institution": null}]}