{"title": "Data Programming: Creating Large Training Sets, Quickly", "book": "Advances in Neural Information Processing Systems", "page_first": 3567, "page_last": 3575, "abstract": "Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label subsets of the data, but that are noisy and may conflict. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to \"denoise\" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.", "full_text": "Data Programming:\n\nCreating Large Training Sets, Quickly\n\nAlexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher R\u00e9\n\nStanford University\n\n{ajratner,cdesa,senwu,dselsam,chrismre}@stanford.edu\n\nAbstract\n\nLarge labeled training sets are the critical building blocks of supervised learning\nmethods and are key enablers of deep learning techniques. For some applications,\ncreating labeled training sets is the most time-consuming and expensive part of\napplying machine learning. We therefore propose a paradigm for the programmatic\ncreation of training sets called data programming in which users express weak\nsupervision strategies or domain heuristics as labeling functions, which are pro-\ngrams that label subsets of the data, but that are noisy and may con\ufb02ict. We show\nthat by explicitly representing this training set labeling process as a generative\nmodel, we can \u201cdenoise\u201d the generated training set, and establish theoretically\nthat we can recover the parameters of these generative models in a handful of\nsettings. We then show how to modify a discriminative loss function to make it\nnoise-aware, and demonstrate our method over a range of discriminative models\nincluding logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP\nSlot Filling challenge, we show that data programming would have led to a new\nwinning score, and also show that applying data programming to an LSTM model\nleads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline\n(and into second place in the competition). Additionally, in initial user studies we\nobserved that data programming may be an easier way for non-experts to create\nmachine learning models when training data is limited or unavailable.\n\n1\n\nIntroduction\n\nMany of the major machine learning breakthroughs of the last decade have been catalyzed by the\nrelease of a new labeled training dataset.1 Supervised learning approaches that use such datasets have\nincreasingly become key building blocks of applications throughout science and industry. This trend\nhas also been fueled by the recent empirical success of automated feature generation approaches,\nnotably deep learning methods such as long short term memory (LSTM) networks [14], which amelio-\nrate the burden of feature engineering given large enough labeled training sets. For many real-world\napplications, however, large hand-labeled training sets do not exist, and are prohibitively expen-\nsive to create due to requirements that labelers be experts in the application domain. Furthermore,\napplications\u2019 needs often change, necessitating new or modi\ufb01ed training sets.\nTo help reduce the cost of training set creation, we propose data programming, a paradigm for the\nprogrammatic creation and modeling of training datasets. Data programming provides a simple,\nunifying framework for weak supervision, in which training labels are noisy and may be from\nmultiple, potentially overlapping sources. In data programming, users encode this weak supervision\nin the form of labeling functions, which are user-de\ufb01ned programs that each provide a label for\nsome subset of the data, and collectively generate a large but potentially overlapping set of training\nlabels. Many di\ufb00erent weak supervision approaches can be expressed as labeling functions, such\n\n1http://www.spacemachine.net/views/2016/3/datasets-over-algorithms\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fas strategies which utilize existing knowledge bases (as in distant supervision [22]), model many\nindividual annotator\u2019s labels (as in crowdsourcing), or leverage a combination of domain-speci\ufb01c\npatterns and dictionaries. Because of this, labeling functions may have widely varying error rates and\nmay con\ufb02ict on certain data points. To address this, we model the labeling functions as a generative\nprocess, which lets us automatically denoise the resulting training set by learning the accuracies of\nthe labeling functions along with their correlation structure. In turn, we use this model of the training\nset to optimize a stochastic version of the loss function of the discriminative model that we desire to\ntrain. We show that, given certain conditions on the labeling functions, our method achieves the same\nasymptotic scaling as supervised learning methods, but that our scaling depends on the amount of\nunlabeled data, and uses only a \ufb01xed number of labeling functions.\nData programming is in part motivated by the challenges that users faced when applying prior\nprogrammatic supervision approaches, and is intended to be a new software engineering paradigm for\nthe creation and management of training sets. For example, consider the scenario when two labeling\nfunctions of di\ufb00ering quality and scope overlap and possibly con\ufb02ict on certain training examples; in\nprior approaches the user would have to decide which one to use, or how to somehow integrate the\nsignal from both. In data programming, we accomplish this automatically by learning a model of the\ntraining set that includes both labeling functions. Additionally, users are often aware of, or able to\ninduce, dependencies between their labeling functions. In data programming, users can provide a\ndependency graph to indicate, for example, that two labeling functions are similar, or that one \u201c\ufb01xes\u201d\nor \u201creinforces\u201d another. We describe cases in which we can learn the strength of these dependencies,\nand for which our generalization is again asymptotically identical to the supervised case.\nOne further motivation for our method is driven by the observation that users often struggle with\nselecting features for their models, which is a traditional development bottleneck given \ufb01xed-size\ntraining sets. However, initial feedback from users suggests that writing labeling functions in the\nframework of data programming may be easier [12]. While the impact of a feature on end performance\nis dependent on the training set and on statistical characteristics of the model, a labeling function has\na simple and intuitive optimality criterion: that it labels data correctly. Motivated by this, we explore\nwhether we can \ufb02ip the traditional machine learning development process on its head, having users\ninstead focus on generating training sets large enough to support automatically-generated features.\n\nSummary of Contributions and Outline Our \ufb01rst contribution is the data programming frame-\nwork, in which users can implicitly describe a rich generative model for a training set in a more\n\ufb02exible and general way than in previous approaches. In Section 3, we \ufb01rst explore a simple model in\nwhich labeling functions are conditionally independent. We show here that under certain conditions,\nthe sample complexity is nearly the same as in the labeled case. In Section 4, we extend our results to\nmore sophisticated data programming models, generalizing related results in crowdsourcing [17]. In\nSection 5, we validate our approach experimentally on large real-world text relation extraction tasks\nin genomics, pharmacogenomics and news domains, where we show an average 2.34 point F1 score\nimprovement over a baseline distant supervision approach\u2014including what would have been a new\ncompetition-winning score for the 2014 TAC-KBP Slot Filling competition. Using LSTM-generated\nfeatures, we additionally would have placed second in this competition, achieving a 5.98 point F1\nscore gain over a state-of-the-art LSTM baseline [32]. Additionally, we describe promising feedback\nfrom a usability study with a group of bioinformatics users.\n\n2 Related Work\n\nOur work builds on many previous approaches in machine learning. Distant supervision is one\napproach for programmatically creating training sets. The canonical example is relation extraction\nfrom text, wherein a knowledge base of known relations is heuristically mapped to an input corpus [8,\n22]. Basic extensions group examples by surrounding textual patterns, and cast the problem as a\nmultiple instance learning one [15, 25]. Other extensions model the accuracy of these surrounding\ntextual patterns using a discriminative feature-based model [26], or generative models such as\nhierarchical topic models [1, 27, 31]. Like our approach, these latter methods model a generative\nprocess of training set creation, however in a proscribed way that is not based on user input as in\nour approach. There is also a wealth of examples where additional heuristic patterns used to label\ntraining data are collected from unlabeled data [7] or directly from users [21, 29], in a similar manner\nto our approach, but without any framework to deal with the fact that said labels are explicitly noisy.\n\n2\n\n\fCrowdsourcing is widely used for various machine learning tasks [13, 18]. Of particular relevance\nto our problem setting is the theoretical question of how to model the accuracy of various experts\nwithout ground truth available, classically raised in the context of crowdsourcing [10]. More recent\nresults provide formal guarantees even in the absence of labeled data using various approaches [4,\n9, 16, 17, 24, 33]. Our model can capture the basic model of the crowdsourcing setting, and can be\nconsidered equivalent in the independent case (Sec. 3). However, in addition to generalizing beyond\ngetting inputs solely from human annotators, we also model user-supplied dependencies between the\n\u201clabelers\u201d in our model, which is not natural within the context of crowdsourcing. Additionally, while\ncrowdsourcing results focus on the regime of a large number of labelers each labeling a small subset\nof the data, we consider a small set of labeling functions each labeling a large portion of the dataset.\nCo-training is a classic procedure for e\ufb00ectively utilizing both a small amount of labeled data and a\nlarge amount of unlabeled data by selecting two conditionally independent views of the data [5]. In\naddition to not needing a set of labeled data, and allowing for more than two views (labeling functions\nin our case), our approach allows explicit modeling of dependencies between views, for example\nallowing observed issues with dependencies between views to be explicitly modeled [19].\nBoosting is a well known procedure for combining the output of many \u201cweak\u201d classi\ufb01ers to create a\nstrong classi\ufb01er in a supervised setting [28]. Recently, boosting-like methods have been proposed\nwhich leverage unlabeled data in addition to labeled data, which is also used to set constraints on the\naccuracies of the individual classi\ufb01ers being ensembled [3]. This is similar in spirit to our approach,\nexcept that labeled data is not explicitly necessary in ours, and richer dependency structures between\nour \u201cheuristic\u201d classi\ufb01ers (labeling functions) are supported.\nThe general case of learning with noisy labels is treated both in classical [20] and more recent\ncontexts [23]. It has also been studied speci\ufb01cally in the context of label-noise robust logistic\nregression [6]. We consider the more general scenario where multiple noisy labeling functions can\ncon\ufb02ict and have dependencies.\n\n3 The Data Programming Paradigm\n\nIn many applications, we would like to use machine learning, but we face the following challenges:\n(i) hand-labeled training data is not available, and is prohibitively expensive to obtain in su\ufb03cient\nquantities as it requires expensive domain expert labelers; (ii) related external knowledge bases are\neither unavailable or insu\ufb03ciently speci\ufb01c, precluding a traditional distant supervision or co-training\napproach; (iii) application speci\ufb01cations are in \ufb02ux, changing the model we ultimately wish to learn.\nIn such a setting, we would like a simple, scalable and adaptable approach for supervising a model\napplicable to our problem. More speci\ufb01cally, we would ideally like our approach to achieve \u0001\nexpected loss with high probability, given O(1) inputs of some sort from a domain-expert user, rather\nthan the traditional \u02dcO(\u0001\u22122) hand-labeled training examples required by most supervised methods\n(where \u02dcO notation hides logarithmic factors). To this end, we propose data programming, a paradigm\nfor the programmatic creation of training sets, which enables domain-experts to more rapidly train\nmachine learning systems and has the potential for this type of scaling of expected loss. In data\nprogramming, rather than manually labeling each example, users instead describe the processes by\nwhich these points could be labeled by providing a set of heuristic rules called labeling functions.\nIn the remainder of this paper, we focus on a binary classi\ufb01cation task in which we have a distribution\n\u03c0 over object and class pairs (x, y) \u2208 X \u00d7 {\u22121, 1}, and we are concerned with minimizing the logistic\nloss under a linear model given some features,\n\n(cid:104)\n(cid:105)\nlog(1 + exp(\u2212wT f (x)y))\n\n,\n\nl(w) = E(x,y)\u223c\u03c0\n\nwhere without loss of generality, we assume that (cid:107) f (x)(cid:107) \u2264 1. Then, a labeling function \u03bbi : X (cid:55)\u2192\n{\u22121, 0, 1} is a user-de\ufb01ned function that encodes some domain heuristic, which provides a (non-zero)\nlabel for some subset of the objects. As part of a data programming speci\ufb01cation, a user provides\nsome m labeling functions, which we denote in vectorized form as \u03bb : X (cid:55)\u2192 {\u22121, 0, 1}m.\nExample 3.1. To gain intuition about labeling functions, we describe a simple text relation extraction\nexample. In Figure 1, we consider the task of classifying co-occurring gene and disease mentions as\neither expressing a causal relation or not. For example, given the sentence \u201cGene A causes disease B\u201d,\nthe object x = (A, B) has true class y = 1. To construct a training set, the user writes three labeling\n\n3\n\n\fdef\n\ndef\n\ndef\n\nlambda_1 ( x ) :\nreturn 1 i f\n\nlambda_2 ( x ) :\nreturn -1 i f\n\n( x . gene , x . pheno )\n\nin KNOWN_RELATIONS_1 e l s e 0\n\nr e . match ( r \u2019 .\u2217 n o t (cid:23) c a u s e .\u2217 \u2019 , x . t e x t _ b e t w e e n )\n\ne l s e 0\n\nlambda_3 ( x ) :\nreturn 1 i f\n\nr e . match ( r \u2019 .\u2217 a s s o c i a t e d .\u2217 \u2019 , x . t e x t _ b e t w e e n )\nin KNOWN_RELATIONS_2 e l s e 0\n\nand ( x . gene , x . pheno )\n\n(a) An example set of three labeling functions written by a user.\n\nY\n\n\u03bb2\n\n\u03bb1\n\n\u03bb3\n\n(b) The generative model of a\ntraining set de\ufb01ned by the user\ninput (unary factors omitted).\n\nFigure 1: An example of extracting mentions of gene-disease relations from the scienti\ufb01c literature.\n\nfunctions (Figure 1a). In \u03bb1, an external structured knowledge base is used to label a few objects with\nrelatively high accuracy, and is equivalent to a traditional distant supervision rule (see Sec. 2). \u03bb2\nuses a purely heuristic approach to label a much larger number of examples with lower accuracy.\nFinally, \u03bb3 is a \u201chybrid\u201d labeling function, which leverages a knowledge base and a heuristic.\n\nA labeling function need not have perfect accuracy or recall; rather, it represents a pattern that the\nuser wishes to impart to their model and that is easier to encode as a labeling function than as a\nset of hand-labeled examples. As illustrated in Ex. 3.1, labeling functions can be based on external\nknowledge bases, libraries or ontologies, can express heuristic patterns, or some hybrid of these types;\nwe see evidence for the existence of such diversity in our experiments (Section 5). The use of labeling\nfunctions is also strictly more general than manual annotations, as a manual annotation can always be\ndirectly encoded by a labeling function. Importantly, labeling functions can overlap, con\ufb02ict, and\neven have dependencies which users can provide as part of the data programming speci\ufb01cation (see\nSection 4); our approach provides a simple framework for these inputs.\n\nm(cid:89)\n\n1\n2\n\nIndependent Labeling Functions We \ufb01rst describe a model in which the labeling functions label\nindependently, given the true label class. Under this model, each labeling function \u03bbi has some\nprobability \u03b2i of labeling an object and then some probability \u03b1i of labeling the object correctly; for\nsimplicity we also assume here that each class has probability 0.5. This model has distribution\n\n(cid:0)\u03b2i\u03b1i1{\u039bi=Y} + \u03b2i(1 \u2212 \u03b1i)1{\u039bi=\u2212Y} + (1 \u2212 \u03b2i)1{\u039bi=0}(cid:1) ,\n\ni=1\n\n\u00b5\u03b1,\u03b2(\u039b, Y) =\n\n(1)\nwhere \u039b \u2208 {\u22121, 0, 1}m contains the labels output by the labeling functions, and Y \u2208 {\u22121, 1} is the\npredicted class. If we allow the parameters \u03b1 \u2208 Rm and \u03b2 \u2208 Rm to vary, (1) speci\ufb01es a family of\ngenerative models. In order to expose the scaling of the expected loss as the size of the unlabeled\ndataset changes, we will assume here that 0.3 \u2264 \u03b2i \u2264 0.5 and 0.8 \u2264 \u03b1i \u2264 0.9. We note that while\nthese arbitrary constraints can be changed, they are roughly consistent with our applied experience,\nwhere users tend to write high-accuracy and high-coverage labeling functions.\nOur \ufb01rst goal will be to learn which parameters (\u03b1, \u03b2) are most consistent with our observations\u2014our\nunlabeled training set\u2014using maximum likelihood estimation. To do this for a particular training set\nS \u2282 X, we will solve the problem\n\n( \u02c6\u03b1, \u02c6\u03b2) = arg max\n\n\u03b1,\u03b2\n\nlog P(\u039b,Y)\u223c\u00b5\u03b1,\u03b2 (\u039b = \u03bb(x)) = arg max\n\n\u03b1,\u03b2\n\nlog\n\n\u00b5\u03b1,\u03b2(\u03bb(x), y(cid:48))\n\n(2)\n\n(cid:88)\n\nx\u2208S\n\n(cid:88)\n\nx\u2208S\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed (cid:88)\n\ny(cid:48)\u2208{\u22121,1}\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nIn other words, we are maximizing the probability that the observed labels produced on our training\nexamples occur under the generative model in (1). In our experiments, we use stochastic gradient\ndescent to solve this problem; since this is a standard technique, we defer its analysis to the appendix.\n\nNoise-Aware Empirical Loss Given that our parameter learning phase has successfully found\nsome \u02c6\u03b1 and \u02c6\u03b2 that accurately describe the training set, we can now proceed to estimate the parameter\nw which minimizes the expected risk of a linear model over our feature mapping f , given \u02c6\u03b1, \u02c6\u03b2. To do\nso, we de\ufb01ne the noise-aware empirical risk L \u02c6\u03b1,\u02c6\u03b2 with regularization parameter \u03c1, and compute the\nnoise-aware empirical risk minimizer\n\n(cid:104)\n\nlog\n\n(cid:16)\n\n1 + e\u2212wT f (x)Y(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)\u039b = \u03bb(x)\n\n(cid:105)\n\n+ \u03c1(cid:107)w(cid:107)2\n\n(3)\n\n\u02c6w = arg min\nw\n\nL \u02c6\u03b1,\u02c6\u03b2(w; S ) = arg min\nw\n\n1\n|S|\n\n(cid:88)\n\nx\u2208S\n\nE(\u039b,Y)\u223c\u00b5 \u02c6\u03b1,\u02c6\u03b2\n\n4\n\n\fthen our expected parameter error and generalization risk can be bounded by\n\n(cid:104)(cid:107) \u02c6\u03b1 \u2212 \u03b1\n\n\u2217(cid:107)2(cid:105) \u2264 m\u00012\n\nE\n\nThis is a logistic regression problem, so it can be solved using stochastic gradient descent as well.\nWe can in fact prove that stochastic gradient descent running on (2) and (3) is guaranteed to produce\naccurate estimates, under conditions which we describe now. First, the problem distribution \u03c0 needs\nto be accurately modeled by some distribution \u00b5 in the family that we are trying to learn. That is, for\nsome \u03b1\u2217 and \u03b2\u2217,\n(4)\nSecond, given an example (x, y) \u223c \u03c0\u2217, the class label y must be independent of the features f (x) given\nthe labels \u03bb(x). That is,\n\n\u2200\u039b \u2208 {\u22121, 0, 1}m, Y \u2208 {\u22121, 1}, P(x,y)\u223c\u03c0\u2217 (\u03bb(x) = \u039b, y = Y) = \u00b5\u03b1\u2217,\u03b2\u2217(\u039b, Y).\n\n(5)\nThis assumption encodes the idea that the labeling functions, while they may be arbitrarily dependent\non the features, provide su\ufb03cient information to accurately identify the class. Third, we assume that\nthe algorithm used to solve (3) has bounded generalization risk such that for some parameter \u03c7,\n\n\u2217 \u21d2 y \u22a5 f (x) | \u03bb(x).\n\n(x, y) \u223c \u03c0\n\nES\n\nE \u02c6w\n\nES\n\nL \u02c6\u03b1,\u02c6\u03b2( \u02c6w; S )\n\n(6)\nUnder these conditions, we make the following statement about the accuracy of our estimates, which\nis a simpli\ufb01ed version of a theorem that is detailed in the appendix.\nTheorem 1. Suppose that we run data programming, solving the problems in (2) and (3) using\nstochastic gradient descent to produce ( \u02c6\u03b1, \u02c6\u03b2) and \u02c6w. Suppose further that our setup satis\ufb01es the\nconditions (4), (5), and (6), and suppose that m \u2265 2000. Then for any \u0001 > 0, if the number of labeling\nfunctions m and the size of the input dataset S are large enough that\n\nL \u02c6\u03b1,\u02c6\u03b2(w; S )\n\nw\n\n(cid:20)\n\n(cid:104)\n\n(cid:105) \u2212 min\n\n(cid:104)\n\n(cid:105)(cid:21) \u2264 \u03c7.\n\n|S| \u2265 356\n\n(cid:19)\n\n(cid:18) m\n\u2217(cid:13)(cid:13)(cid:13)2(cid:21) \u2264 m\u00012\n\n\u00012 log\n\n3\u0001\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u02c6\u03b2 \u2212 \u03b2\n\nE\n\n(cid:20)\n\n(cid:21) \u2264 \u03c7 +\n\n\u0001\n27\u03c1\n\n.\n\nl( \u02c6w) \u2212 min\n\nl(w)\n\nE\n\nw\n\nWe select m \u2265 2000 to simplify the statement of the theorem and give the reader a feel for how \u0001\nscales with respect to |S|. The full theorem with scaling in each parameter (and for arbitrary m) is\npresented in the appendix. This result establishes that to achieve both expected loss and parameter\nestimate error \u0001, it su\ufb03ces to have only m = O(1) labeling functions and |S| = \u02dcO(\u0001\u22122) training\nexamples, which is the same asymptotic scaling exhibited by methods that use labeled data. This\nmeans that data programming achieves the same learning rate as methods that use labeled data, while\nrequiring asymptotically less work from its users, who need to specify O(1) labeling functions rather\nthan manually label \u02dcO(\u0001\u22122) examples. In contrast, in the crowdsourcing setting [17], the number of\nworkers m tends to in\ufb01nity while here it is constant while the dataset grows. These results provide\nsome explanation of why our experimental results suggest that a small number of rules with a large\nunlabeled training set can be e\ufb00ective at even complex natural language processing tasks.\n\n4 Handling Dependencies\n\nIn our experience with data programming, we have found that users often write labeling functions\nthat have clear dependencies among them. As more labeling functions are added as the system is\ndeveloped, an implicit dependency structure arises naturally amongst the labeling functions: modeling\nthese dependencies can in some cases improve accuracy. We describe a method by which the user\ncan specify this dependency knowledge as a dependency graph, and show how the system can use it\nto produce better parameter estimates.\n\nLabel Function Dependency Graph To support the injection of dependency information into the\nmodel, we augment the data programming speci\ufb01cation with a label function dependency graph,\nG \u2282 D \u00d7 {1, . . . , m} \u00d7 {1, . . . , m}, which is a directed graph over the labeling functions, each of the\nedges of which is associated with a dependency type from a class of dependencies D appropriate to\nthe domain. From our experience with practitioners, we identi\ufb01ed four commonly-occurring types of\ndependencies as illustrative examples: similar, \ufb01xing, reinforcing, and exclusive (see Figure 2).\nFor example, suppose that we have two functions \u03bb1 and \u03bb2, and \u03bb2 typically labels only when (i)\n\u03bb1 also labels, (ii) \u03bb1 and \u03bb2 disagree in their labeling, and (iii) \u03bb2 is actually correct. We call this a\n\ufb01xing dependency, since \u03bb2 \ufb01xes mistakes made by \u03bb1. If \u03bb1 and \u03bb2 were to typically agree rather\nthan disagree, this would be a reinforcing dependency, since \u03bb2 reinforces a subset of the labels of \u03bb1.\n\n5\n\n\fY\n\ns\n\n\u03bb1\n\n\u03bb2\n\nlambda_1 ( x ) = f ( x . word )\nlambda_2 ( x ) = f ( x . lemma )\n\nS i m i l a r ( lambda_1 ,\n\nlambda_2 )\n\nY\n\n\u03bb1\n\nr\n\n\u03bb3\n\nf\n\n\u03bb2\n\nlambda_1 ( x ) = f ( \u2019 .\u2217 c a u s e .\u2217 \u2019 )\nlambda_2 ( x ) = f ( \u2019 .\u2217 n o t (cid:23) c a u s e .\u2217 \u2019 )\nlambda_3 ( x ) = f ( \u2019 .\u2217 c a u s e .\u2217 \u2019 )\n\nF i x e s ( lambda_1 ,\nR e i n f o r c e s ( lambda_1 ,\n\nlambda_2 )\n\nlambda_3 )\n\nY\n\ne\n\n\u03bb1\n\n\u03bb2\n\nlambda_1 ( x ) = x in DISEASES_A\nlambda_2 ( x ) = x in DISEASES_B\n\nE x c l u d e s ( lambda_1 ,\n\nlambda_2 )\n\nFigure 2: Examples of labeling function dependency predicates.\n\nModeling Dependencies The presence of dependency information means that we can no longer\nmodel our labels using the simple Bayesian network in (1). Instead, we model our distribution as a\nfactor graph. This standard technique lets us describe the family of generative distributions in terms\nof a known factor function h : {\u22121, 0, 1}m \u00d7 {\u22121, 1} (cid:55)\u2192 {\u22121, 0, 1}M (in which each entry hi represents\na factor), and an unknown parameter \u03b8 \u2208 RM as\n\u00b5\u03b8(\u039b, Y) = Z\u22121\n\n\u03b8 exp(\u03b8T h(\u039b, Y)),\n\nwhere Z\u03b8 is the partition function which ensures that \u00b5 is a distribution. Next, we will describe how\nwe de\ufb01ne h using information from the dependency graph.\nTo construct h, we will start with some base factors, which we inherit from (1), and then augment\nthem with additional factors representing dependencies. For all i \u2208 {1, . . . , m}, we let\n\nh0(\u039b, Y) = Y,\n\nhi(\u039b, Y) = \u039biY,\n\nhm+i(\u039b, Y) = \u039bi,\n\nh2m+i(\u039b, Y) = \u039b2\n\ni Y,\n\nh3m+i(\u039b, Y) = \u039b2\ni .\n\nThese factors alone are su\ufb03cient to describe any distribution for which the labels are mutually\nindependent, given the class: this includes the independent family in (1).\nWe now proceed by adding additional factors to h, which model the dependencies encoded in\nG. For each dependency edge (d, i, j), we add one or more factors to h as follows. For a near-\nduplicate dependency on (i, j), we add a single factor h\u03b9(\u039b, Y) = 1{\u039bi = \u039b j}, which increases\nour prior probability that the labels will agree. For a \ufb01xing dependency, we add two factors,\nh\u03b9(\u039b, Y) = \u22121{\u039bi = 0 \u2227 \u039b j (cid:44) 0} and h\u03b9+1(\u039b, Y) = 1{\u039bi = \u2212Y \u2227 \u039b j = Y}, which encode the idea\nthat \u03bb j labels only when \u03bbi does, and that \u03bb j \ufb01xes errors made by \u03bbi. The factors for a reinforcing\ndependency are the same, except that h\u03b9+1(\u039b, Y) = 1{\u039bi = Y \u2227 \u039b j = Y}. Finally, for an exclusive\ndependency, we have a single factor h\u03b9(\u039b, Y) = \u22121{\u039bi (cid:44) 0 \u2227 \u039b j (cid:44) 0}.\nLearning with Dependencies We can again solve a maximum likelihood problem like (2) to\nlearn the parameter \u02c6\u03b8. Using the results, we can continue on to \ufb01nd the noise-aware empirical loss\nminimizer by solving the problem in (3). In order to solve these problems in the dependent case, we\ntypically invoke stochastic gradient descent, using Gibbs sampling to sample from the distributions\nused in the gradient update. Under conditions similar to those in Section 3, we can again provide a\nbound on the accuracy of these results. We de\ufb01ne these conditions now. First, there must be some set\n\u0398 \u2282 RM that we know our parameter lies in. This is analogous to the assumptions on \u03b1i and \u03b2i we\nmade in Section 3, and we can state the following analogue of (4):\n\n\u2217 \u2208 \u0398 s.t. \u2200(\u039b, Y) \u2208 {\u22121, 0, 1}m \u00d7 {\u22121, 1}, P(x,y)\u223c\u03c0\u2217 (\u03bb(x) = \u039b, y = Y) = \u00b5\u03b8\u2217(\u039b, Y).\n\n(7)\nSecond, for any \u03b8 \u2208 \u0398, it must be possible to accurately learn \u03b8 from full (i.e. labeled) samples of\n\u00b5\u03b8. More speci\ufb01cally, there exists an unbiased estimator \u02c6\u03b8(T) that is a function of some dataset T of\nindependent samples from \u00b5\u03b8 such that, for some c > 0 and for all \u03b8 \u2208 \u0398,\n\n\u2203\u03b8\n\n(cid:17) (cid:22) (2c|T|)\u22121I.\n(cid:16)\u02c6\u03b8(T)\n\n(8)\n\nThird, for any two feasible models \u03b81 and \u03b82 \u2208 \u0398,\nVar(\u039b2,Y2)\u223c\u00b5\u03b82\n\nE(\u039b1,Y1)\u223c\u00b5\u03b81\n\n(Y2|\u039b1 = \u039b2)\n\n(9)\nThat is, we\u2019ll usually be reasonably sure in our guess for the value of Y, even if we guess using\ndistribution \u00b5\u03b82 while the the labeling functions were actually sampled from (the possibly totally\ndi\ufb00erent) \u00b5\u03b81. We can now prove the following result about the accuracy of our estimates.\n\n(cid:105) \u2264 cM\u22121.\n\nCov\n\n(cid:104)\n\n6\n\n\fFeatures\n\nHand-tuned\nLSTM\n\nMethod\nITR\nDP\nITR\nDP\n\nKBP (News)\n\nPrec.\n51.15\n50.52\n37.68\n47.47\n\nRec.\n26.72\n29.21\n28.81\n27.88\n\nF1\n35.10\n37.02\n32.66\n35.78\n\nGenomics\n\nPrec.\n83.76\n83.90\n69.07\n75.48\n\nRec.\n41.67\n43.43\n50.76\n48.48\n\nF1\n55.65\n57.24\n58.52\n58.99\n\nPharmacogenomics\nF1\n57.23\n60.83\n37.23\n42.17\n\nRec.\n49.32\n54.80\n43.84\n47.95\n\nPrec.\n68.16\n68.36\n32.35\n37.63\n\nTable 1: Precision/Recall/F1 scores using data programming (DP), as compared to distant supervision\nITR approach, with both hand-tuned and LSTM-generated features.\n\nTheorem 2. Suppose that we run stochastic gradient descent to produce \u02c6\u03b8 and \u02c6w, and that our setup\nsatis\ufb01es the conditions (5)-(9). Then for any \u0001 > 0, if the input dataset S is large enough that\n\n(cid:32)2(cid:107)\u03b80 \u2212 \u03b8\u2217(cid:107)2\n\n(cid:33)\n\n,\n\n|S| \u2265 2\n\nc2\u00012 log\n\nthen our expected parameter error and generalization risk can be bounded by\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u02c6\u03b8 \u2212 \u03b8\n\n\u2217(cid:13)(cid:13)(cid:13)2(cid:21) \u2264 M\u00012\n\nE\n\n\u0001\n\n(cid:20)\n\nE\n\nl( \u02c6w) \u2212 min\n\nw\n\nl(w)\n\n(cid:21) \u2264 \u03c7 +\n\nc\u0001\n2\u03c1\n\n.\n\nAs in the independent case, this shows that we need only |S| = \u02dcO(\u0001\u22122) unlabeled training examples\nto achieve error O(\u0001), which is the same asymptotic scaling as supervised learning methods. This\nsuggests that while we pay a computational penalty for richer dependency structures, we are no less\nstatistically e\ufb03cient. In the appendix, we provide more details, including an explicit description of\nthe algorithm and the step size used to achieve this result.\n\n5 Experiments\n\nWe seek to experimentally validate three claims about our approach. Our \ufb01rst claim is that data\nprogramming can be an e\ufb00ective paradigm for building high quality machine learning systems,\nwhich we test across three real-world relation extraction applications. Our second claim is that data\nprogramming can be used successfully in conjunction with automatic feature generation methods,\nsuch as LSTM models. Finally, our third claim is that data programming is an intuitive and productive\nframework for domain-expert users, and we report on our initial user studies.\n\nRelation Mention Extraction Tasks\nIn the relation mention extraction task, our objects are rela-\ntion mention candidates x = (e1, e2), which are pairs of entity mentions e1, e2 in unstructured text,\nand our goal is to learn a model that classi\ufb01es each candidate as either a true textual assertion of the\nrelation R(e1, e2) or not. We examine a news application from the 2014 TAC-KBP Slot Filling chal-\nlenge2, where we extract relations between real-world entities from articles [2]; a clinical genomics\napplication, where we extract causal relations between genetic mutations and phenotypes from the\nscienti\ufb01c literature3; and a pharmacogenomics application where we extract interactions between\ngenes, also from the scienti\ufb01c literature [21]; further details are included in the Appendix.\nFor each application, we or our collaborators originally built a system where a training set was\nprogrammatically generated by ordering the labeling functions as a sequence of if-then-return\nstatements, and for each candidate, taking the \ufb01rst label emitted by this script as the training label.\nWe refer to this as the if-then-return (ITR) approach, and note that it often required signi\ufb01cant domain\nexpert development time to tune (weeks or more). For this set of experiments, we then used the same\nlabeling function sets within the framework of data programming. For all experiments, we evaluated\non a blind hand-labeled evaluation set. In Table 1, we see that we achieve consistent improvements:\non average by 2.34 points in F1 score, including what would have been a winning score on the 2014\nTAC-KBP challenge [30].\nWe observed these performance gains across applications with very di\ufb00erent labeling function sets.\nWe describe the labeling function summary statistics\u2014coverage is the percentage of objects that\nhad at least one label, overlap is the percentage of objects with more than one label, and con\ufb02ict is\n\n2http://www.nist.gov/tac/2014/KBP/\n3https://github.com/HazyResearch/dd-genomics\n\n7\n\n\fthe percentage of objects with con\ufb02icting labels\u2014and see in Table 2 that even in scenarios where\nm is small, and con\ufb02ict and overlap is relatively less common, we still realize performance gains.\nAdditionally, on a disease mention extraction task (see Usability Study), which was written from\nscratch within the data programming paradigm, allowing developers to supply dependencies of the\nbasic types outlined in Sec. 4 led to a 2.3 point F1 score boost.\n\n# of LFs Coverage\n\nOverlap Con\ufb02ict\n\nApplication\nKBP (News)\nGenomics\nPharmacogenomics\nDiseases\n\n40\n146\n7\n12\n\n29.39\n53.61\n7.70\n53.32\n\n|S \u03bb(cid:44)0|\n2.03M\n256K\n129K\n418K\n\n1.38\n26.71\n0.35\n31.81\n\n0.15\n2.05\n0.32\n0.98\n\nF1 Score Improvement\nHT\n1.92\n1.59\n3.60\nN/A\n\nLSTM\n3.12\n0.47\n4.94\nN/A\n\nTable 2: Labeling function (LF) summary statistics, sizes of generated training sets S \u03bb(cid:44)0 (only counting non-zero\nlabels), and relative F1 score improvement over baseline IRT methods for hand-tuned (HT) and LSTM-generated\n(LSTM) feature sets.\n\nAutomatically-generated Features We additionally compare both hand-tuned and automatically-\ngenerated features, where the latter are learned via an LSTM recurrent neural network (RNN) [14].\nConventional wisdom states that deep learning methods such as RNNs are prone to over\ufb01tting to the\nbiases of the imperfect rules used for programmatic supervision. In our experiments, however, we\n\ufb01nd that using data programming to denoise the labels can mitigate this issue, and we report a 9.79\npoint boost to precision and a 3.12 point F1 score improvement on the benchmark 2014 TAC-KBP\n(News) task, over the baseline if-then-return approach. Additionally for comparison, our approach is\na 5.98 point F1 score improvement over a state-of-the-art LSTM approach [32].\n\nUsability Study One of our hopes is that a user without expertise in ML will be more productive\niterating on labeling functions than on features. To test this, we arranged a hackathon involving\na handful of bioinformatics researchers, using our open-source information extraction framework\nSnorkel4 (formerly DDLite). Their goal was to build a disease tagging system which is a common\nand important challenge in the bioinformatics domain [11]. The hackathon participants did not have\naccess to a labeled training set nor did they perform any feature engineering. The entire e\ufb00ort was\nrestricted to iterative labeling function development and the setup of candidates to be classi\ufb01ed. In\nunder eight hours, they had created a training set that led to a model which scored within 10 points of\nF1 of the supervised baseline; the gap was mainly due to recall issue in the candidate extraction phase.\nThis suggests data programming may be a promising way to build high quality extractors, quickly.\n\n6 Conclusion and Future Work\n\nWe introduced data programming, a new approach to generating large labeled training sets. We\ndemonstrated that our approach can be used with automatic feature generation techniques to achieve\nhigh quality results. We also provided anecdotal evidence that our methods may be easier for domain\nexperts to use. We hope to explore the limits of our approach on other machine learning tasks that\nhave been held back by the lack of high-quality supervised datasets, including those in other domains\nsuch imaging and structured prediction.\n\nAcknowledgements Thanks to Theodoros Rekatsinas, Manas Joglekar, Henry Ehrenberg, Jason\nFries, Percy Liang, the DeepDive and DDLite users and many others for their helpful conversations.\nThe authors acknowledge the support of: DARPA FA8750-12-2-0335; NSF IIS-1247701; NSFCCF-\n1111943; DOE 108845; NSF CCF-1337375; DARPA FA8750-13-2-0039; NSF IIS-1353606;ONR\nN000141210041 and N000141310129; NIH U54EB020405; DARPA\u2019s SIMPLEX program; Oracle;\nNVIDIA; Huawei; SAP Labs; Sloan Research Fellowship; Moore Foundation; American Family\nInsurance; Google; and Toshiba. The views and conclusions expressed in this material are those of the\nauthors and should not be interpreted as necessarily representing the o\ufb03cial policies or endorsements,\neither expressed or implied, of DARPA, AFRL, NSF, ONR, NIH, or the U.S. Government.\n\n4snorkel.stanford.edu\n\n8\n\n\fReferences\n[1] E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a\n\nhierarchical topic model. In Proceedings of the ACL.\n\n[2] G. Angeli, S. Gupta, M. Jose, C. D. Manning, C. R\u00e9, J. Tibshirani, J. Y. Wu, S. Wu, and C. Zhang.\n\nStanford\u2019s 2014 slot \ufb01lling systems. TAC KBP, 695, 2014.\n\n[3] A. Balsubramani and Y. Freund. Scalable semi-supervised aggregation of classi\ufb01ers. In Advances in\n\nNeural Information Processing Systems, pages 1351\u20131359, 2015.\n\n[4] D. Berend and A. Kontorovich. Consistency of weighted majority votes. In NIPS 2014.\n[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the\n\neleventh annual conference on Computational learning theory, pages 92\u2013100. ACM, 1998.\n\n[6] J. Bootkrajang and A. Kab\u00e1n. Label-noise robust logistic regression and its applications. In Machine\n\nLearning and Knowledge Discovery in Databases, pages 143\u2013158. Springer, 2012.\n\n[7] R. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In\n\nAnnual meeting-association for Computational Linguistics, volume 45, page 576, 2007.\n\n[8] M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text\n\nsources. In ISMB, volume 1999, pages 77\u201386, 1999.\n\n[9] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceedings\n\nof the 22Nd International Conference on World Wide Web, WWW \u201913, pages 285\u2013294, 2013.\n\n[10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em\n\nalgorithm. Applied statistics, pages 20\u201328, 1979.\n\n[11] R. I. Do\u02d8gan and Z. Lu. An improved corpus of disease mentions in pubmed citations. In Proceedings of\n\nthe 2012 workshop on biomedical natural language processing.\n\n[12] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. R\u00e9. Data programming with ddlite: putting\n\nhumans in a di\ufb00erent part of the loop. In HILDA@ SIGMOD, page 13, 2016.\n\n[13] H. Gao, G. Barbier, R. Goolsby, and D. Zeng. Harnessing the crowdsourcing power of social media for\n\ndisaster relief. Technical report, DTIC Document, 2011.\n\n[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[15] R. Ho\ufb00mann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for\n\ninformation extraction of overlapping relations. In Proceedings of the ACL.\n\n[16] M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Comprehensive and reliable crowd assessment\n\nalgorithms. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on.\n\n[17] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In Advances in\n\nneural information processing systems, pages 1953\u20131961, 2011.\n\n[18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma,\net al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv\npreprint arXiv:1602.07332, 2016.\n\n[19] M.-A. Krogel and T. Sche\ufb00er. Multi-relational learning, text mining, and semi-supervised learning for\n\nfunctional genomics. Machine Learning, 57(1-2):61\u201381, 2004.\n\n[20] G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79 \u2013 87, 1992.\n[21] E. K. Mallory, C. Zhang, C. R\u00e9, and R. B. Altman. Large-scale extraction of gene interactions from\n\nfull-text literature using deepdive. Bioinformatics, 2015.\n\n[22] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled\n\ndata. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL, 2009.\n\n[23] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in\n\nNeural Information Processing Systems 26.\n\n[24] F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled\n\ndata. Proceedings of the National Academy of Sciences, 111(4):1253\u20131258, 2014.\n\n[25] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In\n\nMachine Learning and Knowledge Discovery in Databases, pages 148\u2013163. Springer, 2010.\n\n[26] B. Roth and D. Klakow. Feature-based models for improving the quality of noisy training data for relation\n\nextraction. In Proceedings of the 22nd ACM Conference on Knowledge management.\n\n[27] B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision. In\n\nEMNLP, pages 24\u201329, 2013.\n\n[28] R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.\n[29] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. R\u00e9. Incremental knowledge base construction using\n\ndeepdive. Proceedings of the VLDB Endowment, 8(11):1310\u20131321, 2015.\n\n[30] M. Surdeanu and H. Ji. Overview of the english slot \ufb01lling track at the tac2014 knowledge base population\n\nevaluation. In Proc. Text Analysis Conference (TAC2014), 2014.\n\n[31] S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation\n\nextraction. In Proceedings of the ACL.\n\n[32] P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum. Multilingual relation extraction using\n\ncompositional universal schema. arXiv preprint arXiv:1511.06396, 2015.\n\n[33] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet em: A provably optimal algorithm\nfor crowdsourcing. In Advances in Neural Information Processing Systems 27, pages 1260\u20131268. 2014.\n\n9\n\n\f", "award": [], "sourceid": 1778, "authors": [{"given_name": "Alexander", "family_name": "Ratner", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "De Sa", "institution": "Stanford University"}, {"given_name": "Sen", "family_name": "Wu", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Selsam", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford University"}]}