{"title": "Semi-Supervised Learning with Declaratively Specified Entropy Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 4425, "page_last": 4435, "abstract": "We propose a technique for declaratively specifying strategies for semi-supervised learning (SSL). SSL methods based on different assumptions perform differently on different tasks, which leads to difficulties applying them in practice. In this paper, we propose to use entropy to unify many types of constraints. Our method can be used to easily specify ensembles of semi-supervised learners, as well as agreement constraints and entropic regularization constraints between these learners, and can be used to model both well-known heuristics such as co-training, and novel domain-specific heuristics. Besides, our model is flexible as to the underlying learning mechanism. Compared to prior frameworks for specifying SSL techniques, our technique achieves consistent improvements on a suite of well-studied SSL benchmarks, and obtains a new state-of-the-art result on a difficult relation extraction task.", "full_text": "Semi-Supervised Learning with Declaratively\n\nSpeci\ufb01ed Entropy Constraints\n\nHaitian Sun\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nhaitians@cs.cmu.edu\n\nLidong Bing\u2217\n\nR&D Center Singapore\n\nMachine Intelligence Technology\n\nAlibaba DAMO Academy\nl.bing@alibaba-inc.com\n\nWilliam W. Cohen\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nwcohen@cs.cmu.edu\n\nAbstract\n\nWe propose a technique for declaratively specifying strategies for semi-supervised\nlearning (SSL). SSL methods based on different assumptions perform differently\non different tasks, which leads to dif\ufb01culties applying them in practice. In this\npaper, we propose to use entropy to unify many types of constraints. Our method\ncan be used to easily specify ensembles of semi-supervised learners, as well\nas agreement constraints and entropic regularization constraints between these\nlearners, and can be used to model both well-known heuristics such as co-training,\nand novel domain-speci\ufb01c heuristics. Besides, our model is \ufb02exible as to the\nunderlying learning mechanism. Compared to prior frameworks for specifying\nSSL techniques, our technique achieves consistent improvements on a suite of\nwell-studied SSL benchmarks, and obtains a new state-of-the-art result on a dif\ufb01cult\nrelation extraction task.\n\n1\n\nIntroduction\n\nMany semi-supervised learning (SSL) methods are based on regularizers which impose \u201csoft con-\nstraints\u201d on how the learned classi\ufb01er will behave on unlabeled data. For example, logistic regression\nwith entropy regularization [11] and transductive SVMs [13] constrain the classi\ufb01er to make con\ufb01dent\npredictions at unlabeled points; the NELL system [7] imposes consistency constraints based on an\nontology of types and relations; and graph-based SSL approaches require that the instances associated\nwith the endpoints of an edge have similar labels [31, 1, 25] or embedded representations [28, 30, 14].\nCertain other weakly-supervised methods also can be viewed as constraining predictions made by a\nclassi\ufb01er: for instance, in distantly-supervised information extraction, a useful constraint requires\nthat the classi\ufb01er, when applied to the set S of mentions of an entity pair that is a member of relation\nr, classi\ufb01es at least one mention in S as a positive instance of r [12].\nUnfortunately, although many speci\ufb01c SSL constraints have been proposed, there is little consensus\nas to which constraint should be used in which setting, so determining which SSL approach to use\non a speci\ufb01c task remains an experimental question\u2014and one which requires substantial effort to\nanswer, since different SSL strategies are often embedded in different learning systems. To address\n\n\u2217 This work was mainly done when Lidong Bing was working at Tencent AI Lab.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Specifying supervised text classi\ufb01cation declaratively with TensorLog.\n\nthis problem, Bing et al. [3] proposed a succinct declarative language for specifying semi-supervised\nlearners, the D-Learner. The D-Learner allowed many constraints to be speci\ufb01ed easily, and allows\nall constraints to be easily combined and compared. The relative weights of different SSL heuristics\ncan also be tuned collectively un the D-earner using Bayesian optimization. The D-learner was\ndemonstrated by encoding a number of intuitive problem-speci\ufb01c SSL constraints, and was able\nto achieve signi\ufb01cant improvements over the state-of-the-art weakly supervised learners on two\nreal-world relation extraction tasks.\nA disadvantage of the D-Learner is that it is limited to specifying constraints on, and ensembles of, a\nsingle learning system\u2014a particular variety of supervised learner based on the ProPPR probabilistic\nlogic [27]. In this paper we introduce a variant of the D-Learner called the DCE-Learner (for\nDeclaratively Constrained Entropy) that uses a more constrained speci\ufb01cation language, but is\npaired with a more effective and more \ufb02exible underlying learning system. This leads to consistent\nimprovements over the original D-Learner on \ufb01ve benchmark problems.\nIn the next section, we \ufb01rst introduce several SSL constraints in a uniform notation. Then, in Section\n3, we experiment with some benchmark text categorization tasks, to illustrate the effectiveness of the\nconstraints. Finally, in Section 4, we generalize our model to a dif\ufb01cult relation extraction task in\ndrug and disease domains, where we obtain a state-of-the-art results using this framework.\n\n2 Declaratively Specifying SSL Algorithms\n\n2.1 An Example: Supervised Classi\ufb01cation\n\nA method for declaratively specifying constrained learners with the probabilistic logic ProPPR has\nbeen previously described in [27]. Here we present a modi\ufb01cation of this approach for probabilistic\nlogic TensorLog [8]. TensorLog is a somewhat more restricted logic, but has the advantage that queries\nin TensorLog can be compiled to computation graphs in deep-learning platforms like Tensor\ufb02ow.\nThis leads to a number of advantages, in particular the ability to exploit GPU processors, well-tuned\ngradient-descent optimizers, and extenive libraries of learners.\nWe begin with the example of supervised learning for document classi\ufb01ers shown in Figure 1. A\nTensorLog program consists of (1) a set of rules (function-free Horn clauses) of the form \u201cA \u2190\nB1, . . . , Bk\u201d, where A and the Bi\u2019s are of the form r(X, Y ) and X, Y are variables; and (2) a set of\nweighted facts, where \u201cfacts\u201d are simply triples r(a, b) where r is a relation, and a, b are constants.\nWe say that constant a is the \u201chead\u201d entity for r, and b is the \u201ctail\u201d entity.\nTo encode a text classi\ufb01cation task, a document xi, which is normally stored as a set of weighted\nterms, will be encoded as weighted facts for the hasFeature relation. For instance, the document\nx1 containing the text \u201cParsing with LSTMs\u201d might be stoplisted, stemmed, and then encoded with\nthe facts hasFeature(x1,pars) and hasfeature(x1,lstm), which have weights 0.6 and 0.4\nrespectively (representing importance in the original document). A classi\ufb01er will be encoded by a\nset of triples which associate features and classes: for instance, indicates(pars,accept) with\nweight 0.2 might indicate a weak association between the term pars and the class accept. Finally\nthe classi\ufb01er is the single rule\n\npredict(X,Y) \u2190 hasFeature(X,F), indicates(F,Y)\n\n2\n\n\fthe weight of each fact used to support the proof: i.e., the weight for a proof is(cid:81)n\n\nIn TensorLog, capital letters are universally quanti\ufb01ed variables, and the comma between the two\npredicates on the right-hand side of the rule stands for conjunction, so this can be read as asserting\n\u201cfor all X, F, Y , if X contains a feature F which indicates a class Y then we predict that X has the\nclass Y .\u201d\nGiven rules and facts, a traditional non-probabilistic logic will answer a query such as\npredict(x1,Y) by \ufb01nding all y such that predict(x1,Y) can be proved, where a proof is sim-\nilar to a context-free grammar derivation: at each stage one can apply a rule\u2014e.g., to replace the\nquery predict(x1,Y) with the conjunction hasFeature(x1,f1),indicates(f1,Y)\u2014or one\ncan match against a fact. The middle part of Figure 1 shows two sample proofs, which will support\npredicting both the class \u201caccept\u201d and the class \u201creject\u201d.\nIn a TensorLog, however, each proof is associated with a weight, which is simply the product of\ni=1 wi where\nwi is the weight of fact used at step i in the proof (or else wi = 1, if a rule was used at step i.)\nTensorLog will compute all proofs that support any answer to a query, and use the sum of those\nweights as a con\ufb01dence in an answer. Figure 1 also shows a plate-like diagram indicating the set\nof proofs that would be explored for this task (with the box labeled with \u201c\u2212\u201d denoting a completed\nproof.) TensorLog aggregates the weights using dynamic programming, speci\ufb01cally by unrolling the\nmessage-passing steps of belief propagation on a certain factor graph into a computational graph,\nwhich can then be compiled into TensorFlow. For more details on TensorLog\u2019s semantics, see [8].\nTensorLog learns to answer queries of the form qi(xi, Y ): that is, given such a query it learns to\n\ufb01nd all the constants yij\u2019s such that qi(xi, yij) is provable, and associates an appropriate con\ufb01dence\nscore with each such prediction. By default TensorLog assumes there is one correct label yi for each\nquery entity xi, and the vector of weighted proof counts for the provable yij\u2019s is passed through a\nsoftmax to obtain a distribution. Learning in TensorLog uses a set of training examples {(qi, xi, yi)}i\nand gradient descent to optimize cross-entropy loss on the data, allowing the weights of some user-\nspeci\ufb01ed set of facts to vary in the optimization. In this example, if only the weights of the indicates\nfacts vary, the learning algorithm shown in the \ufb01gure is closely related to logistic regression. During\ntraining, labeled examples {(xi, yi)}i become TensorLog examples of the form predict(xi,yi).\n\n2.2 Declaratively Describing Entropic Regularization\n\nA commonly used approach to SSL is entropy regularization, where one introduces a regularizer\nthat encourages the classi\ufb01er to predict some class con\ufb01dently on unlabeled examples. Entropic\nregularizers encode a common bias of SSL systems: the decision boundaries should be drawn in\nlow-probability regions of the space. For instance, transductive SVMs maximize the \u201cunlabeled data\nmargin\u201d based on the low-density separation assumption that a good decision hyperplane lies on a\nsparse area of the feature space [13].\nTo implement entropic regularization, we extend TensorLog\u2019s interpreter to support a new predicate,\nentropy(Y,H), which is de\ufb01ned algorithmically, instead of by a set of database facts. It takes the\ndistribution of variable Y as input, and outputs a weighting over the two entities high and low such\nthat high has weight SY and low has weight 1 \u2212 SY , where SY is the entropy of the distribution of\nvalues Y . (In the experiments we actually use Tsallis entropy with q = 2 [18], a variant of the usual\nBoltzmann\u2013Gibbs entropy). Using this extension, one can implement entropic regularization SSL by\nadding a single rule to the theory of Figure 1. These entropic regularization (ER) rules are shown in\nFigure 2.\nIn learning, each unlabeled xi is converted to a training example of the form predictionHas-\nEntropy(xi,low), which thus encourages the classi\ufb01er to have low entropy over the distribution of\npredicted labels for xi. During gradient descent, parameters are optimized to maximize the probability\nof low, and thus minimize the entropy of Y .\n\n2.3 Declaratively Describing a Co-training Heuristic\n\nAnother popular SSL method is co-training [6], which is useful when there are two independent\nfeature sets for representing examples (e.g., words on a web page and hyperlinks into a web page). In\nco-training, two classi\ufb01ers are learned, one using each feature set, and an iterative training process is\nused to ensure that the two classi\ufb01ers agree on unlabeled examples.\n\n3\n\n\fFigure 2: Declaratively speci\ufb01ed SSL rules\n\nHere we instead use an entropic constraint to encourage agreement. Speci\ufb01cally, we construct a theory\nthat says that if the predictions of the two classi\ufb01ers are disjunctively combined (on an unlabeled\nexample), then the resulting combination of predictions should have low entropy. These are shown as\nthe co-training (CT) rules in Figure 2.\nThe same types of examples would be provided as above: the predict(xi,yi) examples for labeled\n(xi, yi) pairs would encourage both classi\ufb01ers to classify the labeled data correctly, and the examples\npredictionHasEntropy(xj,low) for unlabeled xj would encourage agreement.\n\n2.4 Declaratively Describing Network Classi\ufb01ers\n\nAnother common setting in which SSL is useful is when there is some sort of neighborhood structure\nthat indicates pairs of examples that are likely to have the same class. An example of this setting is\nhypertext categorization, where two documents are considered to be a pair if one cites another. If we\nassume that a hyperlink from document x1 to x2 is indicated by the fact near(x1,x2) then we can\nencourage a classi\ufb01er to make similar decisions for the neighbors of a document with the neighbor\nentropy regularization (NBER) rules in Figure 2. Here the unlabeled examples would be converted to\nTensorLog examples of the form neighborPredictionsHaveEntropy(xj,low).\nVariants of this SSL algorithm can be encoded by replacing the near conditions with alternative\nnotions of similarity. For example, another way in which links between unlabeled examples are\noften used in label propagation methods, such as harmonic \ufb01elds [31] or modi\ufb01ed absorption [25].\nIf the weights of near facts are less than one, we can de\ufb01ne a variant of random-walk proximity\nin a graph with recursion, using the rule label-propagation entropy regularization (LPER) (plus\nthe usual \u201cpredict\u201d rule), as shown in Figure 2. Note that in this SSL model, we are regularizing\nthe feature-based classi\ufb01er to behave similarly to a label propagation learner, so the learner is still\ninductive, not transductive.\nAnother variant of this model is formed by noting that in some situations, direct links may indicate\ndifferent classes: e.g., label propagation using links for an \u201cX1 / X2 is the advisor of Z\u201d relation-\nship may not be good at predicting the class labels \u201cprofessor\u201d vs \u201cstudent\u201d. In many of these\ncases, replacing the condition near(X1,X2) with a two-step similarity near(X1,Z),near(Z,X2),\nimproves label propagation performance: in the case above, labels would be propagated through\na \u201cX1 co-advises a student with X2\u201d relationship, and in the case of hyperlinks, the relationship\nwould be a co-citation relationship. Below we will call this the co-linked label propagation entropy\nregularization (COLPER) rules, as shown in Figure 2.\n\n3 Experimental Results \u2013 Text Categorization\n\nFollowing [3], we consider SSL performance on three widely-used benchmark datasets for clas-\nsi\ufb01cation of hyperlinked text: Citeseer, Cora, and PubMed [21]. We apply one rule for entropic\nregularization (ER) and three rules for network classi\ufb01ers (NBER, LPER, COLPER). For this task,\nthe co-training heuristic is not applicable.\n\n3.1 The Task and Model Con\ufb01guration\n\nMany datasets contain data that are interlinked. For example, in Citeseer, papers that cite each other\nare likely to have similar topics. Many algorithms are proposed to exploit such link structure to\n\n4\n\n\fimprove the prediction accuracy. In this experiment, the objective is to classify documents, given a\nbag of words as features, and citation relation as links between them.\nAn described above, we use hasFeature(di, wj) as a fact if word wj exists in document di and\nuse this to declaratively specify a classi\ufb01er. For unlabeled data, we add entropic regularization (ER)\nby adding a predictionHasEntropy(di, low) training example for all unlabeled documents di.\nTo apply the rules for network classi\ufb01ers in Section 2.4, we consider near(X1, X2) as a citation\nrelation, i.e. if document di cites document dj, then near(di, dj) and near(dj, di). We also\nde\ufb01ne every document to be \u201cnear\u201d itself, so near(di, di) is always true. Then, we can simply\napply the NBER, LPER, and COLPER rules to every unlabeled document.\nDuring training, we have \ufb01ve losses: supervised classi\ufb01cation, ER, NBER, LPER, and COLPER.\nThese are combined with different weights of importance:\n\nltotal = lpredict + w1 \u00b7 lER + w2 \u00b7 lNBER + w3 \u00b7 lLPER + w4 \u00b7 lCOLPER\n\nwhere wi\u2019s are hyper-parameters that will be tuned with Bayesian Optimization [22] 2.\nIn this experiment, our learning algorithm is closely related to logistic regression. However, since\nthe SSL strategies are based the predicate predict(X, Y), one could replace this learner with any\nother learning algorithm\u2013in particular one could replace it with a non-declarative speci\ufb01ed prediction\nrule as well, using the same programming mechanism we used to de\ufb01ne entropy. Exactly the same\nrules could be used to de\ufb01ne the SSL strategies.\n\n3.2 Results\n\nWe take 20 labeled examples for each class as training data, and reserve 1,000 examples as test\ndata. Other examples are treated as unlabeled. We compare our results with baseline models from\nD-Learner [3]: supervised SVM with linear kernel (SL-SVM), supervised version of ProPPR (SL-\nProPPR) [27], and the D-Learner. Our model variants show consistent improvement over baseline\nmodels, as shown in table 1. More importantly, our full model, i.e. \u201cAll\u201d, performs the best, which\nshows combining constraints can further improve performance.\n\nTable 1: Text categorization results in percentage (Note that table 1a and 1b use different data splits.)\n\n(a) Results of using different rules\n\nSL-SVM\nSL-ProPPR\nD-Learner\nDCE (Ours):\nSupervised\n+ ER\n+ NBER\n+ LPER\n+ COLPER\n+ All\n\nCiteSeer Cora\n52.0\n55.1\n58.1\n\n55.8\n52.8\n55.1\n\nPubMed\n\n66.5\n68.8\n69.9\n\n59.8\n60.3\n61.4\n61.3\n60.9\n61.7\n\n59.3\n59.9\n60.3\n60.5\n60.2\n60.5\n\n71.8\n72.7\n72.5\n73.1\n73.3\n73.8\n\n(b) Results compared with other models\n(Our model is inductive, not transductive)\n\nSL-logit\nSemiEmb\nManiReg\nGraphEmb\nPlanetoid-I\nDCE (Ours):\nSupervised\n+ All\n\nTransductive:\nTSVM*\nPlanetoid-T*\nGAT*\n\nCiteSeer Cora\n57.4\n59.0\n59.5\n67.2\n61.2\n\n57.2\n59.6\n60.1\n43.2\n64.7\n\nPubMed\n\n69.8\n71.1\n70.7\n65.3\n77.2\n\n63.6\n65.7\n\n64.0\n62.9\n72.5\n\n60.7\n61.5\n\n57.5\n75.7\n83.0\n\n72.7\n74.4\n\n62.2\n75.7\n79.0\n\nWe also compare our model with several other models3: supervised logistic regression (SL-Logit),\nsemi-supervised embedding (SemiEmb) [28], manifold regularization (ManiReg) [1], TSVM [13],\ngraph embeddings (GraphEmb) [17], and the inductive version of Planetoid (Planetoid-I) [29]. The\nDCE-Learner performs better on the smaller datasets, Citeseer and Cora, and Planetoid-I (with a\nmore complex multi-layer classi\ufb01er) performs better on PubMed.\nThe models compared here, like the DCE-Learner, are inductive: they do not use any graph infor-\nmation at classi\ufb01cation time, nor to they access the test data at training time. Inductive learners are\n\n2Open source Bayesian optimization tool available at https://github.com/HIPS/Spearmint\n3Data splits available at https://github.com/kimiyoung/planetoid\n\n5\n\n\fmuch more ef\ufb01cient at test time; however, it is worth noting that better accuracy can often be obtained\nby transductive methods. For reference we also give results for TSVM, the transductive version\nof Planetoid (Planetoid-T), and a recent variant of graph convolutional networks, Graph Attention\nNetworks (GAT) [26], which to our knowledge is the current state-of-the-art on these techniques.\n\n4 Experimental Results \u2013 Relation Extraction\n\nAnother common task in NLP is relation extraction. In this experiment, we start with two distantly\nsupervised information extraction pipelines, DIEL [2] and DIEJOB [4], which extract relation and\ntype examples from entity-centric documents. Then, we train our classi\ufb01er with several declaratively\nde\ufb01ned constraints, including one rule for co-training heuristic (CT) and a few variants of constraints\nfor network classi\ufb01ers (NBER and COLPER). Experimental results show our model consistently\nimproves the performance on two datasets in drug and disease domains.\n\n4.1 The Task and Data Preparation\n\nIn an entity-centric corpus, each document describes some aspects of a particular entity, called the\nsubject entity. For example, the Wikipedia page about \u201cAspirin\u201d is an entity-centric document, which\nintroduces its medical use, side effects, and other information. The goal of this task is to predict the\nrelation of a noun phrase to its subject. For example, we would like to determine if \u201cheartburn\u201d is a\n\u201cside effect\u201d of \u201cAspirin\u201d. Since the subjects of documents are simply their titles in an entity-centric\ncorpus, the task is reduced to classifying a noun phrase X into one of several pre-de\ufb01ned relations R,\ni.e. predictR(X,R).\nWe ran experiments on two datasets in the drug and disease domains, respectively: DailyMed with\n28,590 articles and WikiDisease with 8,596 articles. These datasets are described in [5]. We directly\nemploy the preprocessed corpora from [5] 4, which contains shallow features such as tokens from\nthe sentence containing the noun phrase and unigram/bigrams from a window around it, and also\nfeatures derived from dependency parses. In the drug domain, we predict if a noun phrase describes\none of the three relations: \u201cside effect\u201d, \u201cused to treat\u201d, and \u201cconditions this may prevent\u201d. If none\nof the above relation is true, a noun phrase should be classi\ufb01ed as \u201cother\u201d. In the disease domain,\nwe predict \ufb01ve relations: \u201chas treatment\u201d, \u201chas symptom\u201d, \u201chas risk factor\u201d, \u201chas cause\u201d, and \u201chas\nprevention factor\u201d.\nFollowing prior work, labels were produced using DIEJOB [4] to extract noun phrases that have\nnon-trivial relations to their subjects. Since a noun phrase could have different meanings under\ndifferent context, each mention is treated independently. DIEJOB employs a small but well-structured\ncorpus to collect some con\ufb01dent examples with distant supervision as seeds, then propagates labels in\na bipartite graph (links are between noun phrases and their features) to collect a larger set of training\nexamples. Then, we use DIEL [2] to extract predicted types of noun phrases, where types indicate\nthe ontological category of the referent, rather than its relationship to the subject entity. There are six\npre-de\ufb01ned types: \u201crisk\u201d, \u201csymptom\u201d, \u201cprevention\u201d, \u201ccause\u201d, \u201ctreatment\u201d, and \u201cdisease\u201d. It\u2019s obvious\nthat types and relations are related, so we\u2019ll use it to build our co-training heuristics later. DIEL uses\ncoordinate lists to build a bipartite graph, and starts to propagate with type seeds to collect a larger\nset of type examples as training data.\n\n4.2 Model Con\ufb01guration\n\nWe then introduce a domain-speci\ufb01c set of SSL constraints for this problem. The most con\ufb01dent\noutputs from DIEL and DIEJOB are selected as distantly labeled examples. However, some examples\ncould be misclassi\ufb01ed, since both models are based on label propagation with a limited number of\nseeds. To mitigate this issue, and to exploit unlabeled data, we design a variant of the co-training\nconstraint (CT) to merge the information from relation and type. Relations and types are connected\nwith a predicate hasType(R, T), which contains four facts in the drug domain, as shown in Figure\n3. For example, the third one fact says: the tail entity of \u201cside effect\u201d relation should be of \u201csymptom\u201d\ntype. For disease domain, we de\ufb01ne such facts similarly.\n\n4Data available at http://www.cs.cmu.edu/~lbing/#data_aaai-2016_diebolds\n\n6\n\n\fFigure 3: SSL rules for relation extraction\n\nWe could easily get two classi\ufb01ers to predict the type and relation of a noun phrase: predictR(X,R)\nand predictT(X,T), which are simple classi\ufb01ers as described in Section 2.1 using their own features.\nThen, hasType(R, T) converts its relation R to type T. Following the co-training rule (CT) in Section\n2.3, predictionHasEntropy(xi, low) (softly) forces two predictions to match for each unlabeled\nnoun phrase xi, as shown in Figure 3.\nIn addition to the co-training constraint (CT), we construct another constraint with the assumption\nthat a noun phrase mentioned multiple times in the same document should have the same relationship\nto the subject entity. For example, if \u201cheartburn\u201d appears twice in the \u201cAspirin\u201d document, they\nshould both be labeled as \u201cside effect\u201d. As a common fact, \u201cheartburn\u201d can\u2019t be a \u201csymptom\u201d and a\n\u201cside effect\u201d of a speci\ufb01c drug at the same time. This constraint could be implemented as a variant\nof neighbour entropy regularization (NBER). Let xi and xj be two mentions of a noun phrase, and\npxixj is a virtual entity that represents the pair (xi, xj). Conceptually, pxixj is a parent node, and xi\nand xj are its children. In analogy to near(X1,X2), we create a new predicate hasExample(P,X)\nand consider hasExample(pxixj , xi) and hasExample(pxixj , xj) as facts. Now, we are ready\nto create a training example pairPredictionsHaveEntropy(pxixj , low), which encourages xi\nand xj to be classi\ufb01ed into the same category. This NBER constraint is shown in Figure 3.\nIt is obvious that xi and xj in the previous example are of distance two, i.e. xi and xj \u201cco-advise\u201d\npxixj . Inspired by co-linked label propagation entropy regularization (COLPER), we could recursively\nforce a set of examples to have the same prediction. hasExampleSet(P,X) is recursively de\ufb01ned\nas shown in Figure 3, where inPair(X,P) is symmetric to hasExample(P,X), i.e. inPair(xi,\npxixj ) iff hasExample(pxixj , xi). Using hasExampleSet(P,X), we expand a pair of noun\nphrases (xi, xj) to a set {x1 \u00b7\u00b7\u00b7 xn} in which any pair of noun phrases are similar.\nAnother observation is that a noun phrase in the same section always has a similar relationship to\nits subject, even across different documents. For example, if \u201cheartburn\u201d appears in the \u201cAdverse\nreactions\u201d section of \u201cAspirin\u201d and \u201cSingulair\u201d, both mentions should both be \u201cside effect\u201d, or else\nneither shuld be. Given a pair of mentions from the sections with the same name, we also construct\ntraining examples for the NBER rule pairPredictionsHaveEntropy(E,H). Note that we use the\nsame rules, but examples are prepared under different assumptions.\nSimilar to text categorization, for each unlabeled example xi, we also add an ER example\npredictionHasEntropy(xi, low). Rules are combined with a weighted sum of loss, and weights\nare tuned with Bayesian Optimization.\n\n4.3 Results\n\nIn the drug domain, we take 500 relation and type examples for each class as labeled data and\nrandomly select 2,000 unlabeled examples for each constraint. In the disease domain, we take 2,000\nlabeled relation and type examples for each class and 4,000 unlabeled examples for each constraint.\nThe evaluation dataset was originally prepared in DIEJOB, which contains 436 and 320 examples for\nthe disease and drug domains. The model is evaluated from an information retrieval perspective. We\npredict the relation of all noun phrases in test documents and drop those that are predicted as \u201cother\u201d.\nThen, we compare the retrieved examples with the ground truth to calculate the precision, recall and\nF1 score. There is also a tuning dataset, please refer to [4] for more details of evaluation, tuning data,\nand the evaluation protocols.\nBesides D-Learner, we compare our experiment results with the following baseline models5. The\n\ufb01rst four are supervised learning baselines via distant supervision: DS-SVM and DS-ProPPR directly\n\n5Refer to [3] for the full details.\n\n7\n\n\fTable 2: Relation extraction results in F1\n\n(a) Results compared to baseline models\n\n(b) Results of using different rules\n\nDS-SVM\nDS-ProPPR\nDS-Dist-SVM\nDS-Dist-ProPPR\nMultiR\nMintz++\nMIML-RE\nD-Learner\nDCE (Ours):\n\nDisease Drug\n17.8\n17.2\n30.0\n25.3\n14.6\n17.8\n16.3\n37.8\n50.1\n\n27.1\n21.9\n27.5\n24.3\n24.9\n24.9\n26.6\n31.6\n33.3\n\nDCE (Supervised)\n\n+ ER-Relation\n+ ER-Type\n+ NBER-Doc\n+ NBER-Sec\n+ COLPER-Doc\n+ COLPER-Sec\n+ CT\n+ All\n\nDisease Drug\n46.6\n48.5\n48.3\n48.2\n47.9\n48.4\n48.7\n49.8\n50.1\n\n30.4\n31.4\n31.2\n32.5\n32.4\n32.4\n31.9\n31.2\n33.3\n\nemploy the distantly labeled examples as training data, and use SVM or ProPPR as the learner, while\nDS-Dist-SVM and DS-Dist-ProPPR \ufb01rst employ DIEJOB to distill the distant examples, as done for\nD-Learner and our model. The second group of comparisons include three existing methods: MultiR,\n[12] which models each mention separately and aggregates their labels using a deterministic OR;\nMintz++ [23], which improves the original model [16] by training multiple classi\ufb01ers, and allowing\nmultiple labels per entity pair; and MIML-RE [23] which has a similar structure to MultiR, but uses a\nclassi\ufb01er to aggregate mention level predictions into an entity pair prediction. We use the public code\nfrom the authors for the experiments6.\nResults in Table 2 show that entropic regularizations (ER, NBER, and COLPER) and co-training\nheuristics (CT) do improve the performance of the model. Our full model achieves a new state-of-the-\nart result. We observe that different rules yield various improvement on performance. Furthermore, a\nweighted combination of rules can improve the overall performance, which again justi\ufb01es that it is an\nappropriate approach to combine constraints.\n\n5 Related Work\n\nThis work builds on Bing et al. [3], who proposed a declarative language for specifying semi-\nsupervised learners, the D-Learner. The DCE-Learner explored here has similar goals, but is paired\nwith a more effective and more \ufb02exible underlying learning system, and experimentally improves over\nthe original D-Learner on all our benchmark tasks, even though its constraint language is somewhat\nless expressive.\nMany distinct heuristics for SSL have been proposed in the past (some of which we discussed above,\ne.g., making con\ufb01dent predictions at unlabeled points [13]; imposing consistency constraints based\non an ontology of types and relations [7]; instances associated with the endpoints of an edge having\nsimilar labels [31, 1, 25] or embedded representations [28, 30, 14]). Some heuristics have been\nformulated in a fairly general way, notably the information regularization of [24], where arbitrary\ncoupling constraints can be used. However, information regularization does not allow multiple\ncoupling types to be combined, nor does have an obvious extension to support label-propagation type\nregularizers, as the DCE-Learner does.\nThe regularizers introduced for relation extraction are also related to well-studied SSL methods; for\ninstance, the rules encouraging agreement between predictions based on the type and relationship are\ninspired by constraints used in NELL [7, 19], and the remaining relation-extraction constraints are\nspeci\ufb01c variants of the \u201cone sense per discourse\" constraint of [10]. There is also a long tradition of\nincorporating constraints or heuristic biases in distant-supervision systems for relation extraction,\nsuch as DIEBOLD [2], DIEJOB [5], MultiR [12] and MIML [23]. The main contribution of the\nDCE-Learner over this prior work is to provide a convenient framework for combining such heuristics,\nand exploring new variants of existing heuristics.\n\n6Code\n\nstanford.edu/software/mimlre.shtml\n\navailable\n\nat http://aiweb.cs.washington.edu/ai/raphaelh/mr/ and http://nlp.\n\n8\n\n\fThe Snorkel system [20] focuses on declarative speci\ufb01cation of sources of weakly labeled data; in\ncontrast we specify entropic constraints over declarative-speci\ufb01ed groups of examples. Previous work\nhas also considered declarative speci\ufb01cations of agreement constraints among classi\ufb01ers (e.g., [19])\nor strategies for a classi\ufb01cation learning [9] and recommendation (e.g., [15]. However, these prior\nsystems have not considered declarative speci\ufb01cation of SSL constraints. We note that constraints are\nmost useful when labeled data is limited, and SSL is often applicable in such situations.\n\n6 Conclusions\n\nSSL is a powerful method for language tasks, for which unlabeled data is usually plentiful. However,\nit is usually dif\ufb01cult to predict which SSL methods will work well on a new task, and it is also\ninconvenient to test alternative SSL methods, combine methods, or develop novel ones. To address this\nproblem we propose the DCE-Learner, which allows one to declaratively state many SSL heuristics,\nand also allows one to combine and ensemble together SSL heuristics using Bayesian optimization.\nWe show consistent improvements over an earlier system with similar technical goals, and a new\nstate-of-the-art result on a dif\ufb01cult relation extraction task.\n\nReferences\n[1] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[2] Lidong Bing, Sneha Chaudhari, Richard C. Wang, and William W. Cohen. Improving distant\nsupervision for information extraction using label propagation through lists. In Proceedings of\nthe 2015 Conference on Empirical Methods in Natural Language Processing, pages 524\u2013529,\n2015.\n\n[3] Lidong Bing, William W. Cohen, and Bhuwan Dhingra. Using graphs of classi\ufb01ers to impose\ndeclarative constraints on semi-supervised learning. In Carles Sierra, editor, Proceedings of the\nTwenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence, pages 1454\u20131460, 2017.\n\n[4] Lidong Bing, Bhuwan Dhingra, Kathryn Mazaitis, Jong Hyuk Park, and William W. Cohen.\nBootstrapping distantly supervised IE using joint learning and small well-structured corpora. In\nProceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, pages 3408\u20133414.\nAAAI Press, 2017.\n\n[5] Lidong Bing, Mingyang Ling, Richard C. Wang, and William W. Cohen. Distant IE by\nbootstrapping using lists and document structure. In Dale Schuurmans and Michael P. Wellman,\neditors, Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, pages 2899\u2013\n2905, 2016.\n\n[6] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In\nProceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT\u2019 98,\npages 92\u2013100, New York, NY, USA, 1998. ACM.\n\n[7] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and\nTom M Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5,\npage 3. Atlanta, 2010.\n\n[8] William W. Cohen, Fan Yang, and Kathryn Mazaitis. Tensorlog: Deep learning meets proba-\n\nbilistic dbs. CoRR, abs/1707.05390, 2017.\n\n[9] Michelangelo Diligenti, Marco Gori, and Claudio Sacc\u00e0. Semantic-based regularization for\nlearning and inference. Arti\ufb01cial Intelligence, 244:143 \u2013 165, 2017. Combining Constraint\nSolving with Mining and Learning.\n\n[10] William A Gale, Kenneth W Church, and David Yarowsky. One sense per discourse.\n\nIn\nProceedings of the workshop on Speech and Natural Language, pages 233\u2013237. Association for\nComputational Linguistics, 1992.\n\n9\n\n\f[11] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In\n\nAdvances in neural information processing systems, pages 529\u2013536, 2005.\n\n[12] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld.\nKnowledge-based weak supervision for information extraction of overlapping relations. In Pro-\nceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human\nLanguage Technologies-Volume 1, pages 541\u2013550. Association for Computational Linguistics,\n2011.\n\n[13] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector machines.\nIn Proceedings of the Sixteenth International Conference on Machine Learning, ICML \u201999,\npages 200\u2013209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\n[14] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[15] Pigi Kouki, Shobeir Fakhraei, James Foulds, Magdalini Eirinaki, and Lise Getoor. Hyper: A\n\ufb02exible and extensible probabilistic framework for hybrid recommender systems. In Proceedings\nof the 9th ACM Conference on Recommender Systems, pages 99\u2013106. ACM, 2015.\n\n[16] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation\nextraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual\nMeeting of the ACL and the 4th International Joint Conference on Natural Language Processing\nof the AFNLP: Volume 2-Volume 2, pages 1003\u20131011, 2009.\n\n[17] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-\nsentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 701\u2013710. ACM, 2014.\n\n[18] A.R. Plastino and A. Plastino. Tsallis\u2019 entropy, ehrenfest theorem and information theory.\n\nPhysics Letters A, 177(3):177 \u2013 179, 1993.\n\n[19] Jay Pujara, Hui Miao, Lise Getoor, and William Cohen. Knowledge graph identi\ufb01cation. In\n\nInternational Semantic Web Conference, pages 542\u2013557. Springer, 2013.\n\n[20] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R\u00e9.\nSnorkel: Rapid training data creation with weak supervision. arXiv preprint arXiv:1711.10160,\n2017.\n\n[21] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-\n\nRad. Collective classi\ufb01cation in network data. AI magazine, 29(3):93, 2008.\n\n[22] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[23] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. Multi-\ninstance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference\non Empirical Methods in Natural Language Processing and Computational Natural Language\nLearning, EMNLP-CoNLL \u201912, pages 455\u2013465, 2012.\n\n[24] Martin Szummer and Tommi S Jaakkola. Information regularization with partially labeled data.\n\nIn Advances in Neural Information processing systems, pages 1049\u20131056, 2003.\n\n[25] Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductive\nlearning. In Joint European Conference on Machine Learning and Knowledge Discovery in\nDatabases, pages 442\u2013457. Springer, 2009.\n\n[26] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua\n\nBengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[27] William Yang Wang, Kathryn Mazaitis, and William W Cohen. Programming with personalized\npagerank: a locally groundable \ufb01rst-order probabilistic logic. In Proceedings of the 22nd ACM\ninternational conference on Information & Knowledge Management, pages 2129\u20132138. ACM,\n2013.\n\n10\n\n\f[28] Jason Weston, Fr\u00e9d\u00e9ric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-\nsupervised embedding. In Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer,\n2012.\n\n[29] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning\n\nwith graph embeddings. arXiv preprint arXiv:1603.08861, 2016.\n\n[30] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch\u00f6lkopf.\nLearning with local and global consistency. In Advances in Neural Information Processing Sys-\ntems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver\nand Whistler, British Columbia, Canada], pages 321\u2013328, 2003.\n\n[31] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n11\n\n\f", "award": [], "sourceid": 2179, "authors": [{"given_name": "Haitian", "family_name": "Sun", "institution": "Carnegie Mellon University"}, {"given_name": "William", "family_name": "Cohen", "institution": "Google AI"}, {"given_name": "Lidong", "family_name": "Bing", "institution": "Tencent AI Lab"}]}