{"title": "Learning with Relaxed Supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 2827, "page_last": 2835, "abstract": "For weakly-supervised problems with deterministic constraints between the latent variables and observed output, learning necessitates performing inference over latent variables conditioned on the output, which can be intractable no matter how simple the model family is. Even finding a single latent variable setting that satisfies the constraints could be difficult; for instance, the observed output may be the result of a latent database query or graphics program which must be inferred. Here, the difficulty lies in not the model but the supervision, and poor approximations at this stage could lead to following the wrong learning signal entirely. In this paper, we develop a rigorous approach to relaxing the supervision, which yields asymptotically consistent parameter estimates despite altering the supervision. Our approach parameterizes a family of increasingly accurate relaxations, and jointly optimizes both the model and relaxation parameters, while formulating constraints between these parameters to ensure efficient inference. These efficiency constraints allow us to learn in otherwise intractable settings, while asymptotic consistency ensures that we always follow a valid learning signal.", "full_text": "Learning with Relaxed Supervision\n\nJacob Steinhardt\nStanford University\n\njsteinhardt@cs.stanford.edu\n\nAbstract\n\nPercy Liang\n\nStanford University\n\npliang@cs.stanford.edu\n\nFor weakly-supervised problems with deterministic constraints between the latent\nvariables and observed output, learning necessitates performing inference over la-\ntent variables conditioned on the output, which can be intractable no matter how\nsimple the model family is. Even \ufb01nding a single latent variable setting that sat-\nis\ufb01es the constraints could be dif\ufb01cult; for instance, the observed output may be\nthe result of a latent database query or graphics program which must be inferred.\nHere, the dif\ufb01culty lies in not the model but the supervision, and poor approxi-\nmations at this stage could lead to following the wrong learning signal entirely.\nIn this paper, we develop a rigorous approach to relaxing the supervision, which\nyields asymptotically consistent parameter estimates despite altering the supervi-\nsion. Our approach parameterizes a family of increasingly accurate relaxations,\nand jointly optimizes both the model and relaxation parameters, while formu-\nlating constraints between these parameters to ensure ef\ufb01cient inference. These\nef\ufb01ciency constraints allow us to learn in otherwise intractable settings, while\nasymptotic consistency ensures that we always follow a valid learning signal.\n\n1\n\nIntroduction\n\nWe are interested in the problem of learning from intractable supervision. For example, for a\nquestion answering application, we might want to learn a semantic parser that maps a question\nto a logical form z (e.g., USPresident(e) \u2227\nx (e.g., \u201cWhich president is from Arkansas?\u201d)\nPlaceOfBirth(e, Arkansas)) that executes to the answer y (e.g., BillClinton). If we are only\ngiven (x, y) pairs as training data [1, 2, 3], then even if the model p\u03b8(z | x) is tractable, it is still\nintractable to incorporate the hard supervision constraint [S(z, y) = 1] since z and y live in a large\nspace and S(z, y) can be complex (e.g., S(z, y) = 1 iff z executes to y on a database). In addition to\nsemantic parsing, intractable supervision also shows up in inverse graphics [4, 5, 6], relation extrac-\ntion [7, 8], program induction [9], and planning tasks with complex, long-term goals [10]. As we\nscale to weaker supervision and richer output spaces, such intractabilities will become the norm.\nOne can handle the intractable constraints in various ways: by relaxing them [11], by applying them\nin expectation [12], or by using approximate inference [8]. However, as these constraints are part of\nthe supervision rather than the model, altering them can fundamentally change the learning process;\nthis raises the question of when such approximations are faithful enough to learn a good model.\nIn this paper, we propose a framework that addresses these questions formally, by constructing a\nrelaxed supervision function with well-characterized statistical and computational properties. Our\napproach is sketched in Figure 1: we start with an intractable supervision function q\u221e(y | z) (given\nby the constraint S), together with a model family p\u03b8(z | x). We then replace q\u221e by a family of\nfunctions q\u03b2(y | z) which contains q\u221e, giving rise to a joint model p\u03b8,\u03b2(y, z | x). We ensure\ntractability of inference by constraining p\u03b8(z | x) and p\u03b8,\u03b2(z | x, y) to stay close together, so that\nthe supervision y is never too surprising to the model. Finally, we optimize \u03b8 and \u03b2 subject to this\ntractability constraint; when q\u03b2(y | z) is properly normalized, there is always pressure to use the true\n\n1\n\n\ft\nc\na\nx\ne\n\ne\nr\no\nm\n\n\u03b2\n\nt\nc\na\nx\ne\n\ns\ns\ne\nl\n\nintractable region\n\nle ar nin g traje cto r y\n\nless accurate\n\n\u03b8\n\ntractable region\n\nmore accurate\n\nFigure 1: Sketch of our approach; we de\ufb01ne\na family of relaxations q\u03b2 of the supervision,\nand then jointly optimize both \u03b8 and \u03b2. If the\nsupervision q\u03b2 is too harsh relative to the ac-\ncuracy of the current model p\u03b8, inference be-\ncomes intractable. In Section 4, we formulate\nconstraints to avoid this intractable region and\nlearn within the tractable region.\n\nsupervision q\u221e, and we can prove that the global optimum of p\u03b8,\u03b2 is an asymptotically consistent\nestimate of the true model.\nSection 2 introduces the relaxed supervision model q\u03b2(y | z) \u221d exp(\u03b2(cid:62)\u03c8(z, y)), where \u03c8(z, y) = 0\niff the constraint S(z, y) is satis\ufb01ed (the original supervision is then obtained when \u03b2 = \u221e). Sec-\ntion 3 studies the statistical properties of this relaxation, establishing asymptotic consistency as well\nas characterizing the properties for any \ufb01xed \u03b2: we show roughly that both the loss and statistical\nef\ufb01ciency degrade by a factor of \u03b2\u22121\nmin, the inverse of the smallest coordinate of \u03b2. In Section 4, we\nintroduce novel tractability constraints, show that inference is ef\ufb01cient if the constraints are satis-\n\ufb01ed, and present an EM-like algorithm for constrained optimization of the likelihood. Finally, in\nSection 5, we explore the empirical properties of this algorithm on two illustrative examples.\n\n2 Framework\nWe assume that we are given a partially supervised problem x \u2192 z \u2192 y where (x, y) \u2208 X \u00d7 Y\nare observed and z \u2208 Z is unobserved. We model z given x as an exponential family p\u03b8(z | x) =\nexp(\u03b8(cid:62)\u03c6(x, z)\u2212A(\u03b8; x)), and assume that y = f (z) is a known deterministic function of z. Hence:\n(1)\n\nS(z, y) exp(\u03b8(cid:62)\u03c6(x, z) \u2212 A(\u03b8; x)),\n\np\u03b8(y | x) =\n\n(cid:88)\n\nz\n\nassume \u03c01 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u03c0k is injective, which implies that S(z, y) equals the conjunction(cid:86)k\n\nwhere S(z, y) \u2208 {0, 1} encodes the constraint [f (z) = y]. In general, f could have complicated\nstructure, rendering inference (i.e., computing p\u03b8(z | x, y), which is needed for learning) intractable.\nTo alleviate this, we consider projections \u03c0j mapping Y to some smaller set Yj; we then obtain the\n(hopefully simpler) constraint that f (z) and y match under \u03c0j: Sj(z, y) def= [\u03c0j(f (z)) = \u03c0j(y)]. We\nSj(z, y).\nWe also assume that some part of S (call it T(z, y)) can be imposed tractably. We can always take\nT \u2261 1, but it is better to include as much of S as possible because T will be handled exactly while S\nwill be approximated. We record our assumptions below:\nDe\ufb01nition 2.1. Let S(z, y) encode the constraint f (z) = y. We say that (T, \u03c01, . . . , \u03c0k) logically\ndecomposes S if (1) S implies T and (2) \u03c01 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u03c0k is injective.\nBefore continuing, we give three examples to illustrate the de\ufb01nitions above.\nExample 2.2 (Translation from unordered supervision). Suppose that given an input sentence x,\neach word is passed through the same unknown 1-to-1 substitution cipher to obtain an enciphered\nsentence z, and then ordering is removed to obtain an output y = multiset(z). For example, we\nmight have x = abaa, z = dcdd, and y = {c : 1, d : 3}. Suppose the vocabulary is {1, . . . , V }. Our\nconstraint is S(z, y) = [y = multiset(z)], which logically decomposes as\n\nj=1\n\n(cid:122)\n\n(cid:125)(cid:124)\n\nf (z)\n\n(cid:123)(cid:122)\n\nmultiset(z)]\nS(z,y)\n\n(cid:123)\n(cid:125)\n\n(cid:124)\n\n[y =\n\n\u21d0\u21d2 [zi \u2208 y for all i]\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nT(z,y)\n\n\u2227 V(cid:94)\n\nj=1\n\n(cid:124)\n\n(cid:125)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n\u03c0j (y)\n\n(cid:123)(cid:122)\n\nSj (z,y)\n\n(cid:123)\n(cid:125)\n\n[count(z, j) =\n\ncount(y, j)]\n,\n\n(2)\n\nwhere count(\u00b7, j) counts the number of occurrences of the word j. The constraint T is useful because\nit lets us restrict attention to words in y (rather than all of {1, . . . , V }), which dramatically reduces\nthe search space. If each sentence has length L, then Yj = \u03c0j(Y) = {0, . . . , L}.\nExample 2.3 (Conjunctive semantic parsing). Suppose again that x is an input sentence, and that\neach input word xi \u2208 {1, . . . , V } maps to a predicate (set) zi \u2208 {Q1, . . . , Qm}, and the meaning y\n\n2\n\n\fof the sentence is the intersection of the predicates. For instance, if the sentence x is \u201cbrown dog\u201d,\nand Q6 is the set of all brown objects and Q11 is the set of all dogs, then z1 = Q6, z2 = Q11, and\n\ny = Q6 \u2229 Q11 is the set of all brown dogs. In general, we de\ufb01ne y =(cid:74)z(cid:75) def= z1 \u2229 \u00b7\u00b7\u00b7 \u2229 zl. This is a\nsimpli\ufb01ed form of learning semantic parsers from denotations [2].\nWe let Y be every set that is obtainable as an intersection of predicates Q, and de\ufb01ne \u03c0j(y) = [y \u2286\nQj] for j = 1, . . . , m (so Yj = {0, 1}). Note that for all y \u2208 Y, we have y = \u2229j:\u03c0j (y)=1Qj, so\n\u03c01 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u03c0m is injective. We then have the following logical decomposition:\n\ny =(cid:74)z(cid:75)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nS(z,y)\n\n\u21d0\u21d2 [zi \u2287 y for all i]\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nT(z,y)\n\n\u2227 m(cid:94)\n\nj=1\n\n(cid:125)\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n\u03c0j (y)\n\n[y \u2286 Qj]\n\n(cid:123)\n(cid:125)\n\n[(cid:74)z(cid:75) \u2286 Qj] =\n(cid:124)\n(cid:123)(cid:122)\n\nSj (z,y)\n\n.\n\n(3)\n\nThe \ufb01rst constraint T factors across i, so it can be handled tractably.\nExample 2.4 (Predicate abstraction). Next, we consider a program induction task; here\nthe input x might be \u201csmallest square divisible by six larger than 1000\u201d, z would be\nargmin{i1 | mod(i1,6) = 0 and i1 = i2*i2 and i1 > 1000}, and y would be\n1296; hence S(z, y) = 1 if z evaluates to y. Suppose that we have a collection of predicates \u03c0j,\nsuch as \u03c01(y) = mod(y, 6), \u03c02(y) = isPrime(y), etc. These predicates are useful for giving partial\ncredit; for instance, it is easier to satisfy mod(y, 6) = 0 than y = 1296, but many programs that\nsatisfy the former will have pieces that are also in the correct z. Using the \u03c0j to decompose S will\ntherefore provide a more tractable learning signal that still yields useful information.\nRelaxing the supervision. Returning to the general framework, let us now use Sj and T to relax\nS, and thus also p\u03b8(y | x). First, de\ufb01ne penalty features \u03c8j(z, y) = Sj(z, y) \u2212 1, and also de\ufb01ne\nS(z, y) is from being satis\ufb01ed: for each violated Sj, we incur a penalty \u03b2j (or in\ufb01nite penalty if T is\nviolated). Note that the original q\u221e(y | z) = S(z, y) corresponds to \u03b21 = \u00b7\u00b7\u00b7 = \u03b2k = +\u221e.\nNormalization constant.\n\nq\u03b2(y | z) \u221d T(z, y) exp(cid:0)\u03b2(cid:62)\u03c8(z, y)(cid:1) for any vector \u03b2 \u2265 0. Then, \u2212 log q\u03b2(y | z) measures how far\nlog((cid:80)\n\nto\ny\u2208Y T(z, y) exp(\u03b2(cid:62)\u03c8(z, y))); this is in general dif\ufb01cult to compute, since \u03c8 could have arbi-\n\nThe log-normalization constant A(\u03b2; z) for q\u03b2 is equal\n\ntrary structure. Fortunately, we can uniformly upper-bound A(\u03b2; z) by a tractable quantity A(\u03b2):\nProposition 2.5. For any z, we have the following bound:\n\nA(\u03b2; z) \u2264 k(cid:88)\n\nj=1\n\nlog (1 + (|Yj|\u22121) exp(\u2212\u03b2j)) def= A(\u03b2).\n\n(4)\n\nby the product set(cid:81)k\n\n(cid:88)\n\nSee the supplement for proof; the intuition is that, by injectivity of \u03c01 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u03c0k, we can bound Y\n\nj=1 Yj. We now de\ufb01ne our joint model, which is a relaxation of (1):\n\nq\u03b2(y | z) = T(z, y) exp(cid:0)\u03b2(cid:62)\u03c8(z, y) \u2212 A(\u03b2)(cid:1) ,\n\nz\n\np\u03b8,\u03b2(y | x) =\n\nT(z, y) exp(\u03b8(cid:62)\u03c6(x, z) + \u03b2(cid:62)\u03c8(z, y) \u2212 A(\u03b8; x) \u2212 A(\u03b2)),\nL(\u03b8, \u03b2) = Ex,y\u223cp\u2217 [\u2212 log p\u03b8,\u03b2(y | x)], where p\u2217 is the true distribution.\n\n(7)\nThe relaxation parameter \u03b2 provides a trade-off between faithfulness to the original objective (large\n\u03b2) and tractability (small \u03b2). Importantly, p\u03b8,\u03b2(y | x) produces valid probabilities which can be\nmeaningfully compared across different \u03b2; this will be important later in allowing us to optimize \u03b2.\n\ny p\u03b8,\u03b2(y | x) < 1 if the bound (4) is not tight, this gap vanishes as \u03b2 \u2192 \u221e.)\n\n(Note that while(cid:80)\n\n(5)\n(6)\n\n3 Analysis\nWe now analyze the effects of relaxing supervision (i.e., taking \u03b2 < \u221e); proofs may be found in the\nsupplement. We will analyze the following properties:\n1. Effect on loss: How does the value of the relaxation parameter \u03b2 affect the (unrelaxed) loss of\n\nthe learned parameters \u03b8 (assuming we had in\ufb01nite data and perfect optimization)?\n\n3\n\n\f2. Amount of data needed to learn: How does \u03b2 affect the amount of data needed in order to\n\nidentify the optimal parameters?\n\n3. Optimizing \u03b2 and consistency: What happens if we optimize \u03b2 jointly with \u03b8? Is there natural\n\npressure to increase \u03b2 and do we eventually recover the unrelaxed solution?\n\n\u03b2\n\n\u03b2 is optimized for L(\u00b7, \u03b2) rather than L(\u00b7,\u221e), it is possible that L(\u03b8\u2217\n\nNotation. Let Ep\u2217 denote the expectation under x, y \u223c p\u2217, and let L(\u03b8,\u221e) denote the unrelaxed\nloss (see (5)\u2013(7)). Let L\u2217 = inf \u03b8 L(\u03b8,\u221e) be the optimal unrelaxed loss and \u03b8\u2217 be the minimizing\nargument. Finally, let E\u03b8 and Cov\u03b8 denote the expectation and covariance, respectively, under\np\u03b8(z | x). To simplify expressions, we will often omit the arguments from \u03c6(x, z) and \u03c8(z, y), and\nuse S and \u00acS for the events [S(z, y) = 1] and [S(z, y) = 0]. For simplicity, assume that T(z, y) \u2261 1.\nEffect on loss. Suppose we set \u03b2 to some \ufb01xed value (\u03b21, . . . , \u03b2k) and let \u03b8\u2217\n\u03b2 be the minimizer of\n\u03b2,\u221e) is very\nL(\u03b8, \u03b2). Since \u03b8\u2217\n(y | x) is zero for even a single outlier (x, y), then L(\u03b8\u2217\n\u03b2,\u221e) will be in\ufb01nite.\nlarge; indeed, if p\u03b8\u2217\nHowever, we can bound \u03b8\u2217\nProposition 3.1. Let \u03b2min = mink\nThe key idea in the proof is that replacing S with exp(\u03b2(cid:62)\u03c8) in p\u03b8,\u03b2 does not change the loss too\nmuch, in the sense that S \u2264 exp(\u03b2(cid:62)\u03c8) \u2264 exp(\u2212\u03b2min) + (1 \u2212 exp(\u2212\u03b2min))S.\nWhen \u03b2min (cid:28) 1,\nmin. If\n\u03b2min is large and the original loss L\u2217 is small, then L(\u00b7, \u03b2) is a good surrogate. Of particular interest\nis the case L\u2217 = 0 (perfect predictions); in this case, the relaxed loss L(\u00b7, \u03b2) also yields a perfect\npredictor for any \u03b2 > 0. Note conversely that Proposition 3.1 is vacuous when L\u2217 \u2265 1.\nWe show in the supplement that Proposition 3.1 is essentially tight:\nLemma 3.2. For any 0 < \u03b2min < L\u2217, there exists a model with loss L\u2217 and a relaxation parameter\n\u03b2 = (\u03b2min,\u221e, . . . ,\u221e), such that Ep\u2217 [p\u03b8\u2217\n\n\u03b2 under an alternative loss that is less sensitive to outliers:\n1\u2212exp(\u2212\u03b2min) .\n\n. Hence, the error increases roughly linearly with \u03b2\u22121\n\nj=1 \u03b2j. Then, Ep\u2217 [1 \u2212 p\u03b8\u2217\n\n1\u2212exp(\u2212\u03b2min) \u2248 L\u2217\n\n(y | x)] = 0.\n\n(y | x)] \u2264\n\n\u03b2min\n\nL\u2217\n\nL\u2217\n\n\u03b2\n\n\u03b2\n\nAmount of data needed to learn. To estimate how much data is needed to learn, we compute the\nFisher information I\u03b2\n\u03b2, \u03b2), which measures the statistical ef\ufb01ciency of the maximum\nlikelihood estimator [13]. All of the equations below follow from standard properties of exponential\nfamilies [14], with calculations in the supplement. For the unrelaxed loss, the Fisher information is:\n\ndef= \u22072\n\n\u03b8L(\u03b8\u2217\n\nI\u221e = Ep\u2217 [P\u03b8\u2217 [\u00acS] (E\u03b8\u2217 [\u03c6 \u2297 \u03c6 | \u00acS] \u2212 E\u03b8\u2217 [\u03c6 \u2297 \u03c6 | S])] .\n\n(8)\nHence \u03b8\u2217 is easy to estimate if the features have high variance when S = 0 and low variance when\nS = 1. This should be true if all z with S(z, y) = 1 have similar feature values while the z with\nS(z, y) = 0 have varying feature values.\nIn the relaxed case, the Fisher information can be written to \ufb01rst order as\n\n(cid:2)\u03c6(x, z) \u2297 \u03c6(x, z), \u2212\u03b2(cid:62)\u03c8(z, y)(cid:3)(cid:105)\n\n+ O(cid:0)\u03b22(cid:1) .\n\n\u03b2\n\nCov\u03b8\u2217\n\n(9)\nIn other words, I\u03b2, to \ufb01rst order, is the covariance of the penalty \u2212\u03b2(cid:62)\u03c8 with the second-order\nstatistics of \u03c6. To interpret this, we will make the simplifying assumptions that (1) \u03b2j = \u03b2min for all\nj, and (2) the events \u00acSj are all disjoint. In this case, \u2212\u03b2(cid:62)\u03c8 = \u03b2min\u00acS, and the covariance in (9)\nsimpli\ufb01es to\n\nI\u03b2 = Ep\u2217\n\n(cid:104)\n\n(cid:2)\u03c6 \u2297 \u03c6, \u2212\u03b2(cid:62)\u03c8(cid:3) = \u03b2minP\u03b8\u2217\n\n\u03b2\n\n(cid:16)E\u03b8\u2217\n\n\u03b2\n\n\u03b2\n\nCov\u03b8\u2217\n\n\u03b2\n\n[\u00acS]\n\n[S]P\u03b8\u2217\n\n[\u03c6 \u2297 \u03c6 | \u00acS] \u2212 E\u03b8\u2217\n[S] factor. If we further assume that P\u03b8\u2217\n\n[\u03c6 \u2297 \u03c6 | S]\n(10)\n[S] \u2248 1, we see that the\n\nRelative to (8), we pick up a \u03b2P\u03b8\u2217\namount of data required to learn under the relaxation increases by a factor of roughly \u03b2\u22121\nmin.\nOptimizing \u03b2. We now study the effects of optimizing both \u03b8 and \u03b2 jointly. Importantly, joint\noptimization recovers the true distribution p\u03b8\u2217 in the in\ufb01nite data limit:\nProposition 3.3. Suppose the model is well-speci\ufb01ed: p\u2217(y | x) = p\u03b8\u2217 (y | x) for all x, y. Then, all\nglobal optima of L(\u03b8, \u03b2) satisfy p\u03b8,\u03b2(y | x) = p\u2217(y | x); one such optimum is \u03b8 = \u03b8\u2217, \u03b2 = \u221e.\n\n.\n\n\u03b2\n\n\u03b2\n\n\u03b2\n\n(cid:17)\n\n4\n\n\fThere is thus always pressure to send \u03b2 to \u221e and \u03b8 to \u03b8\u2217. The key fact in the proof is that the log-loss\nL(\u03b8, \u03b2) is never smaller than the conditional entropy Hp\u2217 (y | x), with equality iff p\u03b8,\u03b2 = p\u2217.\nSummary. Based on our analyses above, we can conclude that relaxation has the following impact:\n\u2022 Loss: The loss increases by a factor of \u03b2\u22121\n\u2022 Amount of data: In at least one regime, the amount of data needed to learn is \u03b2\u22121\nmin times larger.\nThe general theme is that the larger \u03b2 is, the better the statistical properties of the maximum-\nlikelihood estimator. However, larger \u03b2 also makes the distribution p\u03b8,\u03b2 less tractable, as q\u03b2(y | z)\nbecomes concentrated on a smaller set of y\u2019s. This creates a trade-off between computational ef\ufb01-\nciency (small \u03b2) and statistical accuracy (large \u03b2). We explore this trade-off in more detail in the\nnext section, and show that in some cases we can get the best of both worlds.\n\nmin in the worst case.\n\n4 Constraints for Ef\ufb01cient Inference\n\n(cid:33)\n\n(12)\n\n(13)\n\n(cid:32)(cid:88)\n\nIn light of the previous section, we would like to make \u03b2 as large as possible; on the other hand,\nif \u03b2 is too large, we are back to imposing S exactly and inference becomes intractable. We would\ntherefore like to optimize \u03b2 subject to a tractability constraint ensuring that we can still perform\nef\ufb01cient inference, as sketched earlier in Figure 1. We will use rejection sampling as the inference\nprocedure, with the acceptance rate as a measure of tractability.\nTo formalize our approach, we assume that the model p\u03b8(z | x) and the constraint T(z, y) are jointly\ntractable, so that we can ef\ufb01ciently draw exact samples from\n\np\u03b8,T(z | x, y) def= T(z, y) exp(cid:0)\u03b8(cid:62)\u03c6(x, z) \u2212 AT(\u03b8; x, y)(cid:1) ,\n\n(11)\nT(z, y) exp(\u03b8(cid:62)\u03c6(x, z))). Most learning algorithms require the condi-\n\nwhere AT(\u03b8; x, y) = log((cid:80)\n\ntional expectations of \u03c6 and \u03c8 given x and y; we therefore need to sample the distribution\n\np\u03b8,\u03b2(z | x, y) = T(z, y) exp(cid:0)\u03b8(cid:62)\u03c6(x, z) + \u03b2(cid:62)\u03c8(z, y) \u2212 A(\u03b8, \u03b2; x, y)(cid:1) , where\n\nz\n\nA(\u03b8, \u03b2; x, y) def= log\n\nT(z, y) exp(\u03b8(cid:62)\u03c6(x, z) + \u03b2(cid:62)\u03c8(z, y))\n\n.\n\nz\n\nSince \u03b2(cid:62)\u03c8 \u2264 0, we can draw exact samples from p\u03b8,\u03b2 using rejection sampling: (1) sample z from\np\u03b8,T(\u00b7 | x, y), and (2) accept with probability exp(\u03b2(cid:62)\u03c8(z, y)). If the acceptance rate is high, this\nalgorithm lets us tractably sample from (12). Intuitively, when \u03b8 is far from the optimum, the model\np\u03b8 and constraints Sj will clash, necessitating a small value of \u03b2 to stay tractable. As \u03b8 improves,\nmore of the constraints Sj will be satis\ufb01ed automatically under p\u03b8, allowing us to increase \u03b2.\nFormally, the expected number of samples is the inverse of the acceptance probability and can be\nexpressed as (see the supplement for details)\np\u03b8,T(z | x, y) exp(\u03b2(cid:62)\u03c8(z, y))\n\n= exp (AT(\u03b8; x, y) \u2212 A(\u03b8, \u03b2; x, y)) .\n\n(cid:16)(cid:88)\n\n(cid:17)\u22121\n\n(14)\n\nz\n\nWe can then minimize the loss L(\u03b8, \u03b2) = A(\u03b8; x) + A(\u03b2) \u2212 A(\u03b8, \u03b2; x, y) (see (6)\u2013(7) and (13))\nsubject to the tractability constraint Ex,y[exp (AT(\u03b8; x, y) \u2212 A(\u03b8, \u03b2; x, y))] \u2264 \u03c4, where \u03c4 is our\ncomputational budget. While one might have initially worried that rejection sampling will perform\npoorly, this constraint guarantees that it will perform well by bounding the number of rejections.\nImplementation details. To minimize L subject to a constraint on (14), we will develop an EM-like\nalgorithm; the algorithm maintains an inner approximation to the constraint set as well as an upper\nbound on the loss, both of which will be updated with each iteration of the algorithm. These bounds\nare obtained by linearizing A(\u03b8, \u03b2; x, y); more precisely, for any (\u02dc\u03b8, \u02dc\u03b2) we have by convexity:\n\nA(\u03b8, \u03b2; x, y) \u2265 \u02dcA(\u03b8, \u03b2; x, y) def= A(\u02dc\u03b8, \u02dc\u03b2; x, y) + (\u03b8 \u2212 \u02dc\u03b8)(cid:62) \u02dc\u03c6 + (\u03b2 \u2212 \u02dc\u03b2)(cid:62) \u02dc\u03c8,\np\u02dc\u03b8, \u02dc\u03b2(z | x, y)\u03c8(z, y).\n\np\u02dc\u03b8, \u02dc\u03b2(z | x, y)\u03c6(x, z),\n\nwhere \u02dc\u03c6 def=\n\n\u02dc\u03c8 def=\n\n(cid:88)\n\n(cid:88)\n\n(15)\n\nz\n\nz\n\n5\n\n\f(cid:104)\n(cid:104)\n\n(cid:105)\n(cid:17)(cid:105) \u2264 \u03c4.\n\nexp\n\n(cid:16)\n\nminimize Ep\u2217\nsubject to Ep\u2217\n\nA(\u03b8; x) + A(\u03b2) \u2212 \u02dcA(\u03b8, \u03b2; x, y)\nAT(\u03b8; x, y) \u2212 \u02dcA(\u03b8, \u03b2; x, y)\n\nWe thus obtain a bound \u02dcL on the loss L, as well as a tractability constraint C1, which are both convex:\n( \u02dcL)\n(C1)\nWe will iteratively solve the above minimization, and then update \u02dcL and C1 using the minimizing\n(\u03b8, \u03b2) from the previous step. Note that the minimization itself can be done without inference; we\nonly need to do inference when updating \u02dc\u03c6 and \u02dc\u03c8. Since inference is tractable at (\u02dc\u03b8, \u02dc\u03b2) by design,\nwe can obtain unbiased estimates of \u02dc\u03c6 and \u02dc\u03c8 using the rejection sampler described earlier. We can\nalso estimate A(\u02dc\u03b8, \u02dc\u03b2; x, y) at the same time by using samples from p\u02dc\u03b8,T and the relation (14).\nA practical issue is that C1 becomes overly stringent when (\u03b8, \u03b2) is far away from (\u02dc\u03b8, \u02dc\u03b2).\nIt is\ntherefore dif\ufb01cult to make large moves in parameter space, which is especially bad for getting started\ninitially. We can solve this using the trivial constraint\n\n\u03b2j\n\nexp\n\n(C0)\nwhich will also ensure tractability. We use (C0) for several initial iterations, then optimize the rest\nof the way using (C1). To avoid degeneracies at \u03b2 = 0, we also constrain \u03b2 \u2265 \u0001 in all iterations. We\nwill typically take \u0001 = 1/k, which is feasible for (C0) assuming \u03c4 \u2265 exp(1).1\nTo summarize, we have obtained an iterative algorithm for jointly minimizing L(\u03b8, \u03b2), such that\np\u03b8,\u03b2(z | x, y) always admits ef\ufb01cient rejection sampling. Pseudocode is provided in Algorithm 1;\nnote that all population expectations Ep\u2217 should now be replaced with sample averages.\n\nj=1\n\n(cid:16) k(cid:88)\n\n(cid:17) \u2264 \u03c4,\n\nAlgorithm 1 Minimizing L(\u03b8, \u03b2) while guaranteeing tractable inference.\n\nInput training data (x(i), y(i))n\nInitialize \u02dc\u03b8 = 0, \u02dc\u03b2j = \u0001 for j = 1, . . . , k.\nwhile not converged do\n\ni=1.\n\nEstimate \u02dc\u03c6(i), \u02dc\u03c8(i), and A(\u02dc\u03b8, \u02dc\u03b2; x(i), y(i)) for i = 1, . . . , n by sampling p\u02dc\u03b8, \u02dc\u03b2(z| x(i), y(i)).\nEstimate the functions \u02dcA(\u03b8, \u03b2; x(i), y(i)) using the output from the preceding step.\nLet (\u02c6\u03b8, \u02c6\u03b2) be the solution to\n\nn(cid:88)\n\n(cid:16)\n\n1\nn\n\n\u03b8,\u03b2\n\nminimize\nsubject to (C0),\n\ni=1\n\nA(\u03b8; x(i)) + A(\u03b2) \u2212 \u02dcA(\u03b8, \u03b2; x(i), y(i))\n\u03b2j \u2265 \u0001 for j = 1, . . . , k\n\n(cid:17)\n\nUpdate (\u02dc\u03b8, \u02dc\u03b2) \u2190 (\u02c6\u03b8, \u02c6\u03b2).\n\nend while\nRepeat the same loop as above, with the constraint (C0) replaced by (C1).\nOutput (\u02dc\u03b8, \u02dc\u03b2).\n\n5 Experiments\n\nWe now empirically explore our method\u2019s behavior. All of our code, data, and experiments may be\nfound on the CodaLab worksheet for this paper at https://www.codalab.org/worksheets/\n0xc9db508bb80446d2b66cbc8e2c74c052/, which also contains more detailed plots beyond\nthose shown here. We would like to answer the following questions:\n\u2022 Fixed \u03b2: For a \ufb01xed \u03b2, how does the relaxation parameter \u03b2 affect the learned parameters?\n\nWhat is the trade-off between accuracy and computation as we vary \u03b2?\n1If only some of the constraints Sj are active for each y (e.g. for translation we only have to worry about\nthe words that actually appear in the output sentence), then we need only include those \u03b2j in the sum for (C0).\nThis can lead to substantial gains, since now k is effectively the sentence length rather than the vocabulary size.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: (a) Accuracy versus computation (measured by number of samples drawn by the rejection\nsampler) for the unordered translation task. (b) Corresponding plot for the conjunctive semantic\nparsing task. For both tasks, the FIXED method needs an order of magnitude more samples to\nachieve comparable accuracy to either adaptive method.\n\u2022 Adapting \u03b2: Does optimizing \u03b2 affect performance? Is the per-coordinate adaptivity of our\nrelaxation advantageous, or can we set all coordinates of \u03b2 to be equal? How does the compu-\ntational budget \u03c4 (from C0 and C1) impact the optimization?\n\nTo answer these questions, we considered using a \ufb01xed \u03b2 (FIXED(\u03b2)), optimizing \u03b2 with a computa-\ntional constraint \u03c4 (ADAPTFULL(\u03c4 )), and performing the same optimization with all coordinates of\n\u03b2 constrained to be equal (ADAPTTIED(\u03c4 )). For optimization, we used Algorithm 1, using S = 50\nsamples to approximate each \u02dc\u03c6(i) and \u02dc\u03c8(i), and using the solver SNOPT [15] for the inner optimiza-\ntion. We ran Algorithm 1 for 50 iterations; when \u03b2 is not \ufb01xed, we apply the constraint (C0) for\nthe \ufb01rst 10 iterations and (C1) for the remaining 40 iterations; when it is \ufb01xed, we do not apply any\nconstraint.\nUnordered translation. We \ufb01rst consider the translation task from Example 2.2. Recall that we\nare given a vocabulary [V ] def= {1, . . . , V }, and wish to recover an unknown 1-1 substitution cipher\nc : [V ] \u2192 [V ]. Given an input sentence x1:L, the latent z is the result of applying c, where zi is\nc(xi) with probability 1 \u2212 \u03b4 and uniform over [V ] with probability \u03b4. To model this, we de\ufb01ne a\nfeature \u03c6u,v(x, z) that counts the number of times that xi = u and zi = v; hence, p\u03b8(z | x) \u221d\n\nexp((cid:80)L\n\ni=1 \u03b8xi,zi). Recall also that the output y = multiset(z).\n\nIn our experiments, we generated n = 100 sentences of length L = 20 with vocabulary size V =\n102. For each pair of adjacent words (x2i\u22121, x2i), we set x2i\u22121 = 3j +1 with j drawn from a power\nlaw distribution on {0, . . . , V /3 \u2212 1} with exponent r \u2265 0; we then set x2i to 3j + 2 or 3j + 3 with\nequal probability. This ensures that there are pairs of words that co-occur often (without which the\nconstraint T would already solve the problem).\nWe set r = 1.2 and \u03b4 = 0.1, which produces a moderate range of word frequencies as well as\na moderate noise level (we also considered setting either r or \u03b4 to 0, but omitted these results be-\ncause essentially all methods achieved ceiling accuracy; the interested reader may \ufb01nd them in our\nCodaLab worksheet). We set the computational budget \u03c4 = 50 for the constraints C0 and C1, and\nL as the lower bound on \u03b2. To measure accuracy, we look at the fraction of words whose modal\n\u0001 = 1\nprediction under the model corresponds to the correct mapping.\nWe plot accuracy versus computation (i.e., cumulative number of samples drawn by the rejection\nsampler up through the current iteration) in Figure 2a; note that the number of samples is plotted on a\nlog-scale. For the FIXED methods, there is a clear trade-off between computation and accuracy, with\nmultiplicative increases in computation needed to obtain additive increases in accuracy. The adaptive\nmethods completely surpass this trade-off curve, achieving higher accuracy than FIXED(0.8) while\nusing an order of magnitude less computation. The ADAPTFULL and ADAPTTIED methods achieve\nsimilar results to each other; in both cases, all coordinates of \u03b2 eventually obtained their maximum\nvalue of 5.0, which we set as a cap for numerical reasons, and which corresponds closely to imposing\nthe exact supervision signal.\n\n7\n\n104105106107108numberofsamples0.00.20.40.60.81.0accuracyAdaptFull(50)AdaptTied(50)Fixed(0.8)Fixed(0.5)Fixed(0.2)Fixed(0.1)104105106107108109numberofsamples0.00.20.40.60.81.0accuracyAdaptFull(200)AdaptTied(200)AdaptFull(100)AdaptFull(50)Fixed(0.5)Fixed(0.3)Fixed(0.2)\fwith probability \u03b4, with \u03b4 = 0.1. We constrained the denotation y =(cid:74)z(cid:75) to have non-zero size by\n\nConjunctive semantic parsing. We also ran experiments on the semantic parsing task from Exam-\nple 2.3. We used vocabulary size V = 150, and represented each predicate Q as a subset of [U ],\nwhere U = 300. The \ufb01ve most common words in [V ] mapped to the empty predicate Q = [U ], and\nthe remaining words mapped to a random subset of 85% of [U ]. We used n = 100 and sentence\nlength L = 25. Each word in the input was drawn independently from a power law with r = 0.8. A\nword was mapped to its correct predicate with probability 1\u2212 \u03b4 and to a uniformly random predicate\nre-generating each examples until this constraint held. We used the same model p\u03b8(z | x) as before,\nand again measured accuracy based on the fraction of the vocabulary for which the modal prediction\nwas correct. We set \u03c4 = 50, 100, 200 to compare the effect of different computational budgets.\nResults are shown in Figure 2b. Once again, the adaptive methods substantially outperform the\nFIXED methods. We also see that the accuracy of the algorithm is relatively invariant to the compu-\ntational budget \u03c4 \u2014 indeed, for all of the adaptive methods, all coordinates of \u03b2 eventually obtained\ntheir maximum value, meaning that we were always using the exact supervision signal by the end\nof the optimization. These results are broadly similar to the translation task, suggesting that our\nmethod generalizes across tasks.\n\n6 Related Work and Discussion\n\nFor a \ufb01xed relaxation \u03b2, our loss L(\u03b8, \u03b2) is similar to the Jensen risk bound de\ufb01ned by Gimpel and\nSmith [16]. For varying \u03b2, our framework is similar in spirit to annealing, where the entire objective\nis relaxed by exponentiation, and the relaxation is reduced over time. An advantage of our method\nis that we do not have to pick a \ufb01xed annealing schedule; it falls out of learning, and moreover, each\nconstraint can be annealed at its own pace.\nUnder model well-speci\ufb01cation, optimizing the relaxed likelihood recovers the same distribution as\noptimizing the original likelihood. In this sense, our approach is similar in spirit to approaches such\nas pseudolikelihood [17, 18] and, more distantly, reward shaping in reinforcement learning [19].\nThere has in the past been considerable interest in specifying and learning under constraints on\nmodel predictions, leading to a family of ideas including constraint-driven learning [11], generalized\nexpectation criteria [20, 21], Bayesian measurements [22], and posterior regularization [23]. These\nideas are nicely summarized in Section 4 of [23], and involve relaxing the constraint either by using\na variational approximation or by applying the constraint in expectation rather than pointwise (e.g.,\nreplacing the constraint h(x, z, y) \u2265 1 with E[h(x, z, y)] \u2265 1). This leads to tractable inference\nwhen the function h can be tractably incorporated as a factor in the model, which is the case for many\nproblems of interest (including the translation task in this paper). In general, however, inference will\nbe intractable even under the relaxation, or the relaxation could lead to different learned parameters;\nthis motivates our framework, which handles a more general class of problems and has asymptotic\nconsistency of the learned parameters.\nThe idea of learning with explicit constraints on computation appears in the context of prioritized\nsearch [24], MCMC [25, 26], and dynamic feature selection [27, 28, 29]. These methods focus on\nkeeping the model tractable; in contrast, we assume a tractable model and focus on the supervision.\nWhile the parameters of the model can be informed by the supervision, relaxing the supervision as\nwe do could fundamentally alter the learning process, and requires careful analysis to ensure that\nwe stay grounded to the data. As an analogy, consider driving a car with a damaged steering wheel\n(approximate model) versus not being able to see the road (approximate supervision); intuitively,\nthe latter appears to pose a more fundamental challenge.\nIntractable supervision is a key bottleneck in many applications, and will only become more so as\nwe incorporate more sophisticated logical constraints into our statistical models. While we have\nlaid down a framework that grapples with this issue, there is much to be explored\u2014e.g., deriving\nstochastic updates for optimization, as well as tractability constraints for more sophisticated infer-\nence methods.\nAcknowledgments. The \ufb01rst author was supported by a Fannie & John Hertz Fellowship and an\nNSF Graduate Research Fellowship. The second author was supported by a Microsoft Research\nFaculty Fellowship. We are also grateful to the referees for their valuable comments.\n\n8\n\n\fReferences\n[1] J. Clarke, D. Goldwasser, M. Chang, and D. Roth. Driving semantic parsing from the world\u2019s response.\n\nIn Computational Natural Language Learning (CoNLL), pages 18\u201327, 2010.\n\n[2] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Association\n\nfor Computational Linguistics (ACL), pages 590\u2013599, 2011.\n\n[3] Y. Artzi and L. Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to\n\nactions. Transactions of the Association for Computational Linguistics (TACL), 1:49\u201362, 2013.\n\n[4] M. Fisher, D. Ritchie, M. Savva, T. Funkhouser, and P. Hanrahan. Example-based synthesis of 3D object\n\narrangements. ACM SIGGRAPH Asia, 12, 2012.\n\n[5] V. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. Tenenbaum. Approximate Bayesian image interpre-\ntation using generative probabilistic graphics programs. In Advances in Neural Information Processing\nSystems (NIPS), pages 1520\u20131528, 2013.\n\n[6] A. X. Chang, M. Savva, and C. D. Manning. Learning spatial knowledge for text to 3D scene generation.\n\nIn Empirical Methods in Natural Language Processing (EMNLP), 2014.\n\n[7] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled\n\ndata. In Association for Computational Linguistics (ACL), pages 1003\u20131011, 2009.\n\n[8] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In\n\nMachine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 148\u2013163, 2010.\n\n[9] S. Gulwani. Automating string processing in spreadsheets using input-output examples. ACM SIGPLAN\n\nNotices, 46(1):317\u2013330, 2011.\n\n[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n[11] M. Chang, L. Ratinov, and D. Roth. Guiding semi-supervision with constraint-driven learning. In Asso-\n\nciation for Computational Linguistics (ACL), pages 280\u2013287, 2007.\n\n[12] J. Grac\u00b8a, K. Ganchev, and B. Taskar. Expectation maximization and posterior constraints. In NIPS, 2008.\n[13] A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.\n[14] F. Nielsen and V. Garcia. Statistical exponential families: A digest with \ufb02ash cards. arXiv preprint\n\narXiv:0911.4863, 2009.\n\n[15] P. E. Gill, W. Murray, and M. A. Saunders. SNOPT: An SQP algorithm for large-scale constrained\n\noptimization. SIAM Journal on Optimization, 12(4):979\u20131006, 2002.\n\n[16] K. Gimpel and N. A. Smith. Softmax-margin CRFs: Training log-linear models with cost functions. In\n\nNorth American Association for Computational Linguistics (NAACL), pages 733\u2013736, 2010.\n\n[17] J. Besag. The analysis of non-lattice data. The Statistician, 24:179\u2013195, 1975.\n[18] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood\n\nestimators. In International Conference on Machine Learning (ICML), pages 584\u2013591, 2008.\n\n[19] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and appli-\n\ncation to reward shaping. In International Conference on Machine Learning (ICML), 1999.\n\n[20] G. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of conditional\n\nrandom \ufb01elds. In HLT/ACL, pages 870\u2013878, 2008.\n\n[21] G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized expectation\n\ncriteria. In ACM Special Interest Group on Information Retreival (SIGIR), pages 595\u2013602, 2008.\n\n[22] P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In Interna-\n\ntional Conference on Machine Learning (ICML), 2009.\n\n[23] K. Ganchev, J. Grac\u00b8a, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable\n\nmodels. Journal of Machine Learning Research (JMLR), 11:2001\u20132049, 2010.\n\n[24] J. Jiang, A. Teichert, J. Eisner, and H. Daume. Learned prioritization for trading off accuracy and speed.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[25] T. Shi, J. Steinhardt, and P. Liang. Learning where to sample in structured prediction. In AISTATS, 2015.\n[26] J. Steinhardt and P. Liang. Learning fast-mixing models for structured prediction. In ICML, 2015.\n[27] H. He, H. Daume, and J. Eisner. Cost-sensitive dynamic feature selection. In ICML Inferning Workshop,\n\n2012.\n\n[28] H. He, H. Daume, and J. Eisner. Dynamic feature selection for dependency parsing. In EMNLP, 2013.\n[29] D. J. Weiss and B. Taskar. Learning adaptive value of information for structured prediction. In Advances\n\nin Neural Information Processing Systems (NIPS), pages 953\u2013961, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1605, "authors": [{"given_name": "Jacob", "family_name": "Steinhardt", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}