{"title": "Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 6810, "page_last": 6821, "abstract": "Many structured prediction problems admit a natural loss function for evaluation such as the edit-distance or $n$-gram loss. However, existing learning algorithms are typically designed to optimize alternative objectives such as the cross-entropy. This is because a na\\\"{i}ve implementation of the natural loss functions often results in intractable gradient computations. In this paper, we design efficient gradient computation algorithms for two broad families of structured prediction loss functions: rational and tropical losses. These families include as special cases the $n$-gram loss, the edit-distance loss, and many other loss functions commonly used in natural language processing and computational biology tasks that are based on sequence similarity measures. Our algorithms make use of weighted automata and graph operations over appropriate semirings to design efficient solutions. They facilitate efficient gradient computation and hence enable one to train learning models such as neural networks with complex structured losses.", "full_text": "Ef\ufb01cient Gradient Computation for Structured\n\nOutput Learning with Rational and Tropical Losses\n\nCorinna Cortes\nGoogle Research\n\nNew York, NY 10011\ncorinna@google.com\n\nVitaly Kuznetsov\nGoogle Research\n\nNew York, NY 10011\nvitalyk@google.com\n\nDmitry Storcheus\n\nCourant Institute and Google Research\n\nNew York, NY 10012\n\ndstorcheus@google.com\n\nAbstract\n\nMehryar Mohri\n\nCourant Institute and Google Research\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nScott Yang\u2217\n\nD. E. Shaw and Co.\nNew York, NY 10036\nyangs@cims.nyu.edu\n\nMany structured prediction problems admit a natural loss function for evaluation\nsuch as the edit-distance or n-gram loss. However, existing learning algorithms\nare typically designed to optimize alternative objectives such as the cross-entropy.\nThis is because a na\u00efve implementation of the natural loss functions often results in\nintractable gradient computations. In this paper, we design ef\ufb01cient gradient com-\nputation algorithms for two broad families of structured prediction loss functions:\nrational and tropical losses. These families include as special cases the n-gram loss,\nthe edit-distance loss, and many other loss functions commonly used in natural\nlanguage processing and computational biology tasks that are based on sequence\nsimilarity measures. Our algorithms make use of weighted automata and graph\noperations over appropriate semirings to design ef\ufb01cient solutions. They facilitate\nef\ufb01cient gradient computation and hence enable one to train learning models such\nas neural networks with complex structured losses.\n\n1\n\nIntroduction\n\nMany important machine learning tasks are instances of structured prediction problems. These are\nlearning problems where the output labels admit some structure that is important to take into account\nboth for statistical and computational reasons. Structured prediction problems include most natural\nlanguage processing tasks, such as pronunciation modeling, part-of-speech tagging, context-free\nparsing, dependency parsing, machine translation, speech recognition, where the output labels are\nsequences of phonemes, part-of-speech tags, words, parse trees, or acyclic graphs, as well as other\nsequence modeling tasks in computational biology. They also include a variety of problems in\ncomputer vision such as image segmentation, feature detection, object recognition, motion estimation,\ncomputational photography and many others.\nSeveral algorithms have been designed in the past for structured prediction tasks, including Con-\nditional Random Fields (CRFs) (Lafferty et al., 2001; Gimpel and Smith, 2010), StructSVMs\n(Tsochantaridis et al., 2005), Maximum-Margin Markov Networks (M3N) (Taskar et al., 2003),\nkernel-regression-based algorithms (Cortes et al., 2007), and search-based methods (Daum\u00e9 III et al.,\n2009; Doppa et al., 2014; Lam et al., 2015; Chang et al., 2015; Ross et al., 2011). More recently, deep\nlearning techniques have been designed for many structured prediction tasks, including part-of-speech\n\n\u2217Work done at the Courant Institute of Mathematical Sciences.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftagging (Jurafsky and Martin, 2009; Vinyals et al., 2015a), named-entity recognition (Nadeau and\nSekine, 2007), machine translation (Zhang et al., 2008; Wu et al., 2016), image segmentation (Lucchi\net al., 2013), and image annotation (Vinyals et al., 2015b).\nMany of these algorithms have been successfully used with speci\ufb01c loss functions such as the\nHamming loss. Their use has been also extended to multivariate performance measures such as\nPrecision/Recall or F1-score (Joachims, 2005), which depend on predictions on all training points.\nHowever, the natural loss function relevant to a structured prediction task, which may be the n-gram\nloss, the edit-distance loss, or some sequence similarity-based loss, is otherwise often ignored. Instead,\nan alternative measure such as the cross-entropy is used. This is typically due to computational\nef\ufb01ciency reasons: a key subroutine within the main optimization such as one requiring to determine\nthe most violating constraint may be computationally intractable, the gradient may not admit a closed-\nform or may seem dif\ufb01cult to compute, as it may involve sums over a number of terms exponential in\nthe size of the input alphabet, with each term in itself being a large non-trivial computational task.\nSeveral techniques have been suggested in the past to address this issue. They include Minimum\nRisk Training (MRT) (Och, 2003; Shen et al., 2016), which seeks to optimize the natural objective\ndirectly but relies on sampling or focusing on only the top-n structured outputs to make the problem\ncomputationally tractable. REINFORCE-based methods (Ranzato et al., 2015; Wu et al., 2016)\nalso seek to optimize the natural loss function by de\ufb01ning an unbiased stochastic estimate of the\nobjective, thereby making the problem computationally tractable. While these publications have\ndemonstrated that training directly with the natural loss function yields better results than using a\nna\u00efve loss function, their solutions naturally suffer from issues such as high variance in the gradient\nestimate, in the case of sampling, or bias in the case of top-n. Moreover, REINFORCE methods often\nhave to feed the ground-truth at training time, which is inconsistent with the underlying theory.\nAnother technique has consisted of designing computationally more tractable surrogate loss functions\ncloser to the natural loss function (Ranjbar et al., 2013; Eban et al., 2017). These publications\nalso report improved performance using an objective closer to the natural loss, while admitting the\ninherent issue of not optimizing the desired metric. McAllester et al. (2010) propose a perceptron-like\nupdate in the special case of linear models in structured prediction problems, which avoids the use of\nsurrogate losses. However, while they show that direct loss minimization admits some asymptotic\nstatistical bene\ufb01ts, each update in their work requires solving an argmax problem for which the\nauthors do not give an algorithm and that is known to be computationally hard in general, particularly\nfor non-additive losses.\nThis paper is strongly motivated by much of this previous work, which reports empirical bene\ufb01ts for\nusing the natural loss associated to the task. We present ef\ufb01cient gradient computation algorithms\nfor two broad families of structured prediction loss functions: rational and tropical losses. These\nfamilies include as special cases the n-gram loss, the edit-distance loss, and many other loss functions\ncommonly used in natural language processing and computational biology tasks that are based on\nsequence similarity measures. Our algorithms make use of weighted automata and graph operations\nover appropriate semirings to design ef\ufb01cient solutions that circumvent the na\u00efve computation of\nexponentially sized sums in gradient formula.\nOur algorithms enable one to train learning models such as neural networks with complex structured\nlosses. When combined with the recent developments in automatic differentiation, e.g. CNTK (Seide\nand Agarwal, 2016), MXNet (Chen et al., 2015), PyTorch (Paszke et al., 2017), and TensorFlow\n(Abadi et al., 2016), they can be used to train structured prediction models such as neural networks\nwith the natural loss of the task. In particular, the use of our techniques for the top layer of neural\nnetwork models can further accelerate progress in end-to-end training (Amodei et al., 2016; Graves\nand Jaitly, 2014; Wu et al., 2016).\nFor problems with limited data, e.g. uncommon languages or some biological problems, our work\novercomes the computational bottleneck, uses the exact loss function, and renders the amount of data\navailable the next hurdle for improved performance. For extremely large-scale problems with more\ndata than can be processed, we further present an approximate truncated shortest-path algorithm that\ncan be used for fast approximate gradient computations of the edit-distance.\nThe rest of the paper is organized as follows. In Section 2, we brie\ufb02y describe structured prediction\nproblems and algorithms, discuss their learning objectives, and point out the challenge of gradient\ncomputation. Section 3 de\ufb01nes several weighted automata and transducer operations that we use to\n\n2\n\n\fdesign ef\ufb01cient algorithms for gradient-based learning. In Sections 4 and 5, we give general algorithms\nfor computing the gradient of rational and tropical loss functions, respectively. In Section 6, we report\nthe results of experiments verifying the improvement due to using our ef\ufb01cient methods compared to\na na\u00efve implementation. Further details regarding weighted automata and transducer operations and\ntraining recurrent neural network training with the structured objective are presented in Appendix A\nand Appendix B.\n\n2 Gradient computation in structured prediction\n\nIn this section, we introduce the structured prediction learning problem. We start by de\ufb01ning the\nlearning scenario, including the relevant loss functions and features. We then move on to discussing\nthe hypothesis sets and forms of the objection function that are used by many structured prediction\nalgorithms, which leads us to describe the problem of computing their gradients.\n\n2.1 Structured prediction learning scenario\n\n1\nl\n\nWe consider the supervised learning setting, in which the learner receives a labeled sample S =\n{(x1, y1), . . . , (xm, ym)} drawn i.i.d. from some unknown distribution over X \u00d7Y, where X denotes\nthe input space and Y the output space. In structured prediction, we assume that elements of the\noutput space Y can be decomposed into possibly overlapping substructures y = (y1, . . . , yl). We\nfurther assume that the loss function L : Y \u00d7 Y \u2192 R+ can similarly be decomposed along these\nsubstructures. Some key examples of loss functions that are relevant to our work are the Hamming\nloss, the n-gram loss and the edit-distance loss.\n(cid:80)l\nThe Hamming loss is de\ufb01ned for all y = (y1, . . . , yl) and y(cid:48) = (y(cid:48)1, . . . , y(cid:48)l) by L(y, y(cid:48)) =\nk=1 1yk(cid:54)=y(cid:48)k, with yk, y(cid:48)k \u2208 Yk. The edit-distance loss is commonly used in natural language\nprocessing (NLP) applications where Y is a set of sequences de\ufb01ned over a \ufb01nite alphabet, and the\nloss function between two sequences y and y(cid:48) is de\ufb01ned as the minimum cost of a sequence of edit\noperations, typically insertions, deletions, and substitutions, that transform y into y(cid:48). The n-gram\nloss is de\ufb01ned as the negative inner product (or its logarithm) of the vectors of n-gram counts of\ntwo sequences. This can serve as an approximation to the BLEU score, which is commonly used in\nmachine translation.\nWe assume that the learner has access to a feature mapping \u03a8 : X \u00d7 Y \u2192 RN . This mapping can\nbe either a vector of manually designed features, as in the application of the CRF algorithm, or the\ndifferentiable output of the penultimate layer of an arti\ufb01cial neural network. In practice, feature\nmappings that correspond to the inherent structure of the input space X combined with the structure\nof Y can be exploited to derive effective and ef\ufb01cient algorithms. As mentioned previously, a common\ncase in structured prediction is when Y is a set of sequences of length l over a \ufb01nite alphabet \u2206.\nThis is the setting that we will consider, as other structured prediction problems can often be treated\nsimilarly.\nWe further assume that \u03a8 admits a Markovian property of order q, that is, for any (x, y) \u2208 X \u00d7 Y,\ns=1 \u03c8(x, ys\u2212q+1:s, s), for some position-dependent\nfeature vector function \u03c8 de\ufb01ned over X \u00d7 \u2206q \u00d7 [l], where the shorthand ys:s(cid:48)\n) stands\nfor the substring of y starting at index s and ending at s(cid:48). For convenience, for s \u2264 0, we de\ufb01ne ys\nto be the empty string \u03b5. This Markovian assumption is commonly adopted in structured prediction\nproblems such as NLP (Manning and Sch\u00fctze, 1999). In particular, it holds for feature mappings that\nare frequently used in conjunction with the CRF, as well as outputs of a recurrent neural network,\nreset at the begining of each new input (see Appendix B).\n\n\u03a8(x, y) can be decomposed as \u03a8(x, y) =(cid:80)l\n\n= (ys, . . . , ys(cid:48)\n\nfeature mapping \u03a8. The empirical loss (cid:98)RS(h) = 1\n\n2.2 Objective function and gradient computation\nThe hypothesis set we consider is that of linear functions h : (x, y) (cid:55)\u2192 w \u00b7 \u03a8(x, y) based on the\ni=1 L(h(xi), yi) associated to a hypothesis\nh is often not differentiable in structured prediction since the loss function admits discrete values.\nTaking the expectation over the distribution induced by the log-linear model, as in (Gimpel and Smith,\n2010)[Equation 5], does not help resolve this issue, since the method does not result in an upper\nbound on the empirical loss and does not admit favorable generalization guarantees. Instead, as in the\n\n(cid:80)m\n\nm\n\n3\n\n\ffamiliar binary classi\ufb01cation scenario, one can resort to upper-bounding the loss with a differentiable\n\n(convex) surrogate. For instance, by (Cortes et al., 2016)[Lemma 4], (cid:98)RS(h) can be upper-bounded\n\nby the following objective function:\n\n(cid:35)\n\nm(cid:88)\n\ni=1\n\n(cid:34)(cid:88)\n\ny\u2208Y\n\nF (w) =\n\n1\nm\n\nlog\n\neL(y,yi)\u2212w\u00b7(\u03a8(xi,yi)\u2212\u03a8(xi,y))\n\n,\n\n(1)\n\nwhich, modulo a regularization term, coincides with the objective function of CRF. Note that this\nexpression has also been presented as the softmax margin (Gimpel and Smith, 2010) and the reward-\naugmented maximum likelihood (Norouzi et al., 2016). Both of these references demonstrate strong\nempirical evidence for this choice of objective function (in addition to the theoretical results presented\nin (Cortes et al., 2016)).\nOur focus in this work is on an ef\ufb01cient computation of the gradient of this objective function. Since\nthe computation of the subgradient of the regularization term often does not pose any issues, we will\nonly consider the unregularized part of the objective. For any w and i \u2208 [m], let Fi(w) denote the\ncontribution of the i-th training point to the objective function F . A standard gradient descent-based\nmethod would sum up all or a subset (mini-batch) of the gradients \u2207Fi(w). As illustrated in (Cortes\net al., 2016)[Lemma 15], the gradient \u2207Fi(w) can be expressed as follows at any w:\n\nl(cid:88)\n\n(cid:88)\n\ns=1\n\nz\u2208\u2206q\n\n\u2207Fi(w) =\n\n1\nm\n\nQw(z, s)\u03c8(xi, z, s) \u2212 \u03a8(xi, yi)\n\nm\n\n,\n\nwhere, for all z \u2208 \u2206q and s \u2208 [l], Qw(z, s) is de\ufb01ned by\n\nQw(z, s) =\n\ny : ys\u2212q+1:s=z\n\nZw\n\neL(y,yi)+w\u00b7\u03a8(xi,y)\n\nand Zw =\n\n(cid:88)\n\n(cid:88)\n\ny\u2208Y\n\neL(y,yi)+w\u00b7\u03a8(xi,y).\n\nThe bottleneck in the gradient computation is the evaluation of Qw(z, s), for all z \u2208 \u2206q and s \u2208 [l].\nThere are l|\u2206|q such terms and each term Qw(z, s) is de\ufb01ned by a sum over the |\u2206|l\u2212q sequences y of\nlength l with a \ufb01xed substring z of length q. A straightforward computation of these terms following\ntheir de\ufb01nition would therefore be computationally expensive. To avoid that computational cost,\nmany existing learning algorithms for structured prediction, including most of those mentioned in the\nintroduction, resort to further approximations and omit the loss L from the de\ufb01nition of Qw(z, s).\nCombining that with the Markovian structure of \u03a8 can then lead to ef\ufb01cient gradient computations.\nOf course, the caveat of this approach is that it ignores the key component of the learning problem,\nnamely the loss function.\nIn what follows, we will present ef\ufb01cient algorithms for the exact computation of the terms Qw(z, s),\nwith their full de\ufb01nition, including the loss function. This leads to an ef\ufb01cient computation of the\ngradients \u2207Fi, which can be used as input to back-propagation algorithms that would enable us to\ntrain neural network models with structured prediction losses.\nThe gradient computation methods we present apply to the Hamming loss, n-gram loss, and edit-\ndistance loss, and more generally to two broad families of losses that can be represented by weighted\n\ufb01nite-state transducers (WFSTs). This covers many losses based on sequence similarity measures\nthat are used in NLP and computational biology applications (Cortes et al., 2004; Sch\u00f6lkopf et al.,\n2004).\nWe brie\ufb02y describe the WFST operations relevant to our solutions in the following section and\nprovide an example of how the edit-distance loss can be represented with a WFST in Section 5.\n\n3 Weighted automata and transducers\n\nWeighted \ufb01nite automata (WFA) and weighted \ufb01nite-state transducers (WFST) are fundamental\nconcepts and representations widely used in computer science (Mohri, 2009). We will use WFAs and\nWFSTs to devise algorithms that ef\ufb01ciently compute gradients of structured prediction objectives.\nThis section introduces some standard concepts and notation for WFAs and WFSTs. We provide\nadditional details in Appendix A. For a more comprehensive treatment of these topics, we refer the\nreader to (Mohri, 2009).\n\n4\n\n\fFigure 1: Bigram transducer Tbigram over the semiring (R+ \u222a {+\u221e}, +,\u00d7, 0, 1) for the alphabet \u2206 = {a, b}.\nThe weight of each transition (or that of a \ufb01nal state) is indicated after the slash separator. For example, for any\nstring y and bigram u, Tbigram(y, u) is equal to the number of occurrences of u in y (Cortes et al., 2015).\nDe\ufb01nition. A weighted \ufb01nite-state transducer T over a semiring (S,\u2295,\u2297, 0, 1) is an 8-tuple\n(\u03a3, \u2206, Q, I, F, E, \u03bb, \u03c1) where \u03a3 is a \ufb01nite input alphabet, \u2206 is a \ufb01nite output alphabet, Q is a\n\ufb01nite set of states, I \u2286 Q is the set of initial states, F \u2286 Q is the set of \ufb01nal states, E is a \ufb01nite\nmultiset of transitions, which are elements of Q \u00d7 (\u03a3 \u222a {\u0001}) \u00d7 (\u2206 \u222a {\u0001}) \u00d7 S \u00d7 Q, \u03bb : I \u2192 S is an\ninitial weight function, and \u03c1 : F \u2192 S is a \ufb01nal weight function. A weighted \ufb01nite automaton is a\nweighted \ufb01nite-state transducer where the input and output labels are the same. See Figures 1 and 3\nfor some examples.\nFor many operations to be well de\ufb01ned, the weights of a WFST must belong to a semiring\n(S,\u2295,\u2297, 0, 1). We provide a formal de\ufb01nition of a semiring in Appendix A. In this work, we\nconsider two semirings: the probability semiring (R+ \u222a {+\u221e}, +,\u00d7, 0, 1) and the tropical semiring\n(R \u222a {\u2212\u221e, +\u221e}, min, +, +\u221e, 0). The \u2297-operation is used to compute the weight of a path by\n\u2297-multiplying the weights of the transitions along that path. The \u2295-operation is used to compute the\nweight of a pair of input and output strings (x, y) by \u2295-summing the weights of the paths labeled\nwith (x, y). We denote this weight by T(x, y).\nAs shown in Sections 4 and 5, in many useful cases, we can reduce the computation of the loss\nfunction L(y, y(cid:48)) between two strings y and y(cid:48), along with the gradient of the corresponding objective\ndescribed in (1), to that of the \u2295-sum of the weights of all paths labeled by y:y(cid:48) in a suitably de\ufb01ned\ntransducer over either the probability or tropical semiring. We will use the following standard WFST\noperations to construct these transducers: inverse (T\u22121), projection (\u03a0(T)), composition (T1 \u25e6 T2),\nand determinization (Det(A)). The de\ufb01nitions of these operations are given in Appendix A.\n\n4 An ef\ufb01cient algorithm for the gradient computation of rational losses\n\nAs discussed in Section 2, computing Qw(z, s) is the main bottleneck in the gradient computation.\nIn this section, we give an ef\ufb01cient algorithm for computing Qw(z, s) that works for an arbitrary\nrational loss, which includes as a special case the n-gram loss and other sequence similarity-based\nlosses. We \ufb01rst present the de\ufb01nition of a rational loss and show how the n-gram loss can be encoded\nas a speci\ufb01c rational loss. Then, we present our gradient computation algorithm.\nLet (R+ \u222a {+\u221e}, +,\u00d7, 0, 1) be the probability semiring and let U be a WFST over the probability\nsemiring admitting \u2206 as both the input and output alphabet. Then, following (Cortes et al., 2015),\nthe rational loss associated to U is the function LU : \u2206\u2217 \u00d7 \u2206\u2217 \u2192 R \u222a {\u2212\u221e, +\u221e} de\ufb01ned for all\nnegative logarithm of the inner product of the vectors of n-gram counts of y and y(cid:48). The WFST\nUn-gram of an n-gram loss is obtained by composing a weighted transducer Tn-gram giving the n-gram\ncounts with its inverse T\u22121\nn-gram, that is the transducer derived from Tn-gram by swapping input and\noutput labels for each transition. As an example, Figure 1 shows the WFST Tbigram for bigrams.\nTo compute Qw(z, s) for a rational loss, recall that\n\ny, y(cid:48) \u2208 \u2206\u2217 by LU(y, y(cid:48)) = \u2212 log(cid:0)U(y, y(cid:48))(cid:1). As an example, the n-gram loss of y and y(cid:48) is the\n\nQw(z, s) \u221d (cid:88)\n\ny : ys\u2212q+1:s=z\n\neLU(y,yi)+w\u00b7\u03a8(xi,y).\n\nThus, we will design two WFAs, A and B, such that A(y) = ew\u00b7\u03a8(xi,y), B(y) = eLU(y,yi), and their\ncomposition C(y) = (A \u25e6 B)(y) = eLU(y,yi)+w\u00b7\u03a8(xi,y). To compute Qw from C, we will need to\nsum up the weights of all paths labeled with some substring z, which we will achieve by treating this\nas a \ufb02ow computation problem.\nThe pseudocode of our algorithm for computing the key terms Qw(z, s) for a rational loss is given in\nFigure 2(a).\n\n5\n\n0 a:\u03b5/1b:\u03b5/1 1a:a/1b:b/12/1a:a/1b:b/1 a:\u03b5/1b:\u03b5/1 \fGRAD-RATIONAL(xi, yi, w)\n1 Y \u2190 WFA accepting any y \u2208 \u2206l.\n2 Yi \u2190 WFA accepting yi.\n3 M \u2190 \u03a01(Y \u25e6 U \u25e6 Yi)\n4 M \u2190 Det(M)\n5 B \u2190 INVERSEWEIGHTS(M)\n6 C \u2190 A \u25e6 B\n7 \u03b1 \u2190 DISTFROMINITIAL(C, (+,\u00d7))\n8 \u03b2 \u2190 DISTTOFINAL(C, (+,\u00d7))\n9 Zw \u2190 \u03b2(IC) (cid:46) IC initial state of C\n10\n\u03b1(e) \u00d7 \u03c9(e) \u00d7 \u03b2(e)\n\n11 Qw(z, s) \u2190(cid:88)\n\nfor (z, s) \u2208 \u2206q \u00d7 [l] do\n\n12 Qw(z, s) \u2190 Qw(z, s)/Zw\n\ne\u2208Ez,s\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nl(cid:89)\n\nt=1\n\nA(y) = ew\u00b7\u03a8(xi,y) =\n\new\u00b7\u03c8(xi,yt\u2212q+1:t,t).\n\n(cid:110)\n\nFigure 2: (a) Ef\ufb01cient computation of the key terms of the structured gradient for the rational loss. For each\ntransition e \u2208 Ez,s, we denote its origin by e, destination by e and weight by \u03c9(e). (b) Illustration of the WFA\nY for \u2206 = {a, b} and l = 3. (c) Illustration of the WFA Yi representing string dac. (d) Illustration of WFA A\nfor q = 2, alphabet \u2206 = (a, b) and string length l = 2. For example, the transition from state (a, 1) to state\n(b, 2) has the label b and weight \u03c9(ab, 2) = ew\u00b7\u03c8(xi,ab,2).\nDesign of A. We want to design a determnistic WFA A such that\n\n(cid:111)\n\nTo accomplish this task, let A be a WFA with the following set of states QA =\n\n(yt\u2212q+1:t, t) : y \u2208\n, with IA = (\u03b5, 0) its single initial state, FA = {(yl\u2212q+1:l, l) : y \u2208 \u2206l} its set of\n\u2206l, t = 0, . . . , l\n\ufb01nal states, and with a transition from state (yt\u2212q+1:t\u22121, t \u2212 1) to state (yt\u2212q+2:t\u22121 b, t) with label b\nand weight \u03c9(yt\u2212q+1:t\u22121 b, t) = ew\u00b7\u03c8(xi,yt\u2212q+1:t\u22121b,t), that is, the following set of transitions:\n: y \u2208 \u2206l, b \u2208 \u2206, t \u2208 [l]\n\n(yt\u2212q+1:t\u22121, t \u2212 1), b, \u03c9(yt\u2212q+1:t\u22121 b, t), (yt\u2212q+2:t\u22121 b, t)\n\n(cid:110)(cid:16)\n\nEA =\n\n(cid:17)\n\n(cid:111)\n\n.\n\nFigure 2(d) illustrates this construction in the case q = 2. Note that the WFA A is deterministic by\nconstruction. Since the weight of a path in A is obtained by multiplying the transition weights along\nthe path, A(y) computes the desired quantity.\nDesign of B. We now design a deterministic WFA B which associates to each sequence y \u2208 \u2206l the\nexponential of the loss eLU(y,yi) = 1/U(y, yi). Let Y denote a WFA over the probability semiring\naccepting the set of all strings of length l with weight one and let Yi denote the WFA accepting only\nyi with weight one. Figures 2(b) and 2(c) illustrate the constructions of Y and Yi in some simple\ncases.2 We \ufb01rst use the composition operation for weighted automata and transducers. Then, we\nuse the projection operation on the input, which we denote by \u03a01, to compute the following WFA:\nM = \u03a01(Y \u25e6 U \u25e6 Yi). Recalling that Y(y) = Yi(yi) = 1 by construction and applying the de\ufb01nition\nof WFST composition, we observe that for any y \u2208 \u2206l\nM(y) = (Y\u25e6 U\u25e6 Yi)(y, yi) =\n\nY(z)U(z, z(cid:48))Yi(z(cid:48)) = Y(y)U(y, yi)Yi(yi) = U(y, yi). (2)\n\n(cid:88)\n\nz=y,z(cid:48)=yi\n\n2Note that we do not need to explicitly construct Y, which could be costly when the alphabet size \u2206 is large.\nInstead, we can create its transitions on-the-\ufb02y as demanded by the composition operation. Thus, for the rational\nkernels commonly used, at most the transitions labeled with the alphabet symbols appearing in Yi need to be\ncreated.\n\n6\n\n\fl(cid:89)\n\nt=1\n\nl(cid:89)\n\nt=1\n\n(a)\n\n(b)\n\nFigure 3: (a) Edit-distance transducer Uedit over the tropical semiring, in the case where the substitution cost\nis 1, the deletion cost 2, the insertion cost 3, and the alphabet \u2206 = {a, b}. (b) Smith-Waterman transducer\nUSmith-Waterman over the tropical semiring, in the case where the substitution, deletion and insertion costs are 1,\nand where the matching cost is \u22122, for the alphabet \u2206 = {a, b}.\n\nNext, we can apply weighted determinization (Mohri, 1997) to compute a deterministic WFA\nequivalent to M, denoted by Det(M). By (Cortes et al., 2015)[Theorem 3], Det(M) can be computed\nin polynomial time. Since Det(M) is deterministic and by construction accepts precisely the set\nof strings y \u2208 \u2206l, it admits a unique accepting path labeled with y whose weight is Det(M)(y) =\nM(y) = U(y, yi). The weight of that accepting path is obtained by multiplying the weights of its\ntransitions and that of the \ufb01nal state. Let B be the WFA derived from Det(M) by replacing each\nu. Then, by construction, for any y \u2208 \u2206l, we have\ntransition weight or \ufb01nal weight u by its inverse 1\nB(y) = 1\n\nU(y,yi).\n\nCombining A and B. Now consider the WFA C = A \u25e6 B, the composition of A and B. C is\ndeterministic since both A and B are deterministic. Moreover, C can be computed in time O(|A||B|).\nBy de\ufb01nition, for all y \u2208 \u2206l,\n\nC(y) = A(y) \u00d7 B(y) =\n\new\u00b7\u03c8(xi,yt\u2212q+1:t,t) \u00d7\n\n1\n\nU(y, yi)\n\n= eL(y,yi)\n\new\u00b7\u03c8(xi,yt\u2212q+1:t,t).\n\n(3)\n\nTo see how C can be used to compute Qw(z, s), we note \ufb01rst that the states of C can be identi\ufb01ed\nwith pairs (qA, qB) where qA is a state of A, qB is a state of B, and the transitions are obtained by\nmatching a transition in A with one in B. Thus, for any z \u2208 \u2206q and s \u2208 [l], let Ez,s be the set of\ntransitions of C constructed by pairing the transition in A ((z1:q\u22121, s \u2212 1), zq, \u03c9(z, s), (z2:q, s)) with\na transition in B:\n\n(cid:110)(cid:0)(qA, qB), zq, \u03c9, (q(cid:48)\n\nEz,s =\n\nB)(cid:1) \u2208 EC : qA = (z1:q\u22121, s \u2212 1)\n\nA, q(cid:48)\n\n(cid:111)\n\n.\n\n(4)\n\ne\u2208Ez,s\n\nQw(z, s) can be computed as(cid:80)\n\nNote that, since C is deterministic, there can be only one transition leaving a state labeled with zq.\nThus, to de\ufb01ne Ez,s, we only needed to specify the origin state of the transitions.\nFor each transition e \u2208 Ez,s, we denote its origin by e, destination by e and weight by \u03c9(e). Then,\n\u03b1(e) \u00d7 \u03c9(e) \u00d7 \u03b2(e), where \u03b1(e) is the sum of the weights\nof all paths from an initial state of C to e, and \u03b2(e) is the sum of the weights of all paths from e to\na \ufb01nal state of C. Since C is acyclic, \u03b1 and \u03b2 can be computed for all states in linear time in the\nsize of C using a single-source shortest-distance algorithm over the (+,\u00d7) semiring (Mohri, 2002)\nor the so-called forward-backward algorithm. We denote these subroutines by DistFromInitial and\nDistToFinal respectively in the pseudocode. Since C admits O(l|\u2206|q) transitions, we can compute all\nof the quantities Qw(z, s), s \u2208 [l] and z \u2208 \u2206q and Z(cid:48)\nNote that a natural alternative to the weighted transducer methods presented in this work is to consider\njunction tree type methods for graphical methods. However, weighted transducer techniques typically\nresult in more \u201ccompact\u201d representations than graphical model methods, and the computational cost\nof the former can even be exponentially faster than the best one could achieve using the latter (Poon\nand Domingos, 2011).\n\nw, in time O(l|\u2206|q).\n\n7\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fGRAD-TROPICAL(xi, yi, w)\n1 Y \u2190 WFA accepting any y \u2208 \u2206l.\n2 Yi \u2190 WFA accepting yi.\n3 M \u2190 \u03a01(Y \u25e6 U \u25e6 Yi)\n4 M \u2190 Det(M)\n5 B \u2190 EXPONENTIATEWEIGHTS(M)\n6 C \u2190 A \u25e6 B\n7 \u03b1 \u2190 DISTFROMINITIAL(C, (+,\u00d7))\n8 \u03b2 \u2190 DISTTOFINAL(C, (+,\u00d7))\n9 Zw \u2190 \u03b2(IC) (cid:46) IC initial state of C\n10 for (z, s) \u2208 \u2206q \u00d7 [l] do\n\u03b1(e) \u00d7 \u03c9(e) \u00d7 \u03b2(e)\n\n11 Qw(z, s) \u2190(cid:88)\n\n12 Qw(z, s) \u2190 Qw(z, s)/Zw\n\ne\u2208Ez,s\n\n(a)\n\n(b)\n\nFigure 4: (a) Ef\ufb01cient computation of the key terms of the structured gradient for the tropical loss. (b) Factoring\nof the edit-distance transducer. The leftmost \ufb01gure is the edit-distance weighted transducer Uedit over alphabet\n\u03a3 = {a, b}, the center \ufb01gure is a weighted transducer T1, and the rightmost \ufb01gure is a weighted transducer T2\nsuch that Uedit = T1 \u25e6 T2.\n5 An ef\ufb01cient algorithm for the gradient computation of tropical losses\n\nFollowing the treatment in (Cortes et al., 2015), the tropical loss associated to a weighted transducer\nU over the tropical semiring is de\ufb01ned as the function LU : \u2206\u2217 \u00d7 \u2206\u2217 \u2192 R coinciding with U; thus,\nfor all y, y(cid:48) \u2208 \u2206\u2217, LU(y, y(cid:48)) = U(y, y(cid:48)).\nFor examples of weighted transducers over the tropical semiring, see Figures 3(a) and (b).\nOur algorithm for computing Qw(z, s) for a tropical loss, illustrated in Figure 4(a), is similar to\nour algorithm for a rational loss, with the primary difference being that we exponentiate weights\ninstead of invert them in the WFA B. Speci\ufb01cally, we design A just as in Section 4, and we design a\ndeterministic WFA B by \ufb01rst designing Det(M) as in Section 4 and then deriving B from Det(M)\nby replacing each transition weight or \ufb01nal weight u in Det(M) by eu. Then by construction, for any\ny \u2208 \u2206l, B(y) = eU(y,yi). Moreover, composition of A with B yields a WFA C = A \u25e6 B such that\nfor all y \u2208 \u2206l,\n\nC(y) = A(y) \u00d7 B(y) =\n\new\u00b7\u03c8(xi,yt\u2212q+1:t,t) \u00d7 eU(y,yi) = eL(y,yi)\n\new\u00b7\u03c8(xi,yt\u2212q+1:t,t).\n\n(5)\n\nt=1\n\nt=1\n\nAs an example, the general edit-distance of two sequences y and y(cid:48) can, as already described,\nbe computed using Uedit in time O(|y||y(cid:48)|) (Mohri, 2003). Note that for further computational\noptimization, Uedit and USmith-Waterman can be computed on-the-\ufb02y as demanded by the composition\noperation, thereby creating only transitions with alphabet symbols appearing in the strings compared.\nIn order to achieve optimal dependence on the size of the input alphabet, we can also apply factoring\nto the edit-distance transducer. Figure 4(b) illustrates factoring of the edit-distance transducer over\nthe alphabet \u03a3 = {a, b}, where s is the substitution and deletion symbol and i is the insertion symbol.\nNote that both T1 and T2 are linear in the size of \u03a3, while Uedit is quadratic in |\u03a3|. Furthermore,\nusing on-the-\ufb02y composition, for any Y1 and Y2, we can \ufb01rst compute Y1 \u25e6 T1 and T2 \u25e6 Y2 and then\ncompose the result achieving time and space complexity in O(|Y1||Y2|).\n\n6 Experiments\n\nIn this section, we present experiments validating both the computational ef\ufb01ciency of our gradient\ncomputation methods as well as the learning bene\ufb01ts of training with natural loss functions. The\nexperiments in this section should be treated as a proof of concept. We defer an extensive study of\ntraining structured prediction models on large-scale datasets for future work.\n\n8\n\nl(cid:89)\n\nl(cid:89)\n\n0/0a:a/0b:b/0a:b/1b:a/1\u03b5:a/1\u03b5:b/1a:\u03b5/1b:\u03b5/10/0a:a/0b:b/0a:s/1b:s/1\u03b5:i/10/0a:a/0b:b/0s:a/0s:b/0s:\u03b5/0i:a/0i:b/0\fFigure 5: Runtime comparison of ef\ufb01cient versus na\u00efve gradient computation methods for edit-distance (a),\nSmith-Waterman (b) and bigram (c) loss functions. The na\u00efve line refers to the average runtime of Grad-Na\u00efve,\nthe ef\ufb01cient line refers to Grad-Tropical for edit-distance (a) and Smith-Waterman (b) and Grad-Rational for\nbigram (c) loss. Na\u00efve computations are shown only up to string length l = 8.\n\nFor the runtime comparison, we randomly generate an input and output data pair (xi, yi), both\nof a given \ufb01xed length, as well as a weight vector w, and we compute \u2207Fi(w) using both the\nna\u00efve and the outlined ef\ufb01cient methods. As shown in Section 2, the computationally demanding\npart in the \u2207Fi(w) calculation is evaluating Qw(z, s) for all s \u2208 [l] and z \u2208 \u2206q, while the other\nterms are generally not problematic to compute. We de\ufb01ne a procedure Grad-Na\u00efve (see Figure 6\nin the appendix) and compare the average runtimes of Grad-Na\u00efve with that of Grad-Ef\ufb01cient for\nboth rational and tropical losses. The ef\ufb01cient algorithms suggested in this work improve upon the\nGrad-Na\u00efve runtime by eliminating the explicit loop over y \u2208 Y and using the weighted automata\nand transducer operations instead. All the weighted automata and transducer computations required\nfor Grad-Rational and Grad-Tropical are implemented using OpenFST (Allauzen et al., 2007).\nMore speci\ufb01cally, we de\ufb01ne an alphabet |\u2206| = 10 and features \u03a8(x, y) as vectors of counts of all\n100 possible bigrams. For each string length l from 2 to 30, we draw input pairs (xi, yi) \u2208 \u2206l \u00d7 \u2206l\nuniformly at random and w \u2208 R100 according to a standard normal distribution. The average runtimes\nover 125 random trials are presented in Figure 5 for three loss functions: the edit-distance, the Smith-\nWaterman distance and the bigram loss. The experiments demonstrate a number of crucial bene\ufb01ts of\nour ef\ufb01cient gradient computation framework. Note that the Grad-Na\u00efve procedure runtime grows\nexponentially in l, while Grad-Tropical and Grad-Rational exhibit linear dependency on the length\nof the input strings. In fact, using the threshold pruning as part of determinization can allow one to\ncompute approximate gradient for arbitrarily long input strings. The computational improvement\nis even more evident for rational losses, in which case the determinization of M can be achieved in\npolynomial time (Cortes et al., 2015), thus pruning is not required.\nWe also provide preliminary learning experiments that illustrate the bene\ufb01t of learning with a\nstructured loss for a sequence alignment task, compared to training with the cross-entropy loss. The\nsequence alignment experiment replicates the arti\ufb01cial genome sequence data in (Joachims et al.,\n2006), where each example consists of native, homolog, and decoy sequences of length 50 and the\ntask is to predict a sequence that is the closest to native in terms of the Smith-Waterman alignment\nscore. The experiment con\ufb01rms that a model trained with Smith-Waterman distance as the objective\nshows signi\ufb01cantly higher average Smith-Waterman alignment score (and higher accuracy) on a test\nset compared to a model trained with cross-entropy objective. The cross-entropy model achieved a\nSmith-Waterman score of 42.73, while the augmented model achieved a score of 44.65 on a test set\nwith a standard deviation of 0.35 averaged over 10 random folds.\n\n7 Conclusion\n\nWe presented ef\ufb01cient algorithms for computing the gradients of structured prediction models with\nrational and tropical losses, reporting experimental results con\ufb01rming both runtime improvement\ncompared to na\u00efve implementations and learning improvement compared to standard methods that\nsettle for using easier-to-optimize losses. We also showed how our approach can be incorporated\ninto the top layer of a neural network, so that it can be used to train end-to-end models in domains\nincluding speech recognition, machine translation, and natural language processing. For future work,\nwe plan to run large-scale experiments with neural networks to further demonstrate the bene\ufb01t of\nworking directly with rational or tropical losses using our ef\ufb01cient computational methods.\n\n9\n\n051015202530string length l4202468log(time) in secondsRun time for edit-distance lossefficientnaive051015202530string length l4202468Run time for Smith-Waterman lossefficientnaive051015202530string length l420246Run time for bigram lossefficientnaive\fAcknowledgments\n\nThis work was partly funded by NSF CCF-1535987 and NSF IIS-1618662.\n\nReferences\nM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker,\nV. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensor\ufb02ow: a system for large-scale\nmachine learning. In Proceedings of USENIX, 2016.\n\nC. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. OpenFst: a general and ef\ufb01cient\n\nweighted \ufb01nite-state transducer library. In Proceedings of CIAA. Springer, 2007.\n\nD. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski,\nA. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, A. Y. Hannun, B. Jun, T. Han,\nP. LeGresley, X. Li, L. Lin, S. Narang, A. Y. Ng, S. Ozair, R. Prenger, S. Qian, J. Raiman,\nS. Satheesh, D. Seetapun, S. Sengupta, C. Wang, Y. Wang, Z. Wang, B. Xiao, Y. Xie, D. Yogatama,\nJ. Zhan, and Z. Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. In\nProceedings of ICML, 2016.\n\nK. Chang, A. Krishnamurthy, A. Agarwal, H. Daum\u00e9 III, and J. Langford. Learning to search better\n\nthan your teacher. In Proceedings of ICML, 2015.\n\nT. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet:\nA \ufb02exible and ef\ufb01cient machine learning library for heterogeneous distributed systems. CoRR,\nabs/1512.01274, 2015.\n\nC. Cortes, P. Haffner, and M. Mohri. Rational kernels: Theory and algorithms. JMLR, 5:1035\u20131062,\n\n2004.\n\nC. Cortes, M. Mohri, and J. Weston. A General Regression Framework for Learning String-to-String\n\nMappings. In Predicting Structured Data. MIT Press, 2007.\n\nC. Cortes, V. Kuznetsov, M. Mohri, and M. K. Warmuth. On-line learning algorithms for path experts\n\nwith non-additive losses. In Proceedings of COLT, 2015.\n\nC. Cortes, V. Kuznetsov, M. Mohri, and S. Yang. Structured prediction theory based on factor graph\n\ncomplexity. In Proceedings of NIPS, 2016.\n\nH. Daum\u00e9 III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning, 75\n\n(3):297\u2013325, 2009.\n\nJ. R. Doppa, A. Fern, and P. Tadepalli. Structured prediction via output space search. JMLR, 15(1):\n\n1317\u20131350, 2014.\n\nE. Eban, M. Schain, A. Mackey, A. Gordon, R. Rifkin, and G. Elidan. Scalable learning of non-\n\ndecomposable objectives. In Arti\ufb01cial Intelligence and Statistics, pages 832\u2013840, 2017.\n\nK. Gimpel and N. A. Smith. Softmax-margin CRFs: Training log-linear models with cost functions.\n\nIn Proceedings of ACL, 2010.\n\nA. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In\n\nProceedings of ICML, 2014.\n\nT. Joachims. A support vector method for multivariate performance measures. In Proceedings of\n\nICML, 2005.\n\nT. Joachims, T. Galor, and R. Elber. Learning to align sequences: A maximum-margin approach. In\n\nNew algorithms for macromolecular simulation, pages 57\u201369. Springer, 2006.\n\nD. Jurafsky and J. H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc.,\n\n2009.\n\n10\n\n\fJ. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proceedings of ICML, 2001.\n\nM. Lam, J. R. Doppa, S. Todorovic, and T. G. Dietterich. Hc-search for structured prediction in\n\ncomputer vision. In CVPR, 2015.\n\nA. Lucchi, L. Yunpeng, and P. Fua. Learning for structured prediction using approximate subgradient\n\ndescent with working sets. In Proceedings of CVPR, 2013.\n\nC. D. Manning and H. Sch\u00fctze. Foundations of Statistical Natural Language Processing. The MIT\n\nPress, Cambridge, Massachusetts, 1999.\n\nD. A. McAllester, T. Hazan, and J. Keshet. Direct loss minimization for structured prediction. In\n\nProceedings of NIPS, 2010.\n\nM. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23\n\n(2):269\u2013311, 1997.\n\nM. Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of\n\nAutomata, Languages and Combinatorics, 7(3):321\u2013350, 2002.\n\nM. Mohri. Edit-distance of weighted automata: General de\ufb01nitions and algorithms. International\n\nJournal of Foundations of Computer Science, 14(6):957\u2013982, 2003.\n\nM. Mohri. Weighted automata algorithms. In Handbook of Weighted Automata, pages 213\u2013254.\n\nSpringer, 2009.\n\nD. Nadeau and S. Sekine. A survey of named entity recognition and classi\ufb01cation. Linguisticae\n\nInvestigationes, 30(1):3\u201326, January 2007.\n\nM. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans. Reward augmented\n\nmaximum likelihood for neural structured prediction. In Proceedings of NIPS, 2016.\n\nF. J. Och. Minimum error rate training in statistical machine translation. In Proceedings of ACL,\n\nvolume 1, 2003.\n\nA. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,\n\nand A. Lerer. Automatic differentiation in PyTorch. In Proceedings of NIPS, 2017.\n\nH. Poon and P. Domingos. Sum-product networks: A new deep architecture. In ICCV Workshops,\n\npages 689\u2013690, 2011.\n\nR. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan. Mini-\nmum word error rate training for attention-based sequence-to-sequence models. arXiv preprint\narXiv:1712.01818, 2017.\n\nM. Ranjbar, T. Lan, Y. Wang, S. N. Robinovitch, Z.-N. Li, and G. Mori. Optimizing nondecomposable\nIEEE transactions on pattern analysis and machine\n\nloss functions in structured prediction.\nintelligence, 35(4):911\u2013924, 2013.\n\nM. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural\n\nnetworks. arXiv preprint arXiv:1511.06732, 2015.\n\nS. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to\n\nno-regret online learning. In Proceedings of AISTATS, 2011.\n\nB. Sch\u00f6lkopf, K. Tsuda, and J.-P. Vert. Kernel methods in computational biology. MIT Press,\n\nCambridge, Mass., 2004.\n\nF. Seide and A. Agarwal. CNTK: Microsoft\u2019s open-source deep-learning toolkit. In Proceedings of\n\nKDD. ACM, 2016.\n\nS. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural\n\nmachine translation. In Proceedings of ACL, volume 1, 2016.\n\n11\n\n\fI. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nProceedings of NIPS, 2014.\n\nB. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n\nI. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 6:1453\u20131484, Dec. 2005.\n\nO. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language.\n\nIn Proceedings of NIPS, 2015a.\n\nO. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\nProceedings of CVPR, 2015b.\n\nY. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, \u0141ukasz Kaiser, S. Gouws, Y. Kato, T. Kudo,\nH. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick,\nO. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google\u2019s neural machine translation system:\nBridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL\nhttp://arxiv.org/abs/1609.08144.\n\nD. Zhang, L. Sun, and W. Li. A structured prediction approach for statistical machine translation. In\n\nProceedings of IJCNLP, 2008.\n\n12\n\n\f", "award": [], "sourceid": 3415, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": "Google Research"}, {"given_name": "Vitaly", "family_name": "Kuznetsov", "institution": "Google"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Inst. of Math. Sciences & Google Research"}, {"given_name": "Dmitry", "family_name": "Storcheus", "institution": "Google Research"}, {"given_name": "Scott", "family_name": "Yang", "institution": "D. E. Shaw & Co."}]}