{"title": "Learning latent variable structured prediction models with Gaussian perturbations", "book": "Advances in Neural Information Processing Systems", "page_first": 3145, "page_last": 3155, "abstract": "The standard margin-based structured prediction commonly uses a maximum loss over all possible structured outputs. The large-margin formulation including latent variables not only results in a non-convex formulation but also increases the search space by a factor of the size of the latent space. Recent work has proposed the use of the maximum loss over random structured outputs sampled independently from some proposal distribution, with theoretical guarantees. We extend this work by including latent variables. We study a new family of loss functions under Gaussian perturbations and analyze the effect of the latent space on the generalization bounds. We show that the non-convexity of learning with latent variables originates naturally, as it relates to a tight upper bound of the Gibbs decoder distortion with respect to the latent space. Finally, we provide a formulation using random samples and relaxations that produces a tighter upper bound of the Gibbs decoder distortion up to a statistical accuracy, which enables a polynomial time evaluation of the objective function. We illustrate the method with synthetic experiments and a computer vision application.", "full_text": "Learning latent variable structured prediction models\n\nwith Gaussian perturbations\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nJean Honorio\n\nPurdue University\n\nWest Lafayette, IN, USA\njhonorio@purdue.edu\n\nKevin Bello\n\nPurdue University\n\nWest Lafayette, IN, USA\nkbellome@purdue.edu\n\nAbstract\n\nThe standard margin-based structured prediction commonly uses a maximum loss\nover all possible structured outputs [26, 1, 5, 25]. The large-margin formulation\nincluding latent variables [30, 21] not only results in a non-convex formulation but\nalso increases the search space by a factor of the size of the latent space. Recent\nwork [11] has proposed the use of the maximum loss over random structured\noutputs sampled independently from some proposal distribution, with theoretical\nguarantees. We extend this work by including latent variables. We study a new\nfamily of loss functions under Gaussian perturbations and analyze the effect of\nthe latent space on the generalization bounds. We show that the non-convexity\nof learning with latent variables originates naturally, as it relates to a tight upper\nbound of the Gibbs decoder distortion with respect to the latent space. Finally, we\nprovide a formulation using random samples and relaxations that produces a tighter\nupper bound of the Gibbs decoder distortion up to a statistical accuracy, which\nenables a polynomial time evaluation of the objective function. We illustrate the\nmethod with synthetic experiments and a computer vision application.\n\n1\n\nIntroduction\n\nStructured prediction is of high interest in many domains such as computer vision [19], natural\nlanguage processing [32, 33], and computational biology [14]. Some standard methods for structured\nprediction are conditional random \ufb01elds (CRFs) [13] and structured SVMs (SSVMs) [25, 26].\nIn many tasks it is crucial to take into account latent variables. For example, in machine translation,\none is usually given a sentence x and its translation y, but not the linguistic structure h that connects\nthem (e.g. alignments between words). Even if h is not observable it is important to include this\ninformation in the model in order to obtain better prediction results. Examples also arise in computer\nvision, for instance, most images in indoor scene understanding [28] are cluttered by furniture and\ndecorations, whose appearances vary drastically across scenes, and can hardly be modeled (or even\nhand-labeled) consistently. In this application, the input x is an image, the structured output y is the\nlayout of the faces (\ufb02oor, ceiling, walls) and furniture, while the latent structure h assigns a binary\nlabel to each pixel (clutter or non-clutter.)\nDuring past years, there has been several solutions to address the problem of latent variables in\nstructured prediction. In the \ufb01eld of computer vision, hidden conditional random \ufb01elds (HCRF)\n[23, 29, 22] have been widely applied for object recognition and gesture detection. In natural language\nprocessing there is also work in applying discriminative probabilistic latent variable models, for\nexample the training of probabilistic context free grammars with latent annotations in a discriminative\nmanner [20]. The work of Yu and Joachims [30] extends the margin re-scaling SSVM in [26] by\nintroducing latent variables (LSSVM) and obtains a formulation that is optimized using the concave-\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fconvex procedure (CCCP) [31]. The work of Ping et al. [21] considers a smooth objective in LSSVM\nby incorporating marginal maximum a posteriori inference that \u201caverages\u201d over the latent space.\nSome of the few works in deriving generalization bounds for structured prediction include the work of\nMcAllester [16], which provides PAC-Bayesian guarantees for arbitrary losses, and the work of Cortes\net al. [7], which provides data-dependent margin guarantees for a general family of hypotheses, with\nan arbitrary factor graph decomposition. However, with the exception of [11], both aforementioned\nworks do not focus on producing computationally appealing methods. Moreover, prior generalization\nbounds have not focused on latent variables.\n\nContributions. We focus on the learning aspects of structured prediction problems using latent\nvariables. We \ufb01rst extend the work of [16] by including latent variables, and show that the non-convex\nformulation using the slack re-scaling approach with latent variables is related to a tight upper bound\nof the Gibbs decoder distortion. This motivates the apparent need of the non-convexity in different\nformulations using latent variables (e.g., [30, 10]). Second, we provide a tighter upper bound of\nthe Gibbs decoder distortion by randomizing the search space of the optimization problem. That is,\ninstead of having a formulation over all possible structures and latent variables (usually exponential in\nsize), we propose a formulation that uses i.i.d. samples coming from some proposal distribution. This\napproach is also computationally appealing in cases where the margin can be computed in poly-time\n(for example, when the latent space is polynomial in size or when a relaxation in the maximization\nover the latent space can be computed in poly-time), since it would lead to a fully polynomial time\nevaluation of the formulation. The use of standard Rademacher arguments and the analysis of [11]\nwould lead to a prohibitive upper bound that is proportional to the size of the latent space. We\nprovide a way to obtain an upper bound that is logarithmic in the size of the latent space. Finally, we\nprovide experimental results in synthetic data and in a computer vision application, where we obtain\nimprovements in the average test error with respect to the values reported in [9].\n\n2 Background\nWe denote the input space as X , the output space as Y, and the latent space as H. We assume a\ndistribution D over the observable space X \u00d7 Y. We further assume that we are given a training set\nS of n i.i.d. samples drawn from the distribution D, i.e., S \u223c Dn.\nLet Yx (cid:54)= \u2205 denote the countable set of feasible outputs or decodings of x. In general, |Yx| is\nexponential with respect to the input size. Likewise, let Hx (cid:54)= \u2205 denote the countable set of feasible\nlatent decodings of x.\nWe consider a \ufb01xed mapping \u03a6 from triples to feature vectors to describe the relation among\ninput x, output y, and latent variable h, i.e., for any triple (x, y, h), we have the feature vector\n\u03a6(x, y, h) \u2208 Rk \\ {0}. For a parameter w \u2208 W \u2286 Rk \\ {0}, we consider linear decoders of the\nform:\n\nfw(x) = argmax\n\n(y,h)\u2208Yx\u00d7Hx\n\n\u03a6(x, y, h) \u00b7 w.\n\n(1)\n\n(2)\n\nThe problem of computing this argmax is typically referred as the inference or prediction problem.\nIn practice, very few cases of the above general inference problem are tractable, while most are\nNP-hard and also hard to approximate within a \ufb01xed factor. (For instance, see Section 6.1 in [11] for\na thorough discussion.)\nWe denote by d : Y \u00d7Y \u00d7H \u2192 [0, 1] the distortion function, which measures the dissimilarity among\ntwo elements of the output space Y and one element of the latent space H. (Note that the distortion\nfunction is general in the sense that the latent element may not be used in some applications.)\nTherefore, the goal is to \ufb01nd a w \u2208 W that minimizes the decoder distortion, that is:\n\n(cid:2)d(y,(cid:104)fw(x)(cid:105))(cid:3) .\n\nmin\nw\u2208W\n\nE\n\n(x,y)\u223cD\n\nIn the above equation, the angle brackets indicate that we are inserting a pair (\u02c6y, \u02c6h) = fw(x) into\nthe distortion function. From the computational point of view, the above optimization problem is\nintractable since d(y,(cid:104)fw(x)(cid:105)) is discontinuous with respect to w. From the statistical viewpoint,\neq.(2) requires access to the data distribution D and would require an in\ufb01nite amount of data. In\npractice, one only has access to a \ufb01nite number of samples.\n\n2\n\n\fFurthermore, even if one were able to compute w using the objective in eq.(2), this parameter w,\nwhile achieving low distortion, could potentially be in a neighborhood of parameters with high\ndistortion. Therefore, we can optimize a more robust objective that takes into account perturbations.\nIn this paper we consider Gaussian perturbations. More formally, let \u03b1 > 0 and let Q(w) be a\nunit-variance Gaussian distribution centered at \u03b1w of parameters w(cid:48) \u2208 W. The Gibbs decoder\ndistortion of the perturbation distribution Q(w) and data distribution D, is de\ufb01ned as:\n\n(cid:34)\n\n(cid:2)d(y,(cid:104)fw(cid:48)(x)(cid:105))(cid:3)(cid:35)\n\nL(Q(w), D) = E\n\n(x,y)\u223cD\n\nE\n\nw(cid:48)\u223cQ(w)\n\n(3)\n\nThen, the optimization problem using the Gibbs decoder distortion can be written as:\n\nmin\nw\u2208W\n\nL(Q(w), D).\n\nWe de\ufb01ne the margin m(x, y, y(cid:48), h(cid:48), w) as follows:\n\nm(x, y, y(cid:48), h(cid:48), w) = max\nh\u2208Hx\n\n\u03a6(x, y, h) \u00b7 w \u2212 \u03a6(x, y(cid:48), h(cid:48)) \u00b7 w.\n\nNote that since we are considering latent variables, our de\ufb01nition of margin differs from [16, 11].\nLet h\u2217 = argmaxh\u2208Hx \u03a6(x, y, h) \u00b7 w. In this case h\u2217 can be interpreted as the latent variable that\nbest explains the pair (x, y). Then, for a \ufb01xed w, the margin computes the amount by which the pair\n(y, h\u2217) is preferred to the pair (y(cid:48), h(cid:48)).\nNext we introduce the concept of \u201cparts\u201d, also used in [16]. Let c(p, x, y, h) be a nonnegative integer\nthat gives the number of times that the part p \u2208 P appears in the triple (x, y, h). For a part p \u2208 P,\nwe de\ufb01ne the feature p as follows:\n\n\u03a6p(x, y, h) \u2261 c(p, x, y, h)\n\nWe let Px (cid:54)= \u2205 denote the set of p \u2208 P such that there exists (y, h) \u2208 Yx \u00d7 Hx with c(p, x, y, h) > 0.\n\nStructural SVMs with latent variables.\n[30] extend the formulation of margin re-scaling given\nin [26] incorporating latent variables. The motivation to extend such formulation is that it leads\nto a difference of two convex functions, which allows the use of CCCP [31]. The aforementioned\nformulation is:\n(cid:107)w(cid:107)2\n\n[\u03a6(x, \u02c6y, \u02c6h) \u00b7 w + d(y, \u02c6y, \u02c6h)] \u2212 C\n\n\u03a6(x, y, h) \u00b7 w (4)\n\n(cid:88)\n\n(cid:88)\n\n2 + C\n\nmin\nw\n\n1\n2\n\nmax\n\n(\u02c6y,\u02c6h)\u2208Yx\u00d7Hx\n\n(x,y)\u2208S\n\nmax\nh\u2208Hx\n\n(x,y)\u2208S\n\nIn the case of standard SSVMs (without latent variables), [26] discuss two advantages of the slack re-\nscaling formulation over the margin re-scaling formulation, these are: the slack re-scaling formulation\nis invariant to the scaling of the distortion function, and the margin re-scaling potentially gives\nsigni\ufb01cant score to structures that are not even close to being confusable with the target structures.\n[1, 6, 25] proposed similar formulations to the slack re-scaling formulation. Despite its theoretical\nadvantages, the slack re-scaling has been less popular than the margin re-scaling approach due to\ncomputational requirements. In particular, both formulations require optimizing over the output\nspace, but while margin re-scaling preserves the structure of the score and error functions, the slack\nre-scaling does not. This results in harder inference problems during training. [11] also analyze the\nslack re-scaling approach and theoretically show that using random structures one can obtain a tighter\nupper bound of the Gibbs decoder distortion. However, these works do not take into account latent\nvariables.\nThe following formulation corresponds to the slack re-scaling approach with latent variables:\n\n(cid:88)\n\nmin\nw\n\n1\nn\n\nmax\n\n(\u02c6y,\u02c6h)\u2208Yx\u00d7Hx\n\n(x,y)\u2208S\n\n(cid:104)\n\nd(y, \u02c6y, \u02c6h) 1\n\nm(x, y, \u02c6y, \u02c6h, w) \u2264 1\n\n+ \u03bb(cid:107)w(cid:107)2\n\n2\n\n(5)\n\n(cid:105)\n\nWe take into account the loss of structures whose margin is less than one (i.e., m(\u00b7) \u2264 1) instead of\nthe Hamming distance as done in [11]. This is because the former gave better results in preliminary\nexperiments. Also, it is more related to current practice (e.g., [30]). In order to obtain an SSVM-like\nformulation, the hinge loss is used instead of the discontinuous 0/1 loss in the above formulation.\nNote however, that both eq.(4) and eq.(5) are now non-convex problems with respect to the learning\nparameter w even if the hinge loss is used.\n\n3\n\n\f3 The maximum loss over all structured outputs and latent variables\n\ndecoder distortion (eq.(3)) up to an statistical accuracy of O((cid:112)log n/n) for n training samples.\n\nIn this section we extend the work of McAllester [16] by including latent variables. In the following\ntheorem, we show that the slack re-scaling objective function (eq.(5)) is an upper bound of the Gibbs\nTheorem 1. Assume that there exists a \ufb01nite integer value r such that |Yx\u00d7Hx| \u2264 r for all (x, y) \u2208 S.\nAssume also that (cid:107)\u03a6(x, y, h)(cid:107)2 \u2264 \u03b3 for any triple (x, y, h). Fix \u03b4 \u2208 (0, 1). With probability at least\n1 \u2212 \u03b4/2 over the choice of n training samples, simultaneously for all parameters w \u2208 W and\nunit-variance Gaussian perturbation distributions Q(w) centered at w\u03b3\n2), we\nhave:\n\n8 log (rn/(cid:107)w(cid:107)2\n\n(cid:113)\n\n(cid:88)\n(cid:115)\n\n(x,y)\u2208S\n\nL(Q(w), D) \u2264 1\nn\n\n+\n\nmax\n\nd(y, \u02c6y, \u02c6h) 1\n\nm(x, y, \u02c6y, \u02c6h, w) \u2264 1\n\n+\n\n4(cid:107)w(cid:107)2\n\n(\u02c6y,\u02c6h)\u2208Yx\u00d7Hx\n2 \u03b32 log (rn/(cid:107)w(cid:107)2\n2(n \u2212 1)\n\n2) + log (2n/\u03b4)\n\n(cid:104)\n\n(cid:105)\n\n(cid:107)w(cid:107)2\nn\n\n2\n\n(cid:105)\n\n(See Appendix A for detailed proofs.)\nFor the proof of the above we used the PAC-Bayes theorem and well-known Gaussian concentration\ninequalities. Note that the average sum in the right-hand side, i.e., the objective function, can be\nequivalently written as:\n\n(cid:88)\n\n1\nn\n\nmax\n\n(\u02c6y,\u02c6h)\u2208Yx\u00d7Hx\n\nmin\nh\u2208Hx\n\n(x,y)\u2208S\n\n(cid:104)\n\nd(y, \u02c6y, \u02c6h) 1\n\n\u03a6(x, y, h) \u00b7 w \u2212 \u03a6(x, \u02c6y, \u02c6h) \u00b7 w \u2264 1\n\n.\n\nRemark 1. It is clear that the above formulation is tight with respect to the latent space Hx due\nto the minimization. This is an interesting observation because it reinforces the idea that a non-\nconvex formulation is required in models using latent variables, i.e., an attempt to \u201cconvexify\u201d the\nformulation will result in looser upper bounds and consequently might produce worse predictions.\nSome other examples of non-convex formulations for latent-variable models are found in [30, 10].\nNote also that the upper bound has a maximization over Yx \u00d7 Hx (usually exponential in size) and a\nminimization over Hx (potentially in exponential size). We state two important observations in the\nfollowing remark.\nRemark 2. First, in the minimization, it is clear that the use of a subset of Hx would lead to a looser\n\nrelaxation not only can tighten the bound but also can allow the margin to be computed in polynomial\ntime. See for instance some analyses of LP-relaxations in [12, 15, 17]. Second, in contrast, using a\nsubset of Yx \u00d7 Hx in the maximization would lead to a tighter upper bound.\n\nupper bound. However, using a superset (cid:101)Hx \u2287 Hx would lead to a tighter upper bound. The latter\nFrom the \ufb01rst observation above, we will now introduce a new de\ufb01nition of margin, (cid:101)m, which\nperforms a maximization over a superset (cid:101)Hx \u2287 Hx.\n(cid:101)m for (cid:101)H being a set of binary strings. That is, we can encode any DAG (in H) as a binary string (in\n(cid:101)H), but not all binary strings are DAGs. Later, in Section 6, we provide an empirical comparison of\nthe use of m and (cid:101)m. We next present a similar upper bound to the one obtained in Theorem 1 but\nnow using the margin (cid:101)m.\n(cid:113)\n\nTheorem 2 (Relaxed margin bound.). Assume that there exists a \ufb01nite integer value r such that\n|Yx \u00d7 Hx| \u2264 r for all (x, y) \u2208 S. Assume also that (cid:107)\u03a6(x, y, h)(cid:107)2 \u2264 \u03b3 for any triple (x, y, h). Fix\n\u03b4 \u2208 (0, 1). With probability at least 1 \u2212 \u03b4/2 over the choice of n training samples, simultaneously\nfor all parameters w \u2208 W and unit-variance Gaussian perturbation distributions Q(w) centered at\nw\u03b3\n\nSeveral examples are NP-hard m for H (DAGs, trees or cardinality constrained sets), but poly-time\n\n(cid:101)m(x, y, y(cid:48), h(cid:48), w) = max\nh\u2208(cid:101)Hx\n\n\u03a6(x, y, h) \u00b7 w \u2212 \u03a6(x, y(cid:48), h(cid:48)) \u00b7 w.\n\n8 log (rn/(cid:107)w(cid:107)2\n\n2), we have:\n\n(cid:88)\n\nL(Q(w), D) \u2264 1\nn\n\nmax\n\n(\u02c6y,\u02c6h)\u2208Yx\u00d7Hx\n\n(x,y)\u2208S\n\nd(y, \u02c6y, \u02c6h) 1\n\n(cid:104)(cid:101)m(x, y, \u02c6y, \u02c6h, w) \u2264 1\n\n(cid:105)\n\n(cid:107)w(cid:107)2\nn\n\n2\n\n+\n\n4\n\n\f(cid:115)\n\n+\n\n4(cid:107)w(cid:107)2\n\n2 \u03b32 log (rn/(cid:107)w(cid:107)2\n2(n \u2212 1)\n\n2) + log (2n/\u03b4)\n\nFrom the second observation in Remark 2, it is natural to ask what elements should constitute this\nsubset in order to control the statistical accuracy with respect to the Gibbs decoder. Finally, if the\nnumber of elements is polynomial then we also have an ef\ufb01cient computation of the maximum. We\nprovide answers to these questions in the next section.\n\n4 The maximum loss over random structured outputs and latent variables\n\nIn this section, we show the relation between PAC-Bayes bounds and the maximum loss over random\nstructured outputs and latent variables sampled i.i.d. from some proposal distribution.\n\nInstead of using a maximization over Yx \u00d7 Hx, we will perform a\nA more ef\ufb01cient evaluation.\nmaximization over a set T (w, x) of random elements sampled i.i.d. from some proposal distribution\nR(w, x) with support on Yx \u00d7 Hx. More explicitly, our new formulation is:\n\n(cid:104)(cid:101)m(x, y, \u02c6y, \u02c6h, w) \u2264 1\n\n(cid:105)\n\n+ \u03bb(cid:107)w(cid:107)2\n2.\n\n(6)\n\nmax\n\n(\u02c6y,\u02c6h)\u2208T (w,x)\n\nd(y, \u02c6y, \u02c6h) 1\n\n(cid:88)\n\n(x,y)\u2208S\n\nmin\nw\n\n1\nn\n\nWe make use of the following two assumptions in order for |T (w, x)| to be polynomial, even when\n|Yx \u00d7 Hx| is exponential with respect to the input size.\nAssumption A (Maximal distortion, [11]). The proposal distribution R(w, x) ful\ufb01lls the following\ncondition. There exists a value \u03b2 \u2208 [0, 1) such that for all (x, y) \u2208 S and w \u2208 W:\n\nP\n\n(y(cid:48),h(cid:48))\u223cR(w,x)\n\n[d(y, y(cid:48), h(cid:48)) = 1] \u2265 1 \u2212 \u03b2\n\nAssumption B (Low norm). The proposal distribution R(w, x) ful\ufb01lls the condition for all\n(x, y) \u2208 S and w \u2208 W:1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nE\n\n(y(cid:48),h(cid:48))\u223cR(w,x)\n\n(cid:2)\u03a6(x, y, h\u2217) \u2212 \u03a6(x, y(cid:48), h(cid:48))(cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n\u221a\n\u2264 1\n2\n\nn\n\n\u2264 1\n\n2(cid:107)w(cid:107)2\n\n,\n\nwhere h\u2217 = argmaxh\u2208Hx \u03a6(x, y, h) \u00b7 w.\nIn Section 5 we provide examples for Assumptions A and B which allow us to obtain |T (w, x)| =\n. Note that \u03b2 plays an important role in the number of samples that we\n\nO(cid:16)\n\n(cid:17)\n\nlog(1/(\u03b2+e\n\n1\n\u22121/(\u03b32(cid:107)w(cid:107)2\n\n2)))\n\nneed to draw from the proposal distribution R(w, x).\n\nStatistical analysis.\nIn this approach, randomness comes from two sources, from the training data\nS and the random set T (w, x). That is, in Theorem 1, randomness only stems from the training set S.\nNow we need to produce generalization results that hold for all the sets T (w, x), and for all possible\nproposal distributions R(w, x). The following assumption will allow us to upper-bound the number\nof possible proposal distributions R(w, x).\nAssumption C (Linearly inducible ordering, [11]). The proposal distribution R(w, x) depends\nsolely on the linear ordering induced by the parameter w \u2208 W and the mapping \u03a6(x,\u00b7,\u00b7). More\nformally, let r(x) \u2261 |Yx \u00d7 Hx| and thus Yx \u00d7 Hx \u2261 {(y1, h1) . . . (yr(x), hr(x))}. Let w, w(cid:48) \u2208 W\nbe any two arbitrary parameters. Let \u03c0(x) = (\u03c01 . . . \u03c0r(x)) be a permutation of {1 . . . r(x)} such that\n\u03a6(x, y\u03c01, h\u03c01) \u00b7 w < \u00b7\u00b7\u00b7 < \u03a6(x, y\u03c0r(x), h\u03c0r(x)) \u00b7 w. Let \u03c0(cid:48)(x) = (\u03c0(cid:48)\n1 . . . \u03c0(cid:48)\nr(x)) be a permutation of\n) \u00b7 w(cid:48). For all w, w(cid:48) \u2208 W and\n{1 . . . r(x)} such that \u03a6(x, y\u03c0(cid:48)\n, h\u03c0(cid:48)\ndistribution ful\ufb01lls R(\u03c0(x), x) \u2261 R(w, x).\n\nx \u2208 X , if \u03c0(x) = \u03c0(cid:48)(x) then KL(cid:0)R(w, x)(cid:13)(cid:13)R(w(cid:48), x)(cid:1) = 0. In this case, we say that the proposal\n\n) \u00b7 w(cid:48) < \u00b7\u00b7\u00b7 < \u03a6(x, y\u03c0(cid:48)\n\n, h\u03c0(cid:48)\n\nr(x)\n\nr(x)\n\n1\n\n1\n\n1The second inequality follows from an implicit assumption made in Theorem 1, i.e.,(cid:107)w(cid:107)2\n\n2 /n \u2264 1 since the\n\ndistortion function d is at most 1.\n\n5\n\n\fIn Assumption C, geometrically speaking, for a \ufb01xed x we \ufb01rst project the feature vectors \u03a6(x, y, h)\nof all (y, h) \u2208 Yx \u00d7 Hx onto the lines w and w(cid:48). Let \u03c0(x) and \u03c0(cid:48)(x) be the resulting ordering of\nthe structured outputs after projecting them onto w and w(cid:48) respectively. Two proposal distributions\nR(w, x) and R(w(cid:48), x) are the same provided that \u03c0(x) = \u03c0(cid:48)(x). That is, the speci\ufb01c values of\n\u03a6(x, y, h) \u00b7 w and \u03a6(x, y, h) \u00b7 w(cid:48) are irrelevant, and only their ordering matters.\nIn Section 5 we show an example that ful\ufb01lls Assumption C, which corresponds to a generalization\nof Algorithm 2 proposed in [11] for any structure with computationally ef\ufb01cient local changes.\nIn the following theorem, we show that our new formulation in eq.(6) is related to an upper bound of\n\u221a\nthe Gibbs decoder distortion up to statistical accuracy of O(log2 n/\nand |(cid:101)Hx| \u2264 \u02dcr for all (x, y) \u2208 S, | \u222a(x,y)\u2208S Px| \u2264 (cid:96), and (cid:107)\u03a6(x, y, h)(cid:107)2 \u2264 \u03b3 for any triple (x, y, h).\nTheorem 3. Assume that there exist \ufb01nite integer values r, \u02dcr, (cid:96), and \u03b3 such that |Yx \u00d7 Hx| \u2264 r\nAssume that the proposal distribution R(w, x) with support on Yx \u00d7 Hx ful\ufb01lls Assumption A with\n(cid:112)(cid:96)(r + 1) + 1. With probability at least 1 \u2212 \u03b4 over the\nvalue \u03b2, as well as Assumptions B and C. Assume that (cid:107)w(cid:107)2\n128\u03b32 log(1/(1\u2212\u03b2)) . Fix \u03b4 \u2208 (0, 1)\nand an integer s such that 3 \u2264 2s + 1 \u2264 9\nchoice of both n training samples and n sets of random structured outputs and latent variables,\nsimultaneously for all parameters w \u2208 W with (cid:107)w(cid:107)0 \u2264 s, unit-variance Gaussian perturbation\ndistributions Q(w) centered at w\u03b3\n2), and for sets of random structured outputs\nT (w, x) sampled i.i.d. from the proposal distribution R(w, x) for each training sample (x, y) \u2208 S,\nsuch that |T (w, x)| =\n\nn) for n training samples.\n\n8 log (rn/(cid:107)w(cid:107)2\n\n(cid:113)\n\n2 \u2264\n\n20\n\n1\n\nlog(1/(\u03b2+e\n\nlog n\n\u22121/(128\u03b32(cid:107)w(cid:107)2\n\n2)))\n\nL(Q(w), D) \u2264 1\nn\n\nmax\n\n(\u02c6y,\u02c6h)\u2208T (w,x)\n\nd(y, \u02c6y, \u02c6h) 1\n\n, we have:\n\n(cid:25)\n(cid:104)(cid:101)m(x, y, \u02c6y, \u02c6h, w) \u2264 1\n(cid:114) 1\n\n(cid:114)\n\n(cid:105)\n\n+\n\n+ 3\n\nn\n\n(cid:107)w(cid:107)2\nn\n\n2\n\n+\n\nn\n\n1\n2\n\n(cid:24)\n(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) 4(cid:107)w(cid:107)2\n\n(x,y)\u2208S\n\n+\n\n+\n\n2 \u03b32 log rn(cid:107)w(cid:107)2\n2(n \u2212 1)\n\n2\n\n+ log 2n\n\u03b4\n\n(cid:115)\n\ns(log (cid:96) + 2 log (nr)) + log (4/\u03b4)\n\n1\n\nlog(1/(\u03b2 + e\u22121/(128\u03b32(cid:107)w(cid:107)2\n\n2 )))\n\n(2s + 1) log((cid:96)(n\u02dcr + 1) + 1) log3(n + 1)\n\nn\n\nThe proof of the above is based on Theorem 2 as a starting point. In order to account for the\ncomputational aspect of requiring sets T (w, x) of polynomial size, we use Assumptions A and B\nfor bounding a deterministic expectation. In order to account for the statistical aspects, we use\nAssumption C and Rademacher complexity arguments for bounding a stochastic quantity for all sets\nT (w, x) of random structured outputs and latent variables, and all possible proposal distributions\nR(w, x).\nRemark 3. A straightforward application of Rademacher complexity in the analysis of [11] leads\nto a bound of O(|Hx|/\nn). Technically speaking, a classical Rademacher complexity states that:\nlet F and G be two hypothesis classes. Let min(F,G) = {min(f, g)|f \u2208 F, g \u2208 G}. Then\n\u221a\nR(min(F,G)) \u2264 R(F) + R(G). If we apply this, then Theorem 3 would contain an O(|Hx|/\n\u221a\nn)\nterm, or equivalently O(r/\nn). This would be prohibitive since r is typically exponential size, and\n\u221a\none would require a very large number of samples n in order to have a useful bound, i.e., to make\nO(r/\n\nn) close to zero. In the proof we provide a way to tighten the bound to O((cid:112)log |Hx|/n).\n\n\u221a\n\n5 Examples\n\nHere we provide several examples that ful\ufb01ll the three main assumptions of our theoretical result.\n\nExamples for Assumption A. First we argue that we can perform a change of measure between\ndifferent proposal distributions. This allows us to focus on uniform proposals afterwards.\nClaim i (Change of measure). Let R(w, x) and R(cid:48)(w, x) two proposal distributions, both with\nsupport on Yx \u00d7 Hx. Assume that R(w, x) ful\ufb01lls Assumption A with value \u03b21. Let rw,x(\u00b7) and\n\n6\n\n\fw,x(\u00b7) be the probability mass functions of R(w, x) and R(cid:48)(w, x) respectively. Assume that the\nr(cid:48)\ntotal variation distance between R(w, x) and R(cid:48)(w, x) ful\ufb01lls for all (x, y) \u2208 S and w \u2208 W:\n\n(cid:88)\n\n(y,h)\n\nT V (R(w, x)(cid:107)R(cid:48)(w, x)) \u2261 1\n2\n\n|rw,x(y, h) \u2212 r(cid:48)\n\nw,x(y, h)| \u2264 \u03b22\n\nThen R(cid:48)(w, x) ful\ufb01lls Assumption A with \u03b2 = \u03b21 + \u03b22 provided that \u03b21 + \u03b22 \u2208 [0, 1).\nNext, we present a new result for permutations and for a distortion that returns the number of different\npositions. We later use this result for an image matching application in the experiments section.\nClaim ii (Permutations). Let Yx be the set of all permutations of v elements, such that v > 1. Let yi\nbe the i-th element in the permutation y. Let d(y, y(cid:48), h) = 1\ndistribution R(w, x) = R(x) with support on Yx \u00d7 Hx ful\ufb01lls Assumption A with \u03b2 = 2/3.\nThe authors in [11] present several examples of distortion functions of the form d(y, y(cid:48)), for directed\nspanning trees, directed acyclic graphs and cardinality-constrained sets, and a distortion function\nthat returns the number of different edges/elements; as well as for any type of structured output\nand binary distortion functions. For our setting we can make use of these examples by de\ufb01ning\nd(y, y(cid:48), h) = d(y, y(cid:48)). Note that even if we ignore the latent variable in the distortion function, we\nstill use the latent variables in the feature vectors \u03a6(x, y, h) and thus in the calculation of the margin.\n\n(cid:3). The uniform proposal\n\n1(cid:2)yi (cid:54)= y(cid:48)\n\n(cid:80)v\n\ni=1\n\nv\n\ni\n\nx, where the partition \u03a5p\n\nExamples for Assumption B. The claim below is for a particular instance of a sparse mapping\nand a uniform proposal distribution.\nClaim iii (Sparse mapping). Let b > 0 be an arbitrary integer value. For all (x, y) \u2208 S with\nh\u2217 = argmaxh\u2208Hx \u03a6(x, y, h) \u00b7 w, let \u03a5x = \u222ap\u2208Px \u03a5p\nx is de\ufb01ned as follows\nfor all p \u2208 Px:\nx \u2261 {(y(cid:48), h(cid:48)) | |\u03a6p(x, y, h\u2217) \u2212 \u03a6p(x, y(cid:48), h(cid:48))| \u2264 b and (\u2200q (cid:54)= p) \u03a6q(x, y, h\u2217) = \u03a6q(x, y(cid:48), h(cid:48))}\n\u03a5p\nIf n \u2264 |Px|/(4b2) for all (x, y) \u2208 S, then the uniform proposal distribution R(w, x) = R(x) with\nsupport on Yx \u00d7 Hx ful\ufb01lls Assumption B.\nThe claim below is for a particular instance of a dense mapping and an arbitrary proposal distribution.\nClaim iv\nLet\n|\u03a6p(x, y, h\u2217) \u2212 \u03a6p(x, y(cid:48), h(cid:48))| \u2264 b|Px| for all (x, y) \u2208 S with h\u2217 = argmaxh\u2208Hx \u03a6(x, y, h) \u00b7 w,\n(y(cid:48), h(cid:48)) \u2208 Yx \u00d7 Hx and p \u2208 Px. If n \u2264 |Px|/(4b2) for all (x, y) \u2208 S, then any arbitrary proposal\ndistribution R(w, x) ful\ufb01lls Assumption B.\n\n(Dense mapping). Let\n\narbitrary\n\ninteger\n\nvalue.\n\nb > 0\n\nbe\n\nan\n\nExamples for Assumption C.\nIn the case of modeling without latent variables, [32, 33] presented\nan algorithm for directed spanning trees in the context of dependency parsing in natural language\nprocessing. Later, [11] extended the previous algorithm to any structure with computationally ef\ufb01cient\nlocal changes, which includes directed acyclic graphs (traversed in post-order) and cardinality-\nconstrained sets. Next, we generalize Algorithm 2 in [11] by including latent variables.\n\nAlgorithm 1 Procedure for sampling a structured output (y(cid:48), h(cid:48)) \u2208 Yx \u00d7 Hx from a greedy local proposal\ndistribution R(w, x)\n1: Input: parameter w \u2208 W, observed input x \u2208 X\n2: Draw uniformly at random a structured output (\u02c6y, \u02c6h) \u2208 Yx \u00d7 Hx\n3: repeat\n4: Make a local change to (\u02c6y, \u02c6h) in order to increase \u03a6(x, \u02c6y, \u02c6h) \u00b7 w\n5: until no re\ufb01nement in last iteration\n6: Output: structured output and latent variable (y(cid:48), h(cid:48)) \u2190 (\u02c6y, \u02c6h)\n\nThe above algorithm has the following property:\nClaim v (Sampling for any type of structured output and latent variable). Algorithm 1 ful\ufb01lls\nAssumption C.\n\n7\n\n\fTable 1: Average over 30 repetitions, and standard error at 95% con\ufb01dence level. All (LSSVM) indicates the use\nof exact learning and exact inference. Rand and Rand/All indicate use of random learning, and random and exact\n\ninference respectively. (S) indicates the use of superset (cid:101)H in the calculation of the margin. Rand/All obtains a\n\nsimilar or sightly better test performance than All in the different study cases. Note that the runtime for learning\nusing the randomized approach is much less than exact learning, while still having a good test performance.\n\nProblem\nDirected\nspanning\ntrees\n\nDirected\nacyclic\ngraphs\n\nMethod\nAll (LSSVM)\nRand (S)\nRand/All (S)\nRand\nRand/All\nAll (LSSVM)\nRand (S)\nRand/All (S)\nRand\nRand/All\n\nCardinality All (LSSVM)\nconstrained Rand (S)\nsets\n\nRand/All (S)\nRand\nRand/All\n\nTraining runtime\n\n1000 \u00b1 15\n44 \u00b1 1\n126 \u00b1 5\n1000 \u00b1 21\n63 \u00b1 0\n353 \u00b1 5\n1000 \u00b1 5\n75 \u00b1 0\n182 \u00b1 3\n\nTraining distortion Test runtime Test distortion\n8.2% \u00b1 1.3%\n22% \u00b1 1.9%\n8.2% \u00b1 1.3%\n24% \u00b1 3.2%\n8.2% \u00b1 1.4%\n21% \u00b1 2.4%\n28% \u00b1 1.9%\n20% \u00b1 1.9%\n25% \u00b1 1.4%\n19% \u00b1 1.6%\n6% \u00b1 1.2%\n18% \u00b1 1.8%\n6% \u00b1 1.3%\n17% \u00b1 1.2%\n6% \u00b1 2.2%\n\n8.4% \u00b1 1.4%\n22% \u00b1 2.2%\n23% \u00b1 3.0%\n17% \u00b1 1.7%\n24% \u00b1 1.5%\n21% \u00b1 1.1%\n6.3% \u00b1 1.0%\n18% \u00b1 1.8%\n15% \u00b1 3.2%\n\n18.9 \u00b1 0.1\n0.92 \u00b1 0\n19 \u00b1 0.1\n3 \u00b1 0.4\n17 \u00b1 0.8\n19 \u00b1 0.2\n1.5 \u00b1 0\n19 \u00b1 0.2\n8 \u00b1 1\n15 \u00b1 0.2\n19.5 \u00b1 0.1\n1.7 \u00b1 0\n19.5 \u00b1 0.1\n3.1 \u00b1 1\n19.4 \u00b1 0.1\n\n6 Experiments\n\nIn this section we illustrate the use of our approach by using the formulation in eq.(6). The goal of\nthe synthetic experiments is to show the improvement in prediction results and runtime of our method.\nWhile the goal of the real-world experiment is to show the usability of our method in practice.\n\nSynthetic experiments. We present experimental results for directed spanning trees, directed\nacyclic graphs and cardinality-constrained sets. We performed 30 repetitions of the following\nprocedure. We generated a ground truth parameter w\u2217 with independent zero-mean and unit-\nvariance Gaussian entries. Then, we generated a training set of n = 100 samples. Our map-\nping \u03a6(x, y, h) is as follows. For every pair of possible edges/elements i and j, we de\ufb01ne\nIn order to generate each training sample\n(x, y) \u2208 S, we generated a random vector x with independent Bernoulli entries, each with equal\nprobability of being 1 or 0. The latent space H is the set of binary strings with two entries being 1,\nwhere these two entries share a common edge or element, i.e., hij = hik = 1,\u2200 i, j, k. To the best\nof our knowledge there is no ef\ufb01cient way to exactly compute the maximization in the margin m\n\n\u03a6ij(x, y, h) = 1(cid:2)(hij xor xij) and i \u2208 y and j \u2208 y(cid:3).\nunder this latent space. Thus, we de\ufb01ne (cid:101)H (relaxed set) as the set of all binary strings with exactly\n\ntwo entries being 1. We then can ef\ufb01ciently compute the margin \u02dcm by a greedy approach since our\nfeature vector is constructed using linear operators. After generating x, we set (y, h) = fw\u2217 (x). That\nis, we solved eq.(1) in order to produce the structured output y, and disregard h. (More details of the\nexperiment in Appendix B.3.)\nWe compared three training methods: the maximum loss over all possible structured outputs and\nlatent variables with slack re-scaling as in eq.(5). We also evaluated the maximum loss over random\nstructured outputs and latent variables, using the original latent space, as well as, the superset\nrelaxation as in eq.(6). We considered directed spanning trees of 4 nodes, directed acyclic graphs of 4\nnodes and 2 parents per node, and sets of 3 elements chosen from 9 possible elements. After training,\nfor inference on an independent test set, we used eq.(1) for the maximum loss over all possible\nstructured outputs and latent variables. For the maximum loss over random structured outputs and\nlatent variables, we use the following approximate inference approach:\n\n\u03a6(x, y, h) \u00b7 w.\n\n(7)\n\n(cid:101)fw(x) \u2261 argmax\n\n(y,h)\u2208T (w,x)\n\nNote that we used small structures and latent spaces in order to compare to exact learning, i.e., going\nthrough all possible structures as in eq.(5) and eq.(4). Bigger structures would result in exponential\nnumber of structures, making exact methods intractable to compare against our method. For purposes\n\n8\n\n\fof testing, we tried cardinality constrained sets of 4 elements out of 100 (note that in this case\n|Y| \u2248 108, |H| \u2248 1016) and training only took 11 minutes under our approach.\nTable 1 shows the runtime, the training distortion as well as the test distortion in an independently\ngenerated set of 100 samples. In the different study cases, the maximum loss over random structured\noutputs and latent variables obtains similar test performance than the maximum loss over all possible\nstructured outputs and latent variables. However, note that our method is considerable faster.\n\nImage matching. We illustrate our approach for image matching on video frames from the Buffy\nStickmen dataset (http://www.robots.ox.ac.uk/~vgg/data/stickmen/). The goal of the\nexperiment is to match the keypoints representing different body parts, between two images. Each\nframe contains 18 keypoints representing different parts of the body. From a total of 187 image pairs\n(from different episodes and people), we randomly selected 120 pairs for training and the remaining\n67 pairs for testing. We performed 30 repetitions. Ground truth keypoint matching is provided in the\ndataset.\nFollowing [9, 27], we represent the matching as a permutation of keypoints. Let x = (I, I(cid:48)) be\na pair of images, and let y be a permutation of {1 . . . 18}. We model the latent variable h as a\nR2\u00d72 matrix representing an af\ufb01ne transformation of a keypoint, where h11, h22 \u2208 {0.8, 1, 1.2}, and\nh12, h21 \u2208 {\u22120.2, 0, 0.2}. Our mapping \u03a6(x, y, h) uses SIFT features, and the distance between\ncoordinates after using h. (Details in Appendix B.3.)\nWe used the distortion function and \u03b2 = 2/3 as prescribed by Claim ii. After learning, for a given\nx from the test set, we performed 100 iterations of random inference as in eq.(7). We obtained an\naverage error of 0.3878 (6.98 incorrectly matched keypoints) in the test set, which is an improvement\nto the values of 8.47 for maximum-a-posteriori perturbations and 8.69 for max-margin, as reported in\n[9]. Finally, we show an example from the test set in Figure 1.\n\nFigure 1: Image matching on the Buffy Stickmen dataset, predicted by our randomized approach with latent\nvariables. The problem is challenging since the dataset contains different episodes and people.\n\n7 Future directions\n\nThe randomization of the latent space in the calculation of the margin is of high interest. Despite\nleading to a looser upper bound of the Gibbs decoder distortion, if one could control the statistical\naccuracy under this approach then one could obtain a fully polynomial-time evaluation of the objective\nfunction, even if |H| is exponential. Therefore, whether this method is feasible, and under what\ntechnical conditions, are potential future work. The analysis of other non-Gaussian perturbation\nmodels from the computational and statistical viewpoints is also of interest. Finally, it would be\ninteresting to analyze approximate inference for prediction on an independent test set.\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1716609-IIS.\n\nReferences\n[1] Altun, Y. and Hofmann, T. [2003], \u2018Large margin methods for label sequence learning\u2019, Euro-\n\npean Conference on Speech Communication and Technology pp. 145\u2013152.\n\n9\n\n\f[2] Bennett, J. [1956], \u2018Determination of the number of independent parameters of a score matrix\n\nfrom the examination of rank orders\u2019, Psychometrika 21(4), 383\u2013393.\n\n[3] Bennett, J. and Hays, W. [1960], \u2018Multidimensional unfolding: Determining the dimensionality\n\nof ranked preference data\u2019, Psychometrika 25(1), 27\u201343.\n\n[4] Choi, H., Meshi, O. and Srebro, N. [2016], Fast and scalable structural svm with slack rescaling,\n\nin \u2018Arti\ufb01cial Intelligence and Statistics\u2019, pp. 667\u2013675.\n\n[5] Collins, M. [2004], Parameter estimation for statistical parsing models: Theory and practice\nof distribution-free methods, in \u2018New Developments in Parsing Technology\u2019, Vol. 23, Kluwer\nAcademic, pp. 19\u201355.\n\n[6] Collins, M. and Roark, B. [2004], \u2018Incremental parsing with the perceptron algorithm\u2019, Annual\n\nMeeting of the Association for Computational Linguistics pp. 111\u2013118.\n\n[7] Cortes, C., Kuznetsov, V., Mohri, M. and Yang, S. [2016], Structured prediction theory based\non factor graph complexity, in \u2018Advances in Neural Information Processing Systems\u2019, pp. 2514\u2013\n2522.\n\n[8] Cover, T. [1967], \u2018The number of linearly inducible orderings of points in d-space\u2019, SIAM\n\nJournal on Applied Mathematics 15(2), 434\u2013439.\n\n[9] Gane, A., Hazan, T. and Jaakkola, T. [2014], Learning with maximum a-posteriori perturbation\n\nmodels, in \u2018Arti\ufb01cial Intelligence and Statistics\u2019, pp. 247\u2013256.\n\n[10] Hinton, G. E. [2012], A practical guide to training restricted boltzmann machines, in \u2018Neural\n\nnetworks: Tricks of the trade\u2019, Springer, pp. 599\u2013619.\n\n[11] Honorio, J. and Jaakkola, T. [2016], \u2018Structured prediction: from gaussian perturbations to\n\nlinear-time principled algorithms\u2019, UAI .\n\n[12] Kulesza, A. and Pereira, F. [2007], \u2018Structured learning with approximate inference\u2019, Neural\n\nInformation Processing Systems 20, 785\u2013792.\n\n[13] Lafferty, J., McCallum, A. and Pereira, F. C. [2001], \u2018Conditional random \ufb01elds: Probabilistic\n\nmodels for segmenting and labeling sequence data\u2019.\n\n[14] Li, M.-H., Lin, L., Wang, X.-L. and Liu, T. [2007], \u2018Protein\u2013protein interaction site prediction\n\nbased on conditional random \ufb01elds\u2019, Bioinformatics 23(5), 597\u2013604.\n\n[15] London, B., Meshi, O. and Weller, A. [2016], \u2018Bounding the integrality distance of lp relaxations\n\nfor structured prediction\u2019, NIPS workshop on Optimization for Machine Learning .\n\n[16] McAllester, D. [2007], Generalization bounds and consistency, in \u2018Predicting Structured Data\u2019,\n\nMIT Press, pp. 247\u2013261.\n\n[17] Meshi, O., Mahdavi, M., Weller, A. and Sontag, D. [2016], \u2018Train and test tightness of lp\n\nrelaxations in structured prediction\u2019, International Conference on Machine Learning .\n\n[18] Neylon, T. [2006], Sparse Solutions for Linear Prediction Problems, PhD thesis, New York\n\nUniversity.\n\n[19] Nowozin, S., Lampert, C. H. et al. [2011], \u2018Structured learning and prediction in computer\n\nvision\u2019, Foundations and Trends\u00ae in Computer Graphics and Vision 6(3\u20134), 185\u2013365.\n\n[20] Petrov, S. and Klein, D. [2008], Discriminative log-linear grammars with latent variables, in\n\n\u2018Advances in neural information processing systems\u2019, pp. 1153\u20131160.\n\n[21] Ping, W., Liu, Q. and Ihler, A. [2014], \u2018Marginal structured SVM with hidden variables\u2019,\n\nInternational Conference on Machine Learning pp. 190\u2013198.\n\n[22] Quattoni, A., Collins, M. and Darrell, T. [2005], Conditional random \ufb01elds for object recognition,\n\nin \u2018Advances in neural information processing systems\u2019, pp. 1097\u20131104.\n\n10\n\n\f[23] Quattoni, A., Wang, S., Morency, L.-P., Collins, M. and Darrell, T. [2007], \u2018Hidden conditional\n\nrandom \ufb01elds\u2019, IEEE transactions on pattern analysis and machine intelligence 29(10).\n\n[24] Sarawagi, S. and Gupta, R. [2008], Accurate max-margin training for structured output spaces,\nin \u2018Proceedings of the 25th international conference on Machine learning\u2019, ACM, pp. 888\u2013895.\n\n[25] Taskar, B., Guestrin, C. and Koller, D. [2003], \u2018Max-margin Markov networks\u2019, Neural Infor-\n\nmation Processing Systems 16, 25\u201332.\n\n[26] Tsochantaridis, I., Joachims, T., Hofmann, T. and Altun, Y. [2005], \u2018Large margin methods\nfor structured and interdependent output variables\u2019, Journal of machine learning research\n6(Sep), 1453\u20131484.\n\n[27] Volkovs, M. and Zemel, R. S. [2012], Ef\ufb01cient sampling for bipartite matching problems, in\n\n\u2018Advances in Neural Information Processing Systems\u2019, pp. 1313\u20131321.\n\n[28] Wang, H., Gould, S. and Roller, D. [2013], \u2018Discriminative learning with latent variables for\n\ncluttered indoor scene understanding\u2019, Communications of the ACM 56(4), 92\u201399.\n\n[29] Wang, S. B., Quattoni, A., Morency, L.-P., Demirdjian, D. and Darrell, T. [2006], Hidden\nconditional random \ufb01elds for gesture recognition, in \u2018Computer Vision and Pattern Recognition,\n2006 IEEE Computer Society Conference on\u2019, Vol. 2, IEEE, pp. 1521\u20131527.\n\n[30] Yu, C. and Joachims, T. [2009], \u2018Learning structural SVMs with latent variables\u2019, International\n\nConference on Machine Learning pp. 1169\u20131176.\n\n[31] Yuille, A. L. and Rangarajan, A. [2002], The concave-convex procedure (cccp), in \u2018Advances in\n\nneural information processing systems\u2019, pp. 1033\u20131040.\n\n[32] Zhang, Y., Lei, T., Barzilay, R. and Jaakkola, T. [2014], \u2018Greed is good if randomized: New in-\nference for dependency parsing\u2019, Empirical Methods in Natural Language Processing pp. 1013\u2013\n1024.\n\n[33] Zhang, Y., Li, C., Barzilay, R. and Darwish, K. [2015], \u2018Randomized greedy inference for\njoint segmentation, POS tagging and dependency parsing\u2019, North American Chapter of the\nAssociation for Computational Linguistics pp. 42\u201352.\n\n11\n\n\f", "award": [], "sourceid": 1605, "authors": [{"given_name": "Kevin", "family_name": "Bello", "institution": "Purdue University"}, {"given_name": "Jean", "family_name": "Honorio", "institution": "Purdue University"}]}