{"title": "Distributionally Robust Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 8344, "page_last": 8355, "abstract": "In many structured prediction problems, complex relationships between variables are compactly defined using graphical structures. The most prevalent graphical prediction methods---probabilistic graphical models and large margin methods---have their own distinct strengths but also possess significant drawbacks. Conditional random fields (CRFs) are Fisher consistent, but they do not permit integration of customized loss metrics into their learning process. Large-margin models, such as structured support vector machines (SSVMs), have the flexibility to incorporate customized loss metrics, but lack Fisher consistency guarantees. We present adversarial graphical models (AGM), a distributionally robust approach for constructing a predictor that performs robustly for a class of data distributions defined using a graphical structure. Our approach enjoys both the flexibility of incorporating customized loss metrics into its design as well as the statistical guarantee of Fisher consistency. We present exact learning and prediction algorithms for AGM with time complexity similar to existing graphical models and show the practical benefits of our approach with experiments.", "full_text": "Distributionally Robust Graphical Models\n\nRizal Fathony, Ashkan Rezaei, Mohammad Ali Bashiri, Xinhua Zhang, Brian D. Ziebart\n\nDepartment of Computer Science, University of Illinois at Chicago\n\n{rfatho2, arezae4, mbashi4, zhangx, bziebart}@uic.edu\n\nChicago, IL 60607\n\nAbstract\n\nIn many structured prediction problems, complex relationships between variables\nare compactly de\ufb01ned using graphical structures. The most prevalent graphical pre-\ndiction methods\u2014probabilistic graphical models and large margin methods\u2014have\ntheir own distinct strengths but also possess signi\ufb01cant drawbacks. Conditional\nrandom \ufb01elds (CRFs) are Fisher consistent, but they do not permit integration of\ncustomized loss metrics into their learning process. Large-margin models, such as\nstructured support vector machines (SSVMs), have the \ufb02exibility to incorporate\ncustomized loss metrics, but lack Fisher consistency guarantees. We present adver-\nsarial graphical models (AGM), a distributionally robust approach for constructing\na predictor that performs robustly for a class of data distributions de\ufb01ned using\na graphical structure. Our approach enjoys both the \ufb02exibility of incorporating\ncustomized loss metrics into its design as well as the statistical guarantee of Fisher\nconsistency. We present exact learning and prediction algorithms for AGM with\ntime complexity similar to existing graphical models and show the practical bene\ufb01ts\nof our approach with experiments.\n\n1\n\nIntroduction\n\nLearning algorithms must consider complex relationships between variables to provide useful pre-\ndictions in many structured prediction problems. These complex relationships are often represented\nusing graphs to convey the independence assumptions being employed. For example, chain structures\nare used when modeling sequences like words and sentences [1], tree structures are popular for\nnatural language processing tasks that involve prediction for entities in parse trees [2\u20134], and lattice\nstructures are often used for modeling images [5]. The most prevalent methods for learning with\ngraphical structure are probabilistic graphical models (e.g., conditional random \ufb01elds (CRFs) [6]) and\nlarge margin models (e.g., structured support vector machines (SSVMs) [7] and maximum margin\nMarkov networks (M3Ns) [8]). Both types of models have unique advantages and disadvantages.\nCRFs with suf\ufb01ciently expressive feature representation are consistent estimators of the marginal\nprobabilities of variables in cliques of the graph [9], but are oblivious to the evaluative loss metric\nduring training. On the other hand, SSVMs directly incorporate the evaluative loss metric in the\ntraining optimization, but lack consistency guarantees for multiclass settings [10, 11].\nTo address these limitations, we propose adversarial graphical models (AGM), a distributionally\nrobust framework for leveraging graphical structure among variables that provides both the \ufb02exibility\nto incorporate customized loss metrics during training as well as the statistical guarantee of Fisher\nconsistency for a chosen loss metric. Our approach is based on a robust adversarial formulation\n[12\u201314] that seeks a predictor that minimizes a loss metric in the worst-case given the statistical\nsummaries of the empirical distribution. We replace the empirical training data for evaluating our\npredictor with an adversary that is free to choose an evaluating distribution from the set of distributions\nthat match the statistical summaries of empirical training data via moment matching constraints, as\nde\ufb01ned by a graphical structure.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur AGM framework accepts a variety of loss metrics. A notable example that connects our\nframework to previous models is the logarithmic loss metric. The conditional random \ufb01eld (CRF)\nmodel [6] can be viewed as the robust predictor that best minimizes the logarithmic loss metric in\nthe worst case subject to moment matching constraints. In this paper, we focus on a family of loss\nmatrices that additively decomposes over each variable and is de\ufb01ned only based on the label values\nof the predictor and evaluator. For examples, the additive zero-one (the Hamming loss), ordinal\nregression (absolute), and cost sensitive metrics fall into this family of loss metrics. We propose\nef\ufb01cient exact algorithms for learning and prediction for graphical structures with low treewidth.\nFinally, we experimentally demonstrate the bene\ufb01ts of our framework compared with the previous\nmodels on structured prediction tasks.\n\n2 Background and related works\n\n2.1 Structured prediction, Fisher consistency, and graphical models\nThe structured prediction task is to simultaneously predict correlated label variables y \u2208 Y\u2014often\ngiven input variables x \u2208 X \u2014to minimize a loss metric (e.g., loss : Y \u00d7 Y \u2192 R) with respect\nto the true label values \u02dcy. This is in contrast with classi\ufb01cation methods that predict one single\nvariable y. Given a distribution over the multivariate labels, P (y), Fisher consistency is a desirable\ncharacteristic that requires a learning method to produce predictions \u02c6y that minimize the expected loss\nof this distribution, \u02c6y\u2217 \u2208 argmin\u02c6y\nE\nY\u223c \u02dcP [loss(\u02c6y, Y)], under ideal learning conditions (i.e., trained\nfrom the true data distribution using a fully expressive feature representation).\nTo reduce the complexity of the mappings from X to Y being learned, independence assumptions and\nmore restrictive representations are employed. In probabilistic graphical models, such as Bayesian\nnetworks [15] and random \ufb01elds [6], these assumptions are represented using a graph over the\nvariables. For graphs with arbitrary structure, inference (i.e., computing posterior probabilities or\nmaximal value assignments) requires exponential time in terms of the number of variables [16].\nHowever, this run-time complexity reduces to be polynomial in terms of the number of predicted\nvariables for graphs with low treewidth (e.g., chains, trees, cycles).\n\n2.2 Conditional random \ufb01elds as robust multivariate log loss minimization\n\nFollowing ideas from robust Bayes decision theory [12, 13] and distributional robustness [17], the\nconditional random \ufb01eld [6] can be derived as a robust minimizer of the logarithmic loss subject to\nmoment-matching constraints:\n\n(cid:104)\n\n(cid:105)\n\n(cid:2)\u03a6(X, Y)(cid:3) ,\n\n(1)\n\n(cid:104)\u2212log \u02c6P ( \u02c7Y|X)\n\n(cid:105)\n\nmin\n\u02c6P (\u00b7|x)\n\nmax\n\u02c7P (\u00b7|x)\n\nE\n\nX\u223c \u02dcP;\n\u02c7Y|X\u223c \u02c7P\n\nsuch that: E\n\nX\u223c \u02dcP;\n\u02c7Y|X\u223c \u02c7P\n\n\u03a6(X, \u02c7Y)\n\n= E\n\nX,Y\u223c \u02dcP\n\n(cid:105)\n\nargmax\u03b8\n\nE\nX,Y\u223c \u02dcP\n\n\u02c7P\u03b8(y|x) = e\u03b8\u00b7\u03a6(x,y)(cid:14)(cid:80)\n(cid:104)\nfeatures, \u03a6i(x, y) =(cid:80)\n\nwhere \u03a6 : X \u00d7 Y \u2192 Rk are feature functions that typically decompose additively over subsets of\nvariables. Under this perspective, the predictor \u02c6P seeks the conditional distribution that minimizes log\nloss against an adversary \u02c7P seeking to choose an evaluation distribution that approximates training\ndata statistics, while otherwise maximizing log loss. As a result, the predictor is robust not only to\nthe training sample \u02dcP , but all distributions with matching moment statistics [13].\nThe saddle point for Eq. (1) is obtained by the parametric conditional distribution \u02c6P\u03b8(y|x) =\ny(cid:48)\u2208Y e\u03b8\u00b7\u03a6(x,y(cid:48)) with parameters \u03b8 chosen by maximizing the data likelihood:\n. The decomposition of the feature function into additive clique\n\u03c6c,i(xc, yc), can be represented graphically by connecting the variables\nwithin cliques with undirected edges. Dynamic programming algorithms (e.g., junction tree) allow\nthe exact likelihood to be computed in run time that is exponential in terms of the treewidth of the\nresulting graph [18].\nPredictions for a particular loss metric are then made using the Bayes optimal prediction for the\nestimated distribution: y\u2217 = argminy\n[loss(y, \u02c6Y)]. This two-stage prediction approach can\ncreate inef\ufb01ciencies when learning from limited amounts of data since optimization may focus on\naccurately estimating probabilities in portions of the input space that have no impact on the decision\nboundaries of the Bayes optimal prediction. Rather than separating the prediction task from the\n\nlog \u02c6P\u03b8(Y|X)\n\nE \u02c6Y|x\u223c \u02c6P\u03b8\n\nc\u2208Ci\n\n2\n\n\flearning process, we incorporate the evaluation loss metric of interest into the robust minimization\nformulation of Eq. (1) in this work.\n\n2.3 Structured support vector machines\n\nOur approach is most similar to structured support vector machines (SSVMs) [19] and related\nmaximum margin methods [8], which also directly incorporate the evaluation loss metric into the\ntraining process. This is accomplished by minimizing a hinge loss convex surrogate:\n\nloss(y, \u02dcy) + \u03b8 \u00b7(cid:0)\u03a6(x, \u02dcy) \u2212 \u03a6(x, y)(cid:1)(cid:111)\n(cid:110)\n\nhinge\u03b8(\u02dcy) = max\n\ny\n\n,\n\n(2)\n\nwhere \u03b8 represents the model parameters, \u02dcy is the ground truth label, and \u03a6(x, y) is a feature function\nthat decomposes additively over subsets of variables.\nUsing a clique-based graphical representation of the potential function, and assuming the loss metric\nalso additively decomposes into the same clique-based representation, SSVMs have a computational\ncomplexity similar to probabilistic graphical models. Speci\ufb01cally, \ufb01nding the value assignment y\nthat maximizes this loss-augmented potential can be accomplished using dynamic programming in\nrun time that is exponential in the graph treewidth [18].\nA key weakness of support vector machines in general is their lack of Fisher consistency; there\nare distributions for multiclass prediction tasks for which the SVM will not learn a Bayes optimal\npredictor, even when the models are given access to the true distribution and suf\ufb01ciently expressive\nfeatures, due to the disconnection between the Crammer-Singer hinge loss surrogate [20] and the\nevaluation loss metric (i.e., the 0-1 loss in this case) [11]. In practice, if the empirical data behaves\nsimilarly to those distributions (e.g., P (y|x) have no majority y for a speci\ufb01c input x), the inconsistent\nmodel may perform poorly. This inconsistency extends to the structured prediction setting except in\nlimited special cases [21]. We overcome these theoretical de\ufb01ciencies in our approach by using an\nadversarial formulation that more closely aligns the training objective with the evaluation loss metric,\nwhile maintaining convexity.\n\n2.4 Other related works\n\nDistributionally robust learning. There has been a recent surge of interest in the machine learning\ncommunity for developing distributonally robust learning algorithms. The proposed learning algo-\nrithms differ in the uncertainty sets used to provide robustness. Previous robust learning algorithms\nhave been proposed under the F-divergence measures (which includes the popular KL-divergence and\n\u03c7-divergence) [22\u201324], the Wasserstein metric uncertainty set [25\u201327], and the moment matching un-\ncertainty set [17, 28]. Our robust adversarial learning approach differs from the previous approaches\nby focusing on the robustness in terms of the conditional distribution P (y|x) instead of the joint\ndistribution P (x, y). Our approach seeks a predictor that is robust to the worst-case conditional label\nprobability under the moment matching constraints. We do not impose any robustness to the training\nexamples x.\nRobust adversarial formulation. There have been some investigations on applying robust adver-\nsarial formulations for prediction to speci\ufb01c types of structured prediction problems (e.g., sequence\ntagging [29], and graph cuts [30]). Our work differs from them in two key aspects: we provide a\ngeneral framework for graphical structures and any additive loss metrics, as opposed to the speci\ufb01c\nstructures and loss metrics (additive zero-one loss) previously considered; and we also propose\na learning algorithm with polynomial run-time guarantees, in contrast with previously employed\nalgorithms that use double/single oracle constraint generation techniques [31] lacking polynomial\nrun-time guarantees.\nConsistent methods. A notable research interest in consistent methods for structured prediction\ntasks has also been observed. Ciliberto et al. [32] proposed a consistent regularization approach that\nmaps the original structured prediction problem into a kernel Hilbert space and employs a multivariate\nregression on the Hilbert space. Osokin et al. [33] proposed a consistent quadratic surrogate for\nany structured prediction loss metric and provide a polynomial sample complexity analysis for the\nadditive zero-one loss metric surrogate. Our work differs from these line of works in the focus on the\nstructure. We focus on the graphical structures that model interaction between labels, whereas the\nprevious works focus on the structure of the loss metric itself.\n\n3\n\n\f3 Approach\n\nWe propose adversarial graphical models (AGMs) to better align structured prediction with evaluation\nloss metrics in settings where the structured interaction between labels are represented in a graph.\n\n3.1 Formulations\n\nWe construct a predictor that best minimizes a loss metric for the worst-case evaluation distribution\nthat (approximately) matches the statistical summaries of empirical training data. Our predictor is\nallowed to make a probabilistic prediction over all possible label assignments (denoted as \u02c6P (\u02c6y|x)).\nHowever, instead of evaluating the prediction with empirical data (as commonly performed by\nempirical risk minimization formulations [34]), the predictor is pitted against an adversary that also\nmakes a probabilistic prediction (denoted as \u02c7P (\u02c7y|x)). The adversary is constrained to select its\nconditional distributions to match the statistical summaries of the empirical training distribution\n(denoted as \u02dcP ) via moment matching constraints on the features functions \u03a6.\nDe\ufb01nition 1. The adversarial prediction method for structured prediction problems with graphical\ninteraction between labels is:\n\n(cid:104)\n\n(cid:105)\n\n(cid:104)\n\n(cid:105)\n\n\u03a6(X, \u02c7Y)\n\n= \u02dc\u03a6,\n\n(3)\n\nmin\n\u02c6P (\u02c6y|x)\n\nmax\n\u02c7P (\u02c7y|x)\n\nE\nX\u223c \u02dcP ;\n\u02c6Y|X\u223c \u02c6P ;\n\u02c7Y|X\u223c \u02c7P\n\nloss( \u02c6Y, \u02c7Y)\n\nsuch that: E\n\nX\u223c \u02dcP ;\n\u02c7Y|X\u223c \u02c7P\n\nwhere the vector of feature moments, \u02dc\u03a6 = E\nX,Y\u223c \u02dcP [\u03a6(X, Y)], is measured from sample training\ndata. The feature function \u03a6(X, Y) contains features that are additively decomposed over cliques in\n\nthe graph, e.g. \u03a6(x, y) =(cid:80)\n\nc \u03c6(x, yc).\n\ni.e., loss(\u02c6y, \u02c7y) = (cid:80)n\n\nThis follows recent research in developing adversarial prediction for cost-sensitive classi\ufb01cation\n[14] and multivariate performance metrics [35], and, more generally, distributionally robust decision\nmaking under moment-matching constraints [17]. In this paper, we focus on pairwise graphical\nstructures where the interactions between labels are de\ufb01ned over the edges (and nodes) of the graph.\nWe also restrict the loss metric to a family of metrics that additively decompose over each yi variable,\ni=1 loss(\u02c6yi, \u02c7yi). Directly solving the optimization in Eq. (3) is impractical\nfor reasonably-sized problems since P (y|x) grows exponentially with the number of predicted\nvariables. Instead, we utilize the method of Lagrange multipliers and the marginal formulation of the\ndistributions of predictor and adversary to formulate a simpler dual optimization problem as stated in\nTheorem 1.\nTheorem 1. For the adversarial structured prediction with pairwise graphical structure and an\nadditive loss metric, solving the optimization in De\ufb01nition 1 is equivalent to solving the following\nexpectation of maximin problems over the node and edge marginal distributions parameterized by\nLagrange multipiers \u03b8:\n\nmin\n\u02c6P (\u02c6y|x)\n\n(cid:104)(cid:80)n\n(cid:80)\n\u02c7P (\u02c7yi, \u02c7yj|x)(cid:2)\u03b8e \u00b7 \u03c6(x, \u02c7yi, \u02c7yj)(cid:3) \u2212(cid:80)\n(cid:105)\n(i,j)\u2208E \u03b8e \u00b7 \u03c6(x, yi, yj)\n,\n\n(cid:80)\n\u02c7P (\u02c7yi|x)(cid:2)\u03b8v \u00b7 \u03c6(x, \u02c7yi)(cid:3) \u2212(cid:80)n\n\n\u02c6P (\u02c6yi|x) \u02c7P (\u02c7yi|x)loss(\u02c6yi, \u02c7yi)\n\ni \u03b8v \u00b7 \u03c6(x, yi)\n\n\u02c7yi,\u02c7yj\n\ni\n\n\u02c6yi,\u02c7yi\n\n\u02c7yi\n\n+(cid:80)\n+(cid:80)n\n\ni\n\n(i,j)\u2208E\n\n(cid:80)\n\nmin\n\u03b8e,\u03b8v\n\nE\nX,Y\u223c \u02dcP max\n\u02c7P (\u02c7y|x)\n\n(4)\n\nwhere \u03c6(x, yi) is the node feature function for node i, \u03c6(x, yi, yj) is the edge feature function for the\nedge connecting node i and j, E is the set of edges in the graphical structure, and \u03b8v and \u03b8e are the\nLagrange dual variables for the moment matching constraints corresponding to the node and edge\nfeatures, respectively. The optimization objective depends on the predictor\u2019s probability prediction\n\u02c6P (\u02c6y|x) only through its node marginal probabilities \u02c6P (\u02c6yi|x). Similarly, the objective depends on the\nadversary\u2019s probabilistic prediction \u02c7P (\u02c7y|x) only through its node and edge marginal probabilities,\ni.e., \u02c7P (\u02c7yi|x), and \u02c7P (\u02c7yi, \u02c7yj|x).\n\nProof sketch. From Eq. (3) we use the method of Langrange multipliers to introduce dual variables\n\u03b8v and \u03b8e that represent the moment-matching constraints over node and edge features, respectively.\nUsing the strong duality theorem for convex-concave saddle point problems [36, 37], we swap the\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: An example tree structure with \ufb01ve nodes and four edges with the corresponding marginal\nprobabilities for predictor and adversary (a); and the matrix and vector notations of the probabilities\n(b). Note that we introduce a dummy variable Q0,1 to match the constraints in Eq. (5).\n\noptimization order of \u03b8, \u02c6P (\u02c6y|x), and \u02c6P (\u02c7y|x) as in Eq. (4). Then, using the additive property of the\nloss metric and feature functions, the optimization over \u02c6P (\u02c6y|x) and \u02c6P (\u02c7y|x) can be transformed into\nan optimization over their respective node and edge marginal distributions.1\n\nNote that the optimization in Eq. (4) over the node and edge marginal distributions resembles the\noptimization of CRFs [38]. In terms of computational complexity, this means that for a general\ngraphical structure, the optimization above may be intractable. We focus on families of graphical\nstructures in which the optimization is known to be tractable. In the next subsection, we begin with\nthe case of tree-structured graphical models and then proceed with the case of graphical models with\nlow treewidth. In both cases, we formulate the corresponding ef\ufb01cient learning algorithms.\n\n3.2 Optimization\n\nWe \ufb01rst introduce our vector and matrix notations for AGM optimization. Without loss of generality,\nwe assume the number of class labels k to be the same for all predicted variables yi,\u2200i \u2208 {1, . . . , n}.\nLet pi be a vector with length k, where its a-th element contains \u02c6P (\u02c6yi = a|x), and let Qi,j be a\nk-by-k matrix with its (a, b)-th cells store \u02c7P (\u02c7yi = a, \u02c7yj = b|x). We also use a vector and matrix\nnotation to represent the ground truth label by letting zi be a one-hot vector where its a-th element\nz(a)\ni = 1 if yi = a or otherwise 0, and letting Zi,j be a one-hot matrix where its (a, b)-th cell\ni,j = 1 if yi = a \u2227 yj = b or otherwise 0. For each node feature \u03c6l(x, yi), we denote wi,l as a\nZ(a,b)\nlength k vector where its a-th element contains the value of \u03c6l(x, yi = a). Similarly, for each edge\nfeature \u03c6l(x, yi, yj), we denote Wi,j,l as a k-by-k matrix where its (a, b)-th cell contains the value\nof \u03c6l(x, yi = a, yj = b). For a pairwise graphical model with tree structure, we rewrite Eq. (4) using\nour vector and matrix notation with local marginal consistency constraints as follows:\n\nn(cid:88)\n\n(cid:104)\n\ni\n\n(cid:68)\nQpt(i);i \u2212 Zpt(i);i,(cid:80)\npt(i);i1 \u2212 zi)T((cid:80)\n\npt(i);i1) +\n\n(cid:105)\n\nl \u03b8(l)\n\nv wi;l)\n\npiLi(QT\n\n+ (QT\n\n(cid:69)\n\nl \u03b8(l)\n\ne Wpt(i);i;l\n\n(5)\n\nmin\n\u03b8e,\u03b8v\n\nE\nX,Y\u223c \u02dcP max\nQ\u2208\u2206\n\nmin\np\u2208\u2206\n\nsubject to: QT\n\npt(pt(i));pt(i)1 = Qpt(i);i1, \u2200i \u2208 {1, . . . , n},\n\nthe Frobenius inner product between two matrices, i.e., (cid:104)A, B(cid:105) =(cid:80)\n\nwhere pt(i) indicates the parent of node i in the tree structure, Li stores a loss matrix corresponding\n= loss(\u02c6yi = a, \u02c7yi = b), and (cid:104)\u00b7,\u00b7(cid:105) denotes\nto the portion of the loss metric for node i, i.e., L(a,b)\ni,j Ai,jBi,j. Note that we also\nemploy probability simplex constraints (\u2206) to each Qpt(i);i and pi. Figure 1 shows an example tree\nstructure with its marginal probabilities and the matrix notation of the probabilities.\n\ni\n\n1More detailed proofs for Theorem 1 and 2 are available in the supplementary material.\n\n5\n\n\f3.2.1 Learning algorithm\n\ndenote the edge potentials Bpt(i);i =(cid:80)\n\nWe \ufb01rst focus on solving the inner minimax optimization of Eq. (5). To simplify our notation, we\nv wi;l. We\n\nl \u03b8(l)\n\nl \u03b8(l)\n\ne Wpt(i);i;l and the node potentials bi =(cid:80)\n(cid:105)\n\n(cid:68)\n\n(cid:69)\n\nthen rewrite the inner optimization of Eq. (5) as:\n\nmax\nQ\u2208\u2206\n\nmin\np\u2208\u2206\n\npiLi(QT\n\nsubject to: QT\n\nQpt(i);i, Bpt(i);i\n\npt(i);i1) +\npt(pt(i));pt(i)1 = Qpt(i);i1, \u2200i \u2208 {1, . . . , n}.\n\n+ (QT\n\npt(i);i1)Tbi\n\n(6)\n\nn(cid:88)\n\n(cid:104)\n\ni\n\nTo solve the optimization above, we use dual decomposition technique [39, 40] that decompose the\ndual version of the optimization problem into several sub-problem that can be solved independently.\nBy introducing the Lagrange variable u for the local marginal consistency constraint, we formulate\nan equivalent dual unconstrained optimization problem as shown in Theorem 2.\nTheorem 2. The constrained optimization in Eq. (6) is equivalent to an unconstrained Lagrange\ndual problem with an inner optimization that can be solved independently for each node as follows:\n\ni \u2212ui1T +(cid:80)\n\n(cid:69)\n\n(cid:20)\n\nn(cid:88)\n\n(cid:68)\n\nmin\n\nu\n\ni\n\nmax\nQi\u2208\u2206\n\n(cid:21)\n\nQpt(i);i, Bpt(i);i +1bT\n\nk\u2208ch(i) 1uT\nk\n\n+ min\npi\u2208\u2206\n\npiLi(QT\n\npt(i);i1)\n\n, (7)\n\nwhere ui is the Lagrange dual variable associated with the marginal constraint of QT\nQpt(i);i1, and ch(i) represent the children of node i.\n\npt(pt(i));pt(i)1 =\n\nProof sketch. From Eq. (6) we use the method of Langrange multipliers to introduce dual variables\nu. The resulting dual optimization admits dual decomposability, where we can rearrange the variables\ninto independent optimizations for each node as in Eq. (7).\nWe denote matrix Apt(i);i (cid:44) Bpt(i);i+1bT\nk to simplify the inner optimization\nin Eq. (7). Let us de\ufb01ne ri (cid:44) QT\npt(i);i1 and ai be the column wise maximum of matrix Apt(i);i, i.e.,\na(l)\ni = maxl Al;i. Given the value of u, each of the inner optimizations in Eq. (7) can be equivalently\nsolved in terms of our newly de\ufb01ned variable changes ri and ai as follows:\n\ni \u2212ui1T+(cid:80)\n(cid:20)\n\nk\u2208ch(i) 1uT\n\n(cid:21)\n\nmax\nri\u2208\u2206\n\naT\ni ri + min\npi\u2208\u2206\n\npiLiri\n\n.\n\n(8)\n\nNote that this resembles the optimization in a standard adversarial multiclass classi\ufb01cation problem\n[14] with Li as the loss matrix and ai as the class-based potential vector. For an arbitrary loss matrix,\nEq. (8) can be solved as a linear program. However, this is somewhat slow and more ef\ufb01cient\nalgorithms have been studied in the case of zero-one and ordinal regression metrics [41, 42]. Given\nthe solution of this inner optimization, we use a sub-gradient based optimization to \ufb01nd the optimal\nLagrange dual variables u\u2217.\nTo recover our original variables for the adversary\u2019s marginal distribution Q\u2217\npt(i);i given the optimal\ndual variables u\u2217, we use the following steps. First, we use u\u2217 and Eq. (8) to compute the value of\nthe node marginal probability r\u2217\ni (i.e.,\nthe adversary\u2019s node probability), Eq. (6) can be solved independently for each Qpt(i);i to obtain the\noptimal Q\u2217\n\ni . With the additional information that we know the value of r\u2217\n\nQpt(i);i, Bpt(i);i\n\nsubject to: QT\n\npt(i);i1 = r\u2217\n\ni , Qpt(i);i1 = r\u2217\n\npt(i).\n\n(9)\n\n(cid:68)\n\npt(i);i as follows:\npt(i);i = argmax\nQpt(i);i\u2208\u2206\n\nQ\u2217\n\n(cid:69)\n\nNote that the optimization above resembles an optimal transport problem over two discrete distribu-\ntions [43] with cost matrix \u2212Bpt(i);i. This optimal transport problem can be solved using a linear\nprogram solver or a more sophisticated solver (e.g., using Sinkhorn distances [44]).\nFor our overall learning algorithm, we use the optimal adversary\u2019s marginal distributions Q\u2217\npt(i);i\nto compute the sub-differential of the AGM formulation (Eq. (5)) with respect to \u03b8v and \u03b8e. The\npt(i);i1 \u2212\nsub-differential for \u03b8(l)\nv\nzi)Twi;l, whereas the sub-differential for \u03b8(l)\nincludes the expected edge feature difference\ne\nE\n. Using this sub-differential information, we employ a\nX,Y\u223c \u02dcP\nstochastic sub-gradient based algorithm to obtain the optimal \u03b8\u2217\n\nincludes the expected node feature difference E\n\npt(i);i \u2212 Zpt(i);i, Wpt(i);i;l\nQ\u2217\n\n(cid:80)n\ni (Q\u2217T\n\n(cid:80)n\n\nX,Y\u223c \u02dcP\n\n(cid:68)\n\n(cid:69)\n\ni\n\nv and \u03b8\u2217\ne.\n\n6\n\n\f3.2.2 Prediction algorithms\n\nWe propose two different prediction schemes: probabilistic and non-probabilistic prediction.\nProbabilistic prediction. Our probabilistic prediction is based on the predictor\u2019s label probability\ndistribution in the adversarial prediction formulation. Given \ufb01xed values of \u03b8v and \u03b8e, we solve\na minimax optimization similar to Eq. (5) by \ufb02ipping the order of the predictor and adversary\ndistribution as follows:\n\n(cid:20)\n\nn(cid:88)\n\ni\n\nmin\np\u2208\u2206\n\nmax\nQ\u2208\u2206\n\npiLi(QT\n\nsubject to: QT\n\n(cid:68)\nQpt(i);i,(cid:80)\n\n(cid:69)\npt(i);i1) +\npt(pt(i));pt(i)1 = Qpt(i);i1, \u2200i \u2208 {1, . . . , n}.\n\ne Wpt(i);i;l\n\nl \u03b8(l)\n\n+ (QT\n\npt(i);i1)T((cid:80)\n\nl \u03b8(l)\n\nv wi;l)\n\n(cid:21)\n\n(10)\n\nTo solve the inner maximization of Q we use a similar technique as in MAP inference for CRFs. We\nthen use a projected gradient optimization technique to solve the outer minimization over p and a\ntechnique for projecting to the probability simplex [45].\nNon-probabilistic prediction. Our non-probabilistic prediction scheme is similar to SSVM\u2019s\nprediction algorithm.\nIn this scheme, we \ufb01nd \u02c6y that maximizes the potential value, i.e., \u02c6y =\nargmaxy f (x, y), where f (x, y) = \u03b8T\u03a6(x, y). This prediction scheme is faster than the probabilis-\ntic scheme since we only need a single run of a Viterbi-like algorithm for tree structures.\n\n3.2.3 Runtime analysis\n\nEach stochastic update in our algorithm involves \ufb01nding the optimal u and recovering the optimal\nQ to be used in a sub-gradient update. Each iteration of a sub-gradient based optimization to solve\nu costs O(n \u00b7 c(L)) time where n is the number of nodes and c(L) is the cost for solving the\noptimization in Eq. (8) for the loss matrix L. Recovering all of the adversary\u2019s marginal distributions\nQpt(i);i using a fast Sinkhorn distance solver has the empirical complexity of O(nk2) where k is the\nnumber of classes [44]. The total running time of our method depends on the loss metric we use.\nFor example, if the loss metric is the additive zero-one loss, the total complexity of one stochastic\ngradient update is O(nlk log k + nk2) time, where l is the number of iterations needed to obtain the\noptimal u and O(k log k) time is the cost for solving Eq. (8) for the zero-one loss [41]. In practice,\nwe \ufb01nd the average value of l to be relatively small. This runtime complexity is competitive with the\nCRF, which requires O(nk2) time to perform message-passing over a tree to compute the marginal\ndistribution of each parameter update, and also with structured SVM where each iteration requires\ncomputing the most violated constraint, which also costs O(nk2) time for running a Viterbi-like\nalgorithm over a tree structure.\n\n3.2.4 Learning algorithm for graphical structure with low treewidth\n\nOur algorithm for tree-based graphs can be easily extended to the case of graphical structures with\nlow treewidth. Similar to the case of the junction tree algorithm for probabilistic graphical models,\nwe \ufb01rst construct a junction tree representation for the graphical structure. We then solve a similar\noptimization as in Eq. (5) on the junction tree. In this case, the time complexity of one stochastic\ngradient update of the algorithm is O(nlwk(w+1) log k + nk2(w+1)) time for the optimization with\nan additive zero-one loss metric, where n is the number of cliques in the junction tree, k is the number\nof classes, l is the number of iterations in the inner optimization, and w is the treewidth of the graph.\nThis time complexity is competitive with the time complexities of CRF and SSVM which are also\nexponential in the treewidth of the graph.\n\n3.3 Fisher consistency analysis\n\nA key theoretical advantage of our approach over the structured SVM is that it provides Fisher\nconsistency. This guarantees that under the true distribution P (x, y), the learning algorithm yields a\nBayes optimal prediction with respect to the loss metric [10, 11]. In this setting, the learning algorithm\nis allowed to optimize over all measurable functions, or similarly, it has a feature representation of\nunlimited richness. We establish the Fisher consistency of our AGM approach in Theorem 3.\nTheorem 3. The AGM approach is Fisher consistent for all additive loss metrics.\n\n7\n\n\fProof. As established in Theorem 1, pairwise marginal probabilities are suf\ufb01cient statistics of the\nadversary\u2019s distribution. An unlimited access to arbitrary rich feature representation constrains the\nadversary\u2019s distribution in Eq. (3) to match the marginal probabilities of the true distribution, making\nthe optimization in Eq. (3) equivalent to min\u02c6y EX,Y\u223cP\nprediction for the loss metric.\n\n(cid:2)loss(\u02c6y, Y)(cid:3), which is the Bayes optimal\n\n4 Experimental evaluation\n\nTo evaluate our approach, we apply AGM to two different tasks: predicting emotion intensity from a\nsequence of images, and labeling entities in parse trees with semantic roles. We show the bene\ufb01t of\nour method compared with a conditional random \ufb01eld (CRF) and a structured SVM (SSVM).\n\n4.1 Facial emotion intensity prediction\n\nAGM CRF\n0.32\n0.34\n0.33\n0.34\n0.38\n0.38\n0.28\n0.32\n0.29\n0.36\n0.40\n0.36\n0.33\n0.35\n2\n4\n\nSSVM\n0.37\n0.40\n0.40\n0.29\n0.29\n0.33\n0.35\n2\n\nTable 1: The average loss metrics for the emotion\nintensity prediction. Bold numbers indicate the\nbest or not signi\ufb01cantly worse than the best results\n(Wilcoxon signed-rank test with \u03b1 = 0.05).\n\nLoss metrics\nzero-one, unweighted\nabsolute, unweighted\nsquared, unweighted\nzero-one, weighted\nabsolute, weighted\nsquared, weighted\naverage\n# bold\n\nWe evaluate our approach in the facial emo-\ntion intensity prediction task [46]. Given a se-\nquence of facial images, the task is to predict\nthe emotion intensity for each individual im-\nage. The emotion intensity labels are catego-\nrized into three ordinal categories: neutral <\nincreasing < apex, re\ufb02ecting the degree of\nintensity. The dataset contains 167 sequences\ncollected from 100 subjects consisting of six\ntypes of basic emotions (anger, disgust, fear,\nhappiness, sadness, and surprise). In terms of\nthe features used for prediction, we follow an\nexisting feature extraction procedure [46] that\nuses Haar-like features and the PCA algorithm\nto reduce the feature dimensionality.\nIn our experimental setup, we combine the data from all six different emotions and focus on predicting\nthe ordinal category of emotion intensity. From the whole 167 sequences, we construct 20 different\nrandom splits of the training and the testing datasets with 120 sequences of training samples and 47\nsequences of testing samples. We use the training set in the \ufb01rst split to perform cross validation to\nobtain the best regularization parameters and then use the best parameter in the evaluation phase for\nall 20 different splits of the dataset.\nIn the evaluation, we use six different loss metrics. The \ufb01rst three metrics are the average of zero-one,\nabsolute and squared loss metrics for each node in the graph (where we assign label values: neutral\n= 1, increasing = 2, and apex = 3). The other three metrics are the weighted version of the\nzero-one, absolute and squared loss metrics. These weighted variants of the loss metrics re\ufb02ect the\nfocus on the prediction task by emphasizing the prediction on particular nodes in the graph. In this\nexperiment, we set the weight to be the position in the sequence so that we focus more on the latest\nnodes in the sequences.\nWe compare our method with CRF and SSVM models. Both the AGM and the SSVM can incorporate\nthe task\u2019s customized loss metrics in the learning process. The prediction for AGM and SSVM is\ndone by taking an arg-max of potential values, i.e., argmaxy f (x, y) = \u03b8 \u00b7 \u03a6(x, y). For CRF, the\ntraining step aims to model the conditional probability \u02c6P\u03b8(y|x). The CRF\u2019s predictions are computed\nusing the Bayes optimal prediction with respect to the loss metric and CRF\u2019s conditional probability,\ni.e., argminy\nWe report the loss metrics averaged over the dataset splits as shown in Table 1. We highlight the\nresult that is either the best result or not signi\ufb01cantly worse than the best result (using Wilcoxon\nsigned-rank test with \u03b1 = 0.05). The result shows that our method signi\ufb01cantly outperforms CRF in\nthree cases (absolute, weighted zero-one, and weighted absolute losses), and statistically ties with\nCRF in one case (squared loss), while only being outperformed by CRF in one case (zero-one loss).\nAGM also outperforms SSVM in three cases (absolute, squared, and weighted zero-one losses), and\nstatistically ties with SSVM in one case (weighted absolute loss), while only being outperformed\nby SSVM in one case (weighted squared loss). In the overall result, AGM maintains advantages\n\nE \u02c6Y|x\u223c \u02c6P\u03b8\n\n[loss(y, \u02c6Y)].\n\n8\n\n\fcompared to CRFs and SSVMs in both the overall average loss and the number of \u201cindistinguishably\nbest\u201d performances on all cases. These results may re\ufb02ect the theoretical bene\ufb01t that AGM has over\nCRF and SSVM mentioned in Section 3 when learning from noisy labels.\n\n4.2 Semantic role labeling\n\nFigure 2: Example of a syntax tree with semantic\nrole labels as bold superscripts. The dotted and\ndashed lines show the pruned edges from the tree.\nThe original label AM-MOD is among class R in our\nexperimental setup.\n\nWe evaluate the performance of our algorithm\non the semantic role labeling task for the CoNLL\n2005 dataset [47]. Given a sentence and its syn-\ntactic parse tree as the input, the task is to rec-\nognize the semantic role of each constituent in\nthe sentence as propositions expressed by some\ntarget verbs in the sentence. There are a to-\ntal of 36 semantic roles grouped by their types\nof: numbered arguments, adjuncts, references\nof numbered and adjunct arguments, continua-\ntion of each class type and the verb. We prune\nthe syntactic trees according to Xue and Palmer\n[48], i.e., we only include siblings of the nodes\nwhich are on the path from the verb (V) to the\nroot and also the immediate children in case that\nthe node is a propositional phrase (PP). Follow-\ning the setup used by Cohn and Blunsom [49],\nwe extract the same syntactic and contextual\nfeatures and label non-argument constituents and children nodes of arguments as \"outside\" (O).\nAdditionally, in our experiment we simplify the prediction task by reducing the number of labels.\nSpeci\ufb01cally, we choose the three most common labels in the WSJ test dataset, i.e., A0,A1,A2 and\ntheir references R-A0,R-A1,R-A2, and we combine the rest of the classes as one separate class R.\nThus, together with outside O and verb V, we have a total of nine classes in our experiment.\nIn the evaluation, we use a cost-sensitive loss\nmatrix that re\ufb02ects the importance of each label.\nWe use the same cost-sensitive loss matrix to\nevaluate the prediction of all nodes in the graph.\nThe cost-sensitive loss matrix is constructed by\npicking a random order of the class label and\nassigning an ordinal loss based on the order of\nthe labels. We compare the average cost-sensitive loss metric of our method with the CRF and the\nSSVM as shown in Table 2. As we can see from the table, our result is competitive with SSVM, while\nmaintaining an advantage over the CRF. This experiment shows that incorporating customized losses\ninto the training process of learning algorithms is important for some structured prediction tasks.\nBoth the AGM and the SSVM are designed to align their learning algorithms with the customized\nloss metric, whereas CRF can only utilize the loss metric information in its prediction step.\n\nTable 2: The average loss metrics for the semantic\nrole labeling task.\n\nLoss metrics\ncost-sensitive loss\n\nAGM CRF\n0.14\n0.19\n\nSSVM\n0.14\n\n5 Conclusion\n\nIn this paper, we introduced adversarial graphical models, a robust approach to structured prediction\nthat possesses the main bene\ufb01ts of existing methods: (1) it guarantees the same Fisher consistency\npossessed by CRFs [6]; (2) it aligns the target loss metric with the learning objective, as in maximum\nmargin methods [19, 8]; and (3) its computational run time complexity is primarily shaped by the\ngraph treewidth, which is similar to both graphical modeling approaches. Our experimental results\ndemonstrate the bene\ufb01ts of this approach on structured prediction tasks with low treewidth.\nFor more complex graphical structures with high treewidth, our proposed algorithm may not be\nef\ufb01cient. Similar to the case of CRFs and SSVMs, approximation algorithms may be needed to solve\nthe optimization in AGM formulations for these structures. In future work, we plan to investigate the\noptimization techniques and applicable approximation algorithms for general graphical structures.\nAcknowledgement. This work was supported, in part, by the National Science Foundation under\nGrant No. 1652530, and by the Future of Life Institute (futureo\ufb02ife.org) FLI-RFP-AI1 program.\n\n9\n\nButisItuncertainthesewhetherinstitutionstakewillthosestepsCCAUXPRPJJDTINNNSVBMDDTNNSNPVPVPNPSSBARADJPVPSA1VOOOOOOOOOOOA0RAM-MOD\fReferences\n[1] Christopher D Manning and Hinrich Sch\u00fctze. Foundations of statistical natural language\n\nprocessing. MIT press, 1999.\n\n[2] Trevor Cohn and Philip Blunsom. Semantic role labelling with tree conditional random \ufb01elds.\nIn Proceedings of the Ninth Conference on Computational Natural Language Learning, pages\n169\u2013172. Association for Computational Linguistics, 2005.\n\n[3] Jun Hatori, Yusuke Miyao, and Jun\u2019ichi Tsujii. Word sense disambiguation for all words using\ntree-structured conditional random \ufb01elds. Coling 2008: Companion Volume: Posters, pages\n43\u201346, 2008.\n\n[4] Ali Sadeghian, Laksshman Sundaram, D Wang, W Hamilton, Karl Branting, and Craig Pfeifer.\nSemantic edge labeling over legal citation graphs. In Proceedings of the Workshop on Legal\nText, Document, and Corpus Analytics (LTDCA-2016), pages 70\u201375, 2016.\n\n[5] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer\nvision. Foundations and Trends R(cid:13) in Computer Graphics and Vision, 6(3\u20134):185\u2013365, 2011.\n[6] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random \ufb01elds:\nProbabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th\nInternational Conference on Machine Learning, volume 951, pages 282\u2013289, 2001.\n\n[7] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large\nmargin methods for structured and interdependent output variables. In JMLR, pages 1453\u20131484,\n2005.\n\n[8] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured pre-\ndiction models: A large margin approach. In Proceedings of the 22nd international conference\non Machine learning, pages 896\u2013903. ACM, 2005.\n\n[9] Stan Z Li. Markov random \ufb01eld modeling in image analysis. Springer Science & Business\n\nMedia, 2009.\n\n[10] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nThe Journal of Machine Learning Research, 8:1007\u20131025, 2007.\n\n[11] Yufeng Liu. Fisher consistency of multicategory support vector machines. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 291\u2013298, 2007.\n\n[12] Flemming Tops\u00f8e. Information theoretical optimization techniques. Kybernetika, 15(1):8\u201327,\n\n1979.\n\n[13] Peter D. Gr\u00fcnwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrep-\n\nancy, and robust Bayesian decision theory. Annals of Statistics, 32:1367\u20131433, 2004.\n\n[14] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classi\ufb01-\n\ncation. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[15] Judea Pearl. Bayesian networks: A model of self-activated memory for evidential reasoning. In\n\nProceedings of the 7th Conference of the Cognitive Science Society, 1985, 1985.\n\n[16] Gregory F Cooper. The computational complexity of probabilistic inference using Bayesian\n\nbelief networks. Arti\ufb01cial intelligence, 42(2-3):393\u2013405, 1990.\n\n[17] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty\n\nwith application to data-driven problems. Operations research, 58(3):595\u2013612, 2010.\n\n[18] Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. Probabilistic\nnetworks and expert systems: Exact computational methods for Bayesian networks. Springer\nScience & Business Media, 2006.\n\n[19] Thorsten Joachims. A support vector method for multivariate performance measures.\nProceedings of the International Conference on Machine Learning, pages 377\u2013384, 2005.\n\nIn\n\n10\n\n\f[20] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-\n\nbased vector machines. The Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[21] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5(Oct):1225\u20131251, 2004.\n\n[22] Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust\noptimization with f-divergences. In Advances in Neural Information Processing Systems, pages\n2208\u20132216, 2016.\n\n[23] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives.\n\nIn Advances in Neural Information Processing Systems, pages 2971\u20132980, 2017.\n\n[24] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness\nwithout demographics in repeated loss minimization. In Proceedings of the 35th International\nConference on Machine Learning, volume 80, pages 1929\u20131938. PMLR, 2018.\n\n[25] Soroosh Sha\ufb01eezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally\nIn Advances in Neural Information Processing Systems, pages\n\nrobust logistic regression.\n1576\u20131584, 2015.\n\n[26] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization\nusing the wasserstein metric: Performance guarantees and tractable reformulations. Mathemati-\ncal Programming, 171(1-2):115\u2013166, 2018.\n\n[27] Ruidi Chen and Ioannis Ch. Paschalidis. A robust learning approach for regression models\nbased on distributionally robust optimization. Journal of Machine Learning Research, 19(13):\n1\u201348, 2018.\n\n[28] Roi Livni, Koby Crammer, and Amir Globerson. A simple geometric interpretation of svm\n\nusing stochastic adversaries. In Arti\ufb01cial Intelligence and Statistics, pages 722\u2013730, 2012.\n\n[29] Jia Li, Kaiser Asif, Hong Wang, Brian D Ziebart, and Tanya Y Berger-Wolf. Adversarial\n\nsequence tagging. In IJCAI, pages 1690\u20131696, 2016.\n\n[30] Sima Behpour, Wei Xing, and Brian Ziebart. Arc: Adversarial robust cuts for semi-supervised\n\nand multi-label classi\ufb01cation. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[31] H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost\nfunctions controlled by an adversary. In Proceedings of the 20th International Conference on\nMachine Learning (ICML-03), pages 536\u2013543, 2003.\n\n[32] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach\nfor structured prediction. In Advances in Neural Information Processing Systems, pages 4412\u2013\n4420, 2016.\n\n[33] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with\ncalibrated convex surrogate losses. In Advances in Neural Information Processing Systems,\npages 301\u2013312, 2017.\n\n[34] Vladimir Naumovich Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.\n\n[35] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for\n\nmultivariate losses. In Advances in Neural Information Processing Systems, 2015.\n\n[36] John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Bull.\n\nAmer. Math. Soc, 51(7):498\u2013504, 1945.\n\n[37] Maurice Sion. On general minimax theorems. Paci\ufb01c Journal of Mathematics, 8(1):171\u2013176,\n\n1958.\n\n[38] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random \ufb01elds.\n\nFoundations and Trends R(cid:13) in Machine Learning, 4(4):267\u2013373, 2012.\n\n11\n\n\f[39] Stephen Boyd, Lin Xiao, Almir Mutapcic, and Jacob Mattingley. Notes on decomposition\n\nmethods. Notes for EE364B, 2008.\n\n[40] David Sontag, Amir Globerson, and Tommi Jaakkola. Introduction to dual composition for\n\ninference. In Optimization for Machine Learning. MIT Press, 2011.\n\n[41] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classi\ufb01cation:\nA risk minimization perspective. In Advances in Neural Information Processing Systems 29\n(NIPS), pages 559\u2013567, 2016.\n\n[42] Rizal Fathony, Mohammad Ali Bashiri, and Brian Ziebart. Adversarial surrogate losses for\nordinal regression. In Advances in Neural Information Processing Systems 30 (NIPS), pages\n563\u2013573, 2017.\n\n[43] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[44] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances\n\nin Neural Information Processing Systems, pages 2292\u20132300, 2013.\n\n[45] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections\nonto the l 1-ball for learning in high dimensions. In Proceedings of the International Conference\non Machine Learning, pages 272\u2013279. ACM, 2008.\n\n[46] Minyoung Kim and Vladimir Pavlovic. Structured output ordinal regression for dynamic facial\nemotion intensity prediction. In European Conference on Computer Vision, pages 649\u2013662.\nSpringer, 2010.\n\n[47] Xavier Carreras and Llu\u00eds M\u00e0rquez. Introduction to the conll-2005 shared task: Semantic\nrole labeling. In Proceedings of the Ninth Conference on Computational Natural Language\nLearning, CONLL \u201905, pages 152\u2013164. Association for Computational Linguistics, 2005.\n\n[48] Nianwen Xue and Martha Palmer. Calibrating features for semantic role labeling. In Proceedings\n\nof the 2004 Conference on Empirical Methods in Natural Language Processing, 2004.\n\n[49] Trevor Cohn and Philip Blunsom. Semantic role labelling with tree conditional random \ufb01elds.\nIn Proceedings of the Ninth Conference on Computational Natural Language Learning, CONLL\n\u201905, pages 169\u2013172. Association for Computational Linguistics, 2005.\n\n12\n\n\f", "award": [], "sourceid": 5057, "authors": [{"given_name": "Rizal", "family_name": "Fathony", "institution": "University of Illinois at Chicago"}, {"given_name": "Ashkan", "family_name": "Rezaei", "institution": "University of Illinois at Chicago"}, {"given_name": "Mohammad Ali", "family_name": "Bashiri", "institution": "University of Illinois at Chicago"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "UIC"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "University of Illinois at Chicago"}]}