{"title": "Embedding Inference for Structured Multilabel Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3555, "page_last": 3563, "abstract": "A key bottleneck in structured output prediction is the need for inference during training and testing, usually requiring some form of dynamic programming. Rather than using approximate inference or tailoring a specialized inference method for a particular structure---standard responses to the scaling challenge---we propose to embed prediction constraints directly into the learned representation. By eliminating the need for explicit inference a more scalable approach to structured output prediction can be achieved, particularly at test time. We demonstrate the idea for multi-label prediction under subsumption and mutual exclusion constraints, where a relationship to maximum margin structured output prediction can be established. Experiments demonstrate that the benefits of structured output training can still be realized even after inference has been eliminated.", "full_text": "Embedding Inference\n\nfor Structured Multilabel Prediction\n\nFarzaneh Mirzazadeh Siamak Ravanbakhsh\n\nUniversity of Alberta\n\n{mirzazad,mravanba}@ualberta.ca\n\nNan Ding\nGoogle\n\nDale Schuurmans\nUniversity of Alberta\n\ndingnan@google.com\n\ndaes@ualberta.ca\n\nAbstract\n\nA key bottleneck in structured output prediction is the need for inference dur-\ning training and testing, usually requiring some form of dynamic programming.\nRather than using approximate inference or tailoring a specialized inference\nmethod for a particular structure\u2014standard responses to the scaling challenge\u2014\nwe propose to embed prediction constraints directly into the learned representa-\ntion. By eliminating the need for explicit inference a more scalable approach to\nstructured output prediction can be achieved, particularly at test time. We demon-\nstrate the idea for multi-label prediction under subsumption and mutual exclusion\nconstraints, where a relationship to maximum margin structured output prediction\ncan be established. Experiments demonstrate that the bene\ufb01ts of structured output\ntraining can still be realized even after inference has been eliminated.\n\n1\n\nIntroduction\n\nStructured output prediction has been an important topic in machine learning. Many prediction\nproblems involve complex structures, such as predicting parse trees for sentences [28], predicting\nsequence labellings for language and genomic data [1], or predicting multilabel taggings for doc-\numents and images [7, 8, 13, 20]. Initial breakthroughs in this area arose from tractable discrim-\ninative training methods\u2014conditional random \ufb01elds [19, 27] and structured large margin training\n[26, 29]\u2014that compare complete output con\ufb01gurations against given target structures, rather than\nsimply learning to predict each component in isolation. More recently, search based approaches that\nexploit sequential prediction methods have also proved effective for structured prediction [4, 21].\nDespite these improvements, the need to conduct inference or search over complex outputs both\nduring the training and testing phase proves to be a signi\ufb01cant bottleneck in practice.\nIn this paper we investigate an alternative approach that eliminates the need for inference or search\nat test time. The idea is to shift the burden of coordinating predictions to the training phase, by\nembedding constraints in the learned representation that ensure prediction relationships are satis\ufb01ed.\nThe primary bene\ufb01t of this approach is that prediction cost can be signi\ufb01cantly reduced without\nsacri\ufb01cing the desired coordination of structured output components.\nWe demonstrate the proposed approach for the problem of multilabel classi\ufb01cation with hierarchi-\ncal and mutual exclusion constraints on output labels [8]. Multilabel classi\ufb01cation is an important\nsub\ufb01eld of structured output prediction where multiple labels must be assigned that respect semantic\nrelationships such as subsumption, mutual exclusion or weak forms of correlation. The problem is of\ngrowing importance as larger tag sets are being used to annotate images and documents on the Web.\nResearch in this area can be distinguished by whether the relationships between labels are assumed\nto be known beforehand or whether such relationships need to be inferred during training. In the lat-\nter case, many works have developed tailored training losses for multilabel prediction that penalize\njoint prediction behavior [6, 9, 30] without assuming any speci\ufb01c form of prior knowledge. More\nrecently, several works have focused on coping with large label spaces by using low dimensional\n\n1\n\n\fprojections to label subspaces [3, 17, 22]. Other work has focused on exploiting weak forms of prior\nknowledge expressed as similarity information between labels that can be obtained from auxiliary\nsources [11]. Unfortunately, none of these approaches strictly enforce prior logical relationships be-\ntween label predictions. By contrast, other research has sought to exploit known prior relationships\nbetween labels. The most prominent such approaches have been to exploit generative or conditional\ngraphical model structures over the label set [5, 16]. Unfortunately, the graphical model structures\nare either limited to junction trees with small treewidth [5] or require approximation [12]. Other\nwork, using output kernels, has also been shown able to model complex relationships between la-\nbels [15] but is hampered by an intractable pre-image problem at test time.\nIn this paper, we focus on tractable methods and consider the scenario where a set of logical label\nrelationships is given a priori; in particular, implication and mutual exclusion relationships. These\nrelationships have been the subject of extensive work on multilabel prediction, where it is known\nthat if the implication/subsumption relationships form a tree [25] or a directed acyclic graph [2, 8]\nthen ef\ufb01cient dynamic programming algorithms can be developed for tractable inference during\ntraining and testing, while for general pairwise models [32] approximate inference is required. Our\nmain contribution is to show how these relationships can be enforced without the need for dynamic\nprogramming. The idea is to embed label relationships as constraints on the underlying score model\nduring training so that a trivial labelling algorithm can be employed at test time, a process that can\nbe viewed as pre-compiling inference during the training phase.\nThe literature on multivariate prediction has considered many other topics not addressed by this\npaper, including learning from incomplete labellings, exploiting hierarchies and embeddings for\nmulticlass prediction [31], exploiting multimodal data, deriving generalization bounds for structured\nand multilabel prediction problems, and investigating the consistency of multilabel losses.\n\n2 Background\nWe consider a standard prediction model where a score function s : X \u00d7 Y \u2192 R with parameters \u03b8\nis used to determine the prediction for a given input x via\n\n\u02c6y = arg max\n\ny\u2208Y s(x, y).\n\nef\ufb01cient prediction. For example, s might decompose as s(x, y) = (cid:80)\n\nHere y is a con\ufb01guration of assignments over a set of components (that might depend on x). Since\nY is a combinatorial set, (1) cannot usually be solved by enumeration; some structure required for\nc\u2208C s(x, yc) over a set of\ncliques C that form a junction tree, where yc denotes the portion of y covered by clique c. Y might\nalso encode constraints to aid tractability, such as y forming a consistent matching in a bipartite\ngraph, or a consistent parse tree [28]. The key practical requirement is that s and Y allow an ef\ufb01cient\nsolution to (1). The operation of maximizing or summing over all y \u2208 Y is referred to as inference,\nand usually involves a dynamic program tailored to the speci\ufb01c structure encoded by s and Y.\nFor supervised learning one attempts to infer a useful score function given a set of t training pairs\n(x1, y1), (x2, y2), ..., (xt, yt) that specify the correct output associated with each input. Conditional\nrandom \ufb01elds [19] and structured large margin training (below with margin scaling) [28, 29] can both\nbe expressed as optimizations over the score model parameters \u03b8 respectively:\n\n(cid:16)(cid:88)\n(cid:16)\n\ny\u2208Y\n\nt(cid:88)\nt(cid:88)\n\ni=1\n\ni=1\n\nmin\n\u03b8\u2208\u0398\n\nr(\u03b8) +\n\nmin\n\u03b8\u2208\u0398\n\nr(\u03b8) +\n\n(cid:17) \u2212 s\u03b8(xi, yi)\n(cid:17) \u2212 s\u03b8(xi, yi),\n\nlog\n\nexp(s\u03b8(xi, y))\n\nmax\ny\u2208Y\n\n\u2206(y, yi) + s\u03b8(xi, y)\n\n(1)\n\n(2)\n\n(3)\n\nwhere r(\u03b8) is a regularizer over \u03b8 \u2208 \u0398. Equations (1), (2) and (3) suggest that inference over y \u2208 Y\nis required at each stage of training and testing, however we show this is not necessarily the case.\n\nMultilabel Prediction To demonstrate how inference might be avoided, consider the special case\nof multilabel prediction with label constraints. Multilabel prediction specializes the previous set up\nby assuming y is a boolean assignment to a \ufb01xed set of variables, where y = (y1, y2, ..., y(cid:96)) and\nyi \u2208 {0, 1}, i.e. each label is assigned 1 (true) or 0 (false). As noted, an extensive literature that\n\n2\n\n\fhas investigated various structural assumptions on the score function to enable tractable prediction.\nFor simplicity we adopt the factored form that has been reconsidered in recent work [8, 11] (and\n\noriginally [13]): s(x, y) =(cid:80)\n\nk s(x, yk). This form allows (1) to be simpli\ufb01ed to\n\n\u02c6y = arg max\ny\u2208Y\n\ns(x, yk) = arg max\ny\u2208Y\n\nyksk(x)\n\n(4)\n\n(cid:88)\n\nk\n\n(cid:88)\n\nk\n\nwhere sk(x) := s(x, yk = 1) \u2212 s(x, yk = 0) gives the decision function associated with label\nyk \u2208 {0, 1}. That is, based on (4), if the constraints in Y were ignored, one would have the\nrelationship \u02c6yk = 1 \u21d4 sk(x) \u2265 0. The constraints in Y play an important role however: it has been\nshown in [8] that imposing prior implications and mutual exclusions as constraints in Y yields state\nof the art accuracy results for image tagging on the ILSVRC corpus. This result was achieved in [8]\nby developing a novel and rather sophisticated dynamic program that can ef\ufb01ciently solve (4) under\nthese constraints. Here we show how such a dynamic program can be eliminated.\n\n3 Embedding Label Constraints\n\nConsider the two common forms of logical relationships between labels: implication and mutual\nexclusion. For implication one would like to enforce relationships of the form y1 \u21d2 y2, meaning\nthat whenever the label y1 is set to 1 (true) then the label y2 must also be set to 1 (true). For mutual\nexclusion one would like to enforce relationships of the form \u00acy1 \u2228 \u00acy2, meaning that at least one\nof the labels y1 and y2 must be set to 0 (false) (i.e., not both can be simultaneously true). These\nconstraints arise naturally in multilabel classi\ufb01cation, where label sets are increasingly large and\nembody semantic relationships between categories [2, 8, 32]. For example, images can be tagged\nwith labels \u201cdog\u201d, \u201ccat\u201d and \u201cSiamese\u201d where \u201cSiamese\u201d implies \u201ccat\u201d, while \u201cdog\u201d and \u201ccat\u201d are\nmutually exclusive (but an image could depict neither). These implication and mutual exclusion\nconstraints constitute the \u201cHEX\u201d constraints considered in [8].\nOur goal is to express the logical relationships between label assignments as constraints on the score\nfunction that hold universally over all x \u2208 X . In particular, using the decomposed representation\n(4), the desired label relationships correspond to the following constraints\ns1(x) \u2265 \u2212\u03b4 \u21d2 s2(x) \u2265 \u03b4\n\u2200x \u2208 X\ns1(x) < \u2212\u03b4 or s2(x) < \u2212\u03b4 \u2200x \u2208 X\n\n(5)\n(6)\nwhere we have introduced the additional margin quantity \u03b4 \u2265 0 for subsequent large margin training.\n\nImplication\n\ny1 \u21d2 y2:\nMutual exclusion \u00acy1 \u2228 \u00acy2:\n\n3.1 Score Model\n\nThe \ufb01rst key consideration is representing the score function in a manner that allows the desired\nrelationships to be expressed. Unfortunately, the standard linear form s(x, y) = (cid:104)\u03b8, f (x, y)(cid:105) cannot\nallow the needed constraints to be enforced over all x \u2208 X without further restricting the form\nof the feature representation f; a constraint we would like to avoid. More speci\ufb01cally, consider\na standard set up where there is a mapping f (x, yk) that produces a feature representation for an\ninput-label pair (x, yk). For clarity, we additionally make the standard assumption that the inputs\nand outputs each have independent feature representations [11], hence f (x, yk) = \u03c6(x) \u2297 \u03c8k for an\ninput feature map \u03c6 and label feature representation \u03c8k. In this case, a bi-linear score function has\nthe form sk(x) = \u03c6(x)(cid:62)A\u03c8k + b(cid:62)\u03c6(x) + c(cid:62)\u03c8k + d for parameters \u03b8 = (A, b, c, d). Unfortunately,\nsuch a score function does not allow sk(x) \u2265 \u03b4 (e.g., in Condition (5)) to be expressed over all\nx \u2208 X without either assuming A = 0 and b = 0, or special structure in \u03c6.\nTo overcome this restriction we consider a more general scoring model that extends the standard\nbi-linear form to a form that is linear in the parameters but quadratic in the feature representations:\n\n(cid:34) \u03c6(x)\n\n\u03c8k\n1\n\n(cid:35)(cid:62)\uf8ee\uf8f0 P\n\nA b\nA(cid:62) Q c\nb(cid:62) c(cid:62) r\n\n(cid:35)\n\n\uf8f9\uf8fb(cid:34) \u03c6(x)\n\n\u03c8k\n1\n\n\uf8ee\uf8f0 P\n\nA b\nA(cid:62) Q c\nb(cid:62) c(cid:62) r\n\n\uf8f9\uf8fb . (7)\n\nfor\n\n\u03b8 =\n\n\u2212sk(x) =\n\nHere \u03b8 = \u03b8(cid:62) and sk is linear in \u03b8 for each k. The bene\ufb01t of a quadratic form in the features is that\nit allows constraints over x \u2208 X to be easily imposed on label scores via convex constraints on \u03b8.\n\n3\n\n\fLemma 1 If \u03b8 (cid:23) 0 then \u2212sk(x) = (cid:107)U \u03c6(x) + u \u2212 V \u03c8k(cid:107)2 for some U, V and u.\nProof: First expand (7), obtaining \u2212sk(x) = \u03c6(x)(cid:62)P \u03c6(x) + 2\u03c6(x)(cid:62)A\u03c8k + 2b(cid:62)\u03c6(x) + \u03c8(cid:62)\nk Q\u03c8k +\n2c(cid:62)\u03c8k + r. Since \u03b8 (cid:23) 0 there must exist U, V and u such that \u03b8 = [U(cid:62),\u2212V (cid:62), u](cid:62)[U(cid:62),\u2212V (cid:62), u],\nwhere U(cid:62)U = P , U(cid:62)V = \u2212A, U(cid:62)u = b, V (cid:62)V = Q, V (cid:62)u = \u2212c, and u(cid:62)u = r. A simple\n(cid:4)\nsubstitution and rearrangement shows the claim.\nThe representation (7) generalizes both standard bi-linear and distance-based models. The standard\nbi-linear model is achieved by P = 0 and Q = 0. By Lemma 1, the semide\ufb01nite assumption \u03b8 (cid:23) 0\nalso yields a model that has a co-embedding [24] interpretation: the feature representations \u03c6(x)\nand \u03c8k are both mapped (linearly) into a common Euclidean space where the score is determined\nby the squared distance between the embedded vectors (with an additional offset u). To aid the\npresentation below we simplify this model a bit further. Set b = 0 and observe that (7) reduces to\n\n(cid:20) \u03c6(x)\n\n(cid:21)(cid:62)(cid:20) P A\n\n(cid:21)(cid:20) \u03c6(x)\n\n(cid:21)\n\nsk(x) = \u03b3k \u2212\n\n(8)\nwhere \u03b3k = \u2212r \u2212 2c(cid:62)\u03c8k. In particular, we modify the parameterization to \u03b8 = {\u03b3k}(cid:96)\nk=1 \u222a{\u03b8P AQ}\nsuch that \u03b8P AQ denotes the matrix of parameters in (8). Importantly, (8) remains linear in the new\nparameterization. Lemma 1 can then be modi\ufb01ed accordingly for a similar convex constraint on \u03b8.\nLemma 2 If \u03b8P AQ (cid:23) 0 then there exist U and V such that for all labels k and l\n\nA(cid:62) Q\n\n\u03c8k\n\n\u03c8k\n\nk Q\u03c8k \u2212 \u03c8(cid:62)\n\u03c8(cid:62)\n\nk Q\u03c8l \u2212 \u03c8(cid:62)\n\nl Q\u03c8k + \u03c8(cid:62)\n\nsk(x) = \u03b3k \u2212 (cid:107)U \u03c6(x) \u2212 V \u03c8k(cid:107)2\nl Q\u03c8l = (cid:107)V \u03c8k \u2212 V \u03c8l(cid:107)2.\n\n(9)\n(10)\n\nSimilar to Lemma 1, since \u03b8P AQ (cid:23) 0,\n\nProof:\nthere exist U and V such that \u03b8P AQ =\n[U(cid:62),\u2212V (cid:62)](cid:62)[U(cid:62),\u2212V (cid:62)] where U(cid:62)U = P , V (cid:62)V = Q and U(cid:62)V = \u2212A. Expanding (8) and sub-\nl Q\u03c8l = (\u03c8k \u2212 \u03c8l)(cid:62)Q(\u03c8k \u2212 \u03c8l).\nstituting gives (9). For (10) note \u03c8(cid:62)\nl Q\u03c8k + \u03c8(cid:62)\nExpanding Q gives (\u03c8k \u2212 \u03c8l)(cid:62)Q(\u03c8k \u2212 \u03c8l) = (\u03c8k \u2212 \u03c8l)(cid:62)V (cid:62)V (\u03c8k \u2212 \u03c8l) = (cid:107)V \u03c8k \u2212 V \u03c8l(cid:107)2. (cid:4)\nThis representation now allows us to embed the desired label relationships as simple convex con-\nstraints on the score model parameters \u03b8.\n\nk Q\u03c8k \u2212 \u03c8(cid:62)\n\nk Q\u03c8l \u2212 \u03c8(cid:62)\n\n3.2 Embedding Implication Constraints\nTheorem 3 Assume the quadratic-linear score model (8) and \u03b8P AQ (cid:23) 0. Then for any \u03b4 \u2265 0 and\n\u03b1 > 0, the implication constraint in (5) is implied for all x \u2208 X by:\n2 Q\u03c81 + \u03c8(cid:62)\n2 Q\u03c81 + \u03c8(cid:62)\n\n(11)\n(12)\nProof: First, since \u03b8P AQ (cid:23) 0 we have the relationship (10), which implies that there must exist\n2 Q\u03c82 = (cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2.\nvectors \u03bd1 = V \u03c81 and \u03bd2 = V \u03c82 such that \u03c8(cid:62)\nTherefore, the constraints (11) and (12) can be equivalently re-expressed as\n\n2 Q\u03c82) \u2264 \u03b32 \u2212 \u03b4\n2 Q\u03c82) \u2265 \u03b31 + \u03b4.\n\n\u03b31 + \u03b4 + (1 + \u03b1)(\u03c8(cid:62)\n(\u03c8(cid:62)\n\n1 Q\u03c81 \u2212 \u03c8(cid:62)\n1 Q\u03c81 \u2212 \u03c8(cid:62)\n\n1 Q\u03c82 \u2212 \u03c8(cid:62)\n1 Q\u03c82 \u2212 \u03c8(cid:62)\n\n2 Q\u03c81 + \u03c8(cid:62)\n\n(cid:0) \u03b1\n\n(cid:1)2\n\n2\n\n1 Q\u03c81 \u2212 \u03c8(cid:62)\n\n1 Q\u03c82 \u2212 \u03c8(cid:62)\n\u03b31 + \u03b4 + (1 + \u03b1)(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 \u2264 \u03b32 \u2212 \u03b4\n\n(cid:1)2 (cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 \u2265 \u03b31 + \u03b4\n\n(cid:0) \u03b1\n\n2\n\n(13)\n(14)\n\n(15)\n\nwith respect to these vectors. Next let \u00b5(x) := U \u03c6(x) (which exists by (9)) and observe that\n\n(cid:107)\u00b5(x) \u2212 \u03bd2(cid:107)2 = (cid:107)\u00b5(x) \u2212 \u03bd1 + \u03bd1 \u2212 \u03bd2(cid:107)2\n\n= (cid:107)\u00b5(x) \u2212 \u03bd1(cid:107)2 + (cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 + 2(cid:104)\u00b5(x) \u2212 \u03bd1, \u03bd1 \u2212 \u03bd2(cid:105),\n\nConsider two cases.\nCase 1: 2(cid:104)\u00b5(x) \u2212 \u03bd1, \u03bd1 \u2212 \u03bd2(cid:105) > \u03b1(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2. In this case, by the Cauchy Schwarz inequality we\nhave 2(cid:107)\u00b5(x)\u2212 \u03bd1(cid:107)(cid:107)\u03bd1\u2212 \u03bd2(cid:107) \u2265 2(cid:104)\u00b5(x)\u2212 \u03bd1, \u03bd1\u2212 \u03bd2(cid:105) > \u03b1(cid:107)\u03bd1\u2212 \u03bd2(cid:107)2, which implies (cid:107)\u00b5(x)\u2212 \u03bd1(cid:107) >\nthat s1(x) < \u2212\u03b4 therefore it does not matter what value s2(x) has.\n\n(cid:1)2 (cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 \u2265 \u03b31 + \u03b4 by constraint (14). But this implies\n\n2 (cid:107)\u03bd1 \u2212 \u03bd2(cid:107), hence (cid:107)\u00b5(x) \u2212 \u03bd1(cid:107)2 >(cid:0) \u03b1\n\n\u03b1\n\n2\n\n4\n\n\fCase 2: 2(cid:104)\u00b5(x) \u2212 \u03bd1, \u03bd1 \u2212 \u03bd2(cid:105) \u2264 \u03b1(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2.\nIn this case, assume that s1(x) \u2265 \u2212\u03b4, i.e.\n(cid:107)\u00b5(x) \u2212 \u03bd1(cid:107)2 \u2264 \u03b31 + \u03b4, otherwise it does not matter what value s2(x) has. Then from (15) it fol-\nlows that (cid:107)\u00b5(x)\u2212 \u03bd2(cid:107)2 \u2264 (cid:107)\u00b5(x)\u2212 \u03bd1(cid:107)2 +(1 + \u03b1)(cid:107)\u03bd1\u2212 \u03bd2(cid:107)2 \u2264 \u03b31 + \u03b4 +(1 + \u03b1)(cid:107)\u03bd1\u2212 \u03bd2(cid:107)2 \u2264 \u03b32\u2212 \u03b4\nby constraint (13). But this implies that s2(x) \u2265 \u03b4, hence the implication is enforced.\n(cid:4)\n\n3.3 Embedding Mutual Exclusion Constraints\nTheorem 4 Assume the quadratic-linear score model (8) and \u03b8P AQ (cid:23) 0. Then for any \u03b4 \u2265 0 the\nmutual exclusion constraint in (6) is implied for all x \u2208 X by:\n\n1\n\n2 (\u03c8(cid:62)\n\n1 Q\u03c81 \u2212 \u03c8(cid:62)\n\n1 Q\u03c82 \u2212 \u03c8(cid:62)\n\n2 Q\u03c81 + \u03c8(cid:62)\n\n(16)\nProof: As before, since \u03b8P AQ (cid:23) 0 we have the relationship (10), which implies that there must exist\n2 Q\u03c82 = (cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2.\n1 Q\u03c82 \u2212 \u03c8(cid:62)\nvectors \u03bd1 = V \u03c81 and \u03bd2 = V \u03c82 such that \u03c8(cid:62)\nObserve that the constraint (16) can then be equivalently expressed as\n2(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 > \u03b31 + \u03b32 + 2\u03b4,\n\n2 Q\u03c82) > \u03b31 + \u03b32 + 2\u03b4.\n\n1 Q\u03c81 \u2212 \u03c8(cid:62)\n\n2 Q\u03c81 + \u03c8(cid:62)\n\n(17)\n\n1\n\nand observe that\n\n(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 = (cid:107)\u03bd1 \u2212 \u00b5(x) + \u00b5(x) \u2212 \u03bd2(cid:107)2\n\n= (cid:107)\u03bd1 \u2212 \u00b5(x)(cid:107)2 + (cid:107)\u00b5(x) \u2212 \u03bd2(cid:107)2 + 2(cid:104)\u03bd1 \u2212 \u00b5(x), \u00b5(x) \u2212 \u03bd2(cid:105)\n\n(18)\n\nusing \u00b5(x) := U \u03c6(x) as before (which exists by (9)). Therefore\n\n(cid:107)\u00b5(x) \u2212 \u03bd1(cid:107)2 + (cid:107)\u00b5(x) \u2212 \u03bd2(cid:107)2 = (cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 \u2212 2(cid:104)\u03bd1 \u2212 \u00b5(x), \u00b5(x) \u2212 \u03bd2(cid:105)\n\n1\n\n2(cid:107)(\u03bd1 \u2212 \u00b5(x)) + (\u00b5(x) \u2212 \u03bd2)(cid:107)2\n2(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2.\n\n= (cid:107)(\u03bd1\u2212\u00b5(x))+(\u00b5(x)\u2212\u03bd2)(cid:107)2 \u2212 2(cid:104)\u03bd1\u2212\u00b5(x), \u00b5(x)\u2212\u03bd2(cid:105) (19)\n\u2265 1\n(20)\n(21)\n= 1\n2(cid:107)a(cid:107)2 +\n2(cid:107)a \u2212 b(cid:107)2, we must have (cid:104)a, b(cid:105) \u2264 1\n(To prove the inequality (20) observe that, since 0 \u2264 1\n2(cid:107)b(cid:107)2, hence 2(cid:104)a, b(cid:105) \u2264 1\n2(cid:107)b(cid:107)2 + (cid:104)a, b(cid:105) = 1\n2(cid:107)a + b(cid:107)2, which establishes \u22122(cid:104)a, b(cid:105) \u2265\n2(cid:107)a + b(cid:107)2. The inequality (20) then follows simply by setting a = \u03bd1 \u2212 \u00b5(x) and b = \u00b5(x)\u2212 \u03bd2.)\n\u2212 1\nNow combining (21) with the constraint (17) implies that (cid:107)\u00b5(x) \u2212 \u03bd1(cid:107)2 + (cid:107)\u00b5(x) \u2212 \u03bd2(cid:107)2 \u2265\n2(cid:107)\u03bd1 \u2212 \u03bd2(cid:107)2 > \u03b31 + \u03b32 + 2\u03b4, therefore one of (cid:107)\u00b5(x) \u2212 \u03bd1(cid:107)2 > \u03b31 + \u03b4 or (cid:107)\u00b5(x) \u2212 \u03bd2(cid:107)2 > \u03b32 + \u03b4\nmust hold, hence at least one of s1(x) < \u2212\u03b4 or s2(x) < \u2212\u03b4 must hold. Therefore, the mutual\n(cid:4)\nexclusion is enforced.\n\n2(cid:107)a(cid:107)2 + 1\n\n1\n\nImportantly, once \u03b8P AQ (cid:23) 0 is imposed, the other constraints in Theorems 3 and 4 are all linear in\nthe parameters Q and \u03b3.\n\n4 Properties\n\nWe now establish that the above constraints on the parameters in (8) achieve the desired properties.\nIn particular, we show that given the constraints, inference can be removed both from the prediction\nproblem (4) and from structured large margin training (3).\n\n4.1 Prediction Equivalence\n\nFirst note that the decision of whether a label yk is associated with x can be determined by\ns(x, yk = 1) > s(x, yk = 0) \u21d4 max\n(22)\nyk\u2208{0,1} yksk(x).\nConsider joint assignments y = (y1, ..., yl) \u2208 {0, 1}l and let Y denote the set of joint assignments\nthat are consistent with a set of implication and mutual exclusion constraints. (It is assumed the\nconstraints are satis\ufb01able; that is, Y is not the empty set.) Then the optimal joint assignment for a\n\nyk\u2208{0,1} yksk(x) > 0 \u21d4 1 = arg max\n\ngiven x can be speci\ufb01ed by arg maxy\u2208Y(cid:80)l\n\nk=1 yksk(x).\n\n5\n\n\f(23)\n\n(24)\n\nProposition 5 If the constraint set Y imposes the constraints in (5) and (6) (and is nonempty), and\nthe score function s satis\ufb01es the corresponding constraints for some \u03b4 > 0, then\n\nl(cid:88)\n\nl(cid:88)\n\nmax\ny\u2208Y\n\nyksk(x) =\n\nk=1\n\nk=1\n\nmax\n\nyk\n\nyksk(x)\n\nProof: First observe that\n\nl(cid:88)\n\nk=1\n\nmax\ny\u2208Y\n\nyksk(x) \u2264 max\n\ny\n\nl(cid:88)\n\nk=1\n\nyksk(x) =\n\nl(cid:88)\n\nk=1\n\nmax\n\nyk\n\nyksk(x)\n\nso making local classi\ufb01cations for each label gives an upper bound. However, if the score function\nsatis\ufb01es the constraints, then the concatenation of the local label decisions y = (y1, ..., yl) must\nbe jointly feasible; that is, y \u2208 Y. In particular, for the implication y1 \u21d2 y2 the score constraint\n(5) ensures that if s1(x) > 0 \u2265 \u2212\u03b4 (implying 1 = arg maxy1 y1s1(x)) then it must follow\nthat s2(x) \u2265 \u03b4, hence s2(x) > 0 (implying 1 = arg maxy2 y2s2(x)). Similarly, for the mutual\nexclusion \u00acy1 \u2228 \u00acy2 the score constraint (6) ensures min(s1(x), s2(x)) < \u2212\u03b4 \u2264 0, hence if\ns1(x) > 0 \u2265 \u2212\u03b4 (implying 1 = arg maxy1 y1s1(x)) then it must follow that s2(x) < \u2212\u03b4 \u2264 0\n(implying 0 = arg maxy2 y2s2(x)), and vice versa. Therefore, since the maximizer y of (24) is\n(cid:4)\nfeasible, we actually have that the leftmost term in (24) is equal to the rightmost.\n\nSince the feasible set Y embodies non-trivial constraints over assignment vectors in (23), interchang-\ning maximization with summation is not normally justi\ufb01ed. However, Proposition 5 establishes that,\nif the score model also satis\ufb01es its respective constraints (e.g., as established in the previous section),\nthen maximization and summation can be interchanged, and inference over predicted labellings can\nbe replaced by greedy componentwise labelling, while preserving equivalence.\n\n4.2 Re-expressing Large Margin Structured Output Training\nGiven a target joint assignment over labels t = (t1, ..., tl) \u2208 {0, 1}l, and using the score model (8),\nthe standard structured output large margin training loss (3) can then be written as\n\nl(cid:88)\n\nmax\ny\u2208Y \u2206(y, ti) +\n\n(cid:88)\n\u2206(y, ti) =(cid:80)l\n\ns(xi, yk) \u2212 s(xi, tik) =\n\nmax\ny\u2208Y \u2206(y, ti) +\n\n(yk \u2212 tik)sk(xi), (25)\n\nk=1\n\ni\nusing the simpli\ufb01ed score function representation such that tik denotes the k-th label of the i-th\ntraining example. If we furthermore make the standard assumption that \u2206(y, ti) decomposes as\n\nk=1\n\ni\n\nk=1 \u03b4k(yk, tik), the loss can be simpli\ufb01ed to\n\n(cid:88)\n\nl(cid:88)\n\n\u03b4k(yk, tik) + (yk \u2212 tik)sk(xi).\n\n(26)\n\n(cid:88)\n\ni\n\nmax\ny\u2208Y\n\nl(cid:88)\n\nk=1\n\nNote also that since yk \u2208 {0, 1} and tik \u2208 {0, 1} the margin functions \u03b4k typically have the form\n\u03b4k(0, 0) = \u03b4k(1, 1) = 0 and \u03b4k(0, 1) = \u03b4k01 and \u03b4k(1, 0) = \u03b4k10 for constants \u03b4k01 and \u03b4k10, which\nfor simplicity we will assume are equal, \u03b4k01 = \u03b4k10 = \u03b4 for all k (although label speci\ufb01c margins\nmight be possible). This is the same \u03b4 used in the constraints (5) and (6).\nThe dif\ufb01culty in computing this loss is that it apparently requires an exponential search over y. When\nthis exponential search can be avoided, it is normally avoided by developing a dynamic program.\nInstead, we can now see that the search over y can be eliminated.\n\nProposition 6 If the score function s satis\ufb01es the constraints in (5) and (6) for \u03b4 > 0, then\n\n\u03b4(yk, tik) + (yk \u2212 tik)sk(xi).(27)\n\nmax\n\nyk\n\nl(cid:88)\n\n(cid:88)\n\ni\n\nmax\ny\u2208Y\n\n(cid:88)\n\nl(cid:88)\n\n\u03b4(yk, tik) + (yk \u2212 tik)sk(xi) =\n\nk=1\n\ni\n\nk=1\n\n6\n\n\fFor a given x and t \u2208 Y,\n\nProof:\narg maxy\u2208{0,1} fk(y). It is easy to show that\n\nlet fk(y) = \u03b4(y, tk) + (y \u2212 tk)sk(x), hence yk =\n\n1 \u2208 arg max\n\ny\u2208{0,1} fk(y) \u21d0\u21d2 sk(x) \u2265 tk\u03b4 \u2212 (1 \u2212 tk)\u03b4,\n\n(28)\n\nwhich can be veri\ufb01ed by checking the two cases, tk = 0 and tk = 1. When tk = 0 we have fk(0) =\n0 and fk(1) = \u03b4 + s(x), therefore 1 = yk \u2208 arg maxy\u2208{0,1} fk(y) iff \u03b4 + s(x) \u2265 0. Similarly,\nwhen tk = 1 we have fk(0) = \u03b4 \u2212 s(x) and fk(1) = 0, therefore 1 = yk \u2208 arg maxy\u2208{0,1} fk(y)\niff \u03b4 \u2212 s(x) \u2264 0. Combining these two conditions yields (28).\nNext, we verify that if the score constraints hold, then the logical constraints over y are automatically\nsatis\ufb01ed even by locally assigning yk, which implies the optimal joint assignment is feasible, i.e.\ny \u2208 Y, establishing the claim. In particular, for the implication y1 \u21d2 y2, it is assumed that t1 \u21d2 t2\nin the target labeling and also that score constraints hold, ensuring s1(x) \u2265 \u2212\u03b4 \u21d2 s2(x) \u2265 \u03b4.\nConsider the cases over possible assignments to t1 and t2:\nIf t1 = 0 and t2 = 0 then y1 = 1 \u21d2 f1(1) \u2265 f1(0) \u21d2 \u03b4 + s1(x) \u2265 0 \u21d2 s1(x) \u2265 \u2212\u03b4 \u21d2 s2(x) \u2265 \u03b4\n(by assumption) \u21d2 s2(x) \u2265 \u2212\u03b4 \u21d2 \u03b4 + s2(x) \u2265 0 \u21d2 f2(1) \u2265 f2(0) \u21d2 y2 = 1.\nIf t1 = 0 and t2 = 1 then y1 = 1 \u21d2 f1(1) \u2265 f1(0) \u21d2 \u03b4 + s1(x) \u2265 0 \u21d2 s1(x) \u2265 \u2212\u03b4 \u21d2 s2(x) \u2265 \u03b4\n(by assumption) \u21d2 0 \u2265 \u03b4 \u2212 s2(x) \u21d2 f2(1) \u2265 f2(0) \u21d2 y2 = 1 (tight case).\nThe case t1 = 1 and t2 = 0 cannot happen by the assumption that t \u2208 Y.\nIf t1 = 1 and t2 = 1 then y1 = 1 \u21d2 f1(1) \u2265 f1(0) \u21d2 0 \u2265 \u03b4 \u2212 s1(x) \u21d2 s1(x) \u2265 \u2212\u03b4 \u21d2 s2(x) \u2265 \u03b4\n(by assumption) \u21d2 0 \u2265 \u03b4 \u2212 s2(x) \u21d2 f2(1) \u2265 f2(0) \u21d2 y2 = 1.\nSimilarly, for the mutual exclusion \u00acy1 \u2228 \u00acy2, it is assumed that \u00act1 \u2228 \u00act2 in the target labeling\nand also that the score constraints hold, ensuring min(s1(x), s2(x)) < \u2212\u03b4. Consider the cases over\npossible assignments to t1 and t2:\nIf t1 = 0 and t2 = 0 then y1 = 1 and y2 = 1 implies that s1(x) \u2265 \u2212\u03b4 and s2(x) \u2265 \u2212\u03b4, which\ncontradicts the constraint that min(s1(x), s2(x)) < \u2212\u03b4 (tight case).\nIf t1 = 0 and t2 = 1 then y1 = 1 and y2 = 1 implies that s1(x) \u2265 \u2212\u03b4 and s2(x) \u2265 \u03b4, which\ncontradicts the same constraint.\nIf t1 = 1 and t2 = 0 then y1 = 1 and y2 = 1 implies that s1(x) \u2265 \u03b4 and s2(x) \u2265 \u2212\u03b4, which again\ncontradicts the same constraint.\nThe case t1 = 1 and t2 = 1 cannot happen by the assumption that t \u2208 Y.\nTherefore, since the concatenation, y, of the independent maximizers of (27) is feasible, i.e. y \u2208 Y,\n(cid:4)\nwe have that the rightmost term in (27) equals the leftmost.\nSimilar to Section 4.1, Proposition 6 demonstrates that if the constraints (5) and (6) are satis\ufb01ed\nby the score model s, then structured large margin training (3) reduces to independent labelwise\ntraining under the standard hinge loss, while preserving equivalence.\n\n5 Ef\ufb01cient Implementation\n\n2\n\nEven though Section 3 achieves the primary goal of demonstrating how desired label relationships\ncan be embedded as convex constraints on score model parameters, the linear-quadratic represen-\ntation (8) unfortunately does not allow convenient scaling: the number of parameters in \u03b8P AQ (8)\n\n(cid:1) (accounting for symmetry), which is quadratic in the number of features, n, in \u03c6 and the\n\nis(cid:0)n+(cid:96)\n\nnumber of labels, (cid:96). Such a large optimization variable is not practical for most applications, where\nn and (cid:96) can be quite large. The semide\ufb01nite constraint \u03b8P AQ (cid:23) 0 can also be costly in practice.\nTherefore, to obtain scalable training we require some further re\ufb01nement.\nIn our experiments below we obtained a scalable training procudure by exploiting trace norm reg-\nularization on \u03b8P AQ to reduce its rank. The key bene\ufb01t of trace norm regularization is that ef\ufb01-\ncient solution methods exist that work with a low rank factorization of the matrix variable while\nautomatically ensuring positive semide\ufb01niteness and still guaranteeing global optimality [10, 14].\nTherefore, we conducted the main optimization in terms of a smaller matrix variable B such that\nBB(cid:62) = \u03b8P AQ. Second, to cope with the constraints, we employed an augmented Lagrangian\nmethod that increasingly penalizes constraint violations, but otherwise allows simple unconstrained\noptimization. All optimizations for smooth problems were performed using LBFGS and nonsmooth\nproblems were solved using a bundle method [23].\n\n7\n\n\fDataset\nEnron\nWIPO\nReuters\n\n% test error\nunconstrained\nconstrained\ninference\n\nFeatures Labels Depth\n4\n5\n5\n\n1001\n74435\n47235\n\n57\n183\n103\n\n# Training\n988\n1352\n3000\n\nTable 1: Data set properties\n\n# Testing Reference\n\n660\n358\n3000\n\n[18]\n[25]\n[20]\n\nEnron WIPO Reuters\n27.1\n4.0\n29.3\n\n21.0\n2.6\n2.7\n\n12.4\n9.8\n6.8\n\ntest time (s)\nunconstrained\nconstrained\ninference\n\nEnron WIPO Reuters\n0.054\n0.60\n0.60\n0.054\n0.481\n5.20\n\n0.070\n0.070\n0.389\n\nTable 2: (left) test set prediction error (percent); (right) test set prediction time (s)\n\n6 Experimental Evaluation\n\nTo evaluate the proposed approach we conducted experiments on multilabel text classi\ufb01cation data\nthat has a natural hierarchy de\ufb01ned over the label set. In particular, we investigated three multi-\nlabel text classi\ufb01cation data sets, Enron, WIPO and Reuters, obtained from https://sites.\ngoogle.com/site/hrsvmproject/datasets-hier (see Table 1 for details). Some pre-\nprocessing was performed on the label relations to ensure consistency with our assumptions. In\nparticular, all implications were added to each instance to ensure consistency with the hierarchy,\nwhile mutual exclusions were de\ufb01ned between siblings whenever this did not create a contradiction.\nWe conducted experiments to compare the effects of replacing inference with the constraints outlined\nin Section 3, using the score model (8). For comparison, we trained using the structured large margin\nformulation (3), and trained under a multilabel prediction loss without inference, but both including\nthen excluding the constraints. For the multilabel training loss we used the smoothed calibrated\nseparation ranking loss proposed in [24]. In each case, the regularization parameter was simply set\nto 1. For inference, we implemented the inference algorithm outlined in [8].\nThe results are given in Table 2, showing both the test set prediction error (using labelwise prediction\nerror, i.e. Hamming loss) and the test prediction times. As expected, one can see bene\ufb01ts from\nincorporating known relationships between the labels when training a predictor. In each case, the\naddition of constraints leads to a signi\ufb01cant improvement in test prediction error, versus training\nwithout any constraints or inference added. Training with inference (i.e., classical structured large\nmargin training) still proves to be an effective training method overall, in one case improving the\nresults over the constrained approach, but in two other cases falling behind. The key difference\nbetween the approach using constraints versus that using inference is in terms of the time it takes\nto produce predictions on test examples. Using inference to make test set predictions clearly takes\nsigni\ufb01cantly longer than applying labelwise predictions from either a constrained or unconstrained\nmodel, as shown in the right subtable of Table 2.\n\n7 Conclusion\n\nWe have demonstrated a novel approach to structured multilabel prediction where inference is re-\nplaced with constraints on the score model. On multilabel text classi\ufb01cation data, the proposed\napproach does appear to be able to achieve competitive generalization results, while reducing the\ntime needed to make predictions at test time. In cases where logical relationships are known to\nhold between the labels, using either inference or imposing constraints on the score model appear to\nyield bene\ufb01ts over generic training approaches that ignore the prior knowledge. For future work we\nare investigating extensions of the proposed approach to more general structured output settings, by\ncombining the method with search based prediction methods. Other interesting questions include\nexploiting learned label relations and coping with missing labels.\n\n8\n\n\fReferences\n[1] G. Bakir, T. Hofmann, B. Sch\u00a8olkopf, A. Smola, B. Taskar, and S. Vishwanathan. Predicting Structured\n\nData. MIT Press, 2007.\n\n[2] W. Bi and J. Kwok. Mandatory leaf node prediction in hierarchical multilabel classi\ufb01cation. In Neural\n\nInformation Processing Systems (NIPS), 2012.\n\n[3] M. Ciss\u00b4e, N. Usunier, T. Artieres, and P. Gallinari. Robust bloom \ufb01lters for large multilabel classi\ufb01cation\n\ntasks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[4] H. Daume and J. Langford. Search-based structured prediction. Machine Learning, 75:297\u2013325, 2009.\n[5] K. Dembczy\u00b4nski, W. Cheng, and E. H\u00a8ullermeier. Bayes optimal multilabel classi\ufb01cation via probabilistic\n\nclassi\ufb01er chains. In Proceedings ICML, 2010.\n\n[6] K. Dembczy\u00b4nski, W. Waegeman, W. Cheng, and E. H\u00a8ullermeier. On label dependence and loss minimiza-\n\ntion in multi-label classi\ufb01cation. Machine Learning, 88(1):5\u201345, 2012.\n\n[7] J. Deng, A. Berg, K. Li, and F. Li. What does classifying more than 10,000 image categories tell us? In\n\nProceedings of the European Conference on Computer Vision (ECCV), 2010.\n\n[8] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale\n\nobject classi\ufb01cation using label relation graphs. In Proceedings ECCV, 2014.\n\n[9] Y. Guo and D. Schuurmans. Adaptive large margin training for multilabel classi\ufb01cation. In AAAI, 2011.\n[10] B. Haeffele, R. Vidal, and E. Young. Structured low-rank matrix factorization: Optimality, algorithm, and\n\napplications to image processing. In International Conference on Machine Learning (ICML), 2014.\n\n[11] B. Hariharan, S.V.N. Vishwanathan, and M. Varma. Ef\ufb01cient max-margin multi-label classi\ufb01cation with\n\napplications to zero-shot learning. Machine Learning, 88:127\u2013155, 2012.\n\n[12] J. Jancsary, S. Nowozin, and C. Rother. Learning convex QP relaxations for structured prediction. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2013.\n\n[13] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML, 1999.\n[14] M. Journ\u00b4ee, F. Bach, P. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive semidef-\n\ninite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\n[15] H. Kadri, M. Ghavamzadeh, and P. Preux. A generalized kernel approach to structured output learning.\n\nIn Proceedings of the International Conference on Machine Learning (ICML), 2013.\n\n[16] A. Kae, K. Sohn, H. Lee, and E. Learned-Miller. Augmenting CRFs with Boltzmann machine shape\n\npriors for image labeling. In Proceedings CVPR, 2013.\n\n[17] A. Kapoor, P. Jain, and R. Vishwanathan. Multilabel classi\ufb01cation using Bayesian compressed sensing.\n\nIn Proceedings of Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[18] B. Klimt and Y. Yang. The Enron corpus: A new dataset for email classi\ufb01cation. In ECML, 2004.\n[19] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In International Conference on Machine Learning (ICML), 2001.\n\n[20] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[21] Q. Li, J. Wang, D. Wipf, and Z. Tu. Fixed-point model for structured prediction. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), 2013.\n\n[22] Z. Lin, G. Ding, M. Hu, and J. Wang. Multi-label classi\ufb01cation via feature-aware implicit label space\n\nencoding. In Proceedings of the International Conference on Machine Learning (ICML), 2014.\n\n[23] M. M\u00a8akel\u00a8a. Multiobjective proximal bundle method for nonconvex nonsmooth optimization: Fortran\n\nsubroutine MPBNGC 2.0. Technical report, U. of Jyv\u00a8askyk\u00a8a, 2003.\n\n[24] F. Mirzazadeh, Y. Guo, and D. Schuurmans. Convex co-embedding. In Proceedings AAAI, 2014.\n[25] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of hierarchical multilabel\n\nclassi\ufb01cation models. Journal of Machine Learning Research, 7:1601\u20131626, 2006.\n\n[26] V. Srikumar and C. Manning. Learning distributed representations for structured output prediction. In\n\nProceedings of Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[27] X. Sun. Structure regularization for structured prediction. In Proceedings NIPS, 2014.\n[28] B. Taskar. Learning structured prediction models: A large margin approach. PhD thesis, Stanford, 2004.\n[29] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484, 2005.\n\n[30] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data Mining and Knowledge\n\nDiscovery Handbook, 2nd edition. Springer, 2009.\n\n[31] K. Weinberger and O. Chapelle. Large margin taxonomy embedding for document categorization. In\n\nNeural Information Processing Systems (NIPS), 2008.\n\n[32] J. Weston, S. Bengio, and N. Usunier. WSABIE: scaling up to large vocabulary image annotation. In\n\nInternational Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2011.\n\n9\n\n\f", "award": [], "sourceid": 1958, "authors": [{"given_name": "Farzaneh", "family_name": "Mirzazadeh", "institution": "University of Alberta"}, {"given_name": "Siamak", "family_name": "Ravanbakhsh", "institution": "University of Alberta"}, {"given_name": "Nan", "family_name": "Ding", "institution": "Google"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Alberta"}]}