{"title": "Learning Distributed Representations for Structured Output Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3266, "page_last": 3274, "abstract": "In recent years, distributed representations of inputs have led to performance gains in many applications by allowing statistical information to be shared across inputs. However, the predicted outputs (labels, and more generally structures) are still treated as discrete objects even though outputs are often not discrete units of meaning. In this paper, we present a new formulation for structured prediction where we represent individual labels in a structure as dense vectors and allow semantically similar labels to share parameters. We extend this representation to larger structures by defining compositionality using tensor products to give a natural generalization of standard structured prediction approaches. We define a learning objective for jointly learning the model parameters and the label vectors and propose an alternating minimization algorithm for learning. We show that our formulation outperforms structural SVM baselines in two tasks: multiclass document classification and part-of-speech tagging.", "full_text": "Learning Distributed Representations for Structured\n\nOutput Prediction\n\nVivek Srikumar\u2217\nUniversity of Utah\n\nsvivek@cs.utah.edu\n\nChristopher D. Manning\n\nStanford University\n\nmanning@cs.stanford.edu\n\nAbstract\n\nIn recent years, distributed representations of inputs have led to performance gains\nin many applications by allowing statistical information to be shared across in-\nputs. However, the predicted outputs (labels, and more generally structures) are\nstill treated as discrete objects even though outputs are often not discrete units\nof meaning. In this paper, we present a new formulation for structured predic-\ntion where we represent individual labels in a structure as dense vectors and allow\nsemantically similar labels to share parameters. We extend this representation\nto larger structures by de\ufb01ning compositionality using tensor products to give a\nnatural generalization of standard structured prediction approaches. We de\ufb01ne a\nlearning objective for jointly learning the model parameters and the label vectors\nand propose an alternating minimization algorithm for learning. We show that\nour formulation outperforms structural SVM baselines in two tasks: multiclass\ndocument classi\ufb01cation and part-of-speech tagging.\n\n1\n\nIntroduction\n\nIn recent years, many computer vision and natural language processing (NLP) tasks have bene\ufb01ted\nfrom the use of dense representations of inputs by allowing super\ufb01cially different inputs to be related\nto one another [26, 9, 7, 4]. For example, even though words are not discrete units of meaning, tradi-\ntional NLP models use indicator features for words. This forces learning algorithms to learn separate\nparameters for orthographically distinct but conceptually similar words. In contrast, dense vector\nrepresentations allow sharing of statistical signal across words, leading to better generalization.\nMany NLP and vision problems are structured prediction problems. The output may be an atomic\nlabel (tasks like document classi\ufb01cation) or a composition of atomic labels to form combinatorial\nobjects like sequences (e.g. part-of-speech tagging), labeled trees (e.g. parsing) or more complex\ngraphs (e.g.\nimage segmentation). Despite both the successes of distributed representations for\ninputs and the clear similarities over the output space, it is still usual to handle outputs as discrete\nobjects. But are structures, and the labels that constitute them, really discrete units of meaning?\nConsider, for example, the popular 20 Newsgroups dataset [13] which presents the multiclass\nclassi\ufb01cation problem of identifying a newsgroup label given the text of a posting. Labels\ninclude comp.os.mswindows.misc, sci.electronics, comp.sys.mac.hardware,\nrec.autos and rec.motorcycles. The usual strategy is to train a classi\ufb01er that uses separate\nweights for each label. However, the labels themselves have meaning that is independent of the train-\ning data. From the label, we can see that comp.os.mswindows.misc, sci.electronics\nand comp.sys.mac.hardware are semantically closer to each other than the other two. A sim-\nilar argument can be made for not just atomic labels but their compositions too. For example, a\npart-of-speech tagging system trained as a sequence model might have to learn separate parameters\n\n\u2217This work was done when the author was at Stanford University.\n\n1\n\n\ffor the JJ\u2192NNS and JJR\u2192NN transitions even though both encode a transition from an adjective to\na noun. Here, the similarity of the transitions can be inferred from the similarity of its components.\nIn this paper, we propose a new formulation for structured output learning called DISTRO (DIS-\ntributed STRuctred Output), which accounts for the fact that labels are not atomic units of meaning.\nWe model label meaning by representing individual labels as real valued vectors. Doing so allows us\nto capture similarities between labels. To allow for arbitrary structures, we de\ufb01ne compositionality\nof labels as tensor products of the label vectors corresponding to its sub-structures. We show that\ndoing so gives us a natural extension of standard structured output learning approaches, which can\nbe seen as special cases with one-hot label vectors.\nWe de\ufb01ne a learning objective that seeks to jointly learn the model parameters along with the label\nrepresentations and propose an alternating algorithm for minimizing the objective for structured\nhinge loss. We evaluate our approach on two tasks which have semantically rich labels: multiclass\nclassi\ufb01cation on the newsgroup data and part-of-speech tagging for English and Basque. In all cases,\nwe show that DISTRO outperforms the structural SVM baselines.\n\n1.1 Related Work\n\nThis paper considers the problem of using distributed representations for arbitrary structures and is\nrelated to recent work in deep learning and structured learning. Recent unsupervised representation\nlearning research has focused on the problem of embedding inputs in vector spaces [26, 9, 16, 7].\nThere has been some work [22] on modeling semantic compositionality in NLP, but the models do\nnot easily generalize to arbitrary structures. In particular, it is not easy to extend these approaches\nto use advances in knowledge-driven learning and inference that standard structured learning and\nprediction algorithms enable.\nStandard learning approaches for structured output allow for modeling arbitrarily complex structures\n(subject to inference dif\ufb01culties) and structural SVMs [25] or conditional random \ufb01elds [12] are\ncommonly used. However, the output itself is treated as a discrete object and similarities between\noutputs are not modeled. For multiclass classi\ufb01cation, the idea of classifying to a label set that\nfollow a known hierarchy has been explored [6], but such a taxonomy is not always available.\nThe idea of distributed representations for outputs has been discussed in the connectionist literature\nsince the eighties [11, 21, 20]. In recent years, we have seen several lines of research that address the\nproblem in the context of multiclass classi\ufb01cation by framing feature learning as matrix factorization\nor sparse encoding [23, 1, 3]. As in this paper, the goal has often explicitly been to discover shared\ncharacteristics between the classes [2]. Indeed, the inference formulation we propose is very similar\nto inference in these lines of work. Also related is recent research in the NLP community that\nexplores the use of tensor decompositions for higher order feature combinations [14]. The primary\nnovelty in this paper is that in addition to representing atomic labels in a distributed manner, we\nmodel their compositions in a natural fashion to generalize standard structured prediction.\n\n2 Preliminaries and Notation\n\nIn this section, we give a very brief overview of structured prediction with the goal of introducing\nnotation and terminology for the next sections. We represent inputs to the structured prediction\nproblem (such as, sentences, documents or images) by x \u2208 X and output structures (such as labels\nor trees) by y \u2208 Y. We de\ufb01ne the feature function \u03a6 : X \u00d7 Y \u2192 (cid:60)n that captures the relationship\nbetween the input x and the structure y as an n dimensional vector. A linear model scores the\nstructure y with a weight vector w \u2208 (cid:60)n as wT \u03a6(x, y). We predict the output for an input x as\narg maxy wT \u03a6(x, y). This problem of inference is a combinatorial optimization problem.\nWe will use the structures in Figure 1 as running examples. In the case of multiclass classi\ufb01cation,\nthe output y is one of a \ufb01nite set of labels (Figure 1, left). For more complex structures, the feature\nvector is decomposed over the parts of the structure. For example, the usual representation of a\n\ufb01rst-order linear sequence model (Figure 1, middle) decomposes the sequence into emissions and\ntransitions and the features decompose over these [8]. In this case, each emission is associated with\none label and a transition is associated with an ordered pair of labels.\n\n2\n\n\fAtomic part\nLabel yp = (y)\n\ny\n\nx\n\nCompositional part\nLabel yp = (y0, y1)\n\ny0\n\nAtomic part\nLabel yp = (y0)\n\ny1\n\nx\n\ny2\n\ny0\n\ny1\n\ny2\n\nCompositional part\nLabel yp = (y0, y1, y2)\n\nx\n\nMulticlass classi\ufb01cation\n\nSequence labeling. The emissions are\natomic and the transitions are compo-\nsitional.\n\nA purely compositional part\n\nFigure 1: Three examples of structures. In all cases, x represents the input and the y\u2019s denote the outputs to\nbe predicted. Here, each square represents a part as de\ufb01ned in the text and circles represent random variables\nfor inputs and outputs (as in factor graphs). The left \ufb01gure shows multiclass classi\ufb01cation, which has an atomic\npart associated with exactly one label. The middle \ufb01gure shows a \ufb01rst-order sequence labeling task that has both\natomic parts (emissions) and compositional ones (transitions). The right \ufb01gure shows a purely compositional\npart where all outputs interact. The feature functions for these structures are shown at the end of Section 3.1.\n\nIn the general case, we denote the parts (or equivalently, factors in a factor graph) in the structure\nfor input x by \u0393x. Each part p \u2208 \u0393x is associated with a list of discrete labels, denoted by yp =\np,\u00b7\u00b7\u00b7 ). Note that the size of the list yp is a modeling choice; for example, transition parts in\np, y1\n(y0\nthe \ufb01rst-order Markov model correspond to two consecutive labels, as shown in Figure 1.\nWe denote the set of labels in the problem as L = {l1, l2,\u00b7\u00b7\u00b7 , lM} (e.g. the set of part-of-speech\ntags). All the elements of the part labels yp are members of this set. For notational convenience,\nwe denote the \ufb01rst element of the list yp by yp (without boldface) and the rest by y1:\np . In the rest of\nthe paper, we will refer to a part associated with a single label as atomic and all other parts where\nyp has more than one element as compositional. In Figure 1, we see examples of a purely atomic\nstructure (multiclass classi\ufb01cation), a purely compositional structure (right) and a structure that is a\nmix of the two (\ufb01rst order sequence, middle).\nThe decomposition of the structure decomposes the feature function over the parts as\n\n\u03a6(x, y) =\n\n\u03a6p (x, yp) .\n\n(1)\n\n(cid:88)\n\np\u2208\u0393x\n\n(cid:88)\n\ni\n\n1\nN\n\nThe scoring function wT \u03c6(x, y) also decomposes along this sum. Standard de\ufb01nitions of struc-\ntured prediction models leave the de\ufb01nition of the part-speci\ufb01c feature function \u03a6p to be problem\ndependent. We will focus on this aspect in Section 3 to de\ufb01ne our model.\nWith de\ufb01nitions of a scoring function and inference, we can state the learning objective. Given a\ncollection of N training examples of the form (xi, yi), training is the following regularized risk\nminimization problem:\n\nmin\nw\u2208(cid:60)n\n\n\u03bb\n2\n\nwT w +\n\nL(xi, yi; w).\n\n(2)\n\nHere, L represents a loss function such as the hinge loss (for structural SVMs) or the log loss (for\nconditional random \ufb01elds) and penalizes model errors.The hyper-parameter \u03bb trades off between\ngeneralization and accuracy.\n\n3 Distributed Representations for Structured Output\n\nAs mentioned in Section 2, the choice of the feature function \u03a6p for a part p is left to be problem\nspeci\ufb01c. The objective is to capture the correlations between the relevant attributes of the input x\nand the output labels yp. Typically, this is done by conjoining the labels yp with a user-de\ufb01ned\nfeature vector \u03c6p(x) that is dependent only on the input.\n\n3\n\n\fWhen applied to atomic parts (e.g. multiclass classi\ufb01cation), conjoining the label with the input fea-\ntures effectively allocates a different portion of the weight vector for each label. For compositional\nparts (e.g. transitions in sequence models), this ensures that each combination of labels is associated\nwith a different portion of the weight vector. The implicit assumption in this design is that labels and\nlabel combinations are distinct units of meaning and hence do not share any parameters across them.\nIn this paper, we posit that in most naturally occurring problems and their associated labels, this\nassumption is not true. In fact, labels often encode rich semantic information with varying degrees\nof similarities to each other. Because structures are composed of atomic labels, the same applies to\nstructures too.\nFrom Section 2, we see that for the purpose of inference, structures are completely de\ufb01ned by\ntheir feature vectors, which are decomposed along the atomic and compositional parts that form the\nstructure. Thus, our goal is to develop a feature representation for labeled parts that exploits label\nsimilarity. More explicitly, our desiderata are:\n\n1. First, we need to be able to represent labeled atomic parts using a feature representation\nthat accounts for relatedness of labels in such a way that statistical strength (i.e. weights)\ncan be shared across different labels.\n\n2. Second, we need an operator that can construct compositional parts to build larger struc-\n\ntures so that the above property can be extended to arbitrary structured output.\n\n3.1 The DISTRO model\n\nIn order to assign a notion of relatedness between labels, we associate a d dimensional unit vector\nal to each label l \u2208 L. We will refer to the d \u00d7 M matrix comprising of all the M label vectors as\nA, the label matrix.\nWe can de\ufb01ne the feature vectors for parts, and thus entire structures, using these label vectors. To\ndo so, we de\ufb01ne the notion of a feature tensor function for a part p that has been labeled with a list\nof m labels yp. The feature tensor function is a function \u03a8p that maps the input x and the label list\nyp associated with the part to a tensor of order m + 1. The tensor captures the relationships between\nthe input and all the m labels associated with it. We recursively de\ufb01ne the feature tensor function\nusing the label vectors as:\n\n(cid:26)\n\np , A(cid:1) ,\n(cid:0)x, y1:\n\nalyp\n\n\u2297 \u03c6p(x), p is atomic,\n\n\u03a8p (x, yp, A) =\n\n(3)\nHere, the symbol \u2297 denotes the tensor product operation. Unrolling the recursion in this de\ufb01nition\nshows that the feature tensor function for a part is the tensor product of the vectors for all the labels\nassociated with that part and the feature vector associated with the input for the part. For an input x\nand a structure y, we use the feature tensor function to de\ufb01ne its feature representation as\n\np is compositional.\n\n\u2297 \u03a8p\n\nalyp\n\n\u03a6A (x, y) =\n\nvec (\u03a8p (x, yp, A))\n\n(4)\n\n(cid:88)\n\np\u2208\u0393x\n\nHere, vec(\u00b7) denotes the vectorization operator that converts a tensor into a vector by stacking its\nelements. Figure 2 shows an example of the process of building the feature vector for a part that is\nlabeled with two labels. With this de\ufb01nition of the feature vector, we can use the standard approach\nto score structures using a weight vector as wT \u03a6A (x, y).\nIn our running examples from Figure 1, we have the following de\ufb01nitions of feature functions for\neach of the cases:\n\n1. Purely atomic part, multiclass classi\ufb01cation (left): Denote the feature vector associated\nwith x as \u03c6. For an atomic part, the de\ufb01nition of the feature tensor function in Equation (3)\neffectively produces a d \u00d7 |\u03c6| matrix aly \u03c6T . Thus the feature vector for the structure y is\n\n\u03a6A (x, y) = vec(cid:0)aly \u03c6T(cid:1) . For this case, the score for an input x being assigned a label y\n\ncan be explicitly be written as the following summation:\n\nwT \u03a6A (x, y) =\n\nwdj+ialy,i\u03c6j\n\nd(cid:88)\n\n|\u03c6|(cid:88)\n\ni=0\n\nj=0\n\n4\n\n\fvec (\n\n\u2297 \u2297\n\n) \u2192 vec (\n\n\u2192\n\n)\n\nal1 \u2208 (cid:60)d\n\nal2 \u2208 (cid:60)d\n\n\u03c6p(x) \u2208 (cid:60)N\n\nd \u00d7 d \u00d7 N\nFeature tensor\n\nFeature vector \u2208 (cid:60)d2N\n\nFigure 2: This \ufb01gure summarizes feature vector generation for a compositional part labeled with two labels\nl1 and l2. Each label is associated with a d dimensional label vector and the feature vector for the input is N\ndimensional. Vectorizing the feature tensor produces a \ufb01nal feature vector that is a d2N-dimensional vector.\n\n2. Purely compositional part (right): For a compositional part, the feature tensor function\nproduces a tensor whose elements effectively enumerate every possible combination of\nelements of input vector \u03c6p(x) and the associated label vectors. So, the feature vector for\n\n3. First order sequence (middle): This structure presents a combination of atomic and com-\npositional parts. Suppose we denote the input emission features by \u03c6E,i for the ith label\nand the input features corresponding to the transition1 from yi to yi+1 by \u03c6T,i. With this\nnotation, we can de\ufb01ne the feature vector for the structure as\n\nthe structure is \u03a6A (x, y) = vec(cid:0)aly0\nvec(cid:0)alyi\n\n\u03a6A (x, y) =\n\n(cid:88)\n\n\u2297 aly1\n\n\u2297 \u03c6E,i(cid:1) +\n\n\u2297 \u03c6p(x)(cid:1) .\n(cid:16)\n\nvec\n\nalyi\n\n\u2297 aly2\n(cid:88)\n\n\u2297 \u03c6T,i(cid:17)\n\n.\n\n\u2297 alyi+1\n\n3.2 Discussion\n\ni\n\ni\n\nConnection to standard structured prediction For a part p, a traditional structured model con-\njoins all its associated labels to the input feature vector to get the feature vector for that assignment\nof the labels. According to the de\ufb01nition of Equation (3), we propose that these label conjunctions\nshould be replaced with a tensor product, which generalizes the standard method. Indeed, if the\nlabels are represented via one-hot vectors, then we would recover standard structured prediction\nwhere each label (or group of labels) is associated with a separate section of the weight vector. For\nexample, for multiclass classi\ufb01cation, if each label is associated with a separate one-hot vector, then\nthe feature tensor for a given label will be a matrix where exactly one column is the input feature\nvector \u03c6p(x) and all other entries are zero. This argument also extends to compositional parts.\n\nDimensionality of label vectors\nIf labels are represented by one-hot vectors, the dimensionality\nof the label vectors will be M, the number of labels in the problem. However, in DISTRO, in addition\nto letting the label vectors be any unit vector, we can also allow them to exist in a lower dimensional\nspace. This presents us with a decision with regard to the dimensionality d.\nThe choice of d is important for two reasons. First, it determines the number of parameters in the\nmodel. If a part is associated with m labels, recall that the feature tensor function produces a m + 1\norder tensor formed by taking the tensor product of the m label vectors and the input features. That\nis, the feature vector for the part is a dm|\u03c6p(x)| dimensional vector. (See 2 for an illustration.)\nSmaller d thus leads to smaller weight vectors. Second, if the dimensionality of the label vectors\nis lower, it encourages more weights to be shared across labels.\nIndeed, for purely atomic and\ncompositional parts if the labels are represented by M dimensional vectors, we can show that for\nany weight vector that scores these labels via the feature representation de\ufb01ned in Equation (4), there\nis another weight vector that assigns the same scores using one-hot weight vectors.\n\n4 Learning Weights and Label Vectors\n\nIn this section, we will address the problem of learning the weight vectors w and the label vectors\nA from data. We are given a training set with N examples of the form (xi, yi). The goal of learning\n\n1In a linear sequence model de\ufb01ned as a CRF or a structural SVM, these transition input features can simply\n\nbe an indicator that selects a speci\ufb01c portion of the weight vector.\n\n5\n\n\fis to minimize regularized risk over the training set. This leads to a training objective similar to\nthat of structural SVMs or conditional random \ufb01elds (Equation (2)). However, there are two key\ndifferences. First, the feature vectors for structures are not \ufb01xed as in structural SVMs or CRFs but\nare functions of the label vectors. Second, the minimization is over not just the weight vectors, but\nalso over the label vectors that require regularization.\nIn order to encourage the labels to share weights, we propose to impose a rank penalty over the\nlabel matrix A in the learning objective. Since the rank minimization problem is known to be\ncomputationally intractable in general [27], we use the well known nuclear norm surrogate to replace\nthe rank [10]. This gives us the learning objective de\ufb01ned as f below:\n\nwT w + \u03bb2||A||\u2217 +\n\n\u03bb1\n2\n\n(cid:88)\n\ni\n\n1\nN\n\nf (w, A) =\n\n(5)\nHere, the ||A||\u2217 is the nuclear norm of A, de\ufb01ned as the sum of the singular values of the matrix.\nCompared to the objective in Equation (2), the loss function L is also dependent of the label matrix\nvia the new de\ufb01nition of the features. In this paper, we instantiate the loss using the structured hinge\nloss [25]. That is, we de\ufb01ne L to be\n\n(cid:0)wT \u03a6A(xi, y) + \u2206(y, yi) \u2212 wT \u03a6A(xi, yi)(cid:1)\n\nL(xi, yi; w, A)\n\n(6)\n\nL(xi, yi; w, A) = max\n\ny\n\nHere, \u2206 is the Hamming distance. This de\ufb01nes the DISTRO extension of the structural SVM.\nThe goal of learning is to minimize the objective function f in terms of both its parameters w and\nA, where each column of A is restricted to be a unit vector by de\ufb01nition. However, the objective\nis not longer jointly convex in both w and A because of the product terms in the de\ufb01nition of the\nfeature tensor.\nWe use an alternating minimization algorithm for solving the optimization problem (Algorithm 1).\nIf the label matrix A is \ufb01xed, then so are the feature representations of structures (from Equation\n(4)). Thus, for a \ufb01xed A (lines 2 and 5), the problem of minimizing f (w, A) with respect to only w\nis identical to the learning problem of structural SVMs. Since gradient computation and inference\ndo not change from the usual setting, we can solve this minimization over w using stochastic sub-\ngradient descent (SGD). For \ufb01xed weight vectors (line 4), we implemented stochastic sub-gradient\ndescent using the proximal gradient method [18] for solving for A. The supplementary material\ngives further details about the steps of the algorithm.\n\nThe goal\n\nAlgorithm 1 Learning algorithm by alternating minimization.\nis to solve\nminw,A f (w, A). The input to the problem is a training set of examples consisting of pairs of\nlabeled inputs (xi, yi) and T , the number of iterations.\n1: Initialize A0 randomly\n2: Initialize w0 = minw f (w, A0)\n3: for t = 1,\u00b7\u00b7\u00b7 , T do\n4: At \u2190 minA f (wt\u22121, A)\n5: wt+1 \u2190 minw f (w, At)\n6: end for\n7: return (wT +1, AT )\n\nEven though the objective function is not jointly convex in w and A, in our experiments (Section\n5), we found that in all but one trial, the non-convexity of the objective did not affect performance.\nBecause the feature functions are multilinear in w and A, multiple equivalent solutions can exist\n(from the perspective of the score assigned to structures) and the eventual point of convergence is\ndependent on the initialization.\nFor regularizing the label matrix, we also experimented with the Frobenius norm and found that not\nonly does the nuclear norm have an intuitive explanation (rank minimization) but also performed\nbetter. Furthermore, the proximal method itself does not add signi\ufb01cantly to the training time be-\ncause the label matrix is small. In practice, training time is affected by the density of the label\nvectors and sparser vectors correspond to faster training because the sparsity can be used to speed\nup dot product computation. Prediction is as fast as inference in standard models, however, because\nthe only change is in feature computation via the vectorization operator, which can be performed\nef\ufb01ciently.\n\n6\n\n\f5 Experiments\n\nWe demonstrate the effectiveness of DISTRO on two tasks \u2013 document classi\ufb01cation (purely atomic\nstructures) and part-of-speech (POS) tagging (both atomic and compositional structures). In both\ncases, we compare to structural SVMs \u2013 i.e. the case of one-hot label vectors \u2013 as the baseline.\nWe selected the hyper-parameters for all experiments by cross validation. We ran the alternating\nalgorithm for 5 epochs for all cases with 5 epochs of SGD for both the weight and label vectors.\nWe allowed the baseline to run for 25 epochs over the data. For the proposed method, we ran all\nthe experiments \ufb01ve times with different random initializations for the label vectors and report the\naverage accuracy. Even though the objective is not convex, we found that the learning algorithm\nconverged quickly in almost all trials. When it did not, the objective value on the training set at the\nend of each alternating SGD step in the algorithm was a good indicator for ill-behaved initializations.\nThis allowed us to discard bad initializations during training.\n\n5.1 Atomic structures: Multiclass Classi\ufb01cation\n\nOur \ufb01rst application is the problem of document classi\ufb01cation with the 20 Newsgroups Dataset [13].\nThis dataset is collection of about 20,000 newsgroup posts partitioned roughly evenly among 20\nnewsgroups. The task is to predict the newsgroup label given the post. As observed in Section 1,\nsome newsgroups are more closely related to each other than others.\nWe used the \u2018bydate\u2019 version of the data with tokens as features. Table 1 reports the performance of\nthe baseline and variants of DISTRO for newsgroup classi\ufb01cation. The top part of the table compares\nthe baseline to our method and we see that modeling the label semantics gives us a 2.6% increase in\naccuracy. In a second experiment (Table 1, bottom), we studied the effect of explicitly reducing the\nlabel vector dimensionality. We see that even with 15 dimensional vectors, we can outperform the\nbaseline and the performance of the baseline is almost matched with 10 dimensional vectors. Recall\nthat the size of the weight vector increases with increasing label vector dimensionality (see Figure\n2). This motivates a preference for smaller label vectors.\n\nAlgorithm\n\nStructured SVM\n\nDISTRO\n\nDISTRO\nDISTRO\n\nLabel Matrix Rank Average accuracy (%)\n\nReduced dimensionality setting\n\n20\n19\n\n15\n10\n\n81.4\n84.0\n\n83.1\n80.9\n\nTable 1: Results on 20 newsgroup classi\ufb01cation. The top part of the table compares the baseline against the\nfull DISTRO model. The bottom part shows the performance of two versions of DISTRO where the dimension-\nality of the label vectors is \ufb01xed. Even with 10-dimensional vectors, we can almost match the baseline.\n\n5.2 Compositional Structures: Sequence classi\ufb01cation\n\nWe evaluated DISTRO for English and Basque POS tagging using \ufb01rst-order sequence models.\nEnglish POS tagging has been long studied using the Penn Treebank data [15]. We used the standard\ntrain-test split [8, 24] \u2013 we trained on sections 0-18 of the Treebank and report performance on\nsections 22-24. The data is labeled with 45 POS labels. Some labels are semantically close to each\nother because they express variations of a base part-of-speech tag. For example, the labels NN, NNS,\nNNP and NNPS indicate singular and plural versions of common and proper nouns\nWe used the Basque data from the CoNLL 2007 shared task [17] for training the Basque POS tagger.\nThis data comes from the 3LB Treebank. There are 64 \ufb01ne grained parts of speech. Interestingly,\nthe labels themselves have a structure. For example, the labels IZE and ADJ indicate a noun and\nan adjective respectively. However, Basque can take internal noun ellipsis inside noun-forms, which\nare represented with tags like IZE IZEELI and ADJ IZEELI to indicate nouns and adjectives\nwith internal ellipses.\nIn both languages, many labels and transitions between labels are semantically close to each other.\nThis observation has led, for example, to the development of the universal part-of-speech tag set\n\n7\n\n\f[19]. Clearly, the labels should not be treated as independent units of meaning and the model should\nbe allowed to take advantage of the dependencies between labels.\n\nLanguage\nEnglish\n\nAlgorithm\n\nStructured SVM\n\nDISTRO\nDISTRO\n\nBasque\n\nStructured SVM\n\nDISTRO\n\nLabel Matrix Rank Average accuracy (%)\n\n45\n5\n20\n64\n58\n\n96.2\n95.1\n96.7\n91.5\n92.4\n\nTable 2: Results on part-of-speech tagging. The top part of the table shows results on English, where we see\na 0.5% gain in accuracy. The bottom part shows Basque results where we see a nearly 1% improvement.\n\nFor both languages, we extracted the following emission features: indicators for the words, their\npre\ufb01xes and suf\ufb01xes of length 3, the previous and next words and the word shape according to the\nStanford NLP pipeline2,3. Table 2 presents the results for the two languages. We evaluate using the\naverage accuracy over all tags. In the English case, we found that the performance plateaued for any\nlabel matrix with rank greater than 20 and we see an improvement of 0.5% accuracy. For Basque,\nwe see an improvement of 0.9% over the baseline.\nNote that unlike the atomic case, the learning objective for the \ufb01rst order Markov model is not even\nbilinear in the weights and the label vectors. However, in practice, we found that this did not cause\nany problems. In all but one run, the test performance remained consistently higher than the baseline.\nMoreover, the outlier converged to a much higher objective value; it could easily be identi\ufb01ed. As\nan analysis experiment, we initialized the model with one-hot vectors (i.e. the baseline) and found\nthat this gives us similar improvements as reported in the table.\n\n6 Conclusion\n\nWe have presented a new model for structured output prediction called Distributed Structured Output\n(DISTRO). Our model is motivated by two observations. First, distributed representations for inputs\nhave led to performance gains by uncovering shared characteristics across inputs. Second, often,\nstructures are composed of semantically rich labels and sub-structures. Just like inputs, similarities\nbetween components of structures can be exploited for better performance. To take advantage of\nsimilarities among structures, we have proposed to represent labels by real-valued vectors and model\ncompositionality using tensor products between the label vectors. This not only lets semantically\nsimilar labels share parameters, but also allows construction of complex structured output that can\ntake advantage of similarities across its component parts.\nWe have de\ufb01ned the objective function for learning with DISTRO and presented a learning algorithm\nthat jointly learns the label vectors along with the weights using alternating minimization. We\npresented an evaluation of our approach for two tasks \u2013 document classi\ufb01cation, which is an instance\nof multiclass classi\ufb01cation, and part-of-speech tagging for English and Basque, modeled as \ufb01rst-\norder sequence models. Our experiments show that allowing the labels to be represented by real-\nvalued vectors improves performance over the corresponding structural SVM baselines.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their valuable comments. Stanford University gratefully ac-\nknowledges the support of the Defense Advanced Research Projects Agency (DARPA) Deep Explo-\nration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract\nno. FA8750-13-2-0040. Any opinions, \ufb01ndings, and conclusion or recommendations expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the view of the DARPA, AFRL,\nor the US government.\n\n2http://nlp.stanford.edu/software/corenlp.shtml\n3Note that our POS systems are not state-of-the-art implementations, which typically use second order\nMarkov models with additional features and specialized handling of unknown words. However, surprisingly,\nfor Basque, even the baseline gives better accuracy than the second order TnT tagger[5, 19].\n\n8\n\n\fReferences\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J. Vert. Low-rank matrix factorization with attributes. arXiv\n\npreprint cs/0611124, 2006.\n\n[2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In\n\nInternational Conference on Machine Learning, 2007.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Advances in Neural Information\n\nProcessing Systems, 2007.\n\n[4] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2013.\n\n[5] T. Brants. TnT: a statistical part-of-speech tagger. In Conference on Applied Natural Language Process-\n\ning, 2000.\n\n[6] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classi\ufb01cation: combining bayes with svm. In\n\nInternational Conference on Machine learning, 2006.\n\n[7] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[8] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with\n\nperceptron algorithms. In Conference on Empirical Methods in Natural Language Processing, 2002.\n\n[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-\n\ning (almost) from scratch. Journal for Machine Learning Research, 12, 2011.\n\n[10] M. Fazel, H. Hindi, and S. Boyd. Rank minimization and applications in system theory. In Proceedings\n\nof the American Control Conference, volume 4, 2004.\n\n[11] G. E. Hinton. Representing part-whole hierarchies in connectionist networks. In Annual Conference of\n\nthe Cognitive Science Society, 1988.\n\n[12] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In Machine Learning, 2001.\n\n[13] K. Lang. Newsweeder: Learning to \ufb01lter netnews. In International Conference on Machine Learning,\n\n1995.\n\n[14] T. Lei, Y. Xin, Y. Zhang, R. Barzilay, and T. Jaakkola. Low-rank tensors for scoring dependency struc-\n\ntures. In Annual Meeting of the Association for Computational Linguistics, 2014.\n\n[15] M. Marcus, G. Kim, M. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger.\nThe Penn Treebank: Annotating Predicate Argument Structure. In Workshop on Human Language Tech-\nnology, 1994.\n\n[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in vector\n\nspace. arXiv preprint arXiv:1301.3781, 2013.\n\n[17] J. Nivre, J. Hall, S. K\u00a8ubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. The CoNLL 2007 shared\n\ntask on dependency parsing. In CoNLL shared task session of EMNLP-CoNLL, 2007.\n\n[18] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3), 2013.\n[19] S. Petrov, D. Das, and R. McDonald. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086,\n\n2011.\n\n[20] T. A Plate. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3), 1995.\n[21] P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connec-\n\ntionist systems. Arti\ufb01cial intelligence, 46(1), 1990.\n\n[22] R. Socher, B. Huval, C. Manning, and A. Ng. Semantic Compositionality Through Recursive Matrix-\n\nVector Spaces. In Empirical Methods in Natural Language Processing, 2012.\n\n[23] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural\n\nInformation Processing Systems, 2004.\n\n[24] K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic\ndependency network. In Conference of the North American Chapter of the Association for Computational\nLinguistics on Human Language Technology, 2003.\n\n[25] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal for Machine Learning Research, 2005.\n\n[26] J. Turian, L. Ratinov, and Y. Bengio. Word Representations: A Simple and General Method for Semi-\n\nSupervised Learning. In Annual Meeting of the Association for Computational Linguistics, 2010.\n\n[27] L. Vandenberghe and S. Boyd. Semide\ufb01nite programming. SIAM review, 38(1), 1996.\n\n9\n\n\f", "award": [], "sourceid": 1663, "authors": [{"given_name": "Vivek", "family_name": "Srikumar", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "Manning", "institution": "Stanford University"}]}