{"title": "A Multiplicative Model for Learning Distributed Text-Based Attribute Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2348, "page_last": 2356, "abstract": "In this paper we propose a general framework for learning distributed representations of attributes: characteristics of text whose representations can be jointly learned with word embeddings. Attributes can correspond to a wide variety of concepts, such as document indicators (to learn sentence vectors), language indicators (to learn distributed language representations), meta-data and side information (such as the age, gender and industry of a blogger) or representations of authors. We describe a third-order model where word context and attribute vectors interact multiplicatively to predict the next word in a sequence. This leads to the notion of conditional word similarity: how meanings of words change when conditioned on different attributes. We perform several experimental tasks including sentiment classification, cross-lingual document classification, and blog authorship attribution. We also qualitatively evaluate conditional word neighbours and attribute-conditioned text generation.", "full_text": "A Multiplicative Model for Learning Distributed\n\nText-Based Attribute Representations\n\nRyan Kiros, Richard S. Zemel, Ruslan Salakhutdinov\n\nUniversity of Toronto\n\nCanadian Institute for Advanced Research\n\n{rkiros, zemel, rsalakhu}@cs.toronto.edu\n\nAbstract\n\nIn this paper we propose a general framework for learning distributed represen-\ntations of attributes: characteristics of text whose representations can be jointly\nlearned with word embeddings. Attributes can correspond to a wide variety of\nconcepts, such as document indicators (to learn sentence vectors), language in-\ndicators (to learn distributed language representations), meta-data and side infor-\nmation (such as the age, gender and industry of a blogger) or representations of\nauthors. We describe a third-order model where word context and attribute vectors\ninteract multiplicatively to predict the next word in a sequence. This leads to the\nnotion of conditional word similarity: how meanings of words change when con-\nditioned on different attributes. We perform several experimental tasks including\nsentiment classi\ufb01cation, cross-lingual document classi\ufb01cation, and blog author-\nship attribution. We also qualitatively evaluate conditional word neighbours and\nattribute-conditioned text generation.\n\n1\n\nIntroduction\n\nDistributed word representations have enjoyed success in several NLP tasks [1, 2]. More recently,\nthe use of distributed representations have been extended to model concepts beyond the word level,\nsuch as sentences, phrases and paragraphs [3, 4, 5, 6], entities and relationships [7, 8] and embed-\ndings of semantic categories [9, 10].\nIn this paper we propose a general framework for learning distributed representations of attributes:\ncharacteristics of text whose representations can be jointly learned with word embeddings. The use\nof the word attribute in this context is general. Table 1 illustrates several of the experiments we\nperform along with the corresponding notion of attribute. For example, an attribute can represent\nan indicator of the current sentence or language being processed. This allows us to learn sentence\nand language vectors, similar to the proposed model of [6]. Attributes can also correspond to side\ninformation, or metadata associated with text. For instance, a collection of blogs may come with\ninformation about the age, gender or industry of the author. This allows us to learn vectors that can\ncapture similarities across metadata based on the associated body of text. The goal of this work\nis to show that our notion of attribute vectors can achieve strong performance on a wide variety of\nNLP related tasks. In particular, we demonstrate strong quantitative performance on three highly\ndiverse tasks: sentiment classi\ufb01cation, cross-lingual document classi\ufb01cation, and blog authorship\nattribution.\nTo capture these kinds of interactions between attributes and text, we propose to use a third-order\nmodel where attribute vectors act as gating units to a word embedding tensor. That is, words are\nrepresented as a tensor consisting of several prototype vectors. Given an attribute vector, a word\nembedding matrix can be computed as a linear combination of word prototypes weighted by the\nattribute representation. During training, attribute vectors reside in a separate lookup table which\ncan be jointly learned along with word features and the model parameters. This type of three-way\n\n1\n\n\fTable 1: Summary of tasks and attribute types used in our experiments. The \ufb01rst three are quantita-\ntive while the second three are qualitative.\n\nTask\n\nSentiment Classi\ufb01cation\n\nCross-Lingual Classi\ufb01cation\n\nAuthorship Attribution\n\nConditional Text Generation\nStructured Text Generation\nConditional Word Similarity\n\nDataset\n\nSentiment Treebank\n\nRCV1/RCV2\nBlog Corpus\n\nGutenberg Corpus\nGutenberg Corpus\nBlogs & Europarl\n\nAttribute type\nSentence Vector\nLanguage Vector\nAuthor Metadata\n\nBook Vector\n\nPart of Speech Tags\n\nAuthor Metadata / Language\n\ninteraction can be embedded into a neural language model, where the three-way interaction consists\nof the previous context, the attribute and the score (or distribution) of the next word after the context.\nUsing a word embedding tensor gives rise to the notion of conditional word similarity. More specif-\nically, the neighbours of word embeddings can change depending on which attribute is being con-\nditioned on. For example, the word \u2018joy\u2019 when conditioned on an author with the industry attribute\n\u2018religion\u2019 appears near \u2018rapture\u2019 and \u2018god\u2019 but near \u2018delight\u2019 and \u2018comfort\u2019 when conditioned on\nan author with the industry attribute \u2018science\u2019. Another way of thinking of our model would be\nthe language analogue of [11]. They used a factored conditional restricted Boltzmann machine for\nmodelling motion style de\ufb01ned by real or continuous valued style variables. When our factorization\nis embedded into a neural language model, it allows us to generate text conditioned on different\nattributes in the same manner as [11] could generate motions from different styles. As we show in\nour experiments, if attributes are represented by different books, samples generated from the model\nlearn to capture associated writing styles from the author. Furthermore, we demonstrate a strong\nperformance gain for authorship attribution when conditional word representations are used.\nMultiplicative interactions have also been previously incorporated into neural language models. [12]\nintroduced a multiplicative model where images are used for gating word representations. Our\nframework can be seen as a generalization of [12] and in the context of their work an attribute would\ncorrespond to a \ufb01xed representation of an image. [13] introduced a multiplicative recurrent neural\nnetwork for generating text at the character level. In their model, the character at the current timestep\nis used to gate the network\u2019s recurrent matrix. This led to a substantial improvement in the ability to\ngenerate text at the character level as opposed to a non-multiplicative recurrent network.\n\n2 Methods\nIn this section we describe the proposed models. We \ufb01rst review the log-bilinear neural language\nmodel of [14] as it forms the basis for much of our work. Next, we describe a word embedding\ntensor and show how it can be factored and introduced into a multiplicative neural language model.\nThis is concluded by detailing how our attribute vectors are learned.\n\n2.1 Log-bilinear neural language models\nThe log-bilinear language model (LBL) [14] is a deterministic model that may be viewed as a feed-\nforward neural network with a single linear hidden layer. Each word w in the vocabulary is rep-\nresented as a K-dimensional real-valued vector rw \u2208 RK. Let R denote the V \u00d7 K matrix of\nword representation vectors where V is the vocabulary size. Let (w1, . . . wn\u22121) be a tuple of n \u2212 1\nwords where n \u2212 1 is the context size. The LBL model makes a linear prediction of the next word\nrepresentation as\n\n(1)\nwhere C(i), i = 1, . . . , n \u2212 1 are K \u00d7 K context parameter matrices. Thus, \u02c6r is the predicted\nrepresentation of rwn. The conditional probability P (wn = i|w1:n\u22121) of wn given w1, . . . , wn\u22121 is\n\nC(i)rwi,\n\n\u02c6r =\n\ni=1\n\nn\u22121(cid:88)\n\nP (wn = i|w1:n\u22121) =\n\n,\n\n(2)\n\n(cid:80)V\n\nexp(\u02c6rT ri + bi)\nj=1 exp(\u02c6rT rj + bj)\n\nwhere b \u2208 RV is a bias vector. Learning can be done using backpropagation.\n\n2\n\n\f(a) NLM\n\n(b) Multiplicative NLM\n\n(c) Multiplicative NLM with lan-\nguage switch\n\nFigure 1: Three different formulations for predicting the next word in a neural language model. Left:\nA standard neural language model (NLM). Middle: The context and attribute vectors interact via\na multiplicative interaction. Right: When words are unshared across attributes, a one-hot attribute\nvector gates the factors-to-vocabulary matrix.\n\nx \u2208 RD, we can compute attribute-gated word representations as T x = (cid:80)D\n\n2.2 A word embedding tensor\nTraditionally, word representation matrices are represented as a matrix R \u2208 RV \u00d7K, such as in\nthe case of the log-bilinear model. Throughout this work, we instead represent words as a tensor\nT \u2208 RV \u00d7K\u00d7D where D corresponds to the number of tensor slices. Given an attribute vector\ni=1 xiT (i) i.e. word\nrepresentations with respect to x are computed as a linear combination of slices weighted by each\ncomponent xi of x.\nIt is often unnecessary to use a fully unfactored tensor. Following [15, 16], we re-represent T in\nterms of three matrices Wf k \u2208 RF\u00d7K, Wf d \u2208 RF\u00d7D and Wf v \u2208 RF\u00d7V , such that\n\n(3)\nwhere diag(\u00b7) denotes the matrix with its argument on the diagonal. These matrices are parametrized\nby a pre-chosen number of factors F .\n\nT x = (Wf v)(cid:62) \u00b7 diag(Wf dx) \u00b7 Wf k,\n\n2.3 Multiplicative neural language models\nWe now show how to embed our word representation tensor T into the log-bilinear neural language\nmodel. Let E = (Wf k)(cid:62)Wf v denote a \u2018folded\u2019 K \u00d7 V matrix of word embeddings. Given the\ncontext w1, . . . , wn\u22121, the predicted next word representation \u02c6r is given by\n\nn\u22121(cid:88)\n\ni=1\n\n(4)\nwhere E(:, wi) denotes the column of E for the word representation of wi and C(i), i = 1, . . . , n\u22121\nare K \u00d7 K context matrices. Given a predicted next word representation \u02c6r, the factor outputs are\n\nC(i)E(:, wi),\n\n\u02c6r =\n\nf = (Wf k\u02c6r) \u2022 (Wf dx),\n\n(5)\nwhere \u2022 is a component-wise product. The conditional probability P (wn = i|w1:n\u22121, x) of wn\ngiven w1, . . . , wn\u22121 and x can be written as\nP (wn = i|w1:n\u22121, x) =\n\nexp(cid:0)(Wf v(:, i))(cid:62)f + bi\n(cid:1)\nj=1 exp(cid:0)(Wf v(:, j))(cid:62)f + bj\n(cid:80)V\n\n(cid:1) .\n\nHere, Wf v(:, i) denotes the column of Wf v corresponding to word i. In contrast to the log-bilinear\nmodel, the matrix of word representations R from before is replaced with the factored tensor T , as\nshown in Fig. 1.\n\n2.4 Unshared vocabularies across attributes\nOur formulation for T assumes that word representations are shared across all attributes. In some\ncases, words may only be speci\ufb01c to certain attributes and not others. An example of this is cross-\nlingual modelling, where it is necessary to have language speci\ufb01c vocabularies. As a running ex-\nample, consider the case where each attribute corresponds to a language representation vector. Let\n\n3\n\n\fTable 2: Samples generated from the model when conditioning on various attributes. For the last\nexample, we condition on the average of the two vectors (symbol <#> corresponds to a number).\n\nAttribute\n\nBible\n\nCaesar\n\n1\n\n2 (Bible +\nCaesar)\n\nSample\n<#> : <#> for thus i enquired unto thee , saying , the lord had not come unto\nhim . <#> : <#> when i see them shall see me greater am that under the name\nof the king on israel .\nto tell vs pindarus : shortly pray , now hence , a word . comes hither , and\nlet vs exclaim once by him fear till loved against caesar . till you are now which\nhave kept what proper deed there is an ant ? for caesar not wise cassi\nlet our spring tiger as with less ; for tucking great fellowes at ghosts of broth .\nindustrious time with golden glory employments . <#> : <#> but are far in men\nsoft from bones , assur too , set and blood of smelling , and there they cost ,\ni learned : love no guile his word downe the mystery of possession\n\nx denote the attribute vector for language (cid:96) and x(cid:48) for language (cid:96)(cid:48) (e.g. English and French). We\ncan then compute language-speci\ufb01c word representations T (cid:96) by breaking up our decomposition into\nlanguage dependent and independent components (see Fig. 1c):\n\nT (cid:96) = (Wf v\n\n(cid:96) )(cid:62) \u00b7 diag(Wf dx) \u00b7 Wf k,\n\n(6)\n(cid:96) )(cid:62) is a V(cid:96) \u00d7 F language speci\ufb01c matrix. The matrices Wf d and Wf k do not depend\nwhere (Wf v\n(cid:96) )(cid:62) is language speci\ufb01c. Moreover, since each lan-\non the language or the vocabulary, whereas (Wf v\nguage may have a different sized vocabulary, we use V(cid:96) to denote the vocabulary size of language (cid:96).\nObserve that this model has an interesting property in that it allows us to share statistical strength\nacross word representations of different languages. In particular, we show in our experiments how\nwe can improve cross-lingual classi\ufb01cation performance between English and German when a large\namount of parallel data exists between English and French and only a small amount of parallel data\nexists between English and German.\n\n2.5 Learning attribute representations\nWe now discuss how to learn representation vectors x. Recall that when training neural language\nmodels, the word representations of w1, . . . , wn\u22121 are updated by backpropagating through the\nword embedding matrix. We can think of this as being a linear layer, where the input to this layer\nis a one-hot vector with the i-th position active for word wi. Then multiplying this vector by the\nembedding matrix results in the word vector for wi. Thus the columns of the word representations\nmatrix consisting of words from w1, . . . , wn\u22121 will have non-zero gradients with respect to the loss.\nThis allows us to consistently modify the word representations throughout training.\nWe construct attribute representations in a similar way. Suppose that L is an attribute lookup table,\nwhere x = f (L(:, x)) and f is an optional non-linearity. We often use a recti\ufb01er non-linearity in\norder to keep x sparse and positive, which we found made training much more stable. Initially, the\nentries of L are generated randomly. During training, we treat L in the same way as the word em-\nbedding matrix. This way of learning language representations allows us to measure how \u2018similar\u2019\nattributes are as opposed to using a one-hot encoding of attributes for which no such similarity could\nbe computed.\nIn some cases, attributes that are available during training may not also be available at test time.\nAn example of this is when attributes are used as sentence indicators for learning representations\nof sentences. To accommodate for this, we use an inference step similar to that proposed by [6].\nThat is, at test time all the network parameters are \ufb01xed and stochastic gradient descent is used for\ninferring the representation of an unseen attribute vector.\n\n3 Experiments\nIn this section we describe our experimental evaluation and results. Throughout this section we refer\nto our model as Attribute Tensor Decomposition (ATD). All models are trained using stochastic gra-\ndient descent with an exponential learning rate decay and linear (per epoch) increase in momentum.\nWe \ufb01rst demonstrate initial qualitative results to get a sense of the tasks our model can perform. For\nthese, we use the small project Gutenberg corpus which consists of 18 books, some of which have\nthe same author. We \ufb01rst trained a multiplicative neural language model with a context size of 5,\n\n4\n\n\fTable 3: A modi\ufb01ed version of the game Mad Libs. Given an initialization, the model is to generate\nthe next 5 words according to the part-of-speech sequence (note that these are not hard constraints).\n\n[DT, NN, IN, DT, JJ]\nthe meaning of life is...\n\nthe cure of the bad\nthe truth of the good\na penny for the fourth\n\nthe globe of those modern\n\nall man upon the same\n\nto keep sold most wishes\n\nto make manned most magni\ufb01cent\n\n[TO, VB, VBD, JJS, NNS]\n\nmy greatest accomplishment is...\n\nto keep wounded best nations\nto be allowed best arguments\nto be mentioned most people\n\n[PRP, NN, \u2019,\u2019 , JJ, NN]\n\ni could not live without...\n\nhis regard , willing tenderness\n\nher french , serious friend\n\nher father , good voice\nher heart , likely beauty\nher sister , such character\n\nTable 4: Classi\ufb01cation accuracies on various tasks. Left: Sentiment classi\ufb01cation on the tree-\nbank dataset. Competing methods include the Neural Bag of words (NBoW) [5], Recursive Net-\nwork (RNN) [17], Matrix-Vector Recursive Network (MV-RNN) [18], Recursive Tensor Network\n(RTNN) [3], Dynamic Convolutional Network (DCNN) [5] and Paragraph Vector (PV) [6]. Right:\nCross-lingual classi\ufb01cation on RCV2. Methods include statistical machine translation (SMT), I-\nMatrix [19], Bag-of-words autoencoders (BAE-*) [20] and BiCVM, BiCVM+ [21]. The use of \u2018+\u2019\non cross-lingual tasks indicate the use of a third language (French) for learning embeddings.\n\nFine-grained Positive / Negative\n\nMethod\nSMT\n\nI-Matrix\nBAE-cr\nBAE-tree\nBiCVM\nBiCVM+\nBAE-corr\n\nATD\nATD+\n\nEN \u2192 DE DE \u2192 EN\n67.4%\n68.1%\n71.1%\n77.6%\n63.6%\n78.2%\n68.2%\n80.2%\n83.7%\n71.4%\n76.9%\n86.2%\n91.8%\n72.8%\n71.8%\n80.8%\n83.4%\n72.9%\n\nMethod\nSVM\nBiNB\nNBoW\nRNN\n\nMVRNN\nRTNN\nDCNN\n\nPV\nATD\n\n40.7%\n41.9%\n42.4%\n43.2%\n44.4%\n45.7%\n48.5%\n48.7%\n45.9%\n\n79.4%\n83.1%\n80.5%\n82.4%\n82.9%\n85.4%\n86.8%\n87.8%\n83.3%\n\nwhere each attribute is represented as a book. This results in 18 learned attribute vectors, one for\neach book. After training, we can condition on a book vector and generate samples from the model.\nTable 2 illustrates some the generated samples. Our model learns to capture the \u2018style\u2019 associated\nwith different books. Furthermore, by conditioning on the average of book representations, the\nmodel can generate reasonable samples that represent a hybrid of both attributes, even though such\nattribute combinations were not observed during training.\nNext, we computed POS sequences from sentences that occur in the training corpus. We trained\na multiplicative neural language model with a context size of 5 to predict the next word from its\ncontext, given knowledge of the POS tag for the next word. That is, we model P (wn = i|w1:n\u22121, x)\nwhere x denotes the POS tag for word wn. After training, we gave the model an initial input and\na POS sequence and proceeded to generate samples. Table 3 shows some results for this task.\nInterestingly, the model can generate rather funny and poetic completions to the initial context.\n\n3.1 Sentiment classi\ufb01cation\nOur \ufb01rst quantitative experiments are performed on the sentiment treebank of [3]. A common chal-\nlenge for sentiment classi\ufb01cation tasks is that the global sentiment of a sentence need not correspond\nto local sentiments exhibited in sub-phrases of the sentence. To address this issue, [3] collected an-\nnotations from the movie reviews corpus of [22] of all subphrases extracted from a sentence parser.\nBy incorporating local sentiment into their recursive architectures, [3] was able to obtain signi\ufb01cant\nperformance gains with recursive networks over bag of words baselines.\nWe follow the same experimental procedure proposed by [3] for which evaluation is reported on\ntwo tasks: \ufb01ne-grained classi\ufb01cation of categories {very negative, negative, neutral, positive, very\npositive } and binary classi\ufb01cation {positive, negative }. We extracted all subphrases of sentences\nthat occur in the training set and used these to train a multiplicative neural language model. Here,\neach attribute is represented as a sentence vector, as in [6]. In order to compute subphrases for\nunseen sentences, we apply an inference procedure similar to [6], where the weights of the network\nare frozen and gradient descent is used to infer representations for each unseen vector. We trained a\nlogistic regression classi\ufb01er using all training subphrases in the training set. At test time, we infer a\nrepresentation for a new sentence which is used for making a review prediction. We used a context\n\n5\n\n\fsize of 8, 100 dimensional word vectors initialized from [2] and 100 dimensional sentence vectors\ninitialized by averaging vectors of words from the corresponding sentence.\nTable 4, left panel, illustrates our results on this task in comparison to all other proposed approaches.\nOur results are on par with the highest performing recursive network on the \ufb01ne-grained task and\noutperforms all bag-of-words baselines and recursive networks with the exception of the RTNN on\nthe binary task. Our method is outperformed by the two recently proposed approaches of [5] (a\nconvolutional network trained on sentences) and Paragraph Vector [6].\n\n3.2 Cross-lingual document classi\ufb01cation\nWe follow the experimental procedure of [19], for which several existing baselines are available to\ncompare our results. The experiment proceeds as follows. We \ufb01rst use the Europarl corpus [23] for\ninducing word representations across languages. Let S be a sentence with words w in language (cid:96)\nand let x be the corresponding language vector. Let\n\nv(cid:96)(S) =\n\nT (cid:96)(:, w) =\n\n(Wf v\n\n(cid:96) (:, w))(cid:62) \u00b7 diag(Wf dx) \u00b7 Wf k\n\n(7)\n\n(cid:88)\n\nw\u2208S\n\n(cid:88)\n\nw\u2208S\n\ndenote the sentence representation of S, de\ufb01ned as the sum of language conditioned word represen-\ntations for each w \u2208 S. Equivalently we de\ufb01ne a sentence representation for the translation S(cid:48) of S\ndenoted as v(cid:96)(cid:48)(S(cid:48)). We then optimize the following ranking objective:\n\n0, \u03b1 +(cid:13)(cid:13)v(cid:96)(S) \u2212 v(cid:96)(cid:48)(S(cid:48))(cid:13)(cid:13)2\n\n2 \u2212(cid:13)(cid:13)v(cid:96)(S) \u2212 v(cid:96)(cid:48)(Ck)(cid:13)(cid:13)2\n\n2\n\n(cid:27)\n\n+ \u03bb(cid:13)(cid:13)\u03b8(cid:13)(cid:13)2\n\n2\n\n(cid:26)\n\nmax\n\n(cid:88)\n\n(cid:88)\n\nS\n\nk\n\nminimize\n\n\u03b8\n\nsubject to the constraints that each sentence vector has unit norm. Each Ck is a constrastive (non-\ntranslation) sentence of S and \u03b8 denotes all model parameters. This type of cross-language ranking\nloss was \ufb01rst used by [21] but without the norm constraint which we found signi\ufb01cantly improved\nthe stability of training. The Europarl corpus contains roughly 2 million parallel sentence pairs\nbetween English and German as well as English and French, for which we induce 40 dimensional\nword representations. Evaluation is then performed on English and German sections of the Reuters\nRCV1/RCV2 corpora. Note that these documents are not parallel. The Reuters dataset contains\nmultiple labels for each document. Following [19], we only consider documents which have been\nassigned to one of the top 4 categories in the label hierarchy. These are CCAT (Corporate/Industrial),\nECAT (Economics), GCAT (Government/Social) and MCAT (Markets). There are a total of 34,000\nEnglish documents and 42,753 German documents with vocabulary sizes of 43614 English words\nand 50,110 German words. We consider both training on English and evaluating on German and\nvice versa. To represent a document, we sum over the word representations of words in that doc-\nument followed by a unit-ball projection. Following [19] we use an averaged perceptron classi\ufb01er.\nClassi\ufb01cation accuracy is then evaluated on a held-out test set in the other language. We used a\nmonolingual validation set for tuning the margin \u03b1, which was set to \u03b1 = 1. Five contrastive terms\nwere used per example which were randomly assigned per epoch.\nTable 4, right panel, shows our results compared to all proposed methods thus far. We are com-\npetitive with the current state-of-the-art approaches, being outperformed only by BiCVM+ [21] and\nBAE-corr [20] on EN \u2192 DE. The BAE-corr method combines both a reconstruction term and a\ncorrelation regularizer to match sentences, while our method does not consider reconstruction. We\nalso performed experimentation on a low resource task, where we assume the same conditions as\nabove with the exception that we only use 10,000 parallel sentence pairs between English and Ger-\nman while still incorporating all English and French parallel sentences. For this task, we compare\nagainst a separation baseline, which is the same as our model but with no parameter sharing across\nlanguages (and thus resembles [21]). Here we achieve 74.7% and 69.7% accuracies (EN\u2192DE and\nDE\u2192EN) while the separation baseline obtains 63.8% and 67.1%. This indicates that parame-\nter sharing across languages can be useful when only a small amount of parallel data is available.\nFigure 2 further shows t-SNE embeddings of English-German word pairs.1\nAnother interesting consideration is whether or not the learned language vectors can capture any\ninteresting properties of various languages. To look into this, we trained a multiplicative neural\nlanguage model simultaneously on 5 languages: English, French, German, Czech and Slovak. To\nour knowledge, this is the most languages word representations have been jointly learned on. We\n\n1We note that Germany and Deutschland are nearest neighbours in the original space.\n\n6\n\n\f(a) Months\n\n(b) Countries\n\nFigure 2: t-SNE embeddings of English-German word pairs learned from Europarl.\n\n(a) Correlation matrix\n\n(b) Effect of conditional embeddings(c) Effect of inferring attribute vec-\n\ntors\n\nFigure 3: Results on the Blog classi\ufb01cation corpus. For the middle and right plots, each pair of same\ncoloured bars corresponds to the non-inclusion or inclusion of inferred attribute vectors, respectively.\n\ncomputed a correlation matrix from the language vectors, illustrated in Fig. 3a. Interestingly, we\nobserve high correlation between Czech and Slovak representations, indicating that the model may\nhave learned some notion of lexical similarity. That being said, additional experimentation for future\nwork is necessary to better understand the similarities exhibited through language vectors.\n\n3.3 Blog authorship attribution\nFor our \ufb01nal task, we use the Blog corpus of [24] which contains 681,288 blog posts from 19,320\nauthors. For our experiments, we break the corpus into two separate datasets: one containing the\n1000 most proli\ufb01c authors (most blog posts) and the other containing all the rest. Each author comes\nwith an attribute tag corresponding to a tuple (age, gender, industry) indicating the age range of the\nauthor (10s, 20s or 30s), whether the author is male or female, and what industry the author works\nin. Note that industry does not necessary correspond to the topic of blog posts. We use the dataset\nof non-proli\ufb01c authors to train a multiplicative language model conditioned on an attribute tuple\nof which there are 234 unique tuples in total. We used 100 dimensional word vectors initialized\nfrom [2], 100 dimensional attribute vectors with random initialization and a context size of 5. A\n1000-way classi\ufb01cation task is then performed on the proli\ufb01c author subset and evaluation is done\nusing 10-fold cross-validation. Our initial experimentation with baselines found that tf-idf performs\nwell on this dataset (45.9% accuracy). Thus, we consider how much we can improve on the tf-idf\nbaseline by augmenting word and attribute features.\nFor the \ufb01rst experiment, we determine the effect conditional word embeddings have on classi\ufb01cation\nperformance, assuming attributes are available at test time. For this, we compute two embedding\nmatrices from a trained ATD model, one without and with attribute knowledge:\n\nunconditioned ATD :\nconditioned ATD :\n\n(8)\n(9)\nWe represent a blog post as the sum of word vectors projected to unit norm and augment these with\ntf-idf features. As an additional baseline we include a log-bilinear language model [14]. 2 Figure\n3b illustrates the results from which we observe that conditioned word embeddings are signi\ufb01cantly\nmore discriminative over word embeddings computed without knowledge of attribute vectors.\n\n(Wf v)(cid:62)Wf k\n(Wf v)(cid:62) \u00b7 diag(Wf dx) \u00b7 Wf k.\n\n2The log-bilinear model has no concept of attributes.\n\n7\n\n5102550100382#Documents(thousands)0123456ImprovementoverinitialmodelunconditionedATDLBLconditionedATD5102550100382#Documents(thousands)\u22120.2\u22120.10.00.10.20.3InferredattributesdifferenceunconditionedATD\fTable 5: Results from a conditional word similarity task using Blog attributes and language vectors.\nGerman\njanuar\n\nCommon Unique to A Unique to B\n\nEnglish\njanuary\n\nQuery,A,B\n\ntherapy\n\nschool\n\nFrench\njanvier\ndecembre\n\njuin\n\nmarche\nmarches\ninterne\nguerre\n\nterrorisme\nmondaile\n\ndit\ndisait\ndeclare\ndeux\n\njune\n\noctober\nmarket\nmarkets\ninternal\n\nwar\n\nweapons\nglobal\nsaid\nstated\ntold\ntwo\n\ntwo-thirds\n\nboth\n\ndeuxieme\nseconde\n\nf/10/student\nm/20/tech\njournal\n\nf/10/student\nm/30/adv.\n\ncreate\nf/30/arts\n\nf/30/internet\n\njoy\n\nm/30/religion\nm/20/science\n\ncool\n\nm/10/student\nf/10/student\n\nwork\nchurch\ncollege\ndiary\nblog\n\nwebpage\n\nbuild\ndevelop\nmaintain\nhappiness\nsadness\n\npain\nnice\nfunny\n\nawesome\n\nchoir\nprom\nskool\nproject\nbook\n\nyearbook\nprovide\nacquire\ngenerate\nrapture\n\ngod\n\nheartbreak\nbeautiful\namazing\n\nneat\n\ntech\njob\nzine\napp\n\nreferral\ncompile\nfollow\nanalyse\ndelight\ncomfort\n\nsoul\nsexy\nhott\nlame\n\ndezember\n\njuni\nmarkt\n\nbinnenmarktes\n\nmarktes\nkrieg\nglobale\nkrieges\nsagte\ngesagt\nsagten\nzwei\nbeiden\nzweier\n\nFor the second experiment, we determine the effect of inferring attribute vectors at test time if they\nare not assumed to be available. To do this, we train a logistic regression classi\ufb01er within each fold\nfor predicting attributes. We compute an inferred vector by averaging each of the attribute vectors\nweighted by the log-probabilities of the classi\ufb01er. In Fig. 3c we plot the difference in performance\nwhen an inferred vector is augmented vs. when it is not. These results show consistent, albeit small\nimprovement gains when attribute vectors are inferred at test time.\nTo get a better sense of the attribute features learned from the model, the supplementary material\ncontains a t-SNE embedding of the learned attribute vectors. Interestingly, the model learns features\nwhich largely isolate the vectors of all teenage bloggers independent of gender and topic.\n\n3.4 Conditional word similarity\nOne of the key properties of our tensor formulation is the notion of conditional word similarity,\nnamely how neighbours of word representations change depending on the attributes that are condi-\ntioned on. In order to explore the effects of this, we performed two qualitative comparisons: one\nusing blog attribute vectors and the other with language vectors. These results are illustrated in\nTable 5. For the \ufb01rst comparison on the left, we chose two attributes from the blog corpus and a\nquery word. We identify each of these attribute pairs as A and B. Next, we computed a ranked list of\nthe nearest neighbours (by cosine similarity) of words conditioned on each attribute and identi\ufb01ed\nthe top 15 words in each. Out of these 15 words, we display the top 3 words which are common\nto both ranked lists, as well as 3 words that are unique to a speci\ufb01c attribute. Our results illustrate\nthat the model can capture distinctive notions of word similarities depending on which attributes\nare being conditioned. On the right of Table 5, we chose a query word in English (italicized) and\ncomputed the nearest neighbours when conditioned on each language vector. This results in neigh-\nbours that are either direct translations of the query word or words that are semantically similar. The\nsupplementary material includes additional examples with nearest neighbours of collocations.\n\n4 Conclusion\nThere are several future directions from which this work can be extended. One application area\nof interest is in learning representations of authors from papers they choose to review as a way of\nimproving automating reviewer-paper matching [25]. Since authors contribute to different research\ntopics, it might be more useful to instead consider a mixture of attribute vectors that can allow for\ndistinctive representations of the same author across research areas. Another interesting application\nis learning representations of graphs. Recently, [26] proposed an approach for learning embeddings\nof nodes in social networks. Introducing network indicator vectors could allow us to potentially\nlearn representations of full graphs. Finally, it would be interesting to train a multiplicative neural\nlanguage model simultaneously across dozens of languages.\n\nAcknowledgments\n\nWe would also like to thank the anonymous reviewers for their valuable comments and suggestions.\nThis work was supported by NSERC, Google, Samsung, and ONR Grant N00014-14-1-0232.\n\n8\n\n\fReferences\n[1] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep neural\n\nnetworks with multitask learning. In ICML, pages 160\u2013167, 2008.\n\n[2] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for\n\nsemi-supervised learning. In ACL, pages 384\u2013394, 2010.\n\n[3] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and\nChristopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In\nEMNLP, pages 1631\u20131642, 2013.\n\n[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, pages 3111\u20133119, 2013.\n\n[5] Phil Blunsom, Edward Grefenstette, Nal Kalchbrenner, et al. A convolutional neural network for mod-\n\nelling sentences. In ACL, 2014.\n\n[6] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. ICML, 2014.\n[7] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Trans-\n\nlating embeddings for modeling multi-relational data. In NIPS, pages 2787\u20132795, 2013.\n\n[8] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor\n\nnetworks for knowledge base completion. In NIPS, pages 926\u2013934, 2013.\n\n[9] Yann N Dauphin, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. Zero-shot learning for semantic\n\nutterance classi\ufb01cation. ICLR, 2014.\n\n[10] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, and Tomas Mikolov MarcAure-\n\nlio Ranzato. Devise: A deep visual-semantic embedding model. NIPS, 2013.\n\n[11] Graham W Taylor and Geoffrey E Hinton. Factored conditional restricted boltzmann machines for mod-\n\neling motion style. In ICML, pages 1025\u20131032, 2009.\n\n[12] Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov. Multimodal neural language models. ICML,\n\n2014.\n\n[13] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks.\n\nIn ICML, pages 1017\u20131024, 2011.\n\n[14] Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In\n\nICML, pages 641\u2013648, 2007.\n\n[15] Roland Memisevic and Geoffrey Hinton. Unsupervised learning of image transformations. In CVPR,\n\npages 1\u20138, 2007.\n\n[16] Alex Krizhevsky, Geoffrey E Hinton, et al. Factored 3-way restricted boltzmann machines for modeling\n\nnatural images. In AISTATS, pages 621\u2013628, 2010.\n\n[17] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning. Semi-\nIn EMNLP, pages 151\u2013161,\n\nsupervised recursive autoencoders for predicting sentiment distributions.\n2011.\n\n[18] Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. Semantic compositionality\n\nthrough recursive matrix-vector spaces. In EMNLP, pages 1201\u20131211, 2012.\n\n[19] Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. Inducing crosslingual distributed representations\n\nof words. In COLING, pages 1459\u20131474, 2012.\n\n[20] Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, Mitesh M Khapra, Balaraman Ravindran, Vikas\nRaykar, and Amrita Saha. An autoencoder approach to learning bilingual word representations. NIPS,\n2014.\n\n[21] Karl Moritz Hermann and Phil Blunsom. Multilingual distributed representations without word alignment.\n\nICLR, 2014.\n\n[22] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with\n\nrespect to rating scales. In ACL, pages 115\u2013124, 2005.\n\n[23] Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5,\n\npages 79\u201386, 2005.\n\n[24] Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. Effects of age and gender\non blogging. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, volume 6,\npages 199\u2013205, 2006.\n\n[25] Laurent Charlin, Richard S Zemel, and Craig Boutilier. A framework for optimizing paper matching.\n\nUAI, 2011.\n\n[26] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations.\n\nKDD, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Ryan", "family_name": "Kiros", "institution": "University of Toronto"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}