{"title": "Learning semantic similarity in a continuous space", "book": "Advances in Neural Information Processing Systems", "page_first": 986, "page_last": 997, "abstract": "We address the problem of learning semantic representation of questions to measure similarity between pairs as a continuous distance metric. Our work naturally extends Word Mover\u2019s Distance (WMD) [1] by representing text documents as normal distributions instead of bags of embedded words. Our learned metric measures the dissimilarity between two questions as the minimum amount of distance the intent (hidden representation) of one question needs to \"travel\" to match the intent of another question. We first learn to repeat, reformulate questions to infer intents as normal distributions with a deep generative model [2] (variational auto encoder). Semantic similarity between pairs is then learned discriminatively as an optimal transport distance metric (Wasserstein 2) with our novel variational siamese framework. Among known models that can read sentences individually, our proposed framework achieves competitive results on Quora duplicate questions dataset. Our work sheds light on how deep generative models can approximate distributions (semantic representations) to effectively measure semantic similarity with meaningful distance metrics from Information Theory.", "full_text": "Learning semantic similarity in a continuous space\n\nMichel Deudon\n\nmichel.deudon@polytechnique.edu\n\nEcole Polytechnique\n\nPalaiseau, France\n\nAbstract\n\nWe address the problem of learning semantic representation of questions to measure\nsimilarity between pairs as a continuous distance metric. Our work naturally\nextends Word Mover\u2019s Distance (WMD) [1] by representing text documents as\nnormal distributions instead of bags of embedded words. Our learned metric\nmeasures the dissimilarity between two questions as the minimum amount of\ndistance the intent (hidden representation) of one question needs to \"travel\" to\nmatch the intent of another question. We \ufb01rst learn to repeat, reformulate questions\nto infer intents as normal distributions with a deep generative model [2] (variational\nauto encoder). Semantic similarity between pairs is then learned discriminatively\nas an optimal transport distance metric (Wasserstein 2) with our novel variational\nsiamese framework. Among known models that can read sentences individually,\nour proposed framework achieves competitive results on Quora duplicate questions\ndataset. Our work sheds light on how deep generative models can approximate\ndistributions (semantic representations) to effectively measure semantic similarity\nwith meaningful distance metrics from Information Theory.\n\n1\n\nIntroduction\n\nSemantics is the study of meaning in language, used for understanding human expressions. It deals\nwith the relationship between signi\ufb01ers and what they stand for, their denotation. Measuring semantic\nsimilarity between pairs of sentences is an important problem in Natural Language Processing\n(NLP), for conversation systems (chatbots, FAQ), knowledge deduplication [3] or image captioning\nevaluation metrics [4] for example. In this paper, we consider the problem of determining the degree\nof similarity between pairs of sentences. Without loss of generality, we consider the case where the\nprediction is binary (duplicate pair or not), a problem known as paraphrase detection or semantic\nquestion matching in NLP.\nA major breakthrough in the \ufb01eld of semantics was the work done in [5] and [6]. The authors proposed\nunsupervised methods to learn semantic representations of words (Word2vec [5], GloVe [6]). These\nrepresentations come in the form of vectors (typically of dimension 50 - 300) that capture relationships\nbetween words in a geometric space. For instance, let \u03bdm \u2208 Rd denote the learned representation\nfor word m. The following property holds \u03bdking \u2212 \u03bdman + \u03bdwoman (cid:39) \u03bdqueen (linear substructure).\nIn [1], the authors proposed a novel similarity metric for text documents: Word Mover\u2019s Distance\n(WMD). WMD [1] measures the dissimilarity between two bag-of-vectors (embedded words) with\nan optimal transport distance metric, Wasserstein 1, also known as Earth Mover\u2019s Distance [7] [8] [9].\nThe proposed method is conceptually appealing and accurate for news categorization and clustering\ntasks. Yet, WMD experimentally struggles to \ufb01nely capture the semantics of shorter pairs (sentences,\nquestions) because of the bag-of-word assumption. Indeed,\"why do cats like mice ?\" and \"why do\nmice like cats ?\" are mapped to the same bag-of-vectors representation, thus classi\ufb01ed as duplicate.\nWMD does not capture the order in which words appear. Furthermore, because a computationally\nexpensive discrete transportation problem is required to solve WMD, its use in conversation systems\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand scalable online knowledge platforms is limited. A solution to the computational cost of WMD\nis to map a sentence to a single vector (instead of a bag) and then measure similarity between pairs\nas a Euclidean distance or cosine similarity in the mapped vectorial space. This way, there is no\nneed for a transportation scheme to transform a document in another one. A simple approach to\nobtain a sentence representation from words\u2019 embeddings is to compute the barycenter of embedded\nwords. Variants consider weighted sums (TF-IDF [10], Okapi BM-25 [11], SIF [12]) and top k\nprincipal components removal [12] [13]. However, none of these models \ufb01nely capture semantics,\nagain because of the bag-of-word assumption.\nConvolutional Neural Networks (CNN) [14] [15], Recurrent Neural Networks (RNN) [16] [17] and\nSelf-Attentive Neural Networks [18] [19] have been successfully used to model sequences in a variety\nof NLP tasks (sentiment analysis, question answering, machine translation to name a few). For\nsemantic question matching (on the Quora dataset) and natural language inference (on Stanford and\nMulti-Genre NLI datasets [20] [21]), the state-of-the-art approach, Densely Interactive Inference\nNetwork (DIIN) [22], is a neural network that encodes sentences jointly rather than separately.\nReading sentences separately makes the task of predicting similarity or entailment more dif\ufb01cult,\nbut comes with a lower computational cost and higher throughput, since individual representations\ncan be precomputed for fast information retrieval for instance. Models that can read sentences\nseparately are conceptually appealing since they come with individual representations of sentences\nfor visualization (PCA, t-SNE, MDS) and transfer to downstream tasks. Siamese networks are a type\nof neural networks that appeared in vision (face recognition [23]) and have recently been extensively\nstudied to learn representations of sentences and predict similarity or entailment between pairs as an\nend-to-end differentiable task [3] [4] [24] [25].\nIn our work, we replace the discrete combinatorial problem of Word Mover\u2019s Distance (Wasserstein\n1) [1] by a continuous one between normal distributions (intents), as illustrated in Figure 1. We \ufb01rst\nlearn to repeat, reformulate with a deep generative model [2] (variational auto encoder) to learn\nand infer intents (step 1). We then learn and measure semantic similarity between question pairs\nwith our novel variational siamese framework (step 2), which differs from the original one ([3] [4]\n[23] [24] [25]) as our hidden representations consist of two Gaussian distributions instead of two\nvectors. For intents with diagonal covariance matrices, the expression of Wasserstein 2 is explicit and\ncomputationally cheap. Our approach is evaluated on Quora duplicate questions dataset and performs\nstrongly.\n\nFigure 1: Illustration of our learned latent space. Sentences are mapped to Gaussian distributions\nN (\u00b5, \u03c3) (intents). The plot was obtained with PCA on the sentences\u2019 mean vectors \u00b5 after training\nof our variational siamese network. The covariance matrices \u03c32 are diagonal.\n\n2\n\n\fIn Section 2 we will provide background and context. Then, in Section 3 we will present our method\nto learn semantic representation of sentences. In Section 4 we will discuss how we measure similarity\nin our latent space. Finally, we will describe experiments, and present results in Section 5 before\nconcluding.\n\n2 Related work\n\n2.1 Embedding words\n\nAn important principle used to compute semantic representations of sentences is the principle of\ncompositionality which states that the meaning of a phrase is uniquely determined by the meaning of\nits parts and the rules that connect those parts. Word2Vec [5] and GloVe [6] are semantic embeddings\nof words based on their context and frequency of co-occurence in a text corpus such as Wikipedia,\nGoogle News or Common-Crawl. The key idea behind Word2Vec and Glove is that similar words\nappear in similar contexts. Word2Vec [5] learns semantic representation of words by learning to\npredict a word given its surrounding words (CBOW model) or vice versa (skip-gram model). GloVe\n[5] operates on a global word-word co-occurence matrix, learning a low-rank approximation for it.\nWord2Vec and Glove vectors effectively capture semantic similarity at the word level. [26] and [27]\nproposed embedding words in hyperbolic spaces to account for hierarchies in natural language (Zipf\nlaw). [28] proposed Word2Gauss to model uncertainty in words and naturally express assymetries for\nsimilarity/entailment tasks, using KL divergence.\n\n2.2 Learning sentences\u2019 representations\n\nComposing word vectors allows to encode sentences into \ufb01xed-size representations. The follow-\ning presents how unsupervised neural auto encoders and variants can learn meaningful sentence\nrepresentations, given a sequence of embedded words.\n\nAuto encoders and variants Neural autoencoders are models where output units are identical to\ninput units [29]. An input s is compressed with a neural encoder into a representation z = q\u03c6(s),\nwhich is then used to reconstruct s. Sequential Denoising Autoencoder (SDAE) [30] introduces noise\nin the input to predict the original source sentence s given a corrupted version of it s(cid:48). The noise\nfunction independently deletes input words and swaps pairs with some probability. The model then\nuses a recurrent neural network (LSTM-based architecture [31]) to recover the original sentence\nfrom the corrupted version. The model can then be used to encode new sentences into vector\nrepresentations. SDAEs are the top performer on paraphrase identi\ufb01cation, among unsupervised\nmodel [30]. Another approach is Skip Thought (ST) vectors [32] which adapt the skip-gram model for\nwords to the sentence level, by encoding a sentence to predict the sentences around it. Consequently,\nST vectors require a consequent training corpus of ordered sentences, with a coherent narrative. On\nthe SICK sentence relatedness benchmark [33], FastSent, a bag-of-word variant of Skip Thought,\nperforms best among unsupervised model [30].\nVariational auto encoders Rather than representing sentences with \ufb01xed points \u00b5 \u2208 Rd in a\nd-dimensional space, an alternative is to represent them with normal distributions \u00b5, \u03c3 \u2208 Rd, as\nshown in Figure 1. Intuitively, the variance term accounts for uncertainty or ambiguity, a desirable\nproperty for modeling language. Think of the sentences \"He fed her cat food.\" or \"Look at the dog\nwith one eye.\" for example. Furthermore, thanks to the variance term, the learned representations\nsmoothly \"\ufb01ll\" the semantic space to generalize better. Normal representation of sentences can be\nlearned with Variational Auto Encoders (VAE), a class of deep generative models \ufb01rst proposed in\n[34] [35]. Similar to autoencoders, VAE learn to reconstruct their original inputs from a latent code z.\nHowever, instead of computing deterministic representations z = q\u03c6(s), the VAE learns, for any input\ns, a posterior distribution q\u03c6(z|s) over latent codes z. In terms of semantics, the latent distributions\ngrant the ability to sum over all the possibilities (different meanings, different interpretations).\nFormally, the VAE consists of an encoder \u03c6 and a decoder \u03b8. The VAE\u2019s encoder parameterizes the\nposterior distribution q\u03c6(z|s) over the latent code z \u2208 Rh, given an input s. The posterior q\u03c6(z|s) is\nusually assumed to be a Gaussian distribution:\n\nz \u223c N (\u00b5(s), \u03c3(s))\n\n3\n\n(1)\n\n\fwhere the functions \u00b5(s), \u03c3(s) are nonlinear transformations of the input s. The VAE\u2019s decoder\nparameterizes another distribution p\u03b8(s|z) that takes as input a random latent code z and produces\nan observation s. The parameters de\ufb01ning the VAE are learned by maximizing the following lower\nbound on the model evidence p(s|\u03b8, \u03c6):\n\nL\u03b8;\u03c6(s) = Eq\u03c6(z|s)[log p\u03b8(s|z)] \u2212 KL(q\u03c6(z|s)||p(z))\n\n(2)\n\nwhere p(z) is a prior on the latent codes, typically a standard normal distribution N (0, I).\nTo encode sentences into \ufb01xed-size representation, [36] proposed a VAE conditioned on a bag-of-\nword description of the text input. [2] employed long short-term memory (LSTM) networks [31]\nto read (encoder \u03c6) and generate (decoder \u03b8) sentences sequentially. The authors [2] proposed KL\nannealing and dropout of the decoder\u2019s inputs during training to circumvent problems encountered\nwhen using the standard LSTM-VAE for the task of modeling text data.\n\n2.3 Learning similarity metrics\n\nSiamese networks can learn similarity metrics discriminatively as suggested in [23]. They consist of\ntwo encoders (sharing the same weights), that read separately pairs of inputs into \ufb01xed sized represen-\ntations (vectors). The concatenation [u, v] [4] and/or Hadamard product and squared difference of the\nhidden vectors [uv,|u \u2212 v|2] [24] is then used as input layer for a Multi-Layer Perceptron that learns\nand predicts similarity or entailment of the corresponding pair. Siamese LSTM [3] and Siamese GRU\nnetworks [25] operate on sentences and achieve good results on paraphrase detection and question\nmatching. Our proposed variational siamese network extends standard siamese networks to handle\nGaussian hidden representations. It replaces the discrete transportation problem of Word Mover\u2019s\nDistance [1] (Wasserstein 1) by a continuous one (learning Wasserstein 2).\n\n3 Learning to repeat, reformulate\n\n3.1\n\nInfering and learning intents\n\nOur intuition is that behind semantically equivalent sentences, the intent is identical. We consider that\ntwo duplicate questions or paraphrases were generated from a same latent intent (hidden variable).\nWe thus take the VAE approach to learn and infer semantic representation of sentences, but instead\nof only recovering sentence s from s (repeat), we also let our model learn to recover semantically\nequivalent sentences s\u2019 from s (ref ormulate), as illustrated in Figure 2. This \ufb01rst task (repeat,\nreformulate) provides a sound (meaningful) initialization for our encoder\u2019s parameters \u03c6 (see Figure\n2). We then learn semantic similarity discriminatively as an optimal transport metric in the intent\u2019s\nlatent space (covered in section 4). We did not consider training our model jointly on both tasks\n(generative and discriminative), as studied in [37]. We suspect the normal prior N (0, I) on the\nhidden intents in equations (2) and (4) to con\ufb02ict with our \ufb01nal objective of predicting similarity\nor entailment (blurry boundaries). Our repeat, reformulate framework is in the spirit of the work\nproposed in [38] where the authors employ a RNN-VAE and a dataset of sequence-outcome pairs to\ncontinuously revise discrete sequences, for instance to improve sentence positivity. [39] proposed\nlearning paraphrastic sentence embeddings based on supervision from the Paraphrase Database [40]\nand a margin based loss with negative samples. Our approach differs from [39] as our hidden intents\nare Gaussians.\n\n3.2 Neural architecture\nLet s, s(cid:48) be a pair of semantically equivalent sentences (same content), represented as two sequences\nof word vectors, s = w1, w2, ...w|s|, wi \u2208 Rd. Our goal is to predict s(cid:48) given s. A sentence s is always\nsemantically equivalent to itself. This means we could include any sentence s in the learning process\n(semi-supervised learning). Following [2], our encoder \u03c6 consists of a single layer bi-LSTM [31]\nnetwork that encodes s in a \ufb01xed length vector c(s) \u2208 R2h. We linearly transform c(s) in a hidden\nmean vector \u00b5(s) \u2208 Rh and a hidden log diagonal covariance matrix log(\u03c3(s)2) \u2208 Rh that de\ufb01ne a\nGaussian distribution N (\u00b5(s), \u03c3(s)). Using the reparameterization trick (for backpropagation of the\ngradients), we sample intents z from the posterior:\n\nz \u223c \u00b5(s) + \u03c3(s)N (0, I).\n\n(3)\n\n4\n\n\fFigure 2: Our Bayesian framework and neural architecture to learn semantic representation of\nsentences, similar to [2].\n\nSimilar to [2], we then generate s(cid:48) with a single layer LSTM [31] decoder network \u03b8 conditioned\non a sampled intent z \u2208 Rh. The true words of s(cid:48) are sequentially fed to the LSTM decoder during\ntraining (teacher-forcing). As in [2], we employ word dropout for the decoder with 60% keep rate.\nOur model learns semantic representation of sentences N (\u00b5(s), \u03c3(s)) by learning to paraphrase. The\nparameters of our encoder-decoder architecture are learned by minimizing the following regularized\nloss by stochastic gradient descent:\n\n\u2212L\u03b8;\u03c6(s, s(cid:48)) = \u2212Eq\u03c6(z|s)[log p\u03b8(s(cid:48)|z)] + \u03baKL(q\u03c6(z|s)||N (0, I))\n\n(4)\nwhere \u03ba = sigmoid(0.002(step \u2212 2500)) is a sigmoid annealing scheme, as suggested in [2]. The\n\ufb01rst term encourages sampled intents to encode useful information to repeat or reformulate questions.\nThe second term enforces the hidden distributions (intents) to match a prior (a standard normal) to \ufb01ll\nthe semantic space (\u03c3 > 0) and smoothly measure similarity as an optimal transport distance metric.\n\nFigure 3: Results from our generative pretraining: repeat (dark blue), repeat, ref ormulate\n(orange). Left: Per word cross entropy. Right: KL divergence KL(q\u03c6(z|s)||N (0, I)).\n\n4 Variational siamese network\n\n4.1 Learning semantic similarity in a latent continuous space\n\nWe train our model successively on two tasks (generative and discriminative) and use the \ufb01rst step\nas a smooth initialization (N (0, I) prior and semi-supervised setting) for the second one, learning\nsimilarity. Our work differs from [3] [4] [23] [24] [25] as our encoded states (intents zs) are\nmultivariate Gaussians with diagonal covariance matrices N (\u00b5, \u03c3). This allows us to smoothly\nmeasure semantic similarity with different distance metrics or divergences from Information Theory\nas in [28][41]. The computation of Wasserstein 2 (W 2\n2 ) and Mahalanobis distance (DM ) is ef\ufb01cient\nfor two Gaussians with diagonal covariance matrices p1 = N (\u00b51, \u03c31) and p2 = N (\u00b52, \u03c32) on Rh.\n\n5\n\n\fh(cid:88)\nh(cid:88)\n\ni=1\n\n1\n2\n\ni=1\n\n1 \u2212 \u00b5i\n\n(\u00b5i\n\n2)2 + (\u03c3i\n\n1 \u2212 \u03c3i\n2)2\n\n(\n\n1\n\u03c3i\n1\n\n2\n\n+\n\n1\n\u03c3i\n2\n\n2\n\n)(\u00b5i\n\n1 \u2212 \u00b5i\n2)2\n\nW 2\n\n2 (p1, p2) =\n\nD2\n\nM (p1, p2) =\n\n(5)\n\n(6)\n\nTo learn and measure semantic similarity with our variational siamese network, we express the\nprevious metrics \"element-wise\" and feed the resulting tensor as input to a Multi-Layer Perceptron \u03c8\nthat predicts the degree of similarity of the corresponding pair. By \"element-wise\" Wasserstein-2\n2)2 \u2208\ntensor, we mean the tensor of dimension h whose ith element is computed as (\u00b5i\nR, for a pair of vectors (\u00b51, \u03c31), (\u00b52, \u03c32). The Wasserstein distance (scalar) is obtained by summing\nthe components of this tensor. Our neural architecture is illustrated in Figure 4.\n\n2)2+(\u03c3i\n\n1\u2212\u03c3i\n\n1\u2212\u00b5i\n\nFigure 4: Our variational siamese network to measure and learn semantic similarity.\n\n4.2 Neural architecture\n\nLet s1, s2 be a pair of sentences and y a label indicating their degree of similarity (or entailment).\nDenote by (\u00b51, \u03c31), (\u00b52, \u03c32) the inferred pair of latent intents. Our goal is to predict y given (\u00b51, \u03c31),\n(\u00b52, \u03c32). Our pair of latent intents are used to compute the element-wise Wasserstein-2 tensor\n(\u00b51 \u2212 \u00b52)2 + (\u03c31 \u2212 \u03c32)2 \u2208 Rh and the Hadamard product \u00b51\u00b52 \u2208 Rh. The concatenation of the\nWasserstein-2 and Hadamard tensors is fed to a two layer Multi Layer Perceptron \u03c8 with ReLu\nactivations (inner layer) and softmax output layer for classi\ufb01cation of the sentence pair (s1, s2) as\nduplicate or not as shown in Figure 4. Our variational siamese network learns to detect paraphrases\nby minimizing the following regularized loss by stochastic gradient descent:\nL\u03c8;\u03c6(s1, s2) = \u2212y log p\u03c8(y|q\u03c6(.|s1), q\u03c6(.|s2)) + \u03bb||\u03c8||1\n\n(7)\n\nThe \ufb01rst term is the cross entropy of the prediction and the second term is a L1 regularization to\nencourage sparsity of our MLP\u2019s weights (\u03bb = 0.00001). There is no N (0, I) prior on intents during\nthe training of our variational siamese network (VAR-siam) as observed in Figure 3.\n\n5 Experiments and results\n\n5.1 Experimental details\n\nWe implemented our model using python 3.5.4, tensor\ufb02ow 1.3.0 [42], gensim 3.0.1 [43] and nltk\n3.2.4 [44]. We evaluated our proposed framework on Quora question pairs dataset which consists of\n404k sentence pairs annotated with a binary value that indicates whether a pair is duplicate (same\nintent) or not.1\n\n1https://www.kaggle.com/quora/question-pairs-dataset\n\n6\n\n\fData preprocessing We convert sentences to lower case, re\ufb01t dashes for single words, space\npunctuation, currencies, arithmetic operations and any other non alphanumeric symbols. We tokenize\nsentences using nltk tokenizer [44]. We remove from our vocabulary words that appear at most once\nresulting in a total of 48096 non unique words. Unknown words are replaced by the token \u2019UNK\u2019.\nWe pad our sentences to 40 tokens (words).\n\nNeural architecture and training We embed words in a 300 dimensional space using Glove\nvectors [6] pretrained on Wikipedia 2014 and Gigaword 5 as initialization. Our encoder and decoder\nshare the same embedding layer. Our variational space (\u00b5, \u03c3) is of dimension h = 1000. Our\nbi-LSTM encoder network consists of a single layer of 2h neurons and our LSTM [31] decoder has a\nsingle layer with 1000 neurons. Our MLP\u2019s inner layer has 1000 neurons. All weights were randomly\nintialized with \"Xavier\" initializer [45].\nWe successively train our model on reformulating (for 5 epochs) and detecting paraphrases (3 epochs).\nThe \ufb01rst task provides an initialization for the second one (embedding and encoding layer). For\nboth tasks, we employ stochastic gradient descent with ADAM optimizer [46] (lr = 0.001, \u03b21 =\n0.9, \u03b22 = 0.999) and batches of size 256 and 128. Our learning rate is initialized for both task to\n0.001, decayed every 5000 step by a factor 0.96 with an exponential scheme. We clip the L2 norm of\nour gradients to 1.0 to avoid exploding gradients in deep neural networks.\n\n5.2 Semantic question matching\n\nThe task of semantic question matching is to predict whether or not question pairs are duplicate.\nTable 1 compares different models for this task on the Quora dataset. Among models that represent\nthe meaning of text pairs independently (siamese networks), our model performs strongly. More\nsurprisingly, our model is competitive with state-of-the-art models that read sentences together before\nreducing them to vectors, such as Bilateral Multi Perspective Matching (BiMPM) [47], pt-DECATT\n[48] and Densely Interactive Inference Network (DIIN) [22].\n\nTable 1: Quora question pairs dataset result. The split considered is that of BiMPM [47]. Models are\nsorted by decreasing test accuracy. Results with \u2020 are reported in [22] and \u2021 in [25]. SWEM [49]\nstands for Simple Word Embedding based Models.\n\nRead pairs Accuracy\nseparately Dev\nModel\nDIIN [22] \u2020\nFalse\nVAR-Siamese (with repeat/reformulate) True\npt-DECATTchar [48] \u2020\nFalse\nVAR-Siamese (with repeat)\nTrue\nBiMPM [47] \u2020\nFalse\npt-DECATTword [48] \u2020\nFalse\nL.D.C \u2020\nFalse\nSiamese-GRU Augmented [25] \u2021\nTrue\nMulti-Perspective-LSTM \u2020\nFalse\nTrue\nSWEM-concat [49]\nSWEM-aver [49]\nTrue\nSiamese-LSTM \u2020\nTrue\nTrue\nSWEM-max [49]\nMulti-Perspective CNN \u2020\nFalse\nTrue\nDeConv-LVM [37]\nSiamese-CNN \u2020\nTrue\nTrue\nVAR-Siamese (w/o pretraining)\nBias in dataset (baseline)\n-\n\n89.44\n89.05\n88.89\n88.18\n88.69\n88.44\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n62.16\n62.16\n\nGenerative\n(pre)training\nTest\n89.06\nFalse\n88.86 True\n88.40\nFalse\n88.24 True\nFalse\n88.17\nFalse\n87.54\n85.55\nFalse\nFalse\n85.24\nFalse\n83.21\nFalse\n83.03\n82.68\nFalse\nFalse\n82.58\nFalse\n82.20\n81.38\nFalse\n80.40 True\n79.60\nFalse\n62.48 False\n62.48\n\n-\n\nWe considered two pretraining strategies: repeat (standard VAE) and repeat, reformulate (cf. section\n3.). We achieved strong results with both settings (as shown in Table 1). Without a generative\npretraining task, our variational siamese quickly classi\ufb01es all pairs as non duplicate. Learning\nsemantic similarity is a dif\ufb01cult task and our variational siamese network fails with randomly\n\n7\n\n\finitialized intents. Generative pretraining helps our variational siamese network learn semantic\nrepresentations and semantic similarity.\n\n5.3 Question retrieval\n\nSiamese networks can read a text in isolation, and produce a vector representation for it. This allows\nus to encode once per query and compare the latent code to a set of references for question retrieval.\nFor a given query, our model runs 3500+ comparisons per second on two Tesla K80.\nWhen we combine our variational siamese network with Latent Dirichlet Allocation [50] to \ufb01lter out\ntopics, our approach is relevant and scalable for question retrieval in databases with a million entries.\nRelated questions are retrieved by \ufb01ltering topics (different topic implies different meaning) and are\nranked for relevance with our variational siamese network. We report some human queries asked to\nour model on Quora question pairs dataset in Table 2.\n\nTable 2: Question retrieval (best match, con\ufb01dence) with our proposed model on Quora question\npairs dataset (537 088 unique questions). All queries were retrieved in less than a second on two\nTesla K80 GPU.\n\nQuery\n\nRetrieved question\n1/537088\nWhat would happen if I travel\nHi! If I run the speed of the light,\nwith a speed of light?\nwhat would the world look like?\nHow can I make good gazpacho ?\nHow do you do a good gazpacho?\nWhat can I do to save our planet?\nHow can I save the planet?\nWhat is the difference between a cat? What is the bene\ufb01t of a cat?\nCan we trust data?\n\nCan I trust google drive\nwith personal data?\n\nMLP\u2019s output,\nCon\ufb01dence\n99.68%\n\n99.52%\n95.91%\n90.99%\n57.13%\n\n5.4 Discussion\n\nFigure 5: Empirical distribution of Wasserstein 2 distance between pairs of intents for duplicate\n(green) and not duplicate (red) pairs, after variational siamese training (with repeat, reformulate\npretraining). Left: Quora Dev set. Right: Quora Test set.\n\nAs shown in \ufb01gure 5, we measured the Wasserstein 2 distance on Quora dev and test set using\ndifferent encoding (sentence representations) and reported AUC scores in Table 3. Our baseline is\nWord Mover\u2019s Distance [1] which considers documents as bag-of-words. We noticed that the standard\nVAE (repeat) underperforms the bag-of-word model. The unsupervised task of repeating sentences\nallows to encode information but not in a speci\ufb01c way (intent vs. style [38] eg.). Our proposed VAE\n(repeat, reformulate) performs better than our baseline. Our best results (2 \ufb01rst lines in Table 3) were\nobtained when further retraining a VAE discriminatively with our variational siamese network to\nlearn a similarity metric in a latent space.\nIntuitively, our model learns, with the Wasserstein-2 tensor, how to optimally transform a continuous\ndistribution (intent) into another one to measure semantic similarity. We considered learning and\n\n8\n\n\fTable 3: Discriminative power of Wasserstein 1 and 2 on Quora dataset.\n\nEncoding\n\nWasserstein Area under ROC curve Training\n\nW2\nVAR-siamese (with repeat)\nW2\nVAR-siamese (with repeat/reformulate)\nVAE (repeat/reformulate)\nW2\nWord Mover\u2019s Distance [1] (bag-of-word) W1\nVAE (repeat)\nW2\n\nDev set Test set\n87.11\n86.88\n77.70\n73.47\n70.67\n\n86.74\n86.27\n77.44\n73.10\n69.91\n\nlabels\nPositive\n& Negative\nPositive\nNone\nNone\n\nmeasuring similarity with other metrics from Information Theory (Mahalanobis distance and Jensen-\nShannon divergence), as input layer for our Multi-Layer Perceptron. All models overpassed 87.5% dev\naccuracy on the Quora dataset (same split as in Table 1). As in [41], which addresses semantic image\nsegmentation, we found Wasserstein-2 to be the most effective approach. Adding the Hadamard\nproduct to the MLP\u2019s input layer improves predictions by a marginal 1%. When trained with\nWasserstein-2, the convergence is consistent across all tested similarity metrics (Mahalanobis, Jensen\nShannon, Wasserstein 2, Euclidean, Cosine). This gives us hints about the informative nature of our\nlearned latent semantic space.\n\n6 Conclusion\n\nInformation Theory provides a sound framework to study semantics. Instead of capturing the rela-\ntionships among multiple words and phrases in a single vector, we decompose the representation of a\nsentence in a mean vector and a diagonal covariance matrix to account for uncertainty and ambiguity\nin language. By learning to repeat, reformulate with our generative framework, this factorization\nencodes semantic information in an explicit sentence representation for various downstream applica-\ntions. Our novel approach to measure semantic similarity between pair of sentences is based on these\ncontinuous probabilistic representations. Our variational siamese network extends Word Mover\u2019s\nDistance [1] to continuous representation of sentences. Our approach performs strongly on Quora\nquestion pairs dataset and experiments show its effectiveness for question retrieval in knowledge\ndatabases with a million entries. Our code is made publicly available on github.2\nIn future work, we plan to train/test our model on other datasets, such as PARANMT-50M [51],\na dataset of more than 50M English paraphrase pairs, and further propose a similar framework\n(variational siamese) for the task of natural language inference [20] [21] to predict if a sentence\nentails or contradicts another one. A possible generative framework, similar to repeat, reformulate,\nwould be repeat, drift. Learning Gaussian representations with non diagonal covariance matrices is\nalso a direction that could be investigated to capture complex patterns in natural language.\n\n2https://github.com/MichelDeudon/variational-siamese-network\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Ecole Polytechnique for \ufb01nancial support and T\u00e9l\u00e9com Paris-Tech for GPU\nresources. We are grateful to Professor Chlo\u00e9 Clavel, Professor Gabriel Peyr\u00e9, Constance Noziere\nand Paul Bertin for their critical reading of the paper. Special thanks go to Magdalena Fuentes\nfor helping running the code on Telecom\u2019s clusters. We also thank Professor Francis Bach and\nProfessor Guillaume Obozinski for their insightful course on probabilistic graphical models at ENS\nCachan, as well as Professor Michalis Vazirgiannis for his course on Text Mining and NLP at Ecole\nPolytechnique. We also thank the reviewers for their valuable comments and feedback.\n\nReferences\n[1] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document\n\ndistances. In International Conference on Machine Learning (ICML), pages 957\u2013966, 2015.\n\n[2] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\nIn Proceedings of the 20th SIGNLL Conference on\n\nGenerating sentences from a continuous space.\nComputational Natural Language Learning (CoNLL), pages 10\u201321, 2016.\n\n[3] Elkhan Dadashov, Sukolsak Sakshuwong, and Katherine Yu. Quora question duplication. Stanford CS224n\n\nreport.\n\n[4] Adrian Sanborn and Jacek Skryzalin. Deep learning for semantic similarity. CS224d report, 2015.\n\n[5] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. In International Conference on Learning Representations (ICLR) Workshop, 2013.\n\n[6] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representa-\ntion. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP),\npages 1532\u20131543, 2014.\n\n[7] Gaspard Monge. M\u00e9moire sur la th\u00e9orie des d\u00e9blais et des remblais. Histoire de l\u2019Acad\u00e9mie Royale des\n\nSciences de Paris, 1781.\n\n[8] L Kantorovich. On the transfer of masses (in russian). In Doklady Akademii Nauk, volume 37, pages\n\n227\u2013229, 1942.\n\n[9] C\u00e9dric Villani. Topics in optimal transportation (graduate studies in mathematics, vol. 58). 2003.\n\n[10] Karen Sparck Jones. A statistical interpretation of term speci\ufb01city and its application in retrieval. Journal\n\nof documentation, 28(1):11\u201321, 1972.\n\n[11] M Beaulieu, M Gatford, Xiangji Huang, S Robertson, S Walker, and P Williams. Okapi at trec-5. NIST\n\nSpecial Publication SP, pages 143\u2013166, 1997.\n\n[12] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embed-\n\ndings. In International Conference on Learning Representations (ICLR), 2017.\n\n[13] Jiaqi Mu, Suma Bhat, and Pramod Viswanath. All-but-the-top: simple and effective postprocessing for\n\nword representations. In International Conference on Learning Representations (ICLR), 2018.\n\n[14] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation.\n\nIn Proceedings of the 2014\nConference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746\u20131751, 2014.\n\n[15] Mingbo Ma, Liang Huang, Bing Xiang, and Bowen Zhou. Dependency-based convolutional neural\nnetworks for sentence embedding. In Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics and the 7th International Joint Conference on Natural Language Processing,\npages 1106\u20131115, 2015.\n\n[16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in Nural Information Processing Systems (NIPS), pages 3104\u20133112, 2014.\n\n[17] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. In International Conference on Learning Representations (ICLR), 2015.\n\n[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems (NIPS), pages 6000\u20136010, 2017.\n\n10\n\n\f[19] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua\nIn International Conference on Learning\n\nBengio. A structured self-attentive sentence embedding.\nRepresentations (ICLR), 2017.\n\n[20] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated\ncorpus for learning natural language inference. In Proceedings of the conference on Empirical Methods in\nNatural Language Processing (EMNLP), page 632\u2013642, 2015.\n\n[21] Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence\n\nunderstanding through inference. In Proceedings of NAACL-HLT, 2018.\n\n[22] Yichen Gong, Heng Luo, and Jian Zhang. Natural language inference over interaction space. In Interna-\n\ntional Conference on Learning Representations (ICLR), 2018.\n\n[23] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with\napplication to face veri\ufb01cation. In Proceedings of the IEEE conference on Computer Vision and Pattern\nRecognition (CVPR), volume 1, pages 539\u2013546, 2005.\n\n[24] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning\nof universal sentence representations from natural language inference data. In Proceedings of the conference\non Empirical Methods in Natural Language Processing (EMNLP), page 670\u2013680, 2017.\n\n[25] Yushi Homma, Stuart Sy, and Christopher Yeh. Detecting duplicate questions with deep learning. Stanford\n\nCS224n report.\n\n[26] Maximilian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical representations. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 6341\u20136350, 2017.\n\n[27] Bhuwan Dhingra, Christopher J Shallue, Mohammad Norouzi, Andrew M Dai, and George E Dahl.\n\nEmbedding text in hyperbolic spaces. In Proceedings of NAACL-HLT, page 59, 2018.\n\n[28] Luke Vilnis and Andrew McCallum. Word representations via gaussian embedding. In International\n\nConference on Learning Representations (ICLR), 2015.\n\n[29] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and\ndocuments. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics\nand the 7th International Joint Conference on Natural Language Processing, pages 1106\u20131115, 2015.\n\n[30] Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from\n\nunlabelled data. In Proceedings of NAACL-HLT, pages 1367\u20131377, 2016.\n\n[31] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[32] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and\nSanja Fidler. Skip-thought vectors. In Advances in Neural Information Processing Systems (NIPS), pages\n3294\u20133302, 2015.\n\n[33] Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zam-\nparelli. A sick cure for the evaluation of compositional distributional semantic models. In LREC, pages\n216\u2013223, 2014.\n\n[34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on\n\nLearning Representations (ICLR), 2014.\n\n[35] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\nmate inference in deep generative models. In Proceedings of the 31st International Conference on Machine\nLearning, PMLR 32(2), pages 1278\u20131286, 2014.\n\n[36] Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In International\n\nConference on Machine Learning (ICML), pages 1727\u20131736, 2016.\n\n[37] Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang Su, and Lawrence Carin. Deconvolutional latent-\nvariable model for text sequence matching. In the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI-18), 2018.\n\n[38] Jonas Mueller, David Gifford, and Tommi Jaakkola. Sequence to better sequence: continuous revision of\ncombinatorial structures. In International Conference on Machine Learning (ICML), pages 2536\u20132544,\n2017.\n\n11\n\n\f[39] John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic sentence\n\nembeddings. In International Conference on Learning Representations (ICLR), 2016.\n\n[40] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrase database. In\nProceedings of the 2013 Conference of the North American Chapter of the Association for Computational\nLinguistics: Human Language Technologies, pages 758\u2013764, 2013.\n\n[41] Jaume Verg\u00e9s-Llah\u00ed and Alberto Sanfeliu. Evaluation of distances between color image segmentations.\n\nPattern Recognition and Image Analysis, pages 13\u201325, 2005.\n\n[42] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale machine\nlearning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[43] Radim \u02c7Reh\u02dau\u02c7rek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora.\n\nIn\nProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45\u201350, Valletta,\nMalta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.\n\n[44] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text\n\nwith the natural language toolkit. \" O\u2019Reilly Media, Inc.\", 2009.\n\n[45] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence and statistics,\npages 249\u2013256, 2010.\n\n[46] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR), 2015.\n\nIn International\n\n[47] Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language\nsentences. In Proceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI), pages 4144\u20134150, 2017.\n\n[48] Gaurav Singh Tomar, Thyago Duque, Oscar T\u00e4ckstr\u00f6m, Jakob Uszkoreit, and Dipanjan Das. Neural\nparaphrase identi\ufb01cation of questions with noisy pretraining. In Proceedings of the 1st Workshop on\nSubword and Character Level Models in NLP, page 142\u2013147, 2017.\n\n[49] Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan\nLi, Ricardo Henao, and Lawrence Carin. Baseline needs more love: On simple word-embedding-based\nmodels and associated pooling mechanisms. In Proceedings of the 56th Annual Meeting of the Association\nfor Computational Linguistics, 2018.\n\n[50] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine\n\nLearning Research, 3(Jan):993\u20131022, 2003.\n\n[51] John Wieting and Kevin Gimpel. Paranmt-50m: Pushing the limits of paraphrastic sentence embeddings\nwith millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for\nComputational Linguistics, volume 1, pages 451\u2013462, 2018.\n\n12\n\n\f", "award": [], "sourceid": 544, "authors": [{"given_name": "Michel", "family_name": "Deudon", "institution": "Ecole Polytechnique"}]}