{"title": "Convolutional Neural Network Architectures for Matching Natural Language Sentences", "book": "Advances in Neural Information Processing Systems", "page_first": 2042, "page_last": 2050, "abstract": "Semantic matching is of central importance to many natural language tasks \\cite{bordes2014semantic,RetrievalQA}. A successful matching algorithm needs to adequately model the internal structures of language objects and the interaction between them. As a step toward this goal, we propose convolutional neural network models for matching two sentences, by adapting the convolutional strategy in vision and speech. The proposed models not only nicely represent the hierarchical structures of sentences with their layer-by-layer composition and pooling, but also capture the rich matching patterns at different levels. Our models are rather generic, requiring no prior knowledge on language, and can hence be applied to matching tasks of different nature and in different languages. The empirical study on a variety of matching tasks demonstrates the efficacy of the proposed model on a variety of matching tasks and its superiority to competitor models.", "full_text": "Convolutional Neural Network Architectures for\n\nMatching Natural Language Sentences\n\nBaotian Hu\u00a7\u2217\n\nZhengdong Lu\u2020\n\nHang Li\u2020\n\nQingcai Chen\u00a7\n\n\u00a7Department of Computer Science\n\n& Technology, Harbin Institute of Technology\n\nShenzhen Graduate School, Xili, China\n\nbaotianchina@gmail.com\n\n\u2020 Noah\u2019s Ark Lab\n\nHuawei Technologies Co. Ltd.\n\nSha Tin, Hong Kong\n\nlu.zhengdong@huawei.com\n\nqingcai.chen@hitsz.edu.cn\n\nhangli.hl@huawei.com\n\nAbstract\n\nSemantic matching is of central importance to many natural language tasks [2, 28].\nA successful matching algorithm needs to adequately model the internal structures\nof language objects and the interaction between them. As a step toward this goal,\nwe propose convolutional neural network models for matching two sentences, by\nadapting the convolutional strategy in vision and speech. The proposed models\nnot only nicely represent the hierarchical structures of sentences with their layer-\nby-layer composition and pooling, but also capture the rich matching patterns at\ndifferent levels. Our models are rather generic, requiring no prior knowledge on\nlanguage, and can hence be applied to matching tasks of different nature and in\ndifferent languages. The empirical study on a variety of matching tasks demon-\nstrates the ef\ufb01cacy of the proposed model on a variety of matching tasks and its\nsuperiority to competitor models.\n\nIntroduction\n\n1\nMatching two potentially heterogenous language objects is central to many natural language appli-\ncations [28, 2]. It generalizes the conventional notion of similarity (e.g., in paraphrase identi\ufb01cation\n[19]) or relevance (e.g., in information retrieval[27]), since it aims to model the correspondence be-\ntween \u201clinguistic objects\u201d of different nature at different levels of abstractions. Examples include\ntop-k re-ranking in machine translation (e.g., comparing the meanings of a French sentence and an\nEnglish sentence [5]) and dialogue (e.g., evaluating the appropriateness of a response to a given\nutterance[26]).\nNatural language sentences have complicated structures, both sequential and hierarchical, that are\nessential for understanding them. A successful sentence-matching algorithm therefore needs to\ncapture not only the internal structures of sentences but also the rich patterns in their interactions.\nTowards this end, we propose deep neural network models, which adapt the convolutional strategy\n(proven successful on image [11] and speech [1]) to natural language. To further explore the relation\nbetween representing sentences and matching them, we devise a novel model that can naturally\nhost both the hierarchical composition for sentences and the simple-to-comprehensive fusion of\nmatching patterns with the same convolutional architecture. Our model is generic, requiring no\nprior knowledge of natural language (e.g., parse tree) and putting essentially no constraints on the\nmatching tasks. This is part of our continuing effort1 in understanding natural language objects and\nthe matching between them [13, 26].\n\n\u2217The work is done when the \ufb01rst author worked as intern at Noah\u2019s Ark Lab, Huawei Techologies\n1Our project page: http://www.noahlab.com.hk/technology/Learning2Match.html\n\n1\n\n\fOur main contributions can be summarized as follows. First, we devise novel deep convolution-\nal network architectures that can naturally combine 1) the hierarchical sentence modeling through\nlayer-by-layer composition and pooling, and 2) the capturing of the rich matching patterns at dif-\nferent levels of abstraction; Second, we perform extensive empirical study on tasks with different\nscales and characteristics, and demonstrate the superior power of the proposed architectures over\ncompetitor methods.\n\nRoadmap We start by introducing a convolution network in Section 2 as the basic architecture for\nsentence modeling, and how it is related to existing sentence models. Based on that, in Section 3,\nwe propose two architectures for sentence matching, with a detailed discussion of their relation. In\nSection 4, we brie\ufb02y discuss the learning of the proposed architectures. Then in Section 5, we report\nour empirical study, followed by a brief discussion of related work in Section 6.\n2 Convolutional Sentence Model\nWe start with proposing a new convolutional architecture for modeling sentences. As illustrated\nin Figure 1, it takes as input the embedding of words (often trained beforehand with unsupervised\nmethods) in the sentence aligned sequentially, and summarize the meaning of a sentence through\nlayers of convolution and pooling, until reaching a \ufb01xed length vectorial representation in the \ufb01nal\nlayer. As in most convolutional models [11, 1], we use convolution units with a local \u201creceptive\n\ufb01eld\u201d and shared weights, but we design a large feature map to adequately model the rich structures\nin the composition of words.\n\nFigure 1: The over all architecture of the convolutional sentence model. A box with dashed lines\nindicates all-zero padding turned off by the gating function (see top of Page 3).\n\nConvolution As shown in Figure 1, the convolution in Layer-1 operates on sliding windows of\nwords (width k1), and the convolutions in deeper layers are de\ufb01ned in a similar way. Generally,with\nsentence input x, the convolution unit for feature map of type-f (among F(cid:96) of them) on Layer-(cid:96) is\n\nz((cid:96),f )\ni\n\ndef\n= z((cid:96),f )\n\ni\n\n(x) = \u03c3(w((cid:96),f )\u02c6z((cid:96)\u22121)\ni (x) = \u03c3(W((cid:96))\u02c6z((cid:96)\u22121)\n\ni\n\ni\n\n+ b((cid:96),f )), f = 1, 2,\u00b7\u00b7\u00b7 , F(cid:96)\n\n(1)\n\nand its matrix form is z((cid:96))\n\ni\n\ndef\n= z((cid:96))\n\n+ b((cid:96))), where\n\ni\n\n(x) gives the output of feature map of type-f for location i in Layer-(cid:96);\n\n\u2022 z((cid:96),f )\n\u2022 w((cid:96),f ) is the parameters for f on Layer-(cid:96), with matrix form W((cid:96)) def\n\u2022 \u03c3(\u00b7) is the activation function (e.g., Sigmoid or Relu [7])\n\u2022 \u02c6z((cid:96)\u22121)\n\ndenotes the segment of Layer-(cid:96)\u22121 for the convolution at location i , while\n\ni\n\n= [w((cid:96),1),\u00b7\u00b7\u00b7 , w((cid:96),F(cid:96))];\n\n\u02c6z(0)\ni = xi:i+k1\u22121\n\n= [x(cid:62)\ndef\n\ni , x(cid:62)\n\ni+1, \u00b7\u00b7\u00b7 , x(cid:62)\n\ni+k1\u22121](cid:62)\n\nconcatenates the vectors for k1 (width of sliding window) words from sentence input x.\n\nMax-Pooling We take a max-pooling in every two-unit window for every f, after each convolution\n\nz((cid:96),f )\ni\n\n= max(z((cid:96)\u22121,f )\n\n2i\u22121\n\n, z((cid:96)\u22121,f )\n\n2i\n\n), (cid:96) = 2, 4,\u00b7\u00b7\u00b7 .\n\nThe effects of pooling are two-fold: 1) it shrinks the size of the representation by half, thus quickly\nabsorbs the differences in length for sentence representation, and 2) it \ufb01lters out undesirable com-\nposition of words (see Section 2.1 for some analysis).\n\n2\n\n\fLength Variability The variable length of sentences in a fairly broad range can be readily handled\nwith the convolution and pooling strategy. More speci\ufb01cally, we put all-zero padding vectors after\nthe last word of the sentence until the maximum length. To eliminate the boundary effect caused\nby the great variability of sentence lengths, we add to the convolutional unit a gate which sets the\noutput vectors to all-zeros if the input is all zeros. For any given sentence input x, the output of\ntype-f \ufb01lter for location i in the (cid:96)th layer is given by\n(x) = g(\u02c6z((cid:96)\u22121)\n\n(2)\nwhere g(v) = 0 if all the elements in vector v equals 0, otherwise g(v) = 1. This gate, working\nwith max-pooling and positive activation function (e.g., Sigmoid), keeps away the artifacts from\npadding in all layers. Actually it creates a natural hierarchy of all-zero padding (as illustrated in\nFigure 1), consisting of nodes in the neural net that would not contribute in the forward process (as\nin prediction) and backward propagation (as in learning).\n\n) \u00b7 \u03c3(w((cid:96),f )\u02c6z((cid:96)\u22121)\n\n+ b((cid:96),f )),\n\nz((cid:96),f )\ni\n\ndef\n= z((cid:96),f )\n\ni\n\ni\n\ni\n\n2.1 Some Analysis on the Convolutional Architecture\nThe convolutional unit, when com-\nbined with max-pooling, can act as\nthe compositional operator with lo-\ncal selection mechanism as in the\nrecursive autoencoder [21]. Figure\n2 gives an example on what could\nhappen on the \ufb01rst two layers with\ninput sentence \u201cThe cat sat on\nthe mat\u201d. Just for illustration pur-\npose, we present a dramatic choice\nof parameters (by turning off some\nelements in W(1)) to make the con-\nvolution units focus on different seg-\nments within a 3-word window. For\nexample, some feature maps (group\n2) give compositions for \u201cthe cat\u201d\nand \u201ccat sat\u201d, each being a vector. Different feature maps offer a variety of compositions, with\ncon\ufb01dence encoded in the values (color coded in output of convolution layer in Figure 2). The pool-\ning then chooses, for each composition type, between two adjacent sliding windows, e.g., between\n\u201con the\u201d and \u201cthe mat\u201d for feature maps group 2 from the rightmost two sliding windows.\n\nFigure 2: The cat example, where in the convolution layer,\ngray color indicates less con\ufb01dence in composition.\n\nRelation to Recursive Models Our convolutional model differs from Recurrent Neural Network\n(RNN, [15]) and Recursive Auto-Encoder (RAE, [21]) in several important ways. First, unlike\nRAE, it does not take a single path of word/phrase composition determined either by a separate\ngating function [21], an external parser [19], or just natural sequential order [20]. Instead, it takes\nmultiple choices of composition via a large feature map (encoded in w((cid:96),f ) for different f), and\nleaves the choices to the pooling afterwards to pick the more appropriate segments(in every adjacent\ntwo) for each composition. With any window width k(cid:96) \u2265 3, the type of composition would be much\nricher than that of RAE. Second, our convolutional model can take supervised training and tune\nthe parameters for a speci\ufb01c task, a property vital to our supervised learning-to-match framework.\nHowever, unlike recursive models [20, 21], the convolutional architecture has a \ufb01xed depth, which\nbounds the level of composition it could do. For tasks like matching, this limitation can be largely\ncompensated with a network afterwards that can take a \u201cglobal\u201d synthesis on the learned sentence\nrepresentation.\n\nRelation to \u201cShallow\u201d Convolutional Models The proposed convolutional sentence model takes\nsimple architectures such as [18, 10] (essentially the same convolutional architecture as SENNA [6]),\nwhich consists of a convolution layer and a max-pooling over the entire sentence for each feature\nmap. This type of models, with local convolutions and a global pooling, essentially do a \u201csoft\u201d local\ntemplate matching and is able to detect local features useful for a certain task. Since the sentence-\nlevel sequential order is inevitably lost in the global pooling, the model is incapable of modeling\nmore complicated structures. It is not hard to see that our convolutional model degenerates to the\nSENNA-type architecture if we limit the number of layers to be two and set the pooling window\nin\ufb01nitely large.\n\n3\n\n\f3 Convolutional Matching Models\nBased on the discussion in Section 2, we propose two related convolutional architectures, namely\nARC-I and ARC-II), for matching two sentences.\n3.1 Architecture-I (ARC-I)\nArchitecture-I (ARC-I), as illustrated in Figure 3, takes a conventional approach: It \ufb01rst \ufb01nds\nthe representation of each sentence, and then compares the representation for the two sentences\nwith a multi-layer perceptron (MLP) [3].\nIt is essentially the Siamese architecture introduced\nin [2, 11], which has been applied to different tasks as a nonlinear similarity function [23]. Al-\nthough ARC-I enjoys the \ufb02exibility brought by the convolutional sentence model, it suffers from a\ndrawback inherited from the Siamese architecture: it defers the interaction between two sentences\n(in the \ufb01nal MLP) to until their indi-\nvidual representation matures (in the\nconvolution model), therefore runs at\nthe risk of losing details (e.g., a c-\nity name) important for the match-\ning task in representing the sen-\ntences. In other words, in the forward\nphase (prediction), the representation\nof each sentence is formed without\nknowledge of each other. This can-\nnot be adequately circumvented in\nbackward phase (learning), when the\nconvolutional model learns to extrac-\nt structures informative for matching\non a population level.\n\nFigure 3: Architecture-I for matching two sentences.\n\n3.2 Architecture-II (ARC-II)\nIn view of the drawback of Architecture-I, we propose Architecture-II (ARC-II) that is built directly\non the interaction space between two sentences. It has the desirable property of letting two sentences\nmeet before their own high-level representations mature, while still retaining the space for the indi-\nvidual development of abstraction of each sentence. Basically, in Layer-1, we take sliding windows\non both sentences, and model all the possible combinations of them through \u201cone-dimensional\u201d (1D)\nconvolutions. For segment i on SX and segment j on SY , we have the feature map\n\nz(1,f )\ni,j\n\ndef\n= z(1,f )\n\ni,j\n\n(x, y) = g(\u02c6z(0)\n\ni,j ) \u00b7 \u03c3(w(1,f )\u02c6z(0)\n\ni,j + b(1,f )),\n\n(3)\n\nwhere \u02c6z(0)\n\ni,j \u2208 R2k1De simply concatenates the vectors for sentence segments for SX and SY :\n\ni,j = [x(cid:62)\n\u02c6z(0)\n\ni:i+k1\u22121, y(cid:62)\n\nj:j+k1\u22121](cid:62).\n\nClearly the 1D convolution preserves the location information about both segments. After that in\nLayer-2, it performs a 2D max-pooling in non-overlapping 2 \u00d7 2 windows (illustrated in Figure 5)\n(4)\n\ni,j = max({z(1,f )\nz(2,f )\n\nIn Layer-3, we perform a 2D convolution on k3 \u00d7 k3 windows of output from Layer-2:\n\n2i\u22121,2j\u22121, z(1,f )\n\n2i\u22121,2j, z(1,f )\n\n2i,2j\u22121, z(1,f )\n\n2i,2j }).\n\n(5)\nThis could go on for more layers of 2D convolution and 2D max-pooling, analogous to that of\nconvolutional architecture for image input [11].\n\ni,j + b(3,f )).\n\nz(3,f )\ni,j = g(\u02c6z(2)\n\ni,j ) \u00b7 \u03c3(W(3,f )\u02c6z(2)\n\nThe 2D-Convolution After the \ufb01rst convolution, we obtain a low level representation of the inter-\naction between the two sentences, and from then we obtain a high level representation z((cid:96))\ni,j which\nencodes the information from both sentences. The general two-dimensional convolution is formu-\nlated as\n\n(6)\nwhere \u02c6z((cid:96)\u22121)\nconcatenates the corresponding vectors from its 2D receptive \ufb01eld in Layer-(cid:96)\u22121. This\npooling has different mechanism as in the 1D case, for it selects not only among compositions on\ndifferent segments but also among different local matchings. This pooling strategy resembles the\ndynamic pooling in [19] in a similarity learning context, but with two distinctions: 1) it happens on\na \ufb01xed architecture and 2) it has much richer structure than just similarity.\n\ni,j + b((cid:96),f )), (cid:96) = 3, 5,\u00b7\u00b7\u00b7\n\ni,j = g(\u02c6z((cid:96)\u22121)\nz((cid:96))\n\n) \u00b7 \u03c3(W((cid:96))\u02c6z((cid:96)\u22121)\n\ni,j\n\ni,j\n\n4\n\n\fFigure 4: Architecture-II (ARC-II) of convolutional matching model\n\n3.3 Some Analysis on ARC-II\nOrder Preservation Both the convolution\nand pooling operation in Architecture-II have\nthis order preserving property. Generally, z((cid:96))\ni,j\ncontains information about the words in SX\nbefore those in z((cid:96))\ni+1,j, although they may be\ngenerated with slightly different segments in\nSY , due to the 2D pooling (illustrated in Fig-\nure 5). The orders is however retained in a\n\u201cconditional\u201d sense. Our experiments show that\nwhen ARC-II is trained on the (SX , SY , \u02dcSY )\ntriples where \u02dcSY randomly shuf\ufb02es the word-\ns in SY , it consistently gains some ability of\n\ufb01nding the correct SY in the usual contrastive\nnegative sampling setting, which however does\nnot happen with ARC-I.\nModel Generality\nIt is not hard to show that ARC-II actually subsumes ARC-I as a special case.\nIndeed, in ARC-II if we choose (by turning off some parameters in W((cid:96),\u00b7)) to keep the representa-\ntions of the two sentences separated until the \ufb01nal MLP, ARC-II can actually act fully like ARC-I,\nas illustrated in Figure 6. More speci\ufb01cally, if we let the feature maps in the \ufb01rst convolution layer\nto be either devoted to SX or devoted to SY (instead of taking both as in general case), the output\nof each segment-pair is naturally divided into two corresponding groups. As a result, the output for\neach \ufb01lter f, denoted z(1,f )\n1:n,1:n (n is the number of sliding windows), will be of rank-one, possessing\nessentially the same information as the result of the \ufb01rst convolution layer in ARC-I. Clearly the 2D\npooling that follows will reduce to 1D pooling, with this separateness preserved. If we further limit\nthe parameters in the second convolution units (more speci\ufb01cally w(2,f )) to those for SX and SY ,\nwe can ensure the individual development of different levels of abstraction on each side, and fully\nrecover the functionality of ARC-I.\n\nFigure 5: Order preserving in 2D-pooling.\n\nFigure 6: ARC-I as a special case of ARC-II. Better viewed in color.\n\n5\n\n\fAs suggested by the order-preserving property and the generality of ARC-II, this architecture offers\nnot only the capability but also the inductive bias for the individual development of internal abstrac-\ntion on each sentence, despite the fact that it is built on the interaction between two sentences. As\na result, ARC-II can naturally blend two seemingly diverging processes: 1) the successive compo-\nsition within each sentence, and 2) the extraction and fusion of matching patterns between them,\nhence is powerful for matching linguistic objects with rich structures. This intuition is veri\ufb01ed by\nthe superior performance of ARC-II in experiments (Section 5) on different matching tasks.\n\n4 Training\nWe employ a discriminative training strategy with a large margin objective. Suppose that we are\ngiven the following triples (x, y+, y\u2212) from the oracle, with x matched with y+ better than with\ny\u2212. We have the following ranking-based loss as objective:\n\u2212\n; \u0398) = max(0, 1 + s(x, y\n\n) \u2212 s(x, y+)),\n\ne(x, y+, y\n\n\u2212\n\nwhere s(x, y) is predicted matching score for (x, y), and \u0398 includes the parameters for convolution\nlayers and those for the MLP. The optimization is relatively straightforward for both architectures\nwith the standard back-propagation. The gating function (see Section 2) can be easily adopted into\nthe gradient by discounting the contribution from convolution units that have been turned off by\nthe gating function.\nIn other words, We use stochastic gradient descent for the optimization of\nmodels. All the proposed models perform better with mini-batch (100 \u223c 200 in sizes) which can\nbe easily parallelized on single machine with multi-cores. For regularization, we \ufb01nd that for both\narchitectures, early stopping [16] is enough for models with medium size and large training sets\n(with over 500K instances). For small datasets (less than 10k training instances) however, we have\nto combine early stopping and dropout [8] to deal with the serious over\ufb01tting problem.\nWe use 50-dimensional word embedding trained with the Word2Vec [14]: the embedding for English\nwords (Section 5.2 & 5.4) is learnt on Wikipedia (\u223c1B words), while that for Chinese words (Section\n5.3) is learnt on Weibo data (\u223c300M words). Our other experiments (results omitted here) suggest\nthat \ufb01ne-tuning the word embedding can further improve the performances of all models, at the cost\nof longer training. We vary the maximum length of words for different tasks to cope with its longest\nsentence. We use 3-word window throughout all experiments2, but test various numbers of feature\nmaps (typically from 200 to 500), for optimal performance. ARC-II models for all tasks have eight\nlayers (three for convolution, three for pooling, and two for MLP), while ARC-I performs better\nwith less layers (two for convolution, two for pooling, and two for MLP) and more hidden nodes.\nWe use ReLu [7] as the activation function for all of models (convolution and MLP), which yields\ncomparable or better results to sigmoid-like functions, but converges faster.\n5 Experiments\nWe report the performance of the proposed models on three matching tasks of different nature, and\ncompare it with that of other competitor models. Among them, the \ufb01rst two tasks (namely, Sentence\nCompletion and Tweet-Response Matching) are about matching of language objects of heterogenous\nnatures, while the third one (paraphrase identi\ufb01cation) is a natural example of matching homoge-\nneous objects. Moreover, the three tasks involve two languages, different types of matching, and\ndistinctive writing styles, proving the broad applicability of the proposed models.\n5.1 Competitor Methods\n\n\u2022 WORDEMBED: We \ufb01rst represent each short-text as the sum of the embedding of the\nwords it contains. The matching score of two short-texts are calculated with an MLP with\nthe embedding of the two documents as input;\n\u2022 DEEPMATCH: We take the matching model in [13] and train it on our datasets with 3\n\u2022 URAE+MLP: We use the Unfolding Recursive Autoencoder [19]3 to get a 100-\ndimensional vector representation of each sentence, and put an MLP on the top as in\nWORDEMBED;\n\u2022 SENNA+MLP/SIM: We use the SENNA-type sentence model for sentence representation;\n\nhidden layers and 1,000 hidden nodes in the \ufb01rst hidden layer;\n\n2Our other experiments suggest that the performance can be further increased with wider windows.\n3Code from: http://nlp.stanford.edu/\u02dcsocherr/classifyParaphrases.zip\n\n6\n\n\f\u2022 SENMLP: We take the whole sentence as input (with word embedding aligned sequential-\n\nly), and use an MLP to obtain the score of coherence.\n\nAll the competitor models are trained on the same training set as the proposed models, and we report\nthe best test performance over different choices of models (e.g., the number and size of hidden layers\nin MLP).\n\n5.2 Experiment I: Sentence Completion\nThis is an arti\ufb01cial task designed to elucidate how different matching models can capture the cor-\nrespondence between two clauses within a sentence. Basically, we take a sentence from Reuter-\ns [12]with two \u201cbalanced\u201d clauses (with 8\u223c 28 words) divided by one comma, and use the \ufb01rst\nclause as SX and the second as SY . The task is then to recover the original second clause for any\ngiven \ufb01rst clause. The matching here is considered heterogeneous since the relation between the\ntwo is nonsymmetrical on both lexical and semantic levels. We deliberately make the task harder\nby using negative second clauses similar to the original ones4, both in training and testing. One\nrepresentative example is given as follows:\nSX: Although the state has only four votes in the Electoral College,\nS+\nY : its loss would be a symbolic blow to republican presidential candi\nS\u2212\nY : but it failed to garner enough votes to override an expected veto by\n\ndate Bob Dole.\n\npresident Clinton.\n\nModel\nRandom Guess\nDEEPMATCH\nWORDEMBED\nSENMLP\nSENNA+MLP\nURAE+MLP\nARC-I\nARC-II\n\nP@1(%)\n20.00\n32.50\n37.63\n36.14\n41.56\n25.76\n47.51\n49.62\n\nAll models are trained on 3 million triples (from 600K positive\npairs), and tested on 50K positive pairs, each accompanied by\nfour negatives, with results shown in Table 1. The two pro-\nposed models get nearly half of the cases right5, with large margin\nTable 1: Sentence Completion.\nover other sentence models and models without explicit sequence\nmodeling. ARC-II outperforms ARC-I signi\ufb01cantly, showing the power of joint modeling of match-\ning and sentence meaning. As another convolutional model, SENNA+MLP performs fairly well\non this task, although still running behind the proposed convolutional architectures since it is too\nshallow to adequately model the sentence. It is a bit surprising that URAE comes last on this task,\nwhich might be caused by the facts that 1) the representation model (including word-embedding) is\nnot trained on Reuters, and 2) the split-sentence setting hurts the parsing, which is vital to the quality\nof learned sentence representation.\n\n5.3 Experiment II: Matching A Response to A Tweet\nWe trained our model with 4.5 million original (tweet, response)\npairs collected from Weibo, a major Chinese microblog service\n[26]. Compared to Experiment I, the writing style is obviously\nmore free and informal. For each positive pair, we \ufb01nd ten ran-\ndom responses as negative examples, rendering 45 million triples\nfor training. One example (translated to English) is given below,\nY the original response, and S\u2212\nwith SX standing for the tweet, S+\nY\nthe randomly selected response: SX: Damn, I have to work overtime\nthis weekend!\nS+\nY : Try to have some rest buddy.\nS\u2212\nY : It is hard to \ufb01nd a job, better start polishing your resume.\nWe hold out 300K original (tweet, response) pairs and test the matching model on their ability to\npick the original response from four random negatives, with results reported in Table 2. This task\nis slightly easier than Experiment I , with more training instances and purely random negatives. It\nrequires less about the grammatical rigor but more on detailed modeling of loose and local matching\npatterns (e.g., work-overtime\u21d4 rest). Again ARC-II beats other models with large margins,\nwhile two convolutional sentence models ARC-I and SENNA+MLP come next.\n\nModel\nRandom Guess\nDEEPMATCH\nWORDEMBED\nSENMLP\nSENNA+MLP\nARC-I\nARC-II\n\nP@1(%)\n20.00\n49.85\n54,31\n52.22\n56.48\n59.18\n61.95\n\nTable 2: Tweet Matching.\n\n4We select from a random set the clauses that have 0.7\u223c0.8 cosine similarity with the original. The dataset\n\nand more information can be found from http://www.noahlab.com.hk/technology/Learning2Match.html\n\n5Actually ARC-II can achieve 74+% accuracy with random negatives.\n\n7\n\n\fAcc. (%)\n66.5\n70.6\n68.7\n68.4\n68.4\n69.6\n69.9\n\nF1(%)\n79.90\n80.50\n80.49\n79.70\n79.50\n80.27\n80.91\n\nTable 3: The results on Paraphrase.\n\nModel\nBaseline\nRus et al. (2008)\nWORDEMBED\nSENNA+MLP\nSENMLP\nARC-I\nARC-II\n\n5.4 Experiment III: Paraphrase Identi\ufb01cation\nParaphrase identi\ufb01cation aims to determine whether two sentences have the same mean-\ning, a problem considered a touchstone of natural language understanding. This experiment\nis included to test our methods on matching homogenous\nobjects. Here we use the benchmark MSRP dataset [17],\nwhich contains 4,076 instances for training and 1,725 for\ntest. We use all the training instances and report the test\nperformance from early stopping. As stated earlier, our\nmodel is not specially tailored for modeling synonymy,\nand generally requires \u2265100K instances to work favor-\nably. Nevertheless, our generic matching models still\nmanage to perform reasonably well, achieving an accura-\ncy and F1 score close to the best performer in 2008 based\non hand-crafted features [17], but still signi\ufb01cantly low-\ner than the state-of-the-art (76.8%/83.6%), achieved with\nunfolding-RAE and other features designed for this task [19].\n5.5 Discussions\nARC-II outperforms others signi\ufb01cantly when the training instances are relatively abundant (as in\nExperiment I & II). Its superiority over ARC-I, however, is less salient when the sentences have deep\ngrammatical structures and the matching relies less on the local matching patterns, as in Experiment-\nI. This therefore raises the interesting question about how to balance the representation of matching\nand the representations of objects, and whether we can guide the learning process through something\nlike curriculum learning [4].\nAs another important observation, convolutional models (ARC-I & II, SENNA+MLP) perform\nfavorably over bag-of-words models, indicating the importance of utilizing sequential structures in\nunderstanding and matching sentences. Quite interestingly, as shown by our other experiments,\nARC-I and ARC-II trained purely with random negatives automatically gain some ability in telling\nwhether the words in a given sentence are in right sequential order (with around 60% accuracy for\nboth). It is therefore a bit surprising that an auxiliary task on identifying the correctness of word\norder in the response does not enhance the ability of the model on the original matching tasks.\nWe noticed that simple sum of embedding learned via Word2Vec [14] yields reasonably good results\non all three tasks. We hypothesize that the Word2Vec embedding is trained in such a way that the\nvector summation can act as a simple composition, and hence retains a fair amount of meaning in\nthe short text segment. This is in contrast with other bag-of-words models like DEEPMATCH [13].\n6 Related Work\nMatching structured objects rarely goes beyond estimating the similarity of objects in the same do-\nmain [23, 24, 19], with few exceptions like [2, 18]. When dealing with language objects, most\nmethods still focus on seeking vectorial representations in a common latent space, and calculating\nthe matching score with inner product[18, 25]. Few work has been done on building a deep architec-\nture on the interaction space for texts-pairs, but it is largely based on a bag-of-words representation\nof text [13].\nOur models are related to the long thread of work on sentence representation. Aside from the models\nwith recursive nature [15, 21, 19] (as discussed in Section 2.1), it is fairly common practice to use\nthe sum of word-embedding to represent a short-text, mostly for classi\ufb01cation [22]. There is very\nlittle work on convolutional modeling of language. In addition to [6, 18], there is a very recent model\non sentence representation with dynamic convolutional neural network [9]. This work relies heavily\non a carefully designed pooling strategy to handle the variable length of sentence with a relatively\nsmall feature map, tailored for classi\ufb01cation problems with modest sizes.\n7 Conclusion\nWe propose deep convolutional architectures for matching natural language sentences, which can\nnicely combine the hierarchical modeling of individual sentences and the patterns of their matching.\nEmpirical study shows our models can outperform competitors on a variety of matching tasks.\n\nAcknowledgments: B. Hu and Q. Chen are supported in part by National Natural Science Foundation of\nChina 61173075. Z. Lu and H. Li are supported in part by China National 973 project 2014CB340301.\n\n8\n\n\fReferences\n[1] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts\n\nto hybrid nn-hmm model for speech recognition. In Proceedings of ICASSP, 2012.\n\n[2] B. Antoine, X. Glorot, J. Weston, and Y. Bengio. A semantic matching energy function for learning with\n\nmulti-relational data. Machine Learning, 94(2):233\u2013259, 2014.\n\n[3] Y. Bengio. Learning deep architectures for ai. Found. Trends Mach. Learn., 2(1):1\u2013127, 2009.\n[4] Y. Bengio, J. Louradourand, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of ICML,\n\n2009.\n\n[5] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathematics of statistical machine\n\ntranslation: Parameter estimation. Computational linguistics, 19(2):263\u2013311, 1993.\n\n[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-\n\ning (almost) from scratch. Journal of Machine Learning Research, 12:2493\u20132537, 2011.\n\n[7] G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr using recti\ufb01ed\n\nlinear units and dropout. In Proceedings of ICASSP, 2013.\n\n[8] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural net-\n\nworks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.\n\n[9] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sen-\n\ntences. In Proceedings of ACL, Baltimore and USA, 2014.\n\n[10] Y. Kim. Convolutional neural networks for sentence classi\ufb01cation. In Proceedings of EMNLP, 2014.\n[11] Y. LeCun and Y. Bengio. Convolutional networks for images, speech and time series. The Handbook of\n\nBrain Theory and Neural Networks, 3361, 1995.\n\n[12] Y. Lewis, David D.and Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text catego-\n\nrization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[13] Z. Lu and H. Li. A deep architecture for matching short texts. In Advances in NIPS, 2013.\n[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in vector\n\nspace. CoRR, abs/1301.3781, 2013.\n\n[15] T. Mikolov and M. Kara\ufb01\u00b4at. Recurrent neural network based language model. In Proceedings of INTER-\n\nSPEECH, 2010.\n\n[16] C. Rich, L. Steve, and G. Lee. Over\ufb01tting in neural nets: Backpropagation, conjugate gradient, and early\n\nstopping. In Advances in NIPS, 2000.\n\n[17] V. Rus, P. M. McCarthy, M. C. Lintean, D. S. McNamara, and A. C. Graesser. Paraphrase identi\ufb01cation\n\nwith lexico-syntactic graph subsumption. In Proceedings of FLAIRS Conference, 2008.\n\n[18] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. Learning semantic representations using convolutional\n\nneural networks for web search. In Proceedings of WWW, 2014.\n\n[19] R. Socher, E. H. Huang, and A. Y. Ng. Dynamic pooling and unfolding recursive autoencoders for\n\nparaphrase detection. In Advances in NIPS, 2011.\n\n[20] R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing Natural Scenes and Natural Language with\n\nRecursive Neural Networks. In Proceedings of ICML, 2011.\n\n[21] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-supervised recursive autoen-\n\ncoders for predicting sentiment distributions. In Proceedings of EMNLP, 2011.\n\n[22] Y. Song and D. Roth. On dataless hierarchical text classi\ufb01cation. In Proceedings of AAAI, 2014.\n[23] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for face veri\ufb01cation. In Proceedings of ICCV, 2013.\n[24] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. Journal of\n\nMachine Learning Research(JMLR), 11:1201\u20131242, 2010.\n\n[25] B. Wang, X. Wang, C. Sun, B. Liu, and L. Sun. Modeling semantic relevance for question-answer pairs\n\nin web social communities. In Proceedings of ACL, 2010.\n\n[26] H. Wang, Z. Lu, H. Li, and E. Chen. A dataset for research on short-text conversations. In Proceedings\n\nof EMNLP, Seattle, Washington, USA, 2013.\n\n[27] W. Wu, Z. Lu, and H. Li. Learning bilinear model for matching queries and documents. The Journal of\n\nMachine Learning Research, 14(1):2519\u20132548, 2013.\n\n[28] X. Xue, J. Jiwoon, and C. W. Bruce. Retrieval models for question and answer archives. In Proceedings\n\nof SIGIR \u201908, New York, NY, USA, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1108, "authors": [{"given_name": "Baotian", "family_name": "Hu", "institution": "Harbin Institute of Technology Shenzhen Graduate School(HITSZ)"}, {"given_name": "Zhengdong", "family_name": "Lu", "institution": "Noah's Ark Lab, Huawei Technologies"}, {"given_name": "Hang", "family_name": "Li", "institution": "Noah's Ark Lab"}, {"given_name": "Qingcai", "family_name": "Chen", "institution": "Harbin Institute of Technology Shenzhen Graduate School"}]}