{"title": "A Deep Architecture for Matching Short Texts", "book": "Advances in Neural Information Processing Systems", "page_first": 1367, "page_last": 1375, "abstract": "Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models.", "full_text": "A Deep Architecture for Matching Short Texts\n\nZhengdong Lu\nNoah\u2019s Ark Lab\n\nHuawei Technologies Co. Ltd.\n\nSha Tin, Hong Kong\n\nLu.Zhengdong@huawei.com\n\nHang Li\n\nNoah\u2019s Ark Lab\n\nHuawei Technologies Co. Ltd.\n\nSha Tin, Hong Kong\n\nHangLi.HL@huawei.com\n\nAbstract\n\nMany machine learning problems can be interpreted as learning for matching two\ntypes of objects (e.g., images and captions, users and products, queries and doc-\numents, etc.). The matching level of two objects is usually measured as the inner\nproduct in a certain feature space, while the modeling effort focuses on mapping of\nobjects from the original space to the feature space. This schema, although proven\nsuccessful on a range of matching tasks, is insuf\ufb01cient for capturing the rich struc-\nture in the matching process of more complicated objects. In this paper, we pro-\npose a new deep architecture to more effectively model the complicated matching\nrelations between two objects from heterogeneous domains. More speci\ufb01cally, we\napply this model to matching tasks in natural language, e.g., \ufb01nding sensible re-\nsponses for a tweet, or relevant answers to a given question. This new architecture\nnaturally combines the localness and hierarchy intrinsic to the natural language\nproblems, and therefore greatly improves upon the state-of-the-art models.\n\n1\n\nIntroduction\n\nMany machine learning problems can be interpreted as matching two objects, e.g., images and\ncaptions in automatic captioning [11, 14], users and products in recommender systems, queries\nand retrieved documents in information retrieval. It is different from the usual notion of similarity\nsince it is usually de\ufb01ned between objects from two different domains (e.g., texts and images), and\nit is usually associated with a particular purpose. The degree of matching is typically modeled as an\ninner-product of two representing feature vectors for objects x and y in a Hilbert space H,\n\nmatch(x, y) =< \u03a6Y (x), \u03a6X (y) >H\n\n(1)\nwhile the modeling effort boils down to \ufb01nding the mapping from the original inputs to the feature\nvectors. Linear models of this direction include the Partial Least Square (PLS) [19, 20], Canonical\nCorrelation Analysis (CCA) [7], and their large margin variants [1]. In addition, there is also limited\neffort on \ufb01nding the nonlinear mappings for that [3, 18].\nIn this paper, we focus on a rather dif\ufb01cult task of matching a given short text and candidate re-\nsponses. Examples include retrieving answers for a given question and automatically commenting\non a given tweet. This inner-product based schema, although proven effective on tasks like infor-\nmation retrieval, are often incapable for modeling the matching between complicated objects. First,\nrepresenting structured objects like text as compact and meaningful vectors can be dif\ufb01cult; Second,\ninner-product cannot suf\ufb01ciently take into account the complicated interaction between components\nwithin the objects, often in a rather nonlinear manner.\nIn this paper, we attack the problem of matching short texts from a brand new angle. Instead of\nrepresenting the text objects in each domain as semantically meaningful vectors, we directly model\nobject-object interactions with a deep architecture. This new architecture allows us to explicitly\ncapture the natural nonlinearity and the hierarchical structure in matching two structured objects.\n\n1\n\n\f2 Model Overview\n\nWe start with the bilinear model. Assume we can\nrepresent objects in domain X and Y with vectors\nx \u2208 RDx and y \u2208 RDy. The bilinear matching\nmodel decides the score for any pair (x, y) as\nmatch(x, y) = x(cid:62)Ay =\n\nAnmxmyn, (2)\n\nDy(cid:88)\n\nDx(cid:88)\n\nm=1\n\nn=1\n\nFigure 1: Architecture for linear matching.\n\nmade considering all the local decisions, while in the bilinear case match(x, y) =(cid:80)\n\nit simply sums all the local decisions with a weight speci\ufb01ed by A, as illustrated in Figure 1.\n\nwith a pre-determined A. From a different angle,\neach element product xnym in the above sum can\nbe viewed as a micro and local decision about the\nmatching level of x and y. The outer-product matrix\nM = xy(cid:62) speci\ufb01es the space of element-wise inter-\naction between objects x and y. The \ufb01nal decision is\nnm AnmMnm,\n\n2.1 From Linear to Deep\nThis simple summarization strategy can be extended to a deep architecture to explore the nonlin-\nearity and hierarchy in matching short texts. Unlike tasks like text classi\ufb01cation, we need to work\non a pair of text objects to be matched, which we refer to as parallel texts, borrowed from machine\ntranslation. This new architecture is mainly based on the following two intuitions:\nLocalness:\nthere is a salient local structure in the semantic space of parallel text objects to be\nmatched, which can be roughly captured via the co-occurrence pattern of words across the objects.\nThis localness however should not prevent two \u201cdistant\u201d components from correlating with each\nother on a higher level, hence calls for the hierarchical characteristic of our model;\nHierarchy:\nthe decision making for matching has different levels of abstraction. The local deci-\nsions, capturing the interaction between semantically close words, will be combined later layer-by-\nlayer to form the \ufb01nal and global decision on matching.\n\n2.2 Localness\n\n\u201ctext patch\u201d\n\n\u201cimage patch\u201d\n\nFigure 2: Image patches vs. parallel-text patches.\n\nThe localness of the text matching problem can\nbe best described using an analogy with the\npatches in images, as illustrated in Figure 2.\nLoosely speaking, a patch for parallel texts de-\n\ufb01nes the set of interacting pairs of words from\nthe two text objects. Like the coordinate of an\nimage patch, we can use (\u2126x,p, \u2126y,p) to specify\nthe range of the path, with \u2126x,p and \u2126y,p each\nspecifying a subset of terms in X and Y respec-\ntively. Like the patches of images, the patches\nde\ufb01ned here are meant to capture the segments\nof rich inherent structure. But unlike the natu-\nrally formed rectangular patches of images, the\npatches de\ufb01ned here do not come with a pre-given spatial continuity. It is so since in texts, the\nnearness of words are not naturally given as location of pixels in images, but instead needs to be\ndiscovered from the co-occurrence patterns of the matched texts. As shown later in Section 3, we\nactually do that with a method resembling bilingual topic modeling, which nicely captures the co-\noccurrence of the words within-domain and cross-domain simultaneously. The basic intuitions here\nare, 1) when the words co-occur frequently across the domains (e.g., fever\u2014antibiotics), they\nare likely to have strong interaction in determining the matching score, and 2) when the words co-\noccur frequently in the same domain (e.g., {Hawaii,vacation}), they are likely to collaborate in\nmaking the matching decision. For example, modeling the matching between the word \u201cHawaii\u201d\nin question (likely to be a travel-related question) and the word \u201cRAM\u201d in answer (likely an answer\nto a computer-related question) is probably useless, judging from their co-occurrence pattern in\nQuestion-Answer pairs. In other words, our architecture models only \u201clocal\u201d pairwise relations on\n\n2\n\n\fa low level with patches, while describing the interaction between semantically distant terms on\nhigher levels in the hierarchy.\n\n2.3 Hierarchy\nOnce the local decisions on patches are made (most of them are NULL for a particular short\ntext pair), they will be sent to the next layer, where the lower-level decisions are further com-\nbined to form more composite decisions, which in turn will be sent to still higher levels. This\nprocess runs until it reaches the \ufb01nal decision. Figure 3 gives an illustrative example on hier-\narchical decision making. As it shows, the local decision on patch \u201cSIGHTSEEING IN PARIS\u201d\nand \u201cSIGHTSEEING IN BERLIN\u201d can be combined to form a higher level decision on patch for\n\u201cSIGHTSEEING\u201d, which in turn can be combined with decisions on patches like \u201cHOTEL\u201d and\n\u201cTRANSPORTATION\u201d to form a even higher level decision on \u201cTRAVEL\u201d. Note that one low-\nlevel topic does not exclusively belong to a higher-level one. For example, the \u201cWEATHER\u201d\npatch may belong to higher level patches \u201cTRAVEL\u201d and \u201cAGRICULTURE\u201d at the same time.\n\nQuite intuitively, this decision composition mecha-\nnism is also local and varies with the \u201clocations\u201d.\nFor example, when combining \u201cSIGHTSEEING IN\nPARIS\u201d and \u201cSIGHTSEEING IN BERLIN\u201d, it is more\nlike an OR logic since it only takes one of them to\nbe positive. A more complicated strategy is often\nneeded in, for example, a decision on \u201cTRAVELING\u201d,\nwhich often takes more than one element,\nlike\n\u201cSIGHTSEEING\u201d, \u201cHOTEL\u201d, \u201cTRANSPORTATION\u201d,\nor \u201cWEATHER\u201d, but not necessarily all of them. The\nparticular strategy taken by a local decision compo-\n\nFigure 3: An example of decision hierarchy.\nsition unit is fully encoded in the weights of the corresponding neuron through\n\nsp(x, y) = f(cid:0)w(cid:62)\n\np \u03a6p(x, y)(cid:1) ,\n\n(3)\nwhere f is the active function. As stated in [12], a simple nonlinear function (such as sigmoid) with\nproper weights is capable of realizing basic logics such as AND and OR. Here we decide the hierar-\nchical architecture of the decision making, but leave the exact mechanism for decision combination\n(encoded in the weights) to the learning algorithm later.\n3 The Construction of Deep Architecture\n\nThe process for constructing the deep architecture for matching consists of two steps. First, we\nde\ufb01ne parallel text patches with different resolutions using bilingual topic models. Second, we\nconstruct a layered directed acyclic graph (DAG) describing the hierarchy of the topics, based on\nwhich we further construct the topology of the deep neural network.\n3.1 Topic Modeling for Parallel Texts\nThis step is to discover parallel text segments for meaningful co-occurrence patterns of words in\nboth domains. Although more sophisticated methods may exist for capturing this relationship, we\ntake an approach similar to the multi-lingual pLSI proposed in [10], and simply put the words\nfrom parallel texts together to a joint document, while using a different virtual vocabulary for each\ndomain to avoid any mixing up. For example, the word hotel appearing in domain X is treated as\na different word as hotel in domain Y. For modeling tool, we use latent Dirichlet allocation (LDA)\nwith Gibbs sampling [2] on all the training data. Notice that by using topic modeling, we allow the\noverlapping sets of words, which is advantageous over non-overlapping clustering of words, since\nwe may expect some words (e.g., hotel and price) to appear in multiple segments. Table 1 gives\ntwo example parallel-topics learned from a traveling-related Question-Answer corpus (see Section\n5 for more details). As we can see intuitively, in the same topic, a word in domain X co-occurs\nfrequently not only with words in the same domain, but also with those in domain Y. We \ufb01t the\nsame corpus with L topic models with decreasing resolutions1, with the series of learned topic sets\ndenoted as H = {T1,\u00b7\u00b7\u00b7 ,T(cid:96),\u00b7\u00b7\u00b7 ,TL}, with (cid:96) indexing the topic resolution.\n\n1Topic resolution is controlled mainly by the number of topics, i.e., a topic model with 100 topics is consid-\n\nered to be of lower resolution (or more general) than the one with 500 topics.\n\n3\n\n\fTopic Label\nSPECIAL\nPRODUCT\nTRANSPORTATION\n\nQuestion\n\nAnswer\n\nlocal delicacy, special product\nsnack food, quality, tasty, \u00b7\u00b7\u00b7\nroute, arrangement, location\narrive, train station, fare, \u00b7\u00b7\u00b7\n\ntofu, speciality, aroma, duck, sweet, game, cuisine\nsticky rice, dumpling, mushroom, traditional,\u00b7\u00b7\u00b7\ndistance, safety, spending, gateway, air ticket, pass\ntraf\ufb01c control, highway, metroplis, tunnel, \u00b7\u00b7\u00b7\n\nTable 1: Examples of parallel topics. Originally in Chinese, translated into English by the authors.\n\n3.2 Getting Matching Architecture\nWith the set of topics H, the architecture of the deep matching model can then be obtained in the\nfollowing three steps. First, we trim the words (in both domains X and Y) with the low probability\nfor each topic in T(cid:96) \u2208 H, and the remaining words in each topic specify a patch p. With a slight\nabuse of symbols, we still use H to denote the patch sets with different resolutions. Second, based\non the patches speci\ufb01ed in H, we construct a layered DAG G by assigning each patch with resolution\n(cid:96) to a number of patches with resolution (cid:96) \u2212 1 based on the word overlapping between patches, as\nillustrated in Figure 4 (left panel). If a patch p in layer (cid:96) \u2212 1 is assigned to patch p(cid:48) in layer (cid:96), we\ndenote this relation as p \u227a p(cid:48) 2. Third, based on G, we can construct the architecture of the patch-\ninduced layers of the neural network. More speci\ufb01cally, each patch p in layer (cid:96) will be transformed\ninto K(cid:96) neurons in the ((cid:96)\u22121)th hidden layer in the neural network, and the K(cid:96) neurons are connected\nto the neurons in the (cid:96)th layer corresponding to patch p(cid:48) iff p \u227a p(cid:48). In other words, we determine the\nsparsity-pattern of the weights, but leave the values of the weights to the later learning phase. Using\nthe image analogy, the neurons corresponding to patch p are referred to as \ufb01lters. Figure 4 illustrates\nthe process of transforming patches in layer (cid:96) \u2212 1 (speci\ufb01c topics) and layer (cid:96) (general topics) into\ntwo layers in neural network with K(cid:96) = 2.\n\npatches\n\nneural network\n\nFigure 4: An illustration of constructing the deep architecture from hierarchical patches.\n\nThe overall structure is illustrated in Figure 5. The input layer is a two-dimensional interaction\nspace, which connects to the \ufb01rst patch-induced layer p-layerI followed by the second patch-\ninduced layer p-layerII. The connections to p-layerI and p-layerII have pre-speci\ufb01ed s-\nparsity patterns. Following p-layerII is a committee layer (c-layer), with full connections from\np-layerII. With an input (x, y), we \ufb01rst get the local matching decisions on p-layerI, associ-\nated with patches in the interaction space. Those local decisions will be sent to the corresponding\nneurons in p-layerII to get the \ufb01rst round of fusion. The outputs of p-layerII are then sent to\nc-layer for further decision composition. Finally the logistic regression unit in the output layer\nsummarizes the decisions on c-layer to get the \ufb01nal matching score s(x, y). This architecture is\nreferred to as DEEPMATCH in the remainder of the paper.\n\nFigure 5: An illustration of the deep architecture for matching decisions.\n\n2In the assignment, we make sure each patch in layer (cid:96) is assigned to at least m(cid:96) patches in layer (cid:96) \u2212 1.\n\n4\n\n\f3.3 Sparsity\nThe \ufb01nal constructed neural network has two types of sparsity. The \ufb01rst type of sparsity is enforced\nthrough architecture, since most of the connections between neurons in adjacent layers are turned\noff in construction. In our experiments, only about 2% of parameters are allowed to be nonzero.\nThe second type of sparsity is from the characteristics of the texts. For most object pairs in our\nexperiment, only a small percentage of neurons in the lower layers are active (see Section 5 for more\ndetails). This is mainly due to two factors, 1) the input parallel texts are very short (usually < 100\nwords), and 2) the patches are well designed to give a compact and sparse representation of each of\nthe texts, as describe in Section 3.1.\nTo understand the second type of sparsity, let us start with the following de\ufb01nition:\nDe\ufb01nition 3.1. An input pair (x, y) overlaps with patch p, iff x \u2229 px (cid:54)= \u2205 and y \u2229 py (cid:54)= \u2205, where px\nand py are respectively the word indices of patch p in domain X and Y.\n= (cid:107)px \u2229 x(cid:107)0 \u00b7 (cid:107)py \u2229 y(cid:107)0. The\nWe also de\ufb01ne the following indicator function overlap((x, y), p)\nproposed architecture only allows neurons associated with patches overlapped with the input to have\nnonzero output. More speci\ufb01cally, the output of neurons associated with patch p is\n\nsp(x, y) = ap(x, y) \u00b7 overlap((x, y), p)\n\n(4)\nto ensure that sp(x, y) \u2265 0 only when there is non-empty cross-talking of x and y within patch p,\nwhere ap(x, y) is the activation of neuron before this rule is enforced. It is not hard to understand,\nfor any input (x, y), when we track any upwards path of decisions from input to a higher level, there\nis nonzero matching vote until we reach a patch that contains terms from both x and y. This view is\nparticularly useful in parameter tuning with back-propagation: the supervision signal can only get\ndown to a patch p when it overlaps with input (x, y). It is easy to show from the de\ufb01nition, once\nthe supervision signal stops at one patch p, it will not get pass p and propagate to p\u2019s children, even\nif those children have other ancestors. This indicates that when using stochastic gradient descent,\nthe updating of weights usually only involves a very small number of neurons, and therefore can be\nvery ef\ufb01cient.\n\ndef\n\n(cid:16)(cid:80)\n\n3.4 Local Decision Models\nIn the hidden layers p-layerI, p-layerII, and c-layer, we allow two types of neurons, cor-\nresponding to two active functions: 1) linear flin(t) = x, and 2) sigmoid fsig(t) = (1 + e\u2212t)\u22121. In\nthe \ufb01rst layer, each patch p for (x, y) takes the value of the interaction matrix Mp = xpy(cid:62)\np , and the\nkth local decision on p is given by a(k)\n, with weight\np \u2208 {flin, fsig} . With low-rank constraint on A(k) to\ngiven by A(k) and the activation function f (k)\nreduce the complexity, we essentially have\nx(cid:62)\np L(k)\n\n(5)\ny,p \u2208 R|py|\u00d7Dp, with the latent dimension Dp. As indicated in Figure 5,\nwhere L(k)\nthe two-dimensional structure is lost after leaving the input layer, while the local structure is kept in\nthe second patch-induced layer p-layerII. Basically, a neuron in layer p-layerII processes the\nlow-level decisions assigned to it made in layer p-layerI\n\nx,p \u2208 R|px|\u00d7Dp, L(k)\n\n, k = 1,\u00b7\u00b7\u00b7 , K1,\n\ny,p)(cid:62)yp + b(k)\n\np,nmMp,nm + b(k)\np\n\np (x, y) = f (k)\n\na(k)\np (x, y) = f (k)\n\np\n\np\n\nn,m A(k)\n\n(cid:17)\n\np\n\n(cid:16)\n\nx,p(L(k)\n\n(cid:17)\n\np,k\u03a6p(x, y)(cid:1) , k = 1,\u00b7\u00b7\u00b7 , K2,\n(cid:0)w(cid:62)\n\np (x, y) = f (k)\na(k)\n\np\n\n(6)\n\nwhere \u03a6p(x, y) lists all the lower-level decisions assigned to unit p:\n\n\u03a6p(x, y) = [\u00b7\u00b7\u00b7 , s(1)\n\np(cid:48) (x, y), s(2)\n\np(cid:48) (x, y),\u00b7\u00b7\u00b7 , s(K1)\n\np(cid:48)\n\n(x, y),\u00b7\u00b7\u00b7 ],\n\n\u2200p(cid:48) \u227a p, p(cid:48) \u2208 T1\n\nwhich contains all the decisions on patches in layer p-layerI subsumed by p. The local decision\nmodels in the committee layer c-layer are the same as in p-layerII, except that they are fully\nconnected to neurons in the previous layer.\n\n4 Learning\nWe divide the parameters, denoted W, into three sets: 1) the low-rank bilinear model for mapping\nfor all p \u2208 P and \ufb01lter index\nfrom input patches to p-layerI, namely L(k)\n1 \u2264 k \u2264 K1, 2) the parameters for connections between patch-induced neurons, i.e., the weights\n\ny,p, and offset b(k)\np\n\nx,p, L(k)\n\n5\n\n\fbetween p-layerI and p-layerII, denoted (w(k)\n1 \u2264 k \u2264 K2, and 3) the weights for committee layer (c-layer) and after, denoted as wc.\nWe employ a discriminative training strategy with a large margin objective. Suppose that we are\ngiven the following triples (x, y+, y\u2212) from the oracle, with x (\u2208 X ) matched with y+ better than\nwith y\u2212 (both \u2208 Y). We have the following ranking-based loss as objective:\n\np ) for associated patch p and \ufb01lter index\n\np , b(k)\n\nL(W,Dtrn) =\n\neW (xi, y+\n\ni , y\u2212\n\ni ) + R(W),\n\n(7)\n\nwhere R(W) is the regularization term, and eW (xi, y+\ngiven by the following large margin form:\n\ni , y\u2212\n\ni ) is the error for triple (xi, y+\n\ni , y\u2212\ni ),\n\nei = eW (xi, y+\n\ni , y\u2212\n\ni ) = max(0, m + s(xi, y\u2212\n\ni ) \u2212 s(xi, y+\n\ni )),\n\nwith 0 < m < 1 controlling the margin in training. In the experiments, we use m = 0.1.\n\n(cid:88)\n\n(xi,y+\n\ni ,y\n\n\u2212\ni )\u2208Dtrn\n\n4.1 Back-Propagation\nAll three sets of parameters are updated through back-propagation (BP). The updating of the weights\nfrom hidden layers are almost the same as that for conventional Multi-layer Perceptron (MLP), with\ntwo slight differences: 1) we have a different input model and two types of activation function, and\n2) we could gain some ef\ufb01ciency by leveraging the sparsity pattern of the neural network, but the\nadvantage diminishes quickly after the \ufb01rst two layers.\nThis sparsity however greatly reduces the number of parameters for the \ufb01rst two layers, and hence\nthe time on updating them. From Equation (4-6), the sub-gradient of L(k)\nx,p w.r.t. empirical error e is\n\n(cid:88)\n\n(cid:16)\n\n\u2202e\n\u2202L(k)\nx,p\n\n=\n\n\u2202ei\n\n\u2202 s(k)\n\np (xi, y+\ni )\n\u2202 s(k)\n\u2202 pot(k)\n\np (xi, y\u2212\ni )\n\n\u2202ei\n\np (xi, y+\ni )\np (xi, y+\ni )\n\n\u2202 s(k)\n\u2202 pot(k)\np (xi, y\u2212\ni )\np (xi, y\u2212\ni )\n\n\u2202 s(k)\n\ni\n\n\u2212\n\n(cid:0)xi,p(y+\n(cid:0)xi,p(y\u2212\n\ni,p)(cid:62)L(k)\n\ny,p\n\ni,p)(cid:62)L(k)\n\n(cid:1) \u00b7 overlap(cid:0)(xi, y+\ni ), p(cid:1)\ni ), p(cid:1)(cid:17)\n(cid:1) \u00b7 overlap(cid:0)(xi, y\u2212\n\ny,p\n\n,\n\n(8)\n\nwhere i indices the training instances, and\n\np\n\npot(k)\n\np L(k)\n\nx,p(L(k)\n\np (x, y) = x(cid:62)\np . The gradient for L(k)\n\ny,p)(cid:62)yp + b(k)\nstands for the potential value for s(k)\ny,p is given in a slightly different way.For\nthe weights between p-layerI and p-layerII, the gradient can also bene\ufb01t from the sparsity in\nactivation.\nWe use stochastic sub-gradient descent with mini-batches [9], each of which consists of 50 randomly\ngenerated triples (x, y+, y\u2212), where the (x, y+) is the original pair, and y\u2212 is a randomly selected\nresponse. With this type of optimization, most of the patches in p-layerI and p-layerII get zero\ninputs, and therefore remain inactive by de\ufb01nition during the prediction as well as updating process.\nOn the tasks we have tried, only about 2% of parameters are allowed to be nonzero for weights\namong the patch-induced layers. Moreover, during stochastic gradient descent, only about 5% of\nneurons in p-layerI and p-layerII are active on average for each training instance, indicating\nthat the designed architecture has greatly reduced the essential capacity of the model.\n\n5 Experiments\nWe compare our deep matching model to the inner-product based models, ranging from variants of\nbilinear models to nonlinear mappings for \u03a6X (\u00b7) and \u03a6Y (\u00b7). For bilinear models, we consider only\nthe low-rank models with \u03a6X (x) = P (cid:62)\n\nx x and \u03a6y(y) = P (cid:62)\n\nx y, which gives\n\nmatch(x, y) =< P (cid:62)\n\nx x, P (cid:62)\n\ny y >= x(cid:62)PxP (cid:62)\n\ny y.\n\nWith different kinds of constraints on Px and Py, we get different models. More speci\ufb01cally, with 1)\northnormality constraints P (cid:62)\nx Py = Id\u00d7d, we get partial least square (PLS) [19], and with 2) (cid:96)2 and\n(cid:96)1 based constraints put on rows or columns, we get Regularized Mapping to Latent Space (RMLS)\n\n6\n\n\f[20]. For nonlinear models, we use a modi\ufb01ed version of the Siamese architecture [3], which uses\ntwo different neural networks for mapping objects in the two domains to the same d-dimensional\nlatent space, where inner product can be used as a measure of matching and is trained with a similar\nlarge margin objective. Different from the original model in [3], we allow different parameters for\nmapping to handle the domain heterogeneity. Please note here that we omit the nonlinear model for\nshared representation [13, 18, 17] since they are essentially also inner product based models (when\nused for matching) and not designed to deal with short texts with large vocabulary.\n\n5.1 Data Sets\nWe use the learned matching function for retrieving response texts y for a given query text x, which\nwill be ranked purely based on the matching scores. We consider the following two data sets:\nQuestion-Answer: This data set contains around 20,000 traveling-related (Question, Answer) pairs\ncollected from Baidu Zhidao (zhidao.baidu.com) and Soso Wenwen (wenwen.soso.com),\ntwo famous Chinese community QA Web sites. The vocabulary size is 52,315.\nWeibo-Comments: This data set contains half million (Weibo, comment) pairs collected from Sina\nWeibo (weibo.com), a Chinese Twitter-like microblog service. The task is to \ufb01nd the appropriate\nresponses (e.g., comments) to given Weibo posts. This task is signi\ufb01cantly harder than the Question-\nAnswer task since the Weibo data are usually shorter, more informal, and harder to capture with\nbag-of-words. The vocabulary size for tweets and comments are both 48, 724.\nOn both data sets, we generate (x, y+, y\u2212) triples, with y\u2212 being randomly selected. The training\ndata are randomly split into training data and testing data, and the parameters of all models (including\nthe learned patches for DEEPMATCH) are learned on training data. The hyper parameters (e.g., the\nlatent dimensions of low-rank models and the regularization coef\ufb01cients) are tuned on a validation\nset (as part of the training set). We use NDCG@1 and NDCG@6 [8] on random pool with size 6\n(one positive + \ufb01ve negative) to measure the performance of different matching models.\n\n5.2 Performance Comparison\nThe retrieval performances of all four models are reported in Table 2. Among the two data sets, the\nQuestion-Answer data set is relatively easy, with all four matching models improve upon random\nguesses. As another observation, we get signi\ufb01cant gain of performance by introducing nonlinearity\nin the mapping function, but all the inner-product based matching models are outperformed by\nthe proposed DEEPMATCH with large margin on this data set. The story is slightly different on\nthe Weibo-Response data set, which is signi\ufb01cantly more challenging than the Q-A task in that it\nrelies more on the content of texts and is harder to be captured by bag-of-words. This dif\ufb01culty\ncan be hardly handled by inner-product based methods, even with nonlinear mappings of SIAMESE\nNETWORK. In contrast, DEEPMATCH still manages to perform signi\ufb01cantly better than all other\nmodels.\nTo further understand the performances of the different matching models, we also compare the\ngeneralization ability of two nonlinear models. We \ufb01nd that the SIAMESE NETWORK can achieve\nover 90% correct pairwise comparisons on training set with small regularization, but generalizes\nrelatively poorly on the test set with all the con\ufb01gurations we tried. This is not surprising since\nSIAMESE NETWORK has the same level of parameters (varying with the number of hidden units)\nas DEEPMATCH. We argue that our model has better generalization property than the Siamese\narchitecture with similar model complexity.\n\nRANDOM GUESS\nPLS\nRMLS\nSIAMESE NETWORK\nDEEPMATCH\n\nQuestion-Answer\nnDCG@1\n0.167\n0.285\n0.282\n0.357\n0.723\n\nnDCG@6\n0.550\n0.662\n0.659\n0.735\n0.856\n\nWeibo-Response\nnDCG@1\n0.167\n0.171\n0.165\n0.175\n0.336\n\nnDCG@6\n0.550\n0.587\n0.553\n0.574\n0.665\n\nTable 2: The retrieval performance of matching models on the Q-A and Weibo data sets.\n\n7\n\n\f5.3 Model Selection\nWe tested different variants of the current DEEPMATCH architecture, with results reported in Figure\n6. There are two ways to increase the depth of the proposed method: adding patch-induced layers\nand committee layers. As shown in Figure 6 (left and middle panels), the performance of DEEP-\nMATCH stops increasing in either way when the overall depth goes beyond 6, while the training\ngets signi\ufb01cantly slower with each added hidden layer. The number of neurons associated with each\npatch (Figure 6, right panel) follows a similar story: the performance gets \ufb02at out after the number\nof neurons per patch reaches 3, again with training time and memory increased signi\ufb01cantly. As\nanother observation about the architecture, DEEPMATCH with both linear and sigmoid activation\nfunctions in hidden layers yields slightly but consistently better performance than that with only\nsigmoid function. Our conjecture is that linear neurons provide shortcuts for low-level matching\ndecision to high level composition units, and therefore facilitate the informative low-level units in\ndetermining the \ufb01nal matching score.\n\nsize of patch-induced layers\n\nsize of committee layer(s)\n\nnumber of \ufb01lters/patch\n\nFigure 6: Choices of architecture for DEEPMATCH. For the left and middle panels, the numbers in\nparentheses stand for number of neurons in each layer.\n6 Related Work\nOur model is apparently a special case of the learning-to-match models, for which much effort is on\ndesigning a bilinear form [1, 19, 7]. As we discussed earlier, this kind of models cannot suf\ufb01ciently\nmodel the rich and nonlinear structure of matching complicated objects. In order to introduce more\nmodeling \ufb02exibility, there has been some works on replacing \u03a6(\u00b7) in Equation (1) with an nonlinear\nmapping, e.g., with neural networks [3] or implicitly through kernelization [6]. Another similar\nthread of work is the recent advances of deep learning models on multi-modal input [13, 17]. It\nessentially \ufb01nds a joint representation of inputs in two different domains, and hence can be used to\npredict the other side. Those deep learning models however do not give a direct matching function,\nand cannot handle short texts with a large vocabulary.\nOur work is in a sense related to the sum-product network (SPN)[4, 5, 15], especially the work in\n[4] that learns the deep architecture from clustering in the feature space for the image completion\ntask. However, it is dif\ufb01cult to determine a regular architecture like SPN for short texts, since the\nstructure of the matching task for short texts is not as well-de\ufb01ned as that for images. We therefore\nadopt a more traditional MLP-like architecture in this paper.\nOur work is conceptually close to the dynamic pooling algorithm recently proposed by Socher et al\n[16] for paraphrase identi\ufb01cation, which is essentially a special case of matching between two ho-\nmogeneous domains. Similar to our model, their proposed model also constructs a neural network\non the interaction space of two objects (sentences in their case), and outputs the measure of semantic\nsimilarity between them. The major differences are three-fold, 1) their model relies on a prede\ufb01ned\ncompact vectorial representation of short text, and therefore the similarity metric is not much more\nthan summing over the local decisions, 2) the nature of dynamic pooling allows no space for ex-\nploring more complicated structure in the interaction space, and 3) we do not exploit the syntactic\nstructure in the current model, although the proposed architecture has the \ufb02exibility for that.\n\n7 Conclusion and Future Work\nWe proposed a novel deep architecture for matching problems, inspired partially by the long thread\nof work on deep learning. The proposed architecture can suf\ufb01ciently explore the nonlinearity and\nhierarchy in the matching process, and has been empirically shown to be superior to various inner-\nproduct based matching models on real-world data sets.\n\n8\n\n\fReferences\n[1] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger.\n\nSupervised semantic indexing. In CIKM\u201909, pages 187\u2013196, 2009.\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\n\nface veri\ufb01cation. In Proc. of Computer Vision and Pattern Recognition Conference. IEEE Press, 2005.\n\n[4] A. Dennis and D. Ventura. Learning the architecture of sum-product networks using clustering on vari-\n\nables. In Advances in Neural Information Processing Systems 25.\n\n[5] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In NIPS, pages 3248\u20133256,\n\n2012.\n\n[6] D. Grangier and S. Bengio. A discriminative kernel-based model to rank images from text queries. IEEE\n\ntransactions on PAMI, 30(8):1371\u20131384, 2008.\n\n[7] D. Hardoon and J. Shawe-Taylor. Kcca for different level precision in content-based image retrieval. In\n\nProceedings of Third International Workshop on Content-Based Multimedia Indexing, 2003.\n\n[8] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. Ir evaluation methods for retrieving highly relevant documents. In SIGIR,\n\npages 41\u201348, 2000.\n\n[9] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Ef\ufb01cient backprop. In G. Orr and M. K., editors, Neural\n\nNetworks: Tricks of the trade. Springer, 1998.\n\n[10] M. Littman, S. Dumais, and T. Landauer. Automatic cross-language information retrieval using latent\n\nsemantic indexing. In Cross-Language Information Retrieval, chapter 5, pages 51\u201362, 1998.\n\n[11] A. K. Menon and C. Elkan. Link prediction via matrix factorization. In Proceedings of the 2011 Eu-\nropean conference on Machine learning and knowledge discovery in databases - Volume Part II, ECML\nPKDD\u201911, pages 437\u2013452, 2011.\n\n[12] M. Minsky and S. Papert. Perceptrons - an introduction to computational geometry. MIT Press, 1987.\n[13] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In International\n\nConference on Machine Learning (ICML), Bellevue, USA, June 2011.\n\n[14] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned pho-\n\ntographs. In Neural Information Processing Systems (NIPS), 2011.\n\n[15] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In UAI, pages 337\u2013346,\n\n2011.\n\n[16] R. Socher and E. Huang and J. Pennington and A. Ng and C. Manning. Dynamic Pooling and Unfolding\n\nRecursive Autoencoders for Paraphrase Detection. In Advances in NIPS 24. 2011.\n\n[17] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages\n\n2231\u20132239, 2012.\n\n[18] B. Wang, X. Wang, C. Sun, B. Liu, and L. Sun. Modeling semantic relevance for question-answer pairs\n\nin web social communities. In ACL, pages 1230\u20131238, 2010.\n\n[19] W. Wu, H. Li, and J. Xu. Learning query and document similarities from click-through bipartite graph\nIn Proceedings of the sixth ACM international conference on WSDM, pages 687\u2013696,\n\nwith metadata.\n2013.\n\n[20] W. Wu, Z. Lu, and H. Li. Regularized mapping to latent structures and its application to web search.\n\nTechnical report.\n\n9\n\n\f", "award": [], "sourceid": 697, "authors": [{"given_name": "Zhengdong", "family_name": "Lu", "institution": "Noah's Ark Lab, Huawei Technologies"}, {"given_name": "Hang", "family_name": "Li", "institution": "Noah's Ark Lab, Huawei Technologies"}]}