{"title": "Recurrently Controlled Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4731, "page_last": 4743, "abstract": "Recurrent neural networks (RNNs) such as long short-term memory and gated recurrent units are pivotal building blocks across a broad spectrum of sequence modeling problems. This paper proposes a recurrently controlled recurrent network (RCRN) for expressive and powerful sequence encoding. More concretely, the key idea behind our approach is to learn the recurrent gating functions using recurrent networks. Our architecture is split into two components - a controller cell and a listener cell whereby the recurrent controller actively influences the compositionality of the listener cell. We conduct extensive experiments on a myriad of tasks in the NLP domain such as sentiment analysis (SST, IMDb, Amazon reviews, etc.), question classification (TREC), entailment classification (SNLI, SciTail), answer selection (WikiQA, TrecQA) and reading comprehension (NarrativeQA). Across all 26 datasets, our results demonstrate that RCRN not only consistently outperforms BiLSTMs but also stacked BiLSTMs, suggesting that our controller architecture might be a suitable replacement for the widely adopted stacked architecture. Additionally, RCRN achieves state-of-the-art results on several well-established datasets.", "full_text": "Recurrently Controlled Recurrent Networks\n\nYi Tay1, Luu Anh Tuan2, and Siu Cheung Hui3\n\n1,3Nanyang Technological University\n\n2Institute for Infocomm Research\n\nytay017@e.ntu.edu.sg1\n\nat.luu@i2r.a-star.edu.sg2\n\nasschui@ntu.edu.sg3\n\nAbstract\n\nRecurrent neural networks (RNNs) such as long short-term memory and gated\nrecurrent units are pivotal building blocks across a broad spectrum of sequence\nmodeling problems. This paper proposes a recurrently controlled recurrent network\n(RCRN) for expressive and powerful sequence encoding. More concretely, the key\nidea behind our approach is to learn the recurrent gating functions using recurrent\nnetworks. Our architecture is split into two components - a controller cell and a\nlistener cell whereby the recurrent controller actively in\ufb02uences the compositional-\nity of the listener cell. We conduct extensive experiments on a myriad of tasks in\nthe NLP domain such as sentiment analysis (SST, IMDb, Amazon reviews, etc.),\nquestion classi\ufb01cation (TREC), entailment classi\ufb01cation (SNLI, SciTail), answer\nselection (WikiQA, TrecQA) and reading comprehension (NarrativeQA). Across all\n26 datasets, our results demonstrate that RCRN not only consistently outperforms\nBiLSTMs but also stacked BiLSTMs, suggesting that our controller architecture\nmight be a suitable replacement for the widely adopted stacked architecture.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) live at the heart of many sequence modeling problems. In\nparticular, the incorporation of gated additive recurrent connections is extremely powerful, leading to\nthe pervasive adoption of models such as Gated Recurrent Units (GRU) [Cho et al., 2014] or Long\nShort-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] across many NLP applications\n[Bahdanau et al., 2014; Xiong et al., 2016; Rockt\u00e4schel et al., 2015; McCann et al., 2017]. In these\nmodels, the key idea is that the gating functions control information \ufb02ow and compositionality over\ntime, deciding how much information to read/write across time steps. This not only serves as a\nprotection against vanishing/exploding gradients but also enables greater relative ease in modeling\nlong-range dependencies.\nThere are two common ways to increase the representation capability of RNNs. Firstly, the number\nof hidden dimensions could be increased. Secondly, recurrent layers could be stacked on top of\neach other in a hierarchical fashion [El Hihi and Bengio, 1996], with each layer\u2019s input being the\noutput of the previous, enabling hierarchical features to be captured. Notably, the wide adoption of\nstacked architectures across many applications [Graves et al., 2013; Sutskever et al., 2014; Wang\net al., 2017; Nie and Bansal, 2017] signify the need for designing complex and expressive encoders.\nUnfortunately, these strategies may face limitations. For example, the former might run a risk of\nover\ufb01tting and/or hitting a wall in performance. On the other hand, the latter might be faced with the\ninherent dif\ufb01culties of going deep such as vanishing gradients or dif\ufb01culty in feature propagation\nacross deep RNN layers [Zhang et al., 2016b].\nThis paper proposes Recurrently Controlled Recurrent Networks (RCRN), a new recurrent architecture\nand a general purpose neural building block for sequence modeling. RCRNs are characterized by\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fits usage of two key components - a recurrent controller cell and a listener cell. The controller\ncell controls the information \ufb02ow and compositionality of the listener RNN. The key motivation\nbehind RCRN is to provide expressive and powerful sequence encoding. However, unlike stacked\narchitectures, all RNN layers operate jointly on the same hierarchical level, effectively avoiding the\nneed to go deeper. Therefore, RCRNs provide a new alternate way of utilizing multiple RNN layers\nin conjunction by allowing one RNN to control another RNN. As such, our key aim in this work is to\nshow that our proposed controller-listener architecture is a viable replacement for the widely adopted\nstacked recurrent architecture.\nTo demonstrate the effectiveness of our proposed RCRN model, we conduct extensive experiments on\na plethora of diverse NLP tasks where sequence encoders such as LSTMs/GRUs are highly essential.\nThese tasks include sentiment analysis (SST, IMDb, Amazon Reviews), question classi\ufb01cation\n(TREC), entailment classi\ufb01cation (SNLI, SciTail), answer selection (WikiQA, TrecQA) and reading\ncomprehension (NarrativeQA). Experimental results show that RCRN outperforms BiLSTMs and\nmulti-layered/stacked BiLSTMs on all 26 datasets, suggesting that RCRNs are viable replacements\nfor the widely adopted stacked recurrent architectures. Additionally, RCRN achieves close to\nstate-of-the-art performance on several datasets.\n\n2 Related Work\n\nRNN variants such as LSTMs and GRUs are ubiquitous and indispensible building blocks in many\nNLP applications such as question answering [Seo et al., 2016; Wang et al., 2017], machine translation\n[Bahdanau et al., 2014], entailment classi\ufb01cation [Chen et al., 2017] and sentiment analysis [Longpre\net al., 2016; Huang et al., 2017]. In recent years, many RNN variants have been proposed, ranging\nfrom multi-scale models [Koutnik et al., 2014; Chung et al., 2016; Chang et al., 2017] to tree-\nstructured encoders [Tai et al., 2015; Choi et al., 2017]. Models that are targetted at improving the\ninternals of the RNN cell have also been proposed [Xingjian et al., 2015; Danihelka et al., 2016].\nGiven the importance of sequence encoding in NLP, the design of effective RNN units for this purpose\nremains an active area of research.\nStacking RNN layers is the most common way to improve representation power. This has been used\nin many highly performant models ranging from speech recognition [Graves et al., 2013] to machine\nreading [Wang et al., 2017]. The BCN model [McCann et al., 2017] similarly uses multiple BiLSTM\nlayers within their architecture. Models that use shortcut/residual connections in conjunctin with\nstacked RNN layers are also notable [Zhang et al., 2016b; Longpre et al., 2016; Nie and Bansal, 2017;\nDing et al., 2018].\nNotably, a recent emerging trend is to model sequences without recurrence. This is primarily\nmotivated by the fact that recurrence is an inherent prohibitor of parallelism. To this end, many\nworks have explored the possibility of using attention as a replacement for recurrence. In particular,\nself-attention [Vaswani et al., 2017] has been a popular choice. This has sparked many innovations,\nincluding general purpose encoders such as DiSAN [Shen et al., 2017] and Block Bi-DiSAN [Shen\net al., 2018]. The key idea in these works is to use multi-headed self-attention and positional\nencodings to model temporal information.\nWhile attention-only models may come close in performance, some domains may still require the\ncomplex and expressive recurrent encoders. Moreover, we note that in [Shen et al., 2017, 2018],\nthe scores on multiple benchmarks (e.g., SST, TREC, SNLI, MultiNLI) do not outperform (or even\napproach) the state-of-the-art, most of which are models that still heavily rely on bidirectional\nLSTMs [Zhou et al., 2016; Choi et al., 2017; McCann et al., 2017; Nie and Bansal, 2017]. While\nself-attentive RNN-less encoders have recently been popular, our work moves in an orthogonal and\npossibly complementary direction, advocating a stronger RNN unit for sequence encoding instead.\nNevertheless, it is also good to note that our RCRN model outperforms DiSAN in all our experiments.\nAnother line of work is also concerned with eliminating recurrence. SRUs (Simple Recurrent Units)\n[Lei and Zhang, 2017] are recently proposed networks that remove the sequential dependencies in\nRNNs. SRUs can be considered a special case of Quasi-RNNs [Bradbury et al., 2016], which performs\nincremental pooling using pre-learned convolutional gates. A recent work, Multi-range Reasoning\nUnits (MRU) [Tay et al., 2018b] follows the same paradigm, trading convolutional gates with features\nlearned via expressive multi-granular reasoning. Zhang et al. [2018] proposed sentence-state LSTMs\n(S-LSTM) that exchanges incremental reading for a single global state.\n\n2\n\n\fOur work proposes a new way of enhancing the representation capability of RNNs without going\ndeep. For the \ufb01rst time, we propose a controller-listener architecture that uses one recurrent unit to\ncontrol another recurrent unit. Our proposed RCRN consistently outperforms stacked BiLSTMs and\nachieves state-of-the-art results on several datasets. We outperform above-mentioned competitors\nsuch as DiSAN, SRUs, stacked BiLSTMs and sentence-state LSTMs.\n\n3 Recurrently Controlled Recurrent Networks (RCRN)\n\nThis section formally introduces the RCRN architecture. Our model is split into two main components\n- a controller cell and a listener cell. Figure 1 illustrates the model architecture.\n\nFigure 1: High level overview of our proposed RCRN architecture.\n\n3.1 Controller Cell\n\nThe goal of the controller cell is to learn gating functions in order to in\ufb02uence the target cell. In order\nto control the target cell, the controller cell constructs a forget gate and an output gate which are then\nused to in\ufb02uence the information \ufb02ow of the listener cell. For each gate (output and forget), we use\na separate RNN cell. As such, the controller cell comprises two cell states and an additional set of\nparameters. The equations of the controller cell are de\ufb01ned as follows:\n\ni1\nt = \u03c3s(W 1\nf 1\nt = \u03c3s(W 1\no1\nt = \u03c3s(W 1\n\ni xt + U 1\ni h1\nf h1\nf xt + U 1\no h1\no xt + U 1\n\nt\u22121 + b1\nt\u22121 + b1\nt\u22121 + b1\n\ni ) and i2\nf ) and f 2\no) and o2\n\nc1\nt = f 1\nt c1\nc2\nt = f 2\nt c2\nh1\nt = o1\n\nt\u22121 + i1\nt\u22121 + i2\nt (cid:12) \u03c3(c1\n\nt ) and h2\n\nt = \u03c3s(W 2\nt = \u03c3s(W 2\nt = \u03c3s(W 2\nt \u03c3(W 1\nt \u03c3(W 2\n\ni xt + U 2\nf xt + U 2\no xt + U 2\nc xt + U 1\nc xt + U 2\n\ni h2\nf h2\no h2\nc h1\nc h2\nt = o2\n\nt\u22121 + b2\ni )\nt\u22121 + b2\nf )\nt\u22121 + b2\no)\nt\u22121 + b1\nc)\nt\u22121 + b2\nc)\nt (cid:12) \u03c3(c2\nt )\n\n(1)\n(2)\n(3)\n(4)\n(5)\n(6)\n\nwhere xt is the input to the model at time step t. W k\u2217 , U k\u2217 , bk\u2217 are the parameters of the model where\nk = {1, 2} and \u2217 = {i, f, o}. \u03c3s is the sigmoid function and \u03c3 is the tanh nonlinearity. (cid:12) is the\nHadamard product. The controller RNN has two cell states denoted as c1 and c2 respectively. h1\nt , h2\nt\nare the outputs of the unidirectional controller cell at time step t. Next, we consider a bidirectional\nadaptation of the controller cell. Let Equations (1-6) be represented by the function CT(), the\nbidirectional adaptation is represented as:\n\u2212\u2192\nCT(h1\nt\u22121, h2\n\u2190\u2212\nCT(h1\nt+1, h2\n\u2190\u2212\n\u2212\u2192\nt ] and h2\nh1\nh1\nt ;\n\nt\u22121, xt) t = 1,\u00b7\u00b7\u00b7 (cid:96)\nt+1, xt) t = M,\u00b7\u00b7\u00b7 1\n\n\u2212\u2192\nh2\nt =\n\u2190\u2212\nh2\nt =\n\n\u2212\u2192\nh1\nt ,\n\u2190\u2212\nh1\nt ,\n\n\u2212\u2192\nh2\nt ;\n\n\u2190\u2212\nh2\nt ]\n\nh1\nt = [\n\nt = [\n\n(7)\n\n(8)\n\n(9)\n\nThe outputs of the bidirectional controller cell are h1\ngates for the listener cell.\n\nt , h2\n\nt for time step t. These hidden outputs act as\n\n3\n\nxx+\u03c3\u03c3\u210e\"#$%Controller CellListener Cellx\u03c3x\u03c3\u03c3&%xxtanh+$%\u03c3x\u03c3\u03c3&%xxtanh+$%\u03c3\u03c3\u03c3&%xtanh$%x+x\u210e\"'\u210e\"(\u210e\")\f3.2 Listener Cell\n\nThe listener cell is another recurrent cell. The \ufb01nal output of the RCRN is generated by the listener\ncell which is being in\ufb02uenced by the controller cell. First, the listener cell uses a base recurrent model\nto process the sequence input. The equations of this base recurrent model are de\ufb01ned as follows:\n\ni xt + U 3\nf xt + U 3\no xt + U 3\nt\u22121 + i3\n\ni h3\nf h3\no h3\nt \u03c3(W 3\n\nt\u22121 + b3\ni )\nt\u22121 + b3\nf )\nt\u22121 + b3\no)\nc xt + U 3\n\ni3\nt = \u03c3s(W 3\nf 3\nt = \u03c3s(W 3\no3\nt = \u03c3s(W 3\nt = f 3\nc3\n\u2212\u2192\nh3\nt = o3\n\nt c3\nt (cid:12) \u03c3(c3\nt )\n\nc h3\n\nt\u22121 + b3\nc)\n\n\u2212\u2192\nh3\nt ,\n\nt = [\n\nt\n\n(10)\n(11)\n(12)\n(13)\n\n(14)\n\n\u2190\u2212\nt ]. Next, using h1\nSimilarly, a bidirectional adaptation is used, obtaining h3\nh3\nthe controller cell), we de\ufb01ne another recurrent operation as follows:\nt )) (cid:12) h3\n\nt\u22121 + (1 \u2212 \u03c3s(h1\n\n(15)\n(16)\nt and j = {3, 4} are the cell and hidden states at time step t. W 3\u2217 , U 3\u2217 are the parameters\nt are the outputs of the controller cell.\nt ) acts as the\n\nt ) acts as the forget gate for the listener cell. Likewise \u03c3s(h2\n\nwhere cj\nof the listener cell where \u2217 = {i, f, o}. Note that h1\nIn this formulation, \u03c3s(h1\noutput gate for the listener.\n\nt = \u03c3s(h1\nc4\nt (cid:12) c3\nh4\nt = h2\n\nt (outputs of\n\nt ) (cid:12) c4\n\nt and h2\n\nt , hj\n\nt , h2\n\nt\n\n3.3 Overall RCRN Architecture, Variants and Implementation\n\nIntuitively, the overall architecture of the RCRN model can be explained as follows: Firstly, the\ncontroller cell can be thought of as two BiRNN models which hidden states are used as the forget and\noutput gates for another recurrent model, i.e., the listener. The listener uses a single BiRNN model for\nsequence encoding and then allows this representation to be altered by listening to the controller. An\nalternative interpretation to our model architecture is that it is essentially a \u2018recurrent-over-recurrent\u2019\nmodel. Clearly, the formulation we have used above uses BiLSTMs as the atomic building block for\nRCRN. Hence, we note that it is also possible to have a simpli\ufb01ed variant1 of RCRN that uses GRUs\nas the atomic block which we found to have performed slightly better on certain datasets.\n\nCuda-level Optimization For ef\ufb01ciency purposes, we use the CUDNN optimized version of the\nbase recurrent unit (LSTMs/GRUs). Additionally, note that the \ufb01nal recurrent cell (Equation (15))\ncan be subject to CUDA-level optimization2 following simple recurrent units (SRU) [Lei and Zhang,\n2017]. The key idea is that this operation can be performed along the dimension axis, enabling greater\nparallelization on the GPU. For the sake of brevity, we refer interested readers to [Lei and Zhang,\n2017]. Note that this form of cuda-level optimization was also performed in the Quasi-RNN model\n[Bradbury et al., 2016], which effectively subsumes the SRU model.\n\nOn Parameter Cost and Memory Ef\ufb01cency Note that a single RCRN model is equivalent to a\nstacked BiLSTM of 3 layers. This is clear when we consider how two controller BiRNNs are used to\ncontrol a single listener BiRNN. As such, for our experiments, when considering only the encoder\nand keeping all other components constant, 3L-BiLSTM has equal parameters to RCRN while RCRN\nand 3L-BiLSTM are approximately three times larger than BiLSTM.\n\n4 Experiments\n\nThis section discusses the overall empirical evaluation of our proposed RCRN model.\n\n1We omit technical descriptions due to the lack of space.\n2We adapt the CUDA kernel as a custom Tensor\ufb02ow op in our experiments. While the authors of SRU release\ntheir cuda-op at https://github.com/taolei87/sru, we use a third-party open-source Tensor\ufb02ow version\nwhich can be found at https://github.com/JonathanRaiman/tensorflow_qrnn.git.\n\n4\n\n\f4.1 Tasks and Datasets\n\nIn order to verify the effectiveness of our proposed RCRN architecture, we conduct extensive\nexperiments across several tasks3 in the NLP domain.\n\nSentiment Analysis Sentiment analysis is a text classi\ufb01cation problem in which the goal is to\ndetermine the polarity of a given sentence/document. We conduct experiments on both sentence and\ndocument level. More concretely, we use 16 Amazon review datasets from [Liu et al., 2017], the\nwell-established Stanford Sentiment TreeBank (SST-5/SST-2) [Socher et al., 2013] and the IMDb\nSentiment dataset [Maas et al., 2011]. All tasks are binary classi\ufb01cation tasks with the exception of\nSST-5. The metric is the accuracy score.\n\nQuestion Classi\ufb01cation The goal of this task is to classify questions into \ufb01ne-grained categories\nsuch as number or location. We use the TREC question classi\ufb01cation dataset [Voorhees et al., 1999].\nThe metric is the accuracy score.\n\nEntailment Classi\ufb01cation This is a well-established and popular task in the \ufb01eld of natural lan-\nguage understanding and inference. Given two sentences s1 and s2, the goal is to determine if s2\nentails or contradicts s1. We use two popular benchmark datasets, i.e., the Stanford Natural Language\nInference (SNLI) corpus [Bowman et al., 2015], and SciTail (Science Entailment) [Khot et al., 2018]\ndatasets. This is a pairwise classsi\ufb01cation problem in which the metric is also the accuracy score.\n\nAnswer Selection This is a standard problem in information retrieval and learning-to-rank. Given\na question, the task at hand is to rank candidate answers. We use the popular WikiQA [Yang et al.,\n2015] and TrecQA [Wang et al., 2007] datasets. For TrecQA, we use the cleaned setting as denoted\nby Rao et al. [2016]. The evaluation metrics are the MAP (Mean Average Precision) and Mean\nReciprocal Rank (MRR) ranking metrics.\n\nReading Comprehension This task involves reading documents and answering questions about\nthese documents. We use the recent NarrativeQA [Ko\u02c7cisk`y et al., 2017] dataset which involves\nreasoning and answering questions over story summaries. We follow the original paper and report\nscores on BLEU-1, BLEU-4, Meteor and Rouge-L.\n\n4.2 Task-Speci\ufb01c Model Architectures and Implementation Details\n\nIn this section, we describe the task-speci\ufb01c model architectures for each task.\n\nClassi\ufb01cation Model This architecture is used for all text classi\ufb01cation tasks (sentiment analysis\nand question classi\ufb01cation datasets). We use 300D GloVe [Pennington et al., 2014] vectors with 600D\nCoVe [McCann et al., 2017] vectors as pretrained embedding vectors. An optional character-level\nword representation is also added (constructed with a standard BiGRU model). The output of the\nembedding layer is passed into the RCRN model directly without using any projection layer. Word\nembeddings are not updated during training. Given the hidden output states of the 200d dimensional\nRCRN cell, we take the concatenation of the max, mean and min pooling of all hidden states to form\nthe \ufb01nal feature vector. This feature vector is passed into a single dense layer with ReLU activations\nof 200d dimensions. The output of this layer is then passed into a softmax layer for classi\ufb01cation.\nThis model optimizes the cross entropy loss. We train this model using Adam [Kingma and Ba, 2014]\nand learning rate is tuned amongst {0.001, 0.0003, 0.0004}.\n\nEntailment Model This architecture is used for entailment tasks. This is a pairwise classi\ufb01cation\nmodels with two input sequences. Similar to the singleton classsi\ufb01cation model, we utilize the identi-\ncal input encoder (GloVe, CoVE and character RNN) but include an additional part-of-speech (POS\ntag) embedding. We pass the input representation into a two layer highway network [Srivastava et al.,\n2015] of 300 hidden dimensions before passing into the RCRN encoder. The feature representation\nof s1 and s2 is the concatentation of the max and mean pooling of the RCRN hidden outputs. To\ncompare s1 and s2, we pass [s1, s2, s1 (cid:12) s2, s1 \u2212 s2] into a two layer highway network. This output\n3While we agree that other tasks such as language modeling or NMT would be interesting to investigate, we\n\ncould not muster enough GPU resources to conduct any extra experiments. We leave this for future work.\n\n5\n\n\fis then passed into a softmax layer for classi\ufb01cation. We train this model using Adam and learning\nrate is tuned amongst {0.001, 0.0003, 0.0004}. We mainly focus on the encoder-only setting which\ndoes not allow cross sentence attention. This is a commonly tested setting on the SNLI dataset.\n\nRanking Model This architecture is used for the ranking tasks (i.e., answer selection). We use the\nmodel architecture from Attentive Pooling BiLSTMs (AP-BiLSTM) [dos Santos et al., 2016] as our\nbase and swap the RNN encoder with our RCRN encoder. The dimensionality is set to 200. The\nsimilarity scoring function is the cosine similarity and the objective function is the pairwise hinge\nloss with a margin of 0.1. We use negative sampling of n = 6 to train our model. We train our model\nusing Adadelta [Zeiler, 2012] with a learning rate of 0.2.\n\nReading Comprehension Model We use R-NET [Wang et al., 2017] as the base model. Since\nR-NET uses three Bidirectional GRU layers as the encoder, we replaced this stacked BiGRU layer\nwith RCRN. For fairness, we use the GRU variant of RCRN instead. The dimensionality of the\nencoder is set to 75. We train both models using Adam with a learning rate of 0.001.\nFor all datasets, we include an additional ablative baselines, swapping the RCRN with (1) a standard\nBiLSTM model and (2) a stacked BiLSTM of 3 layers (3L-BiLSTM). This is to fairly observe the\nimpact of different encoder models based on the same overall model framework.\n\n4.3 Overall Results\n\nThis section discusses the overall results of our experiments.\n\n3L-BiLSTM\u2020\n\nTable 1: Results on the Amazon Reviews dataset. \u2020 are models implemented by us.\n\nModel/Reference\nMVN [Guo et al., 2017]\nDiSAN [Shen et al., 2017]\nDMN [Kumar et al., 2016]\nLSTM-CNN [Zhou et al., 2016]\nNTI [Yu and Munkhdalai, 2017]\nBCN [McCann et al., 2017]\nBCN + ELMo [Peters et al., 2018]\nBiLSTM\n3L-BiLSTM\nRCRN\n\nAcc\n51.5\n51.7\n52.1\n52.4\n53.1\n53.7\n54.7\n51.3\n52.6\n54.3\n\nTable 2: Results on Sentiment Analysis on SST-5.\n\nModel/Reference\nP-LSTM [Wieting et al., 2015]\nCT-LSTM [Looks et al., 2017]\nTE-LSTM [Huang et al., 2017]\nNSE [Munkhdalai and Yu, 2016]\nBCN [McCann et al., 2017]\nBMLSTM [Radford et al., 2017]\nBiLSTM\n3L-BiLSTM\nRCRN\n\nAcc\n89.2\n89.4\n89.6\n89.7\n90.3\n91.8\n89.7\n90.0\n90.6\n\nTable 3: Results on Sentiment Analysis on SST-2.\n\nSentiment Analysis On the 16 review datasets (Table 1) from [Liu et al., 2017; Zhang et al.,\n2018], our proposed RCRN architecture achieves the highest score on all 16 datasets, outperforming\nthe existing state-of-the-art model - sentence state LSTMs (SLSTM) [Zhang et al., 2018]. The\n\n6\n\nDataset/Model BiLSTM 2L-BiLSTM SLSTM BiLSTM\u2020\nCamera\nVideo\nHealth\nMusic\nKitchen\nDVD\nToys\nBaby\nBooks\nIMDB\nMR\nApparel\nMagazines\nElectronics\nSports\nSoftware\nMacro Avg\n\n90.0\n86.8\n86.5\n82.0\n84.5\n85.5\n85.3\n86.3\n83.4\n87.2\n76.2\n85.8\n93.8\n83.3\n85.8\n87.8\n85.6\n\n88.1\n85.2\n85.9\n80.5\n83.8\n84.8\n85.8\n85.5\n82.8\n86.6\n76.0\n86.4\n92.9\n82.3\n84.8\n87.0\n84.9\n\n87.3\n87.5\n85.5\n83.5\n81.7\n84.0\n87.5\n85.0\n86.0\n86.5\n77.7\n88.0\n93.7\n83.5\n85.5\n88.5\n85.7\n\n87.1\n84.7\n85.5\n78.7\n82.2\n83.7\n85.7\n84.5\n82.1\n86.0\n75.7\n86.1\n92.6\n82.5\n84.0\n86.7\n84.3\n\n89.7\n87.8\n89.0\n85.7\n84.5\n86.0\n90.5\n88.5\n87.2\n88.0\n77.7\n89.2\n92.5\n87.0\n86.5\n90.3\n87.5\n\nRCRN\n90.5\n88.5\n90.5\n86.0\n86.0\n86.8\n90.8\n89.0\n88.0\n89.8\n79.0\n90.5\n94.8\n89.0\n88.0\n90.8\n88.6\n\n\fModel/Reference\nRes. BiLSTM [Longpre et al., 2016]\n4L-QRNN [Bradbury et al., 2016]\nBCN [McCann et al., 2017]\noh-LSTM [Johnson and Zhang, 2016]\nTRNN [Dieng et al., 2016]\nVirtual Miyato et al. [2016]\nBiLSTM\n3L-BiLSTM\nRCRN\n\nAcc\n90.1\n91.4\n91.8\n91.9\n93.8\n94.1\n90.9\n91.8\n92.8\n\nModel/Reference\nCNN-MC [Kim, 2014]\nSRU [Lei and Zhang, 2017]\nDSCNN [Zhang et al., 2016a]\nDC-BiLSTM [Ding et al., 2018]\nBCN [McCann et al., 2017]\nLSTM-CNN [Zhou et al., 2016]\nBiLSTM\n3L BiLSTM\nRCRN\n\nAcc\n92.2\n93.9\n95.4\n95.6\n95.8\n96.1\n95.8\n95.4\n96.2\n\nTable 4: Results on IMDb binary sentiment clasi\ufb01cation.\n\nTable 5: Results on TREC question classi\ufb01cation.\n\nModel/Reference\nMulti-head [Vaswani et al., 2017]\nAtt. Bi-SRU [Lei and Zhang, 2017]\nDiSAN [Shen et al., 2017]\nShortcut [Nie and Bansal, 2017]\nGumbel LSTM [Choi et al., 2017]\nDynamic Meta Emb [Kiela et al., 2018]\nBiLSTM\n3L-BiLSTM\nRCRN\n\nTable 6: Results on SNLI dataset.\n\nAcc\n84.2\n84.8\n85.6\n85.7\n86.0\n86.7\n85.5\n85.1\n85.8\n\nWikiQA\n\nTrecQA\n\nMAP MRR MAP MRR\nModel\n82.5\nBiLSTM\n68.5\n83.6\n3L-BiLSTM 69.3\n85.5\n71.1\nRCRN\n63.9\n80.0\nAP-BiLSTM\n83.4\n69.8\nAP-3L-BiLSTM\n72.4\n88.2\nAP-RCRN\nTable 8: Results on Answer Retrieval (WikiQA and\nTrecQA).\n\n69.8\n71.3\n72.3\n69.9\n71.3\n73.7\n\n72.4\n73.0\n75.4\n75.1\n73.3\n77.9\n\nModel/Reference\nESIM [Chen et al., 2017]\nDecompAtt [Parikh et al., 2016]\nDGEM [Khot et al., 2018]\nCAFE [Tay et al., 2017]\nCSRAN [Tay et al., 2018a]\nOpenAI GPT [Radford et al., 2018]\nBiLSTM\n3L-BiLSTM\nRCRN\n\nAcc\n70.6\n72.3\n77.3\n83.3\n86.7\n88.3\n80.1\n79.6\n81.1\n\nTable 7: Results on SciTail dataset.\n\nModel\nSeq2Seq\nASR\nBiDAF\nR-NET\nRCRN\n\nBleu-1 Bleu-4 Meteor Rouge\n13.3\n16.1\n23.3\n23.5\n36.3\n33.7\n34.9\n36.7\n38.3\n38.1\n\n1.40\n5.90\n15.5\n20.3\n21.8\n\n4.2\n8.0\n15.4\n18.0\n18.1\n\nTable 9: Results on Reading Comprehension (Narra-\ntiveQA).\n\nmacro average performance gain over BiLSTMs (+4%) and Stacked (2 X BiLSTM) (+3.4%) is also\nnotable. On the same architecture, our RCRN outperforms ablative baselines BiLSTM by +2.9%\nand 3L-BiLSTM by +1.1% on average across 16 datasets.\nResults on SST-5 (Table 2) and SST-2 (Table 3) are also promising. More concretely, our RCRN\narchitecture achieves state-of-the-art results on SST-5 and SST-2. RCRN also outperforms many\nstrong baselines such as DiSAN [Shen et al., 2017], a self-attentive model and Bi-Attentive classi\ufb01ca-\ntion network (BCN) [McCann et al., 2017] that also use CoVe vectors. On SST-2, strong baselines\nsuch as Neural Semantic Encoders [Munkhdalai and Yu, 2016] and similarly the BCN model are also\noutperformed by our RCRN model.\nFinally, on the IMDb sentiment classi\ufb01cation dataset (Table 4), RCRN achieved 92.8% accuracy. Our\nproposed RCRN outperforms Residual BiLSTMs [Longpre et al., 2016], 4-layered Quasi Recurrent\nNeural Networks (QRNN) [Bradbury et al., 2016] and the BCN model which can be considered to\nbe very competitive baselines. RCRN also outperforms ablative baselines BiLSTM (+1.9%) and\n3L-BiLSTM (+1%).\n\nQuestion Classi\ufb01cation Our results on the TREC question classi\ufb01cation dataset (Table 5) is also\npromising. RCRN achieved a state-of-the-art score of 96.2% on this dataset. A notable baseline\nis the Densely Connected BiLSTM [Ding et al., 2018], a deep residual stacked BiLSTM model\nwhich RCRN outperforms (+0.6%). Our model also outperforms BCN (+0.4%) and SRU (+2.3%).\nOur ablative BiLSTM baselines achieve reasonably high score, posssibly due to CoVe Embeddings.\nHowever, our RCRN can further increase the performance score.\n\n7\n\n\fEntailment Classi\ufb01cation Results on entailment classi\ufb01cation are also optimistic. On SNLI (Table\n6), RCRN achieves 85.8% accuracy, which is competitive to Gumbel LSTM. However, RCRN\noutperforms a wide range of baselines, including self-attention based models as multi-head [Vaswani\net al., 2017] and DiSAN [Shen et al., 2017]. There is also performance gain of +1% over Bi-SRU\neven though our model does not use attention at all. RCRN also outperforms shortcut stacked\nencoders, which use a series of BiLSTM connected by shortcut layers. Post review, as per reviewer\nrequest, we experimented with adding cross sentence attention, in particular adding the attention of\nParikh et al. [2016] on 3L-BiLSTM and RCRN. We found that they performed comparably (both\nat \u2248 87.0). We did not have resources to experiment further even though intuitively incorporating\ndifferent/newer variants of attention [Kim et al., 2018; Tay et al., 2018a; Chen et al., 2017] and/or\nELMo [Peters et al., 2018] can de\ufb01nitely raise the score further. However, we hypothesize that cross\nsentence attention forces less reliance on the encoder. Therefore stacked BiLSTMs and RCRNs\nperform similarly.\nThe results on SciTail similarly show that RCRN is more effective than BiLSTM (+1%). Moreover,\nRCRN outperforms several baselines in [Khot et al., 2018] including models that use cross sentence\nattention such as DecompAtt [Parikh et al., 2016] and ESIM [Chen et al., 2017]. However, it still falls\nshort to recent state-of-the-art models such as OpenAI\u2019s Generative Pretrained Transformer [Radford\net al., 2018].\n\nAnswer Selection Results on the answer selection (Table 8) task show that RCRN leads to consid-\nerable improvements on both WikiQA and TrecQA datasets. We investigate two settings. The \ufb01rst,\nwe reimplement AP-BiLSTM and swap the BiLSTM for RCRN encoders. Secondly, we completely\nremove all attention layers from both models to test the ability of the standalone encoder. Without\nattention, RCRN gives an improvement of + \u2248 2% on both datasets. With attentive pooling, RCRN\nmaintains a + \u2248 2% improvement in terms of MAP score. However, the gains on MRR are greater\n(+4 \u2212 7%). Notably, AP-RCRN model outperforms the of\ufb01cial results reported in [dos Santos et al.,\n2016]. Overall, we observe that RCRN is much stronger than BiLSTMs and 3L-BiLSTMs on this\ntask.\n\nReading Comprehension Results (Table 9) show that enhancing R-NET with RCRN can lead\nto considerable improvements. This leads to an improvement of \u2248 1% \u2212 2% on all four metrics.\nNote that our model only uses a single layered RCRN while R-NET uses 3 layered BiGRUs. This\nempirical evidence might suggest that RCRN is a better way to utilize multiple recurrent layers.\n\nOverall Results Across all 26 datasets, RCRN outperforms not only standard BiLSTMs but also\n3L-BiLSTMs which have approximately equal parameterization. 3L-BiLSTMs were overall better\nthan BiLSTMs but lose out on a minority of datasets. RCRN outperforms a wide range of competitive\nbaselines such as DiSAN, Bi-SRUs, BCN and LSTM-CNN, etc. We achieve (close to) state-of-the-art\nperformance on SST, TREC question classi\ufb01cation and 16 Amazon review datasets.\n\n4.4 Runtime Analysis\n\nThis section aims to get a benchmark on model performance with respect to model ef\ufb01ciency. In order\nto do that, we benchmark RCRN along with BiLSTMs and 3 layered BiLSTMs (with and without\nCUDNN optimization) on different sequence lengths (i.e., 16, 32, 64, 128, 256). We use the IMDb\nsentiment task. We use the same standard hardware (a single Nvidia GTX1070 card) and an identical\noverarching model architecture. The dimensionality of the model is set to 200 with a \ufb01xed batch size\nof 32. Finally, we also benchmark a CUDA optimized adaptation of RCRN which has been described\nearlier (Section 3.3).\nTable 10 reports training/inference times of all benchmarked models. The fastest model is naturally\nthe 1 layer BiLSTM (CUDNN). Intuitively, the speed of RCRN should be roughly equivalent to using\n3 BiLSTMs. Surprisingly, we found that the CUDA optimized RCRN performs consistently slightly\nfaster than the 3 layer BiLSTM (CUDNN). At the very least, RCRN provides comparable ef\ufb01ciency to\nusing stacked BiLSTM and empirically we show that there is nothing to lose in this aspect. However,\nwe note that CUDA-level optimizations have to be performed. Finally, the non-CUDNN optimized\nBiLSTM and stacked BiLSTMs are also provided for reference.\n\n8\n\n\f3 layer BiLSTM\nBiLSTM\n1 layer BiLSTM (CUDNN)\n3 layer BiLSTM (CUDNN)\nRCRN (CUDNN)\nRCRN (CUDNN +CUDA optimized)\n\nTraining Time (seconds/epoch)\n16\n256\n503\n29\n272\n18\n26\n5\n80\n10\n19\n219\n78\n10\n\n128\n244\n131\n14\n42\n101\n40\n\n64\n113\n63\n9\n23\n53\n21\n\n32\n50\n30\n6\n14\n29\n13\n\nInference (seconds/epoch)\n16\n256\n150\n12\n104\n9\n10\n2\n32\n4\n8\n78\n29\n4\n\n128\n72\n52\n6\n16\n41\n15\n\n64\n38\n28\n4\n9\n23\n8\n\n32\n20\n15\n3\n5\n12\n5\n\nTable 10: Training and Inference times on IMDb binary sentiment classi\ufb01cation task with varying sequence\nlengths.\n\n5 Conclusion and Future Directions\n\nWe proposed Recurrently Controlled Recurrent Networks (RCRN), a new recurrent architecture\nand encoder for a myriad of NLP tasks. RCRN operates in a novel controller-listener architecture\nwhich uses RNNs to learn the gating functions of another RNN. We apply RCRN to a potpourri\nof NLP tasks and achieve promising/highly competitive results on all tasks and 26 benchmark\ndatasets. Overall \ufb01ndings suggest that our controller-listener architecture is more effective than\nstacking RNN layers. Moreover, RCRN remains equally (or slightly more) ef\ufb01cient compared to\nstacked RNNs of approximately equal parameterization. There are several potential interesting\ndirections for further investigating RCRNs. Firstly, investigating RCRNs controlling other RCRNs\nand secondly, investigating RCRNs in other domains where recurrent models are also prevalent\nfor sequence modeling. The source code of our model can be found at https://github.com/\nvanzytay/NIPS2018_RCRN.\n\n6 Acknowledgements\n\nWe thank the anonymous reviewers and area chair from NeurIPS 2018 for their constructive and high\nquality feedback.\n\nReferences\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\nSamuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated\ncorpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical\nMethods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21,\n2015, pages 632\u2013642, 2015.\n\nJames Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural\n\nnetworks. CoRR, abs/1611.01576, 2016.\n\nShiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael\nWitbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks.\nIn Advances in Neural Information Processing Systems, pages 76\u201386, 2017.\n\nQian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM\nfor natural language inference. In Proceedings of the 55th Annual Meeting of the Association for\nComputational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long\nPapers, pages 1657\u20131668, 2017.\n\nKyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for\nstatistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\nJihun Choi, Kang Min Yoo, and Sang-goo Lee. Unsupervised learning of task-speci\ufb01c tree structures\n\nwith tree-lstms. arXiv preprint arXiv:1707.02786, 2017.\n\n9\n\n\fJunyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks.\n\narXiv preprint arXiv:1609.01704, 2016.\n\nIvo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long\nshort-term memory. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1986\u20131994, 2016.\n\nAdji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. Topicrnn: A recurrent neural network\n\nwith long-range semantic dependency. arXiv preprint arXiv:1611.01702, 2016.\n\nZixiang Ding, Rui Xia, Jianfei Yu, Xiang Li, and Jian Yang. Densely connected bidirectional lstm\n\nwith applications to sentence classi\ufb01cation. arXiv preprint arXiv:1802.00889, 2018.\n\nC\u00edcero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. Attentive pooling networks.\n\nCoRR, abs/1602.03609, 2016.\n\nSalah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies.\n\nIn Advances in neural information processing systems, pages 493\u2013499, 1996.\n\nAlex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\nneural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international\nconference on, pages 6645\u20136649. IEEE, 2013.\n\nHongyu Guo, Colin Cherry, and Jiang Su. End-to-end multi-view networks for text classi\ufb01cation.\n\narXiv preprint arXiv:1704.05907, 2017.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nMinlie Huang, Qiao Qian, and Xiaoyan Zhu. Encoding syntactic knowledge in neural networks for\n\nsentiment classi\ufb01cation. ACM Transactions on Information Systems (TOIS), 35(3):26, 2017.\n\nRie Johnson and Tong Zhang. Supervised and semi-supervised text categorization using lstm for\n\nregion embeddings. arXiv preprint arXiv:1602.02373, 2016.\n\nTushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science\n\nquestion answering. In AAAI, 2018.\n\nDouwe Kiela, Changhan Wang, and Kyunghyun Cho. Context-attentive embeddings for improved\n\nsentence representations. arXiv preprint arXiv:1804.07983, 2018.\n\nSeonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. Semantic sentence matching with\ndensely-connected recurrent and co-attentive information. arXiv preprint arXiv:1805.11360, 2018.\n\nYoon Kim. Convolutional neural networks for sentence classi\ufb01cation. arXiv preprint arXiv:1408.5882,\n\n2014.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\nTom\u00e1\u0161 Ko\u02c7cisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G\u00e1bor Melis,\nand Edward Grefenstette. The narrativeqa reading comprehension challenge. arXiv preprint\narXiv:1712.07040, 2017.\n\nJan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv\n\npreprint arXiv:1402.3511, 2014.\n\nAnkit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor\nZhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for\nnatural language processing. In International Conference on Machine Learning, pages 1378\u20131387,\n2016.\n\nTao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017.\n\n10\n\n\fPengfei Liu, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-task learning for text classi\ufb01cation.\n\narXiv preprint arXiv:1704.05742, 2017.\n\nShayne Longpre, Sabeek Pradhan, Caiming Xiong, and Richard Socher. A way out of the odyssey:\n\nAnalyzing and combining recent insights for lstms. arXiv preprint arXiv:1611.05104, 2016.\n\nMoshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. Deep learning with\n\ndynamic computation graphs. arXiv preprint arXiv:1702.02181, 2017.\n\nAndrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting\nof the association for computational linguistics: Human language technologies-volume 1, pages\n142\u2013150. Association for Computational Linguistics, 2011.\n\nBryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation:\nContextualized word vectors. In Advances in Neural Information Processing Systems, pages\n6297\u20136308, 2017.\n\nTakeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised\n\ntext classi\ufb01cation. arXiv preprint arXiv:1605.07725, 2016.\n\nTsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. corr abs/1607.04315, 2016.\n\nYixin Nie and Mohit Bansal. Shortcut-stacked sentence encoders for multi-domain inference. In\nProceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, RepE-\nval@EMNLP 2017, Copenhagen, Denmark, September 8, 2017, pages 41\u201345, 2017.\n\nAnkur P. Parikh, Oscar T\u00e4ckstr\u00f6m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention\nIn Proceedings of the 2016 Conference on Empirical\nmodel for natural language inference.\nMethods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,\npages 2249\u20132255, 2016.\n\nJeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word\nrepresentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special\nInterest Group of the ACL, pages 1532\u20131543, 2014.\n\nMatthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and\nLuke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,\n2018.\n\nAlec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering\n\nsentiment. arXiv preprint arXiv:1704.01444, 2017.\n\nAlec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-\n\nstanding by generative pre-training. 2018.\n\nJinfeng Rao, Hua He, and Jimmy J. Lin. Noise-contrastive estimation for answer selection with deep\nneural networks. In Proceedings of the 25th ACM International on Conference on Information\nand Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages\n1913\u20131916, 2016.\n\nTim Rockt\u00e4schel, Edward Grefenstette, Karl Moritz Hermann, Tom\u00e1\u0161 Ko\u02c7cisk`y, and Phil Blunsom.\n\nReasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.\n\nMinjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention\n\n\ufb02ow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.\n\nTao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan:\nDirectional self-attention network for rnn/cnn-free language understanding. arXiv preprint\narXiv:1709.04696, 2017.\n\nTao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-directional block self-\nattention for fast and memory-ef\ufb01cient sequence modeling. arXiv preprint arXiv:1804.00857,\n2018.\n\n11\n\n\fRichard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,\nand Christopher Potts. Recursive deep models for semantic compositionality over a sentiment\ntreebank. Citeseer, 2013.\n\nRupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. CoRR,\n\nabs/1505.00387, 2015.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.\n\nIn Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\nKai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations\nfrom tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.\n\nYi Tay, Luu Anh Tuan, and Siu Cheung Hui. A compare-propagate architecture with alignment\n\nfactorization for natural language inference. arXiv preprint arXiv:1801.00102, 2017.\n\nYi Tay, Luu Anh Tuan, and Siu Cheung Hui. Co-stack residual af\ufb01nity networks with multi-level\n\nattention re\ufb01nement for matching text sequences. arXiv preprint arXiv:1810.02938, 2018a.\n\nYi Tay, Luu Anh Tuan, and Siu Cheung Hui. Multi-range reasoning for machine comprehension.\n\narXiv preprint arXiv:1803.09074, 2018b.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nIn Advances in Neural Information\n\nKaiser, and Illia Polosukhin. Attention is all you need.\nProcessing Systems, pages 6000\u20136010, 2017.\n\nEllen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77\u201382,\n\n1999.\n\nMengqiu Wang, Noah A. Smith, and Teruko Mitamura. What is the jeopardy model? A quasi-\nsynchronous grammar for QA. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference\non Empirical Methods in Natural Language Processing and Computational Natural Language\nLearning, June 28-30, 2007, Prague, Czech Republic, pages 22\u201332, 2007.\n\nWenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks\nfor reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of\nthe Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189\u2013198,\n2017.\n\nJohn Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic\n\nsentence embeddings. arXiv preprint arXiv:1511.08198, 2015.\n\nSHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.\nIn\n\nConvolutional lstm network: A machine learning approach for precipitation nowcasting.\nAdvances in neural information processing systems, pages 802\u2013810, 2015.\n\nCaiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question\n\nanswering. CoRR, abs/1611.01604, 2016.\n\nYi Yang, Wen-tau Yih, and Christopher Meek. Wikiqa: A challenge dataset for open-domain question\nanswering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 2013\u20132018, 2015.\n\nHong Yu and Tsendsuren Munkhdalai. Neural tree indexers for text understanding. In Proceedings of\nthe 15th Conference of the European Chapter of the Association for Computational Linguistics,\nEACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 11\u201321, 2017.\n\nMatthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\nRui Zhang, Honglak Lee, and Dragomir Radev. Dependency sensitive convolutional neural networks\n\nfor modeling sentences and documents. arXiv preprint arXiv:1611.02361, 2016a.\n\n12\n\n\fYu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway\nlong short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal\nProcessing (ICASSP), 2016 IEEE International Conference on, pages 5755\u20135759. IEEE, 2016b.\n\nYue Zhang, Qi Liu, and Linfeng Song. Sentence-state lstm for text representation. arXiv preprint\n\narXiv:1805.02474, 2018.\n\nPeng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. Text classi\ufb01cation\nimproved by integrating bidirectional lstm with two-dimensional max pooling. arXiv preprint\narXiv:1611.06639, 2016.\n\n13\n\n\f", "award": [], "sourceid": 2295, "authors": [{"given_name": "Yi", "family_name": "Tay", "institution": "Nanyang Technological University"}, {"given_name": "Anh Tuan", "family_name": "Luu", "institution": "Institute for Infocomm Research"}, {"given_name": "Siu Cheung", "family_name": "Hui", "institution": "Nanyang Technological University"}]}