{"title": "Deconvolutional Paragraph Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4169, "page_last": 4179, "abstract": "Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNN-based decoding (reconstruction) decreases with the length of the text. We propose a sequence-to-sequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semi-supervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data.", "full_text": "Deconvolutional Paragraph Representation Learning\n\nYizhe Zhang\n\nDinghan Shen\n\nGuoyin Wang\n\nZhe Gan\n\nRicardo Henao\n\nDepartment of Electrical & Computer Engineering, Duke University\n\nLawrence Carin\n\nAbstract\n\nLearning latent representations from long text sequences is an important \ufb01rst step\nin many natural language processing applications. Recurrent Neural Networks\n(RNNs) have become a cornerstone for this challenging task. However, the qual-\nity of sentences during RNN-based decoding (reconstruction) decreases with the\nlength of the text. We propose a sequence-to-sequence, purely convolutional and\ndeconvolutional autoencoding framework that is free of the above issue, while\nalso being computationally ef\ufb01cient. The proposed method is simple, easy to\nimplement and can be leveraged as a building block for many applications. We\nshow empirically that compared to RNNs, our framework is better at reconstruct-\ning and correcting long paragraphs. Quantitative evaluation on semi-supervised\ntext classi\ufb01cation and summarization tasks demonstrate the potential for better\nutilization of long unlabeled text data.\n\n1\n\nIntroduction\n\nA central task in natural language processing is to learn representations (features) for sentences or\nmulti-sentence paragraphs. These representations are typically a required \ufb01rst step toward more\napplied tasks, such as sentiment analysis [1, 2, 3, 4], machine translation [5, 6, 7], dialogue systems\n[8, 9, 10] and text summarization [11, 12, 13]. An approach for learning sentence representations\nfrom data is to leverage an encoder-decoder framework [14]. In a standard autoencoding setup, a\nvector representation is \ufb01rst encoded from an embedding of an input sequence, then decoded to the\noriginal domain to reconstruct the input sequence. Recent advances in Recurrent Neural Networks\n(RNNs) [15], especially Long Short-Term Memory (LSTM) [16] and variants [17], have achieved\ngreat success in numerous tasks that heavily rely on sentence-representation learning.\nRNN-based methods typically model sentences recursively as a generative Markov process with\nhidden units, where the one-step-ahead word from an input sentence is generated by conditioning on\nprevious words and hidden units, via emission and transition operators modeled as neural networks.\nIn principle, the neural representations of input sequences aim to encapsulate suf\ufb01cient information\nabout their structure, to subsequently recover the original sentences via decoding. However, due to the\nrecursive nature of the RNN, challenges exist for RNN-based strategies to fully encode a sentence into\na vector representation. Typically, during training, the RNN generates words in sequence conditioning\non previous ground-truth words, i.e., teacher forcing training [18], rather than decoding the whole\nsentence solely from the encoded representation vector. This teacher forcing strategy has proven\nimportant because it forces the output sequence of the RNN to stay close to the ground-truth sequence.\nHowever, allowing the decoder to access ground truth information when reconstructing the sequence\nweakens the encoder\u2019s ability to produce self-contained representations, that carry enough information\nto steer the decoder through the decoding process without additional guidance. Aiming to solve\nthis problem, [19] proposed a scheduled sampling approach during training, which gradually shifts\nfrom learning via both latent representation and ground-truth signals to solely use the encoded latent\nrepresentation. Unfortunately, [20] showed that scheduled sampling is a fundamentally inconsistent\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ftraining strategy, in that it produces largely unstable results in practice. As a result, training may fail\nto converge on occasion.\nDuring inference, for which ground-truth sentences are not available, words ahead can only be gener-\nated by conditioning on previously generated words through the representation vector. Consequently,\ndecoding error compounds proportional to the length of the sequence. This means that generated\nsentences quickly deviate from the ground-truth once an error has been made, and as the sentence\nprogresses. This phenomenon was coined exposure bias in [19].\nWe propose a simple yet powerful purely convolutional framework for learning sentence representa-\ntions. Conveniently, without RNNs in our framework, issues connected to teacher forcing training\nand exposure bias are not relevant. The proposed approach uses a Convolutional Neural Network\n(CNN) [21, 22, 23] as encoder and a deconvolutional (i.e., transposed convolutional) neural network\n[24, 25] as decoder. To the best of our knowledge, the proposed framework is the \ufb01rst to force\nthe encoded latent representation to capture information from the entire sentence via a multi-layer\nCNN speci\ufb01cation, to achieve high reconstruction quality without leveraging RNN-based decoders.\nOur multi-layer CNN allows representation vectors to abstract information from the entire sentence,\nirrespective of order or length, making it an appealing choice for tasks involving long sentences or\nparagraphs. Further, since our framework does not involve recursive encoding or decoding, it can\nbe very ef\ufb01ciently parallelized using convolution-speci\ufb01c Graphical Process Unit (GPU) primitives,\nyielding signi\ufb01cant computational savings compared to RNN-based models.\n\n2 Convolutional Auto-encoding for Text Modeling\n\n2.1 Convolutional encoder\n\nc\n\nLet wt denote the t-th word in a given sentence. Each word wt is embedded into a k-dimensional\nword vector xt = We[wt], where We \u2208 Rk\u00d7V is a (learned) word embedding matrix, V is the\nvocabulary size, and We[v] denotes the v-th column of We. All columns of We are normalized\nto have unit (cid:96)2-norm, i.e., ||We[v]||2 = 1,\u2200v, by dividing each column with its (cid:96)2-norm. After\nembedding, a sentence of length T (padded where necessary) is represented as X \u2208 Rk\u00d7T , by\nconcatenating its word embeddings, i.e., xt is the t-th column of X.\nFor sentence encoding, we use a CNN architecture similar to [26], though originally proposed for\nimage data. The CNN consists of L layers (L \u2212 1 convolutional, and the Lth fully-connected) that\nultimately summarize an input sentence into a (\ufb01xed-length) latent representation vector, h. Layer\nl \u2208 {1, . . . , L} consists of pl \ufb01lters, learned from data. For the i-th \ufb01lter in layer 1, a convolutional\n\u2208 Rk\u00d7h to X, where h is the convolution \ufb01lter\noperation with stride length r(1) applies \ufb01lter W(i,1)\nc + b(i,1)) \u2208 R(T\u2212h)/r(1)+1, where \u03b3(\u00b7) is\nsize. This yields latent feature map, c(i,1) = \u03b3(X \u2217 W(i,1)\na nonlinear activation function, b(i,1) \u2208 R(T\u2212h)/r(1)+1, and \u2217 denotes the convolutional operator. In\nour experiments, \u03b3(\u00b7) is represented by a Recti\ufb01ed Linear Unit (ReLU) [27]. Note that the original\nembedding dimension, k, changes after the \ufb01rst convolutional layer, as c(i,1) \u2208 R(T\u2212h)/r(1)+1,\nfor i = 1, . . . , p1. Concatenating the results from p1 \ufb01lters (for layer 1), results in feature map,\nC(1) = [c(1,1) . . . c(p1,1)] \u2208 Rp1\u00d7[(T\u2212h)/r(1)+1].\nAfter this \ufb01rst convolutional layer, we apply the convolution operation to the feature map, C(1), using\nthe same \ufb01lter size, h, with this repeated in sequence for L \u2212 1 layers. Each time, the length along\nthe spatial coordinate is reduced to T (l+1) = (cid:98)(T (l) \u2212 h)/r(l) + 1(cid:99), where r(l) is the stride length,\nT (l) is the spatial length, l denotes the l-th layer and (cid:98)\u00b7(cid:99) is the \ufb02oor function. For the \ufb01nal layer, L,\nthe feature map C(L\u22121) is fed into a fully-connected layer, to produce the latent representation h.\nImplementation-wise, we use a convolutional layer with \ufb01lter size equals to T (L\u22121) (regardless of\nh), which is equivalent to a fully-connected layer; this implementation trick has been also utilized in\n[26]. This last layer summarizes all remaining spatial coordinates, T (L\u22121), into scalar features that\nencapsulate sentence sub-structures throughout the entire sentence characterized by \ufb01lters, {W(i,l)\n}\nfor i = 1, . . . , pl and l = 1, . . . , L, where W(i,l)\ndenotes \ufb01lter i for layer l. This also implies that\nthe extracted feature is of \ufb01xed-dimensionality, independent of the length of the input sentence.\n\nc\n\nc\n\n2\n\n\fFigure 1: Convolutional auto-encoding architecture. Encoder: the input sequence is \ufb01rst expanded to an\nembedding matrix, X, then fully compressed to a representation vector h, through a multi-layer convolutional\nencoder with stride. In the last layer, the spatial dimension is collapsed to remove the spatial dependency.\nDecoder: the latent vector h is fed through a multi-layer deconvolutional decoder with stride to reconstruct X as\n\u02c6X, via cosine-similarity cross-entropy loss.\n\nHaving pL \ufb01lters on the last layer, results in pL-dimensional representation vector, h = C(L), for the\ninput sentence. For example, in Figure 1, the encoder consists of L = 3 layers, which for a sentence\nof length T = 60, embedding dimension k = 300, stride lengths {r(1), r(2), r(3)} = {2, 2, 1}, \ufb01lter\nsizes h = {5, 5, 12} and number of \ufb01lters {p1, p2, p3} = {300, 600, 500}, results in intermediate\nfeature maps, C(1) and C(2) of sizes {28 \u00d7 300, 12 \u00d7 600}, respectively. The last feature map of\nsize 1 \u00d7 500, corresponds to latent representation vector, h.\nConceptually, \ufb01lters from the lower layers capture primitive sentence information (h-grams, analo-\ngous to edges in images), while higher level \ufb01lters capture more sophisticated linguistic features, such\nas semantic and syntactic structures (analogous to image elements). Such a bottom-up architecture\nmodels sentences by hierarchically stacking text segments (h-grams) as building blocks for repre-\nsentation vector, h. This is similar in spirit to modeling linguistic grammar formalisms via concrete\nsyntax trees [28], however, we do not pre-specify a tree structure based on some syntactic structure\n(i.e., English language), but rather abstract it from data via a multi-layer convolutional network.\n2.2 Deconvolutional decoder\nWe apply the deconvolution with stride (i.e., convolutional transpose), as the conjugate operation of\nconvolution, to decode the latent representation, h, back to the source (discrete) text domain. As\nthe deconvolution operation proceeds, the spatial resolution gradually increases, by mirroring the\nconvolutional steps described above, as illustrated in Figure 1. The spatial dimension is \ufb01rst expanded\nto match the spatial dimension of the (L \u2212 1)-th layer of convolution, then progressively expanded as\nT (l+1) = (T (l) \u2212 1) \u2217 r(l) + h, for l = 1,\u00b7\u00b7\u00b7 up to L-th deconvolutional layer (which corresponds to\nthe input layer of the convolutional encoder). The output of the L-layer deconvolution operation aims\nto reconstruct the word embedding matrix, which we denote as \u02c6X. In line with word embedding\nmatrix We, columns of \u02c6X are normalized to have unit (cid:96)2-norm.\nDenoting \u02c6wt as the t-th word in reconstructed sentence \u02c6s, the probability of \u02c6wt to be word v is\nspeci\ufb01ed as\n\np( \u02c6wt = v) =\n\n,\n\n(1)\n\n(cid:80)\n\nexp[\u03c4\u22121Dcos(\u02c6xt, We[v])]\nv(cid:48)\u2208V exp[\u03c4\u22121Dcos(\u02c6xt, We[v(cid:48)])]\n\n(cid:104)x,y(cid:105)\nwhere Dcos(x, y) is the cosine similarity de\ufb01ned as,\n||x||||y||, We[v] is the v-th column of We,\n\u02c6xt is the t-th column of \u02c6X, \u03c4 is a positive number we denote as temperature parameter [29]. This\nparameter is akin to the concentration parameter of a Dirichlet distribution, in that it controls the\nspread of probability vector [p( \u02c6wt = 1) . . . p( \u02c6wt = V )], thus a large \u03c4 encourages uniformly\ndistributed probabilities, whereas a small \u03c4 encourages sparse, concentrated probability values. In\nthe experiments we set \u03c4 = 0.01. Note that in our setting, the cosine similarity can be obtained\nas an inner product, provided that columns of We and \u02c6X have unit (cid:96)2-norm by speci\ufb01cation. This\ndeconvolutional module can also be leveraged as building block in VAE[30, 31] or GAN[32, 33]\n\n3\n\n300x\t60Deconvolution\tLayersConvolution\tLayers(k,h,p1,r(1))(300,5,300,2)28x30012x600(kxT)300600500600300(T(1)xp1)(T(2)xp2)(T(2)xp2)12x60028x300(T(1)xp1)300x\t60(kxT)(k,h,p2,r(2))(1,5,600,2)(k,T(2),p3,r(3))(1,12,500,1)C(1)C(2)\f2.3 Model learning\n\nThe objective of the convolutional autoencoder described above can be written as the word-wise\nlog-likelihood for all sentences s \u2208 D, i.e.,\n\nLae =(cid:80)\n\nd\u2208D(cid:80)\n\nd = wt\n\nd) ,\n\nt log p( \u02c6wt\n\n(2)\nwhere D denotes the set of observed sentences. The simple, maximum-likelihood objective in (2)\nis optimized via stochastic gradient descent. Details of the implementation are provided in the\nexperiments. Note that (2) differs from prior related work in two ways: i) [22, 34] use pooling and\nun-pooling operators, while we use convolution/deconvolution with stride; and ii) more importantly,\n[22, 34] do not use a cosine similarity reconstruction as in (1), but a RNN-based decoder. A further\ndiscussion of related work is provided in Section 3. We could use pooling and un-pooling instead\nof striding (a particular case of deterministic pooling/un-pooling), however, in early experiments\n(not shown) we did not observe signi\ufb01cant performance gains, while convolution/deconvolution\noperations with stride are considerably more ef\ufb01cient in terms of memory footprint. Compared to a\nstandard LSTM-based RNN sequence autoencoders with roughly the same number of parameters,\ncomputations in our case are considerably faster (see experiments) using single NVIDIA TITAN X\nGPU. This is due to the high parallelization ef\ufb01ciency of CNNs via cuDNN primitives [35].\n\nComparison between deconvolutional and RNN Decoders The proposed framework can be seen\nas a complementary building block for natural language modeling. Contrary to the standard LSTM-\nbased decoder, the deconvolutional decoder imposes in general a less strict sequence dependency\ncompared to RNN architectures. Speci\ufb01cally, generating a word from an RNN requires a vector of\nhidden units that recursively accumulate information from the entire sentence in an order-preserving\nmanner (long-term dependencies are heavily down-weighted), while for a deconvolutional decoder,\nthe generation only depends on a representation vector that encapsulates information from throughout\nthe sentence without a pre-speci\ufb01ed ordering structure. As a result, for language generation tasks, a\nRNN decoder will usually generate more coherent text, when compared to a deconvolutional decoder.\nOn the contrary, a deconvolutional decoder is better at accounting for distant dependencies in long\nsentences, which can be very bene\ufb01cial in feature extraction for classi\ufb01cation and text summarization\ntasks.\n\n2.4 Semi-supervised classi\ufb01cation and summarization\n\nIdentifying related topics or sentiments, and abstracting (short) summaries from user generated content\nsuch as blogs or product reviews, has recently received signi\ufb01cant interest [1, 3, 4, 36, 37, 13, 11]. In\nmany practical scenarios, unlabeled data are abundant, however, there are not many practical cases\nwhere the potential of such unlabeled data is fully realized. Motivated by this opportunity, here we\nseek to complement scarcer but more valuable labeled data, to improve the generalization ability of\nsupervised models. By ingesting unlabeled data, the model can learn to abstract latent representations\nthat capture the semantic meaning of all available sentences irrespective of whether or not they are\nlabeled. This can be done prior to the supervised model training, as a two-step process. Recently,\nRNN-based methods exploiting this idea have been widely utilized and have achieved state-of-the-art\nperformance in many tasks [1, 3, 4, 36, 37]. Alternatively, one can learn the autoencoder and classi\ufb01er\njointly, by specifying a classi\ufb01cation model whose input is the latent representation, h; see for\ninstance [38, 31].\nIn the case of product reviews, for example, each review may contain hundreds of words. This poses\nchallenges when training RNN-based sequence encoders, in the sense that the RNN has to abstract\ninformation on-the-\ufb02y as it moves through the sentence, which often leads to loss of information,\nparticularly in long sentences [39]. Furthermore, the decoding process uses ground-truth information\nduring training, thus the learned representation may not necessarily keep all information from the\ninput text that is necessary for proper reconstruction, summarization or classi\ufb01cation.\nWe consider applying our convolutional autoencoding framework to semi-supervised learning from\nlong-sentences and paragraphs. Instead of pre-training a fully unsupervised model as in [1, 3], we cast\nthe semi-supervised task as a multi-task learning problem similar to [40], i.e., we simultaneously train\na sequence autoencoder and a supervised model. In principle, by using this joint training strategy,\nthe learned paragraph embedding vector will preserve both reconstruction and classi\ufb01cation ability.\n\n4\n\n\fSpeci\ufb01cally, we consider the following objective:\nt log p( \u02c6wt\n\nd\u2208{Dl+Du}(cid:80)\n\nLsemi = \u03b1(cid:80)\n\nd) +(cid:80)\n\nLsup(f (hd), yd) ,\n\nd\u2208Dl\n\nd = wt\n\n(3)\nwhere \u03b1 > 0 is an annealing parameter balancing the relative importance of supervised and unsu-\npervised loss; Dl and Du denote the set of labeled and unlabeled data, respectively. The \ufb01rst term\nin (3) is the sequence autoencoder loss in (2) for the d-th sequence. Lsup(\u00b7) is the supervision loss\nfor the d-th sequence (labeled only). The classi\ufb01er function, f (\u00b7), that attempts to reconstruct yd\nfrom hd can be either a Multi-Layer Perceptron (MLP) in classi\ufb01cation tasks, or a CNN/RNN in text\nsummarization tasks. For the latter, we are interested in a purely convolutional speci\ufb01cation, however,\nwe also consider an RNN for comparison. For classi\ufb01cation, we use a standard cross-entropy loss,\nand for text summarization we use either (2) for the CNN or the standard LSTM loss for the RNN.\nIn practice, we adopt a scheduled annealing strategy for \u03b1 as in [41, 42], rather than \ufb01xing it a\npriori as in [1]. During training, (3) gradually transits from focusing solely on the unsupervised\nsequence autoencoder to the supervised task, by annealing \u03b1 from 1 to a small positive value \u03b1min.\nWe set \u03b1min = 0.01 in the experiments. The motivation for this annealing strategy is to \ufb01rst focus on\nabstracting paragraph features, then to selectively re\ufb01ne learned features that are most informative to\nthe supervised task.\n\n3 Related Work\n\nPrevious work has considered leveraging CNNs as encoders for various natural language processing\ntasks [22, 34, 21, 43, 44]. Typically, CNN-based encoder architectures apply a single convolution\nlayer followed by a pooling layer, which essentially acts as a detector of speci\ufb01c classes of h-grams,\ngiven a convolution \ufb01lter window of size h. The deep architecture in our framework will, in principle,\nenable the high-level layers to capture more sophisticated language features. We use convolutions\nwith stride rather than pooling operators, e.g., max-pooling, for spatial downsampling following\n[26, 45], where it is argued that fully convolutional architectures are able to learn their own spatial\ndownsampling. Further, [46] uses a 29-layer CNN for text classi\ufb01cation. Our CNN encoder is\nconsiderably simpler in structure (convolutions with stride and no more than 4 layers) while still\nachieving good performance.\nLanguage decoders other than RNNs are less well studied. Recently, [47] proposed a hybrid model\nby coupling a convolutional-deconvolutional network with an RNN, where the RNN acts as decoder\nand the deconvolutional model as a bridge between the encoder (convolutional network) and decoder.\nAdditionally, [42, 48, 49, 50] considered CNN variants, such as pixelCNN [51], for text generation.\nNevertheless, to achieve good empirical results, these methods still require the sentences to be\ngenerated sequentially, conditioning on the ground truth historical information, akin to RNN-based\ndecoders, thus still suffering from the exposure bias.\nOther efforts have been made to improve embeddings from long paragraphs using unsupervised\napproaches [2, 52]. The paragraph vector [2] learns a \ufb01xed length vector by concatenating it with\na word2vec [53] embedding of history sequence to predict future words. The hierarchical neural\nautoencoder [52] builds a hierarchical attentive RNN, then it uses paragraph-level hidden units of\nthat RNN as embedding. Our work differs from these approaches in that we force the sequence to be\nfully restored from the latent representation, without aid from any history information.\nPrevious methods have considered leveraging unlabeled data for semi-supervised sequence classi\ufb01ca-\ntion tasks. Typically, RNN-based methods consider either i) training a sequence-to-sequence RNN\nautoencoder, or a RNN classi\ufb01er that is robust to adversarial perturbation, as initialization for the en-\ncoder in the supervised model [1, 4]; or, ii) learning latent representation via a sequence-to-sequence\nRNN autoencoder, and then using them as inputs to a classi\ufb01er that also takes features extracted from\na CNN as inputs [3]. For summarization tasks, [54] has considered a semi-supervised approach based\non support vector machines, however, so far, research on semi-supervised text summarization using\ndeep models is scarce.\n\n4 Experiments\n\nExperimental setup For all the experiments, we use a 3-layer convolutional encoder followed by a\n3-layer deconvolutional decoder (recall implementation details for the top layer). Filter size, stride\n\n5\n\n\fGround-truth:\n\nHier. LSTM [52]\n\non every visit to nyc , the hotel beacon is the place we love to stay . so conveniently located to central park , lincoln\ncenter and great local restaurants . the rooms are lovely . beds so comfortable , a great little kitchen and new wizz\nbang coffee maker . the staff are so accommodating and just love walking across the street to the fairway supermarket\nwith every imaginable goodies to eat .\nevery time in new york , lighthouse hotel is our favorite place to stay . very convenient , central park , lincoln center\n, and great restaurants . the room is wonderful , very comfortable bed , a kitchenette and a large explosion of coffee\nmaker . the staff is so inclusive , just across the street to walk to the supermarket channel love with all kinds of what\nto eat .\n\nOur LSTM-LSTM on every visit to nyc , the hotel beacon is the place to relax and wanting to become conveniently located . hotel , in\nthe evenings out good budget accommodations . the views are great and we were more than two couples . manny the\ndoorman has a great big guy come and will de\ufb01nitly want to leave during my stay and enjoy a wonderfully relaxing\nwind break in having for 24 hour early rick\u2019s cafe . oh perfect ! easy easy walking distance to everything imaginable\ngroceries . if you may want to watch yours !\non every visit to nyc , the hotel beacon is the place we love to stay . so closely located to central park , lincoln center\nand great local restaurants . biggest rooms are lovely . beds so comfortable , a great little kitchen and new UNK\nsuggestion coffee maker . the staff turned so accommodating and just love walking across the street to former fairway\nsupermarket with every food taxes to eat .\n\nOur CNN-DCNN\n\nTable 1: Reconstructed paragraph of the Hotel Reviews example, used in [52].\n\nand word embedding are set to h = 5, rl = 2, for l = 1, . . . , 3 and k = 300, respectively. The\ndimension of the latent representation vector varies for each experiment, thus is reported separately.\nFor notational convenience, we denote our convolutional-deconvolutional autoencoder as CNN-\nDCNN. In most comparisons, we also considered two standard autoencoders as baselines: a) CNN-\nLSTM: CNN encoder coupled with LSTM decoder; and b) LSTM-LSTM: LSTM encoder with\nLSTM decoder. An LSTM-DCNN con\ufb01guration is not included because it yields similar performance\nto CNN-DCNN while being more computationally expensive. The complete experimental setup and\nbaseline details is provided in the Supplementary Material (SM). CNN-DCNN has the least number\nof parameters. For example, using 500 as the dimension of h results in about 9, 13, 15 million total\ntrainable parameters for CNN-DCNN, CNN-LSTM and LSTM-LSTM, respectively.\n\nModel\n\nLSTM-LSTM [52]\n\nHier. LSTM-LSTM [52]\n\nHier. + att. LSTM-LSTM [52]\n\nCNN-LSTM\nCNN-DCNN\n\nBLEU ROUGE-1 ROUGE-2\n24.1\n26.7\n28.5\n18.3\n94.2\n\n30.2\n33.0\n35.5\n28.2\n94.2\n\n57.1\n59.0\n62.4\n56.6\n97.0\n\nTable 2: Reconstruction evaluation results on the Hotel Reviews\nDataset.\n\nFigure 2: BLEU score vs. sentence\nlength for Hotel Review data.\n\nParagraph reconstruction We \ufb01rst investigate the performance of the proposed autoencoder in\nterms of learning representations that can preserve paragraph information. We adopt evaluation\ncriteria from [52], i.e., ROUGE score [55] and BLEU score [56], to measure the closeness of the\nreconstructed paragraph (model output) to the input paragraph. Brie\ufb02y, ROUGE and BLEU scores\nmeasures the n-gram recall and precision between the model outputs and the (ground-truth) references.\nWe use BLEU-4, ROUGE-1, 2 in our evaluation, in alignment with [52]. In addition to the CNN-\nLSTM and LSTM-LSTM autoencoder, we also compared with the hierarchical LSTM autoencoder\n[52]. The comparison is performed on the Hotel Reviews datasets, following the experimental setup\nfrom [52], i.e., we only keep reviews with sentence length ranging from 50 to 250 words, resulting\nin 348,544 training data samples and 39,023 testing data samples. For all comparisons, we set the\ndimension of the latent representation to h = 500.\nFrom Table 1, we see that for long paragraphs, the LSTM decoder in CNN-LSTM and LSTM-LSTM\nsuffers from heavy exposure bias issues. We further evaluate the performance of each model with\ndifferent paragraph lengths. As shown in Figure 2 and Table 2, on this task CNN-DCNN demonstrates\na clear advantage, meanwhile, as the length of the sentence increases, the comparative advantage\nbecomes more substantial. For LSTM-based methods, the quality of the reconstruction deteriorates\nquickly as sequences get longer. In constrast, the reconstruction quality of CNN-DCNN is stable and\nconsistent regardless of sentence length. Furthermore, the computational cost, evaluated as wall-clock,\nis signi\ufb01cantly lower in CNN-DCNN. Roughly, CNN-LSTM is 3 times slower than CNN-DCNN,\nand LSTM-LSTM is 5 times slower on a single GPU. Details are reported in the SM.\n\nCharacter-level and word-level correction This task seeks to evaluate whether the deconvolu-\ntional decoder can overcome exposure bias, which severely limits LSTM-based decoders. We consider\n\n6\n\n6080100120140160180200Sentence length020406080100Bleu scoreCNN-DCNNCNN-LSTMLSTM-LSTM\fa denoising autoencoder where the input is tweaked slightly with certain modi\ufb01cations, while the\nmodel attempts to denoise (correct) the unknown modi\ufb01cation, thus recover the original sentence.\nFor character-level correction, we consider the Yahoo! Answer dataset [57]. The dataset description\nand setup for word-level correction is provided in the SM. We follow the experimental setup in\n[58] for word-level and character-level spelling correction (see details in the SM). We considered\nsubstituting each word/character with a different one at random with probability \u03b7, with \u03b7 = 0.30.\nFor character-level analysis, we \ufb01rst map all characters into a 40 dimensional embedding vector, with\nthe network structure for word- and character-level models kept the same.\n\nModel\n\nActor-critic[58]\nLSTM-LSTM\nCNN-LSTM\nCNN-DCNN\n\nModel\n\nLSTM-LSTM\nCNN-LSTM\nCNN-DCNN\n\nYahoo(CER)\n\n0.2284\n0.2621\n0.2035\n0.1323\n\nArXiv(WER)\n\n0.7250\n0.3819\n0.3067\n\nTable 3: CER and WER com-\nparison on Yahoo and ArXiv\ndata.\n\nFigure 4: Spelling error denoising compar-\nison. Darker colors indicate higher uncer-\ntainty. Trained on modi\ufb01ed sentences.\n\nFigure 3: CER comparison.\nBlack triangles indicate the end\nof an epoch.\nWe employ Character Error Rate (CER) [58] and Word Error Rate (WER) [59] for evaluation. The\nWER/CER measure the ratio of Levenshtein distance (a.k.a., edit distance) between model predictions\nand the ground-truth, and the total length of sequence. Conceptually, lower WER/CER indicates\nbetter performance. We use LSTM-LSTM and CNN-LSTM denoising autoencoders for comparison.\nThe architecture for the word-level baseline models is the same as in the previous experiment. For\ncharacter-level correction, we set dimension of h to 900. We also compare to actor-critic training\n[58], following their experimental guidelines (see details in the SM).\nAs shown in Figure 3 and Table 3, we observed CNN-DCNN achieves both lower CER and faster\nconvergence. Further, CNN-DCNN delivers stable denoising performance irrespective of the noise\nlocation within the sentence, as seen in Figure 4. For CNN-DCNN, even when an error is detected\nbut not exactly corrected (darker colors in Figure 4 indicate higher uncertainty), denoising with future\nwords is not effected, while for CNN-LSTM and LSTM-LSTM the error gradually accumulates with\nlonger sequences, as expected.\nFor word-level correction, we consider word substitutions only, and mixed perturbations from three\nkinds: substitution, deletion and insertion. Generally, CNN-DCNN outperforms CNN-LSTM and\nLSTM-LSTM, and is faster. We provide experimental details and comparative results in the SM.\n\nSemi-supervised sequence classi\ufb01cation & summarization We investigate whether our CNN-\nDCNN framework can improve upon supervised natural language tasks that leverage features learned\nfrom paragraphs. In principle, a good unsupervised feature extractor will improve the general-\nization ability in a semi-supervised learning setting. We evaluate our approach on three popular\nnatural language tasks: sentiment analysis, paragraph topic prediction and text summarization. The\n\ufb01rst two tasks are essentially sequence classi\ufb01cation, while summarization involves both language\ncomprehension and language generation.\nWe consider three large-scale document classi\ufb01cation datasets: DBPedia, Yahoo! Answers and\nYelp Review Polarity [57]. The partition of training, validation and test sets for all datasets follows\nthe settings from [57]. The detailed summary statistics of all datasets are shown in the SM. To\ndemonstrate the advantage of incorporating the reconstruction objective into the training of text\nclassi\ufb01ers, we further evaluate our model with different amounts of labeled data (0.1%, 0.15%, 0.25%,\n1%, 10% and 100%, respectively), and the whole training set as unlabeled data.\nFor our purely supervised baseline model (supervised CNN), we use the same convolutional encoder\narchitecture described above, with a 500-dimensional latent representation dimension, followed by\na MLP classi\ufb01er with one hidden layer of 300 hidden units. The dropout rate is set to 50%. Word\nembeddings are initialized at random.\nAs shown in Table 4, the joint training strategy consistently and signi\ufb01cantly outperforms the purely\nsupervised strategy across datasets, even when all labels are available. We hypothesize that during the\nearly phase of training, when reconstruction is emphasized, features from text fragments can be readily\n\n7\n\n010203040506070Time (hour)0.00.20.40.60.81.0Character Error Rate (CER)CNN-DCNNCNN-LSTMLSTM-LSTMOriginalcOriginalaOriginalnOriginal OriginalaOriginalnOriginalyOriginaloOriginalnOriginaleOriginal OriginalsOriginaluOriginalgOriginalgOriginaleOriginalsOriginaltOriginal OriginalsOriginaloOriginalmOriginaleOriginal OriginalgOriginaloOriginaloOriginaldOriginal OriginalbOriginaloOriginaloOriginalkOriginalsOriginal Original?ModifiedcModifiedaModifiedpModified ModifiedaModifiednModifiedyModifiedoModifiednModifiedkModified ModifiedwModifieduModifiedgModifiedgModifiedeModifiedsModifiedtModified ModifiedxModifiedoModifiedhModifiedeModified ModifiediModifiedoModifiedrModifieddModified ModifiedyModifiedoModifiedoModifiedkModifieduModified Modified?ActorCriticcActorCriticaActorCriticnActorCritic ActorCriticaActorCriticnActorCriticyActorCriticoActorCriticnActorCriticeActorCritic ActorCriticwActorCriticiActorCritictActorCritichActorCriticeActorCriticsActorCritictActorCritic ActorCritictActorCriticoActorCritic ActorCriticeActorCritic ActorCriticfActorCriticoActorCriticrActorCriticdActorCritic ActorCriticyActorCriticoActorCriticuActorCritic ActorCriticuActorCritic ActorCritic?LSTM-LSTMcLSTM-LSTMaLSTM-LSTMnLSTM-LSTM LSTM-LSTMaLSTM-LSTMnLSTM-LSTMyLSTM-LSTMoLSTM-LSTMnLSTM-LSTMeLSTM-LSTM LSTM-LSTMsLSTM-LSTMuLSTM-LSTMgLSTM-LSTMgLSTM-LSTMeLSTM-LSTMsLSTM-LSTMtLSTM-LSTM LSTM-LSTMjLSTM-LSTMoLSTM-LSTMkLSTM-LSTMeLSTM-LSTM LSTM-LSTMfLSTM-LSTMoLSTM-LSTMoLSTM-LSTMdLSTM-LSTM LSTM-LSTMyLSTM-LSTMoLSTM-LSTMuLSTM-LSTMnLSTM-LSTMgLSTM-LSTM LSTM-LSTM?CNN-LSTMcCNN-LSTMaCNN-LSTMnCNN-LSTM CNN-LSTMaCNN-LSTMnCNN-LSTMyCNN-LSTMoCNN-LSTMnCNN-LSTMeCNN-LSTM CNN-LSTMgCNN-LSTMuCNN-LSTMiCNN-LSTMtCNN-LSTMeCNN-LSTMsCNN-LSTM CNN-LSTMsCNN-LSTMoCNN-LSTMmCNN-LSTMeCNN-LSTM CNN-LSTMoCNN-LSTMwCNN-LSTMeCNN-LSTM CNN-LSTMpCNN-LSTMoCNN-LSTMoCNN-LSTMkCNN-LSTMsCNN-LSTM CNN-LSTM?CNN-LSTM CNN-LSTM CNN-DCNNcCNN-DCNNaCNN-DCNNnCNN-DCNN CNN-DCNNaCNN-DCNNnCNN-DCNNyCNN-DCNNoCNN-DCNNnCNN-DCNNeCNN-DCNN CNN-DCNNsCNN-DCNNuCNN-DCNNgCNN-DCNNgCNN-DCNNeCNN-DCNNsCNN-DCNNtCNN-DCNN CNN-DCNNsCNN-DCNNoCNN-DCNNmCNN-DCNNeCNN-DCNN CNN-DCNNwCNN-DCNNoCNN-DCNNoCNN-DCNNdCNN-DCNN CNN-DCNNbCNN-DCNNoCNN-DCNNoCNN-DCNNkCNN-DCNNsCNN-DCNN CNN-DCNN?OriginalwOriginalhOriginalaOriginaltOriginal OriginalsOriginal OriginalyOriginaloOriginaluOriginalrOriginal OriginaliOriginaldOriginaleOriginalaOriginal OriginaloOriginalfOriginal OriginalaOriginal OriginalsOriginaltOriginaleOriginalpOriginalpOriginaliOriginalnOriginalgOriginal OriginalsOriginaltOriginaloOriginalnOriginaleOriginal OriginaltOriginaloOriginal OriginalbOriginaleOriginaltOriginaltOriginaleOriginalrOriginal OriginaltOriginalhOriginaliOriginalnOriginalgOriginalsOriginal OriginaltOriginaloOriginal OriginalcOriginaloOriginalmOriginaleOriginal Original?ModifiedwModifieduModifiedaModifiedtModified ModifiedsModified ModifiedyModifiedoModifiedgModifiedrModified ModifiediModifieddModifiedeModifiedmModified ModifiedoModifiedfModified ModifiedtModified ModifiedsModifiedtModifiedeModifiedpModifieduModifiedkModifiednModifiedgModified ModifiedjModifiedtModifiedzModifiednModifiedeModified ModifiedtModifiediModified ModifiedbModifiedeModifiedtModifiedtModifiedeModifiedrModified ModifiedtModifiedhModifiediModifiednModifiedgModifiedzModified ModifiedtModifiedtModified ModifiedcModifiedoModifiedeModifiedeModified Modified?ActorCriticwActorCritichActorCriticaActorCritictActorCritic ActorCriticsActorCritic ActorCriticyActorCriticoActorCriticuActorCriticrActorCritic ActorCriticiActorCriticdActorCriticeActorCriticmActorCritic ActorCriticoActorCriticfActorCritic ActorCritictActorCritic ActorCriticsActorCritictActorCriticeActorCriticpActorCriticuActorCriticaActorCriticnActorCriticgActorCritic ActorCriticjActorCriticoActorCritickActorCriticnActorCriticeActorCritic ActorCritictActorCriticiActorCritic ActorCriticbActorCriticeActorCritictActorCritictActorCriticeActorCriticrActorCritic ActorCritictActorCritichActorCriticiActorCriticnActorCriticgActorCritic ActorCriticiActorCritictActorCritictActorCritic ActorCriticcActorCriticoActorCriticmActorCriticeActorCritic ActorCritic?LSTM-LSTMwLSTM-LSTMhLSTM-LSTMaLSTM-LSTMtLSTM-LSTM LSTM-LSTMsLSTM-LSTM LSTM-LSTMyLSTM-LSTMoLSTM-LSTMuLSTM-LSTMrLSTM-LSTM LSTM-LSTMiLSTM-LSTMdLSTM-LSTMeLSTM-LSTMaLSTM-LSTM LSTM-LSTMoLSTM-LSTMfLSTM-LSTM LSTM-LSTMaLSTM-LSTM LSTM-LSTMsLSTM-LSTMpLSTM-LSTMeLSTM-LSTMaLSTM-LSTMkLSTM-LSTMiLSTM-LSTMnLSTM-LSTMgLSTM-LSTM LSTM-LSTMsLSTM-LSTMtLSTM-LSTMaLSTM-LSTMnLSTM-LSTMdLSTM-LSTM LSTM-LSTMtLSTM-LSTMoLSTM-LSTM LSTM-LSTMbLSTM-LSTMeLSTM-LSTMtLSTM-LSTMtLSTM-LSTMeLSTM-LSTMrLSTM-LSTM LSTM-LSTMtLSTM-LSTMhLSTM-LSTMiLSTM-LSTMnLSTM-LSTMgLSTM-LSTMsLSTM-LSTM LSTM-LSTMtLSTM-LSTMoLSTM-LSTM LSTM-LSTMcLSTM-LSTMoLSTM-LSTMmLSTM-LSTMeLSTM-LSTM LSTM-LSTM?CNN-LSTMwCNN-LSTMhCNN-LSTMaCNN-LSTMtCNN-LSTM CNN-LSTMsCNN-LSTM CNN-LSTMyCNN-LSTMoCNN-LSTMuCNN-LSTMrCNN-LSTM CNN-LSTMiCNN-LSTMdCNN-LSTMeCNN-LSTMmCNN-LSTM CNN-LSTMoCNN-LSTMfCNN-LSTM CNN-LSTMaCNN-LSTM CNN-LSTMsCNN-LSTMtCNN-LSTMeCNN-LSTMpCNN-LSTMpCNN-LSTMiCNN-LSTMnCNN-LSTMgCNN-LSTM CNN-LSTMsCNN-LSTMtCNN-LSTMaCNN-LSTMrCNN-LSTMtCNN-LSTM CNN-LSTMtCNN-LSTMoCNN-LSTM CNN-LSTMbCNN-LSTMeCNN-LSTMtCNN-LSTMtCNN-LSTMeCNN-LSTMrCNN-LSTM CNN-LSTMtCNN-LSTMhCNN-LSTMiCNN-LSTMnCNN-LSTMgCNN-LSTM CNN-LSTMtCNN-LSTMoCNN-LSTM CNN-LSTMcCNN-LSTMoCNN-LSTMmCNN-LSTMeCNN-LSTM CNN-LSTM?CNN-LSTM CNN-DCNNwCNN-DCNNhCNN-DCNNaCNN-DCNNtCNN-DCNN CNN-DCNNsCNN-DCNN CNN-DCNNyCNN-DCNNoCNN-DCNNuCNN-DCNNrCNN-DCNN CNN-DCNNiCNN-DCNNdCNN-DCNNeCNN-DCNNaCNN-DCNN CNN-DCNNoCNN-DCNNfCNN-DCNN CNN-DCNNaCNN-DCNN CNN-DCNNsCNN-DCNNtCNN-DCNNeCNN-DCNNpCNN-DCNNpCNN-DCNNiCNN-DCNNnCNN-DCNNgCNN-DCNN CNN-DCNNsCNN-DCNNtCNN-DCNNoCNN-DCNNnCNN-DCNNeCNN-DCNN CNN-DCNNtCNN-DCNNoCNN-DCNN CNN-DCNNbCNN-DCNNeCNN-DCNNtCNN-DCNNtCNN-DCNNeCNN-DCNNrCNN-DCNN CNN-DCNNtCNN-DCNNhCNN-DCNNiCNN-DCNNnCNN-DCNNgCNN-DCNNsCNN-DCNN CNN-DCNNtCNN-DCNNoCNN-DCNN CNN-DCNNcCNN-DCNNoCNN-DCNNmCNN-DCNNeCNN-DCNN CNN-DCNN?\flearned. As the training proceeds, the most discriminative text fragment features are selected. Further,\nthe subset of features that are responsible for both reconstruction and discrimination presumably\nencapsulate longer dependency structure, compared to the features using a purely supervised strategy.\nFigure 5 demonstrates the behavior of our model in a semi-supervised setting on Yelp Review dataset.\nThe results for Yahoo! Answer and DBpedia are provided in the SM.\n\nModel\n\nngrams TFIDF\n\nLarge Word ConvNet\nSmall Word ConvNet\nLarge Char ConvNet\nSmall Char ConvNet\nSA-LSTM (word-level)\n\nDeep ConvNet\n\nOurs (Purely supervised)\n\nOurs (joint training with CNN-LSTM)\nOurs (joint training with CNN-DCNN)\n\nDBpedia\n\n1.31\n1.72\n1.85\n1.73\n1.98\n1.40\n1.29\n1.76\n1.36\n1.17\n\nYelp P.\n4.56\n4.89\n5.54\n5.89\n6.53\n\n-\n\n4.28\n4.62\n4.21\n3.96\n\nYahoo\n31.49\n29.06\n30.02\n29.55\n29.84\n\n-\n\n26.57\n27.42\n26.32\n25.82\n\n5%\n12.40\n16.04\n\nObs. proportion \u03c3\n\nSupervised\nSemi-sup.\n\nFigure 5: Semi-supervised classi\ufb01ca-\ntion accuracy on Yelp review data.\n\nTable 4: Test error rates of document classi\ufb01cation (%). Results\nfrom other methods were obtained from [57].\nFor summarization, we used a dataset composed of 58,000 abstract-title pairs, from arXiv. Abstract-\ntitle pairs are selected if the length of the title and abstract do not exceed 50 and 500 words,\nrespectively. We partitioned the training, validation and test sets into 55000, 2000, 1000 pairs each.\nWe train a sequence-to-sequence model to generate the title given the abstract, using a randomly\nselected subset of paired data with proportion \u03c3 = (5%, 10%, 50%, 100%). For every value of\n\u03c3, we considered both purely supervised summarization using just abstract-title pairs, and semi-\nsupervised summarization, by leveraging additional abstracts without titles. We compared LSTM and\ndeconvolutional network as the decoder for generating titles for \u03c3 = 100%.\nTable 5 summarizes quantitative results\nusing ROUGE-L (longest common sub-\nsequence) [55]. In general, the additional\nabstracts without titles improve the gen-\neralization ability on the test set. Inter-\nestingly, even when \u03c3 = 100% (all titles\nare observed), the joint training objective\nstill yields a better performance than using Lsup alone. Presumably, since the joint training objective\nrequires the latent representation to be capable of reconstructing the input paragraph, in addition\nto generating a title, the learned representation may better capture the entire structure (meaning) of\nthe paragraph. We also empirically observed that titles generated under the joint training objective\nare more likely to use the words appearing in the corresponding paragraph (i.e., more extractive),\nwhile the the titles generated using the purely supervised objective Lsup, tend to use wording more\nfreely, thus more abstractive. One possible explanation is that, for the joint training strategy, since the\nreconstructed paragraph and title are all generated from latent representation h, the text fragments\nthat are used for reconstructing the input paragraph are more likely to be leveraged when \u201cbuilding\u201d\nthe title, thus the title bears more resemblance to the input paragraph.\nAs expected, the titles produced by a deconvolutional decoder are less coherent than an LSTM\ndecoder. Presumably, since each paragraph can be summarized with multiple plausible titles, the\ndeconvolutional decoder may have trouble when positioning text segments. We provide discussions\nand titles generated under different setups in the SM. Designing a framework which takes the best of\nthese two worlds, LSTM for generation and CNN for decoding, will be an interesting future direction.\n\nTable 5: Summarization task on arXiv data, using ROUGE-L\nmetric. First 4 columns are for the LSTM decoder, and the last\ncolumn is for the deconvolutional decoder (100% observed).\n\n50% 100% DCNN dec.\n15.87\n17.64\n\n14.75\n16.83\n\n16.37\n18.14\n\n10%\n13.07\n16.62\n\n5 Conclusion\n\nWe proposed a general framework for text modeling using purely convolutional and deconvolutional\noperations. The proposed method is free of sequential conditional generation, avoiding issues\nassociated with exposure bias and teacher forcing training. Our approach enables the model to\nfully encapsulate a paragraph into a latent representation vector, which can be decompressed to\nreconstruct the original input sequence. Empirically, the proposed approach achieved excellent long\nparagraph reconstruction quality and outperforms existing algorithms on spelling correction, and\nsemi-supervised sequence classi\ufb01cation and summarization, with largely reduced computational cost.\n\n8\n\n0.1110100Proportion (%) of labeled data556065707580859095100Accuracy (%)SupervisedSemi (CNN-DCNN)Semi (CNN-LSTM)\fAcknowledgements This research was supported in part by ARO, DARPA, DOE, NGA and ONR.\n\nReferences\n[1] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In NIPS, 2015.\n\n[2] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, 2014.\n\n[3] Rie Johnson and Tong Zhang. Supervised and Semi-Supervised Text Categorization using LSTM for\n\nRegion Embeddings. arXiv, February 2016.\n\n[4] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial Training Methods for Semi-Supervised\n\nText Classi\ufb01cation. In ICLR, May 2017.\n\n[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning\n\nto Align and Translate. In ICLR, 2015.\n\n[6] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. In EMNLP, 2014.\n\n[7] Fandong Meng, Zhengdong Lu, Mingxuan Wang, Hang Li, Wenbin Jiang, and Qun Liu. Encoding source\n\nlanguage with convolutional neural network for machine translation. In ACL, 2015.\n\n[8] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. Se-\nmantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv,\n2015.\n\n[9] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement\n\nlearning for dialogue generation. arXiv, 2016.\n\n[10] Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue\n\ngeneration. arXiv:1701.06547, 2017.\n\n[11] Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos santos, Caglar Gulcehre, and Bing Xiang. Abstractive\n\nText Summarization Using Sequence-to-Sequence RNNs and Beyond. In CoNLL, 2016.\n\n[12] Shashi Narayan, Nikos Papasarantopoulos, Mirella Lapata, and Shay B Cohen. Neural Extractive Summa-\n\nrization with Side Information. arXiv, April 2017.\n\n[13] Alexander M Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence\n\nSummarization. In EMNLP, 2015.\n\n[14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nNIPS, 2014.\n\n[15] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recurrent neural\n\nnetwork based language model. In INTERSPEECH, 2010.\n\n[16] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. In Neural computation, 1997.\n\n[17] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated\n\nrecurrent neural networks on sequence modeling. arXiv, 2014.\n\n[18] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation, 1(2):270\u2013280, 1989.\n\n[19] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence\n\nprediction with recurrent neural networks. In NIPS, 2015.\n\n[20] Ferenc Husz\u00e1r. How (not) to train your generative model: Scheduled sampling, likelihood, adversary?\n\narXiv, 2015.\n\n[21] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling\n\nsentences. In ACL, 2014.\n\n[22] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In EMNLP, 2014.\n\n[23] Zhe Gan, Yunchen Pu, Henao Ricardo, Chunyuan Li, Xiaodong He, and Lawrence Carin. Learning generic\n\nsentence representations using convolutional neural networks. In EMNLP, 2017.\n\n9\n\n\f[24] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and\n\nAaron Courville. Pixelvae: A latent variable model for natural images. arXiv, 2016.\n\n[25] Yunchen Pu, Win Yuan, Andrew Stevens, Chunyuan Li, and Lawrence Carin. A deep generative deconvo-\n\nlutional image model. In Arti\ufb01cial Intelligence and Statistics, pages 741\u2013750, 2016.\n\n[26] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv, 2015.\n\n[27] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML,\n\npages 807\u2013814, 2010.\n\n[28] Ian Chiswell and Wilfrid Hodges. Mathematical logic, volume 3. OUP Oxford, 2007.\n\n[29] Emil Julius Gumbel and Julius Lieblein. Statistical theory of extreme values and some practical applications:\n\na series of lectures. 1954.\n\n[30] Yunchen Pu, Xin Yuan, and Lawrence Carin. A generative model for deep convolutional learning. arXiv\n\npreprint arXiv:1504.04054, 2015.\n\n[31] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin.\n\nVariational autoencoder for deep learning of images, labels and captions. In NIPS, 2016.\n\n[32] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. Adversarial\n\nfeature matching for text generation. In ICML, 2017.\n\n[33] Zhe Gan, Liqun Chen, Weiyao Wang, Yunchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li, and Lawrence\n\nCarin. Triangle generative adversarial networks. arXiv preprint arXiv:1709.06548, 2017.\n\n[34] Ronan Collobert, Jason Weston, L\u00e9on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.\n\nNatural language processing (almost) from scratch. In JMLR, 2011.\n\n[35] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and\n\nEvan Shelhamer. cudnn: Ef\ufb01cient primitives for deep learning. arXiv, 2014.\n\n[36] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention\n\nnetworks for document classi\ufb01cation. In NAACL, 2016.\n\n[37] Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. TopicRNN: A Recurrent Neural Network\n\nwith Long-Range Semantic Dependency. In ICLR, 2016.\n\n[38] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\n\nlearning with deep generative models. In NIPS, 2014.\n\n[39] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and J\u00fcrgen Schmidhuber. Gradient \ufb02ow in recurrent\n\nnets: the dif\ufb01culty of learning long-term dependencies, 2001.\n\n[40] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning. Semi-\nsupervised recursive autoencoders for predicting sentiment distributions. In EMNLP. Association for\nComputational Linguistics, 2011.\n\n[41] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio.\n\nGenerating sentences from a continuous space. arXiv, 2015.\n\n[42] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved Variational\n\nAutoencoders for Text Modeling using Dilated Convolutions. arXiv, February 2017.\n\n[43] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architectures for\n\nmatching natural language sentences. In NIPS, 2014.\n\n[44] Rie Johnson and Tong Zhang. Effective use of word order for text categorization with convolutional neural\n\nnetworks. In NAACL HLT, 2015.\n\n[45] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity:\n\nThe all convolutional net. arXiv, 2014.\n\n[46] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2015.\n\n10\n\n\f[47] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A Hybrid Convolutional Variational Autoen-\n\ncoder for Text Generation. arXiv, February 2017.\n\n[48] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray\n\nKavukcuoglu. Neural machine translation in linear time. arXiv, 2016.\n\n[49] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language Modeling with Gated\n\nConvolutional Networks. arXiv, December 2016.\n\n[50] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional Sequence to Sequence\n\nLearning. arXiv, May 2017.\n\n[51] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional\n\nimage generation with pixelcnn decoders. In NIPS, pages 4790\u20134798, 2016.\n\n[52] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and\n\ndocuments. In ACL, 2015.\n\n[53] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, 2013.\n\n[54] Kam-Fai Wong, Mingli Wu, and Wenjie Li. Extractive summarization using supervised and semi-supervised\n\nlearning. In ICCL. Association for Computational Linguistics, 2008.\n\n[55] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ACL workshop, 2004.\n\n[56] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation\n\nof machine translation. In ACL. Association for Computational Linguistics, 2002.\n\n[57] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classi\ufb01cation.\n\nIn NIPS, pages 649\u2013657, 2015.\n\n[58] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron\n\nCourville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv, 2016.\n\n[59] JP Woodard and JT Nelson. An information theoretic measure of speech recognition performance. In\n\nWorkshop on standardisation for speech I/O, 1982.\n\n11\n\n\f", "award": [], "sourceid": 2204, "authors": [{"given_name": "Yizhe", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Dinghan", "family_name": "Shen", "institution": "Duke University"}, {"given_name": "Guoyin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}