{"title": "Character-level Convolutional Networks for Text Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 657, "abstract": "This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.", "full_text": "Character-level Convolutional Networks for Text\n\nClassi\ufb01cation\u2217\n\nXiang Zhang\n\nJunbo Zhao\n\nYann LeCun\n\nCourant Institute of Mathematical Sciences, New York University\n\n719 Broadway, 12th Floor, New York, NY 10003\n{xiang, junbo.zhao, yann}@cs.nyu.edu\n\nAbstract\n\nThis article offers an empirical exploration on the use of character-level convolu-\ntional networks (ConvNets) for text classi\ufb01cation. We constructed several large-\nscale datasets to show that character-level convolutional networks could achieve\nstate-of-the-art or competitive results. Comparisons are offered against traditional\nmodels such as bag of words, n-grams and their TFIDF variants, and deep learning\nmodels such as word-based ConvNets and recurrent neural networks.\n\n1\n\nIntroduction\n\nText classi\ufb01cation is a classic topic for natural language processing, in which one needs to assign\nprede\ufb01ned categories to free-text documents. The range of text classi\ufb01cation research goes from\ndesigning the best features to choosing the best possible machine learning classi\ufb01ers. To date,\nalmost all techniques of text classi\ufb01cation are based on words, in which simple statistics of some\nordered word combinations (such as n-grams) usually perform the best [12].\nOn the other hand, many researchers have found convolutional networks (ConvNets) [17] [18] are\nuseful in extracting information from raw signals, ranging from computer vision applications to\nspeech recognition and others. In particular, time-delay networks used in the early days of deep\nlearning research are essentially convolutional networks that model sequential data [1] [31].\nIn this article we explore treating text as a kind of raw signal at character level, and applying tem-\nporal (one-dimensional) ConvNets to it. For this article we only used a classi\ufb01cation task as a way\nto exemplify ConvNets\u2019 ability to understand texts. Historically we know that ConvNets usually\nrequire large-scale datasets to work, therefore we also build several of them. An extensive set of\ncomparisons is offered with traditional models and other deep learning models.\nApplying convolutional networks to text classi\ufb01cation or natural language processing at large was\nexplored in literature. It has been shown that ConvNets can be directly applied to distributed [6] [16]\nor discrete [13] embedding of words, without any knowledge on the syntactic or semantic structures\nof a language. These approaches have been proven to be competitive to traditional models.\nThere are also related works that use character-level features for language processing. These in-\nclude using character-level n-grams with linear classi\ufb01ers [15], and incorporating character-level\nfeatures to ConvNets [28] [29]. In particular, these ConvNet approaches use words as a basis, in\nwhich character-level features extracted at word [28] or word n-gram [29] level form a distributed\nrepresentation. Improvements for part-of-speech tagging and information retrieval were observed.\nThis article is the \ufb01rst to apply ConvNets only on characters. We show that when trained on large-\nscale datasets, deep ConvNets do not require the knowledge of words, in addition to the conclusion\n\u2217An early version of this work entitled \u201cText Understanding from Scratch\u201d was posted in Feb 2015 as\narXiv:1502.01710. The present paper has considerably more experimental results and a rewritten introduction.\n\n1\n\n\ffrom previous research that ConvNets do not require the knowledge about the syntactic or semantic\nstructure of a language. This simpli\ufb01cation of engineering could be crucial for a single system that\ncan work for different languages, since characters always constitute a necessary construct regardless\nof whether segmentation into words is possible. Working on only characters also has the advantage\nthat abnormal character combinations such as misspellings and emoticons may be naturally learnt.\n\n2 Character-level Convolutional Networks\n\nIn this section, we introduce the design of character-level ConvNets for text classi\ufb01cation. The de-\nsign is modular, where the gradients are obtained by back-propagation [27] to perform optimization.\n\n2.1 Key Modules\n\nThe main component is the temporal convolutional module, which simply computes a 1-D convo-\nlution. Suppose we have a discrete input function g(x) \u2208 [1, l] \u2192 R and a discrete kernel function\nf (x) \u2208 [1, k] \u2192 R. The convolution h(y) \u2208 [1,(cid:98)(l \u2212 k + 1)/d(cid:99)] \u2192 R between f (x) and g(x) with\nstride d is de\ufb01ned as\n\nk(cid:88)\n\nh(y) =\n\nf (x) \u00b7 g(y \u00b7 d \u2212 x + c),\n\nx=1\n\nwhere c = k \u2212 d + 1 is an offset constant. Just as in traditional convolutional networks in vision,\nthe module is parameterized by a set of such kernel functions fij(x) (i = 1, 2, . . . , m and j =\n1, 2, . . . , n) which we call weights, on a set of inputs gi(x) and outputs hj(y). We call each gi (or\nhj) input (or output) features, and m (or n) input (or output) feature size. The outputs hj(y) is\nobtained by a sum over i of the convolutions between gi(x) and fij(x).\nOne key module that helped us to train deeper models is temporal max-pooling. It is the 1-D version\nof the max-pooling module used in computer vision [2]. Given a discrete input function g(x) \u2208\n[1, l] \u2192 R, the max-pooling function h(y) \u2208 [1,(cid:98)(l \u2212 k + 1)/d(cid:99)] \u2192 R of g(x) is de\ufb01ned as\n\nh(y) =\n\nk\n\nmax\nx=1\n\ng(y \u00b7 d \u2212 x + c),\n\nwhere c = k \u2212 d + 1 is an offset constant. This very pooling module enabled us to train ConvNets\ndeeper than 6 layers, where all others fail. The analysis by [3] might shed some light on this.\nThe non-linearity used in our model is the recti\ufb01er or thresholding function h(x) = max{0, x},\nwhich makes our convolutional layers similar to recti\ufb01ed linear units (ReLUs) [24]. The algorithm\nused is stochastic gradient descent (SGD) with a minibatch of size 128, using momentum [26] [30]\n0.9 and initial step size 0.01 which is halved every 3 epoches for 10 times. Each epoch takes a \ufb01xed\nnumber of random training samples uniformly sampled across classes. This number will later be\ndetailed for each dataset sparately. The implementation is done using Torch 7 [4].\n\n2.2 Character quantization\n\nOur models accept a sequence of encoded characters as input. The encoding is done by prescribing\nan alphabet of size m for the input language, and then quantize each character using 1-of-m encoding\n(or \u201cone-hot\u201d encoding). Then, the sequence of characters is transformed to a sequence of such m\nsized vectors with \ufb01xed length l0. Any character exceeding length l0 is ignored, and any characters\nthat are not in the alphabet including blank characters are quantized as all-zero vectors. The character\nquantization order is backward so that the latest reading on characters is always placed near the begin\nof the output, making it easy for fully connected layers to associate weights with the latest reading.\nThe alphabet used in all of our models consists of 70 characters, including 26 english letters, 10\ndigits, 33 other characters and the new line character. The non-space characters are:\n\nabcdefghijklmnopqrstuvwxyz0123456789\n-,;.!?:\u2019\u2019\u2019/\\|_@#$%\u02c6&*\u02dc\u2018+-=<>()[]{}\n\nLater we also compare with models that use a different alphabet in which we distinguish between\nupper-case and lower-case letters.\n\n2\n\n\f2.3 Model Design\n\nWe designed 2 ConvNets \u2013 one large and one small. They are both 9 layers deep with 6 convolutional\nlayers and 3 fully-connected layers. Figure 1 gives an illustration.\n\nFigure 1: Illustration of our model\n\nThe input have number of features equal to 70 due to our character quantization method, and the\ninput feature length is 1014. It seems that 1014 characters could already capture most of the texts of\ninterest. We also insert 2 dropout [10] modules in between the 3 fully-connected layers to regularize.\nThey have dropout probability of 0.5. Table 1 lists the con\ufb01gurations for convolutional layers, and\ntable 2 lists the con\ufb01gurations for fully-connected (linear) layers.\n\nTable 1: Convolutional layers used in our experiments. The convolutional layers have stride 1 and\npooling layers are all non-overlapping ones, so we omit the description of their strides.\n\nLayer Large Feature\n\nSmall Feature Kernel\n\nPool\n\n1\n2\n3\n4\n5\n6\n\n1024\n1024\n1024\n1024\n1024\n1024\n\n256\n256\n256\n256\n256\n256\n\n7\n7\n3\n3\n3\n3\n\n3\n3\n\nN/A\nN/A\nN/A\n\n3\n\nWe initialize the weights using a Gaussian distribution. The mean and standard deviation used for\ninitializing the large model is (0, 0.02) and small model (0, 0.05).\n\nTable 2: Fully-connected layers used in our experiments. The number of output units for the last\nlayer is determined by the problem. For example, for a 10-class classi\ufb01cation problem it will be 10.\n\nLayer Output Units Large Output Units Small\n\n7\n8\n9\n\n1024\n2048\n2048\n1024\nDepends on the problem\n\nFor different problems the input lengths may be different (for example in our case l0 = 1014), and\nso are the frame lengths. From our model design, it is easy to know that given input length l0, the\noutput frame length after the last convolutional layer (but before any of the fully-connected layers)\nis l6 = (l0 \u2212 96)/27. This number multiplied with the frame size at layer 6 will give the input\ndimension the \ufb01rst fully-connected layer accepts.\n\n2.4 Data Augmentation using Thesaurus\n\nMany researchers have found that appropriate data augmentation techniques are useful for control-\nling generalization error for deep learning models. These techniques usually work well when we\ncould \ufb01nd appropriate invariance properties that the model should possess. In terms of texts, it is not\nreasonable to augment the data using signal transformations as done in image or speech recognition,\nbecause the exact order of characters may form rigorous syntactic and semantic meaning. Therefore,\n\n3\n\nSome TextConvolutionsMax-poolingLengthFeatureQuantization...Conv. and Pool. layersFully-connected\fthe best way to do data augmentation would have been using human rephrases of sentences, but this\nis unrealistic and expensive due the large volume of samples in our datasets. As a result, the most\nnatural choice in data augmentation for us is to replace words or phrases with their synonyms.\nWe experimented data augmentation by using an English thesaurus, which is obtained from the\nmytheas component used in LibreOf\ufb01ce1 project. That thesaurus in turn was obtained from Word-\nNet [7], where every synonym to a word or phrase is ranked by the semantic closeness to the most\nfrequently seen meaning. To decide on how many words to replace, we extract all replaceable words\nfrom the given text and randomly choose r of them to be replaced. The probability of number r\nis determined by a geometric distribution with parameter p in which P [r] \u223c pr. The index s of\nthe synonym chosen given a word is also determined by a another geometric distribution in which\nP [s] \u223c qs. This way, the probability of a synonym chosen becomes smaller when it moves distant\nfrom the most frequently seen meaning. We will report the results using this new data augmentation\ntechnique with p = 0.5 and q = 0.5.\n\n3 Comparison Models\n\nTo offer fair comparisons to competitive models, we conducted a series of experiments with both tra-\nditional and deep learning methods. We tried our best to choose models that can provide comparable\nand competitive results, and the results are reported faithfully without any model selection.\n\n3.1 Traditional Methods\n\nWe refer to traditional methods as those that using a hand-crafted feature extractor and a linear\nclassi\ufb01er. The classi\ufb01er used is a multinomial logistic regression in all these models.\nBag-of-words and its TFIDF. For each dataset, the bag-of-words model is constructed by selecting\n50,000 most frequent words from the training subset. For the normal bag-of-words, we use the\ncounts of each word as the features. For the TFIDF (term-frequency inverse-document-frequency)\n[14] version, we use the counts as the term-frequency. The inverse document frequency is the\nlogarithm of the division between total number of samples and number of samples with the word in\nthe training subset. The features are normalized by dividing the largest feature value.\nBag-of-ngrams and its TFIDF. The bag-of-ngrams models are constructed by selecting the 500,000\nmost frequent n-grams (up to 5-grams) from the training subset for each dataset. The feature values\nare computed the same way as in the bag-of-words model.\nBag-of-means on word embedding. We also have an experimental model that uses k-means on\nword2vec [23] learnt from the training subset of each dataset, and then use these learnt means as\nrepresentatives of the clustered words. We take into consideration all the words that appeared more\nthan 5 times in the training subset. The dimension of the embedding is 300. The bag-of-means\nfeatures are computed the same way as in the bag-of-words model. The number of means is 5000.\n\n3.2 Deep Learning Methods\n\nRecently deep learning methods have started to be applied to text classi\ufb01cation. We choose two\nsimple and representative models for comparison, in which one is word-based ConvNet and the\nother a simple long-short term memory (LSTM) [11] recurrent neural network model.\nWord-based ConvNets. Among the large number of recent works on word-based ConvNets for\ntext classi\ufb01cation, one of the differences is the choice of using pretrained or end-to-end learned word\nrepresentations. We offer comparisons with both using the pretrained word2vec [23] embedding [16]\nand using lookup tables [5]. The embedding size is 300 in both cases, in the same way as our bag-\nof-means model. To ensure fair comparison, the models for each case are of the same size as\nour character-level ConvNets, in terms of both the number of layers and each layer\u2019s output size.\nExperiments using a thesaurus for data augmentation are also conducted.\n\n1http://www.libreoffice.org/\n\n4\n\n\fLong-short term memory. We also offer a comparison\nwith a recurrent neural network model, namely long-short\nterm memory (LSTM) [11]. The LSTM model used in\nour case is word-based, using pretrained word2vec em-\nbedding of size 300 as in previous models. The model is\nformed by taking mean of the outputs of all LSTM cells to\nform a feature vector, and then using multinomial logistic\nregression on this feature vector. The output dimension\nis 512. The variant of LSTM we used is the common\n\u201cvanilla\u201d architecture [8] [9]. We also used gradient clipping [25] in which the gradient norm is\nlimited to 5. Figure 2 gives an illustration.\n\nFigure 2: long-short term memory\n\n3.3 Choice of Alphabet\n\nFor the alphabet of English, one apparent choice is whether to distinguish between upper-case and\nlower-case letters. We report experiments on this choice and observed that it usually (but not always)\ngives worse results when such distinction is made. One possible explanation might be that semantics\ndo not change with different letter cases, therefore there is a bene\ufb01t of regularization.\n\n4 Large-scale Datasets and Results\n\nPrevious research on ConvNets in different areas has shown that they usually work well with large-\nscale datasets, especially when the model takes in low-level raw features like characters in our\ncase. However, most open datasets for text classi\ufb01cation are quite small, and large-scale datasets are\nsplitted with a signi\ufb01cantly smaller training set than testing [21]. Therefore, instead of confusing our\ncommunity more by using them, we built several large-scale datasets for our experiments, ranging\nfrom hundreds of thousands to several millions of samples. Table 3 is a summary.\n\nTable 3: Statistics of our large-scale datasets. Epoch size is the number of minibatches in one epoch\n\nClasses Train Samples Test Samples Epoch Size\n\nDataset\nAG\u2019s News\nSogou News\nDBPedia\nYelp Review Polarity\nYelp Review Full\nYahoo! Answers\nAmazon Review Full\nAmazon Review Polarity\n\n4\n5\n14\n2\n5\n10\n5\n2\n\n120,000\n450,000\n560,000\n560,000\n650,000\n1,400,000\n3,000,000\n3,600,000\n\n7,600\n60,000\n70,000\n38,000\n50,000\n60,000\n650,000\n400,000\n\n5,000\n5,000\n5,000\n5,000\n5,000\n10,000\n30,000\n30,000\n\nAG\u2019s news corpus. We obtained the AG\u2019s corpus of news article on the web2. It contains 496,835\ncategorized news articles from more than 2000 news sources. We choose the 4 largest classes from\nthis corpus to construct our dataset, using only the title and description \ufb01elds. The number of training\nsamples for each class is 30,000 and testing 1900.\nSogou news corpus. This dataset is a combination of the SogouCA and SogouCS news corpora [32],\ncontaining in total 2,909,551 news articles in various topic channels. We then labeled each piece\nof news using its URL, by manually classifying the their domain names. This gives us a large\ncorpus of news articles labeled with their categories. There are a large number categories but most\nof them contain only few articles. We choose 5 categories \u2013 \u201csports\u201d, \u201c\ufb01nance\u201d, \u201centertainment\u201d,\n\u201cautomobile\u201d and \u201ctechnology\u201d. The number of training samples selected for each class is 90,000\nand testing 12,000. Although this is a dataset in Chinese, we used pypinyin package combined\nwith jieba Chinese segmentation system to produce Pinyin \u2013 a phonetic romanization of Chinese.\nThe models for English can then be applied to this dataset without change. The \ufb01elds used are title\nand content.\n\n2http://www.di.unipi.it/\u02dcgulli/AG_corpus_of_news_articles.html\n\n5\n\nLSTMLSTMLSTM...Mean\fTable 4: Testing errors of all the models. Numbers are in percentage. \u201cLg\u201d stands for \u201clarge\u201d and\n\u201cSm\u201d stands for \u201csmall\u201d. \u201cw2v\u201d is an abbreviation for \u201cword2vec\u201d, and \u201cLk\u201d for \u201clookup table\u201d.\n\u201cTh\u201d stands for thesaurus. ConvNets labeled \u201cFull\u201d are those that distinguish between lower and\nupper letters\n\nSogou DBP. Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.\n7.15\n6.55\n2.92\n2.81\n10.79\n4.82\n4.39\n4.54\n\nModel\nBoW\nBoW TFIDF\nngrams\nngrams TFIDF\nBag-of-means\nLSTM\nLg. w2v Conv.\nSm. w2v Conv.\nLg. w2v Conv. Th.\nSm. w2v Conv. Th.\nLg. Lk. Conv.\nSm. Lk. Conv.\nLg. Lk. Conv. Th.\nSm. Lk. Conv. Th.\nLg. Full Conv.\nSm. Full Conv.\nLg. Full Conv. Th.\nSm. Full Conv. Th.\nLg. Conv.\nSm. Conv.\nLg. Conv. Th.\nSm. Conv. Th.\n\nAG\n11.19\n10.36\n7.96\n7.64\n16.91\n13.94\n9.92\n11.35\n9.91\n10.88\n8.55\n10.87\n8.93\n9.12\n9.85\n11.59\n9.51\n10.89\n12.82\n15.65\n13.39\n14.80\n\n4.95\n4.93\n\n8.80\n8.95\n\n4.88\n8.65\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n3.39\n2.63\n1.37\n1.31\n9.55\n1.45\n1.42\n1.71\n1.37\n1.53\n1.72\n1.85\n1.58\n1.77\n1.66\n1.89\n1.55\n1.69\n1.73\n1.98\n1.60\n1.85\n\n7.76\n6.34\n4.36\n4.56\n12.67\n5.26\n4.60\n5.56\n4.63\n5.36\n4.89\n5.54\n5.03\n5.37\n5.25\n5.67\n4.88\n5.42\n5.89\n6.53\n5.82\n6.49\n\n42.01\n40.14\n43.74\n45.20\n47.46\n41.83\n40.16\n42.13\n39.58\n41.09\n40.52\n41.41\n40.52\n41.17\n38.40\n38.82\n38.04\n37.95\n39.62\n40.84\n39.30\n40.16\n\n31.11\n28.96\n31.53\n31.49\n39.45\n29.16\n31.97\n31.50\n31.23\n29.86\n29.06\n30.02\n28.84\n28.92\n29.90\n30.01\n29.58\n29.90\n29.55\n29.84\n28.80\n29.84\n\n45.36\n44.74\n45.73\n47.56\n55.87\n40.57\n44.40\n42.59\n43.75\n42.50\n45.95\n43.66\n42.39\n43.19\n40.89\n40.88\n40.54\n40.53\n41.31\n40.53\n40.45\n40.43\n\n9.60\n9.00\n7.98\n8.46\n18.39\n6.10\n5.88\n6.00\n5.80\n5.63\n5.84\n5.85\n5.52\n5.51\n5.78\n5.78\n5.51\n5.66\n5.51\n5.50\n4.93\n5.67\n\nDBPedia ontology dataset. DBpedia is a crowd-sourced community effort to extract structured\ninformation from Wikipedia [19]. The DBpedia ontology dataset is constructed by picking 14 non-\noverlapping classes from DBpedia 2014. From each of these 14 ontology classes, we randomly\nchoose 40,000 training samples and 5,000 testing samples. The \ufb01elds we used for this dataset\ncontain title and abstract of each Wikipedia article.\nYelp reviews. The Yelp reviews dataset is obtained from the Yelp Dataset Challenge in 2015. This\ndataset contains 1,569,264 samples that have review texts. Two classi\ufb01cation tasks are constructed\nfrom this dataset \u2013 one predicting full number of stars the user has given, and the other predict-\ning a polarity label by considering stars 1 and 2 negative, and 3 and 4 positive. The full dataset\nhas 130,000 training samples and 10,000 testing samples in each star, and the polarity dataset has\n280,000 training samples and 19,000 test samples in each polarity.\nYahoo! Answers dataset. We obtained Yahoo! Answers Comprehensive Questions and Answers\nversion 1.0 dataset through the Yahoo! Webscope program. The corpus contains 4,483,032 questions\nand their answers. We constructed a topic classi\ufb01cation dataset from this corpus using 10 largest\nmain categories. Each class contains 140,000 training samples and 5,000 testing samples. The \ufb01elds\nwe used include question title, question content and best answer.\nAmazon reviews. We obtained an Amazon review dataset from the Stanford Network Analysis\nProject (SNAP), which spans 18 years with 34,686,770 reviews from 6,643,669 users on 2,441,053\nproducts [22]. Similarly to the Yelp review dataset, we also constructed 2 datasets \u2013 one full score\nprediction and another polarity prediction. The full dataset contains 600,000 training samples and\n130,000 testing samples in each class, whereas the polarity dataset contains 1,800,000 training sam-\nples and 200,000 testing samples in each polarity sentiment. The \ufb01elds used are review title and\nreview content.\nTable 4 lists all the testing errors we obtained from these datasets for all the applicable models. Note\nthat since we do not have a Chinese thesaurus, the Sogou News dataset does not have any results\nusing thesaurus augmentation. We labeled the best result in blue and worse result in red.\n\n6\n\n\f5 Discussion\n\n(a) Bag-of-means\n\n(b) n-grams TFIDF\n\n(c) LSTM\n\n(d) word2vec ConvNet\n\n(e) Lookup table ConvNet\n\n(f) Full alphabet ConvNet\n\nFigure 3: Relative errors with comparison models\n\nTo understand the results in table 4 further, we offer some empirical analysis in this section. To\nfacilitate our analysis, we present the relative errors in \ufb01gure 3 with respect to comparison models.\nEach of these plots is computed by taking the difference between errors on comparison model and\nour character-level ConvNet model, then divided by the comparison model error. All ConvNets in\nthe \ufb01gure are the large models with thesaurus augmentation respectively.\nCharacter-level ConvNet is an effective method. The most important conclusion from our experi-\nments is that character-level ConvNets could work for text classi\ufb01cation without the need for words.\nThis is a strong indication that language could also be thought of as a signal no different from\nany other kind. Figure 4 shows 12 random \ufb01rst-layer patches learnt by one of our character-level\nConvNets for DBPedia dataset.\n\nFigure 4: First layer weights. For each patch, height is the kernel size and width the alphabet size\n\nDataset size forms a dichotomy between traditional and ConvNets models. The most obvious\ntrend coming from all the plots in \ufb01gure 3 is that the larger datasets tend to perform better. Tra-\nditional methods like n-grams TFIDF remain strong candidates for dataset of size up to several\nhundreds of thousands, and only until the dataset goes to the scale of several millions do we observe\nthat character-level ConvNets start to do better.\nConvNets may work well for user-generated data. User-generated data vary in the degree of how\nwell the texts are curated. For example, in our million scale datasets, Amazon reviews tend to be\nraw user-inputs, whereas users might be extra careful in their writings on Yahoo! Answers. Plots\ncomparing word-based deep models (\ufb01gures 3c, 3d and 3e) show that character-level ConvNets work\nbetter for less curated user-generated texts. This property suggests that ConvNets may have better\napplicability to real-world scenarios. However, further analysis is needed to validate the hypothesis\nthat ConvNets are truly good at identifying exotic character combinations such as misspellings and\nemoticons, as our experiments alone do not show any explicit evidence.\nChoice of alphabet makes a difference. Figure 3f shows that changing the alphabet by distinguish-\ning between uppercase and lowercase letters could make a difference. For million-scale datasets, it\nseems that not making such distinction usually works better. One possible explanation is that there\nis a regularization effect, but this is to be validated.\n\n7\n\n0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%-100.00%-80.00%-60.00%-40.00%-20.00%0.00%20.00%40.00%60.00%-15.00%-10.00%-5.00%0.00%5.00%10.00%15.00%20.00%25.00%-40.00%-30.00%-20.00%-10.00%0.00%10.00%20.00%-60.00%-50.00%-40.00%-30.00%-20.00%-10.00%0.00%10.00%20.00%-50.00%-40.00%-30.00%-20.00%-10.00%0.00%10.00%20.00%AG NewsDBPediaYelp P.Yelp F.Yahoo A.Amazon F.Amazon P.\fSemantics of tasks may not matter. Our datasets consist of two kinds of tasks: sentiment analysis\n(Yelp and Amazon reviews) and topic classi\ufb01cation (all others). This dichotomy in task semantics\ndoes not seem to play a role in deciding which method is better.\nBag-of-means is a misuse of word2vec [20]. One of the most obvious facts one could observe\nfrom table 4 and \ufb01gure 3a is that the bag-of-means model performs worse in every case. Comparing\nwith traditional models, this suggests such a simple use of a distributed word representation may not\ngive us an advantage to text classi\ufb01cation. However, our experiments does not speak for any other\nlanguage processing tasks or use of word2vec in any other way.\nThere is no free lunch. Our experiments once again veri\ufb01es that there is not a single machine\nlearning model that can work for all kinds of datasets. The factors discussed in this section could all\nplay a role in deciding which method is the best for some speci\ufb01c application.\n\n6 Conclusion and Outlook\n\nThis article offers an empirical study on character-level convolutional networks for text classi\ufb01ca-\ntion. We compared with a large number of traditional and deep learning models using several large-\nscale datasets. On one hand, analysis shows that character-level ConvNet is an effective method.\nOn the other hand, how well our model performs in comparisons depends on many factors, such as\ndataset size, whether the texts are curated and choice of alphabet.\nIn the future, we hope to apply character-level ConvNets for a broader range of language processing\ntasks especially when structured outputs are needed.\n\nAcknowledgement\n\nWe gratefully acknowledge the support of NVIDIA Corporation with the donation of 2 Tesla K40\nGPUs used for this research. We gratefully acknowledge the support of Amazon.com Inc for an\nAWS in Education Research grant used for this research.\n\nReferences\n[1] L. Bottou, F. Fogelman Souli\u00b4e, P. Blanchet, and J. Lienard. Experiments with time delay networks and\ndynamic time warping for speaker independent isolated digit recognition. In Proceedings of EuroSpeech\n89, volume 2, pages 537\u2013540, Paris, France, 1989.\n\n[2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In Computer\n\nVision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2559\u20132566. IEEE, 2010.\n\n[3] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition.\nIn Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 111\u2013118,\n2010.\n\n[4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning.\n\nIn BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.\n\n[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-\n\ning (almost) from scratch. J. Mach. Learn. Res., 12:2493\u20132537, Nov. 2011.\n\n[6] C. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In\nProceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tech-\nnical Papers, pages 69\u201378, Dublin, Ireland, August 2014. Dublin City University and Association for\nComputational Linguistics.\n\n[7] C. Fellbaum. Wordnet and wordnets. In K. Brown, editor, Encyclopedia of Language and Linguistics,\n\npages 665\u2013670, Oxford, 2005. Elsevier.\n\n[8] A. Graves and J. Schmidhuber. Framewise phoneme classi\ufb01cation with bidirectional lstm and other neural\n\nnetwork architectures. Neural Networks, 18(5):602\u2013610, 2005.\n\n[9] K. Greff, R. K. Srivastava, J. Koutn\u00b4\u0131k, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space\n\nodyssey. CoRR, abs/1503.04069, 2015.\n\n[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov.\n\nImproving neural\n\nnetworks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.\n\n8\n\n\f[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735\u20131780, Nov.\n\n1997.\n\n[12] T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In\nProceedings of the 10th European Conference on Machine Learning, pages 137\u2013142. Springer-Verlag,\n1998.\n\n[13] R. Johnson and T. Zhang. Effective use of word order for text categorization with convolutional neural\n\nnetworks. CoRR, abs/1412.1058, 2014.\n\n[14] K. S. Jones. A statistical interpretation of term speci\ufb01city and its application in retrieval. Journal of\n\nDocumentation, 28(1):11\u201321, 1972.\n\n[15] I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam\n\n\ufb01ltering. International Journal on Arti\ufb01cial Intelligence Tools, 16(06):1047\u20131067, 2007.\n\n[16] Y. Kim. Convolutional neural networks for sentence classi\ufb01cation. In Proceedings of the 2014 Confer-\nence on Empirical Methods in Natural Language Processing (EMNLP), pages 1746\u20131751, Doha, Qatar,\nOctober 2014. Association for Computational Linguistics.\n\n[17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back-\npropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551, Winter\n1989.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[19] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey,\nP. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from\nwikipedia. Semantic Web Journal, 2014.\n\n[20] G. Lev, B. Klein, and L. Wolf. In defense of word embedding for generic text representation. In C. Bie-\nmann, S. Handschuh, A. Freitas, F. Meziane, and E. Mtais, editors, Natural Language Processing and\nInformation Systems, volume 9103 of Lecture Notes in Computer Science, pages 35\u201350. Springer Inter-\nnational Publishing, 2015.\n\n[21] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization\n\nresearch. The Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[22] J. McAuley and J. Leskovec. Hidden factors and hidden topics: Understanding rating dimensions with\nreview text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys \u201913, pages\n165\u2013172, New York, NY, USA, 2013. ACM.\n\n[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and\nphrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-\nberger, editors, Advances in Neural Information Processing Systems 26, pages 3111\u20133119. 2013.\n\n[24] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In Proceedings\n\nof the 27th International Conference on Machine Learning (ICML-10), pages 807\u2013814, 2010.\n\n[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks. In ICML\n[26] B. Polyak. Some methods of speeding up the convergence of iteration methods. {USSR} Computational\n\n2013, volume 28 of JMLR Proceedings, pages 1310\u20131318. JMLR.org, 2013.\n\nMathematics and Mathematical Physics, 4(5):1 \u2013 17, 1964.\n\n[27] D. Rumelhart, G. Hintont, and R. Williams. Learning representations by back-propagating errors. Nature,\n\n323(6088):533\u2013536, 1986.\n\n[28] C. D. Santos and B. Zadrozny. Learning character-level representations for part-of-speech tagging. In\nProceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818\u20131826,\n2014.\n\n[29] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling\nstructure for information retrieval. In Proceedings of the 23rd ACM International Conference on Confer-\nence on Information and Knowledge Management, pages 101\u2013110. ACM, 2014.\n\n[30] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum\nin deep learning. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Confer-\nence on Machine Learning (ICML-13), volume 28, pages 1139\u20131147. JMLR Workshop and Conference\nProceedings, May 2013.\n\n[31] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition using time-delay\nneural networks. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3):328\u2013339, 1989.\n[32] C. Wang, M. Zhang, S. Ma, and L. Ru. Automatic online news issue construction in web environment. In\nProceedings of the 17th International Conference on World Wide Web, WWW \u201908, pages 457\u2013466, New\nYork, NY, USA, 2008. ACM.\n\n9\n\n\f", "award": [], "sourceid": 456, "authors": [{"given_name": "Xiang", "family_name": "Zhang", "institution": "New York University"}, {"given_name": "Junbo", "family_name": "Zhao", "institution": "New York University"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "New York University"}]}