{"title": "Hash Embeddings for Efficient Word Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 4928, "page_last": 4936, "abstract": "We present hash embeddings, an efficient method for representing words in a continuous vector form. A hash embedding may be seen as an interpolation between a standard word embedding and a word embedding created using a random hash function (the hashing trick). In hash embeddings each token is represented by $k$ $d$-dimensional embeddings vectors and one $k$ dimensional weight vector. The final $d$ dimensional representation of the token is the product of the two. Rather than fitting the embedding vectors for each token these are selected by the hashing trick from a shared pool of $B$ embedding vectors. Our experiments show that hash embeddings can easily deal with huge vocabularies consisting of millions tokens. When using a hash embedding there is no need to create a dictionary before training nor to perform any kind of vocabulary pruning after training. We show that models trained using hash embeddings exhibit at least the same level of performance as models trained using regular embeddings across a wide range of tasks. Furthermore, the number of parameters needed by such an embedding is only a fraction of what is required by a regular embedding. Since standard embeddings and embeddings constructed using the hashing trick are actually just special cases of a hash embedding, hash embeddings can be considered an extension and improvement over the existing regular embedding types.", "full_text": "Hash Embeddings for Ef\ufb01cient Word Representations\n\nDan Svenstrup\n\nDepartment for Applied Mathematics and Computer Science\n\nTechnical University of Denmark (DTU)\n\n2800 Lyngby, Denmark\n\ndsve@dtu.dk\n\nJonas Meinertz Hansen\n\nFindZebra\n\nCopenhagen, Denmark\njonas@findzebra.com\n\nOle Winther\n\nDepartment for Applied Mathematics and Computer Science\n\nTechnical University of Denmark (DTU)\n\n2800 Lyngby, Denmark\n\nolwi@dtu.dk\n\nAbstract\n\nWe present hash embeddings, an ef\ufb01cient method for representing words in a\ncontinuous vector form. A hash embedding may be seen as an interpolation between\na standard word embedding and a word embedding created using a random hash\nfunction (the hashing trick). In hash embeddings each token is represented by\nk d-dimensional embeddings vectors and one k dimensional weight vector. The\n\ufb01nal d dimensional representation of the token is the product of the two. Rather\nthan \ufb01tting the embedding vectors for each token these are selected by the hashing\ntrick from a shared pool of B embedding vectors. Our experiments show that\nhash embeddings can easily deal with huge vocabularies consisting of millions\nof tokens. When using a hash embedding there is no need to create a dictionary\nbefore training nor to perform any kind of vocabulary pruning after training. We\nshow that models trained using hash embeddings exhibit at least the same level\nof performance as models trained using regular embeddings across a wide range\nof tasks. Furthermore, the number of parameters needed by such an embedding\nis only a fraction of what is required by a regular embedding. Since standard\nembeddings and embeddings constructed using the hashing trick are actually just\nspecial cases of a hash embedding, hash embeddings can be considered an extension\nand improvement over the existing regular embedding types.\n\n1\n\nIntroduction\n\nContemporary neural networks rely on loss functions that are continuous in the model\u2019s parameters\nin order to be able to compute gradients for training. For this reason, any data that we wish to feed\nthrough the network, even data that is of a discrete nature in its original form will be translated into a\ncontinuous form. For textual input it often makes sense to represent each distinct word or phrase with\na dense real-valued vector in Rn. These word vectors are trained either jointly with the rest of the\nmodel, or pre-trained on a large corpus beforehand.\nFor large datasets the size of the vocabulary can easily be in the order of hundreds of thousands,\nadding millions or even billions of parameters to the model. This problem can be especially severe\nwhen n-grams are allowed as tokens in the vocabulary. For example, the pre-trained Word2Vec\nvectors from Google (Mih\u00e1ltz, 2016) has a vocabulary consisting of 3 million words and phrases.\nThis means that even though the embedding size is moderately small (300 dimensions), the total\nnumber of parameters is close to one billion.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe embedding size problem caused by a large vocabulary can be solved in several ways. Each of the\nmethods have some advantages and some drawbacks:\n\n1. Ignore infrequent words. In many cases, the majority of a text is made up of a small subset\nof the vocabulary, and most words will only appear very few times (Zipf\u2019s law (Manning\net al., 1999)).\nBy ignoring anything but most frequent words, and sometimes stop words as well, it is\npossible to preserve most of the text while drastically reducing the number of embedding\nvectors and parameters. However, for any given task, there is a risk of removing too much\nor to little. Many frequent words (besides stop words) are unimportant and sometimes even\nstop words can be of value for a particular task (e.g. a typical stop word such as \u201cand\u201d when\ntraining a model on a corpus of texts about logic). Conversely, for some problems (e.g.\nspecialized domains such as medical search) rare words might be very important.\n\n2. Remove non-discriminative tokens after training. For some models it is possible to\nperform ef\ufb01cient feature pruning based on e.g. entropy (Stolcke, 2000) or by only retaining\nthe K tokens with highest norm (Joulin et al., 2016a). This reduction in vocabulary size can\nlead to a decrease in performance, but in some cases it actually avoids some over-\ufb01tting\nand increases performance (Stolcke, 2000). For many models, however, such pruning is not\npossible (e.g. for on-line training algorithms).\n\n3. Compress the embedding vectors. Lossy compression techniques can be employed to\nreduce the amount of memory needed to store embedding vectors. One such method is\nquantization, where each vector is replaced by an approximation which is constructed as a\nsum of vectors from a previously determined set of centroids (Joulin et al., 2016a; Jegou\net al., 2011; Gray and Neuhoff, 1998).\n\nFor some problems, such as online learning, the need for creating a dictionary before training can be\na nuisance. This is often solved with feature hashing, where a hash function is used to assign each\ntoken w \u2208 T to one of a \ufb01xed set of \u201cbuckets\u201d {1, 2, . . . B}, each of which has its own embedding\nvector. Since the goal of hashing is to reduce the dimensionality of the token space T , we normally\nhave that B (cid:28) |T |. This results in many tokens \u201ccolliding\u201d with each other because they are assigned\nto the same bucket. When multiple tokens collide, they will get the same vector representation which\nprevents the model from distinguishing between the tokens. Even though some information is lost\nwhen tokens collide, the method often works surprisingly well in practice (Weinberger et al., 2009).\nOne obvious improvement to the feature hashing method described above would be to learn an\noptimal hash function where important tokens do not collide. However, since a hash function has\na discrete codomain, it is not easy to optimize using e.g. gradient based methods used for training\nneural networks (Kulis and Darrell, 2009).\nThe method proposed in this article is an extension of feature hashing where we use k hash functions\ninstead of a single hash function, and then use k trainable parameters for each word in order to choose\nthe \u201cbest\u201d hash function for the tokens (or actually the best combination of hash functions). We call\nthe resulting embedding hash embedding. As we explain in section 3, embeddings constructed by\nboth feature hashing and standard embeddings can be considered special cases of hash embeddings.\nA hash embedding is an ef\ufb01cient hybrid between a standard embedding and an embedding created\nusing feature hashing, i.e. a hash embedding has all of the advantages of the methods described\nabove, but none of the disadvantages:\n\nmethod can handle a dynamically expanding vocabulary.\n\n\u2022 When using hash embeddings there is no need for creating a dictionary beforehand and the\n\u2022 A hash embedding has a mechanism capable of implicit vocabulary pruning.\n\u2022 Hash embeddings are based on hashing but has a trainable mechanism that can handle\n\u2022 Hash embeddings perform something similar to product quantization. But instead of all of\nthe tokens sharing a single small codebook, each token has access to a few elements in a\nvery large codebook.\n\nproblematic collisions.\n\nUsing a hash embedding typically results in a reduction of parameters of several orders of magnitude.\nSince the bulk of the model parameters often resides in the embedding layer, this reduction of\n\n2\n\n\finput\ntoken\n\nhash functions\n\ncomponent\n\nvectors\n\nimportance\nparameters\n\nhash vector\n\n\u02c6e\u201chorse\u201d\n\n\u201chorse\u201d\n\nH1(\u201chorse\u201d) =\n\nH2(\u201chorse\u201d) =\n\n...\n\np1\n\u201chorse\u201d\n\n(cid:80)\n\np2\n\u201chorse\u201d\n\nHk(\u201chorse\u201d) =\n\npk\n\u201chorse\u201d\n\nFigure 1: Illustration of how to build the hash vector for the word \u201chorse\u201d. The optional step of\nconcatenating the vector of importance parameters to \u02c6e\u201chorse\u201d has been omitted. The size of component\nvectors in the illustration is d = 4.\n\nparameters opens up for e.g. a wider use of e.g. ensemble methods or large dimensionality of word\nvectors.\n\n2 Related Work\n\nArgerich et al. (2016) proposed a type of embedding that is based on hashing and word co-occurrence\nand demonstrates that correlations between those embedding vectors correspond to the subjective\njudgement of word similarity by humans. Ultimately, it is a clever reduction in the embedding sizes\nof word co-occurrence based embeddings.\nReisinger and Mooney (2010) and since then Huang et al. (2012) have used multiple different word\nembeddings (prototypes) for the same words for representing different possible meanings of the same\nwords. Conversely, Bai et al. (2009) have experimented with hashing and treating words that co-occur\nfrequently as the same feature in order to reduce dimensionality.\nHuang et al. (2013) have used bags of either bi-grams or tri-grams of letters of input words to create\nfeature vectors that are somewhat robust to new words and minor spelling differences.\nAnother approach employed by Zhang et al. (2015); Xiao and Cho (2016); Conneau et al. (2016)\nis to use inputs that represent sub-word units such as syllables or individual characters rather than\nwords. This generally moves the task of \ufb01nding meaningful representations of the text from the\ninput embeddings into the model itself and increases the computational cost of running the models\n(Johnson and Zhang, 2016). Johansen et al. (2016) used a hierarchical encoding technique to do\nmachine translation with character inputs while keeping computational costs low.\n\n3 Hash Embeddings\n\nIn the following we will go through the step by step construction of a vector representation for a\ntoken w \u2208 T using hash embeddings. The following steps are also illustrated in \ufb01g. 1:\n\n1. Use k different functions H1, . . . ,Hk to choose k component vectors for the token w from\n\na prede\ufb01ned pool of B shared component vectors\n\n(cid:80)k\n\nw.\n\ni=1 pi\n\n2. Combine the chosen component vectors from step 1 as a weighted sum: \u02c6ew =\nw)(cid:62) \u2208 Rk are called the importance parameters for\n\nwHi(w). pw = (p1\n\nw, . . . , pk\n\n3. Optional: The vector of importance parameters for the token pw can be concatenated with\n\n\u02c6ew in order to construct the \ufb01nal hash vector ew.\n\n3\n\n\fThe full translation of a token to a hash vector can be written in vector notation (\u2295 denotes the\nconcatenation operator):\n\ncw = (H1(w),H2(w), . . . ,Hk(w))(cid:62)\npw = (p1\n\u02c6ew = p(cid:62)\ne(cid:62)\nw = \u02c6e(cid:62)\n\nw, . . . , pk\nwcw\nw \u2295 p(cid:62)\n\nw)(cid:62)\n\nw(optional)\n\nThe token to component vector functions Hi are implemented by Hi(w) = ED2(D1(w)), where\n\n\u2022 D1 : T \u2192 {1, . . . K} is a token to id function.\n\u2022 D2 : {1, . . . , K} \u2192 {1, . . . B} is an id to bucket (hash) function.\n\u2022 E is a B \u00d7 d matrix.\n\nIf creating a dictionary beforehand is not a problem, we can use an enumeration (dictionary) of the\ntokens as D1. If, on the other hand, it is inconvenient (or impossible) to use a dictionary because of\nthe size of T , we can simply use a hash function D1 : T \u2192 {1, . . . K}.\nThe importance parameter vectors pw are represented as rows in a K \u00d7 k matrix P , and the token to\nimportance vector mapping is implemented by w \u2192 P \u02c6D(w). \u02c6D(w) can be either equal to D1, or we\ncan use a different hash function. In the rest of the article we will use \u02c6D = D1, and leave the case\nwhere \u02c6D (cid:54)= D1 to future work.\nBased on the description above we see that the construction of hash embeddings requires the\nfollowing:\n\n1. A trainable embedding matrix E of size B \u00d7 d, where each of the B rows is a component\n\nvector of length d.\n\n2. A trainable matrix P of importance parameters of size K \u00d7 k where each of the K rows is a\n\nvector of k scalar importance parameters.\n\n3. k different hash functions H1, . . . ,Hk that each uniformly assigns one of the B component\n\nvectors to each token w \u2208 T .\n\nThe total number of trainable parameters in a hash embedding is thus equal to B \u00b7 d + K \u00b7 k, which\nshould be compared to a standard embedding where the number of trainable parameters is K \u00b7 d. The\nnumber of hash functions k and buckets B can typically be chosen quite small without degrading\nperformance, and this is what can give a huge reduction in the number of parameters (we typically\nuse k = 2 and choose K and B s.t. K > 10 \u00b7 B).\nFrom the description above we also see that the computational overhead of using hash embeddings in-\nstead of standard embeddings is just a matrix multiplication of a 1\u00d7 k matrix (importance parameters)\nwith a k \u00d7 d matrix (component vectors). When using small values of k, the computational overhead\nis therefore negligible. In our experiments, hash embeddings were actually marginally faster to train\nthan standard embedding types for large vocabulary problems1. However, since the embedding layer\nis responsible for only a negligible fraction of the computational complexity of most models, using\nhash embeddings instead of regular embeddings should not make any difference for most models.\nFurthermore, when using hash embeddings it is not necessary to create a dictionary before training\nnor to perform vocabulary pruning after training. This can also reduce the total training time.\nNote that in the special case where the number of hash functions is k = 1, and all importance\nw = 1 for all tokens w \u2208 T , hash embeddings are equivalent to using\nparameters are \ufb01xed to p1\nthe hashing trick. If furthermore the number of component vectors is set to B = |T | and the hash\nfunction h1(w) is the identity function, hash embeddings are equivalent to standard embeddings.\n\n1the small performance difference was observed when using Keras with a Tensor\ufb02ow backend on a GeForce\nGTX TITAN X with 12 GB of memory and a Nvidia GeForce GTX 660 with 2GB memory. The performance\npenalty when using standard embeddings for large vocabulary problems can possibly be avoided by using a\ncustom embedding layer, but we have not pursued this further.\n\n4\n\n\f4 Hashing theory\nTheorem 4.1. Let h : T \u2192 {0, . . . , K} be a hash function. Then the probability pcol that w0 \u2208 T\ncollides with one or more other tokens is given by\n\npcol = 1 \u2212 (1 \u2212 1/K)|T |\u22121 .\n\nFor large K we have the approximation\n\npcol \u2248 1 \u2212 e\u2212 |T |\nK .\nThe expected number of tokens in collision Ctot is given by\nCtot = |T |pcol .\n\nProof. This is a simple variation of the \u201cbirthday problem\u201d.\n\n(1)\n\n(2)\n\n(3)\n\n(cid:17)\n\nB\n\nWhen using hashing for dimensionality reduction, collisions are unavoidable, which is the main\ndisadvantage for feature hashing. This is counteracted by hash embeddings in two ways:\nFirst of all, for choosing the component vectors for a token w \u2208 T , hash embeddings use k\nindependent uniform hash functions hi : T \u2192 {1, . . . , B} for i = 1, . . . , k. The combination\nof multiple hash functions approximates a single hash function with much larger range h : T \u2192\n{1, . . . , Bk}, which drastically reduces the risk of total collisions. With a vocabulary of |T | = 100M,\nB = 1M different component vectors and just k = 2 instead of 1, the chance of a given token colliding\n\nwith at least one other token in the vocabulary is reduced from approximately 1\u2212exp(cid:0)\u2212108/106(cid:1) \u2248 1\nto approximately 1 \u2212 exp(cid:0)\u2212108/1012(cid:1) \u2248 0.0001. Using more hash functions will further reduce\n\n(cid:16)\u2212|Timp|\n\nthe number of collisions.\nSecond, only a small number of the tokens in the vocabulary are usually important for the task at\nhand. The purpose of the importance parameters is to implicitly prune unimportant words by setting\ntheir importance parameters close to 0. This would reduce the expected number of collisions to\n|Timp|\u00b7 exp\nwhere Timp \u2282 T is the set of important words for the given task. The weighting\nwith the component vector will further be able to separate the colliding tokens in the k dimensional\nsubspace spanned by their k d dimensional embedding vectors.\nNote that hash embeddings consist of two layers of hashing. In the \ufb01rst layer each token is simply\ntranslated to an integer in {1, . . . , K} by a dictionary or a hash function D1. If D1 is a dictionary,\nthere will of course not be any collisions in the \ufb01rst layer. If D1 is a random hash function then\nthe expected number of tokens in collision will be given by equation 3. These collisions cannot be\navoided, and the expected number of collisions can only be decreased by increasing K. Increasing the\nvocabulary size by 1 introduces d parameters in standard embeddings and only k in hash embeddings.\nThe typical d ranges from 10 to 300, and k is in the range 1-3. This means that even when the\nembedding size is kept small, the parameter savings can be huge. In (Joulin et al., 2016b) for example,\nthe embedding size is chosen to be as small as 10. In order to go from a bi-gram model to a general\nn-gram model the number of buckets is increased from K = 107 to K = 108. This increase of\nbuckets requires an additional 900 million parameters when using standard embeddings, but less than\n200 million when using hash embeddings with the default of k = 2 hash functions. I.e. even when\nthe embedding size is kept extremely small, the parameter savings can be huge.\n\n5 Experiments\n\nWe benchmark hash embeddings with and without dictionaries on text classi\ufb01cation tasks.\n\n5.1 Data and preprocessing\n\nWe evaluate hash embeddings on 7 different datasets in the form introduced by Zhang et al. (2015)\nfor various text classi\ufb01cation tasks including topic classi\ufb01cation, sentiment analysis, and news\ncategorization. All of the datasets are balanced so the samples are distributed evenly among the\n\n5\n\n\fclasses. An overview of the datasets can be seen in table 1. Signi\ufb01cant previous results are listed in\ntable 2. We use the same experimental protocol as in (Zhang et al., 2015).\nWe do not perform any preprocessing besides removing punctuation. The models are trained on\nsnippets of text that are created by \ufb01rst converting each text to a sequence of n-grams, and from this\nlist a training sample is created by randomly selecting between 4 and 100 consecutive n-grams as\ninput. This may be seen as input drop-out and helps the model avoid over\ufb01tting. When testing we use\nthe entire document as input. The snippet/document-level embedding is obtained by simply adding\nup the word-level embeddings.\n\nTable 1: Datasets used in the experiments, See (Zhang et al., 2015) for a complete description.\n\n#Classes Task\n\n#Train\n120k\n450k\n560k\n560k\n650k\n3000k\n3600k\n\n#Test\n7.6k\n70k\n38k\n50k\n60k\n650k\n400k\n\n4\n14\n2\n5\n10\n5\n2\n\nEnglish news categorization\nOntology classi\ufb01cation\nSentiment analysis\nSentiment analysis\nTopic classi\ufb01cation\nSentiment analysis\nSentiment analysis\n\nAG\u2019s news\nDBPedia\nYelp Review Polarity\nYelp Review Full\nYahoo! Answers\nAmazon Review Full\nAmazon Review Polarity\n\n5.2 Training\n\nAll the models are trained by minimizing the cross entropy using the stochastic gradient descent-\nbased Adam method (Kingma and Ba, 2014) with a learning rate set to \u03b1 = 0.001. We use early\nstopping with a patience of 10, and use 5% of the training data as validation data. All models\nwere implemented using Keras with TensorFlow backend. The training was performed on a Nvidia\nGeForce GTX TITAN X with 12 GB of memory.\n\n5.3 Hash embeddings without a dictionary\n\nIn this experiment we compare the use of a standard hashing trick embedding with a hash embedding.\nThe hash embeddings use K = 10M different importance parameter vectors, k = 2 hash functions,\nand B = 1M component vectors of dimension d = 20. This adds up to 40M parameters for the hash\nembeddings. For the standard hashing trick embeddings, we use an architecture almost identical to\nthe one used in (Joulin et al., 2016b). As in (Joulin et al., 2016b) we only consider bi-grams. We use\none layer of hashing with 10M buckets and an embeddings size of 20. This requires 200M parameters.\nThe document-level embedding input is passed through a single fully connected layer with softmax\nactivation.\nThe performance of the model when using each of the two embedding types can be seen in the left\nside of table 2. We see that even though hash embeddings require 5 times less parameters compared to\nstandard embeddings, they perform at least as well as standard embeddings across all of the datasets,\nexcept for DBPedia where standard embeddings perform a tiny bit better.\n\n5.4 Hash embeddings using a dictionary\n\nIn this experiment we limit the vocabulary to the 1M most frequent n-grams for n < 10. Most of the\ntokens are uni-grams and bi-grams, but also many tokens of higher order are present in the vocabulary.\nWe use embedding vectors of size d = 200. The hash embeddings use k = 2 hash functions and the\nbucket size B is chosen by cross-validation among [500, 10K, 50K, 100K, 150K]. The maximum\nnumber of words for the standard embeddings is chosen by cross-validation among [10K, 25K, 50K,\n300K, 500K, 1M]. We use a more complex architecture than in the experiment above, consisting of\nan embedding layer (standard or hash) followed by three dense layers with 1000 hidden units and\nReLU activations, ending in a softmax layer. We use batch normalization (Ioffe and Szegedy, 2015)\nas regularization between all of the layers.\nThe parameter savings for this problem are not as great as in the experiment without a dictionary, but\nthe hash embeddings still use 3 times less parameters on average compared to a standard embedding.\n\n6\n\n\fAs can be seen in table 2 the more complex models actually achieve a worse result than the simple\nmodel described above. This could be caused by either an insuf\ufb01cient number of words in the\nvocabulary or by over\ufb01tting. Note however, that the two models have access to the same vocabulary,\nand the vocabulary can therefore only explain the general drop in performance, not the performance\ndifference between the two types of embedding. This seems to suggest that using hash embeddings\nhave a regularizing effect on performance.\nWhen using a dictionary in the \ufb01rst layer of hashing, each vector of importance parameters will corre-\nspond directly to a unique phrase. In table 4 we see the phrases corresponding to the largest/smallest\n(absolute) importance values. As we would expect, large absolute values of the importance parameters\ncorrespond to important phrases. Also note that some of the n-grams contain information that e.g.\nthe bi-gram model above would not be able to capture. For example, the bi-gram model would not be\nable to tell whether 4 or 5 stars had been given on behalf of the sentence \u201cI gave it 4 stars instead of\n5 stars\u201d, but the general n-gram model would.\n\n5.5 Ensemble of hash embeddings\n\nThe number of buckets for a hash embedding can be chosen quite small without severely affecting\nperformance. B = 500 \u2212 10.000 buckets is typically suf\ufb01cient in order to obtain a performance\nalmost at par with the best results. In the experiments using a dictionary only about 3M parameters\nare required in the layers on top of the embedding, while kK + Bd = 2M + B \u00d7 200 are required\nin the embedding itself. This means that we can choose to train an ensemble of models with small\nbucket sizes instead of a large model, while at the same time use the same amount of parameters (and\nthe same training time since models can be trained in parallel). Using an ensemble is particularly\nuseful for hash embeddings: even though collisions are handled effectively by the word importance\nparameters, there is still a possibility that a few of the important words have to use suboptimal\nembedding vectors. When using several models in an ensemble this can more or less be avoided since\ndifferent hash functions can be chosen for each hash embedding in the ensemble.\nWe use an ensemble consisting of 10 models and combine the models using soft voting. Each model\nuse B = 50.000 and d = 200. The architecture is the same as in the previous section except that\nmodels with one to three hidden layers are used instead of just ten models with three hidden layers.\nThis was done in order to diversify the models. The total number of parameters in the ensemble is\napproximately 150M. This should be compared to both the standard embedding model in section 5.3\nand the standard embedding model in section 5.4 (when using the full vocabulary), both of which\nrequire \u2248 200M parameters.\n\nTable 2: Test accuracy (in %) for the selected datasets\n\nWith dictionary\n\nWithout dictionary\n\nShallow network (section 5.3)\nHash emb.\n\nStd emb\n\nDeep network (section 5.4)\n\nHash emb.\n\nStd. emb. Ensemble\n\n92.4\n60.0\n98.5\n72.3\n63.8\n94.4\n95.9\n\nAG\nAmazon full\nDbpedia\nYahoo\nYelp full\nAmazon pol\nYelp pol\n\n6 Future Work\n\n92.0\n58.3\n98.6\n72.3\n62.6\n94.2\n95.5\n\n91.5\n59.4\n98.7\n71.3\n62.6\n94.7\n95.8\n\n91.7\n58.5\n98.6\n65.8\n61.4\n93.6\n95.0\n\n92.0\n60.5\n98.8\n72.9\n62.9\n94.7\n95.7\n\nHash embeddings are complementary to other state-of-the-art methods as it addresses the problem\nof large vocabularies. An attractive possibility is to use hash-embeddings to create a word-level\nembedding to be used in a context sensitive model such as wordCNN.\nAs noted in section 3, we have used the same token to id function D1 for both the component vectors\nand the importance parameters. This means that words that hash to the same bucket in the \ufb01rst layer\nget both identical component vectors and importance parameters. This effectively means that those\nwords become indistinguishable to the model. If we instead use a different token to id function \u02c6D for\n\n7\n\n\fTable 3: State-of-the-art test accuracy in %. The table is split between BOW embedding approaches\n(bottom) and more complex rnn/cnn approaches (top). The best result in each category for each\ndataset is bolded.\n\nchar-CNN (Zhang et al., 2015)\nchar-CRNN (Xiao and Cho, 2016)\nVDCNN (Conneau et al., 2016)\nwordCNN (Johnson and Zhang, 2016)\nDiscr. LSTM (Yogatama et al., 2017)\nVirt. adv. net. (Miyato et al., 2016)\nfastText (Joulin et al., 2016b)\nBoW (Zhang et al., 2015)\nn-grams (Zhang et al., 2015)\nn-grams TFIDF (Zhang et al., 2015)\nHash embeddings (no dict.)\nHash embeddings (dict.)\nHash embeddings (dict., ensemble)\n\nAG\n87.2\n91.4\n91.3\n93.4\n92.1\n\n92.5\n88.8\n92.0\n92.4\n92.4\n91.5\n92.0\n\n62.0\n61.8\n64.7\n67.6\n59.6\n\n71.2\n71.7\n73.4\n75.2\n73.7\n\n59.5\n59.2\n63.0\n63.8\n\nDBP Yelp P Yelp F Yah A Amz F Amz P\n98.3\n98.6\n98.7\n99.2\n98.7\n99.2\n98.6\n96.6\n98.6\n98.7\n98.5\n98.7\n98.8\n\n63.9\n58.0\n56.3\n54.8\n63.8\n62.5\n62.9\n\n72.3\n68.9\n68.5\n68.5\n72.3\n71.9\n72.9\n\n94.7\n94.5\n95.7\n97.1\n92.6\n\n95.7\n92.2\n95.6\n95.4\n95.9\n95.8\n95.7\n\n60.2\n54.6\n54.3\n52.4\n60.0\n59.4\n60.5\n\n94.5\n94.1\n95.7\n96.2\n\n94.6\n90.4\n92.0\n91.5\n94.4\n94.7\n94.7\n\nTable 4: Words in the vocabulary with the highest/lowest importance parameters.\n\nYelp polarity\n\nAmazon full\n\nImportant tokens\n\nUnimportant tokens\n\nWhat_a_joke, not_a_good_experience,\nGreat_experience, wanted_to_love,\nand_lacking, Awful, by_far_the_worst,\n\nThe_service_was, got_a_cinnamon,\n15_you_can, while_touching,\nand_that_table, style_There_is\n\ngave_it_4, it_two_stars_because,\n4_stars_instead_of_5, 4_stars,\nfour_stars, gave_it_two_stars\nthat_my_wife_and_I, the_state_I,\npower_back_on, years_and_though,\nyou_want_a_real_good\n\nthe importance parameters, we severely reduce the chance of \"total collisions\". Our initial \ufb01ndings\nindicate that using a different hash function for the index of the importance parameters gives a small\nbut consistent improvement compared to using the same hash function.\nIn this article we have represented word vector using a weighed sum of component vectors. However,\nother aggregation methods are possible. One such method is simply to concatenate the (weighed)\ncomponent vectors. The resulting kd-dimensional vector is then equivalent to a weighed sum of\northogonal vectors in Rkd.\nFinally, it might be interesting to experiment with pre-training lean, high-quality hash vectors that\ncould be distributed as an alternative to word2vec vectors, which require around 3.5 GB of space for\nalmost a billion parameters.\n\n7 Conclusion\n\nWe have described an extension and improvement to standard word embeddings and made an\nempirical comparisons between hash embeddings and standard embeddings across a wide range of\nclassi\ufb01cation tasks. Our experiments show that the performance of hash embeddings is always at par\nwith using standard embeddings, and in most cases better.\nWe have shown that hash embeddings can easily deal with huge vocabularies, and we have shown\nthat hash embeddings can be used both with and without a dictionary. This is particularly useful for\nproblems such as online learning where a dictionary cannot be constructed before training.\nOur experiments also suggest that hash embeddings have an inherent regularizing effect on perfor-\nmance. When using a standard method of regularization (such as L1 or L2 regularization), we start\nwith the full parameter space and regularize parameters by pushing some of them closer to 0. This\nis in contrast to regularization using hash embeddings where the number of parameters (number of\nbuckets) determines the degree of regularization. Thus parameters not needed by the model will not\nhave to be added in the \ufb01rst place.\nThe hash embedding models used in this article achieve equal or better performance than previous\nbag-of-words models using standard embeddings. Furthermore, in 5 of 7 datasets, the performance of\nhash embeddings is in top 3 of state-of-the art.\n\n8\n\n\fReferences\nArgerich, L., Zaffaroni, J. T., and Cano, M. J. (2016). Hash2vec, feature hashing for word embeddings. CoRR,\n\nabs/1608.08940.\n\nBai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Chapelle, O., and Weinberger, K. (2009).\nSupervised semantic indexing. In Proceedings of the 18th ACM conference on Information and knowledge\nmanagement, pages 187\u2013196. ACM.\n\nConneau, A., Schwenk, H., Barrault, L., and LeCun, Y. (2016). Very deep convolutional networks for natural\n\nlanguage processing. CoRR, abs/1606.01781.\n\nGray, R. M. and Neuhoff, D. L. (1998). Quantization. IEEE Trans. Inf. Theor., 44(6):2325\u20132383.\nHuang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. (2012). Improving word representations via global\ncontext and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for\nComputational Linguistics: Long Papers - Volume 1, ACL \u201912, pages 873\u2013882, Stroudsburg, PA, USA.\nAssociation for Computational Linguistics.\n\nHuang, P.-S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (2013). Learning deep structured semantic\nmodels for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on\nInformation and Knowledge Management (CIKM), pages 2333\u20132338.\n\nIoffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. CoRR, abs/1502.03167.\n\nJegou, H., Douze, M., and Schmid, C. (2011). Product quantization for nearest neighbor search. IEEE Trans.\n\nPattern Anal. Mach. Intell., 33(1):117\u2013128.\n\nJohansen, A. R., Hansen, J. M., Obeid, E. K., S\u00f8nderby, C. K., and Winther, O. (2016). Neural machine\n\ntranslation with characters and hierarchical encoding. CoRR, abs/1610.06550.\n\nJohnson, R. and Zhang, T. (2016). Convolutional neural networks for text categorization: Shallow word-level vs.\n\ndeep character-level. CoRR, abs/1609.00718.\n\nJoulin, A., Grave, E., Bojanowski, P., Douze, M., J\u00e9gou, H., and Mikolov, T. (2016a). Fasttext.zip: Compressing\n\ntext classi\ufb01cation models. CoRR, abs/1612.03651.\n\nJoulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016b). Bag of tricks for ef\ufb01cient text classi\ufb01cation.\n\nCoRR, abs/1607.01759.\n\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.\nKulis, B. and Darrell, T. (2009). Learning to hash with binary reconstructive embeddings. In Bengio, Y.,\nSchuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A., editors, Advances in Neural Information\nProcessing Systems 22, pages 1042\u20131050. Curran Associates, Inc.\n\nManning, C. D., Sch\u00fctze, H., et al. (1999). Foundations of statistical natural language processing, volume 999.\n\nMIT Press.\n\nMih\u00e1ltz, M. (2016). Google\u2019s trained word2vec model in python. https://github.com/mmihaltz/\n\nword2vec-GoogleNews-vectors. Accessed: 2017-02-08.\n\nMiyato, T., Dai, A. M., and Goodfellow, I. (2016). Virtual adversarial training for semi-supervised text\n\nclassi\ufb01cation. stat, 1050:25.\n\nReisinger, J. and Mooney, R. J. (2010). Multi-prototype vector-space models of word meaning. In Human\nLanguage Technologies: The 2010 Annual Conference of the North American Chapter of the Association for\nComputational Linguistics, HLT \u201910, pages 109\u2013117, Stroudsburg, PA, USA. Association for Computational\nLinguistics.\n\nStolcke, A. (2000). Entropy-based pruning of backoff language models. CoRR, cs.CL/0006025.\nWeinberger, K. Q., Dasgupta, A., Attenberg, J., Langford, J., and Smola, A. J. (2009). Feature hashing for large\n\nscale multitask learning. CoRR, abs/0902.2206.\n\nXiao, Y. and Cho, K. (2016). Ef\ufb01cient character-level document classi\ufb01cation by combining convolution and\n\nrecurrent layers. CoRR, abs/1602.00367.\n\nYogatama, D., Dyer, C., Ling, W., and Blunsom, P. (2017). Generative and discriminative text classi\ufb01cation with\n\nrecurrent neural networks. arXiv preprint arXiv:1703.01898.\n\nZhang, X., Zhao, J. J., and LeCun, Y. (2015). Character-level convolutional networks for text classi\ufb01cation.\n\nCoRR, abs/1509.01626.\n\n9\n\n\f", "award": [], "sourceid": 2544, "authors": [{"given_name": "Dan", "family_name": "Tito Svenstrup", "institution": "DTU"}, {"given_name": "Jonas", "family_name": "Hansen", "institution": "Findzebra"}, {"given_name": "Ole", "family_name": "Winther", "institution": "Technical University of Denmark"}]}