{"title": "Hierarchical Distributed Representations for Statistical Language Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 185, "page_last": 192, "abstract": null, "full_text": " Hierarchical Distributed Representations for\n Statistical Language Modeling\n\n\n\n John Blitzer, Kilian Q. Weinberger, Lawrence K. Saul, and Fernando C. N. Pereira\n Department of Computer and Information Science, University of Pennsylvania\n Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104\n {blitzer,kilianw,lsaul,pereira}@cis.upenn.edu\n\n\n\n Abstract\n\n Statistical language models estimate the probability of a word occurring\n in a given context. The most common language models rely on a discrete\n enumeration of predictive contexts (e.g., n-grams) and consequently fail\n to capture and exploit statistical regularities across these contexts. In this\n paper, we show how to learn hierarchical, distributed representations of\n word contexts that maximize the predictive value of a statistical language\n model. The representations are initialized by unsupervised algorithms for\n linear and nonlinear dimensionality reduction [14], then fed as input into\n a hierarchical mixture of experts, where each expert is a multinomial dis-\n tribution over predicted words [12]. While the distributed representations\n in our model are inspired by the neural probabilistic language model of\n Bengio et al. [2, 3], our particular architecture enables us to work with\n significantly larger vocabularies and training corpora. For example, on a\n large-scale bigram modeling task involving a sixty thousand word vocab-\n ulary and a training corpus of three million sentences, we demonstrate\n consistent improvement over class-based bigram models [10, 13]. We\n also discuss extensions of our approach to longer multiword contexts.\n\n\n\n1 Introduction\n\nStatistical language models are essential components of natural language systems for\nhuman-computer interaction. They play a central role in automatic speech recognition [11],\nmachine translation [5], statistical parsing [8], and information retrieval [15]. These mod-\nels estimate the probability that a word will occur in a given context, where in general\na context specifies a relationship to one or more words that have already been observed.\nThe simplest, most studied case is that of n-gram language modeling, where each word is\npredicted from the preceding n-1 words. The main problem in building these models is\nthat the vast majority of word combinations occur very infrequently, making it difficult to\nestimate accurate probabilities of words in most contexts.\n\nResearchers in statistical language modeling have developed a variety of smoothing tech-\nniques to alleviate this problem of data sparseness. Most smoothing methods are based on\nsimple back-off formulas or interpolation schemes that discount the probability of observed\nevents and assign the \"leftover\" probability mass to events unseen in training [7]. Unfortu-\nnately, these methods do not typically represent or take advantage of statistical regularities\n\n\f\namong contexts. One expects the probabilities of rare or unseen events in one context to be\nrelated to their probabilities in statistically similar contexts. Thus, it should be possible to\nestimate more accurate probabilities by exploiting these regularities.\n\nSeveral approaches have been suggested for sharing statistical information across contexts.\nThe aggregate Markov model (AMM) of Saul and Pereira [13] (also discussed by Hofmann\nand Puzicha [10] as a special case of the aspect model) factors the conditional probability\ntable of a word given its context by a latent variable representing context \"classes\". How-\never, this latent variable approach is difficult to generalize to multiword contexts, as the\nsize of the conditional probability table for class given context grows exponentially with\nthe context length.\n\nThe neural probabilistic language model (NPLM) of Bengio et al. [2, 3] achieved signifi-\ncant improvements over state-of-the-art smoothed n-gram models [6]. The NPLM encodes\ncontexts as low-dimensional continuous vectors. These are fed to a multilayer neural net-\nwork that outputs a probability distribution over words. The low-dimensional vectors and\nthe parameters of the network are trained simultaneously to minimize the perplexity of the\nlanguage model. This model has no difficulty encoding multiword contexts, but its training\nand application are very costly because of the need to compute a separate normalization for\nthe conditional probabilities associated to each context.\n\nIn this paper, we introduce and evaluate a statistical language model that combines the\nadvantages of the AMM and NPLM. Like the NPLM, it can be used for multiword con-\ntexts, and like the AMM it avoids per-context normalization. In our model, contexts are\nrepresented as low-dimensional real vectors initialized by unsupervised algorithms for di-\nmensionality reduction [14]. The probabilities of words given contexts are represented by\na hierarchical mixture of experts (HME) [12], where each expert is a multinomial distri-\nbution over predicted words. This tree-structured mixture model allows a rich dependency\non context without expensive per-context normalization. Proper initialization of the dis-\ntributed representations is crucial; in particular, we find that initializations from the results\nof linear and nonlinear dimensionality reduction algorithms lead to better models (with\nsignificantly lower test perplexities) than random initialization.\n\nIn practice our model is several orders of magnitude faster to train and apply than the\nNPLM, enabling us to work with larger vocabularies and training corpora. We present re-\nsults on a large-scale bigram modeling task, showing that our model also leads to significant\nimprovements over comparable AMMs.\n\n\n2 Distributed representations of words\n\nNatural language has complex, multidimensional semantics. As a trivial example, consider\nthe following four sentences:\n\n The vase broke. The vase contains water.\n The window broke. The window contains water.\n\nThe bottom right sentence is syntactically valid but semantically meaningless. As shown by\nthe table, a two-bit distributed representation of the words \"vase\" and \"window\" suffices to\nexpress that a vase is both a container and breakable, while a window is breakable but can-\nnot be a container. More generally, we expect low dimensional continuous representations\nof words to be even more effective at capturing semantic regularities.\n\nDistributed representations of words can be derived in several ways. In a given corpus of\ntext, for example, consider the matrix of bigram counts whose element Cij records the\nnumber of times that word wj follows word wi. Further, let pij = Cij/ C\n k ik denote\nthe conditional frequencies derived from these counts, and let pi denote the V -dimensional\n\n\f\nfrequency vector with elements pij, where V is the vocabulary size. Note that the vectors pi\nthemselves provide a distributed representation of the words wi in the corpus. For large\nvocabularies and training corpora, however, this is an extremely unwieldy representation,\ntantamount to storing the full matrix of bigram counts. Thus, it is natural to seek a lower\ndimensional representation that captures the same information. To this end, we need to map\neach vector pi to some d-dimensional vector xi, with d V . We consider two methods in\ndimensionality reduction for this problem. The results from these methods are then used to\ninitialize the HME architecture in the next section.\n\n\n2.1 Linear dimensionality reduction\n\nThe simplest form of dimensionality reduction is principal component analysis (PCA).\nPCA computes a linear projection of the frequency vectors pi into the low dimensional\nsubspace that maximizes their variance. The variance-maximizing subspace of dimension-\nality d is spanned by the top d eigenvectors of the frequency vector covariance matrix. The\neigenvalues of the covariance matrix measure the variance captured by each axis of the\nsubspace. The effect of PCA can also be understood as a translation and rotation of the\nfrequency vectors pi, followed by a truncation that preserves only their first d elements.\n\n\n2.2 Nonlinear dimensionality reduction\n\nIntuitively, we would like to map the vectors pi into a low dimensional space where se-\nmantically similar words remain close together and semantically dissimilar words are far\napart. Can we find a nonlinear mapping that does this better than PCA? Weinberger et al.\nrecently proposed a new solution to this problem based on semidefinite programming [14].\n\nLet xi denote the image of pi under this mapping. The mapping is discovered by first\nlearning the V V matrix of Euclidean squared distances [1] given by Dij = |xi - xj|2.\nThis is done by balancing two competing goals: (i) to co-locate semantically similar words,\nand (ii) to separate semantically dissimilar words. The first goal is achieved by fixing the\ndistances between words with similar frequency vectors to their original values. In particu-\nlar, if pj and pk lie within some small neighborhood of each other, then the corresponding\nelement Djk in the distance matrix is fixed to the value |pj - pk|2. The second goal is\nachieved by maximizing the sum of pairwise squared distances ijDij. Thus, we push the\nwords in the vocabulary as far apart as possible subject to the constraint that the distances\nbetween semantically similar words do not change.\n\nThe only freedom in this optimization is the criterion for judging that two words are se-\nmantically similar. In practice, we adopt a simple criterion such as k-nearest neighbors in\nthe space of frequency vectors pi and choose k as small as possible so that the resulting\nneighborhood graph is connected [14].\n\nThe optimization is performed over the space of Euclidean squared distance matrices [1].\nNecessary and sufficient conditions for the matrix D to be interpretable as a Euclidean\nsquared distance matrix are that D is symmetric and that the Gram matrix1 derived from\nG = - 1 HDHT is semipositive definite, where H = I - 1 11T. The optimization can\n 2 V\nthus be formulated as the semidefinite programming problem:\n\n\n Maximize ijDij subject to: (i) DT = D, (ii) - 1 HDH 0, and\n 2\n (iii) Dij = |pi - pj|2 for all neighboring vectors pi and pj.\n\n 1Assuming without loss of generality that the vectors xi are centered on the origin, the dot prod-\nucts Gij = xi xj are related to the pairwise squared distances Dij = |xi - xj |2 as stated above.\n\n\f\nPCA\n\nSDE\n\n 0.0 0.2 0.4 0.6 0.8 1.0\n\nFigure 1: Eigenvalues from principal component analysis (PCA) and semidefinite embed-\nding (SDE), applied to bigram distributions of the 2000 most frequently occuring words in\nthe corpus. The eigenvalues, shown normalized by their sum, measure the relative variance\ncaptured by individual dimensions.\n\n\n\nThe optimization is convex, and its global maximum can be computed in polynomial\ntime [4]. The optimization here differs slightly from the one used by Weinberger et al. [14]\nin that here we only preserve local distances, as opposed to local distances and angles.\n\nAfter computing the matrix Dij by semidefinite programming, a low dimensional embed-\nding xi is obtained by metric multidimensional scaling [1, 9, 14]. The top eigenvalues of\nthe Gram matrix measure the variance captured by the leading dimensions of this embed-\nding. Thus, one can compare the eigenvalue spectra from this method and PCA to ascertain\nif the variance of the nonlinear embedding is concentrated in fewer dimensions. We refer\nto this method of nonlinear dimensionality reduction as semidefinite embedding (SDE).\nFig. 1 compares the eigenvalue spectra of PCA and SDE applied to the 2000 most frequent\nwords2 in the corpus described in section 4. The figure shows that the nonlinear embedding\nby SDE concentrates its variance in many fewer dimensions than the linear embedding by\nPCA. Indeed, Fig. 2 shows that even the first two dimensions of the nonlinear embedding\npreserve the neighboring relationships of many words that are semantically similar. By\ncontrast, the analogous plot generated by PCA (not shown) reveals no such structure.\n\n\n MONDAY\n TUESDAY\n WEDNESDAY\n THURSDAY\n MAY, WOULD, COULD, SHOULD, FRIDAY\n MIGHT, MUST, CAN, CANNOT, SATURDAY\n COULDN'T, WON'T, WILL SUNDAY\n\n\n JANUARY\n ONE, TWO, THREE, FEBRUARY\n FOUR, FIVE, SIX, MARCH\n SEVEN, EIGHT, NINE, APRIL\n TEN, ELEVEN, JUNE\n MILLION\n TWELVE, THIRTEEN, JULY\n BILLION\n FOURTEEN, FIFTEEN, AUGUST\n SIXTEEN, SEPTEMBER\n SEVENTEEN, OCTOBER\n EIGHTEEN NOVEMBER\n DECEMBER\n ZERO\n\n\n\n\nFigure 2: Projection of the normalized bigram counts of the 2000 most frequent words\nonto the first two dimensions of the nonlinear embedding obtained by semidefinite pro-\ngramming. Note that semantically meaningful neighborhoods are preserved, despite the\nmassive dimensionality reduction from V = 60000 to d = 2.\n\n\n\n\n 2Though convex, the optimization over distance matrices for SDE is prohibitively expensive for\nlarge matrices. For the results in this paper--on the corpus described in section 4--we solved the\nsemidefinite program in this section to embed the 2000 most frequent words in the corpus, then used\na greedy incremental solver to embed the remaining 58000 words in the vocabulary. Details of this\nincremental solver will be given elsewhere. Though not the main point of this paper, the nonlinear\nembedding of V = 60000 words is to our knowledge one of the largest applications of recently\ndeveloped spectral methods for nonlinear dimensionality reduction [9, 14].\n\n\f\n3 Hierarchical mixture of experts\n\nThe model we use to compute the probability that word w follows word w is known as a\nhierarchical mixture of experts (HME) [12]. HMEs are fully probabilistic models, making\nthem ideally suited to the task of statistical language modeling. Furthermore, like multi-\nlayer neural networks they can parameterize complex, nonlinear functions of their input.\n\nFigure 3 depicts a simple, two-layer HME. HMEs are tree-structured mixture models in\nwhich the mixture components are \"experts\" that lie at the leaves of the tree. The interior\nnodes of the tree perform binary logistic regressions on the input vector to the HME, and\nthe mixing weight for a leaf is computed by multiplying the probabilities of each branch\n(left or right) along the path to that leaf. In our model, the input vector x is a function of\nthe context word w, and the expert at each leaf specifies a multinomial distribution over\nthe predicted word w . Letting denote a path through the tree from root to leaf, the HME\ncomputes the probability of a word w conditioned on a context word w as\n\n Pr(w |w) = Pr(|x(w)) Pr(w |). (1)\n \n\nWe can compute the maximum likelihood parameters for the HME using an Expectation-\nMaximization (EM) algorithm [12]. The E-step involves computing the posterior probabil-\nity over paths Pr(|w, w ) for each observed bigram in the training corpus. This can be\ndone by a recursive pass through the tree. In the M-step, we must maximize the EM aux-\niliary function with respect to the parameters of the logistic regressions and multinomial\nleaves as well as the input vectors x(w). The logistic regressions in the tree decouple and\ncan be optimized separately by Newton's method, while the multinomial leaves have a sim-\nple closed-form update. Though the input vectors are shared across all logistic regressions\nin the tree, we can compute their gradients and hessians in one recursive pass and update\nthem by Newton's method as well.\n\nThe EM algorithm for HMEs converges to a local maximum of the log-likelihood, or equiv-\nalently, a local minimum of the training perplexity\n\n - 1\n C\n\n \n Ptrain = Pr(wj|wi)Cij , (2)\n ij \n\n\nwhere C = C\n ij ij is the total number of observed bigrams in the training corpus. The\nalgorithm is sensitive to the choice of initialization; in particular, as we show in the next\n\n\n\n input\n word vector\n initialized by\n PCA or SDE\n\n logistic\n regression\n\n logistic logistic\n regression regression\n\n\n multinomial multinomial multinomial multinomial\n distribution distribution distribution distribution\n\n\n\nFigure 3: Two-layer HME for bigram modeling. Words are mapped to input vectors; prob-\nabilities of next words are computed by summing over paths through the tree. The mapping\nfrom words to input vectors is initialized by dimensionality reduction of bigram counts.\n\n\f\n P\n P test d\n test d m 4 8 12 16\n init 4 8 12 16 8 435 429 426 428\n random 468 407 378 373 16 385 361 360 355\n PCA 406 364 362 351 32 350 328 320 317\n SDE 385 361 360 355 64 336 308 298 294\n\n Table 1: Test perplexities of HMEs Table 2: Test perplexities of HMEs\n with different input dimensionalities with different input dimensionalities\n and initializations. and numbers of leaves.\n\n\nsection, initialization of the input vectors by PCA or SDE leads to significantly better mod-\nels than random initialization. We initialized the logistic regressions in the HME to split\nthe input vectors recursively along their dimensions of greatest variance. The multinomial\ndistributions at leaf nodes were initialized by uniform distributions.\n\nFor an HME with m multinomial leaves and d-dimensional input vectors, the number of\nparameters scales as O(V d + V m + dm). The resulting model can be therefore be much\nmore compact than a full bigram model over V words.\n\n\n4 Results\n\nWe evaluated our models on the ARPA North American Business News (NAB) corpus.\nOur training set contained 78 million words from a 60,000 word vocabulary. In the interest\nof speed, we truncated the lowest-count bigrams from our training set. This left us with\na training set consisting of 1.7 million unique bigrams. The test set, untruncated, had 13\nmillion words resulting in 2.1 million unique bigrams.\n\n\n4.1 Empirical evaluation\n\nTable 1 reports the test perplexities of several HMEs whose input vectors were initialized\nin different ways. The number of mixture components (i.e., leaves of the HME) was fixed\nat m = 16. In all cases, the inputs initialized by PCA and SDE significantly outperformed\nrandom initialization. PCA and SDE initialization performed equally well for all but the\nlowest-dimensional inputs. Here SDE outperformed PCA, most likely because the first few\neigenvectors of SDE capture more variance in the bigram counts than those of PCA (see\nFigure 1).\n\nTable 2 reports the test perplexities of several HMEs initialized by SDE, but with varying\ninput dimensionality (d) and numbers of leaves (m). Perplexity decreases with increasing\ntree depth and input dimensionality, but increasing the dimensionality beyond d = 8 does\nnot appear to give much gain.\n\n\n4.2 Comparison to a class-based bigram model\n\n We obtained baseline results from an AMM [13]\n w z w' trained on the same corpus. The model (Figure 4)\n has the form\n\n Pr(w |w) = Pr(z|w) Pr(w |z). (3)\nFigure 4: Belief network for AMM. z\n\n The number of estimated parameters in AMMs\nscales as 2|Z|V , where |Z| is the size of the latent variable (i.e., number of classes)\nand V is the number of words in the vocabulary.\n\n\f\n parameters (*1000) Ptest(AMM) Ptest(HME) improvement\n 960 456 429 6%\n 1440 414 361 13%\n 2400 353 328 7%\n 4320 310 308 1%\n\n Table 3: Test perplexities of HMEs and AMMs with roughly equal parameter counts.\n\n\n\nTable 3 compares the test perplexities of several HMEs and AMMs with similar numbers\nof parameters. All these HMEs had d = 8 inputs initialized by SDE. In all cases, the HMEs\nmatch or outperform the AMMs. The performance is nearly equal for the larger models,\nwhich may be explained by the fact that most of the parameters of the larger HMEs come\nfrom the multinomial leaves, not from the distributed inputs.\n\n\n4.3 Comparison to NPLM\n\nThe most successful large-scale application of distributed representations to language mod-\neling is the NPLM of Bengio et al. [2, 3], which in part inspired our work. We now compare\nthe main aspects of the two models.\n\n The NPLM uses softmax to compute the probabil-\n d ity of a word w given its context, thus requiring a\n m 4 8 12 16 separate normalization for each context. Estimat-\n 8 1 1 1 1 ing the parameters of this softmax requires O(V )\n 16 2 2 2 2 computation per observed context and accounts\n 32 4 4 4 4 for almost all of the computational resources re-\n 64 9 10 10 10 quired by the model. Because of this, the NPLM\n vocabulary size was restricted to 18000 words,\nTable 4: Training times in hours and even then it required more than 3 weeks us-\nfor HMEs with m leaves. ing 40 CPUs to finish 5 epochs of training [2].\n\n By contrast, our HMEs require O(md) computa-\n tion per observed bigram. As Table 4 shows, ac-\ntual training times are rather insensitive to input dimensionality. This allowed us to use\na 3.5 larger vocabulary and a larger training corpus than were used for the NPLM, and\nstill complete training our largest models in a matter of hours. Note that the numbers in\nTable 4 do not include the time to compute the initial distributed representations by PCA\n(30 minutes) or SDE (3 days), but these computations do not need to be repeated for each\ntrained model.\n\nThe second difference between our model and the NPLM is the choice of initialization.\nBengio et al. [3] report negligible improvement from initializing the NPLM input vectors\nby singular value decomposition. By contrast, we found that initialization by PCA or SDE\nwas essential for optimal performance of our models (Table 1).\n\nFinally, the NPLM was applied to multiword contexts. We have not done these experi-\nments yet, but our model extends naturally to multiword contexts, as we explain in the next\nsection.\n\n\n5 Discussion\n\nIn this paper, we have presented a statistical language model that exploits hierarchical\ndistributed representations of word contexts. The model shares the advantages of the\nNPLM [2], but differs in its use of dimensionality reduction for effective parameter ini-\n\n\f\ntialization and in the significant speedup provided by the HME architecture. We can conse-\nquently scale our models to larger training corpora and vocabularies. We have also demon-\nstrated that our models consistently match or outperform a baseline class-based bigram\nmodel.\n\nThe class-based bigram model is nearly as effective as the HME, but it has the major draw-\nback that there is no straightforward way to extend it to multiword contexts without ex-\nploding its parameter count. Like the NPLM, however, the HME can be easily extended.\nWe can form an input vector for a multiword history (w1, w2) simply by concatenating the\nvectors x(w1) and x(w2). The parameters of the corresponding HME can be learned by\nan EM algorithm similar to the one in this paper. Initialization from dimensionality reduc-\ntion is also straightforward: we can compute the low dimensional representation for each\nword separately. We are actively pursuing these ideas to train models with hierarchical\ndistributed representations of multiword contexts.\n\n\nReferences\n\n [1] A. Y. Alfakih, A. Khandani, and H. Wolkowicz. Solving Euclidean distance matrix comple-\n tion problems via semidefinite programming. Computational Optimization Applications, 12(1-\n 3):1330, 1999.\n\n [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model.\n Journal of Machine Learning Research, 3:11371155, 2003.\n\n [3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In\n T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing\n Systems, volume 13, Cambridge, MA, 2001. MIT Press.\n\n [4] D. B. Borchers. CSDP, a C library for semidefinite programming. Optimization Methods and\n Software, 11(1):613623, 1999.\n\n [5] P. Brown, S. D. Pietra, V. D. Pietra, and R. Mercer. The mathematics of statistical machine\n translation: parameter estimation. Computational Linguistics, 19(2):263311, 1991.\n\n [6] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. Class-based n-gram\n models of natural language. Computational Linguistics, 18(4):467479, 1992.\n\n [7] S. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling.\n In Proceedings of the 34th Annual Meeting of the ACL, pages 310318, 1996.\n\n [8] M. Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the\n 35th Annual Meeting of the Association for Computational Linguistics, 1997.\n\n [9] J. Ham, D. D. Lee, S. Mika, and B. Scholkopf. A kernel view of the dimensionality reduction of\n manifolds. In Proceedings of the Twenty First International Conference on Machine Learning\n (ICML-04), Banff, Canada, 2004.\n\n[10] T. Hofmann and J. Puzicha. Statistical models for co-occurrence and histogram data. In Pro-\n ceedings of the International Conference Pattern Recognition, pages 192194, 1998.\n\n[11] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1997.\n\n[12] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural\n Computation, 6:181214, 1994.\n\n[13] L. K. Saul and F. C. N. Pereira. Aggregate and mixed-order Markov models for statistical\n language processing. In C. Cardie and R. Weischedel, editors, Proceedings of the Second Con-\n ference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 8189,\n New Providence, RI, 1997.\n\n[14] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality\n reduction. In Proceedings of the Twenty First International Confernence on Machine Learning\n (ICML-04), Banff, Canada, 2004.\n\n[15] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to infor-\n mation retrieval. ACM Transactions on Information Systems, 22(2):179214, 2004.\n\n\f\n", "award": [], "sourceid": 2691, "authors": [{"given_name": "John", "family_name": "Blitzer", "institution": null}, {"given_name": "Fernando", "family_name": "Pereira", "institution": null}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}