{"title": "Context as Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 907, "page_last": 914, "abstract": null, "full_text": "Context as Filtering\n\nDaichi Mochihashi ATR, Spoken Language Communication Research Laboratories Hikaridai 2-2-2, Keihanna Science City Kyoto, Japan daichi.mochihashi@atr.jp\n\nYuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology Takayama 8916-5, Ikoma City Nara, Japan matsu@is.naist.jp\n\nAbstract\nLong-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a nave bag-of-words history has been i employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle filters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order.\n\n1\n\nIntroduction\n\nContextual effect plays an essential role in the linguistic behavior of humans. We infer the context in which we are involved to make an adaptive linguistic response by selecting an appropriate model from that information. In natural language processing research, such models are called long-distance language models that incorporate distant effects of previous words over the short-term dependencies between a few words, which are called n-gram models. Besides apparent application in speech recognition and machine translation, we note that many problems of discrete data processing reduce to language modeling, such as information retrieval [1], Web navigation [2], human-machine interaction or collaborative filtering and recommendation [3]. From the viewpoint of signal processing or control theory, context modeling is clearly a filtering problem that estimates the states of a system sequentially along time to predict the outputs according to them. However, for the problem of long-distance language modeling, natural language processing has so far only provided simple averaging using a set of whole words from the beginning of a text, totally dropping chronological order and implicitly assuming that the text comes from a stationary information source [4, 5]. The inherent difficulties that have prevented filtering approaches to language modeling are its discreteness and high dimensionality, which precludes Kalman filters and their extensions that are all designed for vector spaces and distributions like Gaussians. As we note in the following, ordinary discrete HMMs are not powerful enough for this purpose because their true state is restricted to a single hidden component [6].\n\n\f\nIn contrast, this paper proposes to solve the high-dimensional discrete filtering problem directly using a Particle Filter. By combining a multinomial Particle Filter recently proposed in statistics for DNA sequence modeling [7] with Bayesian text models LDA and DM, we introduce two models that can track multinomial stochastic processes of natural language or similar high-dimensional discrete data domains that we often encounter.\n\n2\n2.1\n\nMean Shift Model of Context\nHMM for Multinomial Distributions\n\nThe long-distance language models mentioned in Section 1 assume a hidden multinomial distribution, such as a unigram distribution or a mixture distribution over the latent topics, to predict the next word by updating its estimate according to the observations. Therefore, to track context shifts, we need a model that describes changes of multinomial distributions. One model for this purpose is a multinomial extension to the Mean shift model (MSM) recently proposed in the field of statistics [7]. This is a kind of HMM, but note that it is different from traditional discrete HMMs. In discrete HMMs, the true state is one of M components and we estimate it stochastically as a multinomial over the M components. On the other hand, since the true state here is itself a multinomial over the components, we estimate it stochastically as (possibly a mixture of) a Dirichlet distribution, a distribution of multinomial distributions on the (M - 1)-simplex. This HMM has some similarity to the Factorial HMM [6] in that it has a combinatorial representational power through a distributed state representation. However, because the true state here is a multinomial over the latent variables, there are dependencies between the states that are assumed independent in the FHMM. Below, we briefly introduce a multinomial Mean shift model following [7] and an associated solution using a Particle Filter. 2.2 Multinomial Mean Shift Model\n\nThe MSM is a generative model that describes the intermittent changes of hidden states and outputs according to them. Although there is a corresponding counterpart using Normal distribution that was first introduced [8, 9], here we concentrate on a multinomial extension of MSM, following [7] for DNA sequence modeling. In a multinomial MSM, we assume time-dependent true multinomials t that may change occasionally and the following generative model for the discrete outputs yt = y1 y2 . . . yt (yt   ;  is a set of symbols) according to 1 2 . . . t :   t  Dir() with probability  = t-1 with probability (1 - ) , (1)  yt  Mult(t )\n\nwhere Dir() and Mult( ) are a Dirichlet and multinomial distribution with parameters  and  , respectively. Here we assume that the hyperparameter  is known and fixed, an assumption we will relax in Section 3.\n\nThis model first draws a multinomial  from Dir() and samples output y according to  for a certain interval. When a change point occurs with probability , a new  is sampled again from Dir() and subsequent y is sampled from the new  . This process continues recursively throughout which neither t nor the change points are known to us; all we know is the output sequence yt . However, if we know that the change has occurred at time c, y can be predicted exactly. Let It be a binary variable that represents whether a change occurred at time t: that is, It = 1 means there was a change at t (t = t-1 ), and It = 0 means there was no change (t = t-1 ). When the last change occurred at time c,\n\n\f\n1. For particles i = 1 . . . N , (a) Calculate f (t) and g (t) according to (6). (i) (i) (i) (b) Sample It  Bernoulli (f (t)/(f (t) + g (t))), and update It-1 to It . (i) (i) (c) Update weight wt = wt-1  (f (t) + g (t)). 2. Find a predictive distribution using wt . . . wt and It . . . It N (i) (i) p(yt+1 |yt ) = i=1 wt p(yt+1 |yt , It ) where p(yt+1 |yt , It ) is given by (3).\n(i) (1) (N ) (1) (N )\n\n: (4)\n\nFigure 1: Algorithm of the Multinomial Particle Filter. p\n\np(yt+1 = y | yt , Ic = 1, Ic+1 =    = It = 0) = =\n\nwhere y is the y 'th element of  and ny is the number of occurrences of y in yc    yt . Therefore, the essence of this problem lies in how to detect a change point given the data up to time t, a change point problem in discrete space. Actually, this problem can be solved by an efficient Particle Filter algorithm [10] shown below. 2.3 Multinomial Particle Filter\nParticle #1 Particle #2 Particle #3 Particle #4\n: :\n\n(y | )p( |yc    yt )d y + n y y , (y + ny )\n\n(2) (3)\n\nd1\n\nd2    dc\n\nPrior\n\nWeight\n\nThe prediction problem above can be solved by the efficient Particle Filter algorithm shown in Figure 1, graphically displayed in Figure 2 (excluding prior updates). The main intricacy involved is as follows. Let us denote It = {I1 . . . It }. By Bayes' theorem,\n\nyt+1\n\nParticle #N t-1 t\n\nFigure 2: Multinomial Particle Filter in work. (5) (6)\n\np(It |It-1 , yt )  p(It , yt |It-1 , yt-1 ) = p(yt |yt-1 , It-1 , It )p(It |It-1 ) p (yt |yt-1 , It-1 , It = 1)p(It = 1|It-1 ) =: f (t) = p(yt |yt-1 , It-1 , It = 0)p(It = 0|It-1 ) =: g (t) leading (It = 1|It-1 , yt ) = f (t)/(f (t) + g (t)) p(It = 0|It-1 , yt ) = g (t)/(f (t) + g (t)) . p\n\n(7)\n\nIn Expression (5), the first term is a likelihood of observation yt when It has been fixed, which can be obtained through (3). The second term is a prior probability of change, which can be set tentatively by a constant . However, when we endow  with a prior Beta distribution Be(,  ), posterior estimate of t given the binary change point history It-1 can be obtained using the number of 1's in It-1 , nt-1 (1), following a standard Bayesian method:  + nt-1 (1) . (8) E [t |It-1 ] = ++t-1 This means that we can estimate a \"rate of topic shifts\" as time proceeds in a Bayesian fashion. Throughout the following experiments, we used this online estimate of  t . The above algorithm runs for each observation yt (t = 1 . . . T ). If we observe a \"strange\" word that is more predictable from the prior than the contextual distribution, (6) makes f (t) larger than g (t), which leads to a higher probability that It = 1 will be sampled in the Bernoulli trial of Algorithm 1(b).\n\n\f\n3\n\nMean Shift Model of Natural Language\n\nChen and Lai [7] recently proposed the above algorithm to analyze DNA sequences. However, when extending this approach to natural language, i.e. word sequences, we meet two serious problems. The first problem is that in a natural language the number of words is extremely large. As opposed to DNA, which has only four letters of A/T/G/C, a natural language usually contains a minimum of some tens of thousands of words and there are strong correlations between them. For example, if \"nurse\" follows \"hospital\"we believe that there has been no context shift; however, if \"university\" follows \"hospital,\" the context probably has been shifted to a \"medical school\" subtopic, even though the two words are equally distinct from \"hospital.\" Of course, this is due to the semantic relationship we can assume between these words. However, the original multinomial MSM cannot capture this relationship because it treats the words independently. To incorporate this relationship, we require an extensive prior knowledge of words as a probabilistic model. The second problem is that in model equation (1), the hyperparameter  of prior Dirichlet distribution of the latent multinomials is assumed to be known. In the case of natural language, this means we know beforehand what words or topics will be spoken for all the texts. Apparently, this is not a natural assumption: we need an online estimation of  as well when we want to extend MSM to natural languages. To solve these problems, we extended a multinomial MSM using two probabilistic text models, LDA and DM. Below we introduce MSM-LDA and MSM-DM, in this order. 3.1 MSM-LDA\n\nLatent Dirichlet Allocation (LDA) [3] is a probabilistic text model that assumes a hidden multinomial topic distribution  over the M topics on a document d to estimate it stochastically as a Dirichlet distribution p( |d). Context modeling using LDA [5] regards a history h = w1 . . . wh as a pseudo document and estimates a variational approximation q ( |h) of a topic distribution p( |h) through a variational Bayes EM algorithm on a document [3]. After obtaining topic distribution q ( |h), we can predict the next word as follows. p M (9) p(y |h) = (y | )q ( |h)d = i=1 p(y |i ) i q(|h) When we use this prediction with an associated VB-EM algorithm in place of the nave i Dirichlet model (3) of MSM, we get an MSM-LDA that tracks a latent topic distribution  instead of a word distribution. Since each particle computes a Dirichlet posterior of topic distribution, the final topic distribution of MSM-LDA is a mixture of Dirichlet distributions for predicting the next word through (4) and (9) as shown in Figure 3(a). Note that MSMLDA has an implicit generative model corresponding to (1) in topic space. However, here we use a conditional model where LDA parameters are already known in order to estimate the context online.\n\nIn MSM-LDA, we can also update the hyperparameter  sequentially from the history. As seen in Figure 2, each particle has a history that has been segmented into pseudo \"documents\" d1 . . . dc by the change points sampled so far. Since each pseudo \"document\" has a Dirichlet posterior q ( |di ) (i = 1 . . . c), a common Dirichlet prior can be inferred by a linear-time Newton-Raphson algorithm [3]. Note that this computation needs only be run when a change point has been sampled. For this purpose, only the sufficient statistics q ( |di ) must be stored for each particle to render itself an online algorithm. Note in passing that MSM-LDA is a model that only tracks a mixing distribution of a mixture model. Therefore, in principle this model is also applicable to other mixture models, e.g. Gaussian mixtures, where mixing distribution is not static but evolves according to (1).\n\n\f\nTime 1 Context\n\nParticle#1\n\nParticle #2\n\nParticle #N\n\nTime 1 Context\n\nParticle#1\n\nParticle #2\n\nParticle #N\n\nchange points t Word Simplex Topic Subsimplex w1 wn Dirichlet distribution\n\nprior\nt Word Simplex wn\n\nchange points\n\nprior\n\nMixture of Dirichlet distributions\n\nDirichlet distribution Unigram distributions Weights\n\n0.2\n\nw2\n\n0.64\n\n0.02\n\nWeights\n\nw1\n\n0.2\n\nw2\n\n0.64\n\n0.02\n\nExpectation of Topic Mixture Unigram next word\n\nMixture of Dirichlet distributions\n\nExpectation of Unigram next word\n\nMixture of Mixture of Dirichlet distributions\n\n(a) MSM-LDA\n\n(b) MSM-DM\n\nFigure 3: MSM-LDA and MSM-DM in work. However, in terms of multinomial estimation, this generality has a drawback because it uses a lower-dimensional topic representation to predict the next word, which may cause a loss of information. In contrast, MSM-DM is a model that works directly on the word space to predict the next word with no loss of information. 3.2 MSM-DM\n\nDirichlet Mixtures (DM) [11] is a novel Bayesian text model that has the lowest perplexity reported so far in context modeling. DM uses no intermediate \"topic\" variables, but places a mixture of Dirichlet distributions directly on the word simplex to model word correlations. Specifically, DM assumes the following generative model for a document w = w 1 . . . wN :1  1. Draw m  Mult(). 2. Draw p  Dir(m ). 3. For n = 1 . . . N , a. Draw wn  Mult(p).\n\nm\n\nwN D\n\nm\n\np\n\nwN\n\nD\n\n(a) Unigram Mixture (UM)\n\n(b) Dirichlet Mixtures (DM)\n\nFigure 4: Graphical models of UM and DM. where p is a V -dimensional unigram distribution over words, 1 . . . M = M are pa1 rameters of Dirichlet prior distributions of p, and  is a M -dimensional prior mixing distribution of them. This model is considered a Bayesian extension of the Unigram Mixture [12] and has a graphical model shown in Figure 4. Given a set of documents D = {w1 , w2 , . . . , wD }, parameters  and M can be iteratively estimated by a com1 bination of EM algorithm and the modified Newton-Raphson method shown in Figure 5, which is a straight extension to the estimation of a Polya mixture [13]. 2 Under DM, a predictive probability p(y |h) is (omitting dependencies on  and  M ): 1  p M M m m (y |p)p(p|m ,h)dp p(m|h) p(y |m, h)p(m|h) = p(y |h) = =\n1\n\nM m\n\n=1\n\n=1\n\nCm\n\n=1\n\n + ny y my , (my + ny )\n\n(10)\n\nStep 1 of the generative model in fact can be replaced by a Dirichlet process prior. Full Bayesian treatment of DM through Dirichlet processes is now under our development. 2 DM is an extension to the model for amino acids [14] to natural language with a huge number of parameters, which precludes the ordinary Newton-Raphson algorithm originally proposed in [14].\n\n\f\nE step: M step:\n\np(m|wi )  m m  mv D\n\np(m|wi ) , i p(m|wi ) niv /(mv + niv - 1) v v v = mv  i p(m|wi ) niv /( mv + niv - 1)\ni=1\n\nv vV (mv + niv ) ( mv ) v v ( mv + niv ) =1 (mv )\n\n(13) (14) (15)\n\nFigure 5: EM-Newton algorithm of Dirichlet Mixtures. where Cm   m v V ( mv ) v (mv + nv ) v mv + h) =1 (mv ) (\n\n(11)\n\nand nv is the number of occurrences of v in h. This prediction can also be considered an extension to Dirichlet smoothing [15] with multiple hyperparameters m to weigh them accordingly by Cm .3 When we replace a nave Dirichlet model (3) by a DM prediction (10), we get a flexible i MSM-DM dynamic model that works on word simplex directly. Since the original multinomial MSM places a Dirichlet prior in the model (1), MSM-DM is considered a natural extension to MSM by placing a mixture of Dirichlet priors rather than a single Dirichlet prior for multinomial unigram distribution. Because each particle calculates a mixture of Dirichlet posteriors for the current context, the final MSM-DM estimate is a mixture of them, again a mixture of Dirichlet distributions as shown in Figure 3(b). In this case, we can also update the mixture prior  sequentially. Because each particle has \"pseudo documents\" w1 . . . wc segmented by change points individually, posterior m can be obtained similarly as (14), c (12) m  i=1 p(m|wi ) where p(m|wi ) is obtained from (13). Also in this case, only the sufficient statistics p(m|wi ) (i = 1 .. c) must be stored to make MSM-DM a filtering algorithm.\n\n4\n\nExperiments\n\nWe conducted experiments using a standard British National Corpus (BNC). We randomly selected 100 files of BNC written texts as an evaluation set, and the remaining 2,943 files as a training set for parameter estimation of LDA and DM in advance. 4.1 Training and evaluation data\n\nSince LDA and DM did not converge on the long texts like BNC, we divided training texts into pseudo documents with a minimum of ten sentences for parameter estimation. Due to the huge size of BNC, we randomly selected a maximum of 20 pseudo documents from each of the 2,943 files to produce a final corpus of 56,939 pseudo documents comprising 11,032,233 words. We used a lexicon of 52,846 words with a frequency  5. Note that this segmentation is optional and has an only indirect influence on the experiments. It only affects the clustering of LDA and DM: in fact, we could use another corpus, e.g. newspaper corpus, to estimate the parameters without any preprocessing. Since the proposed method is an algorithm that simultaneously captures topic shifts and their rate in a text to predict the next word, we need evaluation texts that have different rates of topic shifts. For this purpose, we prepared four different text sets by sampling\n3 Therefore, MSM-DM is considered an ingenious dynamic Dirichlet smoothing as well as a context modeling.\n\n\f\nText Raw Slow Fast VFast\n\nMSM-DM 870.06 (-6.02%) 893.06 (-8.31%) 898.34 (-9.10%) 960.26 (-7.57%)\n\nDM 925.83 974.04 988.26 1038.89\n\nMSM-LDA 1028.04 1047.08 1044.56 1065.15\n\nLDA 1037.42 1060.56 1061.01 1050.83\n\nTable 2: Contextual Unigram Perplexities for Evaluation Texts. from the long BNC texts. Specifically, we conducted sentence-based random sampling as follows. (1) Select a first sentence randomly for each text. (2) Sample contiguous X sentences from that sentence. (3) Skip Y sentences. (4) Continue steps (2) and (3) until a desired length of text is obtained. In the procedure above, X and Y are random variables that have uniform distributions given in Table 1. We sampled 100 sentences from each of the 100 files by this procedure to create the four evaluation text sets listed in the table. 4.2 Parameter settings Name Raw Slow Fast VeryFast Property X = 100, Y = 0 1  X  10, 1  Y  3 1  X  10, 1  Y  10 X = 1, 1  Y  10\n\nThe number of latent classes in LDA and DM are set to 200 and 50, respectively.4 The number of particles is set to N = 20, a relatively small number because each particle executes an exTable 1: Types of Evaluation Texts. act Bayesian prediction once previous change points have been sampled. Beta prior distribution of context change can be initialized as a uniform distribution, (,  ) = (1, 1). However, based on a preliminary experiment we set it to (,  ) = (1, 50): this means we initially assume a context change rate of once every 50 words in average, which will be updated adaptively. 4.3 Experimental results\n\nTable 2 shows the unigram perplexity of contextual prediction for each type of evaluation set. Perplexity is a reciprocal of the geometric average of contextual predictions, thus better predictions yield lower perplexity. While MSM-LDA slightly improves LDA due to the topic space compression explained in Section 3.1, MSM-DM yields a consistently better prediction, and its performance is more significant for texts whose subtopics change faster. Figure 6 shows a plot of the actual improvements relative to DM, PPLMSM - PPLDM . We can see that prediction improves for most documents by automatically selecting appropriate contexts. The maximum improvement was 365 in PPL for one of the evaluation texts. Finally, we show in Figure 7 a sequential plot of context change probabilities p (i) (It = 1) (i = 1..N , t = 1..T ) calculated by each particle for the first 1,000 words of one of the evaluation texts.\n\n5\n\nConclusion and Future Work\n\nIn this paper, we extended the multinomial Particle Filter of a small number of symbols to natural language with an extremely large number of symbols. By combining original filter with Bayesian text models LDA and DM, we get two models, MSM-LDA and MSM-DM, that can incorporate semantic relationship between words and can update their hyperparam4 We deliberately chose a smaller number of mixtures in DM because it is reported to have a better performance in small mixtures since it is essentially a unitopic model, in contrast to LDA.\n\n\f\n40\n0.8\n\nDocuments\n\n30 20 10 0 -400 -300 -200 -100\n\n0.7 0.6 0.5 0.4 0.3 0.2 20 0.1 15 10 0 1000 5 900 800 700 600 500 400 300 200 100 0\n\n0\n\n100 200 300\n\nParticle\n\nPerplexity reduction\n\nTime\n\n0\n\nFigure 6: Perplexity reductions of MSM Figure 7: Context change probabilities for 1,000 words text, sampled by the particles. relative to DM. eter sequentially. According to this model, prediction is made using a mixture of different context lengths sampled by each Monte Carlo particle. Although the proposed method is still in its fundamental stage, we are planning to extend it to larger units of change points beyond words, and to use a forward-backward MCMC or Expectation Propagation to model a semantic structure of text more precisely. References\n[1] Jay M. Ponte and W. Bruce Croft. A Language Modeling Approach to Information Retrieval. In Proc. of SIGIR '98, pages 275281, 1998. [2] David Cohn and Thomas Hofmann. The Missing Link: a probabilistic model of document content and hypertext connectivity. In NIPS 2001, 2001. [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:9931022, 2003. [4] Daniel Gildea and Thomas Hofmann. Topic-based Language Models Using EM. In Proc. of EUROSPEECH '99, pages 21672170, 1999. [5] Takuya Mishina and Mikio Yamamoto. Context adaptation using variational Bayesian learning for ngram models based on probabilistic LSA. IEICE Trans. on Inf. and Sys., J87-D-II(7):1409 1417, 2004. [6] Zoubin Ghahramani and Michael I. Jordan. Factorial Hidden Markov Models. In Advances in Neural Information Processing Systems (NIPS), volume 8, pages 472478. MIT Press, 1995. [7] Yuguo Chen and Tze Leung Lai. Sequential Monte Carlo Methods for Filtering and Smoothing in Hidden Markov Models. Discussion Paper 03-19, Institute of Statistics and Decision Sciences, Duke University, 2003. [8] H. Chernoff and S. Zacks. Estimating the Current Mean of a Normal Distribution Which is Subject to Changes in Time. Annals of Mathematical Statistics, 35:9991018, 1964. [9] Yi-Chin Yao. Estimation of a noisy discrete-time step function: Bayes and empirical Bayes approaches. Annals of Statistics, 12:14341447, 1984. [10] Arnaud Doucet, Nando de Freitas, and Neil Gordon. Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. Springer-Verlag, 2001. [11] Mikio Yamamoto and Kugatsu Sadamitsu. Dirichlet Mixtures in Text Modeling. CS Technical Report CS-TR-05-1, University of Tsukuba, 2005. http://www.mibel.cs.tsukuba.ac.jp/~myama/ pdf/dm.pdf. [12] Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, and Tom M. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3):103134, 2000. [13] Thomas P. Minka. Estimating a Dirichlet distribution, 2000. http://research.microsoft.com/ ~minka/papers/dirichlet/.  [14] K. Sjolander, K. Karplus, M.P. Brown, R. Hughey, R. Krogh, I.S. Mian, and D. Haussler. Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Computing Applications in the Biosciences, 12(4):327245, 1996. [15] D. J. C. MacKay and L. Peto. A Hierarchical Dirichlet Language Model. Natural Language Engineering, 1(3):119, 1994.\n\n\f\n", "award": [], "sourceid": 2887, "authors": [{"given_name": "Daichi", "family_name": "Mochihashi", "institution": null}, {"given_name": "Yuji", "family_name": "Matsumoto", "institution": null}]}