{"title": "Correlated Bigram LSA for Unsupervised Language Model Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 1633, "page_last": 1640, "abstract": "We propose using correlated bigram LSA for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. Our approach can be scalable to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results show that applying unigram and bigram LSA together yields 6%--8% relative perplexity reduction and 0.6% absolute character error rates (CER) reduction compared to applying only unigram LSA on the Mandarin RT04 test set. Comparing with the unadapted baseline, our approach reduces the absolute CER by 1.2%.", "full_text": "Correlated Bigram LSA for Unsupervised Language\n\nModel Adaptation\n\nYik-Cheung Tam\u2217\n\nTanja Schultz\n\nInterACT, Language Technologies Institute\n\nInterACT, Language Technologies Institute\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nyct@cs.cmu.edu\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\ntanja@cs.cmu.edu\n\nAbstract\n\nWe present a correlated bigram LSA approach for unsupervised LM adaptation for\nautomatic speech recognition. The model is trained using ef\ufb01cient variational EM\nand smoothed using the proposed fractional Kneser-Ney smoothing which handles\nfractional counts. We address the scalability issue to large training corpora via\nbootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram\nand bigram LSA are integrated into the background N-gram LM via marginal\nadaptation and linear interpolation respectively. Experimental results on the Man-\ndarin RT04 test set show that applying unigram and bigram LSA together yields\n6%\u20138% relative perplexity reduction and 2.5% relative character error rate reduc-\ntion which is statistically signi\ufb01cant compared to applying only unigram LSA.\nOn the large-scale evaluation on Arabic, 3% relative word error rate reduction is\nachieved which is also statistically signi\ufb01cant.\n\n1 Introduction\n\nLanguage model (LM) adaptation is crucial to automatic speech recognition (ASR) as it enables\nhigher-level contextual information to be effectively incorporated into a background LM improving\nrecognition performance. Exploiting topical context for LM adaptation has shown to be effective\nfor ASR using latent semantic analysis (LSA) such as LSA using singular value decomposition [1],\nLatent Dirichlet Allocation (LDA) [2, 3, 4] and HMM-LDA [5, 6]. One issue in LSA is the bag-\nof-word assumption which ignores word ordering. For document classi\ufb01cation, word ordering may\nnot be important. But in the LM perspective, word ordering is crucial since a trigram LM normally\nperforms signi\ufb01cantly better than a unigram LM for word prediction. In this paper, we investigate\nwhether relaxing the bag-of-word assumption in LSA helps improving the ASR performance via\nLM adaptation.\n\nWe employ bigram LSA [7] which is a natural extension of LDA to relax the bag-of-word assump-\ntion by connecting the adjacent words in a document together to form a Markov chain. There are\ntwo main challenges in bigram LSA which are not addressed properly in [7] especially for large-\nscale application. Firstly, the model can be very sparse since it covers topical bigrams in O(V 2 \u00b7 K)\nwhere V and K denote the vocabulary size and the number of topics. Therefore, model smoothing\nbecomes critical. Secondly, model initialization is important for EM training, especially for bigram\nLSA due to the model sparsity. To tackle the \ufb01rst challenge, we represent bigram LSA as a set\nof K topic-dependent backoff LM. We propose fractional Kneser-Ney smoothing 1 which supports\n\u2217This work is partly supported by the Defense Advanced Research Projects Agency (DARPA) under Con-\ntract No. HR0011-06-2-0001. Any opinions, \ufb01ndings and conclusions or recommendations expressed in this\nmaterial are those of the authors and do not necessarily re\ufb02ect the views of DARPA.\n\n1This method was brie\ufb02y mentioned in [8] without detail. To the best of our knowledge, our formulation in\n\nthis paper is considered new to the research community.\n\n\fPrior distribution over topic mixture weights\n\n\u03b8\n\nLatent topics\n\nz1\n\nz2\n\nObserved words\n\n~~\n\nw1\n\nw2\n\nzN\n\nwN\n\nFigure 1: Graphical representation of bigram LSA. Adjacent words in a document are linked to-\ngether to form a Markov chain from left to right.\n\nfractional counts to smooth each backoff LM. We show that our formulation recovers the original\nKneser-Ney smoothing [9] which supports only integral counts. To address the second challenge,\nwe propose a bootstrapping approach for bigram LSA training using a well-trained unigram LSA as\nan initial model.\n\nDuring unsupervised LM adaptation, word hypotheses from the \ufb01rst-pass decoding are used to es-\ntimate the topic mixture weight of each test audio to adapt both unigram and bigram LSA. The\nadapted unigram and bigram LSA are combined with the background LM in two stages. Firstly,\nmarginal adaptation [10] is applied to integrate unigram LSA into the background LM. Then the in-\ntermediately adapted LM from the \ufb01rst stage is combined with bigram LSA via linear interpolation\nwith the interpolation weights estimated by minimizing the word perplexity on the word hypotheses.\nThe \ufb01nal adapted LM is employed for re-decoding.\n\nRelated work includes topic mixtures [11] which perform document clustering and train a trigram\nLM for each document cluster as an initial model. Sentence-level topic mixtures are modeled so that\nthe topic label is \ufb01xed within a sentence. Topical N-gram model [12] focuses on phrase discovery\nand information retrieval. We do not apply this model because the phrase-based LM seems not\noutperform the word-based LM.\n\nThe paper is organized as follows: In Section 2, we describe the bigram LSA training and the\nfractional Kneser-Ney smoothing algorithm. In Section 3, we present the LM adaptation approach\nbased on marginal adaptation and linear interpolation. In Section 4, we report LM adaptation results\non Mandarin and Arabic ASR, followed by conclusions and future work in Section 5.\n\n2 Correlated bigram LSA\n\nLatent semantic analysis such as LDA makes a bag-of-word assumption that each word in a docu-\nment is generated irrespective of its position in a document. To relax this assumption, bigram LSA\nhas been proposed [7] to modify the graphical structure of LDA by connecting adjacent words in a\ndocument together to form a Markov chain. Figure 1 shows the graphical representation of bigram\nLSA where the top node represents the prior distribution over the topic mixture weights and the\nmiddle layer represents the latent topic label associated to each observed word at the bottom layer.\nThe document generation procedure of bigram LSA is similar to LDA except that the previous word\nis taken into consideration for generating the current word:\n\n1. Sample \u03b8 from a prior distribution p(\u03b8)\n2. For each word wi at the i-th position of a document:\n\n(a) Sample topic label: zi \u223c Multinomial(\u03b8)\n(b) Sample wi given the previous word wi\u22121 and the topic label zi: wi \u223c p(\u00b7|wi\u22121, zi)\n\nOur incremental contributions for bigram LSA are three-folded: Firstly, we present a technique\nfor topic correlation modeling using Dirichlet-Tree prior in Section 2.1. Secondly, we propose\nef\ufb01cient algorithm for bigram LSA training via variational Bayes approach and model bootstrapping\nwhich are scalable to large settings in Section 2.2. Thirdly, we formulate the fractional Kneser-Ney\nsmoothing to generalize the original Kneser-Ney smoothing which supports only integral counts in\nSection 2.3.\n\n\fj=1\nDir(.)\n\nj=2\nDir(.)\n\nj=3\nDir(.)\n\nj=J\nDir(.)\n\npropagate\n\nj=1\nDir(.)\n\n0.1+0.2\n\n0.3+0.4\n\nj=2\nDir(.)\n\nj=3\nDir(.)\n\nLatent topics\n\ntopic 1\n\ntopic 2 topic 3\n\ntopic 4\n\ntopic K\u22121\n\ntopic K\n\nq(z=k)\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nFigure 2: Left: Dirichlet-Tree prior of depth two. Right: Variational E-step as bottom-up propaga-\ntion and summation of fractional topic counts.\n\n2.1 Topic correlation\n\nModeling topic correlations is motivated by an observation that documents such as newspaper arti-\ncles are usually organized into main-topic and sub-topic hierarchy for document browsing. From this\nperspective, a Dirichlet prior is not appropriate since it assumes topic independence. A Dirichlet-\nTree prior [13, 14] is employed to capture topic correlations. Figure 2 (Left) illustrates a depth-two\nDirichlet-Tree. A depth-one Dirichlet-tree is equivalent to a Dirichlet prior in LDA. The sampling\nprocedure for the topic mixture weight \u03b8 \u223c p(\u03b8) can be described as follows:\n\n1. Sample a vector of branch probabilities bj \u223c Dirichlet(\u00b7; {\u03b1jc}) for each node j = 1...J\nwhere {\u03b1jc} denotes the parameter of the Dirichlet distribution at node j, i.e. the pseudo-\ncounts of the outgoing branch c at node j.\n\n2. Compute the topic mixture weight as \u03b8k =Qjc b\n\nwhere \u03b4jc(k) is an indicator function\nwhich sets to unity when the c-th branch of the j-th node leads to the leaf node of topic k\nand zero otherwise. The k-th topic weight \u03b8k is computed as the product of sampled branch\nprobabilities from the root node to the leaf node corresponding to topic k.\n\n\u03b4jc(k)\njc\n\nThe structure and the number of outgoing branches of each Dirichlet node can be arbitrary. In this\npaper, we employ a balanced binary Dirichlet-tree.\n\n2.2 Model training\n\nGibbs sampling was employed for bigram LSA training [7]. Despite the simplicity, it can be slow\nand inef\ufb01cient since it usually requires many sampling iterations for convergence. We present a\nvariational Bayes approach for model training. The joint likelihood of a document wN\n1 , the latent\ntopic sequence zN\n\n1 and \u03b8 using the bigram LSA can be written as follows:\n\nN\n\np(wN\n\n1 , zN\n\n1 , \u03b8) = p(\u03b8) \u00b7\n\np(zi|\u03b8) \u00b7 p(wi|wi\u22121, zi)\n\n(1)\n\nBy introducing a factorizable variational posterior distribution q(zN\ni=1 q(zi)\nover the latent variables and applying the Jensen\u2019s inequality, the lower bound of the marginalized\ndocument likelihood can be derived as follows:\n\n1 , \u03b8; \u0393) = q(\u03b8) \u00b7 QN\n\nYi=1\n\nlog p(wN\n\n1 ; \u039b, \u0393) = logZ\u03b8 Xz1...zN\n\u2265 Z\u03b8 Xz1...zN\n\nq(zN\n\n1 , \u03b8; \u0393) \u00b7\n\n1 , \u03b8; \u039b)\n\np(wN\n\n1 , zN\nq(zN\n\n1 , \u03b8; \u0393)\n\nq(zN\n\n1 , \u03b8; \u0393) \u00b7 log\n\n1 , \u03b8; \u039b)\n\np(wN\n\n1 , zN\nq(zN\n\n1 , \u03b8; \u0393)\n\n= Eq[log\n\np(\u03b8)\nq(\u03b8)\n\n] +\n\nEq[log\n\np(zi|\u03b8)\nq(zi)\n\n] +\n\nN\n\nXi=1\n\nN\n\nXi=1\n\n(2)\n\n(By Jensen\u2019s Inequality) (3)\n\nEq[log p(wi|wi\u22121, zi)]\n\n(4)\n\n= Q(wN\n\n(5)\nwhere the expectation is taken using the variational posterior q(zN\n1 , \u03b8). For the E-step, we compute\nthe partial derivative of the auxiliary function Q(\u00b7) with respect to q(zi) and the parameter \u03b3jc in the\nDirichlet-Tree posterior q(\u03b8). Setting the derivatives to zero yields:\n\n1 ; \u039b, \u0393)\n\n\fE-Steps:\n\nq(zi = k) \u221d p(wi|wi\u22121, k) \u00b7 eEq[log \u03b8k;{\u03b3jc}] for k = 1..K\n\nN\n\nN\n\nK\n\n\u03b3jc = \u03b1jc +\n\nEq[\u03b4jc(zi)] = \u03b1jc +\n\nXi=1\n\u03b4jc(k) \u00b7 Eq[log bjc] =Xjc\n\nq(zi = k) \u00b7 \u03b4jc(k)\n\nXk=1\n\nXi=1\n\u03b4jc(k)\u00c3\u03a8(\u03b3jc) \u2212 \u03a8(Xc\n\n\u03b3jc)! (8)\n\nwhere Eq[log \u03b8k] = Xjc\n\nwhere Eqn 7 is motivated from the conjugate property that the Dirichlet-Tree posterior given the\ntopic sequence zN\n\n1 has the same form as the Dirichlet-Tree prior:\n\np(bJ\n\n1 |zN\n\n1 ) \u221d p(zN\n\n1 |bJ\n\n1 ) \u00b7 p(bJ\n\n1 ; {\u03b1jc}) \u221d\uf8eb\n\uf8ed\n=Yjc\n\ni=1 \u03b4jc(zi))\u22121\n\n\u03b4jc(zi)\n\njc \uf8f6\n\nb\n\nN\n\nYi=1Yjc\n\n\uf8f8 \u00b7Yjc\nYj=1\n\nJ\n\nb\n\n\u03b1jc\u22121\njc\n\njc\u22121\n\n\u03b3 0\njc\n\nb\n\n=\n\nDirichlet(bj; {\u03b30\n\njc})\n\nb\n\n(\u03b1jc+P N\n\njc\n\n= Yjc\n\nFigure 2 (Right) illustrates that Eqn 7 can be implemented as propagation of fractional topic counts\nin a bottom-up fashion with each branch as an accumulator for \u03b3jc. Eqn 6 and Eqn 7 are applied\niteratively until convergence is reached. For the M-step, we compute the partial derivative of the aux-\niliary function Q(\u00b7) over all training documents d with respect to topic bigram probability p(v|u, k)\nand set it to zero:\nM-Step (unsmoothed):\n\n(6)\n\n(7)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nq(zi = k|d) \u00b7 \u03b4(wi\u22121, u)\u03b4(wi, v)\n\np(v|u, k) \u221d Xd\n\n=\n\nNd\n\nXi=1\nPd Cd(u, v|k)\nPdPV\n\nv0=1 Cd(u, v0|k)\n\nC(u, v|k)\nv0=1 C(u, v0|k)\n\n=\n\nPV\n\nwhere Nd denote the number of words in document d and \u03b4(wi, v) is a 0-1 Kronecker Delta function\nto test if the i-th word in document d is vocabulary v. Cd(u, v|k) denotes the fractional counts of a\nbigram (u, v) belonging to topic k in document d. Intuitively, Eqn 12 simply computes the relative\nfrequency of the bigram (u, v). However, this solution is not practical since bigram LSA assigns\nzero probability to unseen bigrams. Therefore, bigram LSA should be smoothed properly. One\nsimple approach is to use Laplace-smoothing by adding a small count \u03b4 to all bigrams. However,\nthis approach can lead to worse performance since it will bias the bigram probability towards a\nuniform distribution when the vocabulary size V gets large. Our approach is to represent p(v|u, k)\nas a standard backoff LM smoothed by fractional Kneser-Ney smoothing as described in Section 2.3.\n\nModel initialization is crucial for variational EM training. We employ a bootstrapping approach\nusing a well-trained unigram LSA as an initial model for bigram LSA so that p(wi|wi\u22121, k) is\napproximated by p(wi|k) in Eqn 6. It saves computation and avoids keeping the full initial bigram\nLSA in memory during the EM training. To make the training procedure more practical, we apply\nbigram pruning during statistics accumulation in the M-step when the bigram count in a document\nis less than 0.1. This heuristic is reasonable since only a small number of topics are \u201cactive\u201d to\na bigram. With the sparsity, there is no need to store K copies of accumulators for each bigram\nand thus reducing the memory requirement signi\ufb01cantly. The pruned bigram counts are re-assigned\nto the most likely topic of the current document so that the counts are conserved. For practical\nimplementation, accumulators are saved into the disk in batches for count merging. In the \ufb01nal step,\neach topic-dependent LM is smoothed individually using the merged count \ufb01le.\n\n2.3 Fractional Kneser-Ney smoothing\n\nStandard backoff N-gram LM is widely used in the ASR community. The state-of-the-art smoothing\nfor the backoff LM is based on Kneser-Ney smoothing [9]. The belief of its success is due to the\npreservation of marginal distributions. However, the original formulation only works for integral\n\n\fcounts which is not suitable for bigram LSA using fractional counts. Therefore, we propose the\nfractional Kneser-Ney smoothing as a generalization of the original formulation. The interpolated\nform using absolute discounting can be expressed as follows:\nmax{C(u, v) \u2212 D, 0}\n\n+ \u03bb(u) \u00b7 pKN (v)\n\n(13)\n\npKN (v|u) =\n\nC(u)\n\nwhere D is a discounting factor. In the original formulation, D lies between 0 and 1. But in our\nformulation, D can be any positive number. Intuitively, D controls the degree of smoothing. If D is\nset to zero, the model is unsmoothed; If D is too big, bigrams with counts smaller than D are pruned\nfrom the LM. \u03bb(u) ensures the bigram probability sums to unity. After summing over all possible v\non both sides of Eqn 13 and re-arranging terms, \u03bb(u) becomes:\n\n1 = Xv\n\nmax{C(u, v) \u2212 D, 0}\n\nC(u)\n\n+ \u03bb(u)\n\n=\u21d2 \u03bb(u) = 1 \u2212Xv\n\nmax{C(u, v) \u2212 D, 0}\n\nC(u)\n\n= 1 \u2212 Xv:C(u,v)>D\nC(u) \u2212Pv:C(u,v)>D C(u, v) + DPv:C(u,v)>D 1\n\nC(u)\n\nC(u, v) \u2212 D\n\nC(u)\n\n=\n\n=\n\n= Pv:C(u,v)\u2264D C(u, v) + DPv:C(u,v)>D 1\n\nC(u)\n\nC\u2264D(u, \u00b7) + D \u00b7 N>D(u, \u00b7)\n\nC(u)\n\nwhere C\u2264D(u, \u00b7) denotes the sum of bigram counts following u and smaller than D. N>D(u, \u00b7)\ndenotes the number of word types following u with the bigram counts bigger than D.\nIn Kneser-Ney smoothing, the lower-order distribution pKN (v) is treated as unknown parameters\nwhich can be estimated using the preservation of marginal distributions:\n\nwhere \u02c6p(v) is the marginal distribution estimated from the background training data so that \u02c6p(v) =\n\n\u02c6p(v) = Xu\n\npKN (v|u) \u00b7 \u02c6p(u)\n\n(19)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\nC(v)\n\nP v0 C(v0) . Therefore, we substitute Eqn 13 into Eqn 19:\nC(v) = Xu \u00b5 max{C(u, v) \u2212 D, 0}\n\nC(u)\n\n+ \u03bb(u) \u00b7 pKN (v)\u00b6 \u00b7 C(u)\n\n(20)\n\nC(u) \u00b7 \u03bb(u)\n\n(21)\n\n=\u21d2 pKN (v) =\n\n=\n\n=\n\n=\n\n= \u00c3Xu\n\nC(v) \u2212Pu max{C(u, v) \u2212 D, 0}\n\nmax{C(u, v) \u2212 D, 0}! + pKN (v) \u00b7Xu\nPu C(u) \u00b7 \u03bb(u)\nPu C(u) \u00b7 \u03bb(u)\n\nC(v) \u2212 C>D(\u00b7, v) + D \u00b7 N>D(\u00b7, v)\n\nC\u2264D(\u00b7, v) + D \u00b7 N>D(\u00b7, v)\n\n(using Eqn 18)\n\nC\u2264D(\u00b7, v) + D \u00b7 N>D(\u00b7, v)\n\nPu C\u2264D(u, \u00b7) + D \u00b7 N>D(u, \u00b7)\nPv C\u2264D(\u00b7, v) + D \u00b7 N>D(\u00b7, v)\n\nEqn 25 generalizes Kneser-Ney smoothing to integral and fractional counts. For the original formu-\nlation, C\u2264D(u, \u00b7) equals to zero since each observed bigram count must be at least one by de\ufb01nition\nwith D less than one. As a result, the D term cancels out yielding the original formulation which\ncounts the number of words preceding v and thus recovering the original formulation. Intuitively,\nthe numerator in Eqn 25 measures the total discounts of observed bigrams ending at v. In other\nwords, fractional Kneser-Ney smoothing estimates the lower-order probability distribution using the\nrelative frequency over discounts instead of word counts. With this approach, each topic-dependent\nLM in bigram LSA can be smoothed using our formulation.\n\n\f3 Unsupervised LM adaptation\n\nUnsupervised LM adaptation is performed by \ufb01rst inferring the topic distribution of each test audio\nusing the word hypotheses from the \ufb01rst-pass decoding via variational inference in Eqn 6\u20137. Relative\nfrequency over the branch posterior counts \u03b3jc is applied on each Dirichlet node j. The MAP topic\nmixture weight \u02c6\u03b8 and the adapted unigram and bigram LSA are computed as follows:\n\n\u02c6\u03b8k \u221d Yjc \u00b5 \u03b3jc\n\nPc0 \u03b3jc0\u00b6\u03b4jc(k)\n\np(v|k) \u00b7 \u02c6\u03b8k and pa(v|u) =\n\nK\n\npa(v) =\n\nXk=1\n\nfor k = 1...K\n\np(v|u, k) \u00b7 \u02c6\u03b8k\n\nK\n\nXk=1\n\n(26)\n\n(27)\n\nThe unigram LSA marginals are integrated into the background N-gram LM pbg(v|h) via marginal\nadaptation [10] as follows:\n\np(1)\n\npbg(v)\u00b6\u03b2\na (v|h) \u221d \u00b5 pa(v)\n\n\u00b7 pbg(v|h)\n\n(28)\n\nMarginal adaptation has a close connection to maximum entropy modeling since the marginal con-\nstraints can be encoded as unigram features. Intuitively, bigram LSA would be integrated in the same\nfashion by introducing bigram marginal constraints. However, we found that integrating bigram\nfeatures via marginal adaptation did not offer further improvement compared to only integrating un-\nigram features. Since marginal adaptation integrates a unigram feature as a likelihood ratio between\nthe adapted marginal pa(v) and the background marginal pbg(v) in Eqn 28, perhaps the unigram and\nbigram likelihood ratios are very similar and thus the latter does not give extra information. Another\nexplanation is that marginal adaptation corresponds to only one iteration of generalized iterative\nscaling (GIS). Due to the large number of bigram features in terms of millions, one GIS iteration\nmay not be suf\ufb01cient for convergence. On the other hand, simple linear LM interpolation is found\nto be effective in our experiment. The \ufb01nal LM adaptation formula is provided using results from\nEqn 27 and Eqn 28 as a two-stage process:\n\np(2)\na (v|h) = \u03bb \u00b7 p(1)\n\na (v|h) + (1 \u2212 \u03bb) \u00b7 pa(v|u)\n\n(29)\n\nwhere \u03bb is tuned to optimize perplexity on word hypotheses from the \ufb01rst-pass decoding on a per-\naudio basis.\n\n4 Experimental setup\n\nOur LM adaptation approach was evaluated using the RT04 Mandarin Broadcast News evaluation\nsystem. The system employed context-dependent Initial-Final acoustic models trained using 100-\nhour broadcast news audio from the Mandarin HUB4 1997 training set and a subset of TDT4. 42-\ndimension features were extracted after linear discriminant analysis projected from a window of\nMFCC and energy features. The system employed a two-pass decoding strategy using speaker-\nindependent and speaker-adaptive acoustic models. For the second-pass decoding, we applied stan-\ndard acoustic model adaptation such as vocal tract length normalization and maximum likelihood\nlinear regression on the feature and model spaces. The training corpora include Xinhua News 2002\n(January\u2013September) containing 13M words and 64k documents. A background 4-gram LM was\ntrained using modi\ufb01ed Kneser-Ney smoothing using the SRILM toolkit [15]. The same training\ncorpora were used for unigram and bigram LSA training with 200 topics. The vocabulary size is\n108k words. Discounting factor D for fractional Kneser-Ney smoothing was set to 0.4.\nFirst-pass decoding was \ufb01rst performed to obtain an automatic transcript for each audio show. Then\nunsupervised LM adaptation was applied using the automatic transcript to obtain an adapted LM\nfor second-pass decoding using the approach described in Section 3. Word perplexity and character\nerror rates (CER) were measured on the Mandarin RT04 test set. Matched pairs sentence-segment\nword error test was performed for signi\ufb01cance test using the NIST scoring tool.\n\n\fTable 1: Correlated bigram topics extracted from bigram LSA.\n\nTopic index\n\nTop bigrams sorted by p(u, v|k)\n\n\u201ctopic-61\u201d\n\n\u1207+\u08e3\u1174(\u2019s student), \u1207+\u0c91\u152f(\u2019s education), \u0c91\u152f+\u1207(education \u2019s)\n\n\u08e3\u0da0+\u1207(school \u2019s), \u0933\u09ef+\u112a(youth class), \u1406\u190c+\u0c91\u152f(quality of education)\n\n\u201ctopic-62\u201d\n\n\u041e\u0b63+\u07ef\u0526(expert cultivation), \u0835\u08e3+\u0da0\u1bbf(university chancellor)\n\n\u201ctopic-63\u201d\n\n\u201ctopic-64\u201d\n\n\u201ctopic-65\u201d\n\n\u1249+\u064d(famous), \u0b5b+\u1d31\u0da0(high-school), \u1207+\u08e3\u1174(\u2019s student)\n\u0694+\u12e8\u0455\u04b1\u1c27(and social security), \u1207+\u0940\u03c7(\u2019s employment),\n\n\u083c\u03c7+\u041e\u067d(unemployed of\ufb01cer), \u0940\u03c7+\u096d\u046c(employment position)\n\u1207+\u128e\u1340(\u2019s research), \u03c1\u090f+\u08e3\u14e5(expert people), \u138b+\u1c94\u07e8(etc area)\n\n\u1174\u10aa+\u0b7c\u0d34(biological technology), \u128e\u1340+\u0b44\u0d64(research result)\n\n\u041e\u13da+\u07f0\u077f\u1435(Human DNA sequence), \u1207+\u07f0\u077f(\u2019s DNA)\n\n\u1174\u10aa+\u0b7c\u0d34(biological technology), \u1543\u153f+\u09ed\u1437\u1547(embryo stem cell)\n\nTable 2: Character Error Rates (Word perplexity) on the RT04 test set. Bigram LSA was applied in\naddition to unigram LSA.\n\nLM (13M)\n\nCCTV\n\nNTDTV\n\nRFA\n\nOVERALL\n\nbackground LM\n+unigram LSA\n\n+bigram LSA (Kneser-Ney, 30 topics)\n\n+bigram LSA (Witten-Bell)\n+bigram LSA (Kneser-Ney)\n\n15.3% (748)\n14.4 (629)\n14.5 (604)\n14.1 (594)\n14.0 (587)\n\n21.8 (1718)\n21.5 (1547)\n20.7 (1502)\n20.9 (1452)\n20.8 (1448)\n\n39.5 (3655)\n38.9 (3015)\n39.0 (2736)\n38.3 (2628)\n38.2 (2586)\n\n24.9\n24.3\n24.1\n23.8\n23.7\n\n4.1 LM adaptation results\n\nTable 1 shows the correlated bigram topics sorted by the joint bigram probability p(v|u, k) \u00b7 p(u|k).\nMost of the top bigrams appear either as phrases or words attached with a stopword such as \u1207(\u2019s in\nEnglish). Table 2 shows the LM adaptation results in CER and perplexity. Applying both unigram\nand bigram LSA yields consistent improvement over unigram LSA in the range of 6.4%\u20138.5%\nrelative reduction in perplexity and 2.5% relative reduction in the overall CER. The CER reduction is\nstatistically signi\ufb01cant at 0.1% signi\ufb01cance level. We compared our proposed fractional Kneser-Ney\nsmoothing with Witten-Bell smoothing which also supports fractional counts. The results showed\nthat Kneser-Ney smoothing performs slightly better than Witten-Bell smoothing.\nIncreasing the\nnumber of topics in bigram LSA helps despite model sparsity. We applied extra EM iterations on\ntop of the bootstrapped bigram LSA but no further performance improvement was observed.\n\n4.2 Large-scale evaluation\n\nWe evaluated our approach using the CMU-InterACT vowelized Arabic transcription system dis-\ncriminatively trained on 1500-hour transcribed audio using MMIE for the GALE Phase-3 evaluation.\nA large background 4-gram LM was trained using 962M-word text corpora with 737k vocabulary.\nUnigram and bigram LSA were trained on the same corpora and were applied to lattice rescoring on\nDev07 and unseen Dev08 test sets with 2.6-hour and 3-hour audio shows containing broadcast news\n(BN) and broadcast conversation (BC) genre. Table 3 shows that bigram LSA rescoring reduces the\noverall word error rate by more than 3.0% relative compared to the unadapted baseline on both sets\nwhich are statistically signi\ufb01cant at 0.1% signi\ufb01cance level. However, degradation is observed using\ntrigram LSA compared to bigram LSA which may be due to data sparseness.\n\nTable 3: Lattice rescoring results in word error rate on Dev07 (unseen Dev08) using the CMU-\nInterACT Arabic transcription system for the GALE Phase-3 evaluation.\n\nGALE LM (962M)\n\nBN\n\nbackground LM\n+unigram LSA\n\n+bigram LSA (Witten-Bell)\n+bigram LSA (Kneser-Ney)\n+trigram LSA (Kneser-Ney)\n\nBC OVERALL\n14.3 (16.4)\n14.2 (16.3)\n13.9 (15.9)\n13.8 (15.9)\n\n11.6% 19.4\n19.2\n11.5\n11.0\n19.0\n11.0\n18.9\n18.8\n11.3\n\n14.0 (-)\n\n\f5 Conclusion\n\nWe present a correlated bigram LSA approach for unsupervised LM adaptation for ASR. Our con-\ntributions include ef\ufb01cient variational EM for model training and fractional Kneser-Ney approach\nfor LM smoothing with fractional counts. Bigram LSA yields additional improvement in both per-\nplexity and recognition performance in addition to unigram LSA. Increasing the number of topics\nfor bigram LSA helps despite the model sparsity. Bootstrapping bigram LSA from unigram LSA\nsaves computation and memory requirement during EM training. Our approach is scalable to large\ntraining corpora and works well on different languages. The improvement from bigram LSA is\nstatistically signi\ufb01cant compared to the unadapted baseline. Future work includes applying the pro-\nposed approach for statistical machine translation.\n\nAcknowledgement\n\nWe would like to thank Mark Fuhs for help parallelizing the bigram LSA training via condor.\n\nReferences\n\n[1] J. R. Bellegarda, \u201cLarge Vocabulary Speech Recognition with Multispan Statistical Language\nModels,\u201d IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 76\u201384, Jan\n2000.\n\n[2] D. Blei, A. Ng, and M. Jordan, \u201cLatent Dirichlet Allocation,\u201d in Journal of Machine Learning\n\nResearch, 2003, pp. 1107\u20131135.\n\n[3] Y. C. Tam and T. Schultz, \u201cLanguage model adaptation using variational Bayes inference,\u201d in\n\nProceedings of Interspeech, 2005.\n\n[4] D. Mrva and P. C. Woodland, \u201cUnsupervised language model adaptation for mandarin broad-\n\ncast conversation transcription,\u201d in Proceedings of Interspeech, 2006.\n\n[5] T. Grif\ufb01ths, M. Steyvers, D. Blei, and J. Tenenbaum, \u201cIntegrating topics and syntax,\u201d in\n\nAdvances in Neural Information Processing Systems, 2004.\n\n[6] B. J. Hsu and J. Glass, \u201cStyle and topic language model adaptation using HMM-LDA,\u201d in\n\nProceedings of Empirical Methods on Natural Language Processing (EMNLP), 2006.\n\n[7] Hanna M. Wallach, \u201cTopic Modeling: Beyond Bag-of-Words,\u201d in International Conference\n\non Machine Learning, 2006.\n\n[8] P. Xu, A. Emami, and F. Jelinek, \u201cTraining connectionist models for the structured language\nmodel,\u201d in Proceedings of Empirical Methods on Natural Language Processing (EMNLP),\n2003.\n\n[9] R. Kneser and H. Ney, \u201cImproved backing-off for M-gram language modeling,\u201d in Proceedings\nof the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),\n1995, vol. 1, pp. 181\u2013184.\n\n[10] R. Kneser, J. Peters, and D. Klakow, \u201cLanguage model adaptation using dynamic marginals,\u201d\nin Proceedings of European Conference on Speech Communication and Technology (EU-\nROSPEECH), 1997, pp. 1971\u20131974.\n\n[11] R. Iyer and M. Ostendorf, \u201cModeling long distance dependence in language: Topic mixtures\nversus dynamic cache models,\u201d IEEE Transactions on Speech and Audio Processing, vol. 7,\nno. 1, pp. 30\u201339, Jan 1999.\n\n[12] X. Wang, A. McCallum, and X. Wei, \u201cTopical N-grams: Phrase and topic discovery, with an\napplication to information retrieval,\u201d in IEEE International Conference on Data Mining, 2007.\n\n[13] T. Minka, \u201cThe dirichlet-tree distribution,\u201d 1999.\n[14] Y. C. Tam and T. Schultz, \u201cCorrelated latent semantic model for unsupervised language model\nadaptation,\u201d in Proceedings of the IEEE International Conference on Acoustics, Speech, and\nSignal Processing (ICASSP), 2007.\n\n[15] A. Stolcke, \u201cSRILM - an extensible language modeling toolkit,\u201d in Proceedings of Interna-\n\ntional Conference on Spoken Language Processing (ICSLP), 2002.\n\n\f", "award": [], "sourceid": 298, "authors": [{"given_name": "Yik-cheung", "family_name": "Tam", "institution": null}, {"given_name": "Tanja", "family_name": "Schultz", "institution": null}]}~~