{"title": "Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 469, "page_last": 477, "abstract": "Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonal-covariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5\\%, which is superior to all published results on speaker-independent TIMIT to date.", "full_text": "Phone Recognition with the Mean-Covariance\n\nRestricted Boltzmann Machine\n\nGeorge E. Dahl, Marc\u2019Aurelio Ranzato, Abdel-rahman Mohamed, and Geoffrey Hinton\n\n{gdahl, ranzato, asamir, hinton}@cs.toronto.edu\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nAbstract\n\nStraightforward application of Deep Belief Nets (DBNs) to acoustic modeling\nproduces a rich distributed representation of speech data that is useful for recogni-\ntion and yields impressive results on the speaker-independent TIMIT phone recog-\nnition task. However, the \ufb01rst-layer Gaussian-Bernoulli Restricted Boltzmann\nMachine (GRBM) has an important limitation, shared with mixtures of diagonal-\ncovariance Gaussians: GRBMs treat different components of the acoustic input\nvector as conditionally independent given the hidden state. The mean-covariance\nrestricted Boltzmann machine (mcRBM), \ufb01rst introduced for modeling natural im-\nages, is a much more representationally ef\ufb01cient and powerful way of modeling\nthe covariance structure of speech data. Every con\ufb01guration of the precision units\nof the mcRBM speci\ufb01es a different precision matrix for the conditional distribu-\ntion over the acoustic space. In this work, we use the mcRBM to learn features\nof speech data that serve as input into a standard DBN. The mcRBM features\ncombined with DBNs allow us to achieve a phone error rate of 20.5%, which is\nsuperior to all published results on speaker-independent TIMIT to date.\n\n1\n\nIntroduction\n\nAcoustic modeling is a fundamental problem in automatic continuous speech recognition. Most\nstate of the art speech recognition systems perform acoustic modeling using the following approach\n[1]. The acoustic signal is represented as a sequence of feature vectors; these feature vectors typ-\nically hold a log spectral estimate on a perceptually warped frequency scale and are augmented\nwith the \ufb01rst and second (at least) temporal derivatives of this spectral information, computed using\nsmoothed differences of neighboring frames. Hidden Markov models (HMMs), with Gaussian mix-\nture models (GMMs) for the emission distributions, are used to model the probability of the acoustic\nvector sequence given the (tri)phone sequence in the utterance to be recognized.1 Typically, all of the\nindividual Gaussians in the mixtures are restricted to have diagonal covariance matrices and a large\nhidden Markov model is constructed from sub-HMMs for each triphone to help deal with the ef-\nfects of context-dependent variations. However, to mitigate the obvious data-sparsity and ef\ufb01ciency\nproblems context dependence creates, modern systems perform sophisticated parameter tying by\nclustering the HMM states using carefully constructed decision trees to make state tying choices.\nAlthough systems of this sort have yielded many useful results, diagonal covariance CDHMM\nmodels have several potential weaknesses as models of speech data. On the face of things at\nleast, feature vectors for overlapping frames are treated as independent and feature vectors must\nbe augmented with derivative information in order to enable successful modeling with mixtures of\ndiagonal-covariance Gaussians (see [2, 3] for a more in-depth discussion of the exact consequences\nof the delta features). However, perhaps even more disturbing than the frame-independence assump-\ntion are the compromises required to deal with two competing pressures in Gaussian mixture model\n\n1We will refer to HMMs with GMM emission distributions as CDHMMs for continuous-density HMMs.\n\n1\n\n\ftraining: the need for expressive models capable of representing the variability present in real speech\ndata and the need to combat the resulting data sparsity and statistical ef\ufb01ciency issues. These pres-\nsures of course exist for other models as well, but the tendency of GMMs to partition the input space\ninto regions where only one component of the mixture dominates is a weakness that inhibits ef\ufb01cient\nuse of a very large number of tunable parameters. The common decision to use diagonal covariance\nGaussians for the mixture components is an example of such a compromise of expressiveness that\nsuggests that it might be worthwhile to explore models in which each parameter is constrained by a\nlarge fraction of the training data. By contrast, models that use the simultaneous activation of a large\nnumber of hidden features to generate an observed input can use many more of their parameters to\nmodel each training example and hence have many more training examples to constrain each param-\neter. As a result, models that use non-linear distributed representations are harder to \ufb01t to data, but\nthey have much more representational power for the same number of parameters.\nThe diagonal covariance approximation typically employed for GMM-based acoustic models is\nsymptomatic of, but distinct from, the general representational inef\ufb01ciencies that tend to crop up\nin mixture models with massive numbers of highly specialized, distinctly parameterized mixture\ncomponents. Restricting mixture components to have diagonal covariance matrices introduces a\nconditional independence assumption between dimensions within a single frame. The delta-feature\naugmentation mitigates the severity of the approximation and thus makes outperforming diagonal\ncovariance Gaussian mixture models dif\ufb01cult. However, a variety of precision matrix modeling\ntechniques have emerged in the speech recognition literature. For example, [4] describes a basis\nsuperposition framework that includes many of these techniques.\nAlthough the recent work in [5] on using deep belief nets (DBNs) for phone recognition begins to\nattack the representational ef\ufb01ciency issues of GMMs, Gaussian-Bernoulli Restricted Boltzmann\nMachines (GRBMs) are used to deal with the real-valued input representation (in this case, mel-\nfrequency cepstral coef\ufb01cients). GRBMs model different dimensions of their input as conditionally\nindependent given the hidden unit activations, a weakness akin to restricting Gaussians in a GMM\nto have diagonal covariance. This conditional independence assumption is inappropriate for speech\ndata encoded as a sequence of overlapping frames of spectral information, especially when many\nframes are concatenated to form the input vector. Such data can exhibit local smoothness in both\nfrequency and time punctuated by bursts of energy that violate these local smoothness properties.\nPerforming a standard augmentation of the input with temporal derivative information, as [5] did,\nwill of course make it easier for GRBMs to deal with such data, but ideally one would use a model\ncapable of succinctly modeling these effects on its own.\nInspired by recent successes in modeling natural images, the primary contribution of this work is to\nbring the mean-covariance restricted Boltzmann machine (mcRBM) of [6] to bear on the problem\nof extracting useful features for phone recognition and to incorporate these features into a deep\narchitecture similar to one described in [5]. We demonstrate the ef\ufb01cacy of our approach by reporting\nresults on the speaker-independent TIMIT phone recognition task. TIMIT, as argued in [7], is an\nideal dataset for testing new ideas in speech recognition before trying to scale them up to large\nvocabulary tasks because it is phonetically rich, has well-labeled transcriptions, and is small enough\nnot to pose substantial computational challenges at test time. Our best system achieves a phone\nerror rate on the TIMIT corpus of 20.5%, which is superior to all published results on speaker-\nindependent TIMIT to date. We obtain these results without augmenting the input with temporal\ndifference features since a sensible model of speech data should be able to learn to extract its own\nuseful features that make explicit inclusion of difference features unnecessary.\n\n2 Using Deep Belief Nets for Phone Recognition\n\nFollowing the approach of [5], we use deep belief networks (DBNs), trained via the unsupervised\npretraining algorithm described in [8], combined with supervised \ufb01ne-tuning using backpropagation,\nto model the posterior distribution over HMM states given a local window of the acoustic input. We\nconstruct training cases for the DBN by taking n adjacent frames of acoustic input and pairing\nthem with the identity of the HMM state for the central frame. We obtain the labels from a forced\nalignment with a CDHMM baseline. During the supervised phase of learning, we optimize the cross-\nentropy loss for the individual HMM-state predictions, as a more convenient proxy for the number\nof mistakes (insertions, deletions, substitutions) in the phone sequence our system produces, which\n\n2\n\n\fis what we are actually interested in. In order to compare with the results [5], at test time, we use\nthe posterior probability distribution over HMM states that the DBN produces in place of GMM\nlikelihoods in an otherwise standard Viterbi decoder. Since the HMM de\ufb01nes a prior over states, it\nis better to divide the posterior probabilities of the DBN by the frequencies of the 183 labels in the\ntraining data [9], but in our experiments this did not noticeably change the results.\n\n3 The Mean-Covariance Restricted Boltzmann Machine\n\nThe previous work of [5] used a GRBM for the initial DBN layer. The GRBM associates each\ncon\ufb01guration of the visible units, v, and hidden units, h, with a probability density according to\n\nP (v, h) \u221d e\u2212E(v,h),\n\n(1)\n\n(2)\n\nwhere E(v, h) is given by\n\nE(v, h) =\n\n1\n2\n\n(v \u2212 b)T(v \u2212 b) \u2212 cTh \u2212 vTWh,\n\nand where W is the matrix of visible/hidden connection weights, b is a visible unit bias, and c is\na hidden unit bias. Equation 2 implicitly assumes that the visible units have a diagonal covariance\nGaussian noise model with a variance of 1 on each dimension.\nAnother option for learning to extract binary features from real-valued data that has enjoyed success\nin vision applications is the mean-covariance RBM (mcRBM), \ufb01rst introduced in [10] and [6]. The\nmcRBM has two groups of hidden units: mean units and precision units. Without the precision\nunits, the mcRBM would be identical to a GRBM. With only the precision units, we have what\nwe will call the \u201ccRBM\u201d, following the terminology in [6]. The precision units are designed to\nenforce smoothness constraints in the data, but when one of these constraints is seriously violated,\nit is removed by turning off the precision unit. The set of active precision units therefore speci\ufb01es\na sample-speci\ufb01c covariance matrix. In order for a visible vector to be assigned high probability\nby the precision units, it must only fail to satisfy a small number of the precision unit constraints,\nalthough each of these constraints could be egregiously violated.\nThe cRBM can be viewed as a particular type of factored third order Boltzmann machine. In other\nwords, the RBM energy function is modi\ufb01ed to have multiplicative interactions between triples of\ntwo visible units, vi and vj, and one hidden unit hk. Unrestricted 3-way connectivity causes a cubic\ngrowth in the number of parameters that is unacceptable if we wish to scale this sort of model to\nhigh dimensional data. Factoring the weights into a sum of 3-way outer products can reduce the\ngrowth rate of the number of parameters in the model to one that is comparable to a normal RBM.\nAfter factoring, we may write the cRBM energy function2 (with visible biases omitted) as:\n\nE(v, h) = \u2212dTh \u2212 (vTR)2Ph,\n\n(3)\nwhere R is the visible-factor weight matrix, d denotes the hidden unit bias vector, and P is the\nfactor-hidden, or \u201cpooling\u201d matrix. The squaring in equation 3 (and in other equations with this\nterm) is performed elementwise. We force P to only have non-positive entries. We must constrain\nP in this way to avoid a model that assigns larger and larger probabilities (more negative energies)\nto larger and larger inputs.\nThe hidden units of the cRBM are still (just as in GRBMs) conditionally independent given the states\nof the visible units, so inference remains simple. However, the visible units are coupled in a Markov\nRandom Field determined by the settings of the hidden units. The interaction weight between two\narbitrary visible units vi and vj, which we shall denote \u02dcwi,j, depends on the states of all the hidden\nunits according to:\n\n\u02dcwi,j =(cid:88)\n(cid:16)\n\nk\n\nP (h|v) = \u03c3\n\nhkrif rjf pkf .\n\n(cid:88)\nd +(cid:0)(vTR)2P(cid:1)T(cid:17)\n\nf\n\n,\n\nThe conditional distribution of the hidden units (derived from 3) given the visible unit states v is:\n\n2In order to normalize the distribution implied by this energy function, we must restrict the visible units to a\nregion of the input space that has \ufb01nite extent. However, once we add the mean RBM this normalization issue\nvanishes.\n\n3\n\n\fwhere \u03c3 denotes the elementwise logistic sigmoid, \u03c3(x) = (1+e\u2212x)\u22121. The conditional distribution\nof the visible units given the hidden unit states for the cRBM is given by:\n\nP (v|h) \u223c N(cid:16)\n\n0,(cid:2)R(cid:0)diag(\u2212PTh)(cid:1) RT(cid:3)\u22121(cid:17)\n\n.\n\n(4)\n\nThe cRBM always assigns highest probability to the all zero visible vector. In order to allow the\nmodel to shift the mean, we add an additional set of binary hidden units whose vector of states we\nshall denote m. The product of the distributions de\ufb01ned by the cRBM and the GRBM forms the\nmcRBM. If EC(v, h) denotes the cRBM energy function (equation 3) and EM (v, m) denotes the\nGRBM energy function (equation 2), then the mcRBM energy function is:\n\nEM C(v, h, m) = EC(v, h) + EM (v, m).\n\n(5)\nThe gradient of the EM term moves the minimum of EM C away from the zero vector, but how far\nit moves depends on the curvature of the precision matrix de\ufb01ned by EC. The resulting conditional\ndistribution over the visible units, given the two sets of hidden units is:\n\nwhere\n\nP (v|h, m) \u221d N (\u03a3Wm, \u03a3) ,\n\n\u03a3 =(cid:0)R(cid:0)diag(\u2212PTh)(cid:1) RT(cid:1)\u22121\n\n.\n\nThus the mcRBM can produce conditional distributions over the visible units, given the hidden units,\nthat have non-zero means, unlike the cRBM.\nJust like other RBMs, the mcRBM can be trained using the following update rule, for some generic\nmodel parameter \u03b8:\n\n\u2206\u03b8 \u221d (cid:104)\u2212 \u2202E\n\u2202\u03b8\n\n(cid:105)data + (cid:104) \u2202E\n\u2202\u03b8\n\n(cid:105)reconstruction.\n\nHowever, since the matrix inversion required to sample from P (v|h, m) can be expensive, we inte-\ngrate out the hidden units and use Hybrid Monte Carlo (HMC) [11] on the mcRBM free energy to\nobtain the reconstructions.\nIt is important to emphasize that the mcRBM model of covariance structure is much more powerful\nthan merely learning a covariance matrix in a GRBM. Learning the covariance matrix for a GRBM\nis equivalent to learning a single global linear transformation of the data, whereas the precision\nunits of an mcRBM are capable of specifying exponentially many different covariance matrices and\nexplaining different visible vectors with different distributions over these matrices.\n\n3.1 Practical details\n\nIn order to facilitate stable training, we make the precision unit term in the energy function insen-\nsitive to the scale of the input data by normalizing by the length of v. This makes the conditional\nP (v|h) clearly non-Gaussian. We constrain the columns of P to have unit L1 norm and to be sparse.\nWe enforce one-dimensional locality and sparsity in P by setting entries beyond a distance of one\nfrom the main diagonal to zero after every update. Additionally, we constrain the columns of R to\nall have equal L2 norms and learn a single global scaling factor shared across all the factors. The\nnon-positivity constraint on the entries of P is maintained by zeroing out, after each update, any\nentries that become positive.\n\n4 Deep Belief Nets\n\nLearning is dif\ufb01cult in densely connected, directed belief nets that have many hidden layers because\nit is dif\ufb01cult to infer the posterior distribution over the hidden variables, when given a data vector,\ndue to the phenomenon of explaining away. Markov chain Monte Carlo methods [12] can be used\nto sample from the posterior, but they are typically very time-consuming. In [8] complementary\npriors were used to eliminate the explaining away effects, producing a training procedure which is\nequivalent to training a stack of restricted Boltzmann machines.\nThe stacking procedure works as follows. Once an RBM has been trained on data, we can infer\nthe hidden unit activation probabilities given a data vector and re-represent the data vector as the\nvector of corresponding hidden activations. Since the RBM has been trained to reconstruct the data\n\n4\n\n\fFigure 1: An mcRBM with two RBMs stacked on top\n\nwell, the hidden unit activations will retain much of the information present in the data and pick\nup (possibly higher-order) correlations between different data dimensions that exist in the training\nset. Once we have used one RBM as a feature extractor we can, if desired, train an additional RBM\nthat treats the hidden activations of the \ufb01rst RBM as data to model. After training a sequence of\nRBMs, we can compose them to form a generative model whose top two layers are the \ufb01nal RBM\nin the stack and whose lower layers all have downward-directed connections that implement the\np(hk\u22121|hk) learned by the kth RBM, where h0 = v.\nThe weights obtained by the greedy layer-by-layer training procedure described for stacking RBMs,\nabove, can be used to initialize the weights of a deep feed-forward neural network. Once we add\nan output layer to the pre-trained neural network, we can discriminatively \ufb01ne-tune the weights of\nthis neural net using any variant of backpropagation [13] we wish. Although options for \ufb01ne-tuning\nexist other than backpropagation, such as the up-down algorithm used in [8], we restrict ourselves\nto backpropagation (updating the weights every 128 training cases) in this work for simplicity and\nbecause it is suf\ufb01cient for obtaining excellent results.\nFigure 1 is a diagram of two RBMs stacked on top of an mcRBM. Note that the RBM immediately\nabove the mcRBM uses both the mean unit activities and the precision unit activities together as\nvisible data. Later, during backpropagation, after we have added the softmax output unit, we do\nnot backpropagate through the mcRBM weights, so the mcRBM is a purely unsupervised feature\nextractor.\n\n5 Experimental Setup\n\n5.1 The TIMIT Dataset\n\nWe used the TIMIT corpus3 for all of our phone recognition experiments. We used the 462 speaker\ntraining set and removed all SA records (i.e., identical sentences for all speakers in the database),\nsince they could potentially bias our results. A development set of 50 speakers was used for hand-\ntuning hyperparameters and automated decoder tuning. As is standard practice, results are reported\nusing the 24-speaker core test set. We produced the training labels with a forced alignment of an\nHMM baseline. Since there are three HMM states per phone and 61 phones, all DBN architectures\nhad a 183-way softmax output unit. Once the training labels have been created, the HMM baseline\n\n3http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1.\n\n5\n\nRPWmhvW2W3h2h3\fis no longer needed; we do not combine or average our results with any HMM+GMM system. After\ndecoding, starting and ending silences were removed and the 61 phone classes were mapped to a set\nof 39 classes as in [14] for scoring. We removed starting and ending silences before scoring in order\nto be as similar to [5] as possible. However, to produce a more informative comparison between our\nresults and results in the literature that do not remove starting and ending silences, we also present\nthe phone error rate of our best model using the more common scoring strategy. During decoding,\nwe used a simple bigram language model over phones. Our results would certainly improve with\na trigram language model. In order to be able to make useful comparisons between different DBN\narchitectures (and achieve the best results), we optimized the Viterbi decoder parameters (the word\ninsertion probability and the language model scale factor) on the development set and then used the\nbest performing setting to compute the phone error rate (PER) for the core test set.\n\n5.2 Preprocessing\n\nSince we have completely abandoned Gaussian mixture model emission distributions, we are no\nlonger forced to use temporal derivative features. For all experiments the acoustic signal was ana-\nlyzed using a 25-ms Hamming window with 10-ms between the left edges of successive frames. We\nuse the output from a mel scale \ufb01lterbank, extracting 39 \ufb01lterbank output log magnitudes and one\nlog energy per frame. Once groups of 15 frames have been concatenated, we perform PCA whiten-\ning and preserve the 384 most important principal components. Since we perform PCA whitening\nanyway, the discrete cosine transform used to compute mel frequency cepstral coef\ufb01cients (MFCCs)\nfrom the \ufb01lterbank output is not useful. Determining the number of frames of acoustic context to\ngive to the DBN is an important preprocessing decision; preliminary experiments revealed that mov-\ning to 15 frames of acoustic data, from the 11 used in [5], could provide improvements in PER when\ntraining a DBN on features from a mcRBM. It is possible that even larger acoustic contexts might\nbe bene\ufb01cial as well. Also, since the mcRBM is trained as a generative model, doubling the input\ndimensionality by using a 5-ms advance per frame is unlikely to cause serious over\ufb01tting and might\nwell improve performance.\n\n5.3 Computational Setup\n\nTraining DBNs of the sizes used in this paper can be computationally expensive. We accelerated\ntraining by exploiting graphics processors, in particular GPUs in a NVIDIA Tesla S1070 system,\nusing the wonderful library described in [15]. The wall time per epoch varied with the architecture.\nAn epoch of training of an mcRBM that had 1536 hidden units (1024 precision units and 512 mean\nunits) took 20 minutes. When each DBN layer had 2048 hidden units, each epoch of pre-training for\nthe \ufb01rst DBN layer took about three minutes and each epoch of pretraining for the \ufb01fth layer took\nseven to eight minutes, since we propagated through each earlier layer. Each epoch of \ufb01ne-tuning\nfor such a \ufb01ve-DBN-layer architecture took 12 minutes. We used 100 epochs to train the mcRBM,\n50 epochs to train each RBM in the stack and 14 epochs of discriminative \ufb01ne-tuning of the whole\nnetwork for a total of nearly 60 hours, about 34 of which were spent training the mcRBM.\n\n6 Experiments\n\nSince one goal of this work is to improve performance on TIMIT by using deep learning architec-\ntures, we explored varying the number of DBN layers in our architecture. In agreement with [5], we\nfound that in order to obtain the best results with DBNs on TIMIT, multiple layers were essential.\nFigure 2 plots phone error rate on both the development set and the core test set against the number\nof hidden layers in a mcRBM-DBN (we don\u2019t count the mcRBM as a hidden layer since we do not\nbackpropagate through it). The particular mcRBM-DBN shown had 1536 hidden units in each DBN\nhidden layer, 1024 precision units in the mcRBM, and 512 mean units in the mcRBM. As the number\nof DBN hidden layers increased, error on the development and test sets decreased and eventually\nleveled off. The improvements that deeper models can provide over shallower models were evident\nfrom results reported in [5]; the results for the mcRBM-DBN in this work are even more dramatic. In\nfact, an mcRBM-DBN with 8 hidden layers is what exhibits the best development set error, 20.17%,\nin these experiments. The same model gets 21.7% on the core test set (20.5% if starting and ending\nsilences are included in scoring). Furthermore, at least 5 DBN hidden layers seem to be necessary\n\n6\n\n\fFigure 2: Effect of increasing model depth\n\nTable 1: The effect of DBN layer size on Phone Error Rate for 5 layer mcRBM-DBN models\n\nModel\n512 units\n1024 units\n1536 units\n2048 units\n\ndevset\ntestset\n21.4% 22.8%\n20.9% 22.3%\n20.4% 21.9%\n20.4% 21.8%\n\nto break a test set PER of 22%. Models of this depth (note also that an mcRBM-DBN with 8 DBN\nhidden layers is really a 9 layer model) have rarely been employed in the deep learning literature (cf.\n[8, 16], for example).\nTable 1 demonstrates that once the hidden layers are suf\ufb01ciently large, continuing to increase the\nsize of the hidden layers did not seem to provide additional improvements. In general, we did not\n\ufb01nd our results to be very sensitive to the exact number of hidden units in each layer, as long the\nhidden layers were relatively large.\nTo isolate the advantage of using an mcRBM instead of a GRBM, we need a clear comparison that is\nnot confounded by the differences in preprocessing between our work and [5]. Table 2 provides such\na comparison and con\ufb01rms that the mcRBM feature extraction causes a noticeable improvement in\nPER. The architectures in table 2 use 1536-hidden-unit DBN layers.\nTable 3 compares previously published results on the speaker-independent TIMIT phone recognition\ntask to the best mcRBM-DBN architecture we investigated. Results marked with a * remove starting\n\nTable 2: mcRBM-DBN vs GRBM-DBN Phone Error Rate\n\nModel\n\ndevset PER testset PER\n\n5 layer GRBM-DBN\n\nmcRBM + 4 layer DBN\n\n22.3%\n20.6%\n\n23.7%\n22.3%\n\n7\n\n123456789Number of DBN Hidden Layers202122232425Phone Error Rate (PER)Dev SetTest Set\fTable 3: Reported (speaker independent) results on TIMIT core test set\n\nMethod\n\nStochastic Segmental Models [17]\nConditional Random Field [18]\nLarge-Margin GMM [19]\nCD-HMM [20]\nAugmented conditional Random Fields [20]\nRecurrent Neural Nets [21]\nBayesian Triphone HMM [22]\nMonophone HTMs [23]\nHeterogeneous Classi\ufb01ers [24]\nDeep Belief Networks(DBNs) [5]\nTriphone HMMs discriminatively trained w/ BMMI [7]\nDeep Belief Networks with mcRBM feature extraction (this work)\nDeep Belief Networks with mcRBM feature extraction (this work)\n\nPER\n36%\n34.8%\n33%\n27.3%\n26.6%\n26.1%\n25.6%\n24.8%\n24.4%\n23.0*%\n22.7%\n21.7*%\n20.5%\n\nand ending silences at test time before scoring. One should note that the work of [7] used triphone\nHMMs and a trigram language model whereas in this work we used only a bigram language model\nand monophone HMMs, so table 3 probably underestimates the error reduction our system provides\nover the best published GMM-based approach.\n\n7 Conclusions and Future Work\n\nWe have presented a new deep architecture for phone recognition that combines a mcRBM feature\nextraction module with a standard DBN. Our approach attacks both the representational inef\ufb01ciency\nissues of GMMs and an important limitation of previous work applying DBNs to phone recognition.\nThe incorporation of features extracted by a mcRBM into an approach similar to that of [5] produces\nresults on speaker-independent TIMIT superior to those that have been reported to date. However,\nDBN-based acoustic modeling approaches are still in their infancy and many important research\nquestions remain. During the \ufb01ne-tuning, one could imagine backpropagating through the decoder\nitself and optimizing an objective function more closely related to the phone error rate. Since the\npretraining procedure can make use of large quantities of completely unlabeled data, leveraging\nuntranscribed speech data on a large scale might allow our approach to be even more robust to\ninter-speaker acoustic variations and would certainly be an interesting avenue of future work.\n\nReferences\n[1] S. Young, \u201cStatistical modeling in continuous speech recognition (CSR),\u201d in UAI \u201901: Proceedings of the\n17th Conference in Uncertainty in Arti\ufb01cial Intelligence, San Francisco, CA, USA, 2001, pp. 562\u2013571,\nMorgan Kaufmann Publishers Inc.\n\n[2] C. K. I. Williams, \u201cHow to pretend that correlated variables are independent by using difference obser-\n\nvations,\u201d Neural Comput., vol. 17, no. 1, pp. 1\u20136, 2005.\n\n[3] J.S. Bridle, \u201cTowards better understanding of the model implied by the use of dynamic features in\nHMMs,\u201d in Proceedings of the International Conference on Spoken Language Processing, 2004, pp.\n725\u2013728.\n\n[4] K. C. Sim and M. J. F. Gales, \u201cMinimum phone error training of precision matrix models,\u201d IEEE\n\nTransactions on Audio, Speech & Language Processing, vol. 14, no. 3, pp. 882\u2013889, 2006.\n\n[5] A. Mohamed, G. E. Dahl, and G. E. Hinton, \u201cDeep belief networks for phone recognition,\u201d in NIPS\n\nWorkshop on Deep Learning for Speech Recognition and Related Applications, 2009.\n\n[6] M. Ranzato and G. Hinton, \u201cModeling pixel means and covariances using factorized third-order boltz-\nmann machines,\u201d in Proc. of Computer Vision and Pattern Recognition Conference (CVPR 2010), 2010.\n[7] T. N. Sainath, B. Ramabhadran, and M. Picheny, \u201cAn exploration of large vocabulary tools for small\nvocabulary phonetic recognition,\u201d in IEEE Automatic Speech Recognition and Understanding Workshop,\n2009.\n\n8\n\n\f[8] G. E. Hinton, S. Osindero, and Y. Teh, \u201cA fast learning algorithm for deep belief nets,\u201d Neural Compu-\n\ntation, vol. 18, pp. 1527\u20131554, 2006.\n\n[9] N. Morgan and H. Bourlard, \u201cContinuous speech recognition,\u201d Signal Processing Magazine, IEEE, vol.\n\n12, no. 3, pp. 24 \u201342, may 1995.\n\n[10] M. Ranzato, A. Krizhevsky, and G. Hinton, \u201cFactored 3-way restricted Boltzmann machines for modeling\nnatural images,\u201d in Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics,\n2010, vol. 13.\n\n[11] R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag New York, Inc., Secaucus, NJ,\n\nUSA, 1996.\n\n[12] R. M. Neal, \u201cConnectionist learning of belief networks,\u201d Arti\ufb01cial Intelligence, vol. 56, no. 1, pp. 71\u2013113,\n\n1992.\n\n[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, \u201cLearning representations by back-propagating\n\nerrors,\u201d Nature, vol. 323, no. 6088, pp. 533\u2013536, 1986.\n\n[14] K. F. Lee and H. W. Hon, \u201cSpeaker-independent phone recognition using hidden markov models,\u201d IEEE\n\nTransactions on Audio, Speech & Language Processing, vol. 37, no. 11, pp. 1641\u20131648, 1989.\n\n[15] V. Mnih, \u201cCudamat: a CUDA-based matrix class for python,\u201d Tech. Rep. UTML TR 2009-004, Depart-\n\nment of Computer Science, University of Toronto, November 2009.\n\n[16] V. Nair and G. E. Hinton, \u201c3-d object recognition with deep belief nets,\u201d in Advances in Neural Infor-\nmation Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,\nEds., 2009, pp. 1339\u20131347.\n\n[17] V. V. Digalakis, M. Ostendorf, and J. R. Rohlicek, \u201cFast algorithms for phone classi\ufb01cation and recog-\nnition using segment-based models,\u201d IEEE Transactions on Signal Processing, vol. 40, pp. 2885\u20132896,\n1992.\n\n[18] J. Morris and E. Fosler-Lussier, \u201cCombining phonetic attributes using conditional random \ufb01elds,\u201d in\n\nProc. Interspeech, 2006, pp. 597\u2013600.\n\n[19] F. Sha and L. Saul, \u201cLarge margin gaussian mixture modeling for phonetic classi\ufb01cation and recognition,\u201d\n\nin Proc. ICASSP, 2006, pp. 265\u2013268.\n\n[20] Y. Hifny and S. Renals, \u201cSpeech recognition using augmented conditional random \ufb01elds,\u201d IEEE Trans-\n\nactions on Audio, Speech & Language Processing, vol. 17, no. 2, pp. 354\u2013365, 2009.\n\n[21] A. Robinson, \u201cAn application of recurrent nets to phone probability estimation,\u201d IEEE Transactions on\n\nNeural Networks, vol. 5, no. 2, pp. 298\u2013305, 1994.\n\n[22] J. Ming and F. J. Smith, \u201cImproved phone recognition using bayesian triphone models,\u201d in Proc. ICASSP,\n\n1998, pp. 409\u2013412.\n\n[23] L. Deng and D. Yu, \u201cUse of differential cepstra as acoustic features in hidden trajectory modelling for\n\nphonetic recognition,\u201d in Proc. ICASSP, 2007, pp. 445\u2013448.\n\n[24] A. Halberstadt and J. Glass, \u201cHeterogeneous measurements and multiple classi\ufb01ers for speech recogni-\n\ntion,\u201d in Proc. ICSLP, 1998.\n\n9\n\n\f", "award": [], "sourceid": 160, "authors": [{"given_name": "George", "family_name": "Dahl", "institution": null}, {"given_name": "Marc'aurelio", "family_name": "Ranzato", "institution": null}, {"given_name": "Abdel-rahman", "family_name": "Mohamed", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}