{"title": "Conditional Random Field Autoencoders for Unsupervised Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3311, "page_last": 3319, "abstract": "We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observed data using a feature-rich conditional random field (CRF). Then a reconstruction of the input is (re)generated, conditional on the latent structure, using a generative model which factorizes similarly to the CRF. The autoencoder formulation enables efficient exact inference without resorting to unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. Finally, we show competitive results with instantiations of the framework for two canonical tasks in natural language processing: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines.", "full_text": "Conditional Random Field Autoencoders\nfor Unsupervised Structured Prediction\n\nWaleed Ammar\n\nChris Dyer\n\nNoah A. Smith\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh, PA 15213, USA\n\n{wammar,cdyer,nasmith}@cs.cmu.edu\n\nAbstract\n\nWe introduce a framework for unsupervised learning of structured predictors with\noverlapping, global features. Each input\u2019s latent representation is predicted con-\nditional on the observed data using a feature-rich conditional random \ufb01eld (CRF).\nThen a reconstruction of the input is (re)generated, conditional on the latent struc-\nture, using a generative model which factorizes similarly to the CRF. The autoen-\ncoder formulation enables ef\ufb01cient exact inference without resorting to unrealistic\nindependence assumptions or restricting the kinds of features that can be used.\nWe illustrate connections to traditional autoencoders, posterior regularization, and\nmulti-view learning. We then show competitive results with instantiations of the\nframework for two canonical tasks in natural language processing: part-of-speech\ninduction and bitext word alignment, and show that training the proposed model\ncan be substantially more ef\ufb01cient than a comparable feature-rich baseline.\n\n1\n\nIntroduction\n\nConditional random \ufb01elds [24] are used to model structure in numerous problem domains, includ-\ning natural language processing (NLP), computational biology, and computer vision. They enable\nef\ufb01cient inference while incorporating rich features that capture useful domain-speci\ufb01c insights. De-\nspite their ubiquity in supervised settings, CRFs\u2014and, crucially, the insights about effective feature\nsets obtained by developing them\u2014play less of a role in unsupervised structure learning, a prob-\nlem which traditionally requires jointly modeling observations and the latent structures of interest.\nFor unsupervised structured prediction problems, less powerful models with stronger independence\nassumptions are standard.1 This state of affairs is suboptimal in at least three ways: (i) adhering\nto inconvenient independence assumptions when designing features is limiting\u2014we contend that\neffective feature engineering is a crucial mechanism for incorporating inductive bias in unsuper-\nvised learning problems; (ii) features and their weights have different semantics in joint and condi-\ntional models (see \u00a73.1); and (iii) modeling the generation of high-dimensional observable data with\nfeature-rich models is computationally challenging, requiring expensive marginal inference in the\ninner loop of iterative parameter estimation algorithms (see \u00a73.1).\n\nOur approach leverages the power and \ufb02exibility of CRFs in unsupervised learning without sacri\ufb01c-\ning their attractive computational properties or changing the semantics of well-understood feature\nsets. Our approach replaces the standard joint model of observed data and latent structure with a two-\nlayer conditional random \ufb01eld autoencoder that \ufb01rst generates latent structure with a CRF (condi-\ntional on the observed data) and then (re)generates the observations conditional on just the predicted\nstructure. For the reconstruction model, we use distributions which offer closed-form maximum\n\n1For example, a \ufb01rst-order hidden Markov model requires that yi \u22a5 xi+1 | yi+1 for a latent sequence\n\ny = hy1, y2, . . .i generating x = hx1, x2, . . .i, while a \ufb01rst-order CRF allows yi to directly depend on xi+1.\n\n1\n\n\f\f\fExtension: partial reconstruction.\nIn our running POS example, the reconstruction model\np\u03b8(\u02c6xi | yi) de\ufb01nes a distribution over words given tags. Because word distributions are heavy-\ntailed, estimating such a distribution reliably is quite challenging. Our solution is to de\ufb01ne a func-\ntion \u03c0 : X \u2192 \u02c6X such that | \u02c6X | \u226a |X |, and let \u02c6xi = \u03c0(xi) be a deterministic transformation of the\noriginal structured observation. We can add indirect supervision by de\ufb01ning \u03c0 such that it represents\nobserved information relevant to the latent structure of interest. For example, we found reconstruct-\ning Brown clusters [5] of tokens instead of their surface forms to improve POS induction. Other\npossible reconstructions include word embeddings, morphological and spelling features of words.\n\nMore general graphs. We presented the CRF autoencoder in terms of sequential Markovian as-\nsumptions for ease of exposition; however, this framework can be used to model arbitrary hidden\nstructures. For example, instantiations of this model can be used for unsupervised learning of parse\ntrees [21], semantic role labels [42], and coreference resolution [35] (in NLP), motif structures [1]\nin computational biology, and object recognition [46] in computer vision. The requirements for\napplying the CRF autoencoder model are:\n\n\u2022 An encoding discriminative model de\ufb01ning p\u03bb(y | x, \u03c6). The encoder may be any model family\n\nwhere supervised learning from hx, yi pairs is ef\ufb01cient.\n\n\u2022 A reconstruction model that de\ufb01nes p\u03b8(\u02c6x | y, \u03c6) such that inference over y given hx, \u02c6xi is\n\nef\ufb01cient.\n\n\u2022 The independencies among y | x, \u02c6x are not strictly weaker than those among y | x.\n\n2.1 Learning & Inference\n\nModel parameters are selected to maximize the regularized conditional log likelihood of recon-\nstructed observations \u02c6x given the structured observation x:\n\n\u2113\u2113(\u03bb, \u03b8) = R1(\u03bb) + R2(\u03b8) + P(x,\u02c6x)\u2208T log Py p\u03bb(y | x) \u00d7 p\u03b8(\u02c6x | y)\n\n(2)\n\nWe apply block coordinate descent, alternating between maximizing with respect to the CRF param-\neters (\u03bb-step) and the reconstruction parameters (\u03b8-step). Each \u03bb-step applies one or two iterations\nof a gradient-based convex optimizer.5 The \u03b8-step applies one or two iterations of EM [10], with a\nclosed-form solution in the M-step in each EM iteration. The independence assumptions among y\nmake the marginal inference required in both steps straightforward; we omit details for space.\n\nIn the experiments below, we apply a squared L2 regularizer for the CRF parameters \u03bb, and a\nsymmetric Dirichlet prior for categorical parameters \u03b8.\n\nThe asymptotic runtime complexity of each block coordinate descent iteration, assuming the \ufb01rst-\norder Markov dependencies in Fig. 2 (right), is:\n\nO (cid:0)|\u03b8| + |\u03bb| + |T | \u00d7 |x|max \u00d7 |Y|max \u00d7 (|Y|max \u00d7 |Fyi\u22121,yi | + |Fx,yi |)(cid:1)\n\n(3)\n\nwhere Fyi\u22121,yi are the active \u201clabel bigram\u201d features used in hyi\u22121, yii factors, Fx,yi are the active\nemission-like features used in hx, yii factors.\n|x|max is the maximum length of an observation\nsequence. |Y|max is the maximum cardinality6 of the set of possible assignments of yi.\n\nAfter learning the \u03bb and \u03b8 parameters of the CRF autoencoder, test-time predictions are made us-\ning maximum a posteriori estimation, conditioning on both observations and reconstructions, i.e.,\n\u02c6yMAP = arg maxy p\u03bb,\u03b8(y | x, \u02c6x).\n\n3 Connections To Previous Work\n\nThis work relates to several strands of work in unsupervised learning. Two broad types of models\nhave been explored that support unsupervised learning with \ufb02exible feature representations. Both are\n\n5We experimented with AdaGrad [12] and L-BFGS. When using AdaGrad, we accummulate the gradient\n\nvectors across block coordinate ascent iterations.\n\n6In POS induction, |Y| is a constant, the number of syntactic classes which we con\ufb01gure to 12 in our exper-\niments. In word alignment, |Y| is the size of the source sentence plus one, therefore |Y|max is the maximum\nlength of a source sentence in the bitext corpus.\n\n4\n\n\ffully generative models that de\ufb01ne joint distributions over x and y. We discuss these \u201cundirected\u201d\nand \u201cdirected\u201d alternatives next, then turn to less closely related methods.\n\n3.1 Existing Alternatives for Unsupervised Learning with Features\n\nUndirected models. A Markov random \ufb01eld (MRF) encodes the joint distribution through local\npotential functions parameterized using features. Such models \u201cnormalize globally,\u201d requiring dur-\ning training the calculation of a partition function summing over all possible inputs and outputs. In\nour notation:\n\nZ(\u03b8) = X\nx\u2208X \u2217\n\nX\ny\u2208Y |x|\n\nexp \u03bb\u22a4\n\n\u00afg(x, y)\n\n(4)\n\nwhere \u00afg collects all the local factorization by cliques of the graph, for clarity. The key dif\ufb01culty\nis in the summation over all possible observations. Approximations have been proposed, including\ncontrastive estimation, which sums over subsets of X \u2217 [38, 43] (applied variously to POS learning\nby Haghighi and Klein [18] and word alignment by Dyer et al. [14]) and noise contrastive estimation\n[30].\n\nDirected models. The directed alternative avoids the global partition function by factorizing the\njoint distribution in terms of locally normalized conditional probabilities, which are parameterized\nin terms of features. For unsupervised sequence labeling, the model was called a \u201cfeature HMM\u201d\nby Berg-Kirkpatrick et al. [3]. The local emission probabilities p(xi | yi) in a \ufb01rst-order HMM for\nPOS tagging are reparameterized as follows (again, using notation close to ours):\n\np\u03bb(xi | yi) =\n\nexp \u03bb\u22a4\n\ng(xi, yi)\n\nPx\u2208X exp \u03bb\u22a4\n\ng(x, yi)\n\n(5)\n\nThe features relating hidden to observed variables must be local within the factors implied by the\ndirected graph. We show below that this locality restriction excludes features that are useful (\u00a7A.1).\n\nPut in these terms, the proposed autoencoding model is a hybrid directed-undirected model.\n\nAsymptotic Runtime Complexity of Inference. The models just described cannot condition on\narbitrary amounts of x without increasing inference costs. Despite the strong independence assump-\ntions of those models, the computational complexity of inference required for learning with CRF\nautoencoders is better (\u00a72.1).\n\nConsider learning the parameters of an undirected model by maximizing likelihood of the observed\ndata. Computing the gradient for a training instance x requires time\n\nO (cid:0)|\u03bb| + |T | \u00d7 |x| \u00d7 |Y| \u00d7 (|Y| \u00d7 |Fyi\u22121,yi |+|X | \u00d7 |Fxi,yi |)(cid:1) ,\n\nwhere Fxi\u2212yi are the emission-like features used in an arbitrary assignment of xi and yi. When the\nmultiplicative factor |X | is large, inference is slow compared to CRF autoencoders.\n\nInference in directed models is faster than in undirected models, but still slower than CRF autoen-\ncoder models. In directed models [3], each iteration requires time\n\nO (cid:0)|\u03bb| + |T | \u00d7 |x| \u00d7 |Y| \u00d7 (|Y| \u00d7 |Fyi\u22121,yi | + |Fxi,yi |)+|\u03b8\u2032| \u00d7 max(|Fyi\u22121,yi |, |FX ,yi |)(cid:1) ,\n\nwhere Fxi,yi are the active emission features used in an arbitrary assignment of xi and yi, FX ,yi\nis the union of all emission features used with an arbitrary assignment of yi, and \u03b8\u2032 are the local\nemission and transition probabilities. When |X | is large, the last term |\u03b8\u2032|\u00d7max(|Fyi\u22121,yi |, |FX ,yi |)\ncan be prohibitively large.\n\n3.2 Other Related Work\n\nThe proposed CRF autoencoder is more distantly related to several important ideas in less-than-\nsupervised learning.\n\n5\n\n\fAutoencoders and other \u201cpredict self\u201d methods. Our framework borrows its general structure,\nFig. 2 (left), as well as its name, from neural network autoencoders. The goal of neural autoencoders\nhas been to learn feature representations that improve generalization in otherwise supervised learn-\ning problems [44, 8, 39]. In contrast, the goal of CRF autoencoders is to learn speci\ufb01c interpretable\nregularities of interest.7 It is not clear how neural autoencoders could be used to learn the latent\nstructures that CRF autoencoders learn, without providing supervised training examples. Stoyanov\net al. [40] presented a related approach for discriminative graphical model learning, including fea-\ntures and latent variables, based on backpropagation, which could be used to instantiate the CRF\nautoencoder.\n\nDaum\u00b4e III [9] introduced a reduction of an unsupervised problem instance to a series of single-\nvariable supervised classi\ufb01cations. The \ufb01rst series of these construct a latent structure y given the\nentire x, then the second series reconstruct the input. The approach can make use of any supervised\nlearner; if feature-based probabilistic models were used, a |X | summation (akin to Eq. 5) would\nbe required. On unsupervised POS induction, this approach performed on par with the undirected\nmodel of Smith and Eisner [38].\n\nMinka [29] proposed cascading a generative model and a discriminative model, where class labels\n(to be predicted at test time) are marginalized out in the generative part \ufb01rst, and then (re)generated\nin the discriminative part. In CRF autoencoders, observations (available at test time) are conditioned\non in the discriminative part \ufb01rst, and then (re)generated in the generative part.\n\nPosterior regularization.\nIntroduced by Ganchev et al. [16], posterior regularization is an effec-\ntive method for specifying constraint on the posterior distributions of the latent variables of interest;\na similar idea was proposed independently by Bellare et al. [2]. For example, in POS induction,\nevery sentence might be expected to contain at least one verb. This is imposed as a soft constraint,\ni.e., a feature whose expected value under the model\u2019s posterior is constrained. Such expectation\nconstraints are speci\ufb01ed directly by the domain-aware model designer.8 The approach was applied\nto unsupervised POS induction, word alignment, and parsing. Although posterior regularization was\napplied to directed feature-less generative models, the idea is orthogonal to the model family and\ncan be used to add more inductive bias for training CRF autoencoder models.\n\n4 Evaluation\n\nWe evaluate the effectiveness of CRF autoencoders for learning from unlabeled examples in POS\ninduction and word alignment. We defer the detailed experimental setup to Appendix A.\n\nPart-of-Speech Induction Results. Fig. 3 compares predictions of the CRF autoencoder model\nin seven languages to those of a featurized \ufb01rst-order HMM model [3] and a standard (feature-less)\n\ufb01rst-order HMM, using V-measure [37] (higher is better). First, note the large gap between both\nfeature-rich models on the one hand, and the feature-less HMM model on the other hand. Second,\nnote that CRF autoencoders outperform featurized HMMs in all languages, except Italian, with an\naverage relative improvement of 12%.\n\nThese results provide empirical evidence that feature engineering is an important source of inductive\nbias for unsupervised structured prediction problems.\nIn particular, we found that using Brown\ncluster reconstructions and specifying features which span multiple words signi\ufb01cantly improve the\nperformance. Refer to Appendix A for more analysis.\n\nBitext Word Alignment Results. First, we consider an intrinsic evaluation on a Czech-English\ndataset of manual alignments, measuring the alignment error rate (AER; [32]). We also perform an\n\n7This is possible in CRF autoencoders due to the interdependencies among variables in the hidden structure\nand the manually speci\ufb01ed feature templates which capture the relationship between observations and their\nhidden structures.\n\n8In a semi-supervised setting, when some labeled examples of the hidden structure are available, Druck\nand McCallum [11] used labeled examples to estimate desirable expected values. We leave semi-supervised\napplications of CRF autoencoders to future work; see also Suzuki and Isozaki [41].\n\n6\n\n\fStandard HMM\nFeaturized HMM\nCRF autoencoder\n\ne\nr\nu\ns\na\ne\nm\n\u2212\nV\n\n6\n\n.\n\n0\n\n5\n\n.\n\n0\n\n4\n\n.\n\n0\n\n3\n\n.\n\n0\n\n2\n\n.\n\n0\n\n1\n\n.\n\n0\n\n0\n\n.\n\n0\n\nArabic\n\nBasque\n\nDanish\n\nGreek Hungarian\n\nItalian\n\nTurkish\n\nAverage\n\nFigure 3: V-measure [37] of induced parts of speech in seven languages. The CRF autoencoder with\nfeatures spanning multiple words and with Brown cluster reconstructions achieves the best results in\nall languages but Italian, closely followed by the feature-rich HMM of Berg-Kirkpatrick et al. [3].\nThe standard multinomial HMM consistently ranks last.\n\ndirection\nforward\nreverse\nsymmetric\n\nfast align model 4\n\n27.7\n25.9\n25.2\n\n31.5\n24.1\n22.2\n\nauto\n27.5\n21.1\n19.5\n\npair\ncs-en\nur-en\nzh-en\n\nfast align model 4\n15.3\u00b10.1\n15.2\u00b10.3\n20.1\u00b10.6\n20.0\u00b10.6\n56.9\u00b11.6\n56.7\u00b11.6\n\nauto\n\n15.5\u00b10.1\n20.8\u00b10.5\n56.1\u00b11.7\n\nTable 1: Left: AER results (%) for Czech-English word alignment. Lower values are better. . Right:\nBleu translation quality scores (%) for Czech-English, Urdu-English and Chinese-English. Higher\nvalues are better. .\n\nextrinsic evaluation of translation quality in three language pairs, using case-insensitive Bleu [33] of\na machine translation system (cdec9 [13]) built using the word alignment predictions of each model.\n\nAER for variants of each model (forward, reverse, and symmetrized) are shown in Table 1 (left).\nOur model signi\ufb01cantly outperforms both baselines. Bleu scores on the three language pairs are\nshown in Table 1; alignments obtained with our CRF autoencoder model improve translation quality\nof the Czech-English and Urdu-English translation systems, but not of Chinese-English. This is un-\nsurprising, given that Chinese orthography does not use letters, so that source-language spelling and\nmorphology features our model incorporates introduce only noise here. Better feature engineering,\nor more data, is called for.\n\nWe have argued that the feature-rich CRF autoencoder will scale better than its feature-rich alter-\nnatives. Fig. 5 (in Appendix A.2) shows the average per-sentence inference runtime for the CRF\nautoencoder compared to exact inference in an MRF [14] with a similar feature set, as a function of\nthe number of sentences in the corpus. For CRF autoencoders, the average inference runtime grows\nslightly due to the increased number of parameters, while it grows substantially with vocabulary size\nin MRF models [14].10\n\n5 Conclusion\n\nWe have presented a general and scalable framework to learn from unlabeled examples for structured\nprediction. The technique allows features with global scope in observed variables with favorable\nasymptotic inference runtime. We achieve this by embedding a CRF as the encoding model in the\n\n9http://www.cdec-decoder.org/\n10We only compare runtime, instead of alignment quality, because retraining the MRF model with exact\n\ninference was too expensive.\n\n7\n\n\finput layer of an autoencoder, and reconstructing a transformation of the input at the output layer\nusing simple categorical distributions. The key advantages of the proposed model are scalability and\nmodeling \ufb02exibility. We applied the model to POS induction and bitext word alignment, obtaining\nresults that are competitive with the state of the art on both tasks.\n\nAcknowledgments\n\nWe thank Brendan O\u2019Connor, Dani Yogatama, Jeffrey Flanigan, Manaal Faruqui, Nathan Schneider,\nPhil Blunsom and the anonymous reviewers for helpful suggestions. We also thank Taylor Berg-\nKirkpatrick for providing his implementation of the POS induction baseline, and Phil Blunsom for\nsharing POS induction evaluation scripts. This work was sponsored by the U.S. Army Research\nLaboratory and the U.S. Army Research Of\ufb01ce under contract/grant number W911NF-10-1-0533.\nThe statements made herein are solely the responsibility of the authors.\n\nReferences\n\n[1] T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using\n\nexpectation maximization. Machine learning, 1995.\n\n[2] K. Bellare, G. Druck, and A. McCallum. Alternating projections for learning with expectation\n\nconstraints. In Proc. of UAI, 2009.\n\n[3] T. Berg-Kirkpatrick, A. Bouchard-C\u02c6ot\u00b4e, J. DeNero, and D. Klein. Painless unsupervised learn-\n\ning with features. In Proc. of NAACL, 2010.\n\n[4] P. Blunsom and T. Cohn. Discriminative word alignment with conditional random \ufb01elds. In\n\nProc. of Proceedings of ACL, 2006.\n\n[5] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram\n\nmodels of natural language. Computational Linguistics, 1992.\n\n[6] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical\n\nmachine translation: parameter estimation. In Computational Linguistics, 1993.\n\n[7] S. Buchholz and E. Marsi. CoNLL-X shared task on multilingual dependency parsing.\n\nIn\n\nCoNLL-X, 2006.\n\n[8] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep\n\nneural networks with multitask learning. In Proc. of ICML, 2008.\n\n[9] H. Daum\u00b4e III. Unsupervised search-based structured prediction. In Proceedings of the 26th\n\nAnnual International Conference on Machine Learning, pages 209\u2013216. ACM, 2009.\n\n[10] A. P. Dempster, N. M. Laird, D. B. Rubin, et al. Maximum likelihood from incomplete data\n\nvia the em algorithm. Journal of the Royal statistical Society, 39(1):1\u201338, 1977.\n\n[11] G. Druck and A. McCallum. High-performance semi-supervised learning using discrimina-\n\ntively constrained generative models. In Proc. of ICML, 2010.\n\n[12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. JMLR, 2011.\n\n[13] C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman,\nand P. Resnik. cdec: A decoder, alignment, and learning framework for \ufb01nite-state and context-\nfree translation models. In Proc. of ACL, 2010.\n\n[14] C. Dyer, J. Clark, A. Lavie, and N. A. Smith. Unsupervised word alignment with arbitrary\n\nfeatures. In Proc. of ACL-HLT, 2011.\n\n[15] C. Dyer, V. Chahuneau, and N. A. Smith. A simple, fast, and effective reparameterization of\n\nIBM Model 2. In Proc. of NAACL, 2013.\n\n[16] K. Ganchev, J. Grac\u00b8a, J. Gillenwater, and B. Taskar. Posterior regularization for structured\n\nlatent variable models. Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\n[17] Q. Gao and S. Vogel. Parallel implementations of word alignment tool. In In Proc. of the ACL\n\nworkshop, 2008.\n\n[18] A. Haghighi and D. Klein. Prototype-driven learning for sequence models. In Proc. of NAACL-\n\nHLT, 2006.\n\n[19] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1997.\n\n8\n\n\f[20] M. Johnson. Why doesn\u2019t EM \ufb01nd good HMM POS-taggers? In Proc. of EMNLP, 2007.\n\n[21] D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of depen-\n\ndency and constituency. In Proc. of ACL, 2004.\n\n[22] P. Koehn. Statistical Machine Translation. Cambridge, 2010.\n\n[23] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proc. of NAACL,\n\n2003.\n\n[24] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proc. of ICML, 2001.\n\n[25] P. Liang. Semi-supervised learning for natural language. In Thesis, MIT, 2005.\n\n[26] C.-C. Lin, W. Ammar, C. Dyer, and L. Levin. The cmu submission for the shared task on lan-\nguage identi\ufb01cation in code-switched data. In First Workshop on Computational Approaches\nto Code Switching at EMNLP, 2014.\n\n[27] A. V. Lukashin and M. Borodovsky. Genemark. hmm: new solutions for gene \ufb01nding. Nucleic\n\nacids research, 26(4):1107\u20131115, 1998.\n\n[28] B. Merialdo. Tagging english text with a probabilistic model. In Comp. Ling., 1994.\n\n[29] T. Minka. Discriminative models, not discriminative training. Technical report, Technical\n\nReport MSR-TR-2005-144, Microsoft Research, 2005.\n\n[30] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language\n\nmodels. In Proc. of ICML, 2012.\n\n[31] J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret. The CoNLL\n\n2007 shared task on dependency parsing. In Proc. of CoNLL, 2007.\n\n[32] F. Och and H. Ney. A systematic comparison of various statistical alignment models. Compu-\n\ntational Linguistics, 2003.\n\n[33] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of\n\nmachine translation. In Proc. of ACL, 2002.\n\n[34] S. Petrov, D. Das, and R. McDonald. A universal part-of-speech tagset. In Proc. of LREC,\n\nMay 2012.\n\n[35] H. Poon and P. Domingos. Joint unsupervised coreference resolution with Markov logic. In\n\nProc. of EMNLP, 2008.\n\n[36] S. Reddy and S. Waxmonsky. Substring-based transliteration with conditional random \ufb01elds.\n\nIn Proc. of the Named Entities Workshop, 2009.\n\n[37] A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster\n\nevaluation measure. In EMNLP-CoNLL, 2007.\n\n[38] N. A. Smith and J. Eisner. Contrastive estimation: Training log-linear models on unlabeled\n\ndata. In Proc. of ACL, 2005.\n\n[39] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and\n\nsyntactic parsing with recursive neural networks. In NIPS workshop, 2010.\n\n[40] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of graphical model pa-\nrameters given approximate inference, decoding, and model structure. In Proc. of AISTATS,\n2011.\n\n[41] J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-\n\nword scale unlabeled data. In Proc. of ACL, 2008.\n\n[42] R. Swier and S. Stevenson. Unsupervised semantic role labelling. In Proc. of EMNLP, 2004.\n\n[43] D. Vickrey, C. C. Lin, and D. Koller. Non-local contrastive objectives. In Proc. of ICML, 2010.\n\n[44] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust\n\nfeatures with denoising autoencoders. In Proc. of ICML, 2008.\n\n[45] S. Vogel, H. Ney, and C. Tillmann. Hmm-based word alignment in statistical translation. In\n\nProc. of COLING, 1996.\n\n[46] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. 2000.\n\n[47] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using\n\nhidden Markov model. In Proc. of CVPR, pages 379\u2013385. IEEE, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1702, "authors": [{"given_name": "Waleed", "family_name": "Ammar", "institution": "CMU"}, {"given_name": "Chris", "family_name": "Dyer", "institution": "Carnegie Mellon University"}, {"given_name": "Noah", "family_name": "Smith", "institution": "Carnegie Mellon University"}]}