{"title": "Multi-View Learning of Word Embeddings via CCA", "book": "Advances in Neural Information Processing Systems", "page_first": 199, "page_last": 207, "abstract": "Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-of-the-art performance on named entity recognition (NER) and chunking problems.", "full_text": "Multi-View Learning of Word Embeddings via CCA\n\nParamveer S. Dhillon\n\nComputer & Information Science\n\nDean Foster\n\nStatistics\n\n{dhillon|ungar}@cis.upenn.edu, foster@wharton.upenn.edu\n\nUniversity of Pennsylvania, Philadelphia, PA, U.S.A\n\nLyle Ungar\n\nComputer & Information Science\n\nAbstract\n\nRecently, there has been substantial interest in using large amounts of unlabeled\ndata to learn word representations which can then be used as features in supervised\nclassi\ufb01ers for NLP tasks. However, most current approaches are slow to train, do\nnot model the context of the word, and lack theoretical grounding. In this paper,\nwe present a new learning method, Low Rank Multi-View Learning (LR-MVL)\nwhich uses a fast spectral method to estimate low dimensional context-speci\ufb01c\nword representations from unlabeled data. These representation features can then\nbe used with any supervised learner. LR-MVL is extremely fast, gives guaranteed\nconvergence to a global optimum, is theoretically elegant, and achieves state-of-\nthe-art performance on named entity recognition (NER) and chunking problems.\n\n1\n\nIntroduction and Related Work\n\nOver the past decade there has been increased interest in using unlabeled data to supplement the\nlabeled data in semi-supervised learning settings to overcome the inherent data sparsity and get\nimproved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2]\nhave been empirically very successful and have achieved excellent accuracies on a variety of NLP\ntasks. However, it is often dif\ufb01cult to adapt these approaches to use in conjunction with an existing\nsupervised NLP system as these approaches enforce a particular choice of model.\nAn increasingly popular alternative is to learn representational embeddings for words from a large\ncollection of unlabeled data (typically using a generative model), and to use these embeddings to\naugment the feature set of a supervised learner. Embedding methods produce features in low di-\nmensional spaces or over a small vocabulary size, unlike the traditional approach of working in the\noriginal high dimensional vocabulary space with only one dimension \u201con\u201d at a given time. Broadly,\nthese embedding methods fall into two categories:\n\n1. Clustering based word representations: Clustering methods, often hierarchical, are used to\ngroup distributionally similar words based on their contexts. The two dominant approaches\nare Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a\nmultinomial distribution over possible clusters [5].\n\n2. Dense representations: These representations are dense, low dimensional and real-valued.\nEach dimension of these representations captures latent information about a combination\nof syntactic and semantic word properties. They can either be induced using neural net-\nworks like C&W embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or\nby eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analy-\nsis/Latent Semantic Indexing (LSA/LSI) [8].\n\nUnfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the\nembeddings (especially (cid:96)2 based approaches like LSA/PCA), 3). can get stuck in local optima (like\nEM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences\n\n1\n\n\fof the word \u201cbank\u201d will have the same embedding, irrespective of whether the context of the word\nsuggests it means \u201ca \ufb01nancial institution\u201d or \u201ca river bank\u201d.\nIn this paper, we propose a novel context-speci\ufb01c word embedding method called Low Rank Multi-\nView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution.\nAs presented here, our LR-MVL embeddings are context-speci\ufb01c, but context oblivious embeddings\n(like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent\nadvances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL\nhas strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional\ncontext-speci\ufb01c word embeddings which preserve all the information in the data if the data were\ngenerated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck\nin local optima as is the case for an EM trained HMM.\nLR-MVL falls into category (2) mentioned above; it learns real-valued context-speci\ufb01c word em-\nbeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future\nviews of low rank approximations of the data. However, LR-MVL is more general than those meth-\nods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence\ninformation to estimate context-speci\ufb01c embeddings and also for the reasons mentioned in the last\nparagraph.\nThe remainder of the paper is organized as follows. In the next section we give a brief overview of\nCCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm\nin detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of\nLR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief\nsummary in Section 5.\n\n2 Brief Review: Canonical Correlation Analysis (CCA)\n\nCCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA com-\nputes the directions of maximum covariance between elements in a single matrix, whereas CCA\ncomputes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does\nnot depend on how the observations are scaled. This invariance of CCA to linear data transforma-\ntions allows proofs that keeping the dominant singular vectors (those with largest singular values)\nwill faithfully capture any state information.\nMore speci\ufb01cally, given a set of n paired observation vectors {(l1, r1), ..., (ln, rn)}\u2013in our case the\ntwo matrices are the left (L) and right (R) context matrices of a word\u2013we would like to simultane-\nously \ufb01nd the directions \u03a6l and \u03a6r that maximize the correlation of the projections of L onto \u03a6l\nwith the projections of R onto \u03a6r. This is expressed as\n\nE[(cid:104)L, \u03a6l(cid:105)(cid:104)R, \u03a6r(cid:105)]\n\n(cid:112)E[(cid:104)L, \u03a6l(cid:105)2]E[(cid:104)R, \u03a6r(cid:105)2]\n\nmax\n\u03a6l,\u03a6r\n\n(1)\n\nwhere E denotes the empirical expectation. We use the notation Clr (Cll) to denote the cross (auto)\ncovariance matrices between L and R (i.e. L\u2019R and L\u2019L respectively.).\nThe left and right canonical correlates are the solutions (cid:104)\u03a6l, \u03a6r(cid:105) of the following equations:\n\n\u22121ClrCrr\n\u22121CrlCll\n\n\u22121Crl\u03a6l = \u03bb\u03a6l\n\u22121Clr\u03a6r = \u03bb\u03a6r\n\nCll\nCrr\n\n(2)\n\n3 Low Rank Multi-View Learning (LR-MVL)\n\nIn LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled\ncorpus to \ufb01nd the common latent structure, i.e., the hidden state associated with each token. These\ninduced representations of the tokens can then be used as features in a supervised classi\ufb01er (typically\ndiscriminative).\nThe context around a word, consisting of the h words to the right and left of it, sits in a high\ndimensional space, since for a vocabulary of size v, each of the h words in the context requires an\nindicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word\n\n2\n\n\fspace down to a k dimensional state space. Thus, all eigenvector computations are done in a space\nthat is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000\nwords, and we use state spaces of order k \u2248 50 dimensions, this gives a 1,000-fold reduction in the\nsize of calculations that are needed.\nThe core of our LR-MVL algorithm is a fast spectral method for learning a v \u00d7 k matrix A which\nmaps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the\n\u201ceigenfeature dictionary\u201d.\nWe now describe the LR-MVL method, give a theorem that provides intuition into how it works, and\nformally present the LR-MVL algorithm. The Experiments section then shows that this low rank\napproximation allows us to achieve state-of-the-art performance on NLP tasks.\n\n3.1 The LR-MVL method\nGiven an unlabeled token sequence w={w0, w1, . . ., wn} we want to learn a low (k)- dimensional\nstate vector {z0, z1, . . . , zn} for each observed token. The key is to \ufb01nd a v\u00d7k matrix A (Algorithm\n1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector,\nwhich is later used to induce context speci\ufb01c embeddings for the tokens (Algorithm 2).\nFor supervised learning, these context speci\ufb01c embeddings are supplemented with other information\nabout each token wt, such as its identity, orthographic features such as pre\ufb01xes and suf\ufb01xes or\nmembership in domain-speci\ufb01c lexicons, and used as features in a classi\ufb01er.\nSection 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general\nterms:\n\ncontexts), and project them each down to k dimensions using A.\n\n\u2022 Take the h words to the left and to the right of each target word wt (the \u201cLeft\u201d and \u201cRight\u201d\n\u2022 Take the CCA between the reduced rank left and right contexts, and use the resulting model\n\u2022 Take the CCA between the hidden states and the tokens wt. The singular vectors associated\n\nto estimate a k dimensional state vector (the \u201chidden state\u201d) for each token.\n\nwith wt form a new estimate of the eigenfeature dictionary.\n\nLR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that\nof the tokens both before and after it, and it is also similar to the states of the other occurrences of\nthe same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage\nof these two different types of similarity by alternately estimating word state using CCA on the\nsmooths of the states of the words before and after each target token and using the average over the\nstates associated with all other occurrences of that word.\n\n3.2 Theoretical Properties of LR-MVL\n\nWe now present the theory behind the LR-MVL algorithm; particularly we show that the reduced\nrank matrix A allows a signi\ufb01cant data reduction while preserving the information in our data and\nthe estimated state does the best possible job of capturing any label information that can be inferred\nby a linear model.\nLet L be an n \u00d7 hv matrix giving the words in the left context of each of the n tokens, where the\ncontext is of length h, R be the corresponding n \u00d7 hv matrix for the right context, and W be an\nn \u00d7 v matrix of indicator functions for the words themselves.\nWe will use the following assumptions at various points in our proof:\nAssumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and\nrank k transition matrix both of which have the same domain.\n\nFor example, if the dimension of the hidden state is k and the vocabulary size is v then the observa-\ntion matrix, which is k \u00d7 v, has rank k. This rank condition is similar to the one used by [10].\nAssumption 1A. For the three views, L, W and R assume that there exists a \u201chidden state H\u201d of\ndimension n \u00d7 k, where each row Hi has the same non-singular variance-covariance matrix and\n\n3\n\n\fsuch that E(Li|Hi) = Hi\u03b2T\nrank k, where Li, Ri and Wi are the rows of L, R and W respectively.\n\nL and E(Ri|Hi) = Hi\u03b2T\n\nR and E(Wi|Hi) = Hi\u03b2T\n\nW where all \u03b2\u2019s are of\n\nAssumption 1A follows from Assumption 1.\nAssumption 2. \u03c1(L, W), \u03c1(L, R) and \u03c1(W, R) all have rank k, where \u03c1(X1, X2) is the expected\ncorrelation between X1 and X2.\n\nAssumption 2 is a rank condition similar to that in [9].\nAssumption 3. \u03c1([L, R], W) has k distinct singular values.\n\nAssumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then\nthe singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces\nwith identical singular values.\nWe also need to de\ufb01ne the CCA function that computes the left and right singular vectors for a pair\nof matrices:\nDe\ufb01nition 1 (CCA). Compute the CCA between two matrices X1 and X2. Let \u03a6X1 be a matrix\ncontaining the d largest singular vectors for X1 (sorted from the largest on down). Likewise for\n\u03a6X2. De\ufb01ne the function CCAd(X1, X2) = [\u03a6X1, \u03a6X2 ]. When we want just one of these \u03a6\u2019s, we\nwill use CCAd(X1, X2)left = \u03a6X1 for the left singular vectors and CCAd(X1, X2)right = \u03a6X2\nfor the right singular vectors.\n\nNote that the resulting singular vectors, [\u03a6X1, \u03a6X2] can be used to give two redundant estimates,\nX1\u03a6X1 and X2\u03a6X2 of the \u201chidden\u201d state relating X1 and X2, if such a hidden state exists.\nDe\ufb01nition 2. De\ufb01ne the symbol \u201c\u2248\u201d to mean\n\nX1 \u2248 X2 \u21d0\u21d2 lim\n\nn\u2192\u221e X1 = lim\n\nn\u2192\u221e X2\n\nwhere n is the sample size.\nLemma 1. De\ufb01ne A by the following limit of the right singular vectors:\n\nCCAk([L, R], W)right \u2248 A.\n\nUnder assumptions 2, 3 and 1A, such that if CCAk(L, R) \u2261 [\u03a6L, \u03a6R] then\n\nCCAk([L\u03a6L, R\u03a6R], W)right \u2248 A.\n\nLemma 1 shows that instead of \ufb01nding the CCA between the full context and the words, we can take\nthe CCA between the Left and Right contexts, estimate a k dimensional state from them, and take\nthe CCA of that state with the words and get the same result. See the supplementary material for the\nProof.\nLet \u02dcAh denote a matrix formed by stacking h copies of A on top of each other. Right multiplying\nL or R by \u02dcAh projects each of the words in that context into the k-dimensional reduced rank space.\nThe following theorem addresses the core of the LR-MVL algorithm, showing that there is an A\nwhich gives the desired dimensionality reduction. Speci\ufb01cally, it shows that the previous lemma\nalso holds in the reduced rank space.\nTheorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that\nCCAk(L \u02dcAh, R \u02dcAh) \u2261 [ \u02dc\u03a6L, \u02dc\u03a6R] then\n\nif\n\nCCAk([L \u02dcAh \u02dc\u03a6L, R \u02dcAh \u02dc\u03a6R], W)right \u2248 A\n\nwhere \u02dcAh is the stacked form of A.\n\nSee the supplementary material for the Proof 1.\n\n1It is worth noting that our matrix A corresponds to the matrix \u02c6U used by [9, 10]. They showed that U is\nsuf\ufb01cient to compute the probability of a sequence of words generated by an HMM; although we do not show\nit here (due to limited space), our A provides a more statistically ef\ufb01cient estimate of U than their \u02c6U, and hence\ncan also be used to estimate the sequence probabilities.\n\n4\n\n\fUnder the above assumptions, there is asymptotically (in the limit of in\ufb01nite data) no bene\ufb01t to \ufb01rst\nestimating state by \ufb01nding the CCA between the left and right contexts and then \ufb01nding the CCA\nbetween the estimated state and the words. One could instead just directly \ufb01nd the CCA between\nthe combined left and rights contexts and the words. However, because of the Zip\ufb01an distribution\nof words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this\ncase, CCA between the rare words and context will not be informative, whereas \ufb01nding the CCA\nbetween the left and right contexts gives a good state vector estimate even for unique words. One can\nthen fruitfully \ufb01nd the CCA between the contexts and the estimated state vector for their associated\nwords.\n3.3 Using Exponential Smooths\nIn practice, we replace the projected left and right contexts with exponential smooths (weighted\naverage of the previous (or next) token\u2019s state i.e. Zt\u22121 (or Zt+1) and previous (or next) token\u2019s\nsmoothed state i.e. St\u22121 (or St+1).), of them at a few different time scales, thus giving a further\ndimension reduction by a factor of context length h (say 100 words) divided by the number of\nsmooths (often 5-7). We use a mixture of both very short and very long contexts which capture short\nand long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since\nexponential smooths are linear, we preserve the linearity of our method.\n\n3.4 The LR-MVL Algorithm\n\nThe LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair\nof CCAs described above in Theorem 1.\n\nAlgorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data\n1: Input: Token sequence Wn\u00d7v, state space size k, smoothing rates \u03b1j\n2: Initialize the eigenfeature dictionary A to random values N (0, 1).\n3: repeat\n4:\n\nSet the state Zt (1 < t \u2264 n) of each token wt to the eigenfeature vector of the corresponding word.\nZt = (Aw : w = wt)\n\nSmooth the state estimates before and after each token to get a pair of views for each smoothing rate \u03b1j.\n\n5:\n\n6:\n\n7:\n\n8:\n\nt\n\nS(l,j)\nS(r,j)\n\nt\n\n= (1 \u2212 \u03b1j)S(l,j)\n= (1 \u2212 \u03b1j)S(r,j)\n\nt\u22121 + \u03b1jZt\u22121 // left view L\nt+1 + \u03b1jZt+1 // right view R.\n\nwhere the tth rows of L and R are, respectively, concatenations of the smooths S(l,j)\neach of the \u03b1(j)s.\nFind the left and right canonical correlates, which are the eigenvectors \u03a6l and \u03a6r of\n(L(cid:48)L)\u22121L(cid:48)R(R(cid:48)R)\u22121R(cid:48)L\u03a6l = \u03bb\u03a6l.\n(R(cid:48)R)\u22121R(cid:48)L(L(cid:48)L)\u22121L(cid:48)R\u03a6r = \u03bb\u03a6r.\nProject the left and right views on to the space spanned by the top k/2 left and right CCAs respectively\n\nand S(r,j)\n\nfor\n\nt\n\nt\n\nXl = L\u03a6(k/2)\n\nand Xr = R\u03a6(k/2)\n\nr\n\nl\n\n, \u03a6(k/2)\n\nl\nwhere \u03a6(k/2)\nare matrices composed of the singular vectors of \u03a6l, \u03a6r with the k/2 largest\nmagnitude singular values. Estimate the state for each word wt as the union of the left and right esti-\nmates: Z = [Xl, Xr]\nEstimate the eigenfeatures of each word type, w, as the average of the states estimated for that word.\n\nr\n\nAw = avg(Zt : wt = w)\n\nCompute the change in A from the previous iteration\n\n9:\n10: until |\u2206A| < \u0001\n11: Output: \u03a6k\nl , \u03a6k\n\nr, A .\n\nA few iterations (\u223c 5) of the above algorithm are suf\ufb01cient to converge to the solution. (Since the\nproblem is convex, there is a single solution, so there is no issue of local minima.) As [14] show\nfor PCA, one can start with a random matrix that is only slightly larger than the true rank k of the\ncorrelation matrix, and with extremely high likelihood converge in a few iterations to within a small\ndistance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2\nand 3) are satis\ufb01ed, our method converges equally rapidly to the true canonical variates.\nAs mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and\nRight context matrices with a set of exponentially smoothed values of the reduced rank projections\nof the context words. Step 6 \ufb01nds the CCA between the Left and Right contexts. Step 7 estimates\n\n5\n\n\fthe state by combining the estimates from the left and right contexts, since we don\u2019t know which\nwill best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of\nwords W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of\naverages.\nOnce we have estimated the CCA model, it is used to generate context speci\ufb01c embeddings for the\ntokens from training, development and test sets (as described in Algorithm 2). These embeddings\nare further supplemented with other baseline features and used in a supervised learner to predict the\nlabel of the token.\n\nAlgorithm 2 LR-MVL Algorithm -Inducing Context Speci\ufb01c Embeddings for Train/Dev/Test Data\n1: Input: Model (\u03a6k\nr , A) output from above algorithm and Token sequences Wtrain, (Wdev, Wtest)\n2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right\n\nl , \u03a6k\n\nCCAs respectively\n\nXl = L\u03a6k\n\nl and Xr = R\u03a6k\nr\n\nand the words onto the eigenfeature dictionary Xw = W trainA\n\n3: Form the \ufb01nal embedding matrix Xtrain:embed by concatenating these three estimates of state\n\nXtrain:embed = [Xl\n\n, Xw , Xr]\n\n4: Output: The embedding matrices Xtrain:embed, (Xdev:embed, Xtest:embed) with context-speci\ufb01c rep-\nresentations for the tokens. These embeddings are augmented with baseline set of features mentioned in\nSections 4.1.1 and 4.1.2 before learning the \ufb01nal classi\ufb01er.\n\nNote that we can get context \u201coblivious\u201d embeddings i.e. one embedding per word type, just by\nusing the eigenfeature dictionary (Av\u00d7k) output by Algorithm 1.\n4 Experimental Results\nIn this section we present the experimental results of LR-MVL on Named Entity Recognition (NER)\nand Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches\nlike [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs)\nas well as embeddings like C&W, HLBL and Brown Clustering.\n\n4.1 Datasets and Experimental Setup\n\nFor the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experi-\nments we used the CoNLL 2000 shared task data2 with standard training, development and testing set\nsplits. The CoNLL \u201903 and the CoNLL \u201900 datasets had \u223c 204K/51K/46K and \u223c 212K/\u2212 /47K\ntokens respectively for Train/Dev./Test sets.\n4.1.1 Named Entity Recognition (NER)\nWe use the same set of baseline features as used by [15, 16] in their experiments. The detailed list\nof features is as below:\n\n\u2022 Current Word wi; Its type information: all-capitalized, is-capitalized, all-digits and so on;\n\nPre\ufb01xes and suf\ufb01xes of wi\n\n\u2022 Word\n\ntokens\n\nin window of\n\ncurrent word\n(wi\u22122, wi\u22121, wi, wi+1, wi+2); and capitalization pattern in the window.\n\u2022 Previous two predictions yi\u22121 and yi\u22122 and conjunction of d and yi\u22121\n\u2022 Embedding features (LR-MVL, C&W, HLBL, Brown etc.) in a window of 2 around the\n\naround\n\ni.e.\n\nthe\n\nd\n\n=\n\n2\n\ncurrent word (if applicable).\n\nFollowing [17] we use regularized averaged perceptron model with above set of baseline features\nfor the NER task. We also used their BILOU text chunk representation and fast greedy inference as\nit was shown to give superior performance.\n\n2More details about\n\nthe data and competition are available at http://www.cnts.ua.ac.be/\n\nconll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/\n\n6\n\n\fWe also augment the above set of baseline features with gazetteers, as is standard practice in NER\nexperiments. We tuned our free parameter namely the size of LR-MVL embedding on the develop-\nment and scaled our embedding features to have a (cid:96)2 norm of 1 for each token and further multiplied\nthem by a normalization constant (also chosen by cross validation), so that when they are used in\nconjunction with other categorical features in a linear classi\ufb01er, they do not exert extra in\ufb02uence.\nThe size of LR-MVL embeddings (state-space) that gave the best performance on the development\nset was k = 50 (50 each for Xl, Xw, Xr in Algorithm 2) i.e. the total size of embeddings was 50\u00d73,\nand the best normalization constant was 0.5. We omit validation plots due to paucity of space.\n4.1.2 Chunking\nFor our chunking experiments we use a similar base set of features as above:\n\n(wi\u22122, wi\u22121, wi, wi+1, wi+2);\n\n\u2022 Current Word wi and word tokens in window of 2 around the current word i.e. d =\n\u2022 POS tags ti in a window of 2 around the current word.\n\u2022 Word conjunction features wi \u2229 wi+1, i \u2208 {\u22121, 0} and Tag conjunction features ti \u2229 ti+1,\n\u2022 Embedding features in a window of 2 around the current word (when applicable).\n\ni \u2208 {\u22122,\u22121, 0, 1} and ti \u2229 ti+1 \u2229 ti+2, i \u2208 {\u22122,\u22121, 0}.\n\nSince CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sen-\ntences from the training data (8936 sentences) for development. So, we trained our chunking models\non 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used\na CRF 3 as the supervised classi\ufb01er. We tuned the size of embedding and the magnitude of (cid:96)2 reg-\nularization penalty in CRF on the development set and took log (or -log of the magnitude) of the\nvalue of the features4. The regularization penalty that gave best performance on development set\nwas 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we\ntrained the CRF on the entire (\u201coriginal\u201d) training data i.e. 8936 sentences.\n4.1.3 Unlabeled Data and Induction of embeddings\nFor inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug \u201996\nto Aug \u201997 and containing about 63 million tokens in 3.3 million sentences5. Case was left intact and\nwe did not do the \u201ccleaning\u201d as done by [18, 16] i.e. remove all sentences which are less than 90%\nlowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline\ntext (mostly all caps) which does not correlate strongly with the text of the article.\nWe induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU)\non the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety\nof smoothing rates (\u03b1 in Algorithm 1) to capture correlations between shorter and longer contexts\n\u03b1 = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the\ndevelopment set but we found this mixture of long and short term dependencies to work well in\npractice.\nAs far as the other embeddings are concerned i.e. C&W, HLBL and Brown Clusters, we downloaded\nthem from http://metaoptimize.com/projects/wordreprs. The details about their\ninduction and parameter tuning can be found in [16]; we report their best numbers here. It is also\nworth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other em-\nbeddings.\n\n4.2 Results\n\nThe results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that\nLR-MVL performs signi\ufb01cantly better than state-of-the-art competing methods on both NER and\nChunking tasks.\n\n3http://www.chokkan.org/software/crfsuite/\n4Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on\n\nsame scale we did this normalization.\n\nunlabeled data.\n\n5We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as\n\n6As some of these embeddings were trained on GPGPU which makes our method even faster comparatively.\n\n7\n\n\fEmbedding/Model\nBaseline\nC&W, 200-dim\nHLBL, 100-dim\nBrown 1000 clusters\nAndo & Zhang \u201905\nSuzuki & Isozaki \u201908\nLR-MVL (CO) 50 \u00d7 3-dim\nLR-MVL 50 \u00d7 3-dim\nHLBL, 100-dim\nC&W, 200-dim\nBrown, 1000 clusters\nLR-MVL (CO) 50 \u00d7 3-dim\nLR-MVL 50 \u00d7 3-dim\n\nNo Gazetteers\n\nWith Gazetteers\n\nDev. Set\n90.03\n92.46\n92.00\n92.32\n93.15\n93.66\n93.11\n93.61\n92.91\n92.98\n93.25\n93.91\n94.41\n\nF1-Score\n\nTest Set\n84.39\n87.46\n88.13\n88.52\n89.31\n89.36\n89.55\n89.91\n89.35\n88.88\n89.41\n89.89\n90.06\n\nTable 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are\ngotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The\ncurrent state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled\ndata [19].\n\nEmbedding/Model\nBaseline\nHLBL, 50-dim\nC&W, 50-dim\nBrown 3200 Clusters\nAndo & Zhang \u201905\nSuzuki & Isozaki \u201908\nLR-MVL (CO) 50 \u00d7 3-dim\nLR-MVL 50 \u00d7 3-dim\n\nTest Set F1-Score\n\n93.79\n94.00\n94.10\n94.11\n94.39\n94.67\n95.02\n95.44\n\nTable 2: Chunking Results.\n\nIt is important to note that in problems like NER, the \ufb01nal accuracy depends on performance on\nrare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn\nbetter representations for rare words resulting in overall better accuracy. On rare-words (occurring\n< 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W, HLBL\nand Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and\n8.7%.\nAlso, it is worth mentioning that modeling the context in embeddings gives decent improvements\nin accuracies on both NER and Chunking problems. For the case of NER, the polysemous words\nwere mostly like Chicago, Wales, Oakland etc., which could either be a location or organization\n(Sports teams, Banks etc.), so when we don\u2019t use the gazetteer features, (which are known lists of\ncities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared\nto the case when we already had gazetteer features which captured most of the information about\npolysemous words for NER dataset and modeling the context didn\u2019t help as much. The polysemous\nwords for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP)\netc. and in this case embeddings with context helped signi\ufb01cantly, giving 3.1 \u2212 6.5% relative im-\nprovement in accuracy over context oblivious embeddings.\n5 Summary and Conclusion\nIn this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large\nscale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works\nin low dimensional state-space so it is computationally ef\ufb01cient, and can be used to train using\nlarge amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained\nHMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner.\nLR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and\nachieves state-of-the-art accuracies on NER and Chunking problems.\nAcknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three\nanonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and\nJoseph Turian for answering our questions regarding their paper [16].\n\n8\n\n\fReferences\n[1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and\n\nunlabeled data. Journal of Machine Learning Research 6 (2005) 1817\u20131853\n\n[2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word\n\nscale unlabeled data. In: In ACL. (2008)\n\n[3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural\n\nlanguage. Comput. Linguist. 18 (December 1992) 467\u2013479\n\n[4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual\n\nMeeting of the ACL. (1993) 183\u2013190\n\n[5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised\nsequence-labeling. ACL \u201909, Stroudsburg, PA, USA, Association for Computational Linguis-\ntics (2009) 495\u2013503\n\n[6] Collobert, R., Weston, J.: A uni\ufb01ed architecture for natural language processing: deep neural\n\nnetworks with multitask learning. ICML \u201908, New York, NY, USA, ACM (2008) 160\u2013167\n\n[7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML\n\n\u201907, New York, NY, USA, ACM (2007) 641\u2013648\n\n[8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic\nanalysis to improve access to textual information. In: SIGCHI Conference on human factors\nin computing systems, ACM (1988) 281\u2013285\n\n[9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In:\n\nCOLT. (2009)\n\n[10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS-\n\n2010. (2010)\n\n[11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of\n\nhidden Markov models. In: ICML. (2010)\n\n[12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935)\n[13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT\u2019\n\n98. (1998) 92\u2013100\n\n[14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. (Dec 2010)\n\n[15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system.\n\nCONLL \u201903 (2003) 204\u2013207\n\n[16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for\nsemi-supervised learning. ACL \u201910, Stroudsburg, PA, USA, Association for Computational\nLinguistics (2010) 384\u2013394\n\n[17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In:\n\nCONLL. (2009) 147\u2013155\n\n[18] Liang, P.: Semi-supervised learning for natural language. Master\u2019s thesis, Massachusetts\n\nInstitute of Technology (2005)\n\n[19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint\nConference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference\non Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL \u201909, Stroudsburg,\nPA, USA, Association for Computational Linguistics (2009) 1030\u20131038\n\n9\n\n\f", "award": [], "sourceid": 157, "authors": [{"given_name": "Paramveer", "family_name": "Dhillon", "institution": null}, {"given_name": "Dean", "family_name": "Foster", "institution": null}, {"given_name": "Lyle", "family_name": "Ungar", "institution": null}]}