{"title": "Contrastive Learning Using Spectral Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 2238, "page_last": 2246, "abstract": "In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent moment-based estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis.", "full_text": "Contrastive Learning Using Spectral Methods\n\nJames Zou\n\nHarvard University\n\nDaniel Hsu\n\nColumbia University\n\nDavid Parkes\n\nHarvard University\n\nRyan Adams\n\nHarvard University\n\nAbstract\n\nIn many natural settings, the analysis goal is not to characterize a single data set in\nisolation, but rather to understand the difference between one set of observations\nand another. For example, given a background corpus of news articles together\nwith writings of a particular author, one may want a topic model that explains\nword patterns and themes speci\ufb01c to the author. Another example comes from\ngenomics, in which biological signals may be collected from different regions\nof a genome, and one wants a model that captures the differential statistics ob-\nserved in these regions. This paper formalizes this notion of contrastive learning\nfor mixture models, and develops spectral algorithms for inferring mixture com-\nponents speci\ufb01c to a foreground data set when contrasted with a background data\nset. The method builds on recent moment-based estimators and tensor decomposi-\ntions for latent variable models, and has the intuitive feature of using background\ndata statistics to appropriately modify moments estimated from foreground data.\nA key advantage of the method is that the background data need only be coarsely\nmodeled, which is important when the background is too complex, noisy, or not\nof interest. The method is demonstrated on applications in contrastive topic mod-\neling and genomic sequence analysis.\n\n1\n\nIntroduction\n\nGenerative latent variable models offer an intuitive way to explain data in terms of hidden structure,\nand are a cornerstone of exploratory data analysis. Popular examples of generative latent variable\nmodels include Latent Dirichlet Allocation (LDA) [1] and Hidden Markov Models (HMMs) [2],\nalthough the modularity of the generative approach has led to a wide range of variations. One of\nthe challenges of using latent variable models for exploratory data analysis, however, is developing\nmodels and learning techniques that accurately re\ufb02ect the intuitions of the modeler. In particular,\nwhen analyzing multiple specialized data sets, it is often the case that the most salient statistical\nstructure\u2014that most easily found by \ufb01tting latent variable models\u2014is shared across all the data and\ndoes not re\ufb02ect interesting speci\ufb01c local structure. For example, if we apply a topic model to a set\nof English-language scienti\ufb01c papers on computer science, we might hope to identify different co-\noccurring words within sub\ufb01elds such as theory, systems, graphics, etc. Instead, such a model will\nsimply learn about English syntactic structure and invent topics that re\ufb02ect uninteresting statistical\ncorrelations between stop words [3].\nIntuitively, what we would like from such an exploratory\nanalysis is to answer the question: What makes these data different from other sets of data in the\nsame broad category?\nTo answer this question, we develop a new set of techniques that we refer to as contrastive learning\nmethods. These methods differentiate between foreground and background data and seek to learn\na latent variable model that captures statistical relationships that appear in the foreground but do\nnot appear in the background. Revisiting the previous scienti\ufb01c topics example, contrastive learning\ncould treat computer science papers as a foreground corpus and (say) English-language news articles\nas a background corpus. As both corpora share the same broad syntactic structure, a contrastive\nforeground topic model would be more likely to discover semantic relationships between words that\nare speci\ufb01c to computer science. This intuition has broad applicability in other models and domains\n\n1\n\n\fBackground\nForeground\n\n \n\n \n\n(a) PCA\n\n(b) Linear contrastive analysis\n\nFigure 1: These \ufb01gures show foreground and background data from Gaussian distributions. The foreground\ndata has greater variance in its minor direction, but the same variance in its major direction. The means are\nslightly different. Different projection lines are shown for different methods, to illustrate the difference be-\ntween (a) the purely unsupervised variance-preserving linear projection of principal component analysis, (b)\nthe contrastive foreground projection that captures variance that is not present in the background.\n\nas well. For example, in genomics one might use a contrastive hidden Markov model to amplify the\nsignal of a particular class of sequences, relative to the broader genome.\nNote that the objective of contrastive learning is not to discriminate between foreground and back-\nground data, but to learn an interpretable generative model that captures the differential statistics\nbetween the two data sets. To clarify this difference, consider the difference between principal com-\nponent analysis and contrastive analysis. Principal component analysis \ufb01nds the linear projection\nthat maximally preserves variance without regard to foreground versus background. A contrastive\napproach, however, would try to \ufb01nd a linear projection that maximally preserves the foreground\nvariance that is not explained by the background. Figure 1 illustrates the differences between these.\nNovelty detection [4] is also related, but it does not directly learn a generative model of the novelty.\n\nOur contributions. We formalize the concept of contrastive learning for mixture models and\npresent new spectral contrast algorithms. We prove that by appropriately \u201csubtracting\u201d background\nmoments from the foreground moments, our algorithms recover the model for the foreground-\nspeci\ufb01c data. To achieve this, we extend recent developments in learning latent variable models\nwith moment matching and tensor decompositions. We demonstrate the effectiveness, robustness,\nand scalability of our method in contrastive topic modeling and contrastive genomics.\n\n2 Contrastive learning in mixture models\n\nMany data can be naturally described by a mixture model. The general mixture model has the form\n\np({xn}N\n\nn=1;{(\u00b5j, wj)}J\n\nj=1) =\n\nN\uffffn=1\uffff J\uffffj=1\n\nwjf (xn|\u00b5j)\uffff\n\n(1)\n\nwhere {\u00b5j} are the parameters of the mixture components, {wj} are the mixture weights,\nand f (\u00b7|\u00b5j) is the density of the j-th mixture component. Each \u00b5j is a vector in some parame-\nter space, and a common estimation task is to infer the component parameters {(\u00b5j, wj)} given the\nobserved data {xn}.\nIn many applications, we have two sets of observations {xf\nn}, which we call the foreground\ndata and the background data, respectively. The foreground and background are generated by two\npossibly overlapping sets of mixture components. More concretely, let {\u00b5j}j\u2208A, {\u00b5j}j\u2208B, and\n{\u00b5j}j\u2208C be three disjoint sets of parameters, with A, B, and C being three disjoint index sets. The\nforeground {xf\nj)}j\u2208A\u222aB, and the background {xb\nn}\nis generated from {(\u00b5j, wb\nThe goal of contrastive learning is to infer the parameters {(\u00b5j, wf\nj)}j\u2208A, which we call the\nj)}j\u2208A\u222aB just from {xf\nn},\nforeground-speci\ufb01c model. The direct approach would be to infer {(\u00b5j, wf\nn}, and then pick out the components speci\ufb01c to\nand in parallel infer {(\u00b5j, wb\nthe foreground. However, this involves explicitly learning a model for the background data, which\n\nn} is generated from the mixture model {(\u00b5j, wf\n\nj )}j\u2208B\u222aC just from {xb\n\nn} and {xb\n\nj )}j\u2208B\u222aC.\n\n2\n\n\fn} is too noisy, or if we do not want to de-\nis undesirable if the background is too complex, if {xb\nvote computational power to learn the background. In many applications, we are only interested in\nlearning a generative model for the difference between the foreground and background, because that\ncontrast is the interesting signal.\nIn this paper, we introduce an ef\ufb01cient and general approach to learn the foreground-speci\ufb01c model\nwithout having to learn an accurate model of the background. Our approach is based on a method-of-\nmoments that uses higher-order tensor decompositions for estimation [5]; we generalize the tensor\ndecomposition technique to deal with our task of contrastive learning. Many other recent spectral\nlearning algorithms for latent variable models are also based on the method-of-moments (e.g., [6\u2013\n13]), but their parameter estimation can not account for the asymmetry between foreground and\nbackground.\nWe demonstrate spectral contrastive learning through two concrete applications: contrastive topic\nmodeling and contrastive genomics. In contrastive topic modeling we are given a foreground cor-\npus of documents and a background corpus. We want to learn a fully generative topic model that\nexplains the foreground-speci\ufb01c documents (the contrast). We show that even when the background\nis extremely sparse\u2014too noisy to learn a good background topic model\u2014our spectral contrast algo-\nrithm still recovers foreground-speci\ufb01c topics. In contrastive genomics, sequence data is modeled by\nHMMs. The foreground data is generated by a mixture of two HMMs; one is foreground-speci\ufb01c,\nand the other captures some background process. The background data is generated by this sec-\nond HMM. Contrastive learning ampli\ufb01es the foreground-speci\ufb01c signal, which have meaningful\nbiological interpretations.\n\n3 Contrastive topic modeling\n\nTo illustrate contrastive analysis and introduce tensor methods, we consider a simple topic model\nwhere each document is generated by exactly one topic. In LDA [1], this corresponds to setting the\nDirichlet prior hyper-parameter \u03b1 \u2192 0. The techniques here can be extended to the general \u03b1> 0\ncase using the moment transformations given in [10]. The generative topic model for a document is\nas follows.\n\n\u2022 A word x is represented by an indicator vector ex \u2208 RD which is 1 in its x-th entry and 0\nelsewhere. D is the size of the vocabulary. A document is a bag-of-words and is represented\nby a vector c \u2208 RD with non-negative integer word counts.\nthe probability vector w \u2208 RK.\nthe distribution speci\ufb01ed by the probability vector \u00b5t \u2208 RD.\n\n\u2022 A topic is \ufb01rst chosen according to the distribution on [K] := {1, 2, . . . , K} speci\ufb01ed by\n\u2022 Given that the chosen topic is t, the words in the document are drawn independently from\n\nFollowing previous work (e.g., [10]) we assume that \u00b51, \u00b52, . . . , \u00b5K are linearly independent prob-\nability vectors in RD. Let the foreground corpus of documents be generated by the mixture\nof |A| + |B| topics {(\u00b5t, wf\nt)}t\u2208B, and the background topics be generated by the\nmixture of |B| + |C| topics {(\u00b5t, wb\nt )}t\u2208C (here, we assume (A, B, C) is a non-\ntrivial partition of [K], and that wf\nt, wb\n\nt )}t\u2208B \u222a{ (\u00b5t, wb\nt > 0 for all t). Our goal is to learn {(\u00b5t, wf\n\nt)}t\u2208A \u222a{ (\u00b5t, wf\n\nt)}t\u2208A.\n\n3.1 Moment decompositions\nWe use the symbol \u2297 to denote the tensor product of vectors, so a\u2297b is the matrix whose (i, j)-th en-\ntry is aibj, and a\u2297b\u2297c is the third-order tensor whose (i, j, k)-th entry is aibjck. Given a third-order\ntensor T \u2208 Rd1\u00d7d2\u00d7d3 and vectors a \u2208 Rd1, b \u2208 Rd2, and c \u2208 Rd3, we let T (I, b, c) \u2208 Rd1 denote\nthe vector whose i-th entry is\uffffj,k Ti,j,kbjck, and T (a, b, c) denote the scalar\uffffi,j,k Ti,j,kaibjck.\nWe review the moments of the word observations in this model (see, e.g., [10]). Let x1, x2, x3 \u2208 [D]\nbe three random words sampled from a random document generated by the foreground model\n(the discussion here also applies to the background model). The second-order (cross) moment\nmatrix M f\n2 := E[ex1 \u2297 ex2] is the matrix whose (i, j)-th entry is the probability that x1 = i\nand x2 = j. Similarly, the third-order (cross) moment tensor M f\n3 := E[ex1 \u2297 ex2 \u2297 ex3] is the\n\n3\n\n\fAlgorithm 1 Contrastive Topic Model estimator\ninput Foreground and background documents {cf\noutput Foreground-speci\ufb01c topics Topicsf.\n1: Let \u02c6M f\n\n2 and \u02c6M b\nn} ({cb\n2: Run Algorithm 2 with input \u02c6M2, \u02c6M3, K, and N to obtain {(\u02c6at, \u02c6\u03bbt) : t \u2208 [K]}.\n3: Topicsf := {(\u02c6at/\uffff\u02c6at\uffff1, 1/\u02c6\u03bb2\n\nn}; parameter \u03b3> 0; number of topics K.\n3 ) be the foreground (background) second- and third-order moment\nn}), and let \u02c6M2 := \u02c6M f\nt ) : t \u2208 [K], \u02c6\u03bbt > 0}.\n\n3 ( \u02c6M b\nestimates based on {cf\n\n2 and \u02c6M3 := \u02c6M f\n\n3 \u2212 \u03b3 \u02c6M b\n3 .\n\n2 \u2212 \u03b3 \u02c6M b\n\nn}, {cb\n\n2 and \u02c6M f\n\nt implies that the second- and third-order moments are\n\nthird-order tensor whose (i, j, k)-th entry is the probability that x1 = i, x2 = j, x3 = k. Ob-\nserve that for any t \u2208 A \u222a B,\nthe i-th entry of E[ex1|topic = t] is precisely the probability\nthat x1 = i given topic = t, which is i-th entry of \u00b5t. Therefore, E[ex1|topic = t] = \u00b5t. Since\nthe words are independent given the topic, the (i, j)-th entry of E[ex1 \u2297 ex2|topic = t] is the\nSimilarly,\nproduct of the i-th and j-th entry of \u00b5t,\nE[ex1 \u2297 ex2 \u2297 ex3|topic = t] = \u00b5t \u2297 \u00b5t \u2297 \u00b5t. Averaging over the choices of t \u2208 A \u222a B with the\nweights wf\n2 = E[ex1 \u2297 ex2] = \ufffft\u2208A\u222aB\n\ni.e., E[ex1 \u2297 ex2|topic = t] = \u00b5t \u2297 \u00b5t.\n3 = E[ex1 \u2297 ex2 \u2297 ex3] = \ufffft\u2208A\u222aB\nwf\nt \u00b5t \u2297 \u00b5t \u2297 \u00b5t.\n(We discuss how to ef\ufb01ciently use documents of length > 3 in Section 5.2.) We can similarly\ndecompose the background moments M b\n3 in terms of tensors products of {\u00b5t}t\u2208B\u222aC. These\nequations imply the following proposition (proved in Appendix A).\nProposition 1. Let M f\n3 be the second- and third-order moments from the fore-\nground and background data, respectively. De\ufb01ne\n\nwf\nt \u00b5t \u2297 \u00b5t\n\n3 and M b\n\n2 and M b\n\nand M f\n\n2 , M b\n\n2, M f\n\nM f\n\nM2 := M f\n\n2 \u2212 \u03b3M b\n\n2\n\nand M3 := M f\n\n3 \u2212 \u03b3M b\n3 .\n\nIf \u03b3 \u2265 maxj\u2208B wf\n\nj/wb\n\nj , then\n\nM2 =\n\nK\ufffft=1\n\n\u03c9t \u00b5t \u2297 \u00b5t\n\nand M3 =\n\n\u03c9t \u00b5t \u2297 \u00b5t \u2297 \u00b5t\n\n(2)\n\nK\ufffft=1\n\nwhere \u03c9t = wf\n\nt > 0 for t \u2208 A (foreground-speci\ufb01c topic), and \u03c9t \u2264 0 for t \u2208 B \u222a C.\n\n2 and M b\n\nUsing tensor decompositions. Proposition 1 implies that the modi\ufb01ed moments M2 and M3 have\nlow-rank decompositions in which the components t with positive multipliers \u03c9t correspond to the\nt)}t\u2208A. A main technical innovation of this paper is a general-\nforeground-speci\ufb01c topics {(\u00b5t, wf\nized tensor power method, described in Section 5, which takes as input (estimates of) second- and\nthird-order tensors of the form in (2), and approximately recovers the individual components. We\nargue that under some natural conditions, the generalized power method is robust to large pertur-\nbations in M b\n3 , which suggests that foreground-speci\ufb01c topics can be learned even when it\nis not possible to accurately model the background. We use the generalized tensor power method\nto estimate the foreground-speci\ufb01c topics in our Contrastive Topic Model estimator (Algorithm 1).\nj gives good\nProposition 1 gives the lower bound on \u03b3; we empirically \ufb01nd that \u03b3 \u2248 maxj\u2208B wf\nresults. When \u03b3 is too large, the convergence of the tensor power worsens. Where possible in prac-\ntice, we recommend using prior belief about foreground and background compositions to estimate\nmaxj\u2208B wf\n3.2 Experiments with contrastive topic modeling\nWe test our contrastive topic models on the RCV1 dataset, which consists of \u2248 800000 news ar-\nticles. Each document comes with multiple category labels (e.g., economics, entertainment) and\nregion labels (e.g., USA, Europe, China). The corpus spans a large set of complex and overlapping\ncategories, making this a good dataset to validate our contrastive learning algorithm.\nIn one set of experiments, we take documents associated with one region as the foreground corpus,\nand documents associated with a general theme, such as economics, as the background. The goal\nof the contrast is to \ufb01nd the region-speci\ufb01c topics which are not relevant to the background theme.\nThe top half of Table 1 shows the example where we take USA-related documents as the foreground\n\nj , and then vary \u03b3 as part of the exploratory analysis.\n\nj/wb\n\nj/wb\n\n4\n\n\fUSA foreground\n\nmillion\n\npercent lbs\nusda\nweek\nrate\nhog\nmarket gilt\nwheat\n\nbond\nmunicipal week\nsale\nindex\nyear\nexport\ntotal\nbarrow trade\nChina foreground\n\nstock\nprice\nclose\ntrade\nindex\n\nUSA foreground, Economics background\n\nplay\nround\ngolf\nopen\nhole\n\nresearch result\nscience hockey\nnation\ncancer\ncell\ncap\nny\nstudy\n\nbasketball game\ngame\nrun\nnation\nhit\nla\nwin\nassociation inn\nChina foreground, Economics background\n\nshare\nmarket\n\nbillion\nreserve\n\nchina\nton\npercent percent bank\nimport million balance\nalumin trade\n\ntrade\n\nshanghai yuan\nyuan\nyear\nbank\n\ufb01rm\nchina\nforeign storm xinhua\nexchange invest\n\npanda\nchina\nchina\neast\ntyphoon year\n\n\ufb02ood\n\nzoo\n\nearthquake china\nchina\nof\ufb01ce\ncourt\nof\ufb01ce\nsmuggle\nricht\nscale\nship\n\ninterest\nbond\nmillion\ncost\nmoody\n\nTable 1: Top words from representative topics: foreground alone (left); foreground/background contrast (right).\nEach column corresponds to one topic.\n\n \n\ne\nr\no\nc\ns\nn\no\ni\nt\na\nc\n\ni\nf\ni\n\ns\ns\na\n\nl\n\nc\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n \n\n \n\n N=10000\n N=1000\n N=100\n N=50\n\n1\n\n2\n\n0\n\n0.5\n\n\u03b3\n(a)\n\n!\"#$%#\"&'(\n\n2\n\n-\n\n.\n\n/\n\n0\n\n1\n\n3\n)4)!\n!\"!# !\"$$ !\"!% !\"!& !\"!! !\"!! !\"!!\n5.3+6\n!\"(# !\"(# !\"$$ !\"(' !\")# !\"!# !\"((\n5.37$/ (\"!! !\"!% !\"!! !\"!# !\"!& !\"!! !\"!!\n5/27$/ !\"!' !\"!# !\"!& !\"!# !\"!$ !\"#+ !\"(&\n507$-\n!\"') !\"+' !\"+% !\")* !\"$$ !\"(! !\"'*\n507$.\n!\")! !\"++ (\"!! !\")$ !\"%+ !\"!& !\"(+\n507$/\n!\"!# !\"') !\"*! !\"&# !\"!$ !\"!+ !\"!*\n58+6\n!\"!! !\"($ !\"$+ !\"() !\"+% !\"!& !\"!*\n5.97$- !\"(! !\"!' !\"!& !\"!# !\"!+ !\"!* !\"(%\n:);\n!\"!& !\"!' !\"!! !\"!! !\"!! !\"!( !\"!(\n<$=%>*\n!\"!( !\"!# !\"(! !\"(+ !\"(& !\"'% !\"+'\n(b)\n\n)\"'*#+,*\n\n0\n\n/\n\n.\n\n-\n\n1\n\n2\n\n3\n!\"!' (\"!! !\"!# !\"!& !\"!! !\"!! !\"!!\n!\"!% !\"() !\"$* !\"(+ !\")& !\"!% !\"'(\n!\"$! !\"!+ !\"!! !\"!) !\"!( !\"!! !\"!*\n!\"!! !\"!% !\"!+ !\"!# !\"!* !\"#+ !\"+(\n!\"'& !\"+( !\"+% !\")$ !\"*! !\"!) !\"&&\n!\"&# !\"+( (\"!! !\")* !\"%+ !\"!+ !\"'%\n!\"(( !\"') !\"*! !\"&) !\"!* !\"!( !\"(#\n!\"!! !\"($ !\"$+ !\"($ !\"+& !\"!+ !\"(*\n!\"'+ !\"!+ !\"!+ !\"!& !\"!+ !\"!% !\"%&\n!\"!! !\"!! !\"!! !\"!! !\"!( !\"!! !\"!(\n!\"!+ !\"!) !\"(+ !\"() !\"($ !\"+# !\"!#\n\nFigure 2: (a) Relative AUC as function of \u03b3 (Sec. 3.2). (b) Emission probabilities of HMM states (Sec. 4).\n\nand Economics as the background theme. We \ufb01rst set the contrast parameter \u03b3 = 0 in Algorithm 1;\nthis learns the topics from the foreground dataset alone. Due to the composition of the corpus, the\nforeground topics for USA is dominated by topics relevant to stock markets and trade; representative\ntopics and keywords are shown on the left of Table 1. Then we increase \u03b3 to observe the effects\nof contrast. In the right half of Table 1, we show the heavily weighted topics and keywords for\nwhen \u03b3 = 2. The topics involving market and trade are also present in the background corpus, so\ntheir weights are reduced through contrast. Topics which are very USA-speci\ufb01c and distinct from\neconomics rise to the top: basketball, baseball, scienti\ufb01c research, etc. A similar experiment with\nChina-related articles as foreground, and the same economics themed background is shown in the\nbottom of Table 1.\nThese examples illustrate that Algorithm 1 learns topics which are unique to the foreground. To\nquantify this effect, we devised a speci\ufb01city test. Using the RCV1 labels, we partition the foreground\nUSA documents into two disjoint groups: documents with any economics-related labels (group 0)\nand the rest (group 1). Because Algorithm 1 learns the full probabilistic model, we use the inferred\ntopic parameters to compute the marginal likelihood for each foreground document given the model.\nWe then use the likelihood value to classify each foreground document as belonging to group 0 or 1.\nThe performance of the classi\ufb01er is summarized by the AUC score.\nWe \ufb01rst set \u03b3 = 0 and compute the AUC score, which corresponds to how well a topic model\nlearned from only the foreground can distinguish between the two groups. We use this score as\nthe baseline and normalize so it is equal to 1. The hope is that by using the background data, the\ncontrastive model can better identify the documents that are generated by foreground-speci\ufb01c topics.\nIndeed, as \u03b3 increases, the AUC score improves signi\ufb01cantly over the benchmark (dark blue bars in\nFigure 2(a)). For \u03b3> 2 we \ufb01nd that the foreground speci\ufb01c topics do not change qualitatively.\nA major advantage of our approach is that we do not need to learn a very accurate background\nmodel to learn the contrast. To validate this, we down sample the background corpus to 1000, 100,\n\n5\n\n\fand 50 documents. This simulates settings where the background is very sparsely sampled, so it\nis not possible to learn a background model very accurately. Qualitatively, we observe that even\nwith only 50 randomly sampled background documents, Algorithm 1 still recovers topics speci\ufb01c to\nUSA and not related to Economics. At \u03b3 = 2, it learns sports and NASA/space as the most promi-\nnent foreground-speci\ufb01c topics. This is supported by the speci\ufb01city test, where contrastive topic\nmodels with sparse background better identify foreground-speci\ufb01c documents relative to the \u03b3 = 0\n(foreground data-only) model.\n\n4 Contrastive Hidden Markov Models\n\nHidden Markov Models (HMMs) are commonly used to model sequence and time series data. For\nexample, a biologist may collect several sequences from an experiment; some of the sequences are\ngenerated by a biological process of interest (modeled by an HMM), while others are generated by\na different \u201cbackground\u201d process\u2014e.g., noise or a process that is not of primary interest.\nConsider a simple generative process where foreground data are generated by a mixture of two\nHMMs: (1 \u2212 \u03b3) HMMA +\u03b3 HMMB, and background data are generated by HMMB. The goal\nis to learn the parameters of HMMA, which models the biological process of interest. As we did\nfor topic models, we can estimate a contrastive HMM by taking appropriate combinations of ob-\n3, . . . be a random emission sequence in RD generated by the\nservable moments. Let xf\nforeground model (1\u2212 \u03b3) HMMA +\u03b3 HMMB, and xb\n3, . . . be the sequence generated by the\nbackground model HMMB. Following [5], we estimate the following cross moment matrices and\ntensors: M f\n3],\n3], M f\n2 \u2297 xf\nas well as the corresponding moments for the background model. This is similar to the estimation\nthe word pair and triple frequencies in LDA. Here we only use the \ufb01rst three observations in the\nsequence, but it is also justi\ufb01able to average over all consecutive observation triplets [14]. Then,\nanalogous to Proposition 1, we de\ufb01ne the contrastive moments as Mu,v := M f\nu,v (for\n1,2,3. In the Appendix (Sec. D and Algorithm 3), we\n{u, v}\u2282{ 1, 2, 3}) and M1,2,3 := M f\ndescribe how to recover the foreground-speci\ufb01c model HMMA. The key technical difference from\ncontrastive LDA lies in the asymmetric generalization of the Tensor Power Method of Algorithm 2.\n\n1,2,3\u2212 \u03b3M b\n\nu,v \u2212 \u03b3M b\n\n1,2,3 := E[xf\n\n2], M f\n\n1,3 := E[xf\n\n2,3 := E[xf\n\n2 \u2297 xf\n\n3], M f\n\n1, xb\n\n2, xb\n\n1, xf\n\n2, xf\n\n1,2 := E[xf\n\n1 \u2297 xf\n\n1 \u2297 xf\n\n1 \u2297 xf\n\nApplication to contrastive genomics. For many biological problems, it is important to understand\nhow signals in certain data are enriched relative to some related background data. For instance, we\nmay want to contrast foreground data composed of gene expressions (or mutation rates, protein\nlevels, etc) from one population against background data taken from (say) a control experiment, a\ndifferent cell type, or a different time point. The contrastive analysis methods developed here can be\na powerful exploratory tool for biology.\nAs a concrete illustration, we use spectral contrast to re\ufb01ne the characterization of chromatin states.\nThe human genome consists of \u2248 3 billion DNA bases, and has recently been shown that these bases\ncan be naturally segmented into a handful of chromatin states [15, 16]. Each state describes a set of\ngenomic properties: several states describe different active and regulatory features, while other states\ndescribe repressive features. The chromatin state varies across the genome, remaining constant for\nrelatively short regions (say, several thousand bases). Learning the nature of the chromatin states\nis of great interest in genomics. The state-of-the-art approach for modeling chromatin states uses\nan HMM [16]. The observable data are, at every 200 bases, a binary feature vector in {0, 1}10.\nEach feature indicates the presence/absence of a speci\ufb01c chemical feature at that site (assumed\nindependent given the chromatin state). This correspond to \u2248 15 million observations across the\ngenome, which are used to learn the parameters of an HMM. Each chromatin state corresponds to a\nlatent state, characterized by a vector of 10 emission probabilities.\nWe take as foreground data the observations from exons, introns and promoters, which account for\nabout 30% of the genome; as background data, we take observations from intergenic regions. Be-\ncause exons and introns are transcribed, we expect the foreground to be a mixture of functional\nchromatin states and spurious states due to noise, and expect more of the background observations\nto be due to non-functional process. The contrastive HMM should capture biologically meaningful\nsignals in the foreground data. In Figure 2(b), we show the emission matrix for the foreground HMM\nand for the contrastive HMM. We learn K = 7 latent states, corresponding to 7 chromatin states.\n\n6\n\n\fAlgorithm 2 Generalized Tensor Power Method\ninput \u02c6M2 \u2208 RD\u00d7D; \u02c6M3 \u2208 RD\u00d7D\u00d7D; target rank K; number of iterations N.\noutput Estimates {(\u02c6at, \u02c6\u03bbt) : t \u2208 [K]}.\n1: Let \u02c6M\u20202 := Moore-Penrose pseudoinverse of rank K approximation to \u02c6M2; initialize T := \u02c6M3.\n2: for t = 1 to K do\n3:\n4:\n5:\n6: end for\n\nRandomly draw u(0) \u2208 RD from any distribution with full support in the range of \u02c6M2.\nRepeat power iteration update N times: u(i+1) := T (I, \u02c6M\u20202 u(i), \u02c6M\u20202 u(i)).\n\u02c6at := u(N )/|\uffffu(N ), \u02c6M\u20202 u(N )\uffff|1/2; \u02c6\u03bbt := T ( \u02c6M\u20202 \u02c6at, \u02c6M\u20202 \u02c6at, \u02c6M\u20202 \u02c6at); T := T \u2212| \u02c6\u03bbt|\u02c6at \u2297 \u02c6at \u2297 \u02c6at.\n\nEach row is a chemical feature of the genome. The foreground states recover the known biologi-\ncal chromatin states from literature [16]. For example, state 6, with high emission for K36me3, is\ntranscribed genes; state 5 is active enhancers; state 4 is poised enhancers. In the contrastive HMM,\nmost of the states are the same as before. Interestingly, state 7, which is associated with feature\nK20me1, drops from the largest component of the foreground to a very small component of the con-\ntrast. This \ufb01nding suggests that state 7 and K20me1 are less speci\ufb01c to gene bodies than previously\nthought [17], and raises more questions regarding its function, which is relatively unknown.\n\n5 Generalized tensor power method\n\nand\n\ni=1 \u03c3iai \u2297 ai and M3 :=\uffffK\n\nWe now describe our general approach for tensor decomposition used in Algorithm 1. Let\nLet\na1, a2, . . . , aK \u2208 RD be linearly independent vectors, and set A := [a1|a2|\u00b7\u00b7\u00b7|aK].\nM2 :=\uffffK\ni=1 \u03bbiai \u2297 ai \u2297 ai, where \u03c3i = sign(\u03bbi) \u2208 {\u00b11}. The goal\nis to recover {(at,\u03bb t) : t \u2208 [K]} from (estimates of) M2 and M3.\nThe following proposition shows that one of the vectors ai (and its associated \u03bbi) can be obtained\nfrom M2 and M3 using a simple power method similar to that from [5, 18] (note that which of the\nK components is obtained depends on the initialization of the procedure). Note that the error \u03b5 is\nexponentially small in 2t after t iterations, so the number of iterations required to converge is very\nsmall. Below, we use (\u00b7)\u2020 to denote the Moore-Penrose pseudoinverse.\nProposition 2 (Informal statement). Consider the sequence u(0), u(1), . . . in RD determined by\nu(i+1) := M3(I, M\u20202 u(i), M\u20202 u(i)) . Then for any \u03b5 \u2208 (0, 1) and almost all u(0) \u2208 range(A),\nthere exists t\u2217 \u2208 [K], c1, c2 > 0 (all depending on u(0) and {(\u00b5t,\u03bb t) : t \u2208 [K]}) such that\n|\u02dc\u03bb \u2212| \u03bbt\u2217|| \u2264 |\u03bbt\u2217|\u03b5 + maxt\uffff=t\u2217 |\u03bbt|\u03b53/2 for \u03b5 := c1 exp(\u2212c22i), where\n\uffff\u02dcu(i) \u2212 at\u2217\uffff2 \u2264 \u03b5\n\u02dcu(i) := \u03c3t\u2217u(i)/\uffffA\u2020u(i)\uffff, and \u02dc\u03bb := M3(M\u20202 \u02dcu(i), M\u20202 \u02dcu(i), M\u20202 \u02dcu(i)).\nSee Appendix B for the formal statement and proof which give explicit dependencies. We use the\niterations from Proposition 2 in our main decomposition algorithm (Algorithm 2), which is a variant\nof the main algorithm from [5]. The main difference is that we do not require M2 to be positive semi-\nde\ufb01nite, which is essential for our application, but requires subtle modi\ufb01cations. For simplicity, we\nassume we run Algorithm 2 with exact moments M2 and M3 \u2014 a detailed perturbation analysis\nwould be similar to that in [5] but is beyond the scope of this paper. Proposition 2 shows that a single\ncomponent can be accurately recovered, and we use de\ufb02ation to recover subsequent components\n(normalization and de\ufb02ation is further discussed in Appendix B). As noted in [5], errors introduced\nin this de\ufb02ation step have only a lower-order effect, and therefore it can be used reliably to recover\nall K components. For increased robustness, we actually repeat steps 3\u20135 in Algorithm 2 several\ntimes, and use the results of the trial in which |\u02c6\u03bbt| takes the median value.\n5.1 Robustness to sparse background sampling\nAlgorithm 1 can recover the foreground-speci\ufb01c {\u00b5t}t\u2208A even with relatively small numbers\nof background data. We can illustrate this robustness under the assumption that the support\nof the foreground-speci\ufb01c topics S0 := \u222at\u2208A supp(\u00b5t) is disjoint from that of the other topics\nS1 := \u222at\u2208B\u222aC supp(\u00b5t) (similar to Brown clusters [19]). Suppose that M f\n2 is estimated accurately\nusing a large sample of foreground documents. Then because S0 and S1 are disjoint, Algorithm 1\n\n7\n\n\f(using suf\ufb01ciently large \u03b3) will accurately recover the topics {(\u00b5t, wf\nt) : t \u2208 A} in Topicsf. The\nremaining concern is that sampling errors will cause Algorithm 1 to mistakenly return additional\ntopics in Topicsf, namely the topics t \u2208 B \u222a C. It thus suf\ufb01ces to guarantee that the signs of the \u02c6\u03bbt\nreturned by Algorithm 2 are correct. The sample size requirement for this is independent of the de-\nsired accuracy level for the foreground-speci\ufb01c topics\u2014it depends only on \u03b3 and \ufb01xed properties of\nthe background model.1 As reported in Section 3.2, this robustness is borne out in our experiments.\n\n5.2 Scalability\n\nOur algorithms are scalable to large datasets when implemented to exploit sparsity and low-rank\nstructure (each experiment we report runs on a standard laptop in a few minutes). Two important\ndetails are (i) how the moments M2 and M3 are represented, and (ii) how to execute the power\niteration update in Algorithm 2. These issues are only brie\ufb02y mentioned in [5] and without proof,\nso in this section, we address these issues in detail.\n\n2 and M f\n\n2 is \u02c6M f\n\n2]i,j.\n\n2]i,i and E[cn(i)cn(j)/(\uffffn(\uffffn \u2212 1))] = [M f\n2 := N\u22121\uffffN\n\nEf\ufb01cient moment estimates for topic models. We \ufb01rst discuss how to represent empirical esti-\nmates of the second- and third-order moments M f\n3 for the foreground documents (the same\nwill hold for the background documents). Let document n \u2208 [N ] have length \uffffn, and let cn \u2208 ND\nbe its word count vector (its i-th entry cn(i) is the number of times word i appears in document n).\nProposition 3 (Estimator for M f\n2). Assume \uffffn \u2265 2. For any distinct i, j \u2208 [D], E[(cn(i)2 \u2212\ncn(i))/(\uffffn(\uffffn \u2212 1))] = [M f\nBy Proposition 3, an unbiased estimator of M f\nn=1(\uffffn(\uffffn \u2212 1))\u22121(cn \u2297 cn \u2212\ndiag(cn)). Since \u02c6M f\n2 is sum of sparse matrices, it can be represented ef\ufb01ciently, and we may use\nsparsity-aware methods for computing its low-rank spectral decompositions. It is similarly easy to\nobtain such a decomposition for \u02c6M f\n2 , from which one can compute its pseudoinverse and\nrepresent it in factored form as P Q\uffff for some P, Q \u2208 RD\u00d7K.\n3). Assume \uffffn \u2265 3. For any distinct i, j, k \u2208 [D], E[(cn(i)3 \u2212\nProposition 4 (Estimator for M f\n3]i,i,i, E[(cn(i)2cn(j) \u2212 cn(i)cn(j))/(\uffffn(\uffffn \u2212\n3cn(i)2 + 2cn(i))/(\uffffn(\uffffn \u2212 1)(\uffffn \u2212 2))] = [M f\n1)(\uffffn \u2212 2))] = [M f\nBy Proposition 4, an unbiased estimator of M f\n3(I, v, v) :=\nN\u22121\uffffN\nn=1(\uffffn(\uffffn\u22121)(\uffffn\u22122))\u22121(\uffffcn, v\uffff2cn\u22122\uffffcn, v\uffff(cn\u25e6v)\u2212\uffffcn, v\u25e6v\uffffcn +2cn\u25e6v\u25e6v) (where\n\u25e6 denotes component-wise product of vectors). Let nnz(cn) be the number of non-zero entries in\ncn, then each term in the sum takes only O(nnz(cn)) operations to compute. So the time to compute\n\u02c6M f\n3(I, v, v) is proportional to the number of non-zero entries of the term-document matrix, using\njust a single pass over the document corpus.\n\n3]i,j,k.\n3(I, v, v) for any vector v \u2208 RD is \u02c6M f\n\n3]i,i,j, and E[(cn(i)cn(j)cn(k))/(\uffffn(\uffffn \u2212 1)(\uffffn \u2212 2))] = [M f\n\n2 \u2212 \u03b3 \u02c6M b\n\nPower iteration computation. Each power iteration update in Algorithm 2 just requires the eval-\n3 (I, v, v) (one-pass linear time, as shown above) for v := \u02c6M\u20202 u(i), and\nuating \u02c6M f\n\u02c6\u03bb\u03c4\uffff\u02c6a\u03c4 , v\uffff2\u02c6a\u03c4 (O(DK) time). Since \u02c6M\u20202 is kept in rank-K factored\n\n3(I, v, v) \u2212 \u03b3 \u02c6M b\ncomputing the de\ufb02ation\uffff\u03c4