{"title": "Joint Modeling of a Matrix with Associated Text via Latent Binary Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1556, "page_last": 1564, "abstract": "A new methodology is developed for joint analysis of a matrix and accompanying documents, with the documents associated with the matrix rows/columns. The documents are modeled with a focused topic model, inferring latent binary features (topics) for each document. A new matrix decomposition is developed, with latent binary features associated with the rows/columns, and with imposition of a low-rank constraint. The matrix decomposition and topic model are coupled by sharing the latent binary feature vectors associated with each. The model is applied to roll-call data, with the associated documents defined by the legislation. State-of-the-art results are manifested for prediction of votes on a new piece of legislation, based only on the observed text legislation. The coupling of the text and legislation is also demonstrated to yield insight into the properties of the matrix decomposition for roll-call data.", "full_text": "Joint Modeling of a Matrix with Associated Text\n\nvia Latent Binary Features\n\nXianXing Zhang\nDuke University\n\nxianxing.zhang@duke.edu\n\nLawrence Carin\nDuke University\nlcarin@duke.edu\n\nAbstract\n\nA new methodology is developed for joint analysis of a matrix and accompanying\ndocuments, with the documents associated with the matrix rows/columns. The\ndocuments are modeled with a focused topic model, inferring interpretable latent\nbinary features for each document. A new matrix decomposition is developed,\nwith latent binary features associated with the rows/columns, and with imposition\nof a low-rank constraint. The matrix decomposition and topic model are coupled\nby sharing the latent binary feature vectors associated with each. The model is\napplied to roll-call data, with the associated documents de\ufb01ned by the legislation.\nAdvantages of the proposed model are demonstrated for prediction of votes on\na new piece of legislation, based only on the observed text of legislation. The\ncoupling of the text and legislation is also shown to yield insight into the properties\nof the matrix decomposition for roll-call data.\n\n1\n\nIntroduction\n\nThe analysis of legislative roll-call data provides an interesting setting for recent developments in\nthe joint analysis of matrices and text [23, 8]. While the roll-call data matrix is typically binary,\nthe modeling framework is general, in that it may be readily extended to categorical, integer or\nreal observations. The problem is made interesting because, in addition to the matrix of votes, we\nhave access to the text of the legislation (e.g., characteristic of the columns of the matrix, with each\ncolumn representing a piece of legislation and each row a legislator). While roll-call data provides\nan interesting proving ground, the basic methodologies are applicable to any setting for which one\nis interested in analysis of matrices, and there is text associated with the rows or columns (e.g., the\ntext may correspond to content on a website; each column of the matrix may represent a website,\nand each row an individual, with the matrix representing number of visits).\nThe analysis of roll-call data is of signi\ufb01cant interest to political scientists [15, 6]. In most such\nresearch the binary data are typically analyzed with a probit or logistic link function, and the under-\nlying real matrix is assumed to have rank one. Each legislator and piece of legislation exists at a\npoint along this one dimension, which is interpreted as characterizing a (one-dimensional) political\nphilosophy (e.g., from \u201cconservative\u201d to \u201cliberal\u201d).\nRoll-call data analysis have principally been interested in inferring the position of legislators in\nthe one-dimensional latent space, with this dictated in part by the fact that the ability to perform\nprediction is limited. As in much matrix-completion research [17, 18], one typically can only infer\nvotes that are missing at random. It is not possible to predict the votes of legislators on a new piece\nof legislation (for which, for example, an entire column of votes is missing). This has motivated the\njoint analysis of roll-call votes and the associated legislation [23, 8]: by modeling the latent space\nof the text legislation with a topic model, and making connections between topics and the latent\nspace of the matrix decomposition, one may infer votes of an entire missing column of the matrix,\nassuming access to the text associated with that new legislation.\n\n1\n\n\fWhile the research in [23, 8] showed the potential of joint text-matrix analysis, there were several\nopen questions that motivate this paper. In [23, 8] a latent Dirichlet allocation (LDA) [5] topic model\nwas employed for the text. It has been demonstrated that LDA yields inferior perplexity scores when\ncompared to modern Bayesian topic models, such as the focused topic model (FTM) [24]. Another\nsigni\ufb01cant issue with [23, 8] concerns how the topic (text) and matrix models are coupled. In [23, 8]\nthe frequency with which a given topic is utilized in the text legislation is used to infer the associated\nmatrix parameters (e.g., to infer the latent feature vector associated with the respective column of\nthe matrix). This is undesirable, because the frequency with which a topic is used in the document\nis characteristic of the style of writing: their may be a topic that is only mentioned brie\ufb02y in the\ndocument, but that is critical to the outcome of the vote, while other topics may not impact the vote\nbut are discussed frequently in the legislation. We also wish to move beyond the rank-one matrix\nassumption in [15, 6, 8].\nMotivated by these limitations, in this paper the FTM is employed to model the text of legislation,\nwith each piece of legislation characterized by a latent binary vector that de\ufb01nes the sparse set of\nassociated topics. A new probabilistic low-rank matrix decomposition is developed for the votes,\nutilizing latent binary features; this leverages the merits of what were previously two distinct lines\nof matrix factorization methods [13, 17]. Unlike previous approaches, the rank is not \ufb01xed a priori\nbut inferred adaptively, with theoretical justi\ufb01cations. For a piece of legislation, the latent binary\nfeature vectors for the FTM and matrix decomposition are shared, yielding a new means of jointly\nmodeling text and matrices. This linkage between text and matrices is innovative as: (i) it\u2019s based\non whether a topic is relevant to a document/legislation, not based on the frequency with which the\ntopic is used in the document (i.e., not based on the style of writing); (ii) it enables interpretation of\nthe underlying latent binary features [13, 9] based upon available text data. The rest of the paper is\norganized as follows. Section 2 \ufb01rst reviews the focused topic model, then introduces a new low-\nrank matrix decomposition method, and the joint model of the two. Section 3 discusses posterior\ninference. In Section 4 quantitative results are presented for prediction of columns of roll-call votes\nbased on the associated text legislation, and the joint model is demonstrated qualitatively to infer\nmeaning/insight for the characteristics of legislation and voting patterns, and Section 5 concludes.\n\n2 Model and Analysis\n\n2.1 Focused topic modeling\n\nFocused topic model (FTM) [24] were developed to address a limitation of related models based\non the hierarchical Dirichlet process (HDP) [21]: the HDP shares a set of \u201cglobal\u201d topics across\nall documents, and each topic is in general manifested with non-zero probability in each document.\nThis property of HDP tends to yield less \u201cfocused\u201d or descriptive topics. It is desirable to share\na set of topics across all documents, but with the additional constraint that a given document only\nutilize a small subset of the topics; this tends to yield more descriptive/focused topics, characteristic\nof detailed properties of the documents. A FTM is manifested as a compound linkage of the Indian\nbuffet process (IBP) [10] and the Dirichlet process (DP). Each document draws latent binary features\nfrom an IBP to select a \ufb01nite subset of atoms/topics from the DP. In the model details, the DP is\nrepresented in terms of a normalized gamma process [7] with weighting by the binary feature vector,\nconstituting a document-speci\ufb01c topic distribution in which only a subset of topics are manifested\nwith non-zero probability.\nThe key components of the FTM are summarized as follows [24]:\n\nbjt|\u03c0t \u223c Bernoulli(bjt|\u03c0t),\n\n\u03b8j|{bj:, \u03bb} \u223c Dirichlet(\u03b8j|bj: (cid:12) \u03bb),\n\n\u03c0t =(cid:81)t\n\nl=1 \u03bdt,\n\n\u03bdt|\u03b1r \u223c Beta(\u03bdt|\u03b1r, 1)\n\n\u03bbt|\u03b3 \u223c Gamma(\u03bbt|\u03b3, 1)\n\n(1)\n\n1\u2208 {0, 1} indicates if document j uses topic t, which is modeled as drawn from an IBP\nwhere bjt\nparameterized by \u03b1r under the stick breaking construction [20], as shown in the \ufb01rst line of (1).\n\u03bb = {\u03bbt}Kr\nt=1 represents the relative mass on Kr topics (Kr could be in\ufb01nite in principle); \u03bb is\nshared across all documents, analogous to the \u201ctop layer\u201d of the HDP. \u03b8j is the topic distribution\nfor the jth document, and the expression bj: (cid:12) \u03bb denotes the pointwise vector product between\n1Throughout this paper notation bij are used to denote the entry locates at the ith row and jth column in\n\nmatrix B, bj: and b:k are used to represent the jth row and kth column in B respectively.\n\n2\n\n\fk=1, a word is drawn as wjn|zjn,{\u03b2k}Kr\n\nbj: and \u03bb, thereby selecting a subset of topics for document j (those for which the corresponding\ncomponents of bj: are non-zero). The rest of the FTM is constructed similar to LDA [5], where for\neach token n in document j, a topic indicator is drawn as zjn|\u03b8j \u223c Mult(zjn|1, \u03b8j). Conditional\non zjn and the topics {\u03b2k}Kr\nk=1 \u223c Mult(wjn|1, \u03b2zjn ), where\n\u03b2k|\u03b7 \u223c Dirichlet(\u03b2k|\u03b7).\nAlthough in (1) bj: is mainly designed to map the global prevalence of topics across the corpus,\n\u03bb, to a within-document proportion of topic usage, \u03b8j, latent features bj: are informative in their\nown right, as they indicate which subset of topics is relevant to a given document. The document-\ndependent topic usage bj: may be more important than \u03b8j when characterizing the meaning of a\ndocument: \u03b8j speci\ufb01es the frequency with which each of the selected topics is utilized in document\nj (this is related to writing style \u2013 verbosity or parsimony \u2013 and less related to meaning); it may be\nmore important to just know what underlying topics are used in the document, characterized by bj:.\nWe therefore make the linkage between documents and an associated matrix via the bj:, not based\non \u03b8j (where [23, 8] base the document-matrix linkage via \u03b8j or it\u2019s empirical estimate).\n\n2.2 Matrix factorization with binary latent factors and a low-rank assumption\nBinary matrix factorization (BMF) [13, 14] is a general framework in which real latent matrix X \u2208\nRP\u00d7N is decomposed as X = LHRT , where L \u2208 {0, 1}P\u00d7Kl, R \u2208 {0, 1}N\u00d7Kr are binary, and\nH \u2208 RKl\u00d7Kr is real. The rows of L and R are modeled via IBPs, parameterized by \u03b1l and \u03b1r\nrespectively, and Kl and Kr are the truncation levels for the IBPs, which again can be in\ufb01nite in\nprinciple. The observed matrix is Y, which may be real, binary, or categorial [12]. The observations\nare modeled in an element-wise fashion: yij = f (xij). We focus on binary observed matrices,\nY \u2208 {0, 1}P\u00d7N , and utilize f (\u00b7) as a probit model [2]:\n\n(cid:26) 1\n\nif \u02c6xij \u2265 0\nif \u02c6xij < 0\n\n(2)\n\n0\n\nyij =\nwith \u02c6xij = xij + \u0001ij, where \u0001ij \u223c N (0, 1).\nWe generalize the BMF framework by imposing that H is low-rank. Speci\ufb01cally, we impose the\n:k, where u:k and v:k are column vectors (thus their outer product\n\nrank-1 expansion H =(cid:80)Kc\nTo motivate this model, consider the representation H =(cid:80)Kc\nLHRT , which implies X =(cid:80)Kc\n\nk=1 u:kvT\nu:k \u223c N (u:k|0, IKl )\n\nis a rank-1 matrix), each of them is modeled here by a Gaussian distribution:\n\nv:k \u223c N (v:k|0, IKr )\n\nk=1 u:kvT\n\n(3)\nand Kc is the number of such rank-1 matrices such that Kc < min(Kl, Kr), i.e., H is low-rank.\n:k in the decomposition X =\nk=1(Lu:k)(Rv:k)T . Therefore, we may also express X = \u03a8\u03a6T ,\nwith \u03a8 \u2208 RP\u00d7Kc and \u03a6 \u2208 RN\u00d7Kc; the kth column of \u03a8 is de\ufb01ned by Lu:k and the kth column\nof \u03a6 de\ufb01ned by Rv:k. Consequently, the low-rank assumption for H yields a low-rank model\nX = \u03a8\u03a6T , precisely as in [17, 18]. Thus the de\ufb01nition of \u03a8 and \u03a6 via the binary matrices\nL and R and the linkage matrix H merges previously two distinct lines of matrix factorization\nmethods. In the context of the application considered here, the decomposition X = LHRT will\nprove convenient, as we may share the binary matrices L or R among the topic usage of available\ndocuments. The binary features in L and R are therefore characteristic of the presence/absence of\nunderlying topics, or related latent processes, and the matrix H provides the mapping of how these\nbinary features map to observed data.\nHowever, how to specify Kc remains an open question for the above low-rank construction. As a\ncontribution of this paper, we provide a new means of imposing a low-rank model within the prior.\nWe model the \u201csigni\ufb01cance\u201d of each rank-1 term in the expansion explicitly, using a stochastic\nprocess {sk}Kc\n:k, Kc can be in\ufb01nity in\nprinciple. As a result, the hierarchical representation in modeling the latent matrix X in probit model\ncan be summarized as:\n\nk=1, therefore H can be decomposed as H = (cid:80)Kc\n\u02c6xij|(cid:110)\n\u02c6xij|(cid:80)Kc\n\nli:, rj:,{u:k, v:k, sk}Kc\n\n(cid:111) \u223c N(cid:16)\n\nk=1 sk(li:u:k)(rj:v:k)T , 1\n\nk=1 sku:kvT\n\n(cid:17)\n\n(4)\n\nk=1\n\nNote that sk in (4) is similar to the singular value of SVD in spirit. Intuitively, we wish to impose\n|sk| to decrease \u201cfast\u201d as the increase of index k, and the rank-1 matrices with large indices will have\n\n3\n\n\fnegligible impact over (4), therefore Kc plays a role similar to the truncation level in stick breaking\nconstruction for DP [11] and IBP [20]. To achieve this end, we model each sk as a Gaussian random\nvariable with a conjugate multiplicative gamma process (MGP) placed on its precision parameter:\n\nsk|\u03c4k \u223c N(cid:0)sk|0, \u03c4\u22121\n\n(cid:1) ,\n\nk\n\n\u03c4k =(cid:81)k\n\n\u03b4l|\u03b1c \u223c Gamma(\u03b4l|\u03b1c, 1)\n\nl=1 \u03b4l,\n\nTheorem 1. When \u03b1c > 1, the sequence(cid:80)Kc\n\n(5)\nThe MGP was originally proposed in [3] for learning sparse factor models and further extended for\ntree-structured sparse factor models [26] and change-point stick breaking process [25], one of its\nproperties is that it increasingly shrinks sk towards zero with the increase of index k. Next we make\nthe above intuition rigorous. Theorem 1 below formally states that if sk is modeled by MGP as in\n(5), the rank-1 expansion in (4) will converge when Kc \u2192 \u221e.\nk=1 sk(li:u:k)(rj:v:k)T converges in (cid:96)2, as Kc \u2192 \u221e.\nAlthough in MGP Kc is unbounded [3], for computational considerations we would like to truncate\nit to a \ufb01nite value Kc (cid:28) max (P, N ), without much loss of information. As justi\ufb01cation, the\nfollowing theoretical bound is obtained, in a manner similar to its counterparts in DP [11].\nLemma 1. Denoting M Kc\n\u0001} < ab(1\u22121/\u03b1c)\n\n, where a = maxk E(li:u:k)2, b = maxk E(rj:v:k)2.\n\nk=Kc+1 sk(li:u:k)(rj:v:k)T , then \u2200\u0001 > 0 we have p{(M Kc\n\nij )2 >\n\nij =(cid:80)\u221e\n\n\u0001\u03b1Kc\n\nc\n\nLemma 1 states that, when \u03b1c > 1 the approximation error introduced by the truncation level Kc\ndecays exponentially fast to 0, as Kc \u2192 \u221e.\nIn Section 3 an MCMC method is developed to\nadaptively choose Kc at each iteration, which alleviates us from \ufb01xing it a priori. The proof of\nTheorem 1 and Lemma 1 can be found in the Supplemental Material.\n\n2.3\n\nJoint learning of FTM and BMF\n\nVia the FTM and BMF framework of the previous subsections, each piece of legislation j is rep-\nresented as two latent binary feature vectors bj: and rj:. To jointly model the matrix of votes with\nassociated text of legislation, a natural choice is to impose bj: = rj:. As a result, the full joint model\ncan be speci\ufb01ed by equations (1) - (5), with bjt in (1) replaced by rjt. Note that the joint model links\nthe topics characteristic of the text, to the latent binary features characteristic of legislation in the\nmatrix decomposition; and such linkage leverages statistical strength of the two data source across\nthe latent variables of the joint model during posterior inference. A graphical representation of the\njoint model can be found in the Supplemental Material.\nIn the context of the model for Y = f (X), with X = LHRT , if one were to learn L and H\nbased upon available training data, then a new legislation y:N +1 could be predicted if we had access\nto r:N +1. Via the construction above, not only do we gain a predictive advantage, because the\nnew legislation\u2019s latent binary features r:N +1 can be obtained from modeling its document as in\n(1), but also the model provides powerful interpretative insights. Speci\ufb01cally the topics inferred\nfrom the documents may be used to interpret the latent binary features associated with the matrix\nfactorization. These advantages will be demonstrated through experiments on legislative roll-call\ndata in Section 4.\n\n2.4 Related work\n\nThe ideal point topic model (IPTM) was developed in [8], where the supervised Latent Dirichlet\nAllocation (sLDA) [4] model was used to link empirical topic-usage frequencies to the latent factors\nvia regression. In that work the dimension of the latent factors was set to 1, e.g., \ufb01xing Kc = 1 in our\nnomenclature. In [23] the authors proposed to jointly analyze the voting matrix and the associated\ntext through a mixture model, where each legislation\u2019s latent feature factor is clustered to a mixture\ncomponent in coupled with that legislation\u2019s document topic distribution \u03b8. Note that in their case\neach piece of legislation can only belong to one cluster, while in our case the latent binary features\nfor each document can be effectively treated as being grouped to multiple clusters [13] (a mixed-\nmembership model, manifested in terms of the binary feature vectors). Similar research in linking\ncollaborative \ufb01ltering and topic models can also be found in web content recommendation [1], movie\nrecommendation[19], and scienti\ufb01c paper recommendation [22]. None of these methods makes use\nof the binary indicators as the characterization of associated documents, but perform linking via the\ntopic distribution \u03b8 and the latent (real) features in different ways.\n\n4\n\n\ften as p(v:k|\u2212) \u221d (cid:81)N\n\n3 Posterior Inference\nWe use Gibbs sampling for posterior inference over the latent variables, and only sampling equations\nthat are unique for this model are discussed here. The rest are similar to those in [24, 13]. In the\nfollowing we use p(\u00b7|\u2212) to denote the conditional posterior of one variable given on all others.\nSampling {v:k, u:k}k=1:Kc Based on (3) and (4) the conditional posterior of v:k can be writ-\nk=1 sk(Lu:k)(rj:v:k), 1)N (v:k|0, IKr ). It can be shown that\nand covari-\nj: +\n\np(v:k|\u2212) = N (v:k|\u00b5v:k , \u03a3v:k ), with mean \u00b5v:k = sk\u03a3v:k\nance matrix \u03a3v:k = [IKr + s2\nk\nLu:krj:v:k. By repeating the above procedure p(u:k|\u2212) can be derived similarly.\nSampling {sk}k=1:Kc Based on (4) and (5) the conditional posterior of sk can be written\nIt can be shown that\nand variance\n\nj=1 N ( \u02c6x:j|(cid:80)Kc\n(cid:80)N\nj=1(Lu:krj:)T (Lu:krj:)]\u22121, where \u02dcx\u2212k\nj=1 N ( \u02c6x:j|(cid:80)Kc\n(cid:1), with mean \u00b5sk = \u03c32\n(cid:16)\n\n(cid:80)N\nj=1((Lu:k)(rj:v:k))T \u02dcx\u2212k\n(cid:80)Kc\n\nas p(sk|\u2212) \u221d (cid:81)N\np(sk|\u2212) = N(cid:0)sk|\u00b5sk , \u03c32\n= 1/(\u03c4k +(cid:80)N\nl =(cid:81)l\n\n(cid:80)N\nj=1(Lu:krj:)T \u02dcx\u2212k\n\nk=1 sk(Lu:k)(rj:v:k), 1)N (sk|0, \u03c4\u22121\nk ).\n\n:j = \u02c6x:j \u2212 LUVT rT\n\n, 1 + 1\n2\n\nl s2\nl\n\nsk\n\nsk\n\n:j\n\n:j\n\nj=1((Lu:k)(rj:v:k))T ((Lu:k)(rj:v:k))).\n\n\u03b4k|\u03b1c + Kc\u2212k+1\nt=1,t(cid:54)=k \u03b4t. \u03c4k can then be reconstructed from \u03b41:k as in (5).\n\n\u03c32\nsk\nSampling {\u03c4k, \u03b4k}k=1:Kc Based on (5), given a \ufb01xed truncation level Kc it can be sampled directly\nfrom its posterior distribution: p(\u03b4k|\u2212) = Gamma\n, where\n\u03bd(k)\nSampling {rjt}j=1:N,t=1:Kr\nSimilar to the derivation in [24], p(rjt = 1|\u2212) = 1 if Njt >\n0, where Njt denotes the number of times document j used topic t. When Njt = 0,\nbased on (1) and (4) the conditional posterior of rjt can be written as p(rjt = 1|\u2212) \u221d\n(cid:80)Kc\n\u03c0t+2\u03bbt (1\u2212\u03c0t) exp{\u2212 1\n:j ]}, where ht: represents the tth row of H =\nk=1 sku:kvT\n\nt: ) \u2212 2(LhT\n:k; and p(rjt = 0|\u2212) \u221d 2\u03bbt (1\u2212\u03c0t)\n\ni=1:P is sampled as described in [13].\n\n\u03c0t+2\u03bbt (1\u2212\u03c0t). {lit}t=1:Kl\n\nt: )T \u02dcx\u2212k\n\nt: )T (LhT\n\nl=k \u03bd(k)\n\n2 [(LhT\n\n(cid:17)\n\n\u03c0t\n\n2\n\nAdaptive sampler for MGP The above Gibbs sampler needs a prede\ufb01ned truncation level Kc.\nIn [3, 26] the authors proposed an adaptive sampler, tuning Kc as the sampler progresses, with\nconvergence of the chain guaranteed [16]. Speci\ufb01cally, the adaptation procedure is triggered with\nprobability p(t) = exp(z0 + z1t) at the tth iteration, with z0, z1 chosen so that adaptation occurs\nfrequently at the beginning of the chain but decreases exponentially fast. When the adaptation is\ntriggered in the tth iteration, let q\u03ba(t) = {k|d\u221e(skLu:kvT\n:kRT ) \u2264 \u03ba} denotes in iteration t the\nindices of the rank-1 matrices with the maximum-valued entry less than some pre-de\ufb01ned threshold\n\u03ba, which intuitively has a negligible contribution at the tth iteration, and thus are deleted and Kc will\ndecrease. On the other hand, if q\u03ba(t) is empty then it suggests that more rank-1 matrices are needed,\nin this case we increase Kc by one and draw u:Kc, v:Kc from their prior distributions respectively.\n\n4 Experimental Results\n\n4.1 Experiment setting\n\nWe have performed joint matrix and text analysis, considering the House of Representatives (House),\nsessions 106 - 111 2; we model each session\u2019s roll-call votes separately as binary matrix Y. Entry\nyij = 1 denotes that the ith legislator\u2019s response to legislation j is either \u201cYea\u201d or \u201cYes\u201d , and yij\n= 0 denotes that the corresponding response is either \u201cNay\u201d or \u201cNo\u201d. The data are preprocessed in\nthe same way as described in [8]. We recommend to set the IBP hyperparameters \u03b1l = \u03b1r = 1,\nMGP hyperparameters \u03b1c = 3, FTM hyperparameters \u03b3 = 5 and topic model hyperparameter\n\u03b7 = 0.01. We also considered using a random-walk MH algorithm with non-informative gamma\nprior to infer those hyperparameters, as described in [24, 3], and the Markov chain manifested\nsimilar mixing performance. The truncation level Kc in the MGP is not \ufb01xed, but inferred from\nthe adaptive sampler, with threshold parameter \u03ba set to 0.05 (it is recommended to be set small for\nmost applications). In the study below, for each model we run 5000 iterations of the Gibbs sampler,\nwith the \ufb01rst 1000 iterations discarded as burn-in, and 400 samples are collected, taking every tenth\niteration afterwards, to perform Bayesian estimate on the object of interest.\n\n2These data are available from thomas.loc.gov\n\n5\n\n\f4.2 Predicting random missing votes\n\nIn this section we study the classical problem of estimating the values of matrix data that are missing\nuniformly at random (in-matrix missing votes), without the use of associated documents. We com-\npare the model proposed in (4) to the probabilistic matrix factorization (PMF) found in [17, 18]. This\nis done by decomposing the latent matrix X = \u03a8\u03a6T , where each row of \u03a8 and \u03a6T are drawn from\na Gaussian distribution with mean and covariance matrix modeled by a Gaussian-Wishart distribu-\ntion. To study the behavior of the proposed MGP prior in (5), we (i) vary the number of columns\n(rank) Kc in \u03a8 and \u03a6 as a free parameter, and call this model PMF; and (ii) incorporate MGP into\nthe decomposition of X = \u03a8S\u03a6T where S \u2208 RKc\u00d7Kc is a diagonal matrix with each diagonal ele-\nment speci\ufb01ed as sk. The model in (ii) is called PMF+MGP. Additionally, to check if the low-rank\nassumption detailed in Section 2.2 is effective for BMF, we also compare the performance of the\nBMF model originally proposed in [13], which we term BMF-Original.\nWe compared these models on predicting the missing values selected uniformly at random, with\ndifferent percentage (90%, 95%, 99%) of missingness. This study has been done on House data\nfrom the 106 to 111 sessions; however, to conserve space we only summarized the experimental\nresults on the 110th House data, in Figure 1; similar results are observed across all sessions. In\nFigure 1 each panel corresponds to a certain percentage of missingness; the horizontal axis is the\nnumber of columns (rank), which varies as a free parameter of PMF, while the vertical axis is\nthe prediction accuracy. MGP is observed to be generally effective in modeling the rank across\nthree panels, and the low-rank assumption is critical to get good performance for the BMF. When\nthe percentage of missingness is relatively low, e.g., 90% or 95%, PMF performs better than BMF,\nhowever when the percentage of missingness is high e.g., 99%, the BMF (with low rank assumption)\nis very competitive with PMF. This is probably because of the way BMF encourages the sharing of\nstatistical strength among all rows and columns via the matrix H as described in [13], which is most\neffective when data is scarce.\n\n4.3 Predicting new bills based on text\n\nWe study the predictive power of the proposed model when the legislative roll-call votes and the\nassociated bill documents are modeled jointly, as described in Section 2.3. We compare our proposed\nmodel with the IPTM in [8], where the authors \ufb01xed the rank Kc = 1 in IPTM; we term this\nmodel IPTM(Kc = 1).\nIn [8] the authors suggested that \ufb01xing the rank to one might be over-\nrestrictive, thus we also propose to model the rank in the ideal point model using MGP, in a similar\nway to how this was done for the PMF model, and call this model IPTM. We also compare our\nmodel with that in [23], where the authors proposed to combine the factor analysis model and topic\nmodel via a compounded mixture model, with all sessions of roll-call data are modeled jointly\nvia a Markov process. Since our main goal is to predict new bills but not modeling the matrices\ndynamically, in the following experiments we remove the Markov process but model each session of\nHouse data separately; we call this model FATM. In [23] the authors proposed to use a beta-Bernoulli\ndistributed binary variable bk to model if the kth rank-1 matrix is used in matrix decomposition.\nWhen performing posterior inference we \ufb01nd that bk tends to be easily trapped in local maxima,\nwhile MGP, which models the signi\ufb01cance of usage (but not the binary usage) of each kth rank-1\nmatrix via sk, smoother estimates and better mixing were observed.\nFor each session the bills are partitioned into 6-folds, and we iteratively remove a fold, and train the\nmodel with the remaining folds; predictions are then performed on the bills in the removed fold. The\nexperiment results are summarized in Figure 2. Note that since rj: is modeled via the stick-breaking\nconstruction of IBP as in (1), the total number of latent binary features Kr is unbounded, and we\nface the risk of having the latent binary features important for explaining voting Y and important for\nexplaining the associated text learned separately. This may lead to the undesirable consequence that\nthe latent features learned from text are not discriminative in predicting a new piece of legislation.\nTo reduce such risk, in practice we could either set \u03b1r such that it strongly favor fewer latent binary\nfeatures, or we can truncate the stick breaking construction at a pre-de\ufb01ned level Kr. For a clearer\ncomparison with other models, where the number of topics are \ufb01xed, we choose the second approach\nand let Kr vary as the maximum number of possible topics.\nAcross all sessions IPTM consistently performs better than its counterpart when Kc = 1; this again\ndemonstrates the effectiveness of MGP in modeling the rank. Although there is no signi\ufb01cant advan-\n\n6\n\n\fFigure 1: Comparison of prediction accuracy for votes missing uniformly at random, for the 110th House\ndata. Different panels corresponds to different percentage of missingness, for each panel the vertical axis\nrepresents accuracy and horizontal axis represents the rank set for PMF. For PMF+MGP and our proposed\nmethod, inferred rank Kc is shown for the most-probable collection sample.\n\nFigure 2: Prediction accuracy for held-out legislation across 106th - 111th House data; prediction of an entire\ncolumn of missing votes based on text. In each panel the vertical axis represents accuracy and the horizontal\naxis represents the number of topics used for each model. Results are averaged across 6-folds, with variances\nare too small to see.\n\ntage of our proposed model when the truncation on the number of topics Kr (horizontal axis) is small\n(e.g., 30-50), over-\ufb01tting is observed for all models except our proposed model. As we increase the\nnumber of topics, the performance of other models drop signi\ufb01cantly (vertical axis). Across all \ufb01ve\nsessions, the best quantitative results are obtained by the proposed model when Kr > 100.\n\n4.4 Latent binary feature interpretation\n\nIn this study we partition all the bills into two groups: (i) bills for which there is near-unanimous\nagreement, with \u201cYea\u201d or \u201cYes\u201d more than 90%; (ii) contentious bills with percentage of votes\nreceived as \u201cYea\u201d or \u201cYes\u201d less than 60%. By linking the inferred binary latent features to the\ntopics for those two groups, we can get insight into the characteristics of legislation and voting\npatterns, e.g., what in\ufb02uenced a near-unanimous yes vote, and what in\ufb02uenced more contention.\nFigure 3 compares the latent feature usage pattern of those two groups; the horizontal axis represents\nthe latent features, where we set Kr = 100 for illustration purpose, and the vertical axis is the\naggregated frequency that a feature/topic is used by all the bills in each of those two groups. The\nfrequency is normalized within each group for easy interpretation. For each group, we select three\ndiscriminative features: ones heavily used in one group but rarely used in the other (these selected\nfeatures are highlighted in blue/red). For example, in the left panel the features highlighted in blue\nare widely used by bills in the left group, but rarely used by bills in the right group. As observed\n\n7\n\n0.960.9790%\u00a0Missing0.950.960.9795%\u00a0Missing0890.90.9199%\u00a0MissingKc=\u00a012Kc=\u00a09Kc=\u00a012Kc=\u00a012Kc=\u00a011Kc=\u00a0130.930.940.950.920.930.940.950860.870.880.89Kc\u00a012c0.92151020500.90.9115102050ProposedBMF\u2010OriginalPMFPMF+MGP0.850.86151020500.890.910.93108th0.880.90.92107th0.880.90.92106th0.810.830.850.870.80.820.840.860.80.820.840.860.790.8130501001502003000.780.830501001502003000.7830501001502003000.870.890.91109th0880.90.92110th0.910.930.95111th0790.810.830.850.820.840.860.880830.850.870.890.770.7930501001502003000.83050100150200300ProposedFATMIPTMIPTM(Kc\u00a0=\u00a01)0.810.833050100150200300\fFigure 3: Comparison of the frequencies of binary features usage between two groups of bills, left: near-\nunanimous af\ufb01rmative bills (e.g., bills with percentage of votes received as \u201cYes\u201d or \u201cYea\u201d is more than 90%).\nRight: contentious bills (e.g., bills with percentage of votes received as \u201cYes\u201d or \u201cYea\u201d is less than 60%).\nData from 110th House, when Kr = 100. The vertical axis represents the normalized frequency of using\nfeature/topic within the corresponding group. The six most discriminative features/topics (labeled in the \ufb01gure)\nare shown in Table 1\nTable 1: Six discriminative topics of unanimous agreed/highly debated bills learned from the 110th house of\nrepresentatives, with top-ten most probable words shown. (R) and (B) represent the topics depicted in Figure 3\nas red and blue respectively.\n\nTOPIC 31(R)\n\nTOPIC 38 (R)\n\nTOPIC 62 (B)\n\nTOPIC 73 (B)\n\nTOPIC 83 (R)\n\nTOPIC 22 (B)\n\nCHILDREN\n\nCHILD\nYOUTH\n\nPORNOGRAPHY\n\nINTERNET\n\nFATHER\nFAMILY\nPARENT\nSCHOOL\n\nCONCURRENT RESOLUTION\n\nTAX\n\nADJOURN\n\nMAJORITY LEADER\n\nDESIGNEE\nAVIATION\nRECESS\n\nMINORITY LEADER\n\nFEBRUARY\n\nMOTION OFFER\n\nCORPORATION\n\nTAXABLE\nCREDIT\nPENALTY\nREVENUE\nTAXPAYER\nSPECIAL\n\nFILE\n\nPEOPLE\nWORLD\nHOME\n\nSANITATION\n\nWATER\n\nINTERNATIONAL\n\nSOUTHERN\n\nCOMPENSATION\n\nASSOCIATION\n\nECONOMIC\n\nNATION\nATTACK\n\nTERRORIST\n\nPEOPLE\n\nSEPTEMBER\nVOLUNTEER\n\nCITIZEN\nPAKISTAN\nLEGITIMATE\n\nFUTURE\n\nCLAUSE\nPRINT\nWAIVE\n\nSUBSTITUTE\n\nCOMMITTEE AMENDMENT\n\nREAD\nDEBATE\nOFFER\n\nDIVIDE AND CONTROL\n\nMOTION\n\nEMERGENCY\n\nSTAND\n\nSUBSTITUTE\n\nfrom Figure 3, the learned binary features are discriminative, as the usage pattern for those two\ngroups are quite different.\nWe also study the interpretation of those latent features by linking them to the topics inferred from\nthe texts. As an example, those six highlighted features are linked to their corresponding topics\nand depicted in Table 1, with the top-ten most probable words within each topic shown. For the\nunanimous agreed bills, we can read from Table 1 that they are highly probable to be related to\ntopics about the education of youth (Topic 22), or the prevention of terrorist (Topic 73). While the\nbills from the contentious group tend to more related to making amendments to an existing piece of\nlegislation (Topic 83) or discussing taxation (Topic 38). Note that compared to conventional topic\nmodeling, these inferred topics are not only informative in semantic meaning of the bills, but also\ndiscriminative in predicting the outcome of the bills.\n\n5 Conclusion\n\nA new methodology has been developed for the joint analysis of a matrix with associated text,\nbased on sharing latent binary features modeled via the Indian buffet process. The model has been\ndemonstrated on analysis of voting data from the US House of Representatives. Imposition of a low-\nrank representation for the latent real matrix has proven important, with this done in a new manner\nvia the multiplicative gamma process. Encouraging quantitative results are demonstrated, and the\nmodel has also been shown to yield interesting insights into the meaning of the latent features. The\nsharing of latent binary features provides a general joint learning framework for Indian buffet process\nbased models [9], where focused topic model and binary matrix factorization are two examples,\nexploring other possibilities in different scenarios could be an interesting direction.\n\nAcknowledgements\nThe authors would like to thank anonymous reviewers for providing useful comments. The research\nreported here was supported by ARO, DOE, NGA, ONR, and DARPA (under the MSEE program).\n\n8\n\n010203040506070809010000.010.020.030.040.050.060.070.08Binary feature usage pattern for unanimous agreed bills 010203040506070809010000.020.040.060.080.1Binary feature usage pattern for highly debated billsTopic 22Topic 62Topic 73Topic 31Topic 83Topic 38Topic 22Topic 38Topic 31Topic 62Topic 73Topic 83\fReferences\n[1] D. Agarwal and B. Chen. fLDA: matrix factorization through latent Dirichlet allocation. In\n\nWSDM, 2010.\n\n[2] J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Jour-\n\nnal of the American Statistical Association, 1993.\n\n[3] A. Bhattacharya and D. B. Dunson. Sparse Bayesian in\ufb01nite factor models. Biometrika, 2011.\n[4] D. M. Blei and Jon D. McAuliffe. Supervised topic models. In Advances in Neural Information\n\nProcessing Systems, 2007.\n\n[5] D. M. Blei, A. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n[6] J. Clinton, S. Jackman, and D. Rivers. The statistical analysis of roll call data. Am. Political\n\nSc. Review, 2004.\n\n[7] T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics,\n\n1973.\n\n[8] S. Gerrish and D.M. Blei. Predicting legislative roll calls from text. In ICML, 2011.\n[9] T. L. Grif\ufb01ths and Z. Ghahramani. The indian buffet process: An introduction and review.\n\nJournal of Machine Learning Research, 12:1185\u20131224, 2011.\n\n[10] T.L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process.\n\nIn Advances in Neural Information Processing Systems, 2005.\n\n[11] H. Ishwaran and L.F. James. Gibbs sampling methods for stick-breaking priors. J. American\n\nStatistical Association, 2001.\n\n[12] P. McCullagh and J. Nelder. Generalized Linear Models. Chapman and Hall, 1989.\n[13] E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary latent\n\nfactors. In Advances in Neural Information Processing Systems. 2007.\n\n[14] K. Miller, T. Grif\ufb01ths, and M.I. Jordan. Nonparametric latent feature models for link predic-\n\ntion. In Advances in Neural Information Processing Systems, 2009.\n\n[15] K.T. Poole. Recent developments in analytical models of voting in the U.S. congress. Am.\n\nPolitical Sc. Review, 1988.\n\n[16] G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity of adaptive MCMC. Journal of\n\nApplied Probability, 2007.\n\n[17] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization.\n\nInformation Processing Systems, 2007.\n\nIn Advances in Neural\n\n[18] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain\n\nMonte Carlo. In ICML, 2008.\n\n[19] H. Shan and A. Banerjee. Generalized probabilistic matrix factorizations for collaborative\n\n\ufb01ltering. In ICDM, 2010.\n\n[20] Y. W. Teh, D. G\u00f6r\u00fcr, and Z. Ghahramani. Stick-breaking construction for the Indian buffet\n\nprocess. In AISTATS, 2007.\n\n[21] Y. W. Teh, M. I. Jordan, Matthew J. Beal, and D. M. Blei. Hierarchical Dirichlet processes.\n\nJournal of the American Statistical Association, 2006.\n\n[22] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles.\n\nIn KDD, 2011.\n\n[23] E. Wang, D. Liu, J. Silva, D. B. Dunson, and L. Carin. Joint analysis of time-evolving binary\nmatrices and associated documents. In Advances in Neural Information Processing Systems,\n2010.\n\n[24] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The IBP compound Dirichlet process\n\nand its application to focused topic modeling. In ICML, 2010.\n\n[25] X. Zhang, D. Dunson, and L. Carin. Hierarchical topic modeling for analysis of time-evolving\n\npersonal choices. In Advances in Neural Information Processing Systems 24. 2011.\n\n[26] X. Zhang, D. Dunson, and L. Carin. Tree-structured in\ufb01nite sparse factor model. In ICML,\n\n2011.\n\n9\n\n\f", "award": [], "sourceid": 733, "authors": [{"given_name": "Xianxing", "family_name": "Zhang", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}]}