{"title": "Content-based recommendations with Poisson factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 3176, "page_last": 3184, "abstract": "We develop collaborative topic Poisson factorization (CTPF), a generative model of articles and reader preferences. CTPF can be used to build recommender systems by learning from reader histories and content to recommend personalized articles of interest. In detail, CTPF models both reader behavior and article texts with Poisson distributions, connecting the latent topics that represent the texts with the latent preferences that represent the readers. This provides better recommendations than competing methods and gives an interpretable latent space for understanding patterns of readership. Further, we exploit stochastic variational inference to model massive real-world datasets. For example, we can fit CPTF to the full arXiv usage dataset, which contains over 43 million ratings and 42 million word counts, within a day. We demonstrate empirically that our model outperforms several baselines, including the previous state-of-the-art approach.", "full_text": "Content-based recommendations\n\nwith Poisson factorization\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nPrem Gopalan\n\nPrinceton University\nPrinceton, NJ 08540\n\nLaurent Charlin\n\nColumbia University\nNew York, NY 10027\n\npgopalan@cs.princeton.edu\n\nlcharlin@cs.columbia.edu\n\nDavid M. Blei\n\nDepartments of Statistics & Computer Science\n\nColumbia University\nNew York, NY 10027\n\ndavid.blei@columbia.edu\n\nAbstract\n\nWe develop collaborative topic Poisson factorization (CTPF), a generative model\nof articles and reader preferences. CTPF can be used to build recommender sys-\ntems by learning from reader histories and content to recommend personalized\narticles of interest. In detail, CTPF models both reader behavior and article texts\nwith Poisson distributions, connecting the latent topics that represent the texts\nwith the latent preferences that represent the readers. This provides better recom-\nmendations than competing methods and gives an interpretable latent space for\nunderstanding patterns of readership. Further, we exploit stochastic variational\ninference to model massive real-world datasets. For example, we can \ufb01t CPTF\nto the full arXiv usage dataset, which contains over 43 million ratings and 42\nmillion word counts, within a day. We demonstrate empirically that our model\noutperforms several baselines, including the previous state-of-the art approach.\n\n1\n\nIntroduction\n\nIn this paper we develop a probabilistic model of articles and reader behavior data. Our model is\ncalled collaborative topic Poisson factorization (CTPF). It identi\ufb01es the latent topics that under-\nlie the articles, represents readers in terms of their preferences for those topics, and captures how\ndocuments about one topic might be interesting to the enthusiasts of another.\nAs a recommendation system, CTPF performs well in the face of massive, sparse, and long-tailed\ndata. Such data is typical because most readers read or rate only a few articles, while a few readers\nmay read thousands of articles. Further, CTPF provides a natural mechanism to solve the \u201ccold start\u201d\nproblem, the problem of recommending previously unread articles to existing readers. Finally, CTPF\nprovides a new exploratory window into the structure of the collection. It organizes the articles\naccording to their topics and identi\ufb01es important articles both in terms of those important to their\ntopic and those that have transcended disciplinary boundaries.\nWe illustrate the model with an example. Consider the classic paper \u201dMaximum likelihood from\nincomplete data via the EM algorithm\u201d [5]. This paper, published in the Journal of the Royal Sta-\ntistical Society (B) in 1977, introduced the expectation-maximization (EM) algorithm. The EM\nalgorithm is a general method for \ufb01nding maximum likelihood estimates in models with hidden\nrandom variables. As many readers will know, EM has had an enormous impact on many \ufb01elds,\n\n1\n\n\fincluding computer vision, natural language processing, and machine learning. This original paper\nhas been cited over 37,000 times.\nFigure 1 illustrates the CTPF representation of the EM paper. (This model was \ufb01t to the shared\nlibraries of scientists on the Mendeley website; the number of readers is 80,000 and the number of\narticles is 261,000.) In the \ufb01gure, the horizontal axes contains topics, latent themes that pervade\nthe collection [2]. Consider the black bars in the left \ufb01gure. These represent the topics that the\nEM paper is about. (These were inferred from the abstract of the paper.) Speci\ufb01cally, it is about\nprobabilistic modeling and statistical algorithms. Now consider the red bars on the right, which are\nsummed with the black bars. These represent the preferences of the readers who have the EM paper\nin their libraries. CTPF has uncovered the interdisciplinary impact of the EM paper. It is popular\nwith readers interested in many \ufb01elds outside of those the paper discusses, including computer vision\nand statistical network analysis.\nThe CTPF representation has advantages. For forming recommendations, it naturally interpolates\nbetween using the text of the article (the black bars) and the inferred representation from user behav-\nior data (the red bars). On one extreme, it recommends rarely or never read articles based mainly on\ntheir text; this is the cold start problem. On the other extreme, it recommends widely-read articles\nbased mainly on their readership. In this setting, it can make good inferences about the red bars.\nFurther, in contrast to traditional matrix factorization algorithms, we combine the space of prefer-\nences and articles via interpretable topics. CTPF thus offers reasons for making recommendations,\nreadable descriptions of reader preferences, and an interpretable organization of the collection. For\nexample, CTPF can recognize the EM paper is among the most important statistics papers that has\nhad an interdisciplinary impact.\nIn more detail, CTPF draws on ideas from two existing models: collaborative topic regression [20]\nand Poisson factorization [9]. Poisson factorization is a form of probabilistic matrix factoriza-\ntion [17] that replaces the usual Gaussian likelihood and real-valued representations with a Poisson\nlikelihood and non-negative representations. Compared to Gaussian factorization, Poisson factor-\nization enjoys more ef\ufb01cient inference and better handling of sparse data. However, PF is a basic\nrecommendation model. It cannot handle the cold start problem or easily give topic-based represen-\ntations of readers and articles.\nCollaborative topic regression is a model of text and reader data that is based on the same intuitions\nas we described above. (Wang and Blei [20] also use the EM paper as an example.) However, in its\nimplementation, collaborative topic regression is a non-conjugate model that is complex to \ufb01t, dif\ufb01-\ncult to work with on sparse data, and dif\ufb01cult to scale without stochastic optimization. Further, it is\nbased on a Gaussian likelihood of reader behavior. Collaborative Poisson factorization, because it is\nbased on Poisson and gamma variables, enjoys an easier-to-implement and more ef\ufb01cient inference\nalgorithm and a better \ufb01t to sparse real-world data. As we show below, it scales more easily and\nprovides signi\ufb01cantly better recommendations than collaborative topic regression.\n\n2 The collaborative topic Poisson factorization model\n\nIn this section we describe the collaborative topic Poisson factorization model (CTPF), and discuss\nits statistical properties. We are given data about users (readers) and documents (articles), where\neach user has read or placed in his library a set of documents. The rating rud equals one if user u\nconsulted document d, can be greater than zero if the user rated the document and is zero otherwise.\nMost of the values of the matrix y are typically zero, due to sparsity of user behavior data.\n\nBackground: Poisson factorization. CTPF builds on Poisson matrix factorization [9]. In collab-\norative \ufb01ltering, Poisson factorization (PF) is a probabilistic model of users and items. It associates\neach user with a latent vector of preferences, each item with a latent vector of attributes, and con-\nstrains both sets of vectors to be sparse and non-negative. Each cell of the observed matrix is\nassumed drawn from a Poisson distribution, whose rate is a linear combination of the corresponding\nuser and item attributes. Poisson factorization has also been used as a topic model [3], and developed\nas an alternative text model to latent Dirichlet allocation (LDA). In both applications Poisson factor-\nization has been shown to outperform competing methods [3, 9]. PF is also more easily applicable\nto real-life preference datasets than the popular Gaussian matrix factorization [9].\n\n2\n\n\fFigure 1: We visualized the inferred topic intensities \u03b8 (the black bars) and the topic offsets \u0001 (the\nred bars) of an article in the Mendeley [13] dataset. The plots are for the statistics article titled\n\u201cMaximum likelihood from incomplete data via the EM algorithm\u201d. The black bars represent the\ntopics that the EM paper is about. These include probabilistic modeling and statistical algorithms.\nThe red bars represent the preferences of the readers who have the EM paper in their libraries. It\nis popular with readers interested in many \ufb01elds outside of those the paper discusses, including\ncomputer vision and statistical network analysis.\n\nCollaborative topic Poisson factorization. CTPF is a latent variable model of user ratings and\ndocument content. CTPF uses Poisson factorization to model both types of data. Rather than mod-\neling them as independent factorization problems, we connect the two latent factorizations using a\ncorrection term [20] which we\u2019ll describe below.\nSuppose we have data containing D documents and U users. CTPF assumes a collection of K\nunormalized topics \u03b21:K. Each topic \u03b2k is a collection of word intensities on a vocabulary of size\nV . Each component \u03b2vk of the unnormalized topics is drawn from a Gamma distribution. Given\nthe topics, CTPF assumes that a document d is generated with a vector of K latent topic intensities\n\u03b8d, and represents users with a vector of K latent topic preferences \u03b7u. Additionally, the model\nassociates each document with K latent topic offsets \u0001d that capture the document\u2019s deviation from\nthe topic intensities. These deviations occur when the content of a document is insuf\ufb01cient to explain\nits ratings. For example, these variables can capture that a machine learning article is interesting to\na biologist, because other biologists read it.\nWe now de\ufb01ne a generative process for the observed word counts in documents and observed user\nratings of documents under CTPF:\n\n1. Document model:\n\n(a) Draw topics \u03b2vk \u223c Gamma(a, b)\n(b) Draw document topic intensities \u03b8dk \u223c Gamma(c, d)\n(c) Draw word count wdv \u223c Poisson(\u03b8T\n\nd \u03b2v).\n\n2. Recommendation model:\n\n(a) Draw user preferences \u03b7uk \u223c Gamma(e, f )\n(b) Draw document topic offsets \u0001dk \u223c Gamma(g, h)\n(c) Draw rud \u223c Poisson(\u03b7T\n\nu (\u03b8d + \u0001d)).\n\nCTPF speci\ufb01es that the conditional probability that a user u rated document d with rating rud is\ndrawn from a Poisson distribution with rate parameter \u03b7T\nu (\u03b8d + \u0001d). The form of the factorization\ncouples the user preferences for the document topic intensities \u03b8d and the document topic offsets \u0001d.\nThis allows the user preferences to be interpreted as af\ufb01nity to latent topics.\nCTPF has two main advantages over previous work (e.g., [20]), both of which contribute to its\nsuperior empirical performance (see Section 5). First, CTPF is a conditionally conjugate model\nwhen augmented with auxiliary variables. This allows CTPF to conveniently use standard variational\ninference with closed-form updates (see Section 3). Second, CTPF is built on Poisson factorization;\nit can take advantage of the natural sparsity of user consumption of documents and can analyze\nmassive real-world data. This follows from the likelihood of the observed data under the model [9].\n\n3\n\n!!!!!!!!!!!!!010203040Topicprobability, prior, bayesian, likelihood, inference, maximumalgorithm, ef\ufb01cient, optimal, clustering, optimization, show!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!010203040Topicimage, object, matching, tracking\" motion,segmentationnetwork, connected, modules, nodes,links, topology \fWe analyze user preferences and document content with CTPF via its posterior distribution over\nlatent variables p(\u03b21:K, \u03b81:D, \u00011:D, \u03b71:U|w, r). By estimating this distribution over the latent struc-\nture, we can characterize user preferences and document readership in many useful ways. Figure 1\ngives an example.\n\nRecommending old and new documents. Once the posterior is \ufb01t, we use CTPF to recommend\nin-matrix documents and out-matrix or cold-start documents to users. We de\ufb01ne in-matrix docu-\nments as those that have been rated by at least one user in the recommendation system. All other\ndocuments are new to the system. A cold-start recommendation of a new document is based entirely\non its content. For predicting both in-matrix and out-matrix documents, we rank each user\u2019s unread\ndocuments by their posterior expected Poisson parameters,\n\nscoreud = E[\u03b7T\n\nu (\u03b8d + \u0001d)|w, r].\n\n(1)\n\nThe intuition behind the CTPF posterior is that when there is no reader data, we depend on the\ntopics to make recommendations. When there is both reader data and article content, this gives\ninformation about the topic offsets. We emphasize that under CTPF the in-matrix recommendations\nand cold-start recommendations are not disjoint tasks. There is a continuum between these tasks.\nFor example, the model can provide better predictions for articles with few ratings by leveraging its\nlatent topic intensities \u03b8d.\n\n3 Approximate posterior inference\n\nGiven a set of observed document ratings r and their word counts w, our goal is to infer the topics\n\u03b21:K, the user preferences \u03b71:U , the document topic intensities \u03b81:D, the document topic offsets \u00011:D.\nWith estimates of these quantities, we can recommend in-matrix and out-matrix documents to users.\nComputing the exact posterior distribution p(\u03b21:K, \u03b81:D, \u00011:D, \u03b71:U|w, r) is intractable; we use vari-\national inference [15]. We \ufb01rst develop a coordinate ascent algorithm\u2014a batch algorithm that iter-\nates over only the non-zero document-word counts and the non-zero user-document ratings. We then\npresent a more scalable stochastic variational inference algorithm.\nIn variational inference we \ufb01rst de\ufb01ne a parameterized family of distributions over the hidden vari-\nables. We then \ufb01t the parameters to \ufb01nd a distribution that minimizes the KL divergence to the\nposterior. The model is conditionally conjugate if the complete conditional of each latent variable\nis in the exponential family and is in the same family as its prior. (The complete conditional is the\nconditional distribution of a latent variable given the observations and the other latent variables in\nthe model [8].) For the class of conditionally conjugate models, we can perform this optimization\nwith a coordinate ascent algorithm and closed form updates.\n\nintegers such that wdv = (cid:80)\nrud =(cid:80)\n\nud,k \u223c Poisson(\u03b7uk\u03b8dk) and K latent variables yb\nvariables ya\nud,k + yb\nk ya\n\nAuxiliary variables. To facilitate inference, we \ufb01rst augment CTPF with auxiliary variables. Fol-\nlowing Ref. [6] and Ref. [9], we add K latent variables zdv,k \u223c Poisson(\u03b8dk\u03b2vk), which are\nk zdv,k. Similarly, for each observed rating rud, we add K latent\nud,k \u223c Poisson(\u03b7uk\u0001dk) such that\nud,k. A sum of independent Poisson random variables is itself a Poisson with\nrate equal to the sum of the rates. Thus, these new latent variables preserve the marginal distribution\nof the observations, wdv and rud. Further, when the observed counts are 0, these auxiliary variables\nare not random. Consequently, our inference procedure need only consider the auxiliary variables\nfor non-zero observations.\nCTPF with the auxiliary variables is conditionally conjugate; its complete conditionals are shown\nin Table 1. The complete conditionals of the Gamma variables \u03b2vk, \u03b8dk, \u0001dk, and \u03b7uk are Gamma\ndistributions with shape and rate parameters as shown in Table 1. For the auxiliary Poisson variables,\nobserve that zdv is a K-dimensional latent vector of Poisson counts, which when conditioned on\ntheir observed sum wdv, is distributed as a multinomial [14, 4]. A similar reasoning underlies the\nconditional for yud which is a 2K-dimensional latent vector of Poisson counts. With our complete\nconditionals in place, we now derive the coordinate ascent algorithm for the expanded set of latent\nvariables.\n\n4\n\n\fLatent Variable\n\nType\n\n\u03b8dk\n\u03b2vk\n\u03b7uk\n\u0001dk\nzdv\n\nyud\n\nGamma\nGamma\nGamma\nGamma\nMult\n\nMult\n\nComplete conditional\n\nv zdv,k +(cid:80)\nd zdv,k, b +(cid:80)\nud,k +(cid:80)\nud,k, h +(cid:80)\n\nd ya\nu yb\n\nd yb\n\nu ya\n\nud,k, d +(cid:80)\nud,k, f +(cid:80)\n\nd \u03b8dk\n\nu \u03b7uk\n\nc +(cid:80)\na +(cid:80)\ne +(cid:80)\ng +(cid:80)\n(cid:40)\n\nlog \u03b8dk + log \u03b2vk\n\nv \u03b2vk +(cid:80)\n\nu \u03b7uk\n\nd(\u03b8dk + \u0001dk)\n\nlog \u03b7uk + log \u03b8dk\nlog \u03b7uk + log \u0001dk\n\nif k < K,\nif K \u2264 k < 2K\n\n\u03beud\n\ndk\n\nVariational parameters\n\u02dc\u03b8shp\ndk , \u02dc\u03b8rte\n\u02dc\u03b2shp\nvk , \u02dc\u03b2rte\n\u02dc\u03b7shp\nuk, \u02dc\u03b7rte\n\u02dc\u0001shp\ndk , \u02dc\u0001rte\n\u03c6dv\n\nuk\n\nvk\n\ndk\n\nTable 1: CTPF: latent variables, complete conditionals and variational parameters.\n\nVariational family. We de\ufb01ne the mean-\ufb01eld variational family q(\u03b2, \u03b8, \u03b7, \u0001, z, y) over the latent\nvariables where we consider these variables to be independent and each governed by its own distri-\nbution,\n\n(cid:89)\n\n(cid:89)\n\n(cid:89)\n\nq(\u03b2, \u03b8, \u0001, \u03b7, z, y) =\n\nq(\u03b2vk)\n\nq(\u03b8dk)q(\u0001dk)\n\nq(\u03b7uk)\n\nq(yud,k)\n\nq(zdv,k).\n\n(2)\n\n(cid:89)\n\n(cid:89)\n\nv,k\n\nd,k\n\nu,k\n\nud,k\n\ndv,k\n\nThe variational factors for topic components \u03b2vk, topic intensities \u03b8dk, user preferences \u03b7uk are\nall Gamma distributions\u2014the same as their conditional distributions\u2014with freely set shape and\nrate variational parameters. For example, the variational distribution for the topic intensities \u03b8dk\nis Gamma(\u03b8dk; \u02dc\u03b8shp\ndk). We denote shape with the superscript \u201cshp\u201d and rate with the superscript\n\u201crte\u201d. The variational factor for zdv is a multinomial Mult(wdv, \u03c6dv) where the variational parameter\n\u03c6dv is a point on the K-simplex. The variational factor for yud = (ya\nud) is also a multinomial\nMult(rud, \u03beud) but here \u03beud is a point in the 2K-simplex.\n\ndk , \u02dc\u03b8rte\n\nud, yb\n\nOptimal coordinate updates.\nIn coordinate ascent we iteratively optimize each variational pa-\nrameter while holding the others \ufb01xed. Under the conditionally conjugate augmented CTPF, we\ncan optimize each coordinate in closed form by setting the variational parameter equal to the ex-\npected natural parameter (under q) of the complete conditional. For a given random variable, this\nexpected conditional parameter is the expectation of a function of the other random variables and\nobservations. (For details, see [9, 10]). We now describe two of these updates; the other updates\nare similarly derived.\nThe update for the variational shape and rate parameters of topic intensities \u03b8dk is\n\u02dc\u03b7shp\nuk\n\u02dc\u03b7rte\nuk\n\nu rud\u03beud,k, d +(cid:80)\n\n\u02dc\u03b8dk = (cid:104)c +(cid:80)\n\nv wdv\u03c6dv,k +(cid:80)\n\n+(cid:80)\n\n(3)\n\n(cid:105).\n\nu\n\nv\n\n\u02dc\u03b2shp\nvk\n\u02dc\u03b2rte\nvk\n\nThe Gamma update in Equation 3 derives from the expected natural parameter (under q) of the\ncomplete conditional for \u03b8dk in Table 1. In the shape parameter for topic intensities for document d,\nwe use that Eq[zdv,k] = wdv\u03c6dv,k for the word indexed by v and Eq[ya\nud,k] = rud\u03beud,k for the user\nindexed by u. In the rate parameter, we use that the expectation of a Gamma variable is the shape\ndivided by the rate.\nThe update for the multinomial \u03c6dv is\n\ndk + \u03a8( \u02dc\u03b2shp\n\nvk ) \u2212 log \u02dc\u03b2rte\nvk},\n\n\u03c6dv \u221d exp{\u03a8(\u02dc\u03b8shp\n\ndk ) \u2212 log \u02dc\u03b8rte\n\n(4)\nwhere \u03a8(\u00b7) is the digamma function (the \ufb01rst derivative of the log \u0393 function). This update comes\ndk ) \u2212 log \u02dc\u03b8rte\nfrom the expectation of the log of a Gamma variable, for example, Eq[log \u03b8dk] = \u03a8(\u02dc\u03b8shp\ndk.\nCoordinate ascent algorithm. The CTPF coordinate ascent algorithm is illustrated in Figure 2.\nSimilar to the algorithm of [9], our algorithm is ef\ufb01cient on sparse matrices. In steps 1 and 2, we\nneed to only update variational multinomials for the non-zero word counts wdv and the non-zero\nratings rud. In step 3, the sums over the expected zdv,k and the expected yud,k need only to consider\nnon-zero observations. This ef\ufb01ciency comes from the likelihood of the full matrix depending only\non the non-zero observations [9].\n\n5\n\n\fInitialize the topics \u03b21:K and topic intensities \u03b81:D using LDA [2] as described in Section 3.\nRepeat until convergence:\n\n1. For each word count wdv > 0, set \u03c6dv to the expected conditional parameter of zdv.\n2. For each rating rud > 0, set \u03beud to the expected conditional parameter of yud.\n3. For each document d and each k, update the block of variational topic intensities \u02dc\u03b8dk to\ntheir expected conditional parameters using Equation 3. Perform similar block updates\nfor \u02dc\u03b2vk, \u02dc\u03b7uk and \u02dc\u0001dk, in sequence.\n\nFigure 2: The CTPF coordinate ascent algorithm. The expected conditional parameters of the latent\nvariables are computed from Table 1.\n\nStochastic algorithm. The CTPF coordinate ascent algorithm is ef\ufb01cient: it only iterates over the\nnon-zero observations in the observed matrices. The algorithm computes approximate posteriors for\ndatasets with ten million observations within hours (see Section 5). To \ufb01t to larger datasets, within\nhours, we develop an algorithm that subsamples a document and estimates variational parameters\nusing stochastic variational inference [10]. The stochastic algorithm is also useful in settings where\nnew items continually arrive in a stream. The CTPF SVI algorithm is described in the Appendix.\n\nComputational ef\ufb01ciency. The SVI algorithm is more ef\ufb01cient than the batch algorithm. The\nbatch algorithm has a per-iteration computational complexity of O((W + R)K) where R and W\nare the total number of non-zero observations in the document-user and document-word matrices,\nrespectively. For the SVI algorithm, this is O((wd + rd)K) where rd is the number of users rating\nthe sampled document d and wd is the number of unique words in it. (We assume that a single\ndocument is sampled in each iteration.) In Figure 2, the sums involving the multinomial parameters\ncan be tracked for ef\ufb01cient memory usage. The bound on memory usage is O((D + V + U )K).\nHyperparameters, initialization and stopping criteria: Following [9], we \ufb01x each Gamma shape\nand rate hyperparameter at 0.3. We initialize the variational parameters for \u03b7uk and \u0001dk to the prior\non the corresponding latent variables and add small uniform noise. We initialize \u02dc\u03b2vk and \u02dc\u03b8dk using\nestimates of their normalized counterparts from LDA [2] \ufb01tted to the document-word matrix w. For\nthe SVI algorithm described in the Appendix, we set learning rate parameters \u03c40 = 1024, \u03ba = 0.5\nand use a mini-batch size of 1024. In both algorithms, we declare convergence when the change in\nexpected predictive likelihood is less than 0.001%.\n\n4 Related work\n\nSeveral research efforts propose joint models of item covariates and user activity. Singh and Gor-\ndon [19] present a framework for simultaneously factorizing related matrices, using generalized link\nfunctions and coupled latent spaces. Hong et al. [11] propose Co-factorization machines for model-\ning user activity on twitter with tweet features, including content. They study several design choices\nfor sharing latent spaces. While CTPF is roughly an instance of these frameworks, we focus on the\ntask of recommending articles to readers.\nAgarwal and Chen [1] propose fLDA, a latent factor model which combines document features\nthrough their empirical LDA [2] topic intensities and other covariates, to predict user preferences.\nThe coupling of matrix decomposition and topic modeling through shared latent variables is also\nconsidered in [18, 22]. Like fLDA, both papers tie latent spaces without corrective terms. Wang\nand Blei [20] have shown the importance of using corrective terms through the collaborative topic\nregression (CTR) model which uses a latent topic offset to adjust a document\u2019s topic proportions.\nCTR has been shown to outperform a variant of fLDA [20]. Our proposed model CTPF uses the\nCTR approach to sharing latent spaces.\nCTR [20] combines topic modeling using LDA [2] with Gaussian matrix factorization for one-class\ncollaborative \ufb01ltering [12]. Like CTPF, the underlying MF algorithm has a per-iteration complex-\nity that is linear in the number of non-zero observations. Unlike CTPF, CTR is not conditionally\n\n6\n\n\fFigure 3: The CTPF coordinate ascent algorithm outperforms CTR and other competing algorithms on both\nin-matrix and out-matrix predictions. Each panel shows the in-matrix or out-matrix recommendation task on\nthe Mendeley data set or the 1-year arXiv data set. Note that the Ratings-only model cannot make out-matrix\npredictions. The mean precision and mean recall are computed from a random sample of 10,000 users.\nconjugate, and the inference algorithm depends on numerical optimization of topic intensities. Fur-\nther, CTR requires setting con\ufb01dence parameters that govern uncertainty around a class of observed\nratings. As we show in Section 5, CTPF scales more easily and provides signi\ufb01cantly better recom-\nmendations than CTR.\n.5 Empirical results\n\nWe use the predictive approach to evaluating model \ufb01tness [7], comparing the predictive accuracy\nof the CTPF coordinate ascent algorithm in Figure 2 to collaborative topic regression (CTR) [21].\nWe also compare to variants of CTPF to demonstrate that coupling the latent spaces using corrective\nterms is essential for good predictive performance, and that CTPF predicts signi\ufb01cantly better than\nits variants and CTR. Finally, we explore large real-world data sets revealing the interaction patterns\nbetween readers and articles.\n\nData sets. We study the CTPF algorithm of Figure 2 on two data sets. The Mendeley data set [13]\nof scienti\ufb01c articles is a binary matrix of 80,000 users and 260,000 articles with 5 million observa-\ntions. Each cell corresponds to the presence or absence of an article in a scientist\u2019s online library.\nThe arXiv data set is a matrix of 120,297 users and 825,707 articles, with 43 million observations.\nEach observation indicates whether or not a user has consulted an article (or its abstract). This data\nwas collected from the access logs of registered users on the http://arXiv.org paper repository. The\narticles and the usage data spans a timeline of 10 years (2003-2012). In our experiments on predic-\ntive performance, we use a subset of the data set, with 64,978 users 636,622 papers and 7.6 million\nclicks, which spans one year of usage data (2012). We treat the user clicks as implicit feedback and\nspeci\ufb01cally as binary data. For each article in the above data sets, we remove stop words and use\ntf-idf to choose the top 10,000 distinct words (14,000 for arXiv) as the vocabulary. We implemented\nthe batch and stochastic algorithms for CTPF in 4500 lines of C++ code.1\n\nCompeting methods. We study the predictive performance of the following models. With the\nexception of the Poisson factorization [9], which does not model content, the topics and topic in-\ntensities (or proportions) in all CTPF models are initialized using LDA [2], and \ufb01t using batch\nvariational inference. We set K = 100 in all of our experiments.\n\n\u2022 CTPF: CTPF is our proposed model (Section 2) with latent user preferences tied to a single\n\nvector \u03b7u, and interpreted as af\ufb01nity to latent topics \u03b2.\n\n1Our source code is available from: https://github.com/premgopalan/collabtm\n\n7\n\nmendeley.inmendeley.outarxiv.inarxiv.out0.5%1.0%0.1%0.2%0.3%0.4%0.5%1.0%1.5%2.0%0.1%0.2%0.3%0.4%10305070100103050701001030507010010305070100Number of recommendationsMean precisionCTPF (Section 2)Decoupled PF (Section 5)Content OnlyRatings Only [9]Collaborative Topic Regression [20]mendeley.inmendeley.outarxiv.inarxiv.out2%4%6%1%2%3%4%5%0%1%2%3%4%0.5%1.0%1.5%2.0%10305070100103050701001030507010010305070100Number of recommendationsMean recallCTPF (Section 2)Decoupled PF (Section 5)Content OnlyRatings Only [9]Collaborative Topic Regression [20]\fFigure 4: The top articles by the expected weight \u03b8dk from a component discovered by our stochastic vari-\national inference in the arXiv data set (Left) and Mendeley (Right). Using the expected topic proportions \u03b8dk\nand the expected topic offsets \u0001dk, we identi\ufb01ed subclasses of articles: A) corresponds to the top articles by\ntopic proportions in the \ufb01eld of \u201cStatistical inference algorithms\u201d for arXiv and \u201cOntologies and applications\u201d\nfor Mendeley; B) corresponds to the top articles with low topic proportions in this \ufb01eld, but a large \u03b8dk + \u0001dk,\ndemonstrating the outside interests of readers of that \ufb01eld (e.g., very popular papers often appear such as \u201cThe\nProof of Innocence\u201d which describes a rigorous way to \u201c\ufb01ght your traf\ufb01c tickets\u201d). C) corresponds to the top\narticles with high topic proportions in this \ufb01eld but that also draw signi\ufb01cant interest from outside readers.\n\n\u2022 Decoupled Poisson Factorization: This model is similar to CTPF but decouples the user\n\nlatent preferences into distinct components pu and qu, each of dimension K. We have,\n\nwdv \u223c Poisson(\u03b8T\n\nrud \u223c Poisson(pT\n\nu \u03b8d + qT\n\nu \u0001d).\n\nd \u03b2v);\n\nonly make in-matrix predictions.\n\nsembles the idea developed in [1] but using Poisson generating distributions.\n\n(5)\nThe user preference parameters for content and ratings can vary freely. The qu are inde-\npendent of topics and offer greater modeling \ufb02exibility, but they are less interpretable than\nthe \u03b7u in CTPF. Decoupling the factorizations has been proposed by Porteous et al. [16].\n\u2022 Content Only: We use the CTPF model without the document topic offsets \u0001d. This re-\n\u2022 Ratings Only [9]: We use Poisson factorization to the observed ratings. This model can\n\u2022 CTR [20]: A full optimization of this model does not scale to the size of our data sets\ndespite running for several days. Accordingly, we \ufb01x the topics and document topic pro-\nportions to their LDA values. This procedure is shown to perform almost as well as jointly\noptimizing the full model in [20]. We follow the authors\u2019 experimental settings. Speci\ufb01-\ncally, for hyperparameter selection we started with the values of hyperparameters suggested\nby the authors and explored various values of the learning rate as well as the variance of\nthe prior over the correction factor (\u03bbv in [20]). Training convergence was assessed using\nthe model\u2019s complete log-likelihood on the training observations. (CTR does not use a\nvalidation set.)\n\nEvaluation. Prior to training models, we randomly select 20% of ratings and 1% of documents in\neach data set to be used as a held-out test set. Additionally, we set aside 1% of the training ratings as\na validation set (20% for arXiv) and use it to determine convergence. We used the CTPF settings de-\nscribed in Section 3 across both data sets. During testing, we generate the top M recommendations\nfor each user as those items with the highest predictive score under each method. Figure 3 shows the\nmean precision and mean recall at varying number of recommendations for each method and data\nset. We see that CTPF outperforms CTR and the Ratings-only model on all data sets. CTPF out-\nperforms the Decoupled PF model and the Content-only model on all data sets except on cold-start\npredictions on the arXiv data set, where it performs equally well. The Decoupled PF model lacks\nCTPF\u2019s interpretable latent space. The Content-only model performs poorly on most tasks; it lacks a\ncorrective term on topics to account for user ratings. In Figure 4, we explored the Mendeley and the\narXiv data sets using CTPF. We \ufb01t the Mendeley data set using the coordinate ascent algorithm, and\nthe full arXiv data set using the stochastic algorithm from Section 3. Using the expected document\ntopic intensities \u03b8dk and the expected document topic offsets \u0001dk, we identi\ufb01ed interpretable topics\nand subclasses of articles that reveal the interaction patterns between readers and articles.\n\n8\n\nTopic: \"Statistical Inference Algorithms\"On the ergodicity properties of adaptive MCMC algorithmsParticle filtering within adaptive Metropolis Hastings samplingAn Adaptive Sequential Monte Carlo SamplerA) Articles about the topic; readers in the \ufb01eldB) Articles outside the topic; readers in the \ufb01eldA comparative review of dimension reduction methods in ABCComputational methods for Bayesian model choiceThe Proof of InnocenceC) Articles about this \ufb01eld; readers outside the \ufb01eldIntroduction to Monte Carlo MethodsAn introduction to Monte Carlo simulation of statistical...The No-U-Turn Sampler: Adaptively setting path lengths...Topic: \u201cInformation Retrieval\u201dThe anatomy of a large-scale hypertextual Web search engineAuthoritative sources in a hyperlinked environmentA translation approach to portable ontology specificationsA) Articles about the topic; readers in the \ufb01eldB) Articles outside the topic; readers in the \ufb01eldHow to choose a good scientific problem.Practical Guide to Support Vector ClassificationMaximum likelihood from incomplete data via the EM\u2026C) Articles about this \ufb01eld; readers outside the \ufb01eldData clustering: a reviewDefrosting the digital library: bibliographic tools\u2026Top 10 algorithms in data mining\fReferences\n[1] D. Agarwal and B. Chen. fLDA: Matrix factorization through latent Dirichlet allocation. In Proceedings\n\nof the third ACM international conference on web search and data mining, pages 91\u2013100. ACM, 2010.\n\n[2] D. Blei, A. Ng, and M. Jordan.\n\n993\u20131022, January 2003.\n\nlatent Dirichlet allocation. Journal of Machine Learning Research, 3:\n\n[3] J. Canny. GaP: A factor model for discrete data. In Proceedings of the 27th Annual International ACM\n\nSIGIR Conference on Research and Development in Information Retrieval, 2004.\n\n[4] A. Cemgil. Bayesian inference for nonnegative matrix factorization models. Computational Intelligence\n\nand Neuroscience, 2009.\n\n[5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.\n\nJournal of the Royal Statistical Society, Series B, 39:1\u201338, 1977.\n\n[6] D. B Dunson and A. H. Herring. Bayesian latent variable models for mixed discrete outcomes. Biostatis-\n\ntics, 6(1):11\u201325, 2005.\n\n[7] S. Geisser and W.F. Eddy. A predictive approach to model selection. Journal of the American Statistical\n\nAssociation, pages 153\u2013160, 1979.\n\n[8] Z. Ghahramani and M. Beal. Variational inference for Bayesian mixtures of factor analysers. In Neural\n\nInformation Processing Systems, volume 12, 2000.\n\n[9] P. Gopalan, J.M. Hofman, and D. Blei. Scalable recommendation with Poisson factorization. arXiv\n\npreprint arXiv:1311.1704, 2013.\n\n[10] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14:1303\u20131347, 2013.\n\n[11] L. Hong, A. S. Doumith, and B.D. Davison. Co-factorization machines: Modeling user interests and\npredicting individual decisions in Twitter. In Proceedings of the sixth ACM international conference on\nweb search and data mining, pages 557\u2013566. ACM, 2013.\n\n[12] Y. Hu, Y. Koren, and C. Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In Eighth IEEE\n\nInternational Conference on Data Mining., pages 263\u2013272. IEEE, 2008.\n\n[13] K. Jack, J. Hammerton, D. Harvey, J. J. Hoyt, J. Reichelt, and V. Henning. Mendeley\u2019s reply to the\ndatatel challenge. Procedia Computer Science, 1(2):1\u20133, 2010. URL http://www.mendeley.com/\nresearch/sei-whale/.\n\n[14] N. Johnson, A. Kemp, and S. Kotz. Univariate Discrete Distributions. John Wiley & Sons, 2005.\n[15] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical\n\nmodels. Machine Learning, 37:183\u2013233, 1999.\n\n[16] I. Porteous, A. U. Asuncion, and M. Welling. Bayesian matrix factorization with side information and\nDirichlet process mixtures. In Maria Fox and David Poole, editors, In the conference of the Association\nfor the Advancement of Arti\ufb01cial Intelligence. AAAI Press, 2010.\n\n[17] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte\n\nCarlo. In Proceedings of the 25th international conference on machine learning, pages 880\u2013887, 2008.\n\n[18] H. Shan and A. Banerjee. Generalized probabilistic matrix factorizations for collaborative \ufb01ltering. In\n\nData Mining (ICDM), 2010 IEEE 10th International Conference on, pages 1025\u20131030. IEEE, 2010.\n\n[19] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization. In Proceedings of the\n14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 650\u2013658.\nACM, 2008.\n\n[20] C. Wang and D. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles. In Knowledge\n\nDiscovery and Data Mining, 2011.\n\n[21] C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. In\n\nArti\ufb01cial Intelligence and Statistics, 2011.\n\n[22] X. Zhang and L. Carin. Joint modeling of a matrix with associated text via latent binary features. In\n\nAdvances in Neural Information Processing Systems, pages 1556\u20131564, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1627, "authors": [{"given_name": "Prem", "family_name": "Gopalan", "institution": "Princeton University"}, {"given_name": "Laurent", "family_name": "Charlin", "institution": "Columbia University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}