{"title": "Gamma-Poisson Dynamic Matrix Factorization Embedded with Metadata Influence", "book": "Advances in Neural Information Processing Systems", "page_first": 5824, "page_last": 5835, "abstract": "A conjugate Gamma-Poisson model for Dynamic Matrix Factorization incorporated with metadata influence (mGDMF for short) is proposed to effectively and efficiently model massive, sparse and dynamic data in recommendations. Modeling recommendation problems with a massive number of ratings and very sparse or even no ratings on some users/items in a dynamic setting is very demanding and poses critical challenges to well-studied matrix factorization models due to the large-scale, sparse and dynamic nature of the data. Our proposed mGDMF tackles these challenges by introducing three strategies: (1) constructing a stable Gamma-Markov chain model that smoothly drifts over time by combining both static and dynamic latent features of data; (2) incorporating the user/item metadata into the model to tackle sparse ratings; and (3) undertaking stochastic variational inference to efficiently handle massive data. mGDMF is conjugate, dynamic and scalable. Experiments show that mGDMF significantly (both effectively and efficiently) outperforms the state-of-the-art static and dynamic models on large, sparse and dynamic data.", "full_text": "Gamma-Poisson Dynamic Matrix Factorization\n\nEmbedded with Metadata In\ufb02uence\n\nTrong Dinh Thac Do\n\nAdvanced Analytics Institute\n\nUniversity of Technology Sydney\n\nthacdtd@gmail.com\n\nLongbing Cao \u2217\n\nAdvanced Analytics Institute\n\nUniversity of Technology Sydney\n\nlongbing.cao@gmail.com\n\nAbstract\n\nA conjugate Gamma-Poisson model for Dynamic Matrix Factorization incorpo-\nrated with metadata in\ufb02uence (mGDMF for short) is proposed to effectively and\nef\ufb01ciently model massive, sparse and dynamic data in recommendations. Modeling\nrecommendation problems with a massive number of ratings and very sparse or\neven no ratings on some users/items in a dynamic setting is very demanding and\nposes critical challenges to well-studied matrix factorization models due to the\nlarge-scale, sparse and dynamic nature of the data. Our proposed mGDMF tackles\nthese challenges by introducing three strategies: (1) constructing a stable Gamma-\nMarkov chain model that smoothly drifts over time by combining both static and\ndynamic latent features of data; (2) incorporating the user/item metadata into the\nmodel to tackle sparse ratings; and (3) undertaking stochastic variational inference\nto ef\ufb01ciently handle massive data. mGDMF is conjugate, dynamic and scalable.\nExperiments show that mGDMF signi\ufb01cantly (both effectively and ef\ufb01ciently)\noutperforms the state-of-the-art static and dynamic models on large, sparse and\ndynamic data.\n\n1\n\nIntroduction\n\nAn increasing amount of research [8, 21, 13, 25, 3, 16, 34] focuses on the signi\ufb01cant real-life\nrecommendation challenges of modeling massive and evolving ratings (e.g., a girl likes cartoon\nmovies in her childhood but this may change to romantic movies when she is older) but some users\n(or items) have only a few or or even no ratings (forming sparse or cold-start user/item ratings). For\nexample, Net\ufb02ix data have 97.5M ratings, 225K users, 14K movies, and 98.8% missing ratings.\nThe intensively-studied collaborative \ufb01ltering models, in particular matrix factorization (MF) models,\nfail to model such massive, dynamic and sparse recommendation problems as they usually model\nstatic data, assume certain user/item rating similarity, and are too costly to estimate missing ratings in\nmassive data.\nSeveral Poisson-based MF models were proposed recently to model large and sparse static ratings,\ne.g., Poisson Factorization (PF) [25], and collaborative topic PF for modeling content-based recom-\nmendations [27]; and dynamic data, e.g., dynamic PF (dPF) [16], dynamic compound PF (DCPF)\n[34], and deep dynamic PF [24]. However, none of these can effectively and ef\ufb01ciently handle\nmassive, dynamic and sparse ratings simultaneously (see more analysis in Section 2).\nTo effectively model sparse, dynamic and massive ratings, we propose a Gamma-Poisson Dynamic\nMatrix Factorization model incorporated with metadata in\ufb02uence (mGDMF). mGDMF has three built-\nin mechanisms to jointly address user/item rating sparsity, large-scale ratings, and rating dynamics.\nFirst, mGDMF is a factorization model that uses a Gamma-Poisson structure to model massive,\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsparse and long-tailed data [32, 45, 43]; the Gamma-Poisson structure with Poisson likelihood and\nnon-negative representations enjoys more ef\ufb01cient inference and better handling of sparse data than\nthe Gaussian Factorization in PMF [25, 19, 20].\nSecond, mGDMF has a conjugate Gamma-Gamma of integrating the observable user/item metadata\n(e.g., \u2018age\u2019 of a user and \u2018genre\u2019 of a movie) with latent user preference and latent item attractiveness\nfactorized from ratings to model user/item rating sparsity. This is inspired by the observation that\nrating behaviors are driven by user/item metadata, and the couplings between users/items (i.e., users\n(or items) with similar metadata may share similar items (or users)) [10, 9, 11]. This metadata-based\nrepresentation leverages the rating factorization to handle sparse item/user ratings or cold-start rating\nissues.\nLastly, mGDMF has the conjugate Gamma-Markov chains to model user preferences and item features\nthat change smoothly over time. As a result of jointly handling all three challenges: scalability,\nsparsity and dynamics, mGDMF forms a conjugate Gamma-Gamma-Gamma-Poisson structure, on\nwhich we perform the stochastic variational inference to model massive data.\nExtensive empirical results show that mGDMF effectively and ef\ufb01ciently outperforms the state-of-\nthe-art static and dynamic PF models on \ufb01ve large and sparse datasets.\n\n2 Related Work\n\nHere, we review the work related to ours, including MF-based models and statistical models for\ndynamic data and for handling user/item rating sparsity.\nMF-based models. Classic MF models are improper for handling a large number of ratings [39, 25]\nas they require intensive mathematical computation and may fail to \ufb01nd similar users in sparse data\n(they assume two users have rated at least some items in common). Many probabilistic MF models,\nsuch as PMF [39], have been proposed to deal with large data. However, data with sparse ratings\nsigni\ufb01cantly challenges them since they have to compute all data with many missing ratings.\nModeling dynamic data. Several previous studies tend to capture the evolving characteristics of\nusers and items over time, such as TimeSVD++ [36] and Bayesian Probabilistic Tensor (BPTF)\nFactorization [44]. TimeSVD++ only captures the user-evolving factors but ignores the item-evolving\nfactors. BPTF models the user and item factors at each time index independently from previous ones,\nand cannot handle speci\ufb01c users/items. The work in [17, 18, 41] extends PMF to dynamic data, but\ntakes the Gaussian state space and cannot handle sparsity as in the long-tail Gamma priors taken\nin PF [25]. Further, computing on all data (including missing and non-missing elements) makes\nthese models inef\ufb01cient on large data. Poisson-based dynamic matrix factorization models are recent\nadvances for modeling dynamic data, such as dPF [16] and DCPF [34] for recommendations. dPF\nfaces the same problem as dynamic PMF since it uses the Gaussian state space. DCPF uses the\nconjugate Gamma-Markov chains but assumes the static portions as a prior of dynamic portions.\nThis makes the chains grow too fast or too slow [34], resulting in unpredictable results. In addition,\nrecent dynamic Poisson-based models such as [24, 47, 40, 1] analyze sequential count vectors. In\ncontrast, mGDMF has the conjugate Gamma-Markov chains and aggregates the static portions with\nthe dynamic portions at each time slice and prevents the instability of the chains. They capture stably\nevolving user preferences and item attractiveness over time. As a result, it is more ef\ufb01cient and\neffective to model the nature of dynamic observations.\nHandling user/item rating sparsity. No work reported directly incorporates user/item metadata\ninto PF for dynamic data. The work in [2, 27, 46, 31] integrates a document-word matrix into PF.\nOther recent work [48, 22] also tends to integrate observable attributes into some probabilistic models\nfor link prediction, but only works on small data, so sparsity is not addressed there. In addition, SPF\n[15] and RPF [30] only incorporate the binary relations (0 and 1) of users, however, our method can\nweight the relations of both users and items. The Gamma-Poisson models in [19, 20] incorporate\ngeneral attributes for modeling large and sparse data but cannot model dynamic data. mGDMF is\nthe only PF model that incorporates user/item metadata, embedded with more general attributes\n(e.g., categorical attributes) to work with dynamic data. First, the user/item metadata is modeled\nas Gamma priors of the latent user preference/latent item attractiveness in the static portion (see\n1.(a).ii/1.(b).ii in the mGDMF generative process). Second, these latent user/item features are further\ngiven as Gamma priors for the user/item global static factors (as shown in 1.(a).iii/1.(b).iii). Lastly,\n\n2\n\n\fwe aggregate the above static user/item latent features with the user/item local dynamic factors. This\niterative integration process adds more weight to similar items/users, which cannot be captured by\nany existing PF models.\n\n3 The mGDMF Model\n\nFor a recommendation problem, we assume the availability of the rating matrix Yt, where each entry\nis the rating given by user u to item i at time slice t and 0 indicates no rating, and the user metadata\nHU and item metadata HI. The time slice corresponds to the period of time (e.g., months) when\nusers place ratings on items.\nFurther, we assume the rating matrix Yt at time slice t follows the Poisson distribution and can\nbe factorized to vectors representing K latent user preferences and vectors representing K latent\nitem attractiveness. The latent user preference vectors and the latent item attractiveness vectors are\nassumed to combine both static and dynamic portions. The static portions (\u03b8uk for user and \u03b2ik for\nitem) represent the time-independent aspects of users/items, while the dynamic portions (\u03b8uk,t for\nuser and \u03b2ik,t for item) capture the time-evolving aspects of users/items.\nMetadata Integration. The static portions (\u03b8uk and \u03b2ik) capture the global stationary (i.e., global\nstatic) factors for user u and item i, which are not in\ufb02uenced by time, and are assumed to follow the\nGamma distribution. We further assume the Gamma distribution of each user\u2019s latent preferences,\n\u03beu, and each item\u2019s latent attractiveness, \u03b7i. The in\ufb02uence of user (item) metadata is captured by\ngiving the second parameter (i.e., rate) of Gamma distribution of a user\u2019s latent preference (item\u2019s\nlatent attractiveness) the product of the appearance of the user (item) attribute value in the metadata.\nThe weight (i.e., the importance) of each user attribute, hum, is given a Gamma prior. The weight\nof user attribute hum affects the preference of a user \u03beu and further affects the static portion of user\nrepresentations, \u03b8uk, if and only if f uu,m = 1. We note that f uu,m is binary value that indicates\nwhether user u has the attribute m (i.e., f uu,m = 1) or not (i.e., f uu,m = 0). hum measures the\ndegree of in\ufb02uence of each user attribute, e.g., a user\u2019s \u2018location\u2019 may have less in\ufb02uence than the\nuser\u2019s \u2018age\u2019 on movie ratings. The weight of an item attribute hin is also assumed with a Gamma\ndistribution. hin affects the item\u2019s latent attractiveness \u03b7i and further affects the static portion of item\nrepresentations, \u03b2ik, when item i has the attribute n (i.e., f ii,n = 1).\nDynamic Modelling. The dynamic portions (\u03b8uk,t and \u03b2ik,t) serve as the local non-stationary (i.e.,\nlocal dynamic) factors to capture the evolution of users and items over time. As shown in [14], it is pos-\nsible to de\ufb01ne a Gamma-Markov chain in a straightforward way by \u03b8uk,t \u223c Gamma(a\u03b8, \u03b8uk,t\u22121/a\u03b8).\nThe full conditional distribution p(\u03b8uk,t|\u03b8uk,t\u22121, \u03b8uk,t+1) is conjugate. However, it is not possible to\nattain a positive correlation between \u03b8uk,t and \u03b8uk,t\u22121 since E[\u03b8uk,t] = 1/\u03b8uk,t\u22121. Hence, we build\na chain that smoothly evolves over time by adding the auxiliary variables \u03bbuk,t between \u03b8uk,t and\n\u03b8uk,t\u22121. The auxiliary variables make E[\u03b8uk,t] = \u03b8uk,t\u22121. Hence, \u03b8uk,t increases/decreases when\n\u03b8uk,t\u22121 increases/decreases. Operations similar to the above are also taken on the item\u2019s dynamic\nportions.\nThe generative process of mGDMF is presented below and the graphical model of mGDMF can be\nfound in the supplementary.\n\n1. Metadata Integration:\n\n(a) For each user:\n\ni. Draw the weight of mth attribute in user metadata hum \u223c Gamma(a(cid:48), b(cid:48))\n\nii. Draw latent user preference \u03beu \u223c Gamma(a,(cid:81)M\nii. Draw latent item attractiveness \u03b7i \u223c Gamma(c,(cid:81)N\n\niii. Draw global static factor \u03b8uk \u223c Gamma(b, \u03beu)\ni. Draw the weight of nth attribute in item metadata hin \u223c Gamma(c(cid:48), d(cid:48))\n\nm=1 huf uu,m\n\nm\n\n)\n\n(b) For each item:\n\niii. Draw global static factor \u03b2ik \u223c Gamma(d, \u03b7i)\n\nn=1 hif ii,n\n\nn\n\n)\n\n2. Dynamic Modeling:\n\n(a) For each user:\n\n3\n\n\fi. Draw initialized state of local dynamic factor \u03b8uk,1 \u223c Gamma(a\u03b8, a\u03b8b\u03b8)\nii. For each time slice t > 1:\nA. Draw auxiliary variable \u03bbuk,t\u22121 \u223c Gamma(a\u03bb, a\u03bb\u03b8uk,t\u22121)\nB. Draw local dynamic factor \u03b8uk,t \u223c Gamma(a\u03b8, a\u03b8\u03bbuk,t\u22121)\n\n(b) For each item:\n\ni. Draw initialized state of local dynamic factor \u03b2ik,1 \u223c Gamma(a\u03b2, a\u03b2b\u03b2)\nii. For each time slice t > 1:\nA. Draw auxiliary variable \u03b9ik,t\u22121 \u223c Gamma(a\u03b9, a\u03b9\u03b2ik,t\u22121)\nB. Draw local dynamic factor \u03b2ik,t \u223c Gamma(a\u03b2, a\u03b2\u03b9ik,t\u22121)\n\n3. For each rating:\n\n(a) Draw yui,t \u223c P oisson((cid:80)\n\nk(\u03b8uk,t + \u03b8uk)(\u03b2ik,t + \u03b2ik))\n\nM(cid:89)\n\n(cid:88)\n\nAs a result, mGDMF effectively models both static and dynamic characteristics of user preference\nand item attractiveness in the context of having sparse ratings on users/items.\nHandling massive data. We further describe how mGDMF models massive data. We calculate the\nprobability of rating yui,t by user u on item i at time slice t as:\n\np(yui|\u03b8u, \u03b8u, \u03b2i, \u03b2i) =\n\n((\u03b8u + \u03b8u)(\u03b2i + \u03b2i))yui exp{\u2212((\u03b8u + \u03b8u)(\u03b2i + \u03b2i)}\n\nyui!\n\n(1)\n\nWhen yui = 0, it does not affect the probability. Owing to the Poisson factorization [25], it does not\nrequire optimization to reduce the computational time as in the classical MF [38]. The probability\nonly depends on \u03b8u, \u03b8u, \u03b2i and \u03b2i.\nBetter prediction with metadata integration. Richer priors are provided by integrating user meta-\ndata to represent the user\u2019s latent preference \u03beu as in Eq. (2). This enhanced user\u2019s latent preference\nrepresentation \u03beu in turn provides richer priors to the user\u2019s global static factor \u03b8uk.\n\n\u03beu|\u03b8 \u223c Gamma(a + Kb,\n\nhuf uu,m\n\nm\n\n+\n\n\u03b8uk)\n\n(2)\n\nThe global static factors then affect the time-sensitive local dynamic factors. Similarly, we integrate\nitem metadata into representing latent item attractiveness and its evolution. As a result, mGDMF\nintegrates both observable user/item metadata and latent static/dynamic portions.\n\nm=1\n\nk\n\n4 Stochastic Variational Inference for mGDMF\n\nWe \ufb01rst apply the mean-\ufb01eld Variational Inference (VI) [42] to the approximate inference of the\nposterior distribution, which is shown [42] to be more ef\ufb01cient than methods like MCMC [23] for\nlarge-scale probabilistic models. The mean-\ufb01eld VI chooses a family of variational distributions over\nall hidden variables. The posteriors of all variational distributions are then approximated by tuning\nthe parameters to minimize the Kullback-Leibler divergence to the true posterior.\nGiven the rating tables Yt and the user/item metadata HU and HI, we compute the posterior\ndistributions of the weight of each user attribute in metadata hum, the weight of each item attribute\nin metadata hin, the user\u2019s latent preference \u03beu (expressed as the static global factor \u03b8uk and local\ndynamic factor \u03b8uk,t), and the item\u2019s latent attractiveness \u03b7i (represented as the static global factor\n\u03b2uk and local dynamic factor \u03b2ik,t).\nTo ensure the conjugacy of the model structure, inspired by [25, 21, 49, 26], the rating yui,t is\nreplaced with auxiliary latent variable zui,k,t \u223c P oisson((\u03b8uk,t + \u03b8uk)(\u03b2ik,t + \u03b2ik)). With the\nk zui,k,t. Then, the mean-\n\ufb01eld family assumes each distribution is independent of each other and is governed by its own\n\nadditive property of Poisson distribution, yui,t is expressed as yui,t =(cid:80)\n\n4\n\n\fdistribution.\n\nq(hu, hi, \u03be, \u03b7, \u03b8, \u03b2,\u03bb, \u03b9, \u03b8, \u03b2, z) =\n\n(cid:89)\n(cid:89)\n\nn\n\n(cid:89)\n(cid:89)\n\nm\n\ni,k\n\nq(hum|\u03b6m)\nq(\u03b2ik|\u00b5ik)\n(cid:89)\n\n(cid:89)\n\n(cid:89)\n(cid:89)\n\nu,k\n\nq(\u03b8uk|\u03bduk)\n\n(cid:89)\n\n(cid:89)\n\nq(\u03b7i|\u03c4i)\n\nq(\u03beu|\u03bau)\n\nq(hin|\u03c1n)\nq(\u03b8uk,t|\u03bduk,t)\n\nu\n\n(cid:89)\n\ni\n\nq(\u03b2ik,t|\u00b5ik,t)\n\n(3)\n\nq(\u03bbuk,t|\u03b3uk,t)\n\nu,k,t\n\nq(\u03b9ik,t|\u03c9ik,t)\n\ni,k,t\n\nq(zui,t,k|\u03c6ui,t,k)\n\nu,k,t\n\ni,k,t\n\nu,i,t,k\n\nWe use the class of conditionally conjugate priors for hum, hin, \u03beu, \u03b7i, \u03b8uk, \u03b2ik, \u03b8uk, \u03bbuk,t, \u03b2ik,\n\u03b9ik,t and zui,t,k to update the variational parameters {\u03b6, \u03c1, \u03ba, \u03c4, \u03bd, \u00b5, \u03bd, \u03b3, \u00b5, \u03c9, \u03c6}. For the Gamma\ndistribution, we update both hyper-parameters: shape and rate.\n\nTable 1: Latent Variables, Type, Variational Variables and Variational Update for Users. Similar\nvariables for items (i.e., hin, \u03b7i, \u03b2ik, \u03b2ik, \u03b9ik,t) can be found in the supplementary. \u2135m is the number\nof users having the mth attribute, K is the number of latent components, and \u03a8(.) is the digamma\nfunction. The Gamma distribution is parameterized by shape (shp) and rate (rte).\n\nLatent\nVariable\n\nType Variational\nVariable\n\nhum Gamma\n\nm , \u03b6 rte\n\u03b6 shp\nm\n\n\u03beu\n\nGamma \u03bashp\n\nu , \u03barte\nu\n\nzui,t,k Mult\n\n\u03c6ui,t,k\n\n\u03b8uk\n\nGamma \u03bdshp\n\nuk , \u03bdrte\n\nuk\n\n\u03b8uk,t Gamma\n\n\u03bdshp\nuk,t\n\u03bdrte\nuk,1\n\n\u03bdrte\nuk,t,(t>1)\n\na\u03b8\n\n(exp{\u03a8(\u03bdshp\n\u2217(exp{\u03a8(\u00b5shp\n\nb +(cid:80)\n\nVariational Update\n\na(cid:48) + \u2135ma, b(cid:48) +(cid:80)\n\nu\n\nk\n\nu\n\nu\n\na + Kb,(cid:81)M\n\nuk,t) \u2212 log(\u03bdrte\nik,t) \u2212 log(\u00b5rte\ni,t yui,t\u03c6ui,t,k, \u03bashp\n\n\u03barte\n\nu\n\nm=1\n\nm\n\u03b6rte\nm\n\n\u03bashp\n\u03barte\n\n\u03bdshp\nuk\n\u03bdrte\nuk\n\n(cid:1)f uu,m +(cid:80)\n(cid:0) \u03b6shp\nuk,t)} +exp{\u03a8(\u03bdshp\nuk )})\nuk ) \u2212 log(\u03bdrte\nik,t} +exp{\u03a8(\u00b5shp\nik ))})\nik ) \u2212 log(\u00b5rte\n(cid:1)\n(cid:0) \u00b5shp\n+(cid:80)\n+(cid:80)\na\u03b8 + a\u03bb +(cid:80)\n(cid:0) \u00b5shp\n(cid:1)\n+(cid:80)\n(cid:0) \u00b5shp\n+(cid:80)\n\ni yui,t\u03c6ui,t,k\n+\n\n\u00b5shp\nik,1\n\u00b5rte\nik,1\n\n\u00b5shp\nik,t\n\u00b5rte\nik,t\n\n\u03b3shp\nuk,1\n\u03b3rte\n\n(cid:1)\n\n\u00b5rte\nik\n\n\u00b5rte\nik\n\n+ a\u03bb\n\nuk,1\n\na\u03b8b\u03b8 + a\u03bb\n\u03b3shp\nuk,t\u22121\n\u03b3rte\nuk,t\u22121\n\nik\n\n\u00b5rte\nik\n\n+\n\n\u03b3shp\nuk,t\n\u03b3rte\nuk,t\n\n\u00b5shp\nik,t\n\u00b5rte\nik,t\n\nik\n\ni\n\nik\n\ni\n\nu\n\nt\n\ni\n\n\u03bbuk,t Gamma \u03b3shp\n\nuk,t, \u03b3rte\n\nuk,t\n\na\u03bb + a\u03b8, a\u03bb\n\n\u03bdshp\nuk,t\n\u03bdrte\nuk,t\n\n+ a\u03b8\n\n\u03bdshp\nuk,t+1\n\u03bdrte\n\nuk,t+1\n\nWith the mean-\ufb01eld VI, the coordinate ascent is used to iteratively optimize each variational parameter\nwhile holding the others \ufb01xed [35]. The full variational parameter update is shown in Table 1 and the\nbatch algorithm can be found in the supplementary material.\nStochastic Algorithm.\nMean-\ufb01eld VI iterates to update variational parameters by involving the entire data at each iteration\nuntil convergence to a local optimum, which could be computationally intensive for large data.\nWe thus adopt stochastic optimization by sampling a data point from the rating yui,t of user u on item\ni to update its local parameters as in batch inference, and then update the global variables (similar to\n[29]). For example, to update the shape of a user\u2019s dynamic factors, we form the intermediate shape\nof a user\u2019s dynamic factors with the sampled rating\u2019s optimized local parameters as follows:\n\n\u03bdshp(imd)\nuk,t\n\n= a\u03b8 + a\u03bb + I.yui,t\u03c6ui,t,k\n\n(4)\n\nwhere I is the total number of items in the dataset.\nWe then update the global variational parameters by taking step \u0001 in the direction of the stochastic\nnatural gradient.\n\n\u03bdshp(iter+1)\nuk,t\n\n(5)\nwhere \u03c3iter > 0 is a step size at iteration iter. As shown in [7, 29], to ensure convergence, one\npossible choice of \u03c3iter is (iter0 + iter)\u2212\u0001 for iter0 > 0 and \u0001 \u2208 (0.5, 1]. iter0 and \u0001 are called the\nlearning rate delay and the learning rate power respectively.\n\n+ \u03c3iter\u03bdshp(imd)\n\nuk,t\n\nuk,t\n\n= (1 \u2212 \u03c3iter)\u03bdshp(iter)\n\n5\n\n\fAlgorithm 1 SVI for mGDMF\n\nInitialize {\u03b6, \u03c1, \u03ba, \u03c4, \u03bd, \u00b5, \u03bd, \u00b5, \u03b3, \u03c9, \u03c6}.\nSet K: # latent components, U: # users, I: # items, iter0 and \u0001.\nrepeat\n\nfor each time slice t = 1...T do\n\nSample a rating yui,t uniformly from the dataset.\nUpdate the local variational parameter of multivariate parameter \u03c6.\nUpdate all intermediate variational parameters similar to Eq. (4).\nUpdate all global variational parameters similar to Eq. (5).\nUpdate the learning rates iter.\n\nend for\n\nuntil convergence\n\nThe updating of the other global variational parameters is similar to that of the user\u2019s dynamic factors.\nThe SVI algorithm for mGDMF is presented in Algorithm 1.\nComputational ef\ufb01ciency. The SVI is more ef\ufb01cient than batch inference algorithms in terms of\nworking with massive data. The computational cost per iteration of batch is O(U0 + I0)K, where\nU0 and I0 are the non-zero observations in the rating matrix. It is more ef\ufb01cient than classic PMF\nwith the computational cost O(U + I)K (U and I are the total numbers of user/item observations\nrespectively). The computational cost of the SVI algorithm per iteration is O(Us + Is)K, Us and Is\nare the non-zero observations sampled from users and items respectively.\n\n5 Experiments\n\nHere, mGDMF is evaluated in terms of its capability to model dynamic data and data with sparse\nuser/item ratings and its ef\ufb01ciency in handling large data.\n\n5.1 Datasets\n\nGDMF/mGDMF are tested on the following \ufb01ve public dynamic datasets with massive and\nsparse/cold-start ratings and some metadata.\nNet\ufb02ix-Time. Similar procedure as in [37, 16, 34] is taken to obtain a subset of Net\ufb02ix Prize data [4]\nwith only active users and movies between 1998 and 2005. It contains 10.4K users, 6.5K movies\nand 2.5M non-missing ratings (from 1 to 5) over 16 time slices (with the duration of each time slice\nas 3 months) with the metadata: the \u2018year of release\u2019 of the movies.\nNet\ufb02ix-Full. We also \ufb01t our models on the whole Net\ufb02ix dataset: 225K users, 14K movies and\n6.9M observations over 14 time slices (with the duration of each time slice as 6 months) with the\nmetadata: the \u2018year of release\u2019 of the movies. This data challenges inference and analysis since\nits ratings distribution is extremely long-tailed and the users are very inactive. It tests the effect\nof modeling sparse items/users when a large number of items are associated with only few (or no)\nratings from users.\nYelp-Active. A subset of the Yelp Academic Challenge data is obtained similarly to [34]: 0.9K\nactive customers and 6.7K business units between 2008 and 2015 over 12 time slices (with the\nduration of each time slice as 6 months). It includes user metadata \u2018year join\u2019 and \u2018average star\u2019, and\nthe item metadata \u2018location\u2019, \u2018star\u2019, \u2019take out\u2019 (0 or 1), and \u2018parking\u2019 (0 or 1).\nLFM-Tracks. It contains the number of times a user listened to a song during a given time period\n[12]: 16 time slices of 0.9K users and 1M tracks (i.e., songs), similar to [34]; user metadata includes\n\u2018age\u2019 (partitioned to ranges: 1 : \u201cU nder 18\u201d, 18 : \u201c18 \u2212 24\u201d, 25 : \u201c25 \u2212 34\u201d, 35 : \u201c35 \u2212 44\u201d, 45 :\n\u201c45 \u2212 49\u201d, 50 : \u201c50 \u2212 55\u201d, and 56 : \u201c56 + \u201d), \u2018gender\u2019, \u2018country\u2019 and \u2018sign up year\u2019; and item\nmetadata includes \u2018genre\u2019 of the music derived from the \u2018tag\u2019 of each track (e.g., \u201crock\u201d, \u201cpop\u201d or\n\u201calternative\u201d).\nLFM-Bands. We consider bands instead of tracks as items in LFM-Tracks: 0.9K users and 107K\nbands with the same metadata as LFM-Tracks.\n\n6\n\n\f5.2 Baseline Methods\n\nWe compare mGDMF with both state-of-the-art static and dynamic PF models in handling dynamic\ndata. The mGDMF without metadata integration is a Gamma-Poisson Dynamic Matrix Factorization\n(GDMF) model to test the effect of metadata integration. GDMF replaces the rate of Gamma\ndistribution of latent user preference \u03beu by hyper-parameter a and the item\u2019s latent attractiveness \u03b7i\nby hyper-parameter c instead of the product of the user/item attribute\u2019s weights.\nStatic Models. As shown in [25], HPF outperforms baselines including Non-negative Matrix\nFactorization [5], LDA [6], PMF [39], and basic PF [8, 21, 13], where HCPF [3] is the latest static\nmodel in the PF family. We thus compare mGDMF with two different versions of HPF and HCPF to\ndemonstrate the mGDMF bene\ufb01ts in modeling the time-evolving effect. Similar to [16], HPF-all and\nHCPF-all are trained on all training ratings while HPF-last and HCPF-last are trained by using the\nlast time slice in the training set as the observations.\nDynamic Models. The two most recently reported PF models for recommendation are dPF [16] and\nDCPF [34]. dPF was shown to outperform state-of-the-art dynamic collaborative \ufb01ltering algorithms,\nspeci\ufb01cally, BPTF [44] and TimeSVD++ [36]. Therefore, we show the advantage of mGDMF with\nthe metadata integration and Gamma-Poisson structure by comparing it with dPF and DCPF.\n\n5.3 Experiment Settings and Evaluation\n\nInitialization. For the static portions, we set a = b = c = d = 0.3 in the same way as in HPF. The\nmetadata hyper-parameters a(cid:48), b(cid:48), c(cid:48) and d(cid:48) are set to a small value: 0.1, so that the user/item attribute\nweights automatically grow over time. We also set a\u03b8 = a\u03b3 = a\u03b8 = b\u03b8 = b\u03b2 = a\u03b9 = 1 to keep the\nchains small at the beginning. We test a wide range of latent components K from 10 to 200 and\nchoose the best K = 50 for mGDMF/GDMF. For SVI hyper-parameters, we assign 10, 000 as the\nlearning rate delay iter0 and 0.7 as the learning rate power \u0001, similar to [34] and [3].\nEvaluation Metrics. We hold out the last time slice for testing and keep the previous slices for\ntraining, i.e., at each slice t, the observations are the data before t (i.e., 1 ... t \u2212 1) and the test set\nincludes the ratings at t. The results are the average of all slices t from 2 to the maximal slice T . We\nthen randomly sample and assign 5% of the test set for validation, similar to [16, 34]. The top-N\nrecommendations are obtained in training w.r.t. the highest prediction score. In testing, we compute\nthe precision@N, which measures the fraction of relevant items in a user\u2019s top-N recommendations,\nand recall@N, which is the fraction of the testing items that present in the top-N recommendations.\nWe also calculate the same evaluation metrics as DCPF: Normalized Discounted Cumulative Gain\n(NDCG) [33] and the Area Under the ROC Curve (AUC) [28].\nPrediction. We test a practical scenario with given past ratings to predict the ratings at the current\ntime slice t, which is ranked by their posterior-expected Poisson parameters as follows.\n\n(cid:88)\n\nscoreui,t =\n\nEposterior[(\u03b8uk,t + \u03b8uk)(\u03b2ik,t + \u03b2ik)]\n\n(6)\n\nConvergence. Convergence is measured by computing the prediction accuracy on the validation set.\nmGDMF stops when the change of prediction accuracy w.r.t. log likelihood is less than 0.0001%.\n\nk\n\n5.4 Results\n\nEffect of metadata and dynamic data modeling.\nFigure 1 reports the results w.r.t. Precision@50 and Recall@50 of mGDMF together with GDMF\nwithout metadata integration, compared to DCPF, dPF and four static models: HCPF-all, HCPF-last,\nHPF-all and HPF-last. It is clear that dynamic models (mGDMF, GDMF, DCPF and dPF) are more\neffective than the static ones (HCPF-all, HCPF-last, HPF-all and HPF-last) on all datasets. Our\nmodels both with or without metadata perform the best of the dynamic models. mGDMF gains as\nmuch as 10% improvement on Net\ufb02ix-Full which has an extremely long-tailed rating distribution (i.e.,\nmany sparse items/users). mGDMF and GDMF make a large performance difference on Yelp-Active\nwhich has the richest metadata. GDMF ef\ufb01ciently models dynamic data by gaining an average 5%\nover DCPF and 4.6% over dPF w.r.t. Precision@50 on \ufb01ve datasets. mGDMF effectively integrates\nmetadata by further gaining an average 1.2% on top of our GDMF. The results are consistent with this\nw.r.t. NDCG/AUC as shown in Table 2, where mGDMF gains maximally 240.69% and minimally\n\n7\n\n\f38.38% NDCG improvement and maximally 27.12% and minimally 2.76% AUC improvement over\nbaselines on Yelp-Active.\n\n(a) Precision@50\nFigure 1: Top-50 Recommendations Compared with Baselines.\n\n(b) Recall@50\n\nTable 2: Predictive Performance on Five Datasets w.r.t. NDCG and AUC.\n\nNet\ufb02ix-Full\n\nYelp-Active\n\nLFM-Tracks\n\nNet\ufb02ix-Time\nLFM-Bands\nNDCG AUC NDCG AUC NDCG AUC NDCG AUC NDCG AUC\nmGDMF 0.389 0.9145 0.403 0.9321 0.494 0.8650 0.310 0.8245 0.367 0.8217\nGDMF\n0.367 0.9121 0.398 0.9320 0.416 0.8512 0.275 0.8101 0.354 0.8139\n0.293 0.9023 0.315 0.8991 0.357 0.8418 0.231 0.8098 0.275 0.8011\nDCPF\n0.257 0.9012 0.301 0.8901 0.332 0.8321 0.210 0.8019 0.298 0.8122\ndPF\nHCPF-all\n0.241 0.8012 0.245 0.8370 0.243 0.8032 0.209 0.7010 0.213 0.7121\nHCPF-last 0.183 0.7423 0.201 0.7600 0.172 0.7312 0.132 0.5893 0.160 0.6101\nHPF-all\n0.231 0.8035 0.250 0.8124 0.248 0.8130 0.179 0.7084 0.184 0.7013\n0.162 0.7213 0.198 0.7540 0.145 0.6810 0.143 0.6050 0.141 0.5982\nHPF-last\n32.76\n1.70\n\u03b4min(%)\n\u03b4max(%) 140.12 26.78 103.54 23.62 240.69 27.12 134.85 44.83 160.28 37.36\n\n27.94\n\n3.67\n\n34.20\n\n1.82\n\n23.15\n\n1.35\n\n38.38\n\n2.76\n\nEffect of handling sparse users/items and the \u2018cold-start\u2019 problem. We report 10 users with the\nmost precisely recommended items and the percentage of precisely recommended sparse items in\nthe top-100 recommendations to compare mGDMF and GDMF with DCPF on LFM-Tracks which\nhas the most dynamic and richest metadata. The sparse item is calculated as the fraction of the\nnon-missing ratings for that in the total number of users (rows). For those items with < 6% ratings\n(i.e., sparse items), Figure 2 shows that mGDMF recommends more (on average 7.6% items per user)\nthan DCPF (on average 3.1%) of these sparse items to the 10 users. For those items with sparsity\n< 1%, mGDMF recommends on average 1.6% items per user compared to 0.2% by DCPF, attributed\nto the richer priors and the mGDMF models with metadata integration and aggregating static and\ndynamic portions. Further, while DCPF cannot recommend any items when sparsity = 0% (i.e.,\ncold-start items), mGDMF has recommended on average 0.5% items. In addition, Figure 2 shows that\nGDMF recommends more sparse items than DCPF and is also more ef\ufb01cient, since the static portions\naggregate the dynamic portions. However, GDMF recommends less than our mGDMF especially on\nitems with sparsity = 0% (i.e., cold-start items). This shows our model\u2019s advantage in incorporating\nmetadata.\nCase study of mGDMF-based recommendation. We illustrate the contributions of mGDMF w.r.t.\nmaking recommendations on data Last.fm, shown in Figure 3, where two users \u2018U 270\u2019 and \u2018U 437\u2019\nhave the same metadata (\u2018age\u2019, \u2018gender\u2019, \u2018country\u2019 and \u2018sign up year\u2019), hence they have similar\nhobbies as shown on the right of Figure 3. From the dynamic perspective, since two users change\ntheir interest from \u2018pop\u2019 to \u2018rock\u2019 music, it is unreasonable to continue to recommend pop music to\nuser U 270. However, the static models continue to recommend pop music as the number of times\nthe users listened to pop and rock are similar, as shown on the right of Figure 3. mGDMF makes\nbetter recommendations (8 out of 10 are in the \u2018rock\u2019 category and no \u2018pop\u2019 track recommended).\nFurther, U 270 has not listened to \u2018Zombie\u2019 in the past but we can still precisely recommend \u2018Zombie\u2019\nbecause it is in the list of tracks listened to by users with similar metadata to U 437.\n\n8\n\n\f(a) mGDMF\n\n(b) GDMF\n\n(c) DCPF\n\nFigure 2: Percentage (%) of Sparse Items Recommended Precisely for 10 Users by mGDMF, GDMF\nand DCPF.\n\nFigure 3: Analysis on two users \u2018U 270\u2019 and \u2018U 437\u2019 with the same metadata in Last.fm. The number\nof times that users listened to two \u2018rock\u2019 and \u2018pop\u2019 tracks with 16 time slices is shown in (a). The\nnumber of times listened to \u2018Zombie\u2019 track by two users \u2018U 270\u2019 and \u2018U 437\u2019 through 16 time slices\nis shown in (b). The top10 precise recommendations by mGDMF are shown in (c). The distribution\nof the number of times that U 270 and U 437 listened to top 10 \u2018rock\u2019 and \u2018pop\u2019 tracks are shown in\n(d) and (e).\n\n6 Conclusions\n\nWe proposed a novel dynamic PF model: Gamma-Poisson Dynamic Matrix Factorization with\nMetadata In\ufb02uence (mGDMF) to effectively and ef\ufb01ciently model three major challenges in real-life\nmassive recommendations: massive data with sparse and evolving ratings. mGDMF signi\ufb01cantly\noutperforms the state-of-the-art static and dynamic models on \ufb01ve large datasets. mGDMF can\nfurther tackle massive, sparse and evolving data by involving time-dependent metadata for scalable\nrecommendation and tackling challenges in other applications such as context-aware chatbot by\ninvolving textual and sequential information.\n\n9\n\n\fReferences\n[1] A. Acharya, J. Ghosh, and M. Zhou. Nonparametric Bayesian factor analysis for dynamic count\n\nmatrices. In AISTATS, pages 1\u20139, 2015.\n\n[2] A. Acharya, D. Teffer, J. Henderson, M. Tyler, M. Zhou, and J. Ghosh. Gamma process Poisson\nfactorization for joint modeling of network and documents. In ECML PKDD, pages 283\u2013299.\nSpringer, 2015.\n\n[3] M. Basbug and B. Engelhardt. Hierarchical compound Poisson factorization. In ICML, pages\n\n1795\u20131803, 2016.\n\n[4] J. Bennett, S. Lanning, et al. The Net\ufb02ix prize. In Proceedings of KDD Cup and Workshop,\n\nvolume 2007, page 35. New York, NY, USA, 2007.\n\n[5] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. Algorithms and\napplications for approximate nonnegative matrix factorization. Computational Statistics & Data\nAnalysis, 52(1):155\u2013173, 2007.\n\n[6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3(Jan):993\u20131022,\n\n2003.\n\n[7] L. Bottou. Online learning and stochastic approximations. On-line Learning in Neural Networks,\n\n17(9):142, 1998.\n\n[8] J. Canny. GaP: a factor model for discrete data. In SIGIR, pages 122\u2013129. ACM, 2004.\n\n[9] L. Cao. Coupling learning of complex interactions. J. Information Processing and Management,\n\n51(2):167\u2013186, 2015.\n\n[10] L. Cao. Non-iid recommender systems: A review and framework of recommendation paradigm\n\nshifting. Engineering, 2(2):212 \u2013224, 2016.\n\n[11] L. Cao, Y. Ou, and P. S. Yu. Coupled behavior analysis with applications.\n\n24(8):1378\u20131392, 2012.\n\nIEEE TKDE,\n\n[12] \u00d2. Celma Herrada. Music recommendation and discovery in the long tail. 2009.\n\n[13] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models. Computational\n\nIntelligence and Neuroscience, 2009, 2009.\n\n[14] A. T. Cemgil and O. Dikmen. Conjugate Gamma Markov random \ufb01elds for modelling nonsta-\n\ntionary sources. In ICA, pages 697\u2013705. Springer, 2007.\n\n[15] A. J. Chaney, D. M. Blei, and T. Eliassi-Rad. A probabilistic model for using social networks in\n\npersonalized item recommendation. In RecSys, pages 43\u201350. ACM, 2015.\n\n[16] L. Charlin, R. Ranganath, J. McInerney, and D. M. Blei. Dynamic Poisson factorization. In\n\nRecSys, pages 155\u2013162. ACM, 2015.\n\n[17] S. Chatzis. Dynamic Bayesian probabilistic matrix factorization. In AAAI, pages 1731\u20131737,\n\n2014.\n\n[18] F. C. T. Chua, R. J. Oentaryo, and E.-P. Lim. Modeling temporal adoptions using dynamic\n\nmatrix factorization. In ICDM, pages 91\u2013100. IEEE, 2013.\n\n[19] T. D. T. Do and L. Cao. Coupled poisson factorization integrated with user/item metadata for\nmodeling popular and sparse ratings in scalable recommendation. In AAAI, pages 2918\u20132925,\n2018.\n\n[20] T. D. T. Do and L. Cao. Metadata-dependent in\ufb01nite poisson factorization for ef\ufb01ciently\n\nmodelling sparse and large matrices in recommendation. In IJCAI, pages 5010\u20135016, 2018.\n\n[21] D. B. Dunson and A. H. Herring. Bayesian latent variable models for mixed discrete outcomes.\n\nBiostatistics, 6(1):11\u201325, 2005.\n\n10\n\n\f[22] X. Fan, R. Y. Da Xu, L. Cao, and Y. Song. Learning nonparametric relational models by\nconjugately incorporating node information in a network. IEEE Transactions on Cybernetics,\n47(3):589\u2013599, 2017.\n\n[23] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain Monte Carlo in practice. CRC\n\npress, 1995.\n\n[24] C. Gong et al. Deep dynamic Poisson factorization model. In NIPS, pages 1665\u20131673, 2017.\n\n[25] P. Gopalan, J. M. Hofman, and D. M. Blei. Scalable recommendation with hierarchical Poisson\n\nfactorization. In UAI, pages 326\u2013335, 2015.\n\n[26] P. Gopalan, F. J. Ruiz, R. Ranganath, and D. Blei. Bayesian nonparametric Poisson factorization\n\nfor recommendation systems. In AISTATS, pages 275\u2013283, 2014.\n\n[27] P. K. Gopalan, L. Charlin, and D. Blei. Content-based recommendations with Poisson factoriza-\n\ntion. In NIPS, pages 3176\u20133184, 2014.\n\n[28] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating\n\ncharacteristic (ROC) curve. Radiology, 143(1):29\u201336, 1982.\n\n[29] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR,\n\n14(1):1303\u20131347, 2013.\n\n[30] S. A. Hosseini, K. Alizadeh, A. Khodadadi, A. Arabzadeh, M. Farajtabar, H. Zha, and H. R.\nRabiee. Recurrent poisson factorization for temporal recommendation. In KDD, pages 847\u2013855.\nACM, 2017.\n\n[31] C. Hu, P. Rai, and L. Carin. Topic-based embeddings for learning from large knowledge graphs.\n\nIn AISTATS, pages 1133\u20131141, 2016.\n\n[32] L. Hu, L. Cao, J. Cao, Z. Gu, G. Xu, and J. Wang. Improving the quality of recommendations\nfor users and items in the tail of distribution. ACM Transactions on Information Systems (TOIS),\n35(3):25, 2017.\n\n[33] K. J\u00e4rvelin and J. Kek\u00e4l\u00e4inen. Cumulated gain-based evaluation of IR techniques. TOIS,\n\n20(4):422\u2013446, 2002.\n\n[34] G. Jerfel, M. Basbug, and B. Engelhardt. Dynamic collaborative \ufb01ltering with compound\n\nPoisson factorization. In AISTATS, pages 738\u2013747, 2017.\n\n[35] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[36] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. Communications of the ACM,\n\n53(4):89\u201397, 2010.\n\n[37] B. Li, X. Zhu, R. Li, C. Zhang, X. Xue, and X. Wu. Cross-domain collaborative \ufb01ltering over\n\ntime. In IJCAI, pages 2293\u20132298. AAAI Press, 2011.\n\n[38] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse\n\ncoding. JMLR, 11(Jan):19\u201360, 2010.\n\n[39] A. Mnih and R. R. Salakhutdinov. Probabilistic matrix factorization. In NIPS, pages 1257\u20131264,\n\n2008.\n\n[40] A. Schein, H. Wallach, and M. Zhou. Poisson-gamma dynamical systems. In NIPS, pages\n\n5005\u20135013, 2016.\n\n[41] J. Z. Sun, D. Parthasarathy, and K. R. Varshney. Collaborative Kalman \ufb01ltering for dynamic\n\nmatrix factorization. IEEE Trans. Signal Processing, 62(14):3499\u20133509, 2014.\n\n[42] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n11\n\n\f[43] W. Wei, J. Yin, J. Li, and L. Cao. Modeling asymmetry and tail dependence among multi-\nple variables by using partial regular vine. In Proceedings of the 2014 SIAM International\nConference on Data Mining, pages 776\u2013784. SIAM, 2014.\n\n[44] L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J. G. Carbonell. Temporal collaborative\n\ufb01ltering with Bayesian probabilistic tensor factorization. In SDM, pages 211\u2013222. SIAM, 2010.\n\n[45] J. Xu and L. Cao. Vine copula-based asymmetry and tail dependence modeling. In Paci\ufb01c-Asia\n\nConference on Knowledge Discovery and Data Mining, pages 285\u2013297. Springer, 2018.\n\n[46] W. Zhang and J. Wang. A collective Bayesian Poisson factorization model for cold-start local\n\nevent recommendation. In SIGKDD, pages 1455\u20131464. ACM, 2015.\n\n[47] Y. Zhang, Y. Zhao, L. David, R. Henao, and L. Carin. Dynamic Poisson factor analysis. In\n\nICDM, pages 1359\u20131364. IEEE, 2016.\n\n[48] H. Zhao, L. Du, and W. Buntine. Leveraging node attributes for incomplete relational data. In\n\nICML, pages 4072\u20134081, 2017.\n\n[49] M. Zhou, L. Hannah, D. B. Dunson, and L. Carin. Beta-negative binomial process and Poisson\n\nfactor analysis. In AISTATS, pages 1462\u20131471, 2012.\n\n12\n\n\f", "award": [], "sourceid": 2816, "authors": [{"given_name": "Trong Dinh Thac", "family_name": "Do", "institution": "University of Technology Sydney"}, {"given_name": "Longbing", "family_name": "Cao", "institution": "University of Technology Sydney"}]}