{"title": "Coevolutionary Latent Feature Processes for Continuous-Time User-Item Interactions", "book": "Advances in Neural Information Processing Systems", "page_first": 4547, "page_last": 4555, "abstract": "Matching users to the right items at the right time is a fundamental task in recommendation systems. As users interact with different items over time, users' and items' feature may evolve and co-evolve over time. Traditional models based on static latent features or discretizing time into epochs can become ineffective for capturing the fine-grained temporal dynamics in the user-item interactions. We propose a coevolutionary latent feature process model that accurately captures the coevolving nature of users' and items' feature. To learn parameters, we design an efficient convex optimization algorithm with a novel low rank space sharing constraints. Extensive experiments on diverse real-world datasets demonstrate significant improvements in user behavior prediction compared to state-of-the-arts.", "full_text": "Coevolutionary Latent Feature Processes for\n\nContinuous-Time User-Item Interactions\n\nYichen Wang\u21e7, Nan Du\u21e4, Rakshit Trivedi\u21e7, Le Song\u21e7\n\n\u21e4Google Research\n\n\u21e7College of Computing, Georgia Institute of Technology\n\n{yichen.wang, rstrivedi}@gatech.edu, dunan@google.com\n\nlsong@cc.gatech.edu\n\nAbstract\n\nMatching users to the right items at the right time is a fundamental task in recom-\nmendation systems. As users interact with different items over time, users\u2019 and\nitems\u2019 feature may evolve and co-evolve over time. Traditional models based on\nstatic latent features or discretizing time into epochs can become ineffective for\ncapturing the \ufb01ne-grained temporal dynamics in the user-item interactions. We\npropose a coevolutionary latent feature process model that accurately captures the\ncoevolving nature of users\u2019 and items\u2019 feature. To learn parameters, we design\nan ef\ufb01cient convex optimization algorithm with a novel low rank space sharing\nconstraints. Extensive experiments on diverse real-world datasets demonstrate sig-\nni\ufb01cant improvements in user behavior prediction compared to state-of-the-arts.\n\nIntroduction\n\n1\nOnline social platforms and service websites, such as Reddit, Net\ufb02ix and Amazon, are attracting\nthousands of users every minute. Effectively recommending the appropriate service items is a\nfundamentally important task for these online services. By understanding the needs of users and\nserving them with potentially interesting items, these online platforms can improve the satisfaction of\nusers, and boost the activities or revenue of the sites due to increased user postings, product purchases,\nvirtual transactions, and/or advertisement clicks [30, 9].\nAs the famous saying goes \u201cYou are what you eat and you think what you read\u201d, both users\u2019 interests\nand items\u2019 semantic features are dynamic and can evolve over time [18, 4]. The interactions between\nusers and service items play a critical role in driving the evolution of user interests and item features.\nFor example, for movie streaming services, a long-time fan of comedy watches an interesting science\n\ufb01ction movie one day, and starts to watch more science \ufb01ction movies in place of comedies. Likewise,\na single movie may also serve different segment of audiences at different times. For example, a movie\ninitially targeted for an older generation may become popular among the younger generation, and the\nfeatures of this movie need to be rede\ufb01ned.\nAnother important aspect is that users\u2019 interests and items\u2019 features can co-evolve over time, that\nis, their evolutions are intertwined and can in\ufb02uence each other. For instance, in online discussion\nforums, such as Reddit, although a group (item) is initially created for political topics, users with very\ndifferent interest pro\ufb01les can join this group (user ! item). Therefore, the participants can shape\nthe actual direction (or features) of the group through their postings and responses. It is not unlikely\nthat this group can eventually become one about education simply because most users here concern\nabout education (item ! user). As the group is evolving towards topics on education, some users\nmay become more attracted to education topics, and to the extent that they even participate in other\ndedicated groups on education. On the opposite side, some users may gradually gain interests in\nsports groups, lose interests in political topics and become inactive in this group. Such coevolutionary\nnature of user-item interactions raises very interesting questions on how to model them elegantly and\nhow to learn them from observed interaction data.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fNowadays, user-item interaction data are archived in increasing temporal resolution and becoming\nincreasingly available. Each individual user-item iteration is typically logged in the database with\nthe precise time-stamp of the interaction, together with additional context of that interaction, such\nas tag, text, image, audio and video. Furthermore, the user-item interaction data are generated in an\nasynchronous fashion in a sense that any user can interact with any item at any time and there may\nnot be any coordination or synchronization between two interaction events. These types of event data\ncall for new representations, models, learning and inference algorithms.\nDespite the temporal and asynchronous nature of such event data, for a long-time, the data has\nbeen treated predominantly as a static graph, and \ufb01xed latent features have been assigned to each\nuser and item [21, 5, 2, 10, 29, 30, 25]. In more sophisticated methods, the time is divided into\nepochs, and static latent feature models are applied to each epoch to capture some temporal aspects\nof the data [18, 17, 28, 6, 13, 4, 20, 17, 28, 12, 15, 24, 23]. For such epoch-based methods, it is not\nclear how to choose the epoch length parameter due to the asynchronous nature of the user-item\ninteractions. First, different users may have very different time-scale when they interact with those\nservice items, making it very dif\ufb01cult to choose a uni\ufb01ed epoch length. Second, it is not easy for\nthe learned model to answer \ufb01ne-grained time-sensitive queries such as when a user will come\nback for a particular service item. It can only make such predictions down to the resolution of the\nchosen epoch length. Most recently, [9] proposed an ef\ufb01cient low-rank point process model for\ntime-sensitive recommendations from recurrent user activities. However, it still fails to capture the\nheterogeneous coevolutionary properties of user-item interactions with much more limited model\n\ufb02exibility. Furthermore, it is dif\ufb01cult for this approach to incorporate observed context features.\nIn this paper, we propose a coevolutionary latent feature process for continuous-time user-item\ninteractions, which is designed speci\ufb01cally to take into account the asynchronous nature of event\ndata, and the co-evolution nature of users\u2019 and items\u2019 latent features. Our model assigns an evolving\nlatent feature process for each user and item, and the co-evolution of these latent feature processes is\nconsidered using two parallel components:\n\u2022 (Item ! User) A user\u2019s latent feature is determined by the latent features of the items he interacted\nwith. Furthermore, the contributions of these items\u2019 features are temporally discounted by an\nexponential decaying kernel function, which we call the Hawkes [14] feature averaging process.\n\u2022 (User ! Item) Conversely, an item\u2019s latent features are determined by the latent features of the\nusers who interact with the item. Similarly, the contribution of these users\u2019 features is also modeled\nas a Hawkes feature averaging process.\n\nBesides the two sets of intertwined latent feature processes, our model can also take into account\nthe presence of potentially high dimensional observed context features and links the latent features\nto the observed context features using a low dimensional projection. Despite the sophistication of\nour model, we show that the model parameter estimation, a seemingly non-convex problem, can\nbe transformed into a convex optimization problem, which can be ef\ufb01ciently solved by the latest\nconditional gradient-like algorithm. Finally, the coevolutionary latent feature processes can be used\nfor down-streaming inference tasks such as the next-item and the return-time prediction. We evaluate\nour method over a variety of datasets, verifying that our method can lead to signi\ufb01cant improvements\nin user behavior prediction compared to the state-of-the-arts.\n2 Background on Temporal Point Processes\nThis section provides necessary concepts of the temporal point process [7]. It is a random process\nwhose realization consists of a list of events localized in time, {ti} with ti 2 R+. Equivalently, a given\ntemporal point process can be represented as a counting process, N (t), which records the number of\nevents before time t. An important way to characterize temporal point processes is via the conditional\nintensity function (t), a stochastic model for the time of the next event given all the previous events.\nFormally, (t)dt is the conditional probability of observing an event in a small window [t, t+dt) given\nthe history T (t) up to t, i.e., (t)dt := P{event in [t, t + dt)|T (t)} = E[dN (t)|T (t)], where one\ntypically assumes that only one event can happen in a small window of size dt, i.e., dN (t) 2{ 0, 1}.\nThe function form of the intensity is often designed to capture the phenomena of interests. One\ncommonly used form is the Hawkes process [14, 11, 27, 26], whose intensity models the excitation\nbetween events, i.e., (t) = \u00b5 + \u21b5Pti2T (t) \uf8ff!(t ti), where \uf8ff!(t) := exp(!t) is an exponential\n\ntriggering kernel, \u00b5 > 0 is a baseline intensity independent of the history. Here, the occurrence of\neach historical event increases the intensity by a certain amount determined by the kernel \uf8ff! and\nthe weight \u21b5 > 0, making the intensity history dependent and a stochastic process by itself. From\n\n2\n\n\f1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n(\n\n1 2\n\n(\n\n1 2\n\n(\n\n1 2\n\n(\n\n1 2\n\nDavid\n\nAlice\n\nChristine\n\nJacob\n\nInteraction \n\nItem \nfeature\n\n!\"($)\nfeature)($)\n&'($)\n\nUser \nfeature\n\nBase drift\n\n#,) =\t/\u22c51,)\t\n45)\u2212)('(\n+3\u22c5\n+45)\u2212)+'+\n+ 45)\u2212)(%78()()\n+45)\u2212)+%79()+)\n\nInteraction feature \n\nCoevolution: Item feature\n\n1\n\n2\n\nK\n\n!\n\n(#,%+,'+,)+)\n\n1 2\n\n(#,%&,'(,)()\n\n!\n\n1 2\n\nAlice\n\n&4( =6\t84(\t\n\n1\n\n2\n\nK\n\nBase drift\n\n1 2\n\n1 2\n\n!\n\n!\n\n(#$,&,'$,($)(#*,&,'*,(*) (#+,&,'+,(+)+ -.(\u2212($#01(($)\n+-.(\u2212($#02((*)\n+-.(\u2212($#03((+)\n\nInteraction feature \n\n!\n\n1 2\n\nDavid\n\nAlice\n\nJacob\n\nCoevolution: User feature\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n1\n\n2\n\nK\n\n(a) Data as a bipartite graph\n\n(b) User latent feature process\n\n(c) Item latent feature process\n\ntn\n\nFigure 1: Model illustration. (a) User-item interaction events data. Each edge contains user, item,\ntime, and interaction feature. (b) Alice\u2019s latent feature consists of three components: the drift of\nbaseline feature, the time-weighted average of interaction feature, and the weighted average of item\nfeature. (c) The symmetric item latent feature process. A, B, C, D are embedding matrices from\nhigh dimension feature space to latent space. \uf8ff!(t) = exp(!t) is an exponential decaying kernel.\nthe survival analysis theory [1], given the history T = {t1, . . . , tn}, for any t > tn, we characterize\n(\u2327 ) d\u2327.\nthe conditional probability that no event happens during [tn, t) as S(t|T ) = exp R t\nMoreover, the conditional density that an event occurs at time t is f (t|T ) = (t) S(t|T ).\n3 Coevolutionary Latent Feature Processes\nIn this section, we present the framework to model the temporal dynamics of user-item interactions.\nWe \ufb01rst explicitly capture the co-evolving nature of users\u2019 and items\u2019 latent features. Then, based on\nthe compatibility between a user\u2019 and item\u2019s latent feature, we model the user-item interaction by a\ntemporal point process and parametrize the intensity function by the feature compatibility.\n3.1 Event Representation\nGiven m users and n items, the input consists of all users\u2019 history events: T = {ek}, where\nek = (uk, ik, tk, qk) means that user uk interacts with item ik at time tk and generates an interaction\nfeature vector qk 2 RD. For instance, the interaction feature can be a textual message delivered\nfrom the user to the chatting-group in Reddit or a review of the business in Yelp. It can also be\nunobservable if the data only contains the temporal information.\n3.2 Latent Feature Processes\nWe associate a latent feature vector uu(t) 2 RK with a user u and ii(t) 2 RK with an item i. These\nfeatures represent the subtle properties which cannot be directly observed, such as the interests of a\nuser and the semantic topics of an item. Speci\ufb01cally, we model uu(t) and ii(t) as follows:\nUser latent feature process. For each user u, we formulate uu(t) as:\n\n\uf8ff!(t  tk)qk\n\n\uf8ff!(t  tk)iik (tk)\n\n,\n\n(1)\n\nHawkes interaction feature averaging\n\nco-evolution: Hawkes item feature averaging\n\nItem latent feature process. For each item i, we specify ii(t) as:\n\nuu(t) = A u(t)\nbase drift\n\n|{z}\n\nii(t) = C i(t)\nbase drift\n\n|{z}\n\n+B X{ek|uk=u,tk<t}\n|\n{z\n+D X{ek|ik=i,tk<t}\n{z\n\n|\n\n+ X{ek|uk=u,tk<t}\n|\n+ X{ek|ik=i,tk<t}\n|\n\n}\n\n}\n\n{z\n\n{z\n\n}\n\n}\n\n\uf8ff!(t  tk)qk\n\n\uf8ff!(t  tk)uuk (tk)\n\n,\n\n(2)\n\nHawkes interaction feature averaging\n\nco-evolution: Hawkes user feature averaging\n\nwhere A, B, C, D 2 RK\u21e5D are the embedding matrices mapping from the explicit high-dimensional\nfeature space into the low-rank latent feature space. Figure 1 highlights the basic setting of our model.\nNext we discuss the rationale of each term in detail.\nDrift of base features. u(t) 2 RD and i(t) 2 RD are the explicitly observed properties of user u\nand item i, which allows the basic features of users (e.g., a user\u2019s self-crafted interests) and items (e.g.,\ntextual categories and descriptions) to smoothly drift through time. Such changes of basic features\nnormally are caused by external in\ufb02uences. One can parametrize u(t) and i(t) in many different\nways, e.g., the exponential decaying basis to interpolate these features observed at different times.\n\n3\n\n\fEvolution with interaction feature. Users\u2019 and items\u2019 features can evolve and be in\ufb02uenced by\nthe characteristics of their interactions. For instance, the genre changes of movies indicate the\nchanging tastes of users. The theme of a chatting-group can be easily shifted to certain topics of\nthe involved discussions. In consequence, this term captures the cumulative in\ufb02uence of the past\ninteraction features to the changes of the latent user (item) features. The triggering kernel \uf8ff!(t  tk)\nassociated with each past interaction at tk quanti\ufb01es how such in\ufb02uence can change through time. Its\nparametrization depends on the phenomena of interest. Without loss of generality, we choose the\nexponential kernel \uf8ff!(t) = exp (!t) to reduce the in\ufb02uence of each past event. In other words,\nonly the most recent interaction events will have bigger in\ufb02uences. Finally, the embedding B, D\nmap the observable high dimension interaction feature to the latent space.\nCoevolution with Hawkes feature averaging processes. Users\u2019 and items\u2019 latent features can\nmutually in\ufb02uence each other. This term captures the two parallel processes:\n\u2022 Item ! User. A user\u2019s latent feature is determined by the latent features of the items he interacted\nwith. At each time tk, the latent item feature is iik (tk). Furthermore, the contributions of these\nitems\u2019 features are temporally discounted by a kernel function \uf8ff!(t), which we call the Hawkes\nfeature averaging process. The name comes from the fact that Hawkes process captures the\ntemporal in\ufb02uence of history events in its intensity function. In our model, we capture both the\ntemporal in\ufb02uence and feature of each history item as a latent process.\n\u2022 User ! Item. Conversely, an item\u2019s latent features are determined by the latent features of all\nthe users who interact with the item. At each time tk, the latent feature is uuk (tk). Similarly, the\ncontribution of these users\u2019 features is also modeled as a Hawkes feature averaging process.\n\nNote that to compute the third co-evolution term, we need to keep track of the user\u2019s and item\u2019s latent\nfeatures after each interaction event, i.e., at tk, we need to compute uuk (tk) and iik (tk) in (1) and\n(2), respectively. Set I(\u00b7) to be the indicator function, we can show by induction that\n\nuuk (tk) = AhXk\n+ ChXk1\niik (tk) = ChXk\n+ AhXk1\n\nj=1 I[uj = uk]\uf8ff!(tk  tj)uj (tj)i + BhXk\nj=1 I[uj = uk]\uf8ff!(tk  tj)ij (tj)i + DhXk1\nj=1 I[ij = ik]\uf8ff!(tk  tj)ij (tj)i + DhXk\nj=1 I[ij = ik]\uf8ff!(tk  tj)uj (tj)i + BhXk1\n\nj=1 I[uj = uk]\uf8ff!(tk  tj)qji\nj=1 I[uj = uk]\uf8ff!(tk  tj)qji\nj=1 I[ij = ik]\uf8ff!(tk  tj)qji\nj=1 I[ij = ik]\uf8ff!(tk  tj)qji\n\nIn summary, we have incorporated both of the exogenous and endogenous in\ufb02uences into a single\nmodel. First, each process evolves according to the respective exogenous base temporal user (item)\nfeatures u(t) (i(t)). Second, the two processes also inter-depend on each other due to the endoge-\nnous in\ufb02uences from the interaction features and the entangled latent features. We present our model\nin the most general form and the speci\ufb01c choices of uu(t), ii(t) are dependent on applications. For\nexample, if no interaction feature is observed, we drop the second term in (1) and (2).\n3.3 User-Item Interactions as Temporal Point Processes\nFor each user, we model the recurrent occurrences of user u\u2019s interaction with all items as a multi-\ndimensional temporal point process. In particular, the intensity in the i-th dimension (item i) is:\n\nu,i(t) =\n\nlong-term preference\n\n+ uu(t)>ii(t)\nshort-term preference\n\n,\n\n(3)\n\n\u2318u,i\n\n|{z}\n\n|\n\n{z\n\n}\n\nwhere \u2318 = (\u2318u,i) is a baseline preference matrix. The rationale of this formulation is threefold.\nFirst, instead of discretizing the time, we explicitly model the timing of each event occurrence as a\ncontinuous random variable, which naturally captures the heterogeneity of the temporal interactions\nbetween users and items. Second, the base intensity \u2318u,i represents the long-term preference of user\nu to item i, independent of the history. Third, the tendency for user u to interact with item i at time t\ndepends on the compatibility of their instantaneous latent features. Such compatibility is evaluated\nthrough the inner product of their time-varying latent features.\nOur model inherits the merits from classic content \ufb01ltering, collaborative \ufb01ltering, and the most\nrecent temporal models. For the cold-start users having few interactions with the items, the model\nadaptively utilizes the purely observed user (item) base properties and interaction features to adjust\nits predictions, which incorporates the key idea of feature-based algorithms. When the observed\n\n4\n\n\ffeatures are missing and non-informative, the model makes use of the user-item interaction patterns to\nmake predictions, which is the strength of collaborative \ufb01ltering algorithms. However, being different\nfrom the conventional matrix-factorization models, the latent user and item features in our model are\nentangled and able to co-evolve over time. Finally, the general temporal point process ingredient of\nthe model makes it possible to capture the dynamic preferences of users to items and their recurrent\ninteractions, which is more \ufb02exible and expressive.\n4 Parameter Estimation\nIn this section, we propose an ef\ufb01cient framework to learn the parameters. A key challenge is that\nthe objective function is non-convex in the parameters. However, we reformulate it as a convex\noptimization by creating new parameters. Finally, we present the generalized conditional gradient\nalgorithm to ef\ufb01ciently solve the objective function.\nGiven a collection of events T recorded within a time window [0, T ), we estimate the parameters\nusing maximum likelihood estimation of all events. The joint negative log-likelihood [1] is:\n\n` = Xek\n\nloguk,ik (tk) +\n\nmXu=1\n\nnXi=1Z T\n\n0\n\nu,i(\u2327 ) d\u2327\n\n(4)\n\n1CA\n\nThe objective function is non-convex in variables {A, B, C, D} due to the inner product term in (3).\nTo learn these parameters, one way is to \ufb01x the matrix rank and update each matrix using gradient\nbased methods. However, it is easily trapped in local optima and one needs to tune the rank for the\nbest performance. However, with the observation that the product of two low rank matrices yields a\nlow rank matrix, we will optimize over the new matrices and obtain a convex objective function.\n4.1 Convex Objective Function\nWe will create new parameters such that the intensity function is convex. Since uu(t) contains the\naveraging of iik (tk) in (1), C, D will appear in uu(t). Similarly, A, B will appear in ii(t). Hence\n\nthese matrices X =A>A, B>B, C>C, D>D, A>B, A>C, A>D, B>C, B>D, C>D will\nappear in (3) after expansion, due to the inner product ii(t)>uu(t). For each matrix product in\nX , we denote it as a new variable Xi and optimize the objective function over the these variables.\nWe denote the corresponding coef\ufb01cient of Xi as xi(t), which can be exactly computed. Denote\n\u21e4(t) = (u,i(t)), we can rewrite the intensity in (3) in the matrix form as:\n\n\u21e4(t) = \u2318 +X10\n\ni=1\n\nxi(t)Xi\n\n(5)\nThe intensity is convex in each new variable Xi, hence the objective function. We will optimize over\nthe new set of variables X subject to the constraints that i) some of them share the same low rank\nspace, e.g., A> is shared as the column space inA>A, A>B, A>C, A>D and ii) new variables\nare low rank (the product of low rank matrices is low rank). Next, we show how to incorporate the\nspace sharing constraint for general objective function with an ef\ufb01cient algorithm.\nFirst, we create a symmetric block matrix X 2 R4D\u21e54D and place each Xi as follows:\nA>A A>B A>C A>D\nB>A B>B B>C B>D\nC>A C>B C>C C>D\nD>A D>B D>C D>D\n\nX1 X2 X3 X4\nX>2 X5 X6 X7\nX>3 X>6 X8 X9\nX>4 X>7 X>9 X10\n\n(6)\n\nX =0B@\n\nIntuitively, minimizing the nuclear norm of X ensures all the low rank space sharing constraints.\nFirst, nuclear norm k\u00b7k \u21e4 is a summation of all singular values, and is commonly used as a convex\nsurrogate for the matrix rank function [22], hence minimizing kXk\u21e4 ensures it to be low rank and\ngives the unique low rank factorization of X. Second, since X1, X2, X3, X4 are in the same row\nand share A>, the space sharing constraints are naturally satis\ufb01ed.\nFinally, since it is typically believed that users\u2019 long-time preference to items can be categorized into\na limited number of prototypical types, we set \u2318 to be low rank. Hence the objective is:\n\nmin\n\n(7)\nwhere ` is de\ufb01ned in (4) and k\u00b7k F is the Frobenius norm and the associated constraint ensures X to\nbe symmetric. {\u21b5, , } control the trade-off between the constraints. After obtaining X, one can\ndirectly apply (5) to compute the intensity and make predictions.\n\n`(X, \u2318) + \u21b5k\u2318k\u21e4 + kXk\u21e4 + kX  X>k2\n\n\u2318>0,X>0\n\nF\n\n5\n\n1CA =0B@\n\n\ft + 1\n\ntu,i\nn\n\n4.2 Generalized Conditional Gradient Algorithm\nWe use the latest generalized conditional gradient algorithm [9] to solve the optimization problem (7).\nWe provide details in the appendix. It has an alternating updates scheme and ef\ufb01ciently handles\nthe nonnegative constraint using the proximal gradient descent and the the nuclear norm constraint\nusing conditional gradient descent. It is guaranteed to converge in O( 1\nt2 ), where t is the number\nof iterations. For both the proximal and the conditional gradient parts, the algorithm achieves\nthe corresponding optimal convergence rates. If there is no nuclear norm constraint, the results\nt2 ) rate achieved by proximal gradient method for smooth convex\nrecover the well-known optimal O( 1\noptimization. If there is no nonnegative constraints, the results recover the well-known O( 1\nt ) rate\nattained by conditional gradient method for smooth convex minimization. Moreover, the per-iteration\ncomplexity is linear in the total number of events with O(mnk), where m is the number of users, n\nis the number of items and k is the number of events per user-item pair.\n5 Experiments\nWe evaluate our framework, COEVOLVE, on synthetic and real-world datasets. We use all the events\nup to time T \u00b7 p as the training data, and the rest as testing data, where T is the length of the\nobservation window. We tune hyper-parameters and the latent rank of other baselines using 10-fold\ncross validation with grid search. We vary the proportion p 2{ 0.7, 0.72, 0.74, 0.76, 0.78} and report\nthe averaged results over \ufb01ve runs on two tasks:\n(a) Item recommendation: for each user u, at every testing time t, we compute the survival probabil-\nity Su,i(t) = exp R t\nn is the last training\nevent time of (u, i). We then rank all the items in the ascending order of Su,i(t) to produce a\nrecommendation list. Ideally, the item associated with the testing time t should rank one, hence\nsmaller value indicates better predictive performance. We repeat the evaluation on each testing\nmoment and report the Mean Average Rank (MAR) of the respective testing items across all users.\n(b) Time prediction: we predict the time when a testing event will occur between a given user-item\npair (u, i) by calculating the density of the next event time as f (t) = u,i(t)Su,i(t). With the\ndensity, we compute the expected time of next event by sampling future events as in [9]. We report\nthe Mean Absolute Error (MAE) between the predicted and true time. Furthermore, we also report\nthe relative percentage of the prediction error with respect to the entire testing time window.\n\nu,i(\u2327 )d\u2327 of each item i up to time t, where tu,i\n\n5.1 Competitors\nTimeSVD++ is the classic matrix factorization method [18]. The latent factors of users and items are\ndesigned as decay functions of time and also linked to each other based on time. FIP is a static low\nrank latent factor model to uncover the compatibility between user and item features [29]. TSVD++\nand FIP are only designed for data with explicit ratings. We convert the series of user-item interaction\nevents into an explicit rating using the frequency of a user\u2019s item consumptions [3]. STIC \ufb01ts\na semi-hidden markov model to each observed user-item pair [16] and is only designed for time\nprediction. PoissonTensor uses Poisson regression as the loss function [6] and has been shown to\noutperform factorization methods based on squared loss [17, 28] on recommendation tasks. There are\ntwo choices of reporting performance: i) use the parameters \ufb01tted only in the last time interval and\nii) use the average parameters over all intervals. We report the best performance between these two\nchoices. LowRankHawkes is a Hawkes process based model and it assumes user-item interactions\nare independent [9].\n5.2 Experiments on Synthetic Data\nWe simulate 1,000 users and 1,000 items. For each user, we further generate 10,000 events by Ogata\u2019s\nthinning algorithm [19]. We compute the MAE by comparing estimated \u2318, X with the ground-truth.\nThe baseline drift feature is set to be constant. Figure 2 (a) shows that it only requires a few hundred\niterations to descend to a decent error, and (b) indicates that it only requires a modest number of\nevents to achieve a good estimation. Finally, (c) demonstrates that our method scales linearly as the\ntotal number of training events grows.\nFigure 2 (d-f) show that COEVOLVE achieves the best predictive performance. Because POISSON-\nTENSOR applies an extra time dimension and \ufb01ts each time interval as a Poisson regression, it\noutperforms TIMESVD++ by capturing the \ufb01ne-grained temporal dynamics. Finally, our method\nautomatically adapts different contributions of each past item factors to better capture the users\u2019\ncurrent latent features, hence it can achieve the best prediction performance overall.\n\n6\n\n\f(a) MAE by iterations\n\n(b) MAE by events\n\n(c) Scalability\n\n(d) Item recommendation\n\nFigure 2: Estimation error (a) vs. #iterations and (b) vs. #events per user; (c) scalability vs. #events\nper user; (d) average rank of the recommended items; (e) and (f) time prediction error.\n\n(e) Time prediction (MAE)\n\n(f) Time prediction (relative)\n\n5.3 Experiments on Real-World Data\n\nDatasets. Our datasets are obtained from three different domains from the TV streaming services\n(IPTV), the commercial review website (Yelp) and the online media services (Reddit). IPTV contains\n7,100 users\u2019 watching history of 436 TV programs in 11 months, with 2,392,010 events, and 1,420\nmovie features, including 1,073 actors, 312 directors, 22 genres, 8 countries and 5 years. Yelp is\navailable from Yelp Dataset challenge Round 7. It contains reviews for various businesses from\nOctober, 2004 to December, 2015. We \ufb01lter users with more than 100 posts and it contains 100\nusers and 17,213 businesses with around 35,093 reviews. Reddit contains the discussions events in\nJanuary 2014. Furthermore, we randomly selected 1,000 users and collect 1,403 groups that these\nusers have discussion in, with a total of 10,000 discussion events. For item base feature, IPTV has\nmovie feature, Yelp has business description, and Reddit does not have it. In experiments we \ufb01x the\nbaseline features. There is no base feature for user. For interaction feature, Reddit and Yelp have\nreviews in bag-of-words, and no such feature in IPTV.\nFigure 3 shows the predictive performance. For time prediction, COEVOLVE outperforms the baselines\nsigni\ufb01cantly, since we explicitly reason and model the effect that past consumption behaviors change\nusers\u2019 interests and items\u2019 features. In particular, compared with LOWRANKHAWKES, our model\ncaptures the interactions of each user-item pair with a multi-dimensional temporal point processes. It is\nmore expressive than the respective one-dimensional Hawkes process used by LOWRANKHAWKES,\nwhich ignores the mutual in\ufb02uence among items. Furthermore, since the unit time is hour, the\nimprovement over the state-of-art on IPTV is around two weeks and on Reddit is around two days.\nHence our method signi\ufb01cantly helps online services make better demand predictions.\nFor item recommendation, COEVOLVE also achieves competitive performance comparable with\nLOWRANKHAWKES on IPTV and Reddit. The reason behind the phenomena is that one needs to\ncompute the rank of the intensity function for the item prediction task, and the value of intensity\nfunction for time prediction. LOWRANKHAWKES might be good at differentiating the rank of\nintensity better than COEVOLVE. However, it may not be able to learn the actual value of the intensity\naccurately. Hence our method has the order of magnitude improvement in the time prediction task.\nIn addition to the superb predictive performance, COEVOLVE also learns the time-varying latent\nfeatures of users and items. Figure 4 (a) shows that the user is initially interested in TV programs\nof adventures, but then the interest changes to Sitcom, Family and Comedy and \ufb01nally switches to\nthe Romance TV programs. Figure 4 (b) shows that Facebook and Apple are the two hot topics in\nthe month of January 2014. The discussions about Apple suddenly increased on 01/21/2014, which\n\n7\n\n0.100.150.200.250.300100200300400500#iterationsMAEParameters X\u03b70.10.20.30.4200040006000800010000#eventsMAEParameters X\u03b7101102103104105106#eventstime(s)23.342.8347.2410.3415.2425.31101001000MethodsMARMethodsCoevolvingDynamicPoissonLowRankHawkesPoissonTensorTimeSVD++FIP103408109001101001000MethodsMAEMethodsCoevolvingLowRankHawkesPoissonTensorSTIC0.26.816.2180.2051015MethodsErr %MethodsCoevolvingLowRankHawkesPoissonTensorSTIC\fV\nT\nP\nI\n\nt\ni\nd\nd\ne\nR\n\np\n\nl\ne\nY\n\n(a) Item recommendation\n\n(b) Time prediction (MAE)\n\n(c) Time prediction (relative)\n\nFigure 3: Prediction results on IPTV, Reddit and Yelp. Results are averaged over \ufb01ve runs with\ndifferent portions of training data and error bar represents the variance.\n\n(a) Feature for a user in IPTV\n\n(b) Feature for the Technology group in Reddit\n\nFigure 4: Learned time-varying features of a user in IPTV and a group in Reddit.\n\ncan be traced to the news that Apple won lawsuit against Samsung1. It further demonstrates that our\nmodel can better explain and capture the user behavior in the real world.\n6 Conclusion\nWe have proposed an ef\ufb01cient framework for modeling the co-evolution nature of users\u2019 and items\u2019\nlatent features. Empirical evaluations on large synthetic and real-world datasets demonstrate its scala-\nbility and superior predictive performance. Future work includes extending it to other applications\nsuch as modeling dynamics of social groups, and understanding peoples\u2019 behaviors on Q&A sites.\nAcknowledge. This project was supported in part by NSF/NIH BIGDATA 1R01GM108341, ONR\nN00014-15-1-2340, NSF IIS-1218749, and NSF CAREER IIS-1350983.\n\n1http://techcrunch.com/2014/01/22/apple-wins-big-against-samsung-in-court/\n\n8\n\n10.41.8150.3177.2191.3110100MethodsMARMethodsCoevolvingLowRankHawkesPoissonTensorTimeSVD++FIP34.5356830.2901.1101000MethodsMAEMethodsCoevolvingLowRankHawkesPoissonTensorSTIC0.44.410.311.20.4036912MethodsErr %MethodsCoevolvingLowRankHawkesPoissonTensorSTIC13.22.5450.1510.7540.7110100MethodsMARMethodsCoevolvingLowRankHawkesPoissonTensorTimeSVD++FIP8.167.2186.4203110100MethodsMAEMethodsCoevolvingLowRankHawkesPoissonTensorSTIC1.19.125.127.21.101020MethodsErr %MethodsCoevolvingLowRankHawkesPoissonTensorSTIC80.190.17800.18100.38320.5101000MethodsMARMethodsCoevolvingLowRankHawkesPoissonTensorTimeSVD++FIP125.9724.3768.4883101000MethodsMAEMethodsCoevolvingLowRankHawkesPoissonTensorSTIC1.821718.821.605101520MethodsErr %MethodsCoevolvingLowRankHawkesPoissonTensorSTIC0.000.250.500.751.0001/0101/2102/1003/0103/2104/1004/3005/2006/0906/2907/1908/0808/2809/1710/0710/2711/16CategoryActionHorrorModernHistoryChildIdolDramaAdventureCostumeCartonSitcomComedyCrimeRomanceSuspenseThrillerFamilyFantasyFictionKung.fuMysteryWar0.000.250.500.751.0001/0101/0301/0501/0701/0901/1101/1301/1501/1701/1901/2101/2301/2501/2701/29CategoryMacbookAntivirusIntelCameraInterfaceSamsungBillPrivacyTwitterCableWikipediaDesktopWatchPriceSoftwareComputerPowerYoutubeNetworkServiceFacebookApple\fReferences\n[1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of view.\n\nSpringer, 2008.\n\n[2] D. Agarwal and B.-C. Chen. Regression-based latent factor models. In J. Elder, F. Fogelman-Souli\u00e9,\n\nP. Flach, and M. Zaki, editors, KDD, 2009.\n\n[3] L. Baltrunas and X. Amatriain. Towards time-dependant recommendation based on implicit feedback,\n\n2009.\n\n[4] L. Charlin, R. Ranganath, J. McInerney, and D. M. Blei. Dynamic poisson factorization. In RecSys, 2015.\n[5] Y. Chen, D. Pavlov, and J. Canny. Large-scale behavioral targeting. In J. Elder, F. Fogelman-Souli\u00e9,\n\nP. Flach, and M. J. Zaki, editors, KDD, 2009.\n\n[6] E. C. Chi and T. G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix\n\nAnalysis and Applications, 33(4):1272\u20131299, 2012.\n\n[7] D. Cox and P. Lewis. Multivariate point processes. Selected Statistical Papers of Sir David Cox: Volume 1,\n\nDesign of Investigations, Statistical Methods and Applications, 1:159, 2006.\n\n[8] J. K. Cullum and R. A. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations:\n\nVol. 1: Theory, volume 41. SIAM, 2002.\n\n[9] N. Du, Y. Wang, N. He, and L. Song. Time sensitive recommendation from recurrent user activities. In\n\nNIPS, 2015.\n\n[10] M. D. Ekstrand, J. T. Riedl, and J. A. Konstan. Collaborative \ufb01ltering recommender systems. Foundations\n\nand Trends in Human-Computer Interaction, 4(2):81\u2013173, 2011.\n\n[11] M. Farajtabar, Y. Wang, M. Gomez-Rodriguez, S. Li, H. Zha, and L. Song. Coevolve: A joint point\n\nprocess model for information diffusion and network co-evolution. In NIPS, 2015.\n\n[12] P. Gopalan, J. M. Hofman, and D. M. Blei. Scalable recommendation with hierarchical poisson factoriza-\n\ntion. UAI, 2015.\n\n[13] S. Gultekin and J. Paisley. A collaborative kalman \ufb01lter for time-evolving dyadic processes. In ICDM,\n\npages 140\u2013149, 2014.\n\n[14] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83\u2013\n\n90, 1971.\n\n[15] B. Hidasi and D. Tikk. General factorization framework for context-aware recommendations. Data Mining\n\nand Knowledge Discovery, pages 1\u201330, 2015.\n\n[16] K. Kapoor, K. Subbian, J. Srivastava, and P. Schrater. Just in time recommendations: Modeling the\n\ndynamics of boredom in activity streams. In WSDM, 2015.\n\n[17] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation: n-dimensional\n\ntensor factorization for context-aware collaborative \ufb01ltering. In Recsys, 2010.\n\n[18] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. In KDD, 2009.\n[19] Y. Ogata. On lewis\u2019 simulation method for point processes. IEEE Transactions on Information Theory,\n\n27(1):23\u201331, 1981.\n\n[20] J. Z. J. L. Preeti Bhargava, Thomas Phan. Who, what, when, and where: Multi-dimensional collaborative\n\nrecommendations using tensor factorization on sparse user-generated data. In WWW, 2015.\n\n[21] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using markov chain monte\n\ncarlo. In ICML, 2008.\n\n[22] S. Sastry. Some np-complete problems in linear algebra. Honors Projects, 1990.\n[23] X. Wang, R. Donaldson, C. Nell, P. Gorniak, M. Ester, and J. Bu. Recommending groups to users using\n\nuser-group engagement and time-dependent matrix factorization. In AAAI, 2016.\n\n[24] Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. Rubik: Knowledge\n\nguided tensor factorization and completion for health data analytics. In KDD, 2015.\n\n[25] Y. Wang and A. Pal. Detecting emotions in social media: A constrained optimization approach. In IJCAI,\n\n2015.\n\n[26] Y. Wang, E. Theodorou, A. Verma, and L. Song. A stochastic differential equation framework for guiding\n\ninformation diffusion. arXiv preprint arXiv:1603.09021, 2016.\n\n[27] Y. Wang, B. Xie, N. Du, and L. Song. Isotonic hawkes processes. In ICML, 2016.\n[28] L. Xiong, X. Chen, T.-K. Huang, J. G. Schneider, and J. G. Carbonell. Temporal collaborative \ufb01ltering\n\nwith bayesian probabilistic tensor factorization. In SDM, 2010.\n\n[29] S.-H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike: joint friendship and\n\ninterest propagation in social networks. In WWW, 2011.\n\n[30] X. Yi, L. Hong, E. Zhong, N. N. Liu, and S. Rajan. Beyond clicks: Dwell time for personalization. In\n\nRecSys, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2267, "authors": [{"given_name": "Yichen", "family_name": "Wang", "institution": "Georgia Tech"}, {"given_name": "Nan", "family_name": "Du", "institution": "Georgia Tech"}, {"given_name": "Rakshit", "family_name": "Trivedi", "institution": "Georgia Institute of Technolo"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}