{"title": "A Meta-Learning Perspective on Cold-Start Recommendations for Items", "book": "Advances in Neural Information Processing Systems", "page_first": 6904, "page_last": 6914, "abstract": "Matrix factorization (MF) is one of the most popular techniques for product recommendation, but is known to suffer from serious cold-start problems. Item cold-start problems are particularly acute in settings such as Tweet recommendation where new items arrive continuously. In this paper, we present a meta-learning strategy to address item cold-start when new items arrive continuously. We propose two deep neural network architectures that implement our meta-learning strategy. The first architecture learns a linear classifier whose weights are determined by the item history while the second architecture learns a neural network whose biases are instead adjusted. We evaluate our techniques on the real-world problem of Tweet recommendation. On production data at Twitter, we demonstrate that our proposed techniques significantly beat the MF baseline and also outperform production models for Tweet recommendation.", "full_text": "A Meta-Learning Perspective on Cold-Start\n\nRecommendations for Items\n\nManasi Vartak\u2217\n\nMassachusetts Institute of Technology\n\nmvartak@csail.mit.edu\n\nArvind Thiagarajan\n\nTwitter Inc.\n\narvindt@twitter.com\n\ncmiranda@twitter.com\n\nConrado Miranda\n\nTwitter Inc.\n\nJeshua Bratman\n\nTwitter Inc.\n\njbratman@twitter.com\n\nHugo Larochelle\u2020\n\nGoogle Brain\n\nhugolarochelle@google.com\n\nAbstract\n\nMatrix factorization (MF) is one of the most popular techniques for product recom-\nmendation, but is known to suffer from serious cold-start problems. Item cold-start\nproblems are particularly acute in settings such as Tweet recommendation where\nnew items arrive continuously. In this paper, we present a meta-learning strategy\nto address item cold-start when new items arrive continuously. We propose two\ndeep neural network architectures that implement our meta-learning strategy. The\n\ufb01rst architecture learns a linear classi\ufb01er whose weights are determined by the item\nhistory while the second architecture learns a neural network whose biases are\ninstead adjusted. We evaluate our techniques on the real-world problem of Tweet\nrecommendation. On production data at Twitter, we demonstrate that our proposed\ntechniques signi\ufb01cantly beat the MF baseline and also outperform production\nmodels for Tweet recommendation.\n\n1\n\nIntroduction\n\nThe problem of recommending items to users \u2014 whether in the form of products, Tweets, or ads \u2014\nis a ubiquitous one. Recommendation algorithms in each of these settings seek to identify patterns\nof individual user interest and use these patterns to recommend items. Matrix factorization (MF)\ntechniques [19], have been shown to work extremely well in settings where many users rate the same\nitems. MF works by learning separate vector embeddings (in the form of lookup tables) for each\nuser and each item. However, these techniques are well known for facing serious challenges when\nmaking cold-start recommendations, i.e. when having to deal with a new user or item for which a\nvector embedding hasn\u2019t been learned. Cold-start problems related to items (as opposed to users) are\nparticularly acute in settings where new items arrive continuously. A prime example of this scenario\nis Tweet recommendation in the Twitter Home Timeline [1]. Hundreds of millions of Tweets are sent\non Twitter everyday. To ensure freshness of content, the Twitter timeline must continually rank the\nlatest Tweets and recommend relevant Tweets to each user. In the absence of user-item rating data\nfor the millions of new Tweets, traditional matrix factorization approaches that depend on ratings\ncannot be used. Similar challenges related to item cold-start arise when making recommendations for\nnews [20], other types of social media, and streaming data applications.\nIn this paper, we consider the problem of item cold-start (ICS) recommendation, focusing speci\ufb01cally\non providing recommendations when new items arrive continuously. Various techniques [3, 14, 27, 17]\n\n\u2217Work done as an intern at Twitter\n\u2020Work done while at Twitter\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fhave been proposed in the literature to extend MF (traditional and probabilistic) to address cold-start\nproblems. Majority of these extensions for item cold-start involve the incorporation of item-speci\ufb01c\nfeatures based on item description, content, or intrinsic value. From these, a model is prescribed\nthat can parametrically (as opposed to a lookup table) infer a vector embedding for that item. Such\nitem embeddings can then be compared with the embeddings from the user lookup table to perform\nrecommendation of new items to these users.\nHowever, in a setting where new items arrive continuously, we posit that relying on user embeddings\ntrained of\ufb02ine into a lookup table is sub-optimal. Indeed, this approach cannot capture substantial\nvariations in user interests occurring on short timescales, a common phenomenon with continuously\nproduced content. This problem is only partially addressed when user embeddings are retrained\nperiodically or incrementally adjusted online.\nAlternatively, recommendations could be made by performing content-based \ufb01ltering [21], where\nwe compare each new item to other items the user has rated in the recent past. However, a pure\ncontent-based \ufb01ltering approach does not let us share and transfer information between users. Instead,\nwe would like a method that performs akin to content \ufb01ltering using a user\u2019s item history, but shares\ninformation across users through some form of transfer learning between recommendation tasks\nacross users. In other words, we would like to learn a common procedure that takes as input a set of\nitems from any user\u2019s history and produces a scoring function that can be applied to new test items\nand indicate how likely this user is to prefer that item.\nIn this formulation, we notice that the recommendation problem is equivalent to a meta-learning\nproblem [28] where the objective is to learn a learning algorithm that can take as input a (small) set\nof labeled examples (a user\u2019s history) and will output a model (scoring function) that can be applied\nto new examples (new items). In meta-learning, training takes place in an episodic manner, where a\ntraining set is presented along with a test example that must be correctly labeled. In our setting, an\nepisode is equivalent to presenting a set of historical items (and ratings) for a particular user along\nwith test items that must be correctly rated for that user.\nThe meta-learning perspective is appealing for a few reasons. First, we are no longer tied to the\nMF model where a rating is usually the inner product of the user and item embeddings; instead, we\ncan explore a variety of ways to combine user and item information. Second, it enables us to take\nadvantage of deep neural networks to learn non-linear embeddings. And third, it speci\ufb01es an effective\nway to perform transfer learning across users (by means of shared parameters), thus enabling us to\ncope with limited amount of data per user.\nA key part of designing a meta-learning algorithm is the speci\ufb01cation of how a model is produced\nfor different tasks. In this work, we propose and evaluate two strategies for conditioning the model\nbased on task. The \ufb01rst, called linear weight adaptation, is a light-weight method that builds a linear\nclassi\ufb01er and adapts weights of the classi\ufb01er based on the task information. The second, called\nnon-linear bias adaptation, builds a neural network classi\ufb01er that uses task information to adapt the\nbiases of the neural network while sharing weights across all tasks.\nThus our contributions are: (1) we introduce a new hybrid recommendation method for the item\ncold-start setting that is motivated by limitations in current MF extensions for ICS; (2) we introduce\na meta-learning formulation of the recommendation problem and elaborate why a meta-learning\nperspective is justi\ufb01ed in this setting; (3) we propose two key architectures for meta-learning in\nthis recommendation context; and (4) we evaluate our techniques on a production Twitter dataset\nand demonstrate that they outperform an approach based on lookup table embeddings as well as\nstate-of-the-art industrial models.\n\n2 Problem Formulation\n\nSimilar to other large-scale recommender systems that must address the ICS problem [6], we view\nrecommendation as a binary classi\ufb01cation problem as opposed to a regression problem. Speci\ufb01cally,\nfor an item ti and user uj, the outcome eij \u2208 {0, 1} indicates whether the user engaged with the\nitem. Engagement can correspond to different actions in different settings. For example, in video\nrecommendation, engagement can be de\ufb01ned as a user viewing the video; in ad-click prediction,\nengagement can be de\ufb01ned as clicking on an ad; and on Twitter, engagement can be an action related\nto a Tweet such as liking, Retweeting or replying. Our goal in this context is to predict the probability\n\n2\n\n\fthat uj will engage with ti:\n\nPr(eij=1|ti, uj) .\n\n(1)\nOnce the engagement probability has been computed, it can be combined with other signals to\nproduce a ranked list of recommended items.\nAs discussed in Section 1, in casting recommendations as a form of meta-learning, we view the\nproblem of making predictions for one user as an individual task or episode. Let Tj be the set\nof items to which a user uj has been exposed (e.g. Tweets recommended to uj). We represent\neach user in terms of their item history, i.e., the set of items to which they have been exposed and\ntheir engagement for each of these items. Speci\ufb01cally, user uj is represented by their item history\nVj = {(tm, emj)} : tm \u2208 Tj. Note that we limit item history to only those items that were seen\nbefore ti.\nWe then model the probability of Eq. 1 as the output of a model f (ti; \u03b8) where the parameters \u03b8 are\nproduced from the item history Vj of user uj:\n\nPr(eij=1|ti, uj) = f (ti;H(Vj))\n\n(2)\nThus meta-learning consists of learning the function H(Vj) that takes history Vj and produces\nparameters of the model f (ti; \u03b8).\nIn this paper, we propose two neural network architectures for learning H(Vj), depicted in Fig. 1.\nThe \ufb01rst approach, called Linear Weight Adaptation (LWA) and shown in Fig. 1a, assumes f (ti; \u03b8)\nis a linear classi\ufb01er on top of non-linear representations of items and uses the user history to adapt\nclassi\ufb01er weights. The second approach, called Non-Linear Bias Adaptation (NLBA) and shown in\nFig. 1b, assumes f (ti; \u03b8) to be a neural network classi\ufb01er on top of item representations and uses the\nuser history to adapt the biases of the units in the classi\ufb01er.\nIn the following subsections, we describe the two architectures and differences in how they model the\nclassi\ufb01cation of a new item ti from its representation F(ti).\n3 Proposed Architectures\n\nAs shown in Fig. 1, both architectures we propose take as input (a) the items to which a user has\nbeen exposed along with the rating (i.e. class) assigned to each item by this user (positive, i.e., 1, or\nnegative, i.e., 0), and (b) a test item for which we seek to predict a rating. Each of our architectures in\nturn consists of two sub-networks. The \ufb01rst sub-network learns a common representation of items\n(historical and new), which we note F(t). In our implementation, item representations F(t) are\nlearned by a deep feed-forward neural network. We then compute a single representative per class by\napplying an aggregating function G to the representations of items tm \u2208 Tj from each class. A simple\nexample of G is the unweighted mean, while more complicated functions may order items by recency\nand learn to weigh individual embeddings differently. Thus, the class representative embedding for\nclass c \u2208 {0, 1} can be expressed as shown below.\n\nRc\nj = G({F(tm)} : tm \u2208 Tj \u2227 (emj = c))\n\n(3)\n\nOnce we have obtained class representatives, the second sub-network applies the LWA and NLBA\napproaches to adapt the learned model based on item histories.\n\n3.1 Linear Classi\ufb01er with Task-dependent weights\n\nOur \ufb01rst approach to conditioning predictions on users\u2019 item histories has parallels to latent factor\nmodels and is appealing due to its simplicity: we learn a linear classi\ufb01er (for new items) whose\nweights are determined by the user\u2019s history Vj.\nGiven the two class-representative embeddings R0\nterm) and weights (second term) of a linear logistic regression classi\ufb01er as follows:\n\nj described above, LWA provides the bias (\ufb01rst\n\nj , R1\n\n(4)\nwhere scalars b, w0, w1 are trainable parameters. More concretely, with f (ti; \u03b8) being logistic\nregression, Eq. 2 becomes:\n\nj )] = H(Vj)\n\n[b, (w0R0\n\nj + w1R1\n\nPr(eij=1|ti, uj) = \u03c3(b + F(ti) \u00b7 (w0R0\n\nj + w1R1\n\nj ))\n\n(5)\n\n3\n\n\f(a) Linear Classi\ufb01er with Weight Adaptation. Changes in the shading of each connection with the\noutput unit for two users illustrates that the weights of the classi\ufb01er vary based on each user\u2019s item\nhistory. The output bias indicated by the shades of the circles however remains the same.\n\n(b) Non-linear Classi\ufb01er with Bias Adaptation. Changes in the shading of each unit between two\nusers illustrates that the biases of these units vary based on each user\u2019s item history. The weights\nhowever remain the same.\n\nFigure 1: Proposed meta-learning architectures\n\nWhile bias b of the classi\ufb01er is constant across users, its weight vector varies with user-speci\ufb01c\nitem histories (i.e., based on the representative embeddings for classes). This means that different\ndimensions of F(ti), such as properties of item ti, get weighted differently depending on the user.\nWhile simple, the LWA method can be quite effective (see Section 5). Moreover, in some cases, it\nmay be preferred over more complex methods because it allows signi\ufb01cant amount of computation\nto be performed of\ufb02ine. For example, in Eq. 5, the only quantities that are unknown at prediction\nj, can be pre-computed, reducing the cost of prediction to\ntime are F(ti). All the rest, including Rc\nthe computation of one dot product and one sigmoid.\n\n3.2 Non-linear Classi\ufb01er with Task-dependent Bias\n\nOur \ufb01rst meta-learning strategy is simple and is reminiscent of MF with non-linear embeddings.\nHowever, it limits the effect of task information, speci\ufb01cally R0\nOur second strategy, NLBA, learns a neural network classi\ufb01er with H hidden-layers where the bias\n(\ufb01rst term) and weights (second term) of the output, as well as the biases (third term) and weights\n(fourth term) of all hidden layers are determined as follows:\nj}H\nl=1,{Wl}H\n\nj , on the \ufb01nal prediction.\n\nl=1] = H(Vj)\n\nj , w,{V0\n\nl R0\n\nj + V1\n\nl R1\n\nj and R1\n\n[v0R0\n\nj + v1R1\n\n(6)\n\n4\n\n{}{}F(t1)F(t3)F(t2)F(t4)positive classnegative class{}{}F(t1)F(t4)positive classnegative classF(t6)F(t7)F(t5)User 1User 2GGGGF(t8){}{}F(t1)F(t3)F(t2)F(t4)positive classnegative class{}{}F(t1)F(t4)positive classnegative classF(t6)F(t7)F(t5)F(t8)User 1User 2GGGG\fl }H\n\nl=1,{V1\n\nl }H\n\nl=1,{Wl}H\n\nHere, the vectors v0, v1, w and matrices {V0\nl=1 are all trainable parameters.\nIn contrast to LWA, all weights (output and hidden) in NLBA are constant across users, while the\nbiases of output and hidden units are adapted per user. One can think of this approach as learning a\nshared pool of hidden units whose activation can be modulated depending on the user (e.g. a unit\ncould be entirely shot down for a user with a large negative bias).\nCompared to LWA, NLBA produces a non-linear classi\ufb01er of the item representations F(ti) and\ncan model complex interactions between classes and also between the classes and the test item. For\nexample, interactions allow NLBA to explore a different part of the classi\ufb01er function space that is not\naccessible to LWA (e.g., ratio of the kth dimension of the class representatives). We \ufb01nd empirically\nthat NLBA signi\ufb01cantly improves model performance compared to LWA (Section 5).\nSelecting Historical Impressions. A natural question with our meta-learning formulation is the\nminimum item history size required to make accurate predictions. In general, item history size\ndepends on the problem and variability within items, and must be empirically determined. Often,\ndue to the long tail of users, item histories can be very large (e.g., consider a bot which likes every\nitem). Therefore, we recommend setting an upper limit on item history size. Further, for any user, the\nnumber of items with positive engagement (eij=1) can be quite different from those with negative\nengagement (eij=0). Therefore, in our experiments, we choose to independently sample histories for\nthe two classes up to a maximum size k for each class. Note that while this sampling strategy makes\nthe problem more tractable, this sampling throws off certain features (e.g. user click through rate)\nthat would bene\ufb01t from having the true proportions of positive and negative engagements.\n\n4 Related Work\n\nAlgorithms for recommendation broadly fall into two categories: content-\ufb01ltering and collaborative-\n\ufb01ltering. Content \ufb01ltering [21] uses information about items (e.g. product categories, item content,\nreviews, price) and users (e.g. age, gender, location) to match users to items. In contrast, collaborative\n\ufb01ltering [19, 23] uses past user-item ratings to predict future ratings. The most popular technique for\nperforming collaborative \ufb01ltering is via latent factor models where items and users are represented\nin a common latent space and ratings are computed as the inner product between the user and item\nrepresentations. Matrix factorization (MF) is the most popular instantiation of latent factor models\nand has been used for large scale recommendations of products [19], movies [15]) and news [7]. A\nsigni\ufb01cant problem with traditional MF approaches is that they suffer from cold-start problems, i.e.,\nthey cannot be applied to new items or users. In order to address the cold-start problem, work such\nas [3, 14, 27, 17] has extended the MF model so that user- and item-speci\ufb01c terms can be included\nin their respective representations. These methods are called hybrid methods. Given the power of\ndeep neural networks to learn representations of images and text, many of the new hybrid methods\nsuch as [30] and [12] use deep neural networks to learn item representations. Deep learning models\nbased on ID embeddings (as opposed to content embeddings) have also been used for performing\nlarge scale video recommendation in [6].\nThe work in [5, 30] operates in a problem setting similar to ours where new scienti\ufb01c articles must be\nrecommended to users based on other articles in their library. In these settings, users are represented\nin terms of scienti\ufb01c papers in their \u201clibraries\u201d. Note that unlike our setting where we have positive as\nwell as negative information, there are no negative examples present in this setting. [9, 10] propose\nRNN architecture for a similar problem, namely that of recommending items during short-lived web\nsessions. [11] propose co-factorization machines to jointly model topics in Tweets while making\nrecommendations.\nIn this paper, we propose to view recommendations from a meta-learning perspective [28, 18].\nRecently, meta-learning has been explored as a popular strategy for learning from a small number\nof items (also called few-shot learning [16, 13]). Successful applications of meta-learning include\nMatchingNets [29] in which an episodic scheme is used to train a meta-learner to classify images\ngiven very few examples belonging to each classes. In particular, MatchingNets use LSTMs to learn\nattention weights over all points in the support set and use a weighted sum to make predictions for\nthe test item. Similarly, in [24], the authors propose an LSTM-based meta-learner to learn another\nlearner that performs few-shot classi\ufb01cation. [25] proposes a memory-augmented neural network\nfor meta-learning. The key idea is that the network can slowly learn useful representations of data\nthrough weight updates while the external memory can cache new data for rapid learning. Most\n\n5\n\n\frecently, [4] proposes to learn active learning algorithms via a technique based on MatchingNets.\nWhile the above state-of-the-art meta-learning techniques are powerful and potentially useful for\nrecommendations, they do not scale to large datasets with hundreds of millions of examples.\nOur approach of computing a mean representative per class is similar to [26] and [22] in terms\nof learning class representative embeddings. Our work also has parallels to the recent work on\nDeepSets [31] where the authors propose a general strategy for performing machine learning tasks\non sets. The authors propose to learn an embedding per item and then use a permutation invariant\noperation, usually a sumpool or maxpool, to learn a single representation that is then passed to another\nneural network for performing the \ufb01nal classi\ufb01cation or regression. Our techniques differ in that\nour input sets are not homogeneous as assumed in DeepSets and therefore we need to learn multiple\nrepresentatives, and unlike DeepSets, our network must work for variable size histories and therefore\na weighted sum is more effective than the unweighted sum.\n\n5 Evaluation\n\n5.1 Recommending Tweets on Twitter Home Timeline\n\nWe evaluated our proposed techniques on the real-world problem of Tweet recommendation. The\nTwitter Home timeline is the primary means by which users on Twitter consume Tweets from their\nnetworks [1]. 300+ million monthly active users on Twitter send hundreds of millions of Tweets per\nday. Every time a user visits Twitter, the timeline ranking algorithm scores Tweets from the accounts\nthey follow and identi\ufb01es the most relevant Tweets for that user. We model the timeline ranking\nproblem as one of engagement prediction as described in Section 2. Given a Tweet ti and a user\nuj, the task is to predict the probability of uj engaging with ti, i.e., Pr(eij=1|ti, uj; \u03b8). Engagement\ncan be any action related to the Tweet such as Retweeting, liking or replying. For the purpose of\nthis paper, we will limit our analysis to prediction of one kind of engagement, namely that of liking\na Tweet. Because hundreds of millions of new Tweets arrive every day, as discussed in Section 1,\ntraditional matrix factorization schemes suffer from acute cold-start problems and cannot be used for\nTweet recommendation. In this work, we cast the problem in terms of meta-learning and adopt the\nformulation developed in Eq. 2.\nDataset. We used production data regarding Tweet engagement to perform an of\ufb02ine evaluation of\nour techniques. Speci\ufb01cally, the training data was generated as follows: for a particular day d, we\ncollect data for all Tweet impressions (i.e., Tweets shown to a user) generated during that day. Each\ndata point consists of a Tweet ti, the user uj to whom it was shown, and the engagement outcome eij.\nWe then join impression data with item histories (per user) that are computed using impressions from\nthe week prior to d. As discussed in Section 2, there are different strategies for selecting items to build\nthe item history. For this problem, we independently sample impressions with positive engagement\nand negative engagement, up to a maximum of k engagements in each class. We experimented with\ndifferent values of k and chose the smallest one that did not produce a signi\ufb01cant drop in performance.\nAfter applying other typical \ufb01ltering operations, our training dataset consists of hundreds of millions\nof examples (i.e., ti, uj pairs) for day d. The test and validation sets were similarly constructed,\nbut for different days. For feature preprocessing, we scale and discretize continuous features and\none-hot-encode categorical features.\nBaseline Models. We implemented different architectural variations of the two meta-learning\napproaches proposed in Section 2. Along with comparisons within architectures, we compare our\nmodels against three external models: (a) \ufb01rst, an industrial baseline (PROD-BASE) not using\nmeta-learning; (b) the industrial baseline augmented with a latent factor model for users (MF); and\n(c) the state-of-the-art production model for engagement prediction (PROD-BEST).\nPROD-BASE is a deep feed-forward neural network that uses information pertaining only to the\ncurrent Tweet in order to predict engagement. This information includes features of the current Tweet\nti (e.g. its recency, whether it contains a photo, number of likes), features about the user uj (e.g. how\nheavily the user uses Twitter, their network), and the Tweet\u2019s author (e.g. strength of connection\nbetween the user and author, past interactions). Note that this network uses no historical information\nabout the user. This baseline learns a combined item-user embedding (due to user features present in\nthe input) and performs classi\ufb01cation based on this embedding. While relatively simple, this model\npresents a very strong baseline due to the presence of high-quality features.\n\n6\n\n\fModel\n\nMF (shallow)\nMF (deep)\nPROD-BEST\n\nAUROC AUROC\n(w/CTR)\n+0.22% +1.32%\n+0.55% +1.87%\n+2.54% +2.54%\n+1.76% +2.43%\n+1.98% +2.31%\n\nLWA\nLWA\u2217\nTable 1: Performance with LWA\n\nModel\n\nMF (shallow)\nMF (shallow)\nPROD-BEST\n\nAUROC AUROC\n(w/CTR)\n+0.22% +1.32%\n+0.55% +1.87%\n+2.54% +2.54%\n+2.65% +2.76%\nTable 2: Performance with NBLA\n\nNLBA\n\nTo mimic latent factor models in cold-start settings, in the second baseline, MF we augmented PROD-\nBASE to learn a latent-space representations of users based on ratings. MF uses an independently\nlearned user representation and a current Tweet representation whose inner product is used to make\nthe classi\ufb01cation. We evaluate two versions of MF, one that uses a shallow network (1 layer) for\nlearning the representations and another than uses a deep network (5 layers) to learn representations.\nPROD-BEST is the production model for engagement prediction based on deep neural networks.\nPROD-BEST uses features not only for the current Tweet but historical features as described in [8].\nPROD-BEST is a highly tuned model and represents the state-of-art in Tweet engagement prediction.\nExperimental Setup. All models were implemented in the Twitter Deep Learning platform [2].\nModels were trained to minimize cross-entropy loss and SGD was used for optimization. We use\nAUROC as the evaluation metric in our experiments. All performance numbers denote percent\nAUROC improvement relative to the production baseline model, PROD-BASE. For every model,\nwe performed hyperparameter tuning on a validation set using random search and report results for\nthe best performing model.\n\n5.2 Results\n\nLinear Classi\ufb01er with Weight Adaptation. We evaluated two key instantiations of the LWA\napproach. First, we test the basic architecture described in Section 3.1 where we calculate one\nrepresentative embedding from the positive and negative class, and take a linear combination of the\ndot products of the new item with the respective embeddings. We note this architecture LWA (refer\nFig. 1.a). When learning class representatives, we use a deep feed-forward neural network (F in\nFig. 1.a) to \ufb01rst compute embeddings and then use a weighted average to learn class representatives\n(G in Fig. 1.a). We also evaluate a model variation where we augment LWA with a network that uses\nonly the new item embedding to produce a prediction that is then combined with the prediction of\nLWA to produce the \ufb01nal probability. The intuition behind this model, called LWA\u2217, is that the linear\nweight adaptation approach works well when there are non-zero items in the two engagement classes.\nIn cases where one of the classes is empty, the model can fall back on the current item to predict\nengagement. We show the performance of all LWA variants in Table 1.\nNote that for all models, we also test a variant where we explicitly pass a user-speci\ufb01c click-through-\nrate (CTR) to the model. The reason is that the CTR provides a strong prior on the probability\np(eij = 1) but cannot be easily learned from our architectures because the ratio of positive to negative\nengagements in the item history is not proportional to the CTR. User-speci\ufb01c CTR can be thought of\nas being part of the bias term from Eq. 2.\nWe \ufb01nd that the simplest classi\ufb01er adaptation model, LWA, already improves upon the production\nbaseline (PROD-BASE) by >1.5 percent. Adding the bias in the form of CTR, improves the\nAUROC even further. Because learning a good class representative embedding is key for our meta-\nlearning approaches, we performed experiments varying only the architectures used to learn class\nrepresentative embeddings (i.e., architectures for F,G in Fig. 1a). The top-level classi\ufb01er was kept\nconstant at LWA but we varied the aggregation function and depth of the feed forward network used\nto learn F. Results of this experimentation are shown in Table 3. We \ufb01nd that deep networks work\nbetter than shallow networks but a model with 9 hidden layers performs worse than a 5 layer network,\npossibly due to over-\ufb01tting. We also \ufb01nd that weighted combinations of embeddings (when items are\nsorted by time) perform signi\ufb01cantly better than simple averages. A likely reason for the effectiveness\nof weighted combinations is that item histories can be variable sized; therefore, weighing non-zero\nentries higher produces better representatives.\nNon-Linear Classi\ufb01er with Bias Adaptation. As with LWA, we evaluated different variations of\nthe non-linear bias adaptation approach. Results of this evaluation are shown in Table 2. We use a\n\n7\n\n\fHidden\nLayers\n\n1\n5\n9\n\nAVG Weighted\n\nAVG\n+1.8%\n+2.31%\n+2.20% +2.42%\n+2.09% +1.65%\n\nEngagements AUROC\n\nUsed\n\nPOS/NEG\nPOS-ONLY\nNEG-ONLY\n\n+2.54%\n+1.76%\n+1.87%\n\nTable 3: Learning a representative per class\n\nTable 4: Effect of different engagements\n\nweighted mean for computing class representatives in NLBA. We see that this network immediately\nbeats PROD-BASE by a large margin. Moreover, it also readily beats the state-of-the-art model,\nPROD-BEST. Augmenting NLBA with user-speci\ufb01c CTR further allows the network to cleanly beat\nthe highly optimized PROD-BEST.\nFor NLBA architectures, we also evaluated the impact on model AUROC when examples of only\none class (eij=0 or 1) are present in the item history. These architectures replicate the strategy of\nonly using one class of items to make predictions. These numbers approximate the gain that could be\nachieved by using a DeepSets [31]-like approach. The results of this evaluation are shown in Table 4.\nAs expected, we \ufb01nd that dropping either type of engagement reduces performance signi\ufb01cantly.\nSummary of Results. We \ufb01nd that both our proposed approaches improve on the baseline production\nmodel (PROD-BASE) by up to 2.76% and NLBA readily beats the state-of-the-art production model\n(PROD-BEST). As discussed in Sec. 3.2, we \ufb01nd that NLBA clearly outperforms LWA because of the\nnon-linear classi\ufb01er and access to a larger space of functions. A breakdown of NLBA performance\nby overall user engagement identi\ufb01es that NLBA shows large gains for highly engaged users. For\nboth techniques, although the improvements in AUROC may appear small numerically, they have\nlarge product impact because they translate to signi\ufb01cantly higher number of engagements. This gain\nis particularly noteworthy when compared to the highly optimized and tuned PROD-BEST model.\n\n6 Discussion\n\nIn this paper, we proposed to view recommendations from a meta-learning perspective and proposed\ntwo architectures for building meta-learning models for recommendation. While our techniques\nshow clear wins over state-of-the-art models, there are several avenues for improving the model and\noperationalizing it. First, our model does not explicitly model the time-varying aspect of engagements.\nWhile weighting impressions differently is a way of modeling time dependencies (e.g., more recent\nimpressions get more weight) scalable versions of sequence models such as [29, 10] could be\nused to explicitly model time dynamics. Second, while we chose a balanced sampling strategy for\nproducing item histories, we believe that different strategies may be more appropriate in different\nrecommendation settings, and thus merit further exploration. Third, producing and storing item\nhistories for every user can be expensive with respect to computational as well as storage cost.\nTherefore, we can explore the computation of representative embeddings in an online fashion such\nthat at any given time, the system tracks the k most representative embeddings. Finally, we believe\nthat there is room for future work exploring effective visualizations of learned embeddings when\ninput items are not easy to interpret (i.e., beyond images and text).\n\n7 Conclusions\n\nIn this paper, we study the recommendation problem when new items arrive continuously. We propose\nto view item cold-start through the lens of meta-learning where making recommendations for one user\nis considered to be one task and our goal is to learn across many such tasks. We formally de\ufb01ne the\nmeta-learning problem and propose two distinct approaches to condition the recommendation model\non the task. The linear weight adaptation approach adapts weights of a linear classi\ufb01er depending on\ntask information while the non-linear bias adaptation approach learns task speci\ufb01c item representations\nand adapts biases of a neural network based on task information. We perform an empirical evaluation\nof our techniques on the Tweet recommendation problem. On production Twitter data, we show\nthat our meta-learning approaches comfortably beat the state-of-the-art production models for Tweet\nrecommendation. We show that the non-linear bias adaptation approach outperforms the linear weight\nadaptation approach. We thus demonstrate that recommendation through meta-learning is effective\nfor item cold-start recommendations and may be extended to other recommendation settings as well.\n\n8\n\n\fAddressing Reviewer Comments\n\nWe thank the anonymous reviewers for their feedback on the paper. We have incorporated responses\nto reviewer comments in the paper text.\n\n9\n\n\fAcknowledgement\n\nWe would like to thank all members of the Twitter Timelines Quality team as well as the Twitter\nCortex team for their help and guidance with this work.\n\nReferences\nmiss\n[1] \"never\n\nnever-miss-important-tweets-from-people-you-follow. Accessed: 2017-05-03.\n\nimportant\n\ntweets\n\nfrom\n\npeople\n\nyou\n\nfollow\".\n\n[2] Using\n\ndeep\n\nlearning\n\nat\n\nscale\n\nin\n\ntwitter\u2019s\n\ntimelines.\n\nusing-deep-learning-at-scale-in-twitter-s-timelines. Accessed: 2017-05-09.\n\nhttps://blog.twitter.com/2016/\n\nhttps://blog.twitter.com/2017/\n\n[3] D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference\n\non Knowledge discovery and data mining, pages 19\u201328. ACM, 2009.\n\n[4] P. Bachman, A. Sordoni, and A. Trischler. Learning algorithms for active learning. In D. Precup and Y. W. Teh, editors, Proceedings\nof the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 301\u2013310,\nInternational Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[5] L. Charlin, R. S. Zemel, and H. Larochelle. Leveraging user libraries to bootstrap collaborative \ufb01ltering. In Proceedings of the 20th ACM\n\nSIGKDD international conference on Knowledge discovery and data mining, pages 173\u2013182. ACM, 2014.\n\n[6] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference\n\non Recommender Systems, pages 191\u2013198. ACM, 2016.\n\n[7] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative \ufb01ltering. In Proceedings of\n\nthe 16th international conference on World Wide Web, pages 271\u2013280. ACM, 2007.\n\n[8] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. Practical lessons from predicting clicks on\nads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1\u20139. ACM, 2014.\n\n[9] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint\n\narXiv:1511.06939, 2015.\n\n[10] B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk. Parallel recurrent neural network architectures for feature-rich session-based\n\nrecommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 241\u2013248. ACM, 2016.\n\n[11] L. Hong, A. S. Doumith, and B. D. Davison. Co-factorization machines: modeling user interests and predicting individual decisions in\n\ntwitter. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 557\u2013566. ACM, 2013.\n\n[12] D. Kim, C. Park, J. Oh, S. Lee, and H. Yu. Convolutional matrix factorization for document context-aware recommendation.\n\nProceedings of the 10th ACM Conference on Recommender Systems, pages 233\u2013240. ACM, 2016.\n\nIn\n\n[13] G. Koch. Siamese Neural Networks for One-Shot Image Recognition. PhD thesis, University of Toronto, 2015.\n\n[14] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model. In Proceedings of the 14th ACM SIGKDD\n\ninternational conference on Knowledge discovery and data mining, pages 426\u2013434. ACM, 2008.\n\n[15] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.\n\n[16] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science,\n\n350(6266):1332\u20131338, 2015.\n\n[17] X. N. Lam, T. Vu, T. D. Le, and A. D. Duong. Addressing cold-start problem in recommendation systems. In Proceedings of the 2nd\n\ninternational conference on Ubiquitous information management and communication, pages 208\u2013211. ACM, 2008.\n\n[18] C. Lemke, M. Budka, and B. Gabrys. Metalearning: a survey of trends and technologies. Arti\ufb01cial intelligence review, 44(1):117\u2013130,\n\n2015.\n\n[19] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative \ufb01ltering.\n\n7(1):76\u201380, 2003.\n\nIEEE Internet computing,\n\n[20] J. Liu, P. Dolan, and E. R. Pedersen. Personalized news recommendation based on click behavior. In Proceedings of the 15th international\n\nconference on Intelligent user interfaces, pages 31\u201340, 2010.\n\n[21] P. Lops, M. De Gemmis, and G. Semeraro. Content-based recommender systems: State of the art and trends. In Recommender systems\n\nhandbook, pages 73\u2013105. Springer, 2011.\n\n[22] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image classi\ufb01cation: Generalizing to new classes at\n\nnear-zero cost. Computer Vision\u2013ECCV 2012, pages 488\u2013501, 2012.\n\n[23] A. Mnih and R. R. Salakhutdinov. Probabilistic matrix factorization.\n\n1257\u20131264, 2008.\n\nIn Advances in neural information processing systems, pages\n\n[24] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. ICLR, 2017.\n\n[25] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks.\n\nInternational conference on machine learning, pages 1842\u20131850, 2016.\n\nIn\n\n10\n\n\f[26] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. CoRR, abs/1703.05175, 2017.\n\n[27] D. H. Stern, R. Herbrich, and T. Graepel. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th interna-\n\ntional conference on World wide web, pages 111\u2013120. ACM, 2009.\n\n[28] R. Vilalta and Y. Drissi. A perspective view and survey of meta-learning. Arti\ufb01cial Intelligence Review, 18(2):77\u201395, 2002.\n\n[29] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information\n\nProcessing Systems, pages 3630\u20133638, 2016.\n\n[30] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles. In Proceedings of the 17th ACM SIGKDD\n\ninternational conference on Knowledge discovery and data mining, pages 448\u2013456. ACM, 2011.\n\n[31] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. P\u00f3czos, R. Salakhutdinov, and A. J. Smola. Deep sets. CoRR, abs/1703.06114, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3464, "authors": [{"given_name": "Manasi", "family_name": "Vartak", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Arvind", "family_name": "Thiagarajan", "institution": "Twitter"}, {"given_name": "Conrado", "family_name": "Miranda", "institution": null}, {"given_name": "Jeshua", "family_name": "Bratman", "institution": "Twitter"}, {"given_name": "Hugo", "family_name": "Larochelle", "institution": "Google Brain"}]}