{"title": "DropoutNet: Addressing Cold Start in Recommender Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 4957, "page_last": 4966, "abstract": "Latent models have become the default choice for recommender systems due to their performance and scalability. However, research in this area has primarily focused on modeling user-item interactions,  and few latent models have been developed for cold start. Deep learning has recently achieved remarkable success showing excellent results for diverse input types. Inspired by these results we propose a neural network based latent model called DropoutNet to address the cold start problem in recommender systems. Unlike existing approaches that incorporate additional content-based objective terms, we instead focus on the optimization and show that neural network models can be explicitly trained for cold start through dropout. Our model can  be applied on top of any existing latent model effectively providing cold start capabilities, and full power of deep architectures. Empirically we demonstrate  state-of-the-art accuracy on publicly available benchmarks. Code is available at  https://github.com/layer6ai-labs/DropoutNet.", "full_text": "DropoutNet: Addressing Cold Start\n\nin Recommender Systems\n\nMaksims Volkovs\n\nlayer6.ai\n\nmaks@layer6.ai\n\nGuangwei Yu\n\nlayer6.ai\n\nguang@layer6.ai\n\nAbstract\n\nTomi Poutanen\n\nlayer6.ai\n\ntomi@layer6.ai\n\nLatent models have become the default choice for recommender systems due to\ntheir performance and scalability. However, research in this area has primarily fo-\ncused on modeling user-item interactions, and few latent models have been devel-\noped for cold start. Deep learning has recently achieved remarkable success show-\ning excellent results for diverse input types. Inspired by these results we propose\na neural network based latent model called DropoutNet to address the cold start\nproblem in recommender systems. Unlike existing approaches that incorporate ad-\nditional content-based objective terms, we instead focus on the optimization and\nshow that neural network models can be explicitly trained for cold start through\ndropout. Our model can be applied on top of any existing latent model effectively\nproviding cold start capabilities, and full power of deep architectures. Empirically\nwe demonstrate state-of-the-art accuracy on publicly available benchmarks. Code\nis available at https://github.com/layer6ai-labs/DropoutNet.\n\n1\n\nIntroduction\n\nPopularity of online content delivery services, e-commerce, and social web has highlighted an im-\nportant challenge of surfacing relevant content to consumers. Recommender systems have proven\nto be effective tools for this task, receiving increasingly more attention. One common approach to\nbuilding accurate recommender models is collaborative \ufb01ltering (CF). CF is a method of making\npredictions about an individual\u2019s preferences based on the preference information from other users.\nCF has been shown to work well across various domains [19], and many successful web-services\nsuch as Net\ufb02ix, Amazon and YouTube use CF to deliver highly personalized recommendations to\ntheir users.\nThe majority of the existing approaches in CF can be divided into two categories: neighbor-based\nand model-based. Model-based approaches, and in particular latent models, are typically the pre-\nferred choice since they build compact representations of the data and achieve high accuracy. These\nrepresentations are optimized for fast retrieval and can be scaled to handle millions of users in\nreal-time. For these reasons we concentrate on latent approaches in this work. Latent models are\ntypically learned by applying a variant of low rank approximation to the target preference matrix.\nAs such, they work well when lots of preference information is available but start to degrade in\nhighly sparse settings. The most extreme case of sparsity known as cold start occurs when no pref-\nerence information is available for a given user or item. In such cases, the only way a personalized\nrecommendation can be generated is by incorporating additional content information. Base latent\napproaches cannot incorporate content, so a number of hybrid models have been proposed [3, 21, 22]\nto combine preference and content information. However, most hybrid methods introduce additional\nobjective terms considerably complicating learning and inference. Moreover, the content part of the\nobjective is typically generative [21, 9, 22] forcing the model to \u201cexplain\u201d the content rather than\nuse it to maximize recommendation accuracy.\nRecently, deep learning has achieved remarkable success in areas such as computer vision [15, 11],\nspeech [12, 10] and natural language processing [5, 16]. In all of these areas end-to-end deep neu-\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fral network (DNN) models achieve state-of-the-art accuracy with virtually no feature engineering.\nThese results suggest that deep learning should also be highly effective at modeling content for rec-\nommender systems. However, while there has been some recent progress in applying deep learning\nto CF [7, 22, 6, 23], little investigation has been done on using deep learning to address the cold start\nproblem.\nIn this work we propose a model to address this gap. Our approach is based on the observation that\ncold start is equivalent to the missing data problem where preference information is missing. Hence,\ninstead of adding additional objective terms to model content, we modify the learning procedure to\nexplicitly condition the model for the missing input. The key idea behind our approach is that by\napplying dropout [18] to input mini-batches, we can train DNNs to generalize to missing input. By\nselecting an appropriate amount of dropout we show that it is possible to learn a DNN-based latent\nmodel that performs comparably to state-of-the-art on warm start while signi\ufb01cantly outperforming\nit on cold start. The resulting model is simpler than most hybrid approaches and uses a single\nobjective function, jointly optimizing all components to maximize recommendation accuracy.\nAn additional advantage of our approach is that it can be applied on top of any existing latent model\nto provide/enhance its cold start capability. This requires virtually no modi\ufb01cation to the original\nmodel thus minimizing the implementation barrier for any production environment that\u2019s already\nrunning latent models. In the following sections we give a detailed description of our approach and\nshow empirical results on publicly available benchmarks.\n\n2 Framework\nIn a typical CF problem we have a set of N users U = {u1, ..., uN} and a set of M items V =\n{v1, ..., vM}. The users\u2019 feedback for the items can be represented by an N \u00d7 M preference matrix\nR where Ruv is the preference for item v by user u. Ruv can be either explicitly provided by the\nuser in the form of rating, like/dislike etc., or inferred from implicit interactions such as views, plays\nand purchases. In the explicit setting R typically contains graded relevance (e.g., 1-5 ratings), while\nin the implicit setting R is often binary; we consider both cases in this work. When no preference\ninformation is available Ruv = 0. We use U(v) = {u \u2208 U | Ruv (cid:54)= 0} to denote the set of users\nthat expressed preference for v, and V(u) = {v \u2208 V | Ruv (cid:54)= 0} to denote the set of items that u\nexpressed preference for. In cold start no preference information is available and we formally de\ufb01ne\ncold start when V(u) = \u2205 and U(v) = \u2205 for a given user u and item v.\nAdditionally, in many domains we often have access to content information for both users and\nitems. For items, this information can come in the form of text, audio or images/video. For users we\ncould have pro\ufb01le information (age, gender, location, device etc.), and social media data (Facebook,\nTwitter etc.). This data can provide highly useful signal for recommender models, and is particularly\neffective in sparse and cold start settings where little or no preference information is available.\nAfter applying relevant transformations most content information can be represented by \ufb01xed-length\nfeature vectors. We use \u03a6U and \u03a6V to denote the content features for users and items respectively\nwhere \u03a6U\nv ) is the content feature vector for user u (item v). When content is missing the\ncorresponding feature vector is set to 0. The goal is to use the preference information R together\nwith content \u03a6U and \u03a6V, to learn accurate and robust recommendation model. Ideally this model\nshould handle all stages of the user/item journey: from cold start, to early stage sparse preferences,\nto a late stage well-de\ufb01ned preference pro\ufb01le.\n\nu (\u03a6V\n\n3 Relevant Work\n\nA number of hybrid latent approaches have been proposed to address cold start in CF. One of the\nmore popular models is the collaborative topic regression (CTR) [21] which combines latent Dirich-\nlet allocation (LDA) [4] and weighted matrix factorization (WMF) [13]. CTR interpolates between\nLDA representations in cold start and WMF when preferences are available. Recently, several re-\nlated approaches have been proposed. Collaborative topic Poisson factorization (CTPF) [8] uses a\nsimilar interpolation architecture but replaces both LDA and WMF components with Poisson factor-\nization [9]. Collaborative deep learning (CDL) [22] is another approach with analogous architecture\nwhere LDA is replaced with a stacked denoising autoencoder [20].\n\n2\n\n\fFigure 1: DropoutNet architecture diagram. For each user u, the preference Uu and content \u03a6U\nu\ninputs are \ufb01rst passed through the corresponding DNNs fU and f\u03a6U . Top layer activations are then\nconcatenated together and passed to the \ufb01ne-tuning network fU which outputs the latent representa-\ntion \u02c6Uu. Items are handled in a similar fashion with fV, f\u03a6V and fV to produce \u02c6Vv. All components\nare optimized jointly with back-propagation and then kept \ufb01xed during inference. Retrieval is done\nin the new latent space using \u02c6U and \u02c6V that replace the original representations U and V.\n\nWhile these models achieve highly competitive performance, they also share several disadvantages.\nFirst, they incorporate both preference and content components into the objective function mak-\ning it highly complex. CDL for example, contains four objective terms and requires tuning three\ncombining weights in addition to WMF and autoencoder parameters. This makes it challenging to\ntune these models on large datasets where every parameter setting experiment is expensive and time\nconsuming. Second, the formulation of each model assumes cold start items and is not applicable\nto cold start users. Most online services have to frequently incorporate new users and items and\nthus require models that can handle both. In principle it is possible to derive an analogous model\nfor users and jointly optimize both models. However, this would require an even more complex\nobjective nearly doubling the number of free parameters. One of the main questions that we aim to\naddress with this work is whether we develop a simpler cold start model that is applicable to both\nusers and items?\nIn addition to CDL, a number of approaches haven been proposed to leverage DNNs for CF. One\nof the earlier approaches DeepMusic [7] aimed to predict latent representations learned by a la-\ntent model using content only DNN. Recently, [6] described YouTube\u2019s two-stage recommendation\nmodel that takes as input user session (recent plays and searches) and pro\ufb01le information. Latent\nrepresentations for items in a given session are averaged, concatenated with pro\ufb01le information, and\npassed to a DNN which outputs a session-dependent latent representation for the user. Averaging\nthe items addresses variable length input problem but can loose temporal aspects of the session. To\nmore accurately model how users\u2019 preferences change over time a recurrent neural network (RNN)\napproach has been proposed by [23]. RNN is applied sequentially to one item at a time, and after all\nitems are processed hidden layer activations are used as latent representation.\nMany of these models show clear bene\ufb01ts of applying deep architectures to CF. However, few inves-\ntigate cold start and sparse setting performance when content information is available. Arguably, we\nexpect deep learning to be the most bene\ufb01cial in these scenarios due to its excellent generalization\nto various content types. Our proposed approach aims to leverage this advantage and is most similar\nto [6]. We also use latent representations as preference feature input for users and items, and com-\nbine them with content to train a hybrid DNN-based model. But unlike [6] which focuses primarily\non warm start users, we develop analogous models for both users and items, and then show how\nthese models can be trained to explicitly handle cold start.\n\n4 Our Approach\n\nIn this section we describe the architecture of our model that we call DropoutNet, together with\nlearning and inference procedures. We begin with input representation. Our aim is to develop a\nmodel that is able to handle both cold and warm start scenarios. Consequently, input to the model\n\n3\n\n\fneeds to contain content and preference information. One option is to directly use rows and columns\nof R in their raw form. However, these become prohibitively large as the number of users and\nitems grows. Instead, we take a similar approach to [6] and [23], and use latent representations as\npreference input. Latent models typically approximate the preference matrix with a product of low\nrank matrices U and V:\n\n(1)\nwhere Uu and Vv are the latent representations for user u and item v respectively. Both U and V\nare dense and low dimensional with rank D (cid:28) min(N, M ). Noting the strong performance of latent\napproaches on a wide range of CF datasets, it is adequate to assume that the latent representations\naccurately summarize preference information about users and items. Moreover, low input dimen-\nsionality signi\ufb01cantly reduces model complexity for DNNs since activation size of the \ufb01rst hidden\nlayer is directly proportional to the input size. Given these advantages we set the input to [Uu, \u03a6U\nu ]\nand [Vu, \u03a6V\n\nv ] for each user u and item v respectively.\n\nRuv \u2248 UuVT\n\nv\n\n4.1 Model Architecture\n\nGiven the joint preference-content input we propose to apply a DNN model to map it into a new\nlatent space that incorporates both content and preference information. Formally, preference Uu\nand content \u03a6U\nu inputs are \ufb01rst passed through the corresponding DNNs fU and f\u03a6U . Top layer\nactivations are then concatenated together and passed to the \ufb01ne-tuning network fU which outputs\nthe latent representation \u02c6Uu. Items are handled in a similar fashion with fV, f\u03a6V and fU to produce\n\u02c6Vv. We use separate components for preference and content inputs to handle complex structured\ncontent such as images that can\u2019t be directly concatenated with preference input in raw form. An-\nother advantage of using a split architecture is that it allows to use any of the publicly available (or\nproprietary) pre-trained models for f\u03a6U and/or f\u03a6V . Training can then be signi\ufb01cantly accelerated\nby updating only the last few layers of each pre-trained network. For domains such as vision where\nmodels can exceed 100 layers [11], this can effectively reduce the training time from days to hours.\nNote that when content input is \u201ccompatible\u201d with preference representations we remove fU and\nf\u03a6U , and directly apply fU to concatenated input [Uu, \u03a6U\nu ]. To avoid notation clutter we omit the\nsub-networks and use fU and fV to denote user and item models in subsequent sections.\nDuring training all components are optimized jointly with back-propagation. Once the model is\ntrained we \ufb01x it, and make forward passes to map U \u2192 \u02c6U and V \u2192 \u02c6V. All retrieval is then done\nusing \u02c6U and \u02c6V with relevance scores estimated as before by \u02c6suv = \u02c6Uu \u02c6VT\nv . Figure 1 shows the full\nmodel architecture with both user and item components.\n\n4.2 Training For Cold Start\n\nDuring training we aim to generalize the model to cold start while preserving warm start accuracy.\nWe discussed that existing hybrid model approach this problem by adding additional objective terms\nand training the model to fall-back on content representations when preferences are not available.\nHowever, this complicates learning by forcing the implementer to balance multiple objective terms\nin addition to training content representations. Moreover, content part of the objective is typically\ngenerative forcing the model to explain the observed data instead of using it to maximize recom-\nmendation accuracy. This can waste capacity by modeling content aspects that are not useful for\nrecommendations.\nWe take a different approach and borrow ideas from denoising autoencoders [20] by training the\nmodel to reconstruct the input from its corrupted version. The goal is to learn a model that would\nstill produce accurate representations when parts of the input are missing. To achieve this we propose\nan objective to reproduce the relevance scores after the input is passed through the model:\nv \u2212 \u02c6Uu \u02c6VT\nv )2\n\n(2)\nO minimizes the difference between scores produced by the input latent model and DNN. When all\ninput is available this objective is trivially minimized by setting the content weights to 0 and learning\nidentity function for preference input. This is a desirable property for reasons discussed below.\nIn cold start either Uu or Vv (or both) is missing so our main idea is to train for this by applying\ninput dropout [18]. We use stochastic mini-batch optimization and randomly sample user-item pairs\n\nv \u2212 fU (Uu, \u03a6U\n\nu )fV (Vv, \u03a6V\n\n(cid:88)\n\nv )T )2 =\n\n(cid:88)\n\nu,v\n\nO =\n\n(UuVT\n\n(UuVT\n\nu,v\n\n4\n\n\fto compute gradients and update the model. In each mini-batch a fraction of users and items is\nselected at random and their preference inputs are set to 0 before passing the mini-batch to the\nmodel. For \u201cdropped out\u201d pairs the model thus has to reconstruct the relevance scores without\nseeing the preference input:\n\nuser cold start: Ouv = (UuVT\nitem cold start: Ouv = (UuVT\n\nv \u2212 fU (0, \u03a6U\nv \u2212 fU (Uu, \u03a6U\n\nu )fV (Vv, \u03a6V\nu )fV (0, \u03a6V\n\nv )T )2\nv )T )2\n\n(3)\n\nAlgorithm 1: Learning Algorithm\n\nsample mini-batch B = {(u1, v1), ..., (uk, vk)}\nfor each (u, v) \u2208 B do\n\nInput: R, U, V, \u03a6U , \u03a6V\nInitialize: user model fU , item model fV\nrepeat {DNN optimization}\n\nTraining with dropout has a two-fold effect: pairs with dropout encourage the model to only use\ncontent information, while pairs without dropout encourage it to ignore content and simply repro-\nduce preference input. The net effect is balanced between these two extremes. The model learns to\nreproduce the accuracy of the input latent model when preference data is available while also gen-\neralizing to cold start. Dropout thus has a similar effect to hybrid preference-content interpolation\nobjectives but with a much simpler architecture that is easy to optimize. An additional advantage of\nusing dropout is that it was originally developed as a way of regularizing the model. We observe a\nsimilar effect here, \ufb01nding that additional regularization is rarely required even for deeper and more\ncomplex models.\nThere are interesting parallels between our\nmodel and areas such as denoising au-\ntoencoders [20] and dimensionality reduc-\ntion [17]. Analogous to denoising autoen-\ncoders, our model is trained to reproduce\nthe input from a noisy version. The noise\ncomes in the form of dropout that fully re-\nmoves a subset of input dimensions. How-\never, instead of reconstructing the actual un-\ncorrupted input we minimize pairwise dis-\ntances between points in the original and re-\nconstructed spaces. Considering relevance\n|u \u2208 U, v \u2208 V} and\nscores S = {UuVT\n\u02c6S = { \u02c6Uu \u02c6VT\n|u \u2208 U, v \u2208 V} as sets of\npoints in one dimensional space, the goal is\nto preserve the relative ordering between the\npoints in \u02c6S produced by our model and the\noriginal set S. We focus on reconstructing\ndistances because it gives greater \ufb02exibility\nallowing the model to learn an entirely new\nlatent space, and not tying it to a representa-\ntion learned by another model. This objec-\ntive is analogous to many popular dimensionality reduction models that project the data to a low\ndimensional space where relative distances between points are preserved [17]. In fact, many of the\nobjective functions developed for dimensionality reduction can also be used here.\nA drawback of the objective in Equation 2 is that it depends on the input latent model and thus its\naccuracy. However, empirically we found this objective to work well producing robust models. The\nmain advantages are that, \ufb01rst, it is simple to implement and has no additional free parameters to tune\nmaking it easy to apply to large datasets. Second, in mini-batch mode, N M unique user-item pairs\ncan be sampled to update the networks. Even for moderate size datasets the number of pairs is in\nthe billions making it signi\ufb01cantly easier to train large DNNs without over-\ufb01tting. The performance\nis particularly robust on sparse implicit datasets commonly found in CF where R is binary and over\n99% sparse. In this setting training with mini-batches sampled from raw R requires careful tuning\nto avoid oversampling 0\u2019s, and to avoid getting stuck in bad local optima.\n\nu ] \u2192 [0, \u03a6U\nu ]\nv ] \u2192 [0, \u03a6V\nv ]\nu ] \u2192 [meanv\u2208V(u)Vv, \u03a6U\nu ]\nv ] \u2192 [meanu\u2208V(v)Uu, \u03a6V\nv ]\n\n3. item dropout:\n\n1. leave as is\n2. user dropout:\n[Uu, \u03a6U\n[Vv, \u03a6V\n[Uu, \u03a6U\n[Vv, \u03a6V\n\nend for\nupdate fV, fU using B\n\nuntil convergence\nOutput: fV, fU\n\nv\n\nv\n\napply one of:\n\n4. user transform:\n\n5. item transform:\n\n4.3\n\nInference\n\nOnce training is completed, we \ufb01x the model and make forward passes to infer new latent repre-\nsentations. Ideally we would apply the model continuously throughout all stages of the user (item)\njourney \u2013 starting from cold start, to \ufb01rst few interactions and \ufb01nally to an established preference\npro\ufb01le. However, to update latent representation \u02c6Uu as we observe \ufb01rst preferences from a cold\n\n5\n\n\fstart user u, we need to infer the input preference vector Uu. As many leading latent models use\ncomplex non-convex objectives, updating latent representations with new preferences is a non-trivial\ntask that requires iterative optimization. To avoid this we use a simple trick by representing each\nuser as a weighted sum of items that the user interacted with until the input latent model is retrained.\nFormally, given cold start user u that has generated new set of interactions V(u) we approximate\nUu with the average latent representations of the items in V(u):\n\n(cid:88)\n\nv\u2208V(u)\n\nUu \u2248 1\n\n|V(u)|\n\nVv\n\n(4)\n\nUsing this approximation, we then make a forward pass through the user DNN to get the updated\nrepresentation: \u02c6Uu = fU (meanv\u2208V(u)Vv, \u03a6U\nu ). This procedure can be used continuously in near\nreal-time as new data is collected until the input latent model is re-trained. Cold start items are\nhandled in a similar way by using averages of user representations. Distribution of representations\nobtained via this approximation can deviate from the one produced by the input latent model. We\nexplicitly train for this using a similar idea to dropout for cold start. Throughout learning preference\ninput for a randomly chosen subset of users and items in each mini-batch is replaced with Equation 4.\nWe alternate between dropout and this transformation and control for the relative frequency of each\ntransformation (i.e., dropout fraction). Algorithm 1 outlines the full learning procedure.\n\n5 Experiments\n\nTo validate the proposed approach, we conducted extensive experiments on two publicly available\ndatasets: CiteULike [21] and the ACM RecSys 2017 challenge dataset [2]. These datasets are\nchosen because they contain content information, allowing cold start evaluation. We implemented\nAlgorithm 1 using the TensorFlow library [1]. All experiments were conducted on a server with\n20-core Intel Xeon CPU E5-2630 CPU, Nvidia Titan X GPU and 128GB of RAM. We compare\nour model against leading CF approaches including WMF [13], CTR [21], DeepMusic [7] and\nCDL [22] described in Section 3. For all baselines except DeepMusic, we use the code released by\nrespective authors, and extensively tune each model to \ufb01nd an optimal setting of hyper-parameters.\nFor DeepMusic we use a modi\ufb01ed version of the model replacing the objective function from [7]\nwith Equation 2 which we found to work better. To make comparison fair we use the same DNN\narchitecture (number of hidden layers and layer size) for DeepMusic and our models.\nAll DNN models are trained with mini batches of size 100, \ufb01xed learning rate and momentum of\n0.9. Algorithm 1 is applied directly to the mini batches, and we alternate between applying dropout,\nand inference transforms. Using \u03c4 to denote the dropout rate, for each batch we randomly select\n\u03c4 \u2217 batch size users and items. Then for batch 1 we apply dropout to selected users and items,\nfor batch 2 inference transform and so on. We found this procedure to work well across different\ndatasets and use it in all experiments.\n\n5.1 CiteULike\n\nAt CiteULike, registered users create scienti\ufb01c article libraries and save them for future reference.\nThe goal is to leverage these libraries to recommend relevant new articles to each user. We use a\nsubset of the CiteULike data with 5,551 users, 16,980 articles and 204,986 observed user-article\npairs. This is a binary problem with R(u, v) = 1 if article v is in u\u2019s library and R(u, v) = 0\notherwise. R is over 99.8% sparse with each user collecting an average of 37 articles. In addition to\npreference data, we also have article content information in the form of title and abstract. To make\nthe comparison fair we follow the approach of [21] and use the same vocabulary of top 8,000 words\nselected by tf-idf. This produces the 16, 980\u00d7 8, 000 item content matrix \u03a6V; since no user content\nis available \u03a6U is dropped from the model. For all evaluation we use Fold 1 from [21] (results on\nother folds are nearly identical) and report results of the test set from this fold. We modify warm\nstart evaluation and measure accuracy by generating recommendations from the full set of 16, 980\narticles for each user (excluding training interactions). This makes the problem more challenging,\nand provides a better evaluation of model performance. Cold start evaluation is the same as in [21],\nwe remove a subset of 3396 articles from the training data and then generate recommendations from\nthese articles at test time.\n\n6\n\n\fMethod\nWMF [13]\nCTR [21]\nDeepMusic [7]\nCDL [22]\nDN-WMF\nDN-CDL\n\n0.592\n0.597\n0.371\n0.603\n0.593\n0.598\n\nWarm Start Cold Start\n\n\u00b7\n\n0.589\n0.601\n0.573\n0.636\n0.629\n\nTable 1: CiteULike recall@100 warm and cold\nstart test set results.\n\nFigure 2: CiteULike warm and cold start results\nfor dropout rates between 0 and 1.\nWe \ufb01x rank D = 200 for all models to stay consistent with the setup used in [21]. For our model\nwe found that 1-hidden layer architectures with 500 hidden units and tanh activations gave good\nperformance and going deeper did not signi\ufb01cantly improve results. To train the model for cold start\nwe apply dropout to preference input as outlined in Section 4.2. Here, we only apply dropout to item\npreferences since only item content is available. Figure 2 shows warm and cold start recall@100\naccuracy for dropout rate (probability to drop) between 0 and 1. From the \ufb01gure we see an inter-\nesting pattern where warm start accuracy remains virtually unchanged decreasing by less than 1%\nuntil dropout reaches 0.7 where it rapidly degrades. Cold start accuracy on the other hand, steadily\nincreases with dropout. Moreover, without dropout cold start performance is poor and even dropout\nof 0.1 improves it by over 60%. This indicates that there is a region of dropout values where signi\ufb01-\ncant gains in cold start accuracy can be achieved without losses on warm start. Similar patterns were\nobserved on other datasets and further validate that the proposed approach of applying dropout for\ncold start generalization achieves the desired effect.\nWarm and cold start recall@100 results are shown in Table 1. To verify that our model can be\ntrained in conjunction with any existing latent model, we trained two versions denoted DN-WMF\nand DN-CDL, that use WMF and CDL as input preference models respectively. Both models were\ntrained with preference input dropout rate of 0.5. From the table we see that most baselines produce\nsimilar results on warm start which is expected since virtually all of these models use WMF objective\nto model R. One exception is DeepMusic that performs signi\ufb01cantly worse than other baselines.\nThis can be attributed to the fact that in DeepMusic item latent representations are functions of\ncontent only and thus lack preference information. DN-WMF and DN-CDL on the other hand,\nperform comparably to the best baseline indicating that adding preference information as input into\nthe model signi\ufb01cantly improves performance over content only models like DeepMusic. Moreover,\nas Figure 2 suggests even aggressive dropout of 0.5 does not affect warm start performance and the\nour model is still able to recover the accuracy of the input latent model.\nCold start results are more diverse, as expected best cold start baseline is DeepMusic. Unlike CTR\nand CDL that have unsupervised and semi-supervised content components, DeepMusic is end-to-\nend supervised, and can thus learn representations that are better tailored to the target retrieval task.\nWe also see that DNN-WMF outperforms all baselines improving recall@100 by 6% over the best\nbaseline. This indicates that incorporating preference information as input during training can also\nimprove cold start generalization. Moreover, WMF can\u2019t be applied to cold start so our model\neffectively adds cold start capability to WMF with excellent generalization and without affecting\nperformance on warm start. Similar pattern can be seen for DN-CDL that improves cold start per-\nformance of CDL by almost 10% without affecting warm start.\n\n5.2 RecSys\n\nThe ACM RecSys 2017 dataset was released as part of the ACM RecSys 2017 Challenge [2]. It\u2019s\na large scale data collection of user-job interactions from the career oriented social network XING\n(European analog of LinkedIn). Importantly, this is one of the only publicly available datasets that\ncontains both user and item content information enabling cold start evaluation on both. In total\nthere are 1.5M users, 1.3M jobs and over 300M interactions. Interactions are divided into six types\n{impression, click, bookmark, reply, delete, recruiter}, and each interaction is recorded with the\ncorresponding type and timestamp. In addition, for users we have access to pro\ufb01le information such\nas education, work experience, location and current position. Similarly, for items we have industry,\n\n7\n\n\f(a)\n\nFigure 3: RecSys warm start (Figure 3(a)), user cold start (Figure 3(b)) and item cold start (Fig-\nure 3(c)) results. All \ufb01gures show test set recall for truncations 50 to 500 in increments of 50. Code\nrelease by the authors of CTR and CDL is only applicable to item cold start so these baselines are\nexcluded from user cold start evaluation.\n\n(b)\n\n(c)\n\nlocation, title/tags, career level and other related information; see [2] for full description of the data.\nAfter cleaning and transforming all categorical inputs into 1-of-n representation we ended up with\n831 user features and 2738 item features forming \u03a6U and \u03a6V respectively.\nWe process the interaction data by removing duplicate interactions (i.e. multiple clicks on the\nsame item) and deletes, and collapse remaining interactions into a single binary matrix R where\nR(u, v) = 1 if user u interacted with job v and R(u, v) = 0 otherwise. We then split the data\nforward in time using interactions from the last two weeks as the test set. To evaluate both warm\nand cold start scenarios simultaneously, test set interactions are further split into three groups: warm\nstart, user cold start and item cold start. The three groups contain approximately 426K, 159K and\n184K interactions respectively with a total of 42, 153 cold start users and 49, 975 cold start items;\ntraining set contains 18.7M interactions. Cold start users and items are obtained by removing all\ntraining interactions for randomly selected subsets of users and items. The goal is to train a single\nmodel that is able to handle all three tasks. This simulates real-world scenarios for many online ser-\nvices like XING where new users and items are added daily and need to be recommended together\nwith existing users and items. We set rank D = 200 for all models and in all experiments train\nour model (denoted DN-WMF) using latent representations from WMF. During training we alter-\nnate between applying dropout and inference approximation (see Section 4.3) for users and items in\neach mini-batch with a rate of 0.5. For CTR and CDL the code released by respective authors only\nsupports item cold start so we evaluate these models on warm start and item cold start tasks only.\nTo \ufb01nd the appropriate DNN architec-\nture we conduct extensive experiments\nusing increasingly deeper DNNs. We fol-\nlow the approach of [6] and use a pyra-\nmid structure where the network gradu-\nally compresses the input witch each suc-\ncessive layer.\nFor all architecture we\nuse fully connected layers with batch\nnorm [14] and tanh activation functions;\nother activation functions such as ReLU\nand sigmoid produced signi\ufb01cantly worse\nresults. All models were trained using\nWMF as input latent model, however note that WMF cannot be applied to either user or item cold\nstart. Table 2 shows warm start, user cold start, and item cold start recall at 100 results as the number\nof layers is increased from one to four. From the table we see that up to three layers, the accuracy on\nboth cold start tasks steadily improves with each additional layer while the accuracy on warm start\nremains approximately the same. These results suggest that deeper architectures are highly useful\nfor this task. We use the three layer model in all experiments.\nRecSys results are shown in Figure 3. From warm start results in Figure 3(a) we see a similar pattern\nwhere all baselines perform comparably except DeepMusic, suggesting that content only models are\nunlikely to perform well on warm start. User and item cold start results are shown in Figures 3(b)\nand 3(c) respectively. From the \ufb01gures we see that DeepMusic is the best performing baseline\n\nNetwork Architecture Warm User\nWMF\n400\n800 \u2192 400\n800 \u2192 800 \u2192 400\n\nTable 2: Recsys recall@100 warm, user cold start and\nitem cold start results for different DNN architectures.\nWe use tanh activations and batch norm in each layer.\n\nItem\n\n0.234\n0.255\n0.265\n\n0.426\n0.421\n0.420\n0.412\n\n0.211\n0.229\n0.231\n\n8\n\n\fsigni\ufb01cantly beating the next best baseline CTR on the item cold start. We also see that DN-WMF\nsigni\ufb01cantly outperforms DeepMusic with over 50% relative improvement for most truncations.\nThis is despite the fact that DeepMusic was trained using the same 3-layer architecture and the\nsame objective function as DN-WMF. These results further indicate that incorporating preference\ninformation as input into the model is highly important even when the end goal is cold start.\nUser inference results are shown in Figure 4. We ran-\ndomly selected a subset of 10K cold start users that have\nat least 5 training interactions. Note that all training in-\nteractions were removed for these users during training to\nsimulate cold start. For each of the selected users we then\nincorporate training interactions one at a time into the\nmodel in chronological order using the inference proce-\ndure outlined in Section 4.3. Resulting latent representa-\ntions are tested on the test set. Figure 4 shows recall@100\nresults as number of interactions is increased from 0 (cold\nstart) to 5. We compare with WMF by applying simi-\nlar procedure from Equation 4 to WMF representations.\nFrom the \ufb01gure it is seen that our model is able to seam-\nlessly transition from cold start to preferences without re-\ntraining. Moreover, even though our model uses WMF as\ninput it is able to signi\ufb01cantly outperform WMF at all interaction sizes. Item inference results are\nsimilar and are omitted. These results indicate that training with inference approximations achieves\nthe desired effect allowing our model to transition from cold start to \ufb01rst few preferences without\nre-training and with excellent generalization.\n\nFigure 4: User inference results as num-\nber of interactions is increased from 0\n(cold start) to 5.\n\n6 Conclusion\n\nWe presented DropoutNet \u2013 a deep neural network model for cold start in recommender systems.\nDropoutNet applies input dropout during training to condition for missing preference information.\nOptimization with missing data forces the model to leverage preference and content information\nwithout explicitly relying on both being present. This leads to excellent generalization on both warm\nand cold start scenarios. Moreover, unlike existing approaches that typically have complex multi-\nterm objective functions, our objective only has a single term and is easy to implement and optimize.\nDropoutNet can be applied on top of any existing latent model effectively, providing cold-start\ncapabilities and leveraging full power of deep architectures for content modeling. Empirically, we\ndemonstrate state-of-the-art results on two public benchmarks. Future work includes investigating\nobjective functions that directly incorporate preference information with the aim of improving warm\nstart accuracy beyond the input latent model. We also plan to explore different DNN architectures\nfor both user and item models to better leverage diverse content types.\n\nReferences\n[1] Mart\u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[2] F. Abel, Y. Deldjo, M. Elahi, and D. Kohlsdorf. Recsys challenge 2017. http://2017.\n\nrecsyschallenge.com, 2017.\n\n[3] D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Conference on Knowl-\n\nedge Discovery and Data Mining, 2009.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine\n\nLearning Research, 3, 2003.\n\n[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural lan-\n\nguage processing (almost) from scratch. Journal of Machine Learning Research, 2011.\n\n9\n\n\f[6] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations.\n\nIn ACM Recommender Systems, 2016.\n\n[7] A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommenda-\n\ntion. In Neural Information Processing Systems, 2013.\n\n[8] P. Gopalan, J. M. Hofman, and D. M. Blei. Scalable recommendation with poisson factoriza-\n\ntion. arXiv:1311.1704, 2013.\n\n[9] P. K. Gopalan, L. Charlin, and D. Blei. Content-based recommendations with poisson factor-\n\nization. In Neural Information Processing Systems, 2014.\n\n[10] A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural\n\nnetworks. In Conference on Acoustics, Speech, and Signal Processing, 2013.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\narXiv:1512.03385, 2015.\n\n[12] G. E. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, and N. Jaitly. Deep neural networks\nfor acoustic modeling in speech recognition: The shared views of four research groups. IEEE\nSignal Processing, 2012.\n\n[13] Y. Hu, Y. Koren, and C. Volinsky. Collaborative \ufb01ltering for implicit feedback datasets. In\n\nInternational Conference on Data Engineering, 2008.\n\n[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, 2015.\n\n[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Neural Information Processing Systems, 2012.\n\n[16] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Interna-\n\ntional Conference on Machine Learning, 2014.\n\n[17] L. V. D. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 2008.\n\n[18] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A sim-\nple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n2014.\n\n[19] X. Su and T. M. Khoshgoftaar. A survey of collaborative \ufb01ltering techniques. Advances in\n\nArti\ufb01cial Intelligence, 2009.\n\n[20] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising au-\ntoencoders: Learning useful representations in a deep network with a local denoising criterion.\nJournal of Machine Learning Research, 2010.\n\n[21] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scienti\ufb01c articles.\n\nIn Conference on Knowledge Discovery and Data Mining, 2011.\n\n[22] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems.\n\nIn Conference on Knowledge Discovery and Data Mining, 2015.\n\n[23] C.-Y. Wu, A. Ahmed, A. Beutel, A. Smola, and H. Jing. Recurrent recommender networks. In\n\nConference on Web Search and Data Mining, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2563, "authors": [{"given_name": "Maksims", "family_name": "Volkovs", "institution": "layer6.ai"}, {"given_name": "Guangwei", "family_name": "Yu", "institution": "layer6.ai"}, {"given_name": "Tomi", "family_name": "Poutanen", "institution": null}]}