{"title": "Collaborative Ranking With 17 Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 2294, "page_last": 2302, "abstract": "The primary application of collaborate filtering (CF) is to recommend a small set of items to a user, which entails ranking. Most approaches, however, formulate the CF problem as rating prediction, overlooking the ranking perspective. In this work we present a method for collaborative ranking that leverages the strengths of the two main CF approaches, neighborhood- and model-based. Our novel method is highly efficient, with only seventeen parameters to optimize and a single hyperparameter to tune, and beats the state-of-the-art collaborative ranking methods. We also show that parameters learned on one dataset yield excellent results on a very different dataset, without any retraining.", "full_text": "Collaborative Ranking With 17 Parameters\n\nMaksims N. Volkovs\nUniversity of Toronto\n\nmvolkovs@cs.toronto.edu\n\nRichard S. Zemel\nUniversity of Toronto\n\nzemel@cs.toronto.edu\n\nAbstract\n\nThe primary application of collaborate \ufb01ltering (CF) is to recommend a small set\nof items to a user, which entails ranking. Most approaches, however, formulate the\nCF problem as rating prediction, overlooking the ranking perspective. In this work\nwe present a method for collaborative ranking that leverages the strengths of the\ntwo main CF approaches, neighborhood- and model-based. Our novel method is\nhighly ef\ufb01cient, with only seventeen parameters to optimize and a single hyperpa-\nrameter to tune, and beats the state-of-the-art collaborative ranking methods. We\nalso show that parameters learned on datasets from one item domain yield excel-\nlent results on a dataset from very different item domain, without any retraining.\n\n1\n\nIntroduction\n\nCollaborative Filtering (CF) is a method of making predictions about an individual\u2019s preferences\nbased on the preference information from many users. The emerging popularity of web-based ser-\nvices such as Amazon, YouTube, and Net\ufb02ix has led to signi\ufb01cant developments in CF in recent\nyears. Most applications use CF to recommend a small set of items to the user. For instance, Ama-\nzon presents a list of top-T products it predicts a user is most likely to buy next. Similarly, Net\ufb02ix\nrecommends top-T movies it predicts a user will like based on his/her rating and viewing history.\nHowever, while recommending a small ordered list of items is a ranking problem, ranking in CF has\ngained relatively little attention from the learning-to-rank community. One possible reason for this\nis the Net\ufb02ix[3] challenge which was the primary venue for CF model development and evaluation\nin recent years. The challenge was formulated as a rating prediction problem, and almost all of the\nproposed models were designed speci\ufb01cally for this task, and were evaluated using the normalized\nsquared error objective. Another potential reason is the absence of user-item features. The standard\nlearning-to-rank problem in information retrieval (IR), which is well explored with many powerful\napproaches available, always includes item features, which are used to learn the models. These\nfeatures incorporate a lot of external information and are highly engineered to accurately describe\nthe query-document pairs. While a similar approach can be taken in CF settings, it is likely to be\nvery time consuming to develop analogous features, and features developed for one item domain\n(books, movies, songs etc.) are likely to not generalize well to another. Moreover, user features\ntypically include personal information which cannot be publicly released, preventing open research\nin the area. An example of this is the second part of the Net\ufb02ix challenge which had to be shut down\ndue to privacy concerns. The absence of user-item features makes it very challenging to apply the\nmodels from the learning-to-rank domain to this task. However, recent work [23, 15, 2] has shown\nthat by optimizing a ranking objective just given the known ratings a signi\ufb01cantly higher ranking\naccuracy can be achieved as compared to models that optimize rating prediction.\nInspired by these results we propose a new ranking framework where we show how the observed\nratings can be used to extract effective feature descriptors for every user-item pair. The features do\nnot require any external information and make it it possible to apply any learning-to-rank method to\noptimize the parameters of the ranking function for the target metric. Experiments on MovieLens\nand Yahoo! datasets show that our model outperforms existing rating and ranking approaches to CF.\n\n1\n\n\fMoreover, we show that a model learned with our approach on a dataset from one user/item domain\ncan then be applied to a different domain without retraining and still achieve excellent performance.\n\n2 Collaborative Ranking Framework\nIn a typical collaborative \ufb01ltering (CF) problem we are given a set of N users U = {u1, ..., uN}\nand a set of M items V = {v1, ..., vM}. The users\u2019 ratings of the items can be represented by an\nN \u00d7M matrix R where R(un, vm) is the rating assigned by user un to item vm and R(un, vm) = 0\nif vm is not rated by un. We use U(vm) to denote the set of all users that have rated vm and V(un)\nto denote the set of items that have been rated by un. We use vector notation: R(un, :) denotes the\nn\u2019th row of R (1 \u00d7 M vector), and R(:, vm) denotes the m\u2019th column (N \u00d7 1 vector).\nAs mentioned above, most research has concentrated on the rating prediction problem in CF where\nthe aim is to accurately predict the ratings for the unrated items for each user. However, most appli-\ncations that use CF typically aim to recommend only a small ranked set of items to each user. Thus\nrather than concentrating on rating prediction we instead approach this problem from the ranking\nviewpoint and refer to it as Collaborative Ranking (CR). In CR the goal is to rank the unrated items\nin the order of relevance to the user. A ranking of the items V can be represented as a permutation\n\u03c0 : {1, ..., M} \u2192 {1, ..., M} where \u03c0(m) = l denotes the rank of the item vm and m = \u03c0\u22121(l). A\nnumber of evaluation metrics have been proposed in IR to evaluate the performance of the ranking.\nHere we use the most commonly used metric, Normalized Discounted Cumulative Gain (NDCG)\n[12]. For a given user un and ranking \u03c0 the NDCG is given by:\n\nT(cid:88)\n\n1\n\n2\u02c6R(un, v\u03c0\u22121(t)) \u2212 1\n\nN DCG(un, \u03c0, R)@T =\n\nGT (un, R)\n\nt=1\n\nlog(t + 1)\n\n(1)\n\nwhere T is a truncation constant, v\u03c0\u22121(t) is the item in position t in \u03c0 and GT (un, R) is a normal-\nizing term which ensures that N DCG \u2208 [0, 1] for all rankings. T is typically set to a small value\nto emphasize that the user will only be shown the top-T ranked items and the items below the top-T\nare not evaluated.\n3 Related Work\n\nRelated work in CF and CR can be divided into two categories: neighborhood-based approaches and\nmodel-based approaches. In this section we describe both types of models.\n3.1 Neighborhood-Based Approaches\n\nNeighborhood-based CF approaches estimate the unknown ratings for a target user based on the\nratings from the set of neighborhood users that tend to rate similarly to the target user. Formally,\ngiven the target user un and item vm the neighborhood-based methods \ufb01nd a subset of K neighbor\nusers who are most similar to un and have rated vm, i.e., are in the set U(vm) \\ un. We use\nK(un, vm) \u2286 U(vm) \\ un to denote the set of K neighboring users. A central component of these\nmethods is the similarity function \u03c8 used to compute the neighbors. Several such functions have\nbeen proposed including the Cosine Similarity [4] and the Pearson Correlation [20, 10]:\n\n\u03c8cos(un, u\n\n(cid:48)\n\n) =\n\nR(un, :) \u00b7 R(u(cid:48), :)T\n(cid:107)R(un, :)(cid:107)(cid:107)R(u(cid:48), :)(cid:107)\n\n\u03c8pears(un, u\n\n(cid:48)\n\n) =\n\n(R(un, :) \u2212 \u00b5(un)) \u00b7 (R(u(cid:48), :) \u2212 \u00b5(u(cid:48)))T\n(cid:107)R(i, :) \u2212 \u00b5(un)(cid:107)(cid:107)R(u(cid:48), :) \u2212 \u00b5(u(cid:48))(cid:107)\n\nwhere \u00b5(un) is the average rating for un. Once the K neighbors are found the rating is predicted\nby taking the weighted average of the neighbors\u2019 ratings. An analogous item-based approach [22]\ncan be used when the number of items is smaller than the number of users.\nOne problem with the neighborhood-based approaches is that the raw ratings often contain user bias.\nFor instance, some users tend to give high ratings while others tend to give low ones. To correct for\nthese biases various methods have been proposed to normalize or center the ratings [4, 20] before\ncomputing the predictions.\nAnother major problem with the neighborhood-based approaches arises from the fact that the ob-\nserved rating matrix R is typically highly sparse, making it very dif\ufb01cult to \ufb01nd similar neighbors\nreliably. To addresss this sparsity, most methods employ dimensionality reduction [9] and data\nsmoothing [24] to \ufb01ll in some of the unknown ratings, or to cluster users before computing user\n\n2\n\n\fsimilarity. This however adds computational overhead and typically requires tuning additional pa-\nrameters such as the number of clusters.\nA neighborhood-based approach to ranking has been proposed recently by Liu & Yang [15]. Instead\nof predicting ratings, this method uses the neighbors of un to \ufb01ll in the missing entries in the M \u00d7M\npairwise preference matrix Yn, where Yn(vm, vl) is the preference strength for vm over vl by\nun. Once the matrix is completed an approximate Markov chain algorithm is used to infer the\nranking from the pairwise preferences. The main drawback of this approach is that the model is\nnot optimized for the target evaluation metric, such as NDCG. The ranking is inferred directly from\nYn and no additional parameters are learned. In general, to the best of our knowledge, no existing\nneighborhood-based CR method takes the target metric into account during optimization.\n3.2 Model-Based Approaches\n\nIn contrast to the neighborhood-based approaches, the model-based approaches use the observed\nratings to create a compact model of the data which is then used to predict the unobserved ratings.\nMethods in this category include latent models [11, 16, 21], clustering methods [24] and Bayesian\nnetworks [19]. Latent factorization models such as Probabilistic Matrix Factorization (PMF) [21]\nare the most popular model-based approaches. In PMF every user un and item vm are represented\nby latent vectors \u03c6(un) and \u03c6(vm) of length D. For a given user-item pair (un, vm) the dot product\nof the corresponding latent vectors gives the rating prediction: R(un, vm) \u2248 \u03c6(un) \u00b7 \u03c6(vm). The\nlatent representations are learned by minimizing the squared error between the observed ratings and\nthe predicted ones.\nLatent models have more expressive power and typically perform better than the neighborhood-\nbased models when the number of observed ratings is small because they are able to learn preference\ncorrelations that extend beyond the simple neighborhood similarity. However, this comes at the cost\nof a large number of parameters and complex optimization. For example, with the suggested setting\nof D = 20 the PMF model on the full Net\ufb02ix dataset has over 10 million parameters and is prone to\nover\ufb01tting. To prevent over\ufb01tting the weighted (cid:96)2 norms of the latent representations are minimized\ntogether with the squared error during the optimization phase, which introduces additional hyper-\nparameters to tune.\nAnother problem with the majority of the model-based approaches is that inference for a new\nuser/item is typically expensive. For instance, in PMF the latent representation has to be learned\nbefore any predictions can be made for a new user/item, and if many new users/items are added the\nentire model has to be retrained. On the other hand, inference for a new user in neighborhood-based\nmethods can be done ef\ufb01ciently by simply computing the K neighbors, which is a key advantage of\nthese approaches.\nSeveral model-based approaches to CR have recently been proposed, notably Co\ufb01Rank [23] and the\nPMF-based ranking model [2]. Co\ufb01Rank learns latent representations that minimize a ranking-based\nloss instead of the squared error. The PMF-based approach uses the latent representations produced\nby PMF as user-item features and learns a ranking model on these features. The authors of that\nwork also note that the PMF representations might not be optimal for ranking since they are learned\nusing a squared error objective which is very different from most ranking metric. To account for this\nthey propose an extension where both user-item features and the weights of the ranking function are\noptimized during learning. Both methods incorporate NDCG during the optimization phase which is\na signi\ufb01cant advantage over most neighborhood-based approaches to CR. However, neither method\naddresses the optimization or inference problems mentioned above. In the following section we\npresent our approach to CR which leverages the advantages of both neighborhood and model-based\nmethods.\n3.3 Learning-to-Rank\n\nLearning-to-rank has received a lot of attention in the machine learning community due to its im-\nportance in a wide variety of applications ranging from information retrieval to natural language\nprocessing to computer vision.\nIn IR the learning-to-rank problem consists of a set of training\nqueries where for each query we are given a set of retrieved documents and their relevance labels\nthat indicate the degree of relevance to the query. The documents are represented as query dependent\nfeature vectors and the goal is to learn a feature-based ranking function to rank the documents in the\norder of relevance to the query. Existing approaches to this problem can be partitioned into three\n\n3\n\n\fFigure 1: An example rating matrix R and the resulting WIN, LOSS and TIE matrices for the user-item pair\n(1) Top-3 closest neighbors {u1, u5, u6} are selected from\n(u3, v4) with K = 3 (number of neighbors).\nU(v4) = {u1, u2, u5, u6} (all users who rated v4). Note that u2 is not selected because the ratings for u2\ndeviate signi\ufb01cantly from those for u3. (2) The WIN, LOSS and TIE matrices are computed for each neighbor\nusing Equation 2. Here g \u2261 1 is used to compute the matrices. For example, u5 gave a rating of 3 to v4 which\nties it with v3 and beats v1. Normalizing by |V(u5)| \u2212 1 = 2 gives WIN34(u5) = 0.5, LOSS34(u5) = 0 and\nTIE34(u5) = 0.5.\n\ncategories: pointwise, pairwise, and listwise. Due to the lack of space we omit the description of the\nindividual approaches here and instead refer the reader to [14] for an excellent overview.\n4 Our Approach\n\nThe main idea behind our approach is to transform the CR problem into a learning-to-rank one and\nthen utilize one of the many developed ranking methods to learn the ranking function. CR can be\nplaced into the learning-to-rank framework by noting that the users correspond to queries and items\nto documents. For each user the observed ratings indicate the relevance of the corresponding items\nto that user and can be used to train the ranking function. The key difference between this setup and\nthe standard learning-to-rank one is the absence of user-item features. In this work we bridge this\ngap and develop a robust feature extraction approach which does not require any external user or\nitem information and is based only on the available training ratings.\n4.1 Feature Extraction\n\nThe PMF-based ranking approach [2] extracts user-item features by concatenating together the latent\nrepresentations learned by the PMF model. The model thus requires the user-item representations\nto be learned before the items can be ranked and hence suffers from the main disadvantages of the\nmodel-based approaches:\nthe large number of parameters, complex optimization, and expensive\ninference for new users and items. In this work we take a different approach which avoids these\ndisadvantages. We propose to use the neighbor preferences to extract the features for a given user-\nitem pair.\nFormally, given a user-item pair (un, vm) and a similarity function \u03c8, we use \u03c8 to extract a subset of\nthe K most similar users to un that rated vm, i.e., K(un, vm). This step is identical to the standard\nneighborhood-based model, and \u03c8 can be any rating or preference based similarity function. Once\nK(un, vm) = {uk}K\nk=1 is found, instead of using only the ratings for vm, we use all of the observed\nratings for each neighbor and summarize the net preference for vm into three K \u00d7 1 summary\npreference matrices WINnm, LOSSnm and TIEnm:\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nv(cid:48)\u2208V(uk)\\vm\n\nWINnm(k) =\n\nLOSSnm(k) =\n\n1\n\n|V(uk)| \u2212 1\n\n1\n\n|V(uk)| \u2212 1\n\nTIEnm(k)\n\n=\n\n1\n\n|V(uk)| \u2212 1\n\nv(cid:48)\u2208V(uk)\\vm\n\nv(cid:48)\u2208V(uk)\\vm\n\ng(R(uk, vm), R(uk, v\n\ng(R(uk, vm), R(uk, v\n\n(cid:48)\n\n(cid:48)\n\n))I[R(uk, vm) > R(uk, v\n\n))I[R(uk, vm) < R(uk, v\n\n(cid:48)\n\n(cid:48)\n\n)]\n\n)]\n\n(2)\n\nI[R(uk, vm) = R(uk, v\n\n(cid:48)\n\n)]\n\nwhere I[x] is an indicator function evaluating to 1 if x is true and to 0 otherwise, and g : R2 \u2192\nR is the pairwise preference function used to convert ratings to pairwise preferences. A simple\nchoice for g is g \u2261 1 which ignores the rating magnitude and turns the matrices into normalized\ncounts. However, recent work in preference aggregation [8, 13] has shown that additional gain can be\nachieved by taking the relative rating magnitude into account by using either the normalized rating or\nlog rating difference. All three versions of g address the user bias problem mentioned above by using\n\n4\n\n\frelative comparisons rather than the absolute rating magnitude. In this form WINnm(k) corresponds\nto the net positive preference for vm by neighbor uk. Similarly, LOSSnm(k) corresponds to the net\nnegative preference and TIEnm(k) counts the number of ties. Together the three matrices thus\ndescribe the relative preferences for vm across all the neighbors of un. Normalization by |V(uk) \\\nvm| (number of observed ratings for uk excluding vm), ensures that the entries are comparable across\nneighbors with different numbers of ratings. For unpopular items vm that do not have many ratings\nwith |U(vm)| < K, the number of neighbors will be less than K, i.e., |K(un, vm)| < K. When\nsuch an item is encountered we shrink the preference matrices to be the same size as |K(un, vm)|.\nFigure 1 shows an example rating matrix R together with the preference matrices computed for the\nuser-item pair (u3, v4).\nGiven the preference matrix WINnm we summarize it with a set of simple descriptive statistics:\n\n(cid:34)\n\n(cid:35)\n\n(cid:88)\n\nk\n\n\u03b3(WINnm) =\n\n\u00b5(WINnm), \u03c3(WINnm), max(WINnm), min(WINnm),\n\n1\nK\n\nI[WINnm(k) (cid:54)= 0]\n\nwhere \u00b5 and \u03c3 are mean and standard deviation functions respectively. The last statistic counts the\nnumber of neighbors (out of K) that express any positive preference towards vm, and together with\n\u03c3 summarizes the overall con\ufb01dence of the preference. Extending this procedure to the other two\npreference matrices and concatenating the resulting statistics gives the feature vector for (un, vm):\n(3)\n\n\u03b3(un, vm) = [\u03b3(WINnm), \u03b3(LOSSnm), \u03b3(TIEnm)]\n\nIntuitively the features describe the net preference for vm and its variability across the neighbors.\nNote that since \u03b3 is independent of K, N and M this representa-\ntion will have the same length for every user-item pair. We have\nthus created a \ufb01xed length feature representation for every user-\nitem pair, effectively transforming the CR problem into a standard\nlearning-to-rank one. During training our aim is now to use the\nobserved training ratings to learn a scoring function f : R|\u03b3| \u2192 R\nwhich maximizes the target IR metric, such as NDCG, across all\nusers. At test time, given a user u and items {v1, ..., vM}, we (1)\nextract features for each item vm using the neighbors of (u, vm);\n(2) apply the learned scoring function to get the score for every\nitem; and (3) sort the scores to produce the ranking. This process\nis shown in Figure 2.\nIt is important to note here that, \ufb01rst, a single scoring function\nis learned for all users and items so the number of parameters is\nindependent of the number of users or items and only depends on\nthe size of \u03b3. This is a signi\ufb01cant advantage over most model-based approaches where the number of\nparameters typically scales linearly with the number of users and/or items. Second, given a new user\nu no optimization is necessary to produce a ranking of the items for u. Similarly to neighborhood-\nbased methods, our approach only requires computing the neighbors to extract the features and apply\nthe learned scoring function to get the ranking. This is also a signi\ufb01cant advantage over most user-\nbased approaches where it is typically necessary to learn a new model for every user not present\nin the training data before predictions can be made. Finally, unlike the existing neighborhood-\nbased methods to CR our approach allows to optimize the parameters of the model for the target\nmetric. Moreover, the extracted features incorporate preference con\ufb01dence information such as the\nvariance across the neighbors and the fraction of the neighbors that generated each preference type\n(positive, negative and tie). Taking this information into account allows us to adapt the parameters\nof the scoring function to sparse low-con\ufb01dence settings and addresses the reliability problem of the\nneighborhood-based methods (see Section 3.1). Note that an analogous item-based approach can be\ntaken here by similarly summarizing the preferences of un for items that are closest to vm, we leave\nthis for future work. A modi\ufb01ed version of this approach adapted to binary ratings recently placed\nsecond in the Million Song Dataset Challenge [18] ran by Kaggle.\n4.2 Learning the Scoring Function\n\nFigure 2: The \ufb02ow diagram for\nWLT, our feature-based CR model.\n\nGiven the user-item features extracted based on the neighbors our goal is to use the observed training\nratings for each user to optimize the parameters of the scoring function for the target IR metric. A\nkey difference between this feature-based CR approach and the typical learning-to-rank setup is the\n\n5\n\n\fpossibility of missing features. If a given training item vm is not ranked by any other user except\nun the feature vector is set to zero (\u03b3(un, vm) \u2261 0). One way to avoid missing features is to learn\nonly with those items that have at least \u0001 ratings in the training set. However, in very sparse settings\nthis would force us to discard some of the valuable training data. We take a different approach,\nmodifying the conventional linear scoring function to include an additional bias term b0:\n\nf (\u03b3(un, vm), W) = w \u00b7 \u03b3(un, vm) + b + I[U(vm) \\ un = \u2205]b0\n\n(4)\nwhere W = {w, b, b0} is the set of free parameters to be learned. Here w has the same dimension as\n\u03b3, and I is an indicator function. The bias term b0 provides a base score for vm if vm does not have\nenough ratings in the training data. Several possible extensions of this model are worth mentioning\nhere. First, the scoring function can be made non-linear by adding additional hidden layer(s) as\ndone in conventional multilayer neural networks. Second, user information can be incorporated\ninto the model by learning user speci\ufb01c weights. To incorporate user information we can learn a\nseparate set of weights wn for each user un or group of users. The weights will provide user speci\ufb01c\ninformation and are then applied to rank the unrated items for the corresponding user(s). However,\nthis extension makes the approach similar to the model-based approaches, with all the corresponding\ndisadvantages mentioned above. Finally, additional user/item information such as, for example,\npersonal information for users and description/genre etc. for items, can be incorporated by simply\nconcatenating it with \u03b3(un, vm) and expanding the dimensionality of W. Note that if these additional\nfeatures can be extracted ef\ufb01ciently, incorporating them will not add signi\ufb01cant overhead to either\nlearning or inference and the model can still be applied to new users and items very ef\ufb01ciently.\nIn the form given by Equation 4 our model has a total of |\u03b3|+2 parameters to be learned. We can use\nany of the developed learning-to-rank approaches to optimize W. In this work we chose to use the\nLambdaRank method, due it its excellent performance, having recently won the Yahoo! Learning-\nTo-Rank Challenge [7]. We omit the description of LambdaRank here due to the lack of space, and\nrefer the reader to [6] and [5] for a detailed description.\n5 Experiments\n\nTable 1: Dataset statistics.\n\nTo validate the proposed approach we conducted extensive experiments on three publicly avail-\nable datasets: two movie datasets MovieLens-1, MovieLens-2, and a musical artist dataset from\nYahoo! [1]. All datasets were kept as is except Yahoo!, which we subsampled by \ufb01rst selecting\nthe 10,000 most popular items and then selecting the 100,000 users with the most ratings. The\nsubsampling was done to speed up the experiments as the original dataset has close to 2 million\nusers and 100,000 items. In addition to subsampling we rescaled user ratings from 0-100 to the 1-5\ninterval to make the data consistent with the other two datasets. The rescaling was done by map-\nping 0-19 to 1, 20-39 to 2, etc. The user, item and rating statistics are summarized in Table 1. To\ninvestigate the effect that the number of ratings has on accuracy we follow the framework of [23, 2].\nFor each dataset we randomly select 10, 20, 30, 40 ratings\nfrom each user for training, 10 for validation and test on\nthe remaining ratings. Users with less than 30, 40, 50,\n60 ratings were removed to ensure that we could evaluate\non at least 10 ratings for each user. Note that the number\nof test items varies signi\ufb01cantly across users with many\nusers having more test ratings than training ones. This simulates the real life CR scenario where the\nset of unrated items from which the recommendations are generated is typically much larger than\nthe rated item set for each user.\nWe trained our ranking model, referred to as WLT, using stochastic gradient descent with the learn-\ning rates 10\u22122, 10\u22123, 10\u22124 for MovieLens-1, MovieLens-2 and Yahoo! respectively. We found that\n1 to 21 iterations was suf\ufb01cient to trained the models. We also found that using smaller learning\nrates typically resulted in better generalization. We compare WLT with a well established user-\nbased (UB) collaborative \ufb01ltering model. We also compare with two collaborative ranking models:\nPMF-based ranker [2] (PMF-R) and Co\ufb01Rank [23] (CO). To make the comparison fair we used the\nsame LambdaRank architecture to train both WLT and PMF-R. Note that both PMF-R and Co\ufb01Rank\nreport state-of-the-art CR results. To compute the PMF features we used extensive cross-validation\nto determine the L2 penalty weights and the latent dimension size D (5, 10, 10 for MovieLens-\n1, MovieLens-2, and Yahoo! datasets respectively). For Co\ufb01Rank we used the settings suggested\n\nMovieLens-1\nMovieLens-2\n\n100,000\n\n10,000,000\n45,729,723\n\nItems\n\n1700\n10,000\n10,000\n\nUsers\n\n1000\n72,000\n100,000\n\nDataset\n\nYahoo!\n\nRatings\n\n1Note that 1 iteration of stochastic gradient descent corresponds to |U| weight updates.\n\n6\n\n\fTable 2: Collaborative Ranking results. NDCG values at different truncation levels are shown within the main\ncolumns, which are split based on the number of training ratings. Each model\u2019s rounded number of parameters\nis shown in brackets, with K = thousand, M = million.\n20\n\n10\n\n30\n\n40\n\nN@1 N@3 N@5\n\nN@1 N@3 N@5\n\nN@1 N@3 N@5\n\nN@1 N@3 N@5\n\n62.27 64.92 66.14\n74.02 71.55 70.90\n71.43 71.64 71.43\n74.09 71.85 71.52\n\n71.29 70.78 70.87\n70.65 70.04 70.09\n68.80 68.51 68.76\n73.93 72.63 72.37\n\n72.65 71.98 71.90\n72.22 71.48 71.43\n64.60 65.62 66.38\n74.67 73.37 73.04\n\n73.33 72.63 72.42\n72.18 71.60 71.55\n62.82 63.49 64.25\n75.19 73.73 73.30\n\n64.25 65.75 66.58\n72.77 72.23 71.55\n71.60 71.15 70.58\n71.41 71.16 71.02\n\n49.30 54.67 57.36\n69.39 68.33 68.65\n67.28 66.23 66.59\n70.96 68.25 67.98\n\n57.20 55.29 54.31\n52.86 51.98 51.53\n57.42 56.88 56.46\n58.76 55.20 53.53\n\n57.49 61.81 62.88\n72.50 70.42 69.95\n71.82 70.80 70.30\n70.34 69.50 69.21\n\nMovieLens-1:\nUB\nPMF-R(12K)\nCO(240K)\nWLT(17)\nMovieLens-2:\nUB\n67.62 68.23 68.74\nPMF-R(500K) 70.12 69.41 69.35\n70.14 68.40 68.46\nCO(5M)\n72.78 71.70 71.49\nWLT(17)\nYahoo!:\nUB\n68.97 65.89 64.50\nPMF-R(1M)\n69.46 68.05 67.21\nCO(10M)\n61.68 60.78 60.24\n71.50 68.52 67.00\nWLT(17)\nin [23] and ran the code available on the author\u2019s home page. Similarly to [2], we found that the\nregression-based objective almost always gave the best results for Co\ufb01Rank, consistently outper-\nforming NDCG and ordinal objectives.\nFor WLT and UB models we use cosine similarity as the distance function to \ufb01nd the top-K neigh-\nbors. Note that using the same similarity function ensures that both models select the same neighbor\nsets and allows for fair comparison. The number of neighbors K was cross validated in the range\n[10, 100] on the small MovieLens-1 dataset and set to 200 on all other datasets as we found the\nresults to be insensitive for K above 100 which is consistent with the \ufb01ndings of [15]. In all experi-\nments only ratings in the training set were used to select the neighbors, and make predictions for the\nvalidation and test set items.\n5.1 Results\n\n64.29 61.48 60.16\n63.93 62.42 61.65\n60.59 59.94 59.48\n66.06 62.77 61.21\n\n66.82 63.83 62.42\n66.82 65.41 64.61\n62.07 61.10 60.54\n69.74 66.58 65.02\n\nThe NDCG (N@T) results at truncations 1,3 and 5 are shown in Table 2. From the table it is seen\nthat the WLT model performs comparably to the best baseline on MovieLens-1, outperforms all\nmethods on MovieLens-2 and is also the best overall approach on Yahoo!. Across the datasets the\ngains are especially large at lower truncations N@1 and N@3, which is important since those items\nwill most likely be the ones viewed by the user.\nSeveral patterns can also be seen from the table. First, when the number of users and ratings is small\n(MovieLens-1) the performance of the UB approach signi\ufb01cantly drops. This is likely due to the fact\nthat neighbors cannot be found reliably in this setting since users have little overlap in ratings. By\ntaking into account the con\ufb01dence information such as the number of available neighbors WLT is\nable to signi\ufb01cantly improve over UB while using the same set of neighbors. On MovieLens-1 WLT\noutperforms UB by as much as 20 NDCG points. Second, for larger datasets such as MovieLens-2\nand Yahoo! the model-based approaches have millions of parameters (shown in brackets in Table 2)\nto optimize and are highly prone to over\ufb01tting. Tuning the hyper-parameters for these models is dif-\n\ufb01cult and computationally expensive in this setting as it requires conducting many cross-validation\nruns over large datasets. On the other hand, our approach achieves consistently better performance\nwith only 17 parameters, and a single hyper-parameter K which is \ufb01xed to 200. Overall, the results\ndemonstrate the robustness of the proposed features which generalize well when both few and many\nusers available.\n5.2 Transfer Learning Results\n\nIn addition to the small number of parameters, another advantage of our approach over most model-\nbased methods is that inference for a new user only requires \ufb01nding the K neighbors. Thus both\nusers and items can be taken from a different, unseen during training, set. This transfer learning task\nis much more dif\ufb01cult than the strong generalization task [17] commonly used to test CF methods\non new users. In strong generalization the models are evaluated on users not present at training time\nwhile keeping the item set \ufb01xed, while here the item set also changes. Note that it is impossible to\n\n7\n\n\fTable 3: Transfer learning NDCG results. Original: WLT model trained on the respective dataset. WLT-M1\nand WLT-M2 models are trained on MovieLens-1 and MovieLens-2 respectively, WLT-Y is trained on Yahoo!.\nWLT-M1, WLT-M2 and WLT-Y models are applied to other datasets without retraining.\n\n10\n\n20\n\n30\n\n40\n\nN@1 N@3 N@5\n\nN@1 N@3 N@5\n\nN@1 N@3 N@5\n\nN@1 N@3 N@5\n\n72.78 71.70 71.49\n72.90 71.77 71.57\n68.04 68.03 68.41\n\n74.67 73.37 73.04\n74.67 73.36 73.01\n73.15 72.38 72.25\n\n71.41 71.16 71.02\n71.02 70.99 70.88\n67.33 66.99 67.99\n\n74.09 71.85 71.52\n73.28 71.70 71.46\n71.11 69.22 68.95\n\n58.76 55.20 53.53\n57.93 53.91 52.35\n58.81 54.70 53.15\n\n66.06 62.77 61.21\n66.03 62.68 61.18\n65.29 61.95 60.47\n\n69.74 66.58 65.02\n68.93 65.85 64.32\n68.68 65.55 64.07\n\n75.19 73.73 73.30\n75.28 73.76 73.28\n74.00 73.03 72.79\n\n70.96 68.25 67.98\n63.15 62.46 62.75\n44.12 47.06 48.75\n\n70.34 69.50 69.21\n69.66 68.61 68.47\n61.73 62.60 63.57\n\n73.93 72.63 72.37\n73.97 72.59 72.34\n71.54 71.02 71.07\n\nMovieLens-1:\nOriginal\nWLT-M2\nWLT-Y\nMovieLens-2:\nOriginal\nWLT-M1\nWLT-Y\nYahoo!:\nOriginal\n71.50 68.52 67.00\nWLT-M1\n71.15 68.17 66.65\nWLT-M2\n70.84 67.91 66.44\napply PMF-R, CO and most other model-based methods to this setting without re-training the entire\nmodel. Our model, on the other hand, can be applied without re-training by simply extracting the\nfeatures for every new user-item pair and applying the learned scoring function to rank the items.\nthe generalization properties of the model we took the three learned WLT mod-\nTo test\nrespec-\nels (referred to as WLT-M1, WLT-M2, WLT-Y for MovieLens-1&2 and Yahoo!\ntively) and applied each model to the datasets that it was not trained on.\nSo for instance\nWLT-M1 was applied to MovieLens-2 and Yahoo!. Table 3 shows the transfer results for\neach of the datasets along with the original results for the WLT model\ntrained on each\ndataset (referred to as Original). Note that\nnone of the models were re-trained or tuned\nin any way. From the table it seen that our\nmodel generalizes very well to different do-\nmains. For instance, WLT-M1 trained on\nMovieLens-1 is able to achieve state-of-the\nart performance on MovieLens-2, outper-\nforming all the baselines that were trained\non MovieLens-2. Note that MovieLens-2\nhas over 5 times more items and 72 times more users than MovieLens-1, majority of which the WLT-\nM1 model has not seen during training. Moreover, perhaps surprisingly, our model also generalizes\nwell across item domains. The WLT-Y model trained on musical artist data achieves state-of-the-art\nperformance on MovieLens-2 movie data, performing better than all the baselines when 20, 30 and\n40 ratings are used for training. Moreover, both WLT-M1 and WLT-M2 achieve very competitive\nresults on Yahoo! outperforming most of the baselines.\nMore insight into why the model generalizes well can be gained from Figure 3, which shows the\nnormalized weights learned by the WLT models on each of the three datsets. The weights are\npartitioned into feature sets from each of the three preference matrices (see Equation 2). From the\n\ufb01gure it can be seen that the learned weights share a lot of similarities. The weights on the features\nfrom the WIN matrix are mostly positive while those on the features from the LOSS matrix are\nmostly negative. Mean preferences and the number of neighbors features have the highest absolute\nweights which indicates that they are the most useful for predicting the item scores. The similarity\nbetween the weight vectors suggests that the features convey very similar information and remain\ninvariant across different user/item sets.\n\nFigure 3: Normalized WLT weights. White/black corre-\nspond to positive/negative weights; the weight magnitude\nis proportional to the square size.\n\n6 Conclusion\nIn this work we presented an effective approach to extract user-item features based on neighbor\npreferences. The features allow us to apply any learning-to-rank approach to learn the ranking\nfunction. Experimental results show that by using these features state-of-the art ranking results\ncan be achieved. Going forward, the strong transfer results call into question whether the complex\nmachinery developed for CF is appropriate when the true goal is recommendation, as the required\ninformation for \ufb01nding the best items to recommend can be obtained from basic neighborhood\nstatistics. We are also currently investigating additional features such as neighbors\u2019 rating overlap.\n\n8\n\n\fReferences\n[1] The Yahoo! R1 dataset. http://webscope.sandbox.yahoo.com/catalog.php?\n\ndatatype=r.\n\n[2] S. Balakrishnan and S. Chopra. Collaborative ranking. In WSDM, 2012.\n[3] J. Bennet and S. Lanning.\n\nThe Net\ufb02ix prize.\n\nwww.cs.uic.edu/\u02dcliub/\n\nKDD-cup-2007/NetflixPrize-description.pdf.\n\n[4] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithm for\n\ncollaborative \ufb01ltering. In UAI, 1998.\n\n[5] C. J. C. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical\n\nReport MSR-TR-2010-82, 2010.\n\n[6] C. J. C. Burges, R. Rango, and Q. V. Le. Learning to rank with nonsmooth cost functions. In\n\nNIPS, 2007.\n\n[7] O. Chapelle, Y. Chang, and T.-Y. Liu. The Yahoo! Learning To Rank Challenge. http:\n\n//learningtorankchallenge.yahoo.com, 2010.\n\n[8] D. F. Gleich and L.-H. Lim. Rank aggregation via nuclear norm minimization. In SIGKDD,\n\n2011.\n\n[9] K. Y. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative\n\n\ufb01ltering algorithm. Information Retrieval, 4(2), 2001.\n\n[10] J. Herlocker, J. A. Konstan, and J. Riedl. An empirical analysis of design choices in\n\nneighborhood-based collaborative \ufb01ltering algorithms. Information Retrieval, 5(4), 2002.\n\n[11] T. Hofmann. Latent semantic models for collaborative \ufb01ltering. ACM Trans. Inf. Syst., 22(1),\n\n2004.\n\n[12] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents.\n\nIn SIGIR, 2000.\n\n[13] X. Jiang, L.-H. Lim, Y. Yao, and Y. Ye. Statistical ranking and combinatorial hodge theory.\n\nMathematical Programming, 127, 2011.\n\n[14] H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan\n\n& Claypool, 2011.\n\n[15] N. Liu and Q. Yang. Eigenrank: A ranking-oriented approach to collaborative \ufb01ltering. In\n\nSIGIR, 2008.\n\n[16] B. Marlin. Modeling user rating pro\ufb01les for collaborative \ufb01ltering. In NIPS, 2003.\n[17] B. Marlin. Collaborative \ufb01ltering: A machine learning perspective. Master\u2019s thesis, University\n\nof Toronto, 2004.\n\n[18] B. McFee, T. Bertin-Mahieux, D. Ellis, and G. R. G. Lanckriet. The Million Song Dataset\n\nChallenge. In WWW, http://www.kaggle.com/c/msdchallenge, 2012.\n\n[19] D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles. Collaborative \ufb01ltering by personality\n\ndiagnosis: A hybrid memory and model-based approach. In UAI, 2000.\n\n[20] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architec-\n\nture for collaborative \ufb01ltering of netnews. In CSCW, 1994.\n\n[21] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2008.\n[22] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative \ufb01ltering recommen-\n\ndation algorithms. In WWW, 2001.\n\n[23] M. Weimer, A. Karatzoglou, Q. V. Le, and A. J. Smola. Co\ufb01Rank - maximum margin matrix\n\nfactorization for collaborative ranking. In NIPS, 2007.\n\n[24] G.-R. Xue, C. Lin, Q. Yang, W. Xi, H.-J. Zeng, Y. Yu, and Z. Chen. Scalable collaborative\n\n\ufb01ltering using cluster-based smoothing. In SIGIR, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1122, "authors": [{"given_name": "Maksims", "family_name": "Volkovs", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}]}