{"title": "COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 1593, "page_last": 1600, "abstract": null, "full_text": "COFIRANK\n\nMaximum Margin Matrix Factorization for\n\nCollaborative Ranking\n\nMarkus Weimer\u2217\n\nAlexandros Karatzoglou\u2020\n\nQuoc Viet Le\u2021\n\nAlex Smola\u00a7\n\nAbstract\n\nIn this paper, we consider collaborative \ufb01ltering as a ranking problem. We present\na method which uses Maximum Margin Matrix Factorization and optimizes rank-\ning instead of rating. We employ structured output prediction to optimize directly\nfor ranking scores. Experimental results show that our method gives very good\nranking scores and scales well on collaborative \ufb01ltering tasks.\n\n1 Introduction\n\nCollaborative \ufb01ltering has gained much attention in the machine learning community due to the\nneed for it in webshops such as those of Amazon, Apple and Net\ufb02ix. Webshops typically offer\npersonalized recommendations to their customers. The quality of these suggestions is crucial to the\noverall success of a webshop. However, suggesting the right items is a highly nontrivial task: (1)\nThere are many items to choose from. (2) Customers only consider very few (typically in the order\nof ten) recommendations. Collaborative \ufb01ltering addresses this problem by learning the suggestion\nfunction for a user from ratings provided by this and other users on items offered in the webshop.\nThose ratings are typically collected on a \ufb01ve star ordinal scale within the webshops.\nLearning the suggestion function can be considered either a rating (classi\ufb01cation) or a ranking prob-\nlem. In the context of rating, one predicts the actual rating for an item that a customer has not rated\nyet. On the other hand, for ranking, one predicts a preference ordering over the yet unrated items.\nGiven the limited size of the suggestion shown to the customer, both (rating and ranking) are used\nto compile a top-N list of recommendations. This list is the direct outcome of a ranking algorithm,\nand can be computed from the results of a rating algorithm by sorting the items according to their\npredicted rating. We argue that rating algorithms solve the wrong problem, and one that is actually\nharder: The absolute value of the rating for an item is highly biased for different users, while the\nranking is far less prone to this problem.\nOne approach is to solve the rating problem using regression. For example for the Net\ufb02ix prize\nwhich uses root mean squared error as an evaluation criterion,1 the most straightforward approach\nis to use regression. However, the same arguments discussed above apply to regression. Thus, we\npresent an algorithm that solves the ranking problem directly, without \ufb01rst computing the rating.\nFor collaborative rating, Maximum Margin Matrix Factorization (MMMF) [11, 12, 10] has proven to\nbe an effective means of estimating the rating function. MMMF takes advantage of the collaborative\neffects: rating patterns from other users are used to estimate ratings for the current user. One key\n\u2217Telecooperation Group, TU Darmstadt, Germany, mweimer@tk.informatik.tu-darmstadt.de\n\u2020Department of Statistics, TU Wien, alexis@ci.tuwien.ac.at\n\u2021Computer Science Department, Stanford University, Stanford, CA 94305, quoc.le@stanford.edu\n\u00a7SML, NICTA, Northbourne Av. 218, Canberra 2601, ACT, Australia, alex.smola@nicta.com.au\n1We conjecture that this is the case in order to keep the rules simple, since ranking scores are somewhat\n\nnontrivial to de\ufb01ne, and there are many different ways to evaluate a ranking, as we will see in the following.\n\n1\n\n\fadvantage of this approach is that it works without feature extraction. Feature extraction is domain\nspeci\ufb01c, e.g.\nthe procedures developed for movies cannot be applied to books. Thus, it is hard\nto come up with a consistent feature set in applications with many different types of items, as for\nexample at Amazon. Our algorithm is based on this idea of MMMF, but optimizes ranking measures\ninstead of rating measures.\nGiven that only the top ranked items will actually be presented to the user, it is much more important\nto rank the \ufb01rst items right than the last ones. In other words, it is more important to predict what a\nuser likes than what she dislikes. In more technical terms, the value of the error for estimation is not\nuniform over the ratings. All of above reasonings lead to the following goals:\n\u2022 The algorithm needs to be able to optimize ranking scores directly.\n\u2022 The algorithm needs to be adaptable to different scores.\n\u2022 The algorithm should not require any features besides the actual ratings.\n\u2022 The algorithm needs to scale well and parallelize such as to deal with millions of ratings arising\n\nfrom thousands of items and users with an acceptable memory footprint.\n\nWe achieve these goals by combining (a) recent results in optimization, in particular the application\nof bundle methods to convex optimization problems [14], (b) techniques for representing functions\non matrices, in particular maximum margin matrix factorizations [10, 11, 12] and (c) the application\nof structured estimation for ranking problems. We describe our algorithm COFIRANK in terms of\noptimizing the ranking measure Normalized Discounted Cumulative Gain (NDCG).\n\n2 Problem De\ufb01nition\n\nAssume that we have m items and u users. The ratings are stored in the sparse matrix Y where\nYi,j \u2208 {0, . . . , r} is the rating of item j by user i and r is some maximal score. Yi,j is 0 if user\ni did not rate item j. In rating, one estimates the missing values in Y directly while we treat this\nas a ranking task. Additionally, in NDCG [16], the correct order of higher ranked items is more\nimportant than that of lower ranked items:\nDe\ufb01nition 1 (NDCG) Denote by y \u2208 {1, . . . , r}n a vector of ratings and let \u03c0 be a permutation\nof that vector. \u03c0i denotes the position of item i after the permutation. Moreover, let k \u2208 N be a\ntruncation threshold and \u03c0s sorts y in decreasing order. In this case the Discounted Cumulative\nGains (DCG@k) score [5] and its normalized variant (NDCG@k) are given by\n\nDCG@k(y, \u03c0) =\n\n2y\u03c0i \u2212 1\nlog(i + 2)\n\nand N DCG@k(y, \u03c0) = DCG@k(y, \u03c0)\nDCG@k(y, \u03c0s)\n\nk(cid:88)\n\ni=1\n\nDCG@k is maximized for \u03c0 = \u03c0s. The truncation threshold k re\ufb02ects how many recommendations\nusers are willing to consider. NDCG is a normalized version of DCG so that the score is bounded\nby [0, 1].\nUnlike classi\ufb01cation and regression measures, DCG is de\ufb01ned on permutations, not absolute val-\nues of the ratings. Departing from traditional pairwise ranking measures [4], DCG is position-\ndependent: Higher positions have more in\ufb02uence on the score than lower positions. Optimizing\nDCG has gained much interest in the machine learning and information retrieval (e.g. [2]) commu-\nnities. However, we present the \ufb01rst effort to optimize this measure for collaborative \ufb01ltering.\nTo perform estimation, we need a recipe for obtaining the permutations \u03c0. Since we want our system\nto be scalable, we need a method which scales not much worse than linearly in the number of the\nitems to be ranked. The avenue we pursue is to estimate a matrix F \u2208 Rm\u00d7u and to use the values\nFij for the purpose of ranking the items j for user i. Given a matrix Y of known ratings we are now\nable to de\ufb01ne the performance of F :\n\nR(F, Y ) :=\n\nNDCG@k(\u03a0i, Y i),\n\n(1)\n\nu(cid:88)\n\ni=1\n\n2\n\n\fwhere \u03a0i is argsort(\u2212F i), it sorts F i in decreasing order.2 While we would like to maximize\nR(F, Ytest) we only have access to R(F, Ytrain). Hence, we need to restrict the complexity of F to\nensure good performance on the test set when maximizing the score on the training set.\n\n3 Structured Estimation for Ranking\n\nHowever, R(F, Y ) is non-convex. In fact, it is piecewise constant and therefore clearly not amenable\nto any type of smooth optimization. To address this issue we take recourse to structured estimation\n[13, 15]. Note that the scores decompose into a sum over individual users\u2019 scores, hence we only\nneed to show how minimizing \u2212NDCG(\u03c0, y) can be replaced by minimizing a convex upper bound\non the latter. Summing over the users then provides us with a convex bound for all of the terms.3\nOur conversion works in three steps:\n\n1. Converting NDCG(\u03c0, y) into a loss by computing the regret with respect to the optimal\npermutation argsort(\u2212y).\n2. Denote by \u03c0 a permutation (of the n items a user might want to see) and let f \u2208 Rn be a\nestimated rating. We design a mapping \u03c8(\u03c0, f) \u2192 R which is linear in f in such a way\nthat maximizing \u03c8(\u03c0, f) with respect to \u03c0 yields argsort(f).\n\n3. We use the convex upper-bounding technique described by [15] to combine regret and\n\nlinear map into a convex upper bound which we can minimize ef\ufb01ciently.\n\nStep 1 (Regret Conversion)\n\nInstead of maximizing NDCG(\u03c0, y) we may also minimize\n\n\u2206(\u03c0, y) := 1 \u2212 NDCG(\u03c0, y).\n\n(2)\n\n\u2206(\u03c0, y) is nonnegative and vanishes for \u03c0 = \u03c0s.\n\nStep 2 (Linear Mapping) Key in our reasoning is the use of the Polya-Littlewood-Hardy inequal-\nity: For any two vectors a, b \u2208 Rn their inner product is maximized by sorting a and b in the same\norder, that is (cid:104)a, b(cid:105) \u2264 (cid:104)sort(a), sort(b)(cid:105). This allows us to encode the permuation \u03c0 = argsort(f)\nin the following fashion: denote by c \u2208 Rn a decreasing nonnegative sequence, then the function\n\n(3)\nis linear in f and maximized with respect to \u03c0 for argsort(f). Since ci is decreasing by construction,\nthe Polya-Littlewood-Hardy inequality applies. We found that choosing ci = (i + 1)\u22120.25 produced\ngood results in our experiments. However, we did not formally optimize this parameter.\n\n\u03c8(\u03c0, f) := (cid:104)c, f\u03c0(cid:105)\n\nStep 3 (Convex Upper Bound) We adapt a result of [15] which describes how to \ufb01nd convex\nupper bounds on nonconvex optimization problems.\nLemma 2 Assume that \u03c8 is de\ufb01ned as in (3). Moreover let \u03c0\u2217 := argsort(\u2212f) be the ranking\ninduced by f. Then the following loss function l(f, y) is convex in f and it satis\ufb01es l(f, y) \u2265\n\u2206(y, \u03c0\u2217).\n\n(4)\n\n(cid:104)\n\n\u2206(\u03c0, y) + (cid:104)c, f\u03c0 \u2212 f(cid:105)(cid:105)\n\nl(f, y) := max\n\n\u03c0\n\nProof We show convexity \ufb01rst. The argument of the maximization over the permutations \u03c0 is a\nlinear and thus convex function in f. Taking the maximum over a set of convex functions is convex\nitself, which proves the \ufb01rst claim. To see that it is an upper bound, we use the fact that\n\nl(f, y) \u2265 \u2206(\u03c0\u2217, y) + (cid:104)c, f\u03c0\u2217 \u2212 f(cid:105) \u2265 \u2206(\u03c0\u2217, y).\nThe second inequality follows from the fact that \u03c0\u2217 maximizes (cid:104)c, f\u03c0\u2217(cid:105).\n\n(5)\n\n2M i denotes row i of matrix M. Matrices are written in upper case, while vectors are written in lower case.\n3This also opens the possibility for parallelization in the implementation of the algorithm.\n\n3\n\n\f4 Maximum Margin Matrix Factorization\n\nLoss The reasoning in the previous section showed us how to replace the ranking score with a\nconvex upper bound on a regret loss. This allows us to replace the problem of maximizing R(F, Y )\nby that of minimizing a convex function in F , namely\n\nL(F, Y ) :=\n\nl(F i, Y i)\n\n(6)\n\nu(cid:88)\n\ni=1\n\nMatrix Regularization Having addressed the problem of non-convexity of the performance score\nwe need to \ufb01nd an ef\ufb01cient way of performing capacity control of F , since we only have L(F, Ytrain)\nat our disposition, whereas we would like to do well on L(F, Ytest). The idea to overcome this prob-\nlem is by means of a regularizer on F , namely the one proposed for Maximum Margin Factorization\nby Srebro and coworkers[10, 11, 12]. The key idea in their reasoning is to introduce a regularizer on\nF via\n\n\u2126[F ] :=\n\n1\n2\n\n[tr M M(cid:62) + tr U U(cid:62)] subject to U M = F.\n\nmin\nM,U\n\n(7)\n\nMore speci\ufb01cally, [12] show that the above is a proper norm on F . While we could use a semidef-\ninite program as suggested in [11], the latter is intractable for anything but the smallest problems.4\nInstead, we replace F by U M and solve the following problem:\n\n(cid:2)tr M M(cid:62) + tr U U(cid:62)(cid:3)\n\n(8)\n\nminimize\n\nM,U\n\nL(U M, Ytrain) + \u03bb\n2\n\nNote that the above matrix factorization approach effectively allows us to learn an item matrix M\nand a user matrix U which will store the speci\ufb01c properties of users and items respectively. This\napproach learns the features of the items and the users. The dimension d of M \u2208 Rd\u00d7m and\nU \u2208 Rd\u00d7u is chosen mainly based on computational concerns, since a full representation would\nrequire d = min(m, u). On large problems the storage requirements for the user matrix can be\nenormous and it is convenient to choose d = 10 or d = 100.\n\nAlgorithm While (8) may not be jointly convex in M and U any more, it still is convex in M and\nU individually, whenever the other term is kept \ufb01xed. We use this insight to perform alternating sub-\nspace descent as proposed by [10]. Note that the algorithm does not guarantee global convergence,\nwhich is a small price to pay for computational tractability.\n\nrepeat\n\nFor \ufb01xed M minimize (8) with respect to U.\nFor \ufb01xed U minimize (8) with respect to M.\n\nuntil No more progress is made or a maximum iteration count has been reached.\n\nNote that on problems of the size of Net\ufb02ix the matrix Y has 108 entries, which means that the\nnumber of iterations is typically time limited. We now discuss a general optimization method for\nsolving regularized convex optimization problems. For more details see [14].\n\n5 Optimization\n\nBundle Methods We discuss the optimization over the user matrix U \ufb01rst, that is, consider the\nproblem of minimizing\n\n(9)\nThe regularizer tr U U(cid:62) is rather simple to compute and minimize. On the other hand, L is expensive\nto compute, since it involves maximizing l for all users.\nBundle methods, as proposed in [14] aim to overcome this problem by performing successive Taylor\napproximations of L and by using them as lower bounds. In other words, they exploit the fact that\n\nR(U) := L(U M, Ytrain) + \u03bb\n2\n\ntr U U(cid:62)\n\nL(U M, Ytrain) \u2265 L(U M(cid:48), Ytrain) + tr(M \u2212 M(cid:48))(cid:62)\u2202M L(U M(cid:48), Y )\u2200M, M(cid:48).\n\n4In this case we optimize over\n\n\u00bb A\n\nF\nF (cid:62) B\n\n\u2013\n\n(cid:23) 0 where \u2126[F ] is replaced by 1\n\n2 [tr A + tr B].\n\n4\n\n\fAlgorithm 1 Bundle Method(\u0001)\n\nInitialize t = 0, U0 = 0, b0 = 0 and H = \u221e\nrepeat\n\nFind minimizer Ut and value L of the optimization problem\n\n(cid:2)tr U(cid:62)\n\n(cid:3) + \u03bb\n\n2\n\ntr U(cid:62)U.\n\nminimize\n\nU\n\nmax\n0\u2264j\u2264t\n\nj M + bj\n\nCompute Ut+1 = \u2202U L(UtM, Ytrain)\nCompute bt+1 = L(UtM, Ytrain) \u2212 tr Ut+1Mt\nif H(cid:48) := tr U(cid:62)\n\nt+1Mt + bt+1 + \u03bb\n\n2 tr U U(cid:62) \u2264 H then\n\nUpdate H \u2190 H(cid:48)\n\nend if\n\nuntil H \u2212 L \u2264 \u0001\n\nSince this holds for arbitrary M(cid:48), we may pick a set of Mi and use the maximum over the Taylor\napproximations at locations Mi to lower-bound L. Subsequently, we minimize this piecewise linear\n2 tr U U(cid:62) to obtain a new location where to compute our next\nlower bound in combination with \u03bb\nTaylor approximation and iterate until convergence is achieved. Algorithm 1 provides further details.\nAs we proceed with the optimization, we obtain increasingly tight lower bounds on L(U M, Ytrain).\nOne may show [14] that the algorithm converges to \u0001 precision with respect to the minimizer of\nR(U) in O(1/\u0001) steps. Moreover, the initial distance from the optimal solution enters the bound\nonly logarithmically.\nAfter solving the optimization problem in U we switch to optimizing over the item matrix M. The\nalgorithm is virtually identical to that in U, except that we now need to use the regularizer in M\ninstead of that in U. We \ufb01nd experimentally that a small number of iterations (less than 10) is more\nthan suf\ufb01cient for convergence.\n\nComputing the Loss So far we simply used the loss l(f, y) of (4) to de\ufb01ne a convex loss with-\nout any concern to its computability. To implement Algorithm 1, however, we need to be able to\nsolve the maximization of l with respect to the set of permutations \u03c0 ef\ufb01ciently. One may show\nthat computing the \u03c0 which maximizes l(f, y) is possible by solving the inear assignment problem\n\nmin(cid:80)\n\ni\n\n(cid:80)\n\nj Ci,jXi,j with the cost matrix:\n2Y [j] \u2212 1\n\nCi,j = \u03bai\n\nDCG(Y, k, \u03c0s)log(i + 1)\n\n\u2212 cifj with \u03bai =\n\n(cid:26)1 if i < k,\n\n0 otherwise\n\nEf\ufb01cient algorithms [7] based on the Hungarian Marriage algorithm (also referred to as the Kuhn-\nMunkres algorithm) exist for this problem [8]: it turns out that this integer programming problem\ncan be solved by invoking a linear program. This in turn allows us to compute l(f, y) ef\ufb01ciently.\n\nComputing the Gradients The second ingredient needed for applying the bundle method is to\ncompute the gradients of L(F, Y ) with respect to F , since this allows us to compute gradients with\nrespect to M and U by applying the chain rule:\n\n\u2202M L(U M, Y ) = U(cid:62)\u2202F L(X, F, Y ) and \u2202U L(U M, Y ) = \u2202F L(X, F, Y )(cid:62)M\n\nL decomposes into losses on individual users as described in (6). For each user i only row i of F\nmatters. It follows that \u2202F L(F, Y ) is composed of the gradients of l(F i, Y i). Note that for l de\ufb01ned\nas in (4) we know that\n\n\u2202F il(F i, Y i) = [c \u2212 c\u00af\u03c0\u22121].\n\nHere we denote by \u00af\u03c0 the maximizer of of the loss and c\u00af\u03c0\u22121 denotes the application of the inverse\npermutation \u00af\u03c0\u22121 to the vector c.\n\n5\n\n\f6 Experiments\n\nWe evaluated COFIRANK with the NDCG loss just de\ufb01ned (denoted by COFIRANK-NDCG) as\nwell as with loss functions which optimize ordinal regression (COFIRANK-Ordinal) and regression\n(COFIRANK-Regression). COFIRANK-Ordinal applies the algorithm described above to preference\nranking by optimizing the preference ranking loss. Similarly, COFIRANK-Regression optimizes for\nregression using the root mean squared loss. We looked at two real world evaluation settings: \u201cweak\u201d\nand \u201cstrong\u201d [9] generalization on three publicly available data sets: EachMovie, MovieLens and\nNet\ufb02ix. Statistics for those can be found in table 1.\n\nDataset\n\nEachMovie\nMovieLens\n\nNet\ufb02ix\n\nUsers Movies\n1623\n61265\n1682\n983\n17770\n\n480189\n\nRatings\n2811717\n100000\n\n100480507\n\nTable 1: Data set statistics\n\nWeak generalization is evaluated by predicting the rank of unrated items for users known at\ntraining time. To do so, we randomly select N = 10, 20, 50 ratings for each user for training and\nand evaluate on the remaining ratings. Users with less then 20, 30, 60 rated movies where removed\nto ensure that the we could evaluate on at least 10 movies per user We compare COFIRANK-NDCG,\nCOFIRANK-Ordinal, COFIRANK-Regression and MMMF [10]. Experimental results are shown in\ntable 2.\nFor all COFIRANK experiments, we choose \u03bb = 10. We did not optimize for this parameter. The\nresults for MMMF were obtained using MATLAB code available from the homepage of the authors\nof [10]. For those, we used \u03bb = 1\n1.6 for MovieLens as it is reported\nto yield the best results for MMMF. In all experiments, we choose the dimensionality of U and M\nto be 100. All COFIRANK experiments and those of MMMF on MovieLens were repeated ten times.\nUnfortunately, we underestimated the runtime and memory requirements of MMMF on EachMovie.\nThus, we cannot report results on this data set using MMMF.\nAdditionally, we performed some experiments on the Net\ufb02ix data set. However, we cannot compare\nto any of the other methods on that data set as to the best of our knowledge, COFIRANK is the \ufb01rst\ncollaborative ranking algorithm to be applied to this data set, supposedly because of its large size.\n\n1.9 for EachMovie, and \u03bb = 1\n\nStrong generalization is evaluated on users that were not present at training time. We follow the\nprocedure described in [17]: Movies with less than 50 ratings are discarded. The 100 users with the\nmost rated movies are selected as the test set and the methods are trained on the remaining users.\nIn evaluation, 10, 20 or 50 ratings from those of the 100 test users are selected. For those ratings,\nthe user training procedure is applied to optimize U. M is kept \ufb01xed in this process to the values\nobtained during training. The remaining ratings are tested using the same procedure as for the weak\n\nEachMovie\n\nMovieLens\n\nMethod\nCOFIRANK-NDCG\nCOFIRANK-Ordinal\nCOFIRANK-Regression\n\nCOFIRANK-NDCG\nCOFIRANK-Ordinal\nCOFIRANK-Regression\nMMMF\n\nN=10\n\n0.6562 \u00b1 0.0012\n0.6727 \u00b1 0.0309\n0.6114 \u00b1 0.0217\n0.6400 \u00b1 0.0061\n0.6233 \u00b1 0.0039\n0.6420 \u00b1 0.0252\n0.6061 \u00b1 0.0037\n\nN=20\n\n0.6644 \u00b1 0.0024\n0.7240 \u00b1 0.0018\n0.6400 \u00b1 0.0354\n0.6307 \u00b1 0.0062\n0.6686 \u00b1 0.0058\n0.6509 \u00b1 0.0190\n0.6937 \u00b1 0.0039\n\nN=50\n\n0.6406 \u00b1 0.0040\n0.7214 \u00b1 0.0076\n0.5693 \u00b1 0.0428\n0.6076 \u00b1 0.0077\n0.7169 \u00b1 0.0059\n0.6584 \u00b1 0.0187\n0.6989 \u00b1 0.0051\n\nNet\ufb02ix\n\nCOFIRANK-NDCG\nCOFIRANK-Regression\n\n0.6081\n0.6082\n\n0.6204\n0.6287\n\nTable 2: Results for the weak generalization setting experiments. We report the NDCG@10 accuracy for\nvarious numbers of training ratings used per user. For most results we report the mean over ten runs and the\nstandard deviation. We also report the p-values for the best vs. second best score.\n\n6\n\n\fEachMovie\n\nMovieLens\n\nN=10\n\nMethod\n0.6367 \u00b1 0.001\nCOFIRANK-NDCG\n0.4558 \u00b1 0.015\nGPR\n0.5734 \u00b1 0.014\nCGPR\n0.3692 \u00b1 0.002\nGPOR\n0.3789 \u00b1 0.011\nCGPOR\n0.4746 \u00b1 0.034\nMMMF\nCOFIRANK-NDCG 0.6237 \u00b1 0.0241\n0.4937 \u00b1 0.0108\nGPR\n0.5101 \u00b1 0.0081\nCGPR\n0.4988 \u00b1 0.0035\nGPOR\n0.5053 \u00b1 0.0047\nCGPOR\n0.5521 \u00b1 0.0183\nMMMF\n\nN=20\n\n0.6619 \u00b1 0.0022\n0.4849 \u00b1 0.0066\n0.5989 \u00b1 0.0118\n0.3678 \u00b1 0.0030\n0.3781 \u00b1 0.0056\n0.4786 \u00b1 0.0139\n0.6711 \u00b1 0.0065\n0.5020 \u00b1 0.0089\n0.5249 \u00b1 0.0073\n0.5004 \u00b1 0.0046\n0.5089 \u00b1 0.0044\n0.6133 \u00b1 0.0180\n\nN=50\n\n0.6771 \u00b1 0.0019\n0.5375 \u00b1 0.0089\n0.6341 \u00b1 0.0114\n0.3663 \u00b1 0.0024\n0.3774 \u00b1 0.0041\n0.5478 \u00b1 0.0211\n0.6455 \u00b1 0.0103\n0.5088 \u00b1 0.0141\n0.5438 \u00b1 0.0063\n0.5011 \u00b1 0.0051\n0.5049 \u00b1 0.0035\n0.6651 \u00b1 0.0190\n\nTable 3: The NGDC@10 accuracy over ten runs and the standard deviation for the strong generalization eval-\nuation.\n\ngeneralization. We repeat the whole process 10 times and again use \u03bb = 10 and a dimensionality of\n100. We compare COFIRANK-NDCG to Gaussian Process Ordinal Regression (GPOR) [3] Gaussian\nProcess Regression (GPR) and the collaborative extensions (CPR, CGPOR) [17]. Table 3 shows our\nresults compared to the ones from [17].\nCOFIRANK performs strongly compared to most of the other tested methods. Particularly in the strong\ngeneralization setting COFIRANK outperforms the existing methods in almost all the settings. Note\nthat all methods except COFIRANK and MMMF use additional extracted features which are either\nprovided with the dataset or extracted from the IMDB. MMMF and COFIRANK only rely on the\nrating matrix. In the weak generalization experiments on the MovieLens data, COFIRANK performs\nbetter for N = 20 but is marginally outperformed by MMMF for the N = 10 and N = 50 cases.\nWe believe that with proper parameter tuning, COFIRANK will perform better in these cases.\n\n7 Discussion and Summary\n\nCOFIRANK is a novel approach to collaborative \ufb01ltering which solves the ranking problem faced\nby webshops directly. It can do so faster and at a higher accuracy than approaches which learn\na rating to produce a ranking. COFIRANK is adaptable to different loss functions such as NDCG,\nRegression and Ordinal Regression in a plug-and-play manner. Additionally, COFIRANK is well\nsuited for privacy concerned applications, as the optimization itself does not need ratings from the\nusers, but only gradients.\nOur results, which we obtained without parameters tuning, are on par or outperform several of the\nmost successful approaches to collaborative \ufb01ltering like MMMF, even when they are used with\ntuned parameters. COFIRANK performs best on data sets of realistic sizes such as EachMovie and\nsigni\ufb01cantly outperforms other approaches in the strong generalization setting.\nIn our experiments, COFIRANKshows to be very fast. For example, training on EachMovie with\nN = 10 can be done in less than ten minutes and uses less than 80M B of memory on a laptop. For\nN = 20, COFIRANK obtained a NDCG@10 of 0.72 after the \ufb01rst iteration, which also took less than\nten minutes. This is the highest NDCG@10 score on that data set we are aware of (apart from the\nresult of COFIRANK after convergence). A comparison to MMMF in that regard is dif\ufb01cult, as it is\nimplemented in MATLAB and COFIRANK in C++. However, COFIRANK is more than ten times faster\nthan MMMF while using far less memory. In the future, we will exploit the fact that the algorithm\nis easily parallelizable to obtain even better performance on current multi-core hardware as well as\ncomputer clusters. Even the current implementation allows us to report the \ufb01rst results on the Net\ufb02ix\ndata set for direct ranking optimization.\nAcknowledgments: Markus Weimer is funded by the German Research Foundation as part of the Research\nTraining Group 1223: \u201cFeedback-Based Quality Management in eLearning\u201d.\nSoftware: COFIRANK is available from http://www.cofirank.org\n\n7\n\n\fReferences\n[2] C. J. Burges, Q. V. Le, and R. Ragno. Learning to rank with nonsmooth cost functions. In\nB. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing\nSystems 19, 2007.\n\n[3] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. J. Mach. Learn. Res.,\n\n6:1019\u20131041, 2005.\n\n[4] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regres-\nsion. In A. J. Smola, P. L. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances in\nLarge Margin Classi\ufb01ers, pages 115\u2013132, Cambridge, MA, 2000. MIT Press.\n\n[5] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents.\nIn ACM Special Interest Group in Information Retrieval (SIGIR), pages 41\u201348. New York:\nACM, 2002.\n\n[7] R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear\n\nassignment problems. Computing, 38:325\u2013340, 1987.\n\n[8] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics\n\nQuarterly, 2:83\u201397, 1955.\n\n[9] B. Marlin. Collaborative \ufb01ltering: A machine learning perspective. Masters thesis, University\n\nof Toronto, 2004.\n\n[10] J. Rennie and N. Srebro. Fast maximum margin matrix factoriazation for collaborative predic-\n\ntion. In Proc. Intl. Conf. Machine Learning, 2005.\n\n[11] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul,\nY. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17,\nCambridge, MA, 2005. MIT Press.\n\n[12] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm.\n\nIn P. Auer and R. Meir,\neditors, Proc. Annual Conf. Computational Learning Theory, number 3559 in Lecture Notes\nin Arti\ufb01cial Intelligence, pages 545\u2013560. Springer-Verlag, June 2005.\n\n[13] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and\nB. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16, pages 25\u201332,\nCambridge, MA, 2004. MIT Press.\n\n[14] C.H. Teo, Q. Le, A.J. Smola, and S.V.N. Vishwanathan. A scalable modular convex solver\nfor regularized risk minimization. In Conference on Knowledge Discovery and Data Mining,\n2007.\n\n[15] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. J. Mach. Learn. Res., 6:1453\u20131484, 2005.\n\n[16] E. Voorhees. Overview of the TREC 2001 question answering track. In Text REtrieval Con-\nference (TREC) Proceedings. Department of Commerce, National Institute of Standards and\nTechnology, 2001. NIST Special Publication 500-250: The Tenth Text REtrieval Conference\n(TREC 2001).\n\n[17] S. Yu, K. Yu, V. Tresp, and H. P. Kriegel. Collaborative ordinal regression. In W.W. Cohen\n\nand A. Moore, editors, Proc. Intl. Conf. Machine Learning, pages 1089\u20131096. ACM, 2006.\n\n8\n\n\f", "award": [], "sourceid": 612, "authors": [{"given_name": "Markus", "family_name": "Weimer", "institution": null}, {"given_name": "Alexandros", "family_name": "Karatzoglou", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}