{"title": "Mixture-Rank Matrix Approximation for Collaborative Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 477, "page_last": 485, "abstract": "Low-rank matrix approximation (LRMA) methods have achieved excellent accuracy among today's collaborative filtering (CF) methods. In existing LRMA methods, the rank of user/item feature matrices is typically fixed, i.e., the same rank is adopted to describe all users/items. However, our studies show that submatrices with different ranks could coexist in the same user-item rating matrix, so that approximations with fixed ranks cannot perfectly describe the internal structures of the rating matrix, therefore leading to inferior recommendation accuracy. In this paper, a mixture-rank matrix approximation (MRMA) method is proposed, in which user-item ratings can be characterized by a mixture of LRMA models with different ranks. Meanwhile, a learning algorithm capitalizing on iterated condition modes is proposed to tackle the non-convex optimization problem pertaining to MRMA. Experimental studies on MovieLens and Netflix datasets demonstrate that MRMA can outperform six state-of-the-art LRMA-based CF methods in terms of recommendation accuracy.", "full_text": "Mixture-Rank Matrix Approximation\n\nfor Collaborative Filtering\n\nDongsheng Li1 Chao Chen1 Wei Liu2\u2217 Tun Lu3,4 Ning Gu3,4\n\n1IBM Research - China\n2Tencent AI Lab, China\n\nStephen M. Chu1\n\n3School of Computer Science, Fudan University, China\n\n4Shanghai Key Laboratory of Data Science, Fudan University, China\n\n{ldsli, cshchen, schu}@cn.ibm.com, wliu@ee.columbia.edu, {lutun, ninggu}@fudan.edu.cn\n\nAbstract\n\nLow-rank matrix approximation (LRMA) methods have achieved excellent ac-\ncuracy among today\u2019s collaborative \ufb01ltering (CF) methods. In existing LRMA\nmethods, the rank of user/item feature matrices is typically \ufb01xed, i.e., the same rank\nis adopted to describe all users/items. However, our studies show that submatrices\nwith different ranks could coexist in the same user-item rating matrix, so that\napproximations with \ufb01xed ranks cannot perfectly describe the internal structures\nof the rating matrix, therefore leading to inferior recommendation accuracy. In\nthis paper, a mixture-rank matrix approximation (MRMA) method is proposed, in\nwhich user-item ratings can be characterized by a mixture of LRMA models with\ndifferent ranks. Meanwhile, a learning algorithm capitalizing on iterated condition\nmodes is proposed to tackle the non-convex optimization problem pertaining to\nMRMA. Experimental studies on MovieLens and Net\ufb02ix datasets demonstrate that\nMRMA can outperform six state-of-the-art LRMA-based CF methods in terms of\nrecommendation accuracy.\n\n1\n\nIntroduction\n\nT .\n\nLow-rank matrix approximation (LRMA) is one of the most popular methods in today\u2019s collaborative\n\ufb01ltering (CF) methods due to high accuracy [11, 12, 13, 17]. Given a targeted user-item rating matrix\nR \u2208 Rm\u00d7n, the general goal of LRMA is to \ufb01nd two rank-k matrices U \u2208 Rm\u00d7k and V \u2208 Rn\u00d7k\nsuch that R \u2248 \u02c6R = U V T . After obtaining the user and item feature matrices, the recommendation\nscore of the i-th user on the j-th item can be obtained by the dot product between their corresponding\nfeature vectors, i.e., UiVj\nIn existing LRMA methods [12, 13, 17], the rank k is considered \ufb01xed, i.e., the same rank is adopted\nto describe all users and items. However, in many real-world user-item rating matrices, e.g., Movielens\nand Net\ufb02ix, users/items have a signi\ufb01cantly varying number of ratings, so that submatrices with\ndifferent ranks could coexist. For instance, a submatrix containing users and items with few ratings\nshould be of a low rank, e.g., 10 or 20, and a submatrix containing users and items with many ratings\nmay be of a relatively higher rank, e.g., 50 or 100. Adopting a \ufb01xed rank for all users and items\ncannot perfectly model the internal structures of the rating matrix, which will lead to imperfect\napproximations as well as degraded recommendation accuracy.\nIn this paper, we propose a mixture-rank matrix approximation (MRMA) method, in which user-item\nratings are represented by a mixture of LRMA models with different ranks. For each user/item, a\nprobability distribution with a Laplacian prior is exploited to describe its relationship with different\n\n\u2217This work was conducted while the author was with IBM.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fLRMA models, while a joint distribution of user-item pairs is employed to describe the relationship\nbetween the user-item ratings and different LRMA models. To cope with the non-convex optimization\nproblem associated with MRMA, a learning algorithm capitalizing on iterated condition modes\n(ICM) [1] is proposed, which can obtain a local maximum of the joint probability by iteratively\nmaximizing the probability of each variable conditioned on the rest. Finally, we evaluate the proposed\nMRMA method on Movielens and Net\ufb02ix datasets. The experimental results show that MRMA can\nachieve better accuracy compared against state-of-the-art LRMA-based CF methods, further boosting\nthe performance for recommender systems leveraging matrix approximation.\n\n2 Related Work\n\nLow-rank matrix approximation methods have been leveraged by much recent work to achieve\naccurate collaborative \ufb01ltering, e.g., PMF [17], BPMF [16], APG [19], GSMF [20], SMA [13],\netc. These methods train one user feature matrix and one item feature matrix \ufb01rst and use these\nfeature matrices for all users and items without any adaptation. However, all these methods adopt\n\ufb01xed rank values for the targeted user-item rating matrices. Therefore, as analyzed in this paper,\nsubmatrices with different ranks could coexist in the rating matrices and only adopting a \ufb01xed rank\ncannot achieve optimal matrix approximation. Besides stand-alone matrix approximation methods,\nensemble methods, e.g., DFC [15], LLORMA [12], WEMAREC [5], etc., and mixture models, e.g.,\nMPMA [4], etc., have been proposed to improve the recommendation accuracy and/or scalability\nby weighing different base models across different users/items. However, the above methods do not\nconsider using different ranks to derive different base models. In addition, it is desirable to borrow\nthe idea of mixture-rank matrix approximation (MRMA) to generate more accurate base models in\nthe above methods and further enhance their accuracy.\nIn many matrix approximation-based collaborative \ufb01ltering methods, auxiliary information, e.g.,\nimplicit feedback [9], social information [14], contextual information [10], etc., is introduced to\nimprove the recommendation quality of pure matrix approximation methods. The idea of MRMA is\northogonal to these methods, and can thus be employed by these methods to further improve their\nrecommendation accuracy. In general low-rank matrix approximation methods, it is non-trivial to\ndirectly determine the maximum rank of a targeted matrix [2, 3]. Cand\u00e8s et al. [3] proved that a\nnon-convex rank minimization problem can be equivalently transformed into a convex nuclear norm\nminimization problem. Based on this \ufb01nding, we can easily determine the range of ranks for MRMA\nand choose different K values (the maximum rank in MRMA) for different datasets.\n\n3 Problem Formulation\n\nIn this paper, upper case letters such as R, U, V denote matrices, and k denotes the rank for matrix\napproximation. For a targeted user-item rating matrix R \u2208 Rm\u00d7n, m denotes the number of users, n\ndenotes the number of items, and Ri,j denotes the rating of the i-th user on the j-th item. \u02c6R denotes\nthe low-rank approximation of R. The general goal of k-rank matrix approximation is to determine\nuser and item feature matrices, i.e., U \u2208 Rm\u00d7k, V \u2208 Rn\u00d7k, such that R \u2248 \u02c6R = U V T . The rank k is\nconsidered low, because k (cid:28) min{m, n} can achieve good performance in many CF applications.\nIn real-world rating matrices, e.g., Movielens and Net\ufb02ix, users/items have a varying number of\nratings, so that a lower rank which best describes users/items with less ratings will easily under\ufb01t the\nusers/items with more ratings, and similarly a higher rank will easily over\ufb01t the users/items with less\nratings. A case study is conducted on the Movielens (1M) dataset (with 1M ratings from 6,000 users\non 4,000 movies), which con\ufb01rms that internal submatrices with different ranks indeed coexist in the\nrating matrix. Here, we run the probabilistic matrix factorization (PMF) method [17] using k = 5\nand k = 50, and then compare the root mean square errors (RMSEs) for the users/items with less\nthan 10 ratings and more than 50 ratings.\nAs shown in Table 1, when the rank is 5, the users/items with less than 10 ratings achieve lower\nRMSEs than the cases when the rank is 50. This indicates that the PMF model over\ufb01ts the users/items\nwith less than 10 ratings when k = 50. Similarly, we can conclude that the PMF model under\ufb01ts\nthe users/items with more than 50 ratings when k = 5. Moreover, PMF with k = 50 achieves lower\nRMSE (higher accuracy) than PMF with k = 5, but the improvement comes with sacri\ufb01ced accuracy\nfor the users and items with a small number of ratings, e.g., less than 10. This study shows that PMF\n\n2\n\n\fTable 1: The root mean square errors (RMSEs) of PMF [17] for users/items with different numbers\nof ratings when rank k = 5 and k = 50.\n\n#user ratings < 10\n#user ratings > 50\n#item ratings < 10\n#item ratings > 50\n\nAll\n\nrank = 5\n0.9058\n0.8416\n0.9338\n0.8520\n0.8614\n\nrank = 50\n\n0.9165\n0.8352\n0.9598\n0.8418\n0.8583\n\nwith \ufb01xed rank values cannot perfectly model the internal mixture-rank structure of the rating matrix.\nTo this end, it is desirable to model users and items with different ranks.\n\n4 Mixture-Rank Matrix Approximation (MRMA)\n\n\u00b5\u03b1\n\nU k\ni\n\nb\u03b1\n\n\u03b1k\ni\n\n\u03c32\nU\n\nRi,j\n\ni = {1, ..., m}\n\nb\u03b2\n\n\u00b5\u03b2\n\nV k\nj\n\n\u03b2k\nj\nk = {1, ..., K}\n\nj = {1, ..., n}\n\n\u03c32\nV\n\n\u03c32\n\nFigure 1: The graphical model for the proposed mixture-rank matrix approximation (MRMA) method.\n\nm(cid:89)\n\nn(cid:89)\n\nK(cid:88)\n\nFollowing the idea of PMF, we exploit a probabilistic model with Gaussian noise to model the\nratings [17]. As shown in Figure 1, the conditional distribution over the observed ratings for the\nmixture-rank model can be de\ufb01ned as follows:\n\nj N (Ri,j|U k\n\ni V k\nj\n\nT\n\nPr(R|U, V, \u03b1, \u03b2, \u03c32) =\n\ni=1\n\n\u03b1k\ni \u03b2k\n\n, \u03c32)]1i,j ,\n\n[\nj=1\nk=1\n\n(1)\nwhere N (x|\u00b5, \u03c32) denotes the probability density function of a Gaussian distribution with mean \u00b5\nand variance \u03c32. K is the maximum rank among all internal structures of the user-item rating matrix.\n\u03b1k and \u03b2k are the weight vectors of the rank-k matrix approximation model for all users and items,\nrespectively. Thus, \u03b1k\nj denote the weights of the rank-k model for the i-th user and j-th item,\nrespectively. U k and V k are the feature matrices of the rank-k matrix approximation model for all\nusers and items, respectively. Likewise, U k\nj denote the feature vectors of the rank-k model\nfor the i-th user and j-th item, respectively. 1i,j is an indication function, which will be 1 if Ri,j is\nobserved and 0 otherwise.\nBy placing a zero mean isotropic Gaussian prior [6, 17] on the user and item feature vectors, we have\n\ni and V k\n\ni and \u03b2k\n\nPr(U k|\u03c32\n\nU ) =\n\nN (U k\n\ni |0, \u03c32\n\nU I), Pr(V k|\u03c32\n\nV ) =\n\nN (V k\n\nj |0, \u03c32\n\nV I).\n\n(2)\n\ni=1\n\nj=1\n\nFor \u03b1k and \u03b2k, we choose a Laplacian prior here, because the models with most suitable ranks\nfor each user/item should be with large weights, i.e., \u03b1k and \u03b2k should be sparse. By placing the\nLaplacian prior on the user and item weight vectors, we have\n\nPr(\u03b1k|\u00b5\u03b1, b\u03b1) =\n\nL(\u03b1k\n\ni |\u00b5\u03b1, b\u03b1), Pr(\u03b2k|\u00b5\u03b2, b\u03b2) =\n\nL(\u03b2k\n\nj |\u00b5\u03b2, b\u03b2),\n\n(3)\n\nm(cid:89)\n\nm(cid:89)\n\nn(cid:89)\n\nn(cid:89)\n\ni=1\n\nj=1\n\n3\n\n\fwhere \u00b5\u03b1 and b\u03b1 are the location parameter and scale parameter of the Laplacian distribution for \u03b1,\nrespectively, and accordingly \u00b5\u03b2 and b\u03b2 are the location parameter and scale parameter for \u03b2.\nThe log of the posterior distribution over the user and item features and weights can be given as\nfollows:\n\n=\n\nU , \u03c32\n\nK(cid:88)\n\nl = ln Pr(U, V, \u03b1, \u03b2|R, \u03c32, \u03c32\n\n\u221d ln(cid:2) Pr(R|U, V, \u03b1, \u03b2, \u03c32) Pr(U|\u03c32\nV , \u00b5\u03b1, b\u03b1, \u00b5\u03b2, b\u03b2)\nU ) Pr(V |\u03c32\nm(cid:88)\nj N (Ri,j|U k\nn(cid:88)\nK(cid:88)\nn(cid:88)\nK(cid:88)\n\nn(cid:88)\n(cid:2) ln\nm(cid:88)\nK(cid:88)\nm(cid:88)\nK(cid:88)\n\ni )2 \u2212 1\n2\u03c32\nV\n\n\u2212 1\n2\u03c32\nU\n\ni (V k\n\n\u03b1k\ni \u03b2k\n\n(V k\n\n(U k\n\n1i,j\n\nk=1\n\nk=1\n\ni=1\n\nj=1\n\nk=1\n\nj=1\n\ni=1\n\ni )2 \u2212 1\n2\n\n\u2212 1\nb\u03b1\n\n|\u03b1k\ni \u2212 \u00b5\u03b1| \u2212 1\nb\u03b2\n\nk=1\n\ni=1\n\nk=1\n\nj=1\n\nV ) Pr(\u03b1|\u00b5\u03b1, b\u03b1) Pr(\u03b2|\u00b5\u03b2, b\u03b2)(cid:3)\nj )T , \u03c32I)(cid:3)\n\nKm ln \u03c32\n\nKn ln \u03c32\nV\n\nU \u2212 1\n2\n\nK(cid:88)\n\nk=1\n\n|\u03b2k\nj \u2212 \u00b5\u03b2| \u2212 1\n2\n\nm ln b2\n\n\u03b1 \u2212 1\n2\n\n(4)\n\nn ln b2\n\n\u03b2 + C,\n\nK(cid:88)\n\nk=1\n\nwhere C is a constant that does not depend on any parameters. Since the above optimization problem\nis dif\ufb01cult to solve directly, we obtain its lower bound using Jensen\u2019s inequality and then optimize\nthe following lower bound:\n\nl(cid:48) = \u2212 1\n2\u03c32\n\ni=1\n\nj=1\n\nn(cid:88)\nm(cid:88)\nK(cid:88)\nm(cid:88)\nK(cid:88)\nm(cid:88)\n\nk=1\n\ni=1\n\nk=1\n\ni=1\n\n\u2212 1\n2\u03c32\nU\n\n\u2212 1\nb\u03b1\n\n(cid:2) K(cid:88)\n\n1i,j\n\n\u03b1k\ni \u03b2k\n\nk=1\n\n(U k\n\ni )2 \u2212 1\n2\u03c32\nV\n\n|\u03b1k\ni \u2212 \u00b5\u03b1| \u2212 1\nb\u03b2\n\nj (Ri,j \u2212 U k\nn(cid:88)\nK(cid:88)\nK(cid:88)\nn(cid:88)\n\n(V k\n\nk=1\n\nj=1\n\nk=1\n\nj=1\n\nj )T )2(cid:3) \u2212 1\n\n2\n\nm(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\ni (V k\n\n1i,j ln \u03c32\n\ni )2 \u2212 1\n2\n\nKm ln \u03c32\n\nU \u2212 1\n2\n\nKn ln \u03c32\nV\n\n(5)\n\n|\u03b2k\nj \u2212 \u00b5\u03b2| \u2212 1\n2\n\nKm ln b2\n\n\u03b1 \u2212 1\n2\n\nKn ln b2\n\n\u03b2 + C.\n\nIf we keep the hyperparameters of the prior distributions \ufb01xed, then maximizing l(cid:48) is similar to the\npopular least square error minimization with (cid:96)2 regularization on U and V and (cid:96)1 regularization on \u03b1\nand \u03b2. However, keeping the hyperparameters \ufb01xed may easily lead to over\ufb01tting because MRMA\nmodels have many parameters.\n\n5 Learning MRMA Models\n\nThe optimization problem de\ufb01ned in Equation 5 is very likely to over\ufb01t if we cannot precisely\nestimate the hyperparameters, which automatically control the generalization capacity of the MRMA\nmodel. For instance, \u03c3U and \u03c3V will control the regularization of U and V . Therefore, it is more\ndesirable to estimate the parameters and hyperparameters simultaneously during model training. One\npossible way is to estimate each variable by its maximum a priori (MAP) value while conditioned\non the rest variables and then iterate until convergence, which is also known as iterated conditional\nmodes (ICM) [1].\nThe ICM procedure for maximizing Equation 5 is presented as follows.\nInitialization: Choose initial values for all variables and parameters.\nICM Step: The values of U, V , \u03b1 and \u03b2 can be updated by solving the following minimization\nproblems when conditioned on other variables or hyperparameters.\n\n\u2200k \u2208 {1, ..., K},\u2200i \u2208 {1, ..., m} :\n\nn(cid:88)\nn(cid:88)\n\nj=1\n\n(cid:8) 1\n(cid:8) 1\n\n2\u03c32\n\n2\u03c32\n\n(cid:2) K(cid:88)\n(cid:2) K(cid:88)\n\nk=1\n\n1i,j\n\n1i,j\n\ni \u2190 arg min\nU k\nU(cid:48)\n\ni \u2190 arg min\n\u03b1k\n\u03b1(cid:48)\n\n\u03b1k\ni \u03b2k\n\nj (Ri,j \u2212 U k\n\ni (V k\n\n\u03b1k\ni \u03b2k\n\nj (Ri,j \u2212 U k\n\ni (V k\n\nj )T )2(cid:3) +\nj )T )2(cid:3) +\n\nK(cid:88)\nK(cid:88)\n\nk=1\n\n(U k\n\ni )2(cid:9),\ni \u2212 \u00b5\u03b1|(cid:9).\n\n|\u03b1k\n\n1\n2\u03c32\nU\n\n1\nb\u03b1\n\nk=1\n\nj=1\n\nk=1\n\n4\n\n\fThe hyperparameters can be learned as their maximum likelihood estimates by setting their partial\nderivatives on l(cid:48) to 0.\n\n\u2200k \u2208 {1, ..., K},\u2200j \u2208 {1, ..., n} :\n\n(cid:8) 1\nj \u2190 arg min\nV k\n(cid:8) 1\nV (cid:48)\n\nj \u2190 arg min\n\u03b2k\n\u03b2(cid:48)\n\n2\u03c32\n\n2\u03c32\n\nm(cid:88)\nm(cid:88)\n\ni=1\n\n1i,j\n\n1i,j\n\n(cid:2) K(cid:88)\n(cid:2) K(cid:88)\n\nk=1\n\ni=1\n\nk=1\n\n\u03b1k\ni \u03b2k\n\nj (Ri,j \u2212 U k\n\ni (V k\n\n\u03b1k\ni \u03b2k\n\nj (Ri,j \u2212 U k\n\ni (V k\n\ni=1\n\nj=1\n\n\u03c32 \u2190 m(cid:88)\nn(cid:88)\nU \u2190 K(cid:88)\nm(cid:88)\nn(cid:88)\nV \u2190 K(cid:88)\n\n\u03c32\n\n\u03c32\n\nk=1\n\ni=1\n\n1i,j\n\nk=1\n\n\u03b1k\ni \u03b2k\n\ni (V k\n\n(cid:2) K(cid:88)\nm(cid:88)\nj )T )2(cid:3)/\nj (Ri,j \u2212 U k\ni )2/Km, \u00b5\u03b1 \u2190 K(cid:88)\nm(cid:88)\nn(cid:88)\nj )2/Kn, \u00b5\u03b2 \u2190 K(cid:88)\n\n\u03b1k\n\nk=1\n\ni=1\n\ni=1\n\n(U k\n\n(V k\n\n\u03b2k\nj /Kn, b\u03b2 =\n\ni /Km, b\u03b1 =\n\nj=1\n\nj )T )2(cid:3) +\nj )T )2(cid:3) +\nn(cid:88)\n\n1i,j,\n\nK(cid:88)\nm(cid:88)\nn(cid:88)\nK(cid:88)\n\nk=1\n\ni=1\n\nK(cid:88)\nK(cid:88)\n\nk=1\n\n(V k\n\nj )2(cid:9),\nj \u2212 \u00b5\u03b2|(cid:9).\n\n|\u03b2k\n\n1\n2\u03c32\nV\n\n1\nb\u03b2\n\nk=1\n\n|\u03b1k\ni \u2212 \u00b5\u03b1|/Km,\n\n|\u03b2k\nj \u2212 \u00b5\u03b2|/Kn.\n\nk=1\n\nj=1\n\nk=1\n\nj=1\n\nk=1\n\nj=1\n\nRepeat: until convergence or the maximum number of iterations reached.\nNote that ICM is sensitive to initial values. Our empirical studies show that setting the initial values\n\u221a\nof U k and V k by solving the classic PMF method can achieve good performance. Regarding \u03b1\nand \u03b2, one of the proper initial values should be 1/\nK (K denotes the number of sub-models in\nthe mixture model). To improve generalization performance and enable online learning [7], we can\nupdate U, V, \u03b1, \u03b2 using stochastic gradient descent. Meanwhile, the (cid:96)1 norms in learning \u03b1 and \u03b2\ncan be approximated by the smoothed (cid:96)1 method [18]. To deal with massive datasets, we can use the\nalternating least squares (ALS) method to learn the parameters of the proposed MRMA model, which\nis amenable to parallelization.\n\n6 Experiments\n\ni\n\ni\n\nj\n\nj\n\n(cid:80)\n\n(cid:80)\n\n(cid:113)(cid:80)\n\nThis section presents the experimental results of the proposed MRMA method on three well-known\ndatasets: 1) MovieLens 1M dataset (\u223c1 million ratings from 6,040 users on 3,706 movies); 2)\nMovieLens 10M dataset (\u223c10 million ratings from 69,878 users on 10,677 movies); 3) Net\ufb02ix Prize\ndataset (\u223c100 million ratings from 480,189 users on 17,770 movies). For all accuracy comparisons,\nwe randomly split each dataset into a training set and a test set by the ratio of 9:1. All results are\nreported by averaging over 5 different splits. The root mean square error (RMSE) is adopted to\nmeasure the rating prediction accuracy of different algorithms, which can be computed as follows:\nD( \u02c6R) =\n1i,j (1i,j indicates that entry (i, j) appears in the\ntest set). The normalized discounted cumulative gain (NDCG) is adopted to measure the item\nranking accuracy of different algorithms, which can be computed as follows: N DCG@N =\ni=1(2reli \u2212 1)/ log2(i + 1), and IDCG is the DCG value\n\n1i,j(Ri,j \u2212 \u02c6Ri,j)2/(cid:80)\nDCG@N/IDCG@N (DCG@N =(cid:80)N\n\nwith perfect ranking).\nIn ICM-based learning, we adopt \u0001 = 0.00001 as the convergence threshold and T = 300 as the\nmaximum number of iterations. Considering ef\ufb01ciency, we only choose a subset of ranks, e.g.,\n{10, 20, 30, ..., 300} rather than {1, 2, 3, ..., 300}, in MRMA. The parameters of all the compared\nalgorithms are adopted from their original papers because all of them are evaluated on the same\ndatasets.\nWe compare the recommendation accuracy of MRMA with six matrix approximation-based collabo-\nrative \ufb01ltering algorithms as follows: 1) BPMF [16], which extends the PMF method from a Baysian\nview and estimates model parameters using a Markov chain Monte Carlo scheme; 2) GSMF [20],\nwhich learns user/item features with group sparsity regularization in matrix approximation; 3) LLOR-\nMA [12], which ensembles the approximations from different submatrices using kernel smoothing;\n4) WEMAREC [5], which ensembles different biased matrix approximation models to achieve higher\n\n5\n\n\fFigure 2: Root mean square error comparison\nbetween MRMA and PMF with different ranks.\n\nFigure 3: The accuracy and ef\ufb01ciency tradeoff of\nMRMA.\n\naccuracy; 5) MPMA [4], which combines local and global matrix approximations using a mixture\nmodel; 6) SMA [13], which yields a stable matrix approximation that can achieve good generalization\nperformance.\n\n6.1 Mixture-Rank Matrix Approximation vs. Fixed-Rank Matrix Approximation\n\nGiven a \ufb01xed rank k, the corresponding rank-k model in MRMA is identical to probabilistic matrix\nfactorization (PMF) [17]. In this experiment, we compare the recommendation accuracy of MRMA\nwith ranks in {10, 20, 50, 100, 150, 200, 250, 300} against those of PMF with \ufb01xed ranks on the\nMovieLens 1M dataset. For PMF, we choose 0.01 as the learning rate, 0.01 as the user feature\nregularization coef\ufb01cient, and 0.001 as the item feature regularization coef\ufb01cient, respectively. The\nconvergence condition is the same as MRMA.\nAs shown in Figure 2, when the rank increases from 10 to 300, PMF can achieve RMSEs between\n0.86 and 0.88. However, the RMSE of MRMA is about 0.84 when mixing all these ranks from 10 to\n300. Meanwhile, the accuracy of PMF is not stable when k \u2264 100. For instance, PMF with k = 10\nachieves better accuracy than k = 20 but worse accuracy than k = 50. This is because \ufb01xed rank\nmatrix approximation cannot be perfect for all users and items, so that many users and items either\nunder\ufb01t or over\ufb01t at a \ufb01xed rank less than 100. Yet when k > 100, only over\ufb01tting occurs and PMF\nachieves consistently better accuracy when k increases, which is because regularization terms can\nhelp improve generalization capacity. Nevertheless, PMF with all ranks achieves lower accuracy\nthan MRMA, because individual users/items can give the sub-models with the optimal ranks higher\nweights in MRMA and thus alleviate under\ufb01tting or over\ufb01tting.\n\n6.2 Sensitivity of Rank in MRMA\n\nIn MRMA, the set of ranks decide the performance of the \ufb01nal model. However, it is neither ef\ufb01cient\nnor necessary to choose all the ranks in [1, 2, ..., K]. For instance, a rank-k approximation will\nbe very similar to rank-(k \u2212 1) and rank-(k + 1) approximations, i.e., they may have overlapping\nstructures. Therefore, a subset of ranks will be suf\ufb01cient. Figure 3 shows 5 different settings of\nrank combinations, in which set 1 = {10, 20, 30, ..., 300}, set 2 = {20, 40, ..., 300}, set 3 =\n{30, 60, ..., 300}, set 4 = {50, 100, ..., 300}, and set 5 = {100, 200, 300}. As shown in this \ufb01gure,\nRMSE decreases when more ranks are adopted in MRMA, which is intuitive because more ranks will\nhelp users/items better choose the most appropriate components. However, the computation time\nalso increases when more ranks are adopted in MRMA. If a tradeoff between accuracy and ef\ufb01ciency\nis required, then set 2 or set 3 will be desirable because they achieve slightly worse accuracies but\nsigni\ufb01cantly less computation overheads.\nMRMA only contains three sub-models with different ranks in set 5 = {100, 200, 300}, but it still\nsigni\ufb01cantly outperforms PMF with ranks ranging from 10 to 300 in recommendation accuracy (as\nshown in Figure 2). This further con\ufb01rms that MRMA can indeed discover the internal mixture-rank\nstructure of the user-item rating matrix and thus achieve better recommendation accuracy due to\nbetter approximation.\n\n6\n\n0.800.820.840.860.88k=10k=20k=50k=100k=150k=200k=250k=300MRMARMSEModelPMFMRMA0.800.820.840.86set 1set 2set 3set 4set 5 0 1000RMSEcomputation time (s)rank settingRMSEcomputation time\fTable 2: RMSE comparison between MRMA and six state-of-the-art matrix approximation-based\ncollaborative \ufb01ltering algorithms on MovieLens (10M) and Net\ufb02ix datasets. Note that MRMA\nstatistically signi\ufb01cantly outperforms the other algorithms with 95% con\ufb01dence level.\n\nBPMF [16]\nGSMF [20]\n\nLLORMA [12]\nWEMAREC [5]\n\nMPMA [4]\nSMA [13]\nMRMA\n\nMovieLens (10M)\n0.8197 \u00b1 0.0004\n0.8012 \u00b1 0.0011\n0.7855 \u00b1 0.0002\n0.7775 \u00b1 0.0007\n0.7712 \u00b1 0.0002\n0.7682 \u00b1 0.0003\n0.7634 \u00b1 0.0009\n\nNet\ufb02ix\n\n0.8421 \u00b1 0.0002\n0.8420 \u00b1 0.0006\n0.8275 \u00b1 0.0004\n0.8143 \u00b1 0.0001\n0.8139 \u00b1 0.0003\n0.8036 \u00b1 0.0004\n0.7973 \u00b1 0.0002\n\nTable 3: NDCG comparison between MRMA and six state-of-the-art matrix approximation-based\ncollaborative \ufb01ltering algorithms on Movielens (1M) and Movielens (10M) datasets. Note that\nMRMA statistically signi\ufb01cantly outperforms the other algorithms with 95% con\ufb01dence level.\n\nMetric\n\nData | Method\nBPMF\nGSMF\n\nM\n1\n\ns\nn\ne\nl\ne\ni\nv\no\nM\n\nM\n0\n1\n\ns\nn\ne\nl\ne\ni\nv\no\nM\n\nN=1\n\nMPMA\nSMA\nMRMA\nBPMF\nGSMF\n\n0.6870 \u00b1 0.0024\n0.6909 \u00b1 0.0048\n0.7025 \u00b1 0.0027\nLLORMA\nWEMAREC 0.7048 \u00b1 0.0015\n0.7020 \u00b1 0.0005\n0.7042 \u00b1 0.0033\n0.7153 \u00b1 0.0027\n0.6563 \u00b1 0.0005\n0.6708 \u00b1 0.0012\n0.6829 \u00b1 0.0014\nLLORMA\nWEMAREC 0.7013 \u00b1 0.0003\n0.6908 \u00b1 0.0006\n0.7002 \u00b1 0.0006\n0.7048 \u00b1 0.0006\n\nMPMA\nSMA\nMRMA\n\nNDCG@N\n\nN=5\n\n0.6981 \u00b1 0.0029\n0.7031 \u00b1 0.0023\n0.7101 \u00b1 0.0005\n0.7089 \u00b1 0.0016\n0.7114 \u00b1 0.0018\n0.7109 \u00b1 0.0011\n0.7182 \u00b1 0.0005\n0.6845 \u00b1 0.0003\n0.6995 \u00b1 0.0008\n0.7066 \u00b1 0.0005\n0.7176 \u00b1 0.0006\n0.7133 \u00b1 0.0002\n0.7134 \u00b1 0.0004\n0.7219 \u00b1 0.0001\n\nN=10\n\n0.7525 \u00b1 0.0009\n0.7555 \u00b1 0.0017\n0.7626 \u00b1 0.0023\n0.7617 \u00b1 0.0041\n0.7606 \u00b1 0.0006\n0.7607 \u00b1 0.0008\n0.7672 \u00b1 0.0013\n0.7467 \u00b1 0.0007\n0.7566 \u00b1 0.0017\n0.7632 \u00b1 0.0004\n0.7703 \u00b1 0.0002\n0.7680 \u00b1 0.0001\n0.7679 \u00b1 0.0003\n0.7743 \u00b1 0.0001\n\nN=20\n\n0.8754 \u00b1 0.0008\n0.8769 \u00b1 0.0011\n0.8811 \u00b1 0.0010\n0.8796 \u00b1 0.0005\n0.8805 \u00b1 0.0007\n0.8801 \u00b1 0.0004\n0.8837 \u00b1 0.0004\n0.8691 \u00b1 0.0002\n0.8748 \u00b1 0.0004\n0.8782 \u00b1 0.0012\n0.8824 \u00b1 0.0006\n0.8808 \u00b1 0.0004\n0.8809 \u00b1 0.0002\n0.8846 \u00b1 0.0001\n\n6.3 Accuracy Comparison\n\n6.3.1 Rating Prediction Comparison\n\nTable 2 compares the rating prediction accuracy between MRMA and six matrix approximation-\nbased collaborative \ufb01ltering algorithms on MovieLens (10M) and Net\ufb02ix datasets. Note that among\nthe compared algorithms, BPMF, GSMF, MPMA and SMA are stand-alone algorithms, while\nLLORMA and WEMAREC are ensemble algorithms. In this experiment, we adopt the set of ranks\nas {10, 20, 50, 100, 150, 200, 250, 300} due to ef\ufb01ciency reason, which means that the accuracy of\nMRMA should not be optimal. However, as shown in Table 2, MRMA statistically signi\ufb01cantly\noutperforms all the other algorithms with 95% con\ufb01dence level. The reason is that MRMA can\nchoose different rank values for different users/items, which can achieve not only globally better\napproximation but also better approximation in terms of individual users or items. This further\ncon\ufb01rms that mixture-rank structure indeed exists in user-item rating matrices in recommender\nsystems. Thus, it is desirable to adopt mixture-rank matrix approximations rather than \ufb01xed-rank\nmatrix approximations for recommendation tasks.\n\n6.3.2 Item Ranking Comparison\n\nTable 3 compares the NDCGs of MRMA with the other six state-of-the-art matrix approximation-\nbased collaborative \ufb01ltering algorithms on Movielens (1M) and Movielens (10M) datasets. Note that\nfor each dataset, we keep 20 ratings in the test set for each user and remove users with less than 5\n\n7\n\n\fratings in the training set. As shown in the results, MRMA can also achieve higher item ranking\naccuracy than the other compared algorithms thanks to the capability of better capturing the internal\nmixture-rank structures of the user-item rating matrices. This experiment demonstrates that MRMA\ncan not only provide accurate rating prediction but also achieve accurate item ranking for each user.\n\n6.4\n\nInterpretation of MRMA\n\nTable 4: Top 10 movies with largest \u03b2 values for sub-models with rank k = 20 and k = 200 in MRMA.\nHere, #ratings stands for the average number of ratings in the training set for the corresponding\nmovies.\n\nrank=20\n\nrank=200\n\nmovie name\n\n\u03b2\n\n#ratings\n\nmovie name\n\n\u03b2\n\n#ratings\n\nSmashing Time\n\nGate of Heavenly Peace\n\nMan of the Century\n\nMamma Roma\nDry Cleaning\nDear Jesse\n\nSkipped Parts\n\nThe Hour of the Pig\n\nInheritors\n\nDangerous Game\n\n0.6114\n0.6101\n0.6079\n0.6071\n0.6071\n0.6063\n0.6057\n0.6055\n0.6042\n0.6034\n\n2.4\n\nAmerican Beauty\nGroundhog Day\n\nFargo\n\nFace/Off\n\n2001: A Space Odyssey\n\nShakespeare in Love\nSaving Private Ryan\n\nThe Fugitive\nBraveheart\nFight Club\n\n0.9219\n0.9146\n0.8779\n0.8693\n0.8608\n0.8553\n0.8480\n0.8404\n0.8247\n0.8153\n\n1781.4\n\nTo better understand how users/items weigh different sub-models in the mixture model of MRMA,\nwe present the top 10 movies which have largest \u03b2 values for sub-models with rank=20 and rank=200,\nshow their \u03b2 values, and compare their average numbers of ratings in the training set in Table 4.\nIntuitively, the movies with more ratings (e.g., over 1000 ratings) should weigh higher towards more\ncomplex models, and the movies with less ratings (e.g., under 10 ratings) should weigh higher towards\nsimpler models in MRMA.\nAs shown in Table 4, the top 10 movies with largest \u03b2 values for the sub-model with rank 20 have\nonly 2.4 ratings on average in the training set. On the contrary, the top 10 movies with largest \u03b2 values\nfor the sub-model with rank 200 have 1781.4 ratings on average in the training set, and meanwhile\nthese movies are very popular and most of them are Oscar winners. This con\ufb01rms our previous claim\nthat MRMA can indeed weigh more complex models (e.g., rank=200) higher for movies with more\nratings to prevent under\ufb01tting, and weigh less complex models (e.g., rank=20) higher for the movies\nwith less ratings to prevent over\ufb01tting. A similar phenomenon has also been observed from users\nwith different \u03b1 values, and we omit the results due to space limit.\n\n7 Conclusion and Future Work\n\nThis paper proposes a mixture-rank matrix approximation (MRMA) method, which describes user-\nitem ratings using a mixture of low-rank matrix approximation models with different ranks to achieve\nbetter approximation and thus better recommendation accuracy. An ICM-based learning algorithm is\nproposed to handle the non-convex optimization problem pertaining to MRMA. The experimental\nresults on MovieLens and Net\ufb02ix datasets demonstrate that MRMA can achieve better accuracy than\nsix state-of-the-art matrix approximation-based collaborative \ufb01ltering methods, further pushing the\nfrontier of recommender systems. One of the possible extensions of this work is to incorporate other\ninference methods into learning the MRMA model, e.g., variational inference [8], because ICM may\nbe trapped in local maxima and therefore cannot achieve global maxima without properly chosen\ninitial values.\n\nAcknowledgement\n\nThis work was supported in part by the National Natural Science Foundation of China under Grant\nNo. 61332008 and NSAF under Grant No. U1630115.\n\n8\n\n\fReferences\n[1] J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society. Series B\n\n(Methodological), pages 259\u2013302, 1986.\n\n[2] E. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Communications of the\n\nACM, 55(6):111\u2013119, 2012.\n\n[3] E. J. Cand\u00e8s and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nTransactions on Information Theory, 56(5):2053\u20132080, 2010.\n\nIEEE\n\n[4] C. Chen, D. Li, Q. Lv, J. Yan, S. M. Chu, and L. Shang. MPMA: mixture probabilistic matrix approxi-\nmation for collaborative \ufb01ltering. In Proceedings of the 25th International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI \u201916), pages 1382\u20131388, 2016.\n\n[5] C. Chen, D. Li, Y. Zhao, Q. Lv, and L. Shang. WEMAREC: Accurate and scalable recommendation\nthrough weighted and ensemble matrix approximation. In Proceedings of the 38th International ACM\nSIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201915), pages 303\u2013312,\n2015.\n\n[6] D. Dueck and B. Frey. Probabilistic sparse matrix factorization. University of Toronto technical report\n\nPSI-2004-23, 2004.\n\n[7] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent,\n\n2015. arXiv:1509.01240.\n\n[8] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[9] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model. In Proceed-\nings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD\n\u201914), pages 426\u2013434. ACM, 2008.\n\n[10] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining (KDD \u201909), pages 447\u2013456. ACM,\n2009.\n\n[11] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer,\n\n42(8), 2009.\n\n[12] J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrix approximation. In Proceedings of The\n\n30th International Conference on Machine Learning (ICML \u201913), pages 82\u201390, 2013.\n\n[13] D. Li, C. Chen, Q. Lv, J. Yan, L. Shang, and S. Chu. Low-rank matrix approximation with stability. In The\n\n33rd International Conference on Machine Learning (ICML \u201916), pages 295\u2013303, 2016.\n\n[14] H. Ma, H. Yang, M. R. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix\nfactorization. In Proceedings of the 17th ACM conference on Information and knowledge management\n(CIKM \u201908), pages 931\u2013940. ACM, 2008.\n\n[15] L. W. Mackey, M. I. Jordan, and A. Talwalkar. Divide-and-conquer matrix factorization. In Advances in\n\nNeural Information Processing Systems (NIPS \u201911), pages 1134\u20131142, 2011.\n\n[16] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo.\nIn Proceedings of the 25th international conference on Machine learning (ICML \u201908), pages 880\u2013887.\nACM, 2008.\n\n[17] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information\n\nProcessing Systems (NIPS \u201908), pages 1257\u20131264, 2008.\n\n[18] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for L1 regularization: A comparative\nstudy and two new approaches. In European Conference on Machine Learning (ECML \u201907), pages 286\u2013297.\nSpringer, 2007.\n\n[19] K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear\n\nleast squares problems. Paci\ufb01c Journal of Optimization, 6(15):615\u2013640, 2010.\n\n[20] T. Yuan, J. Cheng, X. Zhang, S. Qiu, and H. Lu. Recommendation by mining multiple user behaviors with\ngroup sparsity. In Proceedings of the 28th AAAI Conference on Arti\ufb01cial Intelligence (AAAI \u201914), pages\n222\u2013228, 2014.\n\n9\n\n\f", "award": [], "sourceid": 341, "authors": [{"given_name": "Dongsheng", "family_name": "Li", "institution": "IBM Research - China"}, {"given_name": "Chao", "family_name": "Chen", "institution": "Tongji University"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Tun", "family_name": "Lu", "institution": "Fudan University"}, {"given_name": "Ning", "family_name": "Gu", "institution": "Fudan University"}, {"given_name": "Stephen", "family_name": "Chu", "institution": "IBM Research - China"}]}