{"title": "Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm", "book": "Advances in Neural Information Processing Systems", "page_first": 2056, "page_last": 2064, "abstract": "We show that matrix completion with trace-norm regularization can be significantly hurt when entries of the matrix are sampled non-uniformly, but that a properly weighted version of the trace-norm regularizer works well with non-uniform sampling. We show that the weighted trace-norm regularization indeed yields significant gains on the highly non-uniformly sampled Netflix dataset.", "full_text": "Collaborative Filtering in a Non-Uniform World:\n\nLearning with the Weighted Trace Norm\n\nBrain and Cognitive Sciences and CSAIL, MIT\n\nToyota Technological Institute at Chicago\n\nNathan Srebro\n\nRuslan Salakhutdinov\n\nCambridge, MA 02139\nrsalakhu@mit.edu\n\nChicago, Illinois 60637\n\nnati@ttic.edu\n\nAbstract\n\nWe show that matrix completion with trace-norm regularization can be signi\ufb01-\ncantly hurt when entries of the matrix are sampled non-uniformly, but that a prop-\nerly weighted version of the trace-norm regularizer works well with non-uniform\nsampling. We show that the weighted trace-norm regularization indeed yields sig-\nni\ufb01cant gains on the highly non-uniformly sampled Net\ufb02ix dataset.\n\n1 Introduction\n\nTrace-norm regularization is a popular approach for matrix completion and collaborative \ufb01ltering,\nmotivated both as a convex surrogate to the rank [7, 6] and in terms of a regularized in\ufb01nite factor\nmodel with connections to large-margin norm-regularized learning [17, 1, 15].\n\nCurrent theoretical guarantees on using the trace-norm for matrix completion assume a uniform\nsampling distribution over entries of the matrix [18, 6, 5, 13]. In a collaborative \ufb01ltering setting,\nwhere rows of the matrix represent e.g. users and columns represent e.g. movies, this corresponds\nto assuming all users are equally likely to rate movies and all movies are equally likely to be rated.\nThis of course cannot be further from the truth, as invariably some users are more active than others\nand some movies are rated by many people while others are rarely rated.\n\nIn this paper we show, both analytically and through simulations, that this is not a de\ufb01ciency of\nthe proof techniques used to establish the above guarantees. Indeed, a non-uniform sampling dis-\ntribution can lead to a signi\ufb01cant deterioration in prediction quality and an increase in the sample\ncomplexity. Under non-uniform sampling, as many as \u2126(n4/3) samples might be needed for learn-\ning even a simple (e.g. orthogonal low rank) n \u00d7 n matrix. This is in sharp contrast to the uniform\nsampling case, in which \u02dcO(n) samples are enough. It is important to note that if the rank could\nbe minimized directly, which is in general not computationally tractable, \u02dcO(n) samples would be\nenough to learn a low-rank model even under an arbitrary non-uniform distribution.\n\nOur analysis further suggests a weighted correction to the trace-norm regularizer, that takes into\naccount the sampling distribution. Although appearing at \ufb01rst as counter-intuitive, and indeed be-\ning the opposite of a previously suggested weighting [21], this weighting is well-motivated by our\nanalytic analysis and we discuss how it corrects the problems that the unweighted trace-norm has\nwith non-uniform sampling. We show how the weighted trace-norm indeed yields a signi\ufb01cant\nimprovement on the highly non-uniformly sampled Net\ufb02ix dataset.\n\nThe only other work we are aware of that studies matrix completion under non-uniform sampling\nis work on exact completion (i.e. when the matrix is assumed to be exactly low rank) under power-\nlaw sampling [12]. Other then being limited to one speci\ufb01c distribution, the requirement of the\nmatrix being exactly low rank is central to this work, and the results cannot be directly applied\nin the presence of even small noise. Empirically, the approach leads to deterioration in predictive\nperformance on the Net\ufb02ix data [12].\n\n1\n\n\f2 Complexity Control in terms of Matrix Factorizations\nConsider the problem of predicting the entries of some unknown target matrix Y \u2208 Rn\u00d7m based\non a random subset S of observed entries YS. For example, n and m may represent the number of\nusers and the number of movies, and Y may represent a matrix of partially observed rating values.\nPredicting elements of Y can be done by \ufb01nding a matrix X minimizing the training error, here\nmeasured as a squared error, and some measure c(X) of complexity. That is, minimizing either:\n\nor:\n\nF + \u03bbc(X)\n\nX kXS \u2212 YSk2\nmin\nc(X)\u2264C kXS \u2212 YSk2\nF ,\nmin\n\n(1)\n\n(2)\n\nwhere YS, and similarly XS, denotes the matrix \u201cmasked\u201d by S, where (YS)i,j = Yi,j if (i, j) \u2208 S\nand 0 otherwise. For now we ignore possible repeated entries in S and we also assume that n \u2264 m\nwithout loss of generality. The two formulations (1) and (2) are equivalent up to some (unknown)\ncorrespondence between \u03bb and C, and we will be referring to them interchangeably.\nA basic measure of complexity is the rank of X, corresponding to the minimal dimensionality k such\nthat X = U \u22a4V for some U \u2208 Rk\u00d7n and V \u2208 Rk\u00d7m. Directly constraining the rank of X forms\none of the most popular approaches to collaborative \ufb01ltering. However, the rank is non-convex and\nhard to minimize. It is also not clear if a strict dimensionality constraint is most appropriate for\nmeasuring the complexity.\n\nTrace-norm Regularization\nLately, methods regularizing the norm of the factorization U \u22a4V , rather than its dimensionality, have\nbeen advocated and were shown to enjoy considerable empirical success [14, 15]. This corresponds\nto measuring complexity in terms of the trace-norm of X, which can be de\ufb01ned equivalently either\nas the sum of the singular values of X, or as [7]:\n\nkXktr = min\n\nX=U \u2032V\n\n1\n2\n\n(kUk2\n\nF + kV k2\nF),\n\n(3)\n\nwhere the dimensionality of U and V is not constrained. Beyond the modeling appeal of norm-\nbased, rather than dimension-based, regularization, the trace-norm is a convex function of X and so\ncan be minimized by either local search or more sophisticated convex optimization techniques.\n\nScaling\nThe rank, as a measure of complexity, does not scale with the size of the matrix. That is, even very\nlarge matrices can have low rank. Viewing the rank as a complexity measure corresponding to the\nnumber of underlying factors, if data is explained by e.g. two factors, then no matter how many rows\n(\u201cusers\u201d) and columns (\u201cmovies\u201d) we consider, the data will still have rank two. The trace-norm,\nhowever, does scale with the size of the matrix. To see this, note that the trace-norm is the \u21131 norm\nof the spectrum, while the Frobenius norm is the \u21132 norm of the spectrum, yielding:\n\nkXkF \u2264 kXktr \u2264 kXkFprank(X) \u2264 \u221ankXkF .\n\n(4)\n\nThe Frobenius norm certainly increases with the size of the matrix, since the magnitude of each ele-\nment does not decrease when we have more elements, and so the trace-norm will also increase. The\nabove suggests measuring the trace-norm relative to the Frobenius norm. Without loss of generality,\nconsider each target entry to be of roughly unit magnitude, and so in order to \ufb01t Y each entry of\nX must also be of roughly unit magnitude. This suggests scaling the trace-norm by \u221anm. More\nspeci\ufb01cally, we study the trace-norm through the complexity measure:\n\ntc(X) = kXk2\ntr\nnm\n\n,\n\n(5)\n\nwhich puts the trace-norm on a comparable scale to the rank. In particular, when each entry of X is,\non-average, of unit magnitude (i.e. has unit variance) we have 1 \u2264 tc(X) \u2264 rank(X).\nThe relationship between tc(X) and the rank is tight for \u201corthogonal\u201d low-rank matrices, i.e. low-\nrank matrices X = U \u22a4V where the rows of U and also the rows of V are orthogonal and of equal\nmagnitudes. In order for the entries in Y to have unit magnitude, i.e. kY k2\nF = nm, we have that rows\n\n2\n\n\fin U have normqn/\u221ak and rows in V have normqm/\u221ak, yielding precisely tc(X) = rank(X).\nSuch an orthogonal low-rank matrix can be obtained, e.g., when entries of U and V are zero-mean\ni.i.d. Gaussian with variance 1/\u221ak, corresponding to unit-variance entries in X.\nGeneralization Guarantees\nAnother place where we can see that tc(X) plays a similar role to rank(X) is in the generalization\nand sample complexity guarantees that can be obtained for low-rank and low-trace-norm learning.\nIf there is a low-rank matrix X \u2217 achieving low average error relative to Y (e.g. if Y = X \u2217 + noise),\nthen by minimizing the training error subject to a rank constraint (a computationally intractable\ntask), |S| = \u02dcO(rank(X \u2217)(n + m)) samples are enough in order to guarantee learning a matrix X\nwhose overall average error is close to that of X \u2217 [16]. Similarly, if there is a low-trace-norm matrix\nX \u2217 achieving low average error, then minimizing the training error and the trace-norm (a convex\noptimization problem), |S| = \u02dcO(tc(X \u2217)(n + m)) samples are enough in order to guarantee learning\na matrix X whose overall average error is close to that of X \u2217 [18]. In these bounds tc(X) plays\nprecisely the same role as the rank, up to logarithmic factors.\n\nIn order to get some intuitive understanding of low-rank learning guarantees, it is enough to consider\nthe number of parameters in the rank-k factorization X = U \u22a4V . It is easy to see that the number of\nparameters in the factorization is roughly k(m + n) (perhaps a bit less due to rotational invariants).\nWe therefore would expect to be able to learn X when we have roughly this many samples, as is\nindeed con\ufb01rmed by the rigorous sample complexity bounds.\nFor low-trace-norm learning, consider a sample S of size |S| \u2264 Cn, for some constant C. Taking\nentries of Y to be of unit magnitude, we have kYSkF = p|S| \u2264 \u221aCn (recall that YS is de\ufb01ned to\nbe zero outside S). From (4) we therefore have: kYSktr \u2264 \u221aCn \u00b7 \u221an = \u221aCn and so tc(YS) \u2264 C.\nThat is, we can \u201cshatter\u201d any sample of size |S| \u2264 Cn with tc(X) = C: no matter what the\nunderlying matrix Y is, we can always perfectly \ufb01t the training data with a low trace-norm matrix\nX s.t. tc(X) \u2264 C, without generalizing at all outside S. On the other hand, we must allow matrices\nwith tc(X) = tc(X \u2217), otherwise we can not hope to \ufb01nd X \u2217, and so we can only constrain tc(X) \u2264\nC = tc(X \u2217). We therefore cannot expect to learn with less than ntc(X \u2217) samples. It turns out that\nthis is essentially the largest random sample that can be shattered with tc(X) \u2264 C = tc(X \u2217). If we\nhave more than this many samples we can start learning.\n\n3 Trace-Norm Under a Non-Uniform Distribution\n\nIn this section, we analyze trace-norm regularized learning when the sampling distribution is not\nuniform. That is, when there is some, known or unknown, non-uniform distribution D over entries\nof the matrix Y (i.e. over index pairs (i, j)) and our sample S is sampled i.i.d. from D. Our objective\nis to get low average error with respect to the distribution D. That is, we measure generalization\nperformance in terms of the weighted sum-squared-error:\n(6)\n\nD(i, j)(Xij \u2212 Yij)2.\n\nkX \u2212 Y k2\n\nD = E(i,j)\u223cD(cid:2)(Xij \u2212 Yij )2(cid:3) = Xij\n\nWe \ufb01rst point out that when using the rank for complexity control, i.e. when minimizing the training\nerror subject to a low-rank constraint, non-uniformity does not pose a problem. The same generaliza-\ntion and learning guarantees that can be obtained in the uniform case, also hold under an arbitrary\ndistribution D. In particular, if there is some low-rank X \u2217 such that kX \u2217 \u2212 Y k2\nD is small, then\n\u02dcO(rank(X \u2217)(n + m)) samples are enough in order to learn (by minimizing training error subject to\na rank constraint) a matrix X with kX \u2212 Y k2\nD [16].\nHowever, the same does not hold when learning using the trace-norm. To see this, consider an\northogonal rank-k square n\u00d7n matrix, and a sampling distribution which is uniform over an nA\u00d7nA\nsub-matrix A, with nA = na. That is, the row (e.g. \u201cuser\u201d) is selected uniformly among the \ufb01rst nA\nrows, and the column (e.g. \u201cmovie\u201d) is selected uniformly among the \ufb01rst nA columns. We will use\nA to denote the subset of entries in the submatrix, i.e. A = {(i, j)|1 \u2264 i, j \u2264 nA}. For any sample\nS, we have:\n\nD almost as small as kX \u2217 \u2212 Y k2\n\ntc(YS) = kYSk2\n\nn2 \u2264 kYSk2\n\ntr\n\nF rank(YS)\nn2\n\n\u2264 |S|na\n\nn2 = |S|\nn2\u2212a ,\n\n(7)\n\n3\n\n\fwhere we again take the entries in Y to be of unit magnitude. In the second inequality above we\nuse the fact that YS is zero outside of A, and so we can bound the rank of YS by the dimensionality\nnA = na of A.\nSetting a < 1, we see that we can shatter any sample of size1 kn2\u2212a = \u02dc\u03c9(n) with a matrix X for\nwhich tc(X) min(n, m) [2]. However, in practice\nusing very large values of k becomes computationally expensive. Instead, we consider truncated\ntrace-norm minimization by restricting k to smaller values.\nIn the next section we demonstrate\nthat even when using truncated trace-norm, its weighted version signi\ufb01cantly improves model\u2019s\nprediction performance.\n\nIn our experiments, we also replace unknown row p(i) and column q(j) marginals in (13) by their\nempirical estimates \u02c6p(i) = ni/|S| and \u02c6q(j) = mj/|S|. This results in the following objective:\n\nX{i,j}\u2208S\n\n(cid:18)(cid:0)Yij \u2212 U \u22a4\n\ni Vj(cid:1)2\n\n+\n\n\u03bb\n\n2|S|(cid:18)n\u03b1\u22121\n\ni\n\nkUik2 + m\u03b1\u22121\n\nj\n\nkVjk2(cid:19)(cid:19).\n\n(13)\n\nSetting \u03b1 = 1, corresponding to the weighted trace-norm (10), results in stochastic gradient updates\nthat do not involve the row and column counts at all and are in some sense the simplest. Strangely,\nand likely originating as a \u201cbug\u201d in calculating the stochastic gradients by one of the participants,\nthese steps match the stochastic training used by many practitioners on the Net\ufb02ix dataset, without\nexplicitly considering the weighted trace-norm [8, 19, 15].\n\n6 Experimental results\nWe tested the weighted trace-norm on the Net\ufb02ix dataset, which is the largest publicly available col-\nlaborative \ufb01ltering dataset. The training set contains 100,480,507 ratings from 480,189 anonymous\nusers on 17,770 movie titles. Net\ufb02ix also provides quali\ufb01cation set, containing 1,408,395 ratings,\nout of which we set aside 100,000 ratings for validation. The \u201cquali\ufb01cation set\u201d pairs were selected\nby Net\ufb02ix from the most recent ratings for a subset of the users. Due to the special selection scheme,\nratings from users with few ratings are overrepresented in the quali\ufb01cation set, relative to the train-\ning set. To be able to report results where the train and test sampling distributions are the same, we\nalso created a \u201ctest set\u201d by randomly selecting and removing 100,000 ratings from the training set.\nAll ratings were normalized to be zero-mean by subtracting 3.6. The dataset is very imbalanced: it\nincludes users with over 10,000 ratings as well as users who rated fewer than 5 movies.\n\nFor various values of \u03b1, we learned a factorization U \u22a4V with k = 30 and with k = 100 dimensions\n(factors) using stochastic gradient descent as in (13). For each value of \u03b1 and k we selected the\nregularization tradeoff \u03bb by minimizing the error on the 100,000 quali\ufb01cation set examples set aside\nfor validation. Results on both the Net\ufb02ix quali\ufb01cation set and on the test set we created are reported\nin Table 1. Recall that the sampling distribution of the \u201ctest set\u201d matches that of the training data,\nwhile the quali\ufb01cation set is sampled differently, explaining the large difference in generalization\nbetween the two.\n\n7\n\n\fTable 1: Root Mean Squared Error (RMSE) on the Net\ufb02ix quali\ufb01cation set and on a test set that was held out\nfrom the training data, for training by minimizing (13). We report \u03bb/|S| minimizing the error on the validation\nset (held out from the quali\ufb01cation set), quali\ufb01cation and test errors using this tradeoff, and tc\u03b1(X) at the\noptimum. Last row: training by regularizing the max-norm.\n\n\u03b1\n1\n0.9\n0.75\n0.5\n0\n-1\n\nkXkmax\n\ntc\u03b1(X)\n\n\u03bb/|S|\n0.05\n0.07\n0.2\n0.5\n2.5\n450\n\nk\n4.34\n30\n4.27\n30\n5.04\n30\n7.32\n30\n10.36\n30\n30\n11.41\n30 mc(X) = 5.06\n\nTest\n0.7607\n0.7573\n0.7723\n0.7823\n0.7889\n0.7913\n0.7692\n\nQual\n0.9105\n0.9091\n0.9128\n0.9159\n0.9235\n0.9256\n0.9131\n\ntc\u03b1(X)\n\n\u03bb/|S|\n0.08\n0.1\n0.3\n0.8\n3.0\n700\n\nk\n5.47\n100\n5.23\n100\n6.24\n100\n9.65\n100\n21.23\n100\n100\n23.31\n100 mc(X) = 5.77\n\nTest\n0.7412\n0.7389\n0.7491\n0.7613\n0.7667\n0.7713\n0.7432\n\nQual\n0.9071\n0.9062\n0.9098\n0.9127\n0.9203\n0.9221\n0.9092\n\nFor both k = 30 and k = 100, the weighted trace-norm (\u03b1 = 1) signi\ufb01cantly outperformed the\nunweighted trace-norm (\u03b1 = 0). Interestingly, the optimal weighting (setting of \u03b1) was a bit lower\nthen, but very close to \u03b1 = 1. For completeness, we also evaluated the weighting suggested by\nWeimer et al. [21], corresponding to \u03b1 = \u22121. Unsurprising, given our analysis, this seemingly\nintuitive weighting hurts predictive performance.\n\nFor both k = 30 and k = 100, we also observed that for the weighted trace-norm (\u03b1 = 1) good\ngeneralization is possible with a wide range of \u03bb settings, while for the unweighted trace-norm\n(\u03b1 = 0), the results were much more sensitive to the setting of \u03bb. This con\ufb01rms our previous results\non the synthetic experiment and strongly suggests that it may be far easier to search for regularization\nparameters using the weighted trace-norm.\n\nComparison with the Max-Norm\nWe also compared the predictive performance on Net\ufb02ix to predictions based on max-norm regular-\nization. The max-norm is de\ufb01ned as:\n\nkXkmax = min\n\nX=U \u2032V\n\n1\n2\n\n(max\n\ni kUik2 + max\n\nj kVjk2).\n\n(14)\n\nSimilarly to the rank, but unlike the trace-norm, generalization and learning guarantees based on the\nmax-norm hold also under an arbitrary, non-uniform, sampling distribution. Speci\ufb01cally, de\ufb01ning\nmc(X) = kXk2\nmax (no normalization is necessary here), \u02dcO(mc(X)(n + m)) samples are enough for\ngeneralization w.r.t. any sampling distribution (just like the rank) [18]. This suggests that perhaps the\nmax-norm can be used as an alternative factorization-regularization in the presence of non-uniform\nsampling. Indeed, as evident in Table 1, max-norm based regularization does perform much better\nthen the unweighted trace-norm. The differences between the max-norm and the weighted trace-\nnorm are small, but it seems that using the weighted trace-norm is slightly but consistently better.\n\n7 Summary\nIn this paper we showed both analytically and empirically that under non-uniform sampling, trace-\nnorm regularization can lead to signi\ufb01cant performance deterioration and an increase in sample\ncomplexity. Our analytic analysis suggests a non-intuitive weighting for the trace-norm in order to\ncorrect the problem. Our results on both synthetic and on the highly imbalanced Net\ufb02ix datasets fur-\nther demonstrate that the weighted trace-norm yields signi\ufb01cant improvements in prediction quality.\n\nIn terms of optimization, we focused on stochastic gradient descent,both since it is a simple and\npractical method for very large-scale trace-norm optimization [15, 8], and since the weighting was\noriginally stumbled upon through this optimization approach. However, most recently proposed\nmethods for trace-norm optimization (e.g. [3, 10, 9, 11, 20]) can also be easily modi\ufb01ed for the\nweighted trace-norm.\n\nWe hope that the weighted trace-norm, and the discussions in Sections 3 and 4, will be helpful\nin deriving theoretical learning guarantees for arbitrary non-uniform sampling distributions, both in\nthe form of generalization error bounds as in [18], and generalizing the compressed-sensing inspired\nwork on recovery of noisy low-rank matrices as in [4, 13].\n\nAcknowledgments RS is supported by NSERC, Shell, and NTT Communication Sciences Laboratory.\n\n8\n\n\fReferences\n[1] J. Abernethy, F. Bach, T. Evgeniou, and J.P. Vert. A new approach to collaborative \ufb01lter-\ning: Operator estimation with spectral regularization. Journal of Machine Learning Research,\n10:803\u2013826, 2009.\n\n[2] S. Burer and R.D.C. Monteiro. Local minima and convergence in low-rank semide\ufb01nite pro-\n\ngramming. Mathematical Programming, 103(3):427\u2013444, 2005.\n\n[3] J.F. Cai, E.J. Cand`es, and Z. Shen. A Singular Value Thresholding Algorithm for Matrix\n\nCompletion. SIAM Journal on Optimization, 20:1956, 2010.\n\n[4] E.J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE (to appear),\n\n2009.\n\n[5] E.J. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9, 2009.\n\n[6] E.J. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Trans. Inform. Theory (to appear), 2009.\n\n[7] M. Fazel, H. Hindi, and S.P. Boyd. A rank minimization heuristic with application to minimum\norder system approximation. In Proceedings American Control Conference, volume 6, 2001.\n[8] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering\n\nmodel. In ACM SIGKDD, pages 426\u2013434, 2008.\n\n[9] Z. Liu and L. Vandenberghe.\n\nInterior-point method for nuclear norm approximation with\napplication to system identi\ufb01cation. SIAM Journal on Matrix Analysis and Applications,\n31(3):1235\u20131256, 2009.\n\n[10] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank\n\nminimization. Mathematical Programming, pages 1\u201333, 2009.\n\n[11] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral Regularization Algorithms for Learning\n\nLarge Incomplete Matrices. Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n[12] R. Meka, P. Jain, and I. S. Dhillon. Matrix completion from power-law distributed samples. In\n\nAdvances in Neural Information Processing Systems, volume 21, 2009.\n\n[13] B. Recht. A simpler approach to matrix completion. preprint, available from author\u2019s webpage,\n\n2009.\n\n[14] J.D.M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In ICML, page 719, 2005.\n\n[15] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in\n\nNeural Information Processing Systems, volume 20, 2008.\n\n[16] N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction\n\nwith low-rank matrices. In Advances In Neural Information Processing Systems 17, 2005.\n\n[17] N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Advances In\n\nNeural Information Processing Systems 17, 2005.\n\n[18] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In COLT, 2005.\n[19] G\u00b4abor Tak\u00b4acs, Istv\u00b4an Pil\u00b4aszy, Botty\u00b4an N\u00b4emeth, and Domonkos Tikk. Scalable collaborative\n\ufb01ltering approaches for large recommender systems. Journal of Machine Learning Research,\n10:623\u2013656, 2009.\n\n[20] R. Tomioka, T. Suzuki, M. Sugiyama, and H. Kashima. A fast augmented lagrangian algorithm\n\nfor learning low-rank matrices. In ICML, pages 1087\u20131094, 2010.\n\n[21] M. Weimer, A. Karatzoglou, and A. Smola. Improving maximum margin matrix factorization.\n\nMachine Learning, 72(3):263\u2013276, 2008.\n\n9\n\n\f", "award": [], "sourceid": 779, "authors": [{"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}