{"title": "Learning with the weighted trace-norm under arbitrary sampling distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2133, "page_last": 2141, "abstract": "We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneficial.", "full_text": "Learning with the Weighted Trace-norm under\n\nArbitrary Sampling Distributions\n\nRina Foygel\n\nDepartment of Statistics\nUniversity of Chicago\n\nrina@uchicago.edu\n\nRuslan Salakhutdinov\nDepartment of Statistics\nUniversity of Toronto\n\nrsalakhu@ustat.toronto.edu\n\nOhad Shamir\n\nMicrosoft Research New England\n\nohadsh@microsoft.com\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\n\nnati@ttic.edu\n\nAbstract\n\nWe provide rigorous guarantees on learning with the weighted trace-norm under\narbitrary sampling distributions. We show that the standard weighted-trace norm\nmight fail when the sampling distribution is not a product distribution (i.e. when\nrow and column indexes are not selected independently), present a corrected vari-\nant for which we establish strong learning guarantees, and demonstrate that it\nworks better in practice. We provide guarantees when weighting by either the true\nor empirical sampling distribution, and suggest that even if the true distribution is\nknown (or is uniform), weighting by the empirical distribution may be bene\ufb01cial.\n\n1\n\nIntroduction\n\nOne of the most common approaches to collaborative \ufb01ltering and matrix completion is trace-norm\nregularization [1, 2, 3, 4, 5]. In this approach we attempt to complete an unknown matrix, based on\na small subset of revealed entries, by \ufb01nding a matrix with small trace-norm, which matches those\nentries as best as possible.\nThis approach has repeatedly shown good performance in practice, and is theoretically well under-\nstood for the case where revealed entries are sampled uniformly [6, 7, 8, 9, 10, 11]. Under such\nuniform sampling, \u0398(n log(n)) entries are suf\ufb01cient for good completion of an n \u00d7 n matrix\u2014\ni.e. a nearly constant number of entries per row. However, for arbitrary sampling distributions, the\nworst-case sample complexity lies between a lower bound of \u2126(n4/3) [12] and an upper bound of\nO(n3/2) [13], i.e. requiring between n1/3 and n1/2 observations per row, and indicating it is not\nappropriate for matrix completion in this setting.\nMotivated by these issues, Salakhutdinov and Srebro [12] proposed to use a weighted variant of the\ntrace-norm, which takes the distribution of the entries into account, and showed experimentally that\nthis variant indeed leads to superior performance. However, although this recent paper established\nthat the weighted trace-norm corrects a speci\ufb01c situation where the standard trace-norm fails, no\ngeneral learning guarantees are provided, and it is not clear if indeed the weighted trace-norm al-\nways leads to the desired behavior. The only theoretical analysis of the weighted trace-norm that we\nare aware of is a recent report by Negahban and Wainwright [10] that provides reconstruction guar-\nantees for a low-rank matrix with i.i.d. noise, but only when the sampling distribution is a product\ndistribution, i.e. the rows index and column index of observed entries are selected independently. A\nproduct distribution assumption does not seem realistic in many cases\u2014e.g. for the Net\ufb02ix data, it\nwould indicate that all users have the same (conditional) distribution over which movies they rate.\n\n1\n\n\fIn this paper we rigorously study learning with a weighted trace-norm under an arbitrary sampling\ndistribution, and show that this situation is indeed more complicated, requiring a correction to the\nweighting. We show that this correction is necessary, and present empirical results on the Net\ufb02ix\nand MovieLens dataset indicating that it is also helpful in practice. We also rigorously consider\nweighting according to either the true sampling distribution (as in [10]) or the empirical frequencies,\nas is actually done in practice, and present evidence that weighting by the empirical frequencies\nmight be advantageous. Our setting is also more general than that of [10]\u2014we consider an arbitrary\nloss and do not rely on i.i.d. noise, instead presenting results in an agnostic learning framework.\nSetup and Notation. We consider an arbitrary unknown n \u00d7 m target matrix Y , where a subset\nof entries {Yit,jt}s\nt=1 indexed by S = {(i1, j1), . . . , (is, js)} is revealed to us. Without loss of\ngenerality, we assume n \u2265 m. Throughout most of the paper, we assume S is drawn i.i.d. according\n(cid:104)\nto some sampling distribution p(i, j) (with replacement). Based on this subset on entries, we would\nlike to \ufb01ll in the missing entries and obtain a prediction matrix \u02c6XS \u2208 Rn\u00d7m, with low expected loss\nLp( \u02c6XS) = Eij\u223cp\n, where (cid:96)(x, y) is some loss function. Note that we measure the\nloss with respect to the same distribution p(i, j) from which the training set is drawn (this is also the\ncase in [12, 10, 13]).\nThe trace-norm of a matrix X \u2208 Rn\u00d7m, written (cid:107)X(cid:107)tr, is de\ufb01ned as the sum of its singular values.\nGiven some distribution p(i, j) on [n] \u00d7 [m], the weighted trace-norm of X is given by [12]\n\n(cid:96)(( \u02c6XS)ij, Yij)\n\n(cid:105)\n\n(cid:13)(cid:13)(cid:13)diag (pr)1/2 \u00b7 X \u00b7 diag (pc)1/2(cid:13)(cid:13)(cid:13)tr\n\n,\n\n(cid:107)X(cid:107)tr(pr,pc) =\n\nwhere pr \u2208 Rn and pc \u2208 Rm denote vectors of the row- and column-marginals respectively. Note\nthat the weighted trace-norm only depends on these marginals (but not their joint distribution) and\nthat if pr and pc are uniform, then (cid:107)X(cid:107)tr(pr,pc) = 1\u221a\nnm (cid:107)X(cid:107)tr. The weighted trace-norm does not\ngenerally scale with n and m, and in particular, if X has rank r and entries bounded in [\u22121, 1], then\n(cid:107)X(cid:107)tr(pr,pc) \u2264 \u221a\n\nr regardless of which p (i, j) is used. This motivates us to de\ufb01ne the class\n\nWr [p] =(cid:8)X \u2208 Rn\u00d7m : (cid:107)X(cid:107)tr(pr,pc) \u2264 \u221a\n(cid:80)s\n\nr(cid:9),\n\nalthough we emphasize that our results do not directly depend on the rank, and Wr [p] certainly\nincludes full-rank matrices. We analyze here estimators of the form \u02c6XS = arg min{ \u02c6LS(X) : X \u2208\nWr [p]} where \u02c6LS(X) = 1\nAlthough we focus mostly on the standard inductive setting, where the samples are drawn i.i.d. and\nthe guarantee is on generalization for future samples drawn by the same distribution, our results can\nalso be stated in a transductive model, where a training set and a test set are created by splitting a\n\ufb01xed subset of entries uniformly at random (as in [13]). The transductive setting is discussed, and\ntransductive variants of our Theorems are given, in Section 4.2 and in the Supplementary Materials.\n\nt=1 (cid:96)(Xit,jt, Yit,jt) is the empirical error on the observed entries.\n\ns\n\n2 Learning with the Standard Weighting\n\nIn this Section, we consider learning using the weighted trace-norm as suggested by Salakhutdinov\nand Srebro [12], i.e. when the weighting is according to the sampling distribution p(i, j). Following\nthe approach of [6] and [11], we base our results on bounding the Rademacher complexity of Wr [p],\nas a class of functions mapping index pairs to entry values. However, we modify the analysis for the\nweighted trace-norm with non-uniform sampling.\nFor a class of matrices X and a sample S = {(i1, j1), . . . , (is, js)} of indexes in [n] \u00d7 [m], the\nempirical Rademacher complexity of the class (with respect to S) is given by\n\n(cid:34)\n\n2\n\n\u02c6RS(X ) = E\u03c3\u223c{\u00b11}s\n\n1\ns\n\nsup\nX\u2208X\n\n\u03c3tXitjt\n\n,\n\n(cid:35)\n\ns(cid:88)\n\nt=1\n\nwhere \u03c3 is a vector of signs drawn uniformly at random. Intuitively, \u02c6RS(X ) measures the extent to\nwhich the class X can \u201cover\ufb01t\u201d data, by \ufb01nding a matrix X which correlates as strongly as possible\nto a sample from a matrix of random noise. For a loss (cid:96)(x, y) that is Lipschitz in x, the Rademacher\ncomplexity can be used to uniformly bound the deviations |Lp(X)\u2212 \u02c6LS(X)| for all X \u2208 X , yielding\na learning guarantee on the empirical risk minimizer [14].\n\n\f2.1 Guarantees for Special Sampling Distributions\n\nWe begin by providing guarantees for an arbitrary, possibly unbounded, Lipschitz loss (cid:96)(x, y), but\nonly under sampling distributions which are either product distributions (i.e. p(i, j) = pr(i)pc(j))\nor have uniform marginals (i.e. pr and pc are uniform, but perhaps the rows and columns are not\nindependent). In Section 2.3 below, we will see why this severe restriction on p is needed.\nTheorem 1. For an l-Lipschitz loss (cid:96), \ufb01x any matrix Y , sample size s, and distribution\nLet \u02c6XS =\np, such that p is either a product distribution or has uniform marginals.\n. Then, in expectation over the training sample S drawn i.i.d.\narg min\nfrom the distribution p,\n\n(cid:110) \u02c6LS(X) : X \u2208 Wr [p]\n\n(cid:111)\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\nLp( \u02c6XS) \u2264 inf\n\nX\u2208Wr[p]\n\nLp(X) + O\n\nl \u00b7\n\nrn log(n)\n\ns\n\n.\n\n(1)\n\nES\n\n=\n\n(cid:105)\n\n\u03c3t\n\neit,jt\n\neit,jt\n\nES,\u03c3\n\n\u221a\n\nr\ns\n\n\u221a\n\nr\ns\n\nt=1\n\n\u221a\n\nES,\u03c3\n\n\uf8f9\uf8fb =\n\nj and Qt = \u03c3t\n\n(cid:112)pr (it) pc (jt)\n\nHere and elsewhere we state learning guarantees in expectation for simplicity. Since the guaran-\ntees are obtained by bounding the Rademacher complexity, one can also immediately obtain high-\nprobability guarantees, with logarithmic dependence on the con\ufb01dence parameter, via standard tech-\nniques (e.g. [14]).\n\nProof. We will show how to bound the expected Rademacher complexity ES\n, from\nwhich the desired results follows using standard arguments (Theorem 8 of [14]1). Following [11] by\nincluding the weights, using the duality between spectral norm (cid:107)\u00b7(cid:107)sp and trace-norm, we compute:\n\n\u2208 Rn\u00d7m. Since the Qt\u2019s are i.i.d. zero-\nwhere ei,j = eieT\nmean matrices, Theorem 6.1 of [15], combined with Remarks 6.4 and 6.5 there, establishes\nthat ES,\u03c3\n\n(cid:104) \u02c6RS(Wr [p])\n(cid:105)\n\uf8ee\uf8f0(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s(cid:88)\n\uf8ee\uf8f0(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s(cid:88)\n\uf8f9\uf8fb ,\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)sp\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)sp\n= O(cid:0)\u03c1(cid:112)log(n) + R log(n)(cid:1), where R and \u03c1 are de\ufb01ned to satisfy\n(cid:3)(cid:13)(cid:13)sp\n(cid:3)(cid:13)(cid:13)sp ,(cid:13)(cid:13)(cid:80) E(cid:2)QtQT\n(cid:9). Calculating\n(cid:111) \u2264\n\n(cid:104) \u02c6RS(Wr [p])\n(cid:105)\n(cid:104)(cid:107)(cid:80)s\n(cid:107)Qt(cid:107)sp \u2264 R (almost surely) and \u03c12 = max(cid:8)(cid:13)(cid:13)(cid:80) E(cid:2)QT\nt=1 Qt(cid:107)sp\nthese bounds (see Supplementary Material), we get R \u2264(cid:113)\n(cid:118)(cid:117)(cid:117)(cid:116)s max\n(cid:26)\n(cid:105) \u2264 O\n(cid:104) \u02c6R(Wr [p])\n\nIf p has uniform row- and column-marginals, then for all i, j, npr (i) = mpc (j) = 1. This yields\nES\n, as desired. (Here we assume s > n log(n), since otherwise we\nneed only establish that excess error is O(l\nIf p does not have uniform marginals, but instead is a product distribution, then the quantity R\nde\ufb01ned above is potentially unbounded, so we cannot apply the same simple argument. However,\nwe can consider the \u201cp-truncated\u201d class of matrices\n\nr), which holds trivially for any matrix in Wr [p].)\n\n(cid:88)\n(cid:16)(cid:113) rn log(n)\n\nmini,j{npr (i) \u00b7 mpc (j)} .\n\nmini,j{npr(i)\u00b7mpc(j)}, and\n\n(cid:88)\n\npr (i) pc (j)\n\npr (i) pc (j)\n\n(cid:114)\n\npr(it)pc(jt)\n\n\u03c1 \u2264\n\np (i, j)\n\np (i, j)\n\nmax\n\ni\n\n, max\n\nj\n\nQt\n\nt=1\n\n(cid:17)\n\nj\n\ni\n\nt\n\nnm\n\ns\n\n\u221a\n\nt Qt\n\nZ =\n\nZ(X) =\n\nBy a similar calculation of the expected spectral norms, we can now bound ES\n\np (i, j) \u2265 log(n)\nnm\n\n(cid:104) \u02c6RS(Z)\n(cid:105) \u2264\n. Applying Theorem 8 of [14], this bounds(cid:0)Lp(Z( \u02c6XS)) \u2212 \u02c6LS(Z( \u02c6XS))(cid:1) (in ex-\n\n: X \u2208 Wr [p]\n\n(cid:16)(cid:113) rn log(n)\n\n(cid:17)\n\n\u221a\ns\n\n(cid:54)= ( \u02c6XS)ij only on the extremely low-probability entries, we can\n1Theorem 8 of [14] gives a learning guarantee holding with high probability, but their proof of this theorem\n\nO\npectation). Since Z( \u02c6XS)ij\n\n(in particular, the last series of displayed equations) contains a guarantee in expectation, which we use here.\n\nij\n\n.\n\ns\n\n(cid:26)\n\n(cid:16)\n\nXijI(cid:110)\n\nsn\n\n(cid:27)\n\n(cid:111)(cid:17)\n\n3\n\n\falso bound (cid:0)Lp( \u02c6XS) \u2212 Lp(Z( \u02c6XS))(cid:1) and (cid:0) \u02c6LS(Z( \u02c6XS)) \u2212 \u02c6LS( \u02c6XS)(cid:1). Combining these steps,\nwe can bound (cid:0)Lp( \u02c6XS) \u2212 \u02c6LS( \u02c6XS)(cid:1). We similarly bound \u02c6LS(X\u2217) \u2212 Lp(X\u2217), where X\u2217 =\n\narg minX\u2208Wr[p] Lp(X). Since \u02c6LS( \u02c6XS) \u2264 \u02c6LS(X\u2217), this yields the desired bound on excess er-\nror. The details are given in the Supplementary Materials.\n\nCn, pc (j) \u2265 1\n\nExamining the proof of Theorem 1, we see that we can generalize the result by including distri-\nbutions p with row- and column-marginals that are lower-bounded. More precisely, if p satis\ufb01es\npr (i) \u2265 1\nCm for all i, j, then the bound (1) holds, up to a factor of C. Note that this\nresult does not require an upper bound on the row- and column-marginals, only a lower bound, i.e. it\nonly requires that no marginals are too low. This is important to note since the examples where the\nunweighted trace-norm fails under a non-uniform distribution are situations where some marginals\nare very high (but none are too low) [12]. This suggests that the low-probability marginals could\nperhaps be \u201csmoothed\u201d to satisfy a lower bound, without removing the advantages of the weighted\ntrace-norm. We will exploit this in Section 3 to give a guarantee that holds more generally for\narbitrary p, when smoothing is applied.\n\n2.2 Guarantees for bounded loss\n\nIn Theorem 1, we showed a strong bound on excess error, but only for a restricted class of distri-\nbutions p. We now show that if the loss function (cid:96) is bounded, then we can give a non-trivial, but\nweaker, learning guarantee that holds uniformly over all distributions p. Since we are in any case\ndiscussing Lipschitz loss functions, requiring that the loss function be bounded essentially amounts\nto requiring that the entries of the matrices involved be bounded. That is, we can view this as a\nguarantee on learning matrices with bounded entries. In Section 2.3 below, we will show that this\nboundedness assumption is unavoidable if we want to give a guarantee that holds for arbitrary p.\nTheorem 2. For an l-Lipschitz loss (cid:96) bounded by b, \ufb01x any matrix Y , sample size s, and any\nfor r \u2265 1. Then, in expectation over\ndistribution p. Let \u02c6XS = arg min\nthe training sample S drawn i.i.d. from the distribution p,\n\n(cid:110) \u02c6LS(X) : X \u2208 Wr [p]\n(cid:111)\n(cid:32)\n\n(cid:33)\n\nLp( \u02c6XS) \u2264 inf\n\nX\u2208Wr[p]\n\nLp(X) + O\n\n(l + b) \u00b7 3\n\nrn log(n)\n\n.\n\n(2)\n\n(cid:114)\n(cid:113) rn log(n)\n\ns\n\ns\n\n(cid:17)\n\nThe proof is provided in the Supplementary Materials, and is again based on analyzing the expected\nRademacher complexity, ES\n\n(l + b) \u00b7 3\n\n.\n\n(cid:104) \u02c6R((cid:96) \u25e6 Wr [p])\n\n(cid:105) \u2264 O\n\n(cid:16)\n\n2.3 Problems with the standard weighting\n\nIn the previous Sections, we showed that for distributions p that are either product distributions or\nhave uniform marginals, we can prove a square-root bound on excess error, as shown in (1). For\narbitrary p, the only learning guarantee we obtain is a cube-root bound given in (2), for the special\ncase of bounded loss. We would like to know whether the square-root bound might hold uniformly\nover all distributions p, and if not, whether the cube-root bound is the strongest result that we can\ngive for the bounded-loss setting, and whether any bound will hold uniformly over all p in the\nunbounded-loss setting.\nThe examples below demonstrate that we cannot improve the results of Theorems 1 and 2 (up to log\nfactors), by constructing degenerate examples using non-product distributions p with non-uniform\nmarginals. Speci\ufb01cally, in Example 1, we show that in the special case of bounded loss, the cube-\nroot bound in (2) is the best possible bound (up to the log factor) that will hold for all p, by giving\na construction for arbitrary n = m and arbitrary s \u2264 nm, such that with 1-bounded loss, excess\n\n(cid:1). In Example 2, we show that with unbounded (Lipschitz) loss, we cannot bound\n\nerror is \u2126(cid:0) 3(cid:112) n\n\nexcess error better than a constant bound, by giving a construction for arbitrary n = m and arbitrary\ns \u2264 nm in the unbounded-loss regime, where excess error is \u2126(1). For both examples we \ufb01x\nr = 1. We note that both examples can be modi\ufb01ed to \ufb01t the transductive setting, demonstrating\nthat smoothing is necessary in the transductive setting as well.\n\ns\n\n4\n\n\f\uf8f6\uf8f8 ,\n\n(cid:33)\n\n\uf8eb\uf8ed 1\n\n2s \u00b7 1a\u00d7 n\n2\n0(n\u2212a)\u00d7 n\n2\n\n(cid:32)\n\nY =\n\nExample 1. Let (cid:96)(x, y) = min{1,|x \u2212 y|} \u2264 1, let a = (2s/n)2/3 < n, and let matrix Y and\nblock-wise constant distribution p be given by\n\n0a\u00d7 n\n2\n\u00b7 1(n\u2212a)\u00d7 n\n\n1\u2212 an\n4s\n(n\u2212a) n\n\n2\n\nA\n\n0a\u00d7 n\n2\n\n, (p (i, j)) =\n\n8\n\n2\n\ns .\n\n0(n\u2212a)\u00d7 n\n2\n\n3(cid:112) n\n4 , we see that E(cid:2)Lp(Y S)(cid:3) \u2265 1\n\nij = YijI{ij \u2208 S}, and note that(cid:13)(cid:13)Y S(cid:13)(cid:13)tr(pr,pc) \u2264 1. Since \u02c6LS(Y S) = 0, it\n\n0(n\u2212a)\u00d7 n\n2\nwhere A \u2208 {\u00b11}a\u00d7 n\n2 is any sign matrix. Clearly, (cid:107)Y (cid:107)tr(pr,pc) \u2264 1, and so inf X\u2208Wr[p] Lp(X) = 0.\nNow suppose we draw a sample S of size s from the matrix Y , according to the distribution p. We\nwill show an ERM \u02c6Y such that in expectation over S, Lp( \u02c6Y ) \u2265 1\nConsider Y S where Y S\nis clearly an ERM. We also have Lp(Y S) = N\nobserved in the sample. Since E [N ] \u2265 an\nExample 2. Let (cid:96)(x, y) = |x \u2212 y|. Let Y = 0n\u00d7n; trivially, Y \u2208 Wr [p]. Let p (1, 1) = 1\ns , and\np (i, 1) = p (1, j) = 0 for all i, j > 1, yielding pr (1) = pc (1) = 1\ns . (The other entries of p may be\nde\ufb01ned arbitrarily.) We will show an ERM \u02c6Y such that, in expectation over S, Lp( \u02c6Y ) \u2265 0.25. Let\nA be the matrix with X11 = s and zeros elsewhere, and note that (cid:107)A(cid:107)tr(pr,pc) = 1. With probability\n\u2265 0.25, entry (1, 1) will not appear in S, in which case \u02c6Y = A is an ERM, with Lp( \u02c6Y ) = 1.\nThe following table summarizes the learning guarantees that can be established for the (standard)\nweighted trace-norm. As we saw, these guarantees are tight up to log-factors.\n\n2s, where N is the number of \u00b11\u2019s in Y which are not\n\n3(cid:112) n\n\n2s \u00b7 an\n\n4 \u2265 1\n\n8\n\ns .\n\n1-Lipschitz, 1-bounded loss\n\n1-Lipschitz, unbounded loss\n\np = product\n\npr, pc = uniform\n\np arbitrary\n\n(cid:113) rn log(n)\n(cid:113) rn log(n)\n(cid:113) rn log(n)\n\ns\n\ns\n\n3\n\ns\n\n(cid:113) rn log(n)\n(cid:113) rn log(n)\n\ns\n\ns\n\n1\n\n3 Smoothing the weighted trace norm\n\nConsidering Theorem 1 and the degenerate examples in Section 2.3, it seems that in order to be\nable to generalize for non-product distributions, we need to enforce some sort of uniformity on\nthe weights. The Rademacher complexity computations in the proof of Theorem 1 show that the\nproblem lies not with large entries in the vectors pr and pc (i.e. if pr and/or pc are \u201cspiky\u201d), but with\nthe small entries in these vectors. This suggests the possibility of \u201csmoothing\u201d any overly low row-\nor column-marginals, in order to improve learning guarantees.\nIn Section 3.1, we present such a smoothing, and provide guarantees for learning with a smoothed\nweighted trace-norm. The result suggests that there is no strong negative consequence to smoothing,\nbut there might be a large advantage, if confronted with situations as in Examples 1 and 2. In Section\n3.2 we check the smoothing correction to the weighted trace-norm on real data, and observe that\nindeed it can also be bene\ufb01cial in practice.\n\n3.1 Learning guarantee for arbitrary distributions\nFix a distribution p and a constant \u03b1 \u2208 (0, 1), and let \u02dcp denote the smoothed marginals:\n\n\u02dcpr (i) = \u03b1 \u00b7 pr (i) + (1 \u2212 \u03b1) \u00b7 1\n(3)\nn ,\nIn the theoretical results below, we use \u03b1 = 1\n2, but up to a constant factor, the same results hold for\nany \ufb01xed choice of \u03b1 \u2208 (0, 1).\nTheorem 3. For an l-Lipschitz loss (cid:96), \ufb01x any matrix Y , sample size s, and any distribution p. Let\n\u02c6XS = arg min\n. Then, in expectation over the training sample S drawn\ni.i.d. from the distribution p,\n\n(cid:110) \u02c6LS(X) : X \u2208 Wr [\u02dcp]\n(cid:111)\n\n\u02dcpc (j) = \u03b1 \u00b7 pc (j) + (1 \u2212 \u03b1) \u00b7 1\nm .\n\n(cid:114)\n\n(cid:32)\n\nl \u00b7\n\n(cid:33)\n\nrn log(n)\n\ns\n\n.\n\n(4)\n\nLp( \u02c6XS) \u2264 inf\n\nX\u2208Wr[ \u02dcp]\n\nLp(X) + O\n\n5\n\n\f(cid:18)(cid:113) rn log(n)\n\n(cid:19)\n\ns\n\n(cid:105) \u2264 O\n(cid:104) \u02c6RS(Wr [\u02dcp])\n(cid:105)\n(cid:13)(cid:13)sp\n(cid:80)\n(cid:105) \u2264 4sn. Setting \u03c1\n\n= s \u00b7 maxi\n\neit,jt\n\u02dcpr(i) \u02dcpc(j)\n\n\u221a\n\n.\n=\n\nj\n\nProof. We bound ES\u223cp\n, and then apply Theorem 8 of [14].\nThe proof of this Rademacher bound is essentially identical to the proof in Theorem 1, with the\n.\n= R,\nmodi\ufb01ed de\ufb01nition of Qt = \u03c3t\n\u2264 4sm. Similarly,\n\n(cid:80)\n. Then (cid:107)Qt(cid:107)sp \u2264 maxij\n\u02dcpr(i) \u02dcpc(j) \u2264 s \u00b7 maxi\n\u221a\n4sn and applying [15], we obtain the result.\n\n(cid:104)(cid:13)(cid:13)(cid:80)s\n(cid:104)(cid:13)(cid:13)(cid:80)s\n\nt=1 QtQT\nt\nt Qt\n\nand E\nE\n\n\u02dcpr(i) \u02dcpc(j)\np(i,j)\n\n\u221a\n\u2264 2\n\n1\u221a\n\nt=1 QT\n\n1\n\n2 pr(i)\u00b7 1\n\n2m\n\n(cid:13)(cid:13)sp\n\nnm\n\np(i,j)\n\nj\n\nbe a lower bound on the rank, we would need to assume(cid:13)(cid:13)diag (pr)1/2 Xdiag (pc)1/2(cid:13)(cid:13)2\nwe also assume that(cid:13)(cid:13)(X\u2217)(i)\n\nMoving from Theorem 1 to Theorem 3, we are competing with a different class of matrices:\ninf X\u2208Wr[p] Lp(X) (cid:32) inf X\u2208Wr[ \u02dcp] Lp(X).\nIn most applications we can think of, this change is\nnot signi\ufb01cant. For example, we consider the low-rank matrix reconstruction problem, where the\ntrace-norm bound is used as a surrogate for rank. In order for the (squared) weighted trace-norm to\nF \u2264 1 [11]. If\n2 \u2264 n for all rows i and columns j \u2014 i.e. the\nrow and column magnitudes are not \u201cspiky\u201d \u2014 then X\u2217 \u2208 Wr [\u02dcp]. Note that this condition is much\nweaker than placing a spikiness condition on X\u2217 itself, e.g. requiring |X\u2217|\u221e \u2264 1.\n\n(cid:13)(cid:13)2\n2 \u2264 m and(cid:13)(cid:13)(X\u2217)(j)(cid:13)(cid:13)2\n\n3.2 Results on Net\ufb02ix and MovieLens Datasets\n\nWe evaluated different models on two publicly-available collaborative \ufb01ltering datasets: Net\ufb02ix [16]\nand MovieLens [17]. The Net\ufb02ix dataset consists of 100,480,507 ratings from 480,189 users on\n17,770 movies. Net\ufb02ix also provides a quali\ufb01cation set containing 1,408,395 ratings, but due to the\nsampling scheme, ratings from users with few ratings are overrepresented relative to the training\nset. To avoid dealing with different training and test distributions, we also created our own valida-\ntion and test sets, each containing 100,000 ratings set aside from the training set. The MovieLens\ndataset contains 10,000,054 ratings from 71,567 users on 10,681 movies. We again set aside test\nand validation sets of 100,000 ratings. Ratings were normalized to be zero-mean.\nWhen dealing with large datasets the most practical way to \ufb01t trace-norm regularized models is\nvia stochastic gradient descent [18, 3, 12]. For computational reasons, however, we consider\nrank-truncated trace-norm minimization, by optimizing within the restricted class {X : X \u2208\nWr [p] , rank (X) \u2264 k} for k = 30 and k = 100, and for various values of smoothing parameters \u03b1\n(as in (3)). For each value of \u03b1 and k, the regularization parameter was chosen by cross-validation.\nThe following table shows root mean squared error (RMSE) for the experiments. For both k=30 and\nk=100 the weighted trace-norm with smoothing (\u03b1 = 0.9) signi\ufb01cantly outperforms the weighted\ntrace-norm without smoothing (\u03b1 = 1), even on the differently-sampled Net\ufb02ix quali\ufb01cation set.\nThe proposed weighted trace-norm with smoothing outperforms max-norm regularization [19], and\nperforms comparably to \u201cgeometric\u201d smoothing [12]. On the Net\ufb01x quali\ufb01cation set, using k=30,\nmax-norm regularization and geometric smoothing achieve RMSE 0.9138 [19] and 0.9091 [12],\ncompared to 0.9096 achieved by the weighted trace-norm with smoothing. We note that geometric\nsmoothing was proposed by [12] as a heuristic without any theoretical or conceptual justi\ufb01cation.\n\nNet\ufb02ix\n\nMovieLens\n\n\u03b1\n1\n0.9\n0.5\n0.3\n0\n\nk\n30\n30\n30\n30\n30\n\nTest\n0.7604\n0.7589\n0.7601\n0.7712\n0.7887\n\nQual\n0.9107\n0.9096\n0.9173\n0.9198\n0.9249\n\nk\n100\n100\n100\n100\n100\n\nTest\n0.7404\n0.7391\n0.7419\n0.7528\n0.7659\n\nQual\n0.9078\n0.9068\n0.9161\n0.9207\n0.9236\n\nk\n30\n30\n30\n30\n30\n\nTest\n0.7852\n0.7831\n0.7836\n0.7864\n0.7997\n\nk\n100\n100\n100\n100\n100\n\nTest\n0.7821\n0.7798\n0.7815\n0.7871\n0.7987\n\n4 The empirically-weighted trace norm\n\nIn practice, the sampling distribution p is not known exactly \u2014 it can only be estimated via the\nlocations of the entries which are observed in the sample. De\ufb01ning the empirical marginals\n\n\u02c6pr (i) =\n\n#{t : it = i}\n\ns\n\n, \u02c6pc (j) =\n\n#{t : jt = j}\n\ns\n\n,\n\n6\n\n\fwe would like to give a learning guarantee when \u02c6XS is estimated via regularization on the \u02c6p-\nweighted trace-norm, rather than the p-weighted trace-norm.\nIn Section 4.1, we give bounds on excess error when learning with smoothed empirical marginals,\nwhich show that there is no theoretical disadvantage as compared to learning with the smoothed true\nmarginals. In fact, we provide evidence that suggests there might even be an advantage to using\nthe empirical marginals. To this end, in Section 4.2, we introduce the transductive learning setting,\nand give a result based on the empirical marginals which implies a sample complexity bound that\nis better by a factor of log1/2(n). In Section 4.3, we show that in low-rank matrix reconstruction\nsimulations, using empirical marginals indeed yields better reconstructions.\n\n4.1 Guarantee for the standard (inductive) setting\n\nWe \ufb01rst show that when learning with the smoothed empirical marginals, de\ufb01ned as\n\n(cid:1) , \u02c7pc (j) = 1\n\n2\n\n(cid:0)\u02c6pc (j) + 1\n\n(cid:1) ,\n\nm\n\n\u02c7pr (i) = 1\n2\n\n(cid:0)\u02c6pr (i) + 1\n(cid:110) \u02c6LS(X) : X \u2208 Wr [\u02c7p]\n(cid:111)\n\nn\n\nLp( \u02c6XS) \u2264 inf\n\nX\u2208Wr[ \u02dcp]\n\nLp(X) + O\n\n(cid:114)\n\n(cid:18)\n\nl \u00b7\n\nwe can obtain the same guarantee as for learning with the smoothed (true) marginals, given by \u02dcp.\nTheorem 4. For an l-Lipschitz loss (cid:96), \ufb01x any matrix Y , sample size s, and any distribution p. Let\n\u02c6XS = arg min\n. Then, in expectation over the training sample S drawn\ni.i.d. from the distribution p,\n\n(cid:19)\n\nr max{n, m} log(n + m)\n\ns\n\n.\n\n(5)\n\nNote that although we regularize using the (smoothed) empirically-weighted trace-norm, we still\ncompare ourselves to the best possible matrix in the class de\ufb01ned by the (smoothed) true marginals.\nThe proof of this Theorem (in the Supplementary Material) uses Theorem 3 and involves showing\nthat when s = \u2126(n log(n)), which is required for all Theorems so far to be meaningful, the true and\nempirical marginals are the same up to a constant factor. For this to be the case, such a sample size is\neven necessary. In fact, the log(n) factor in our analysis (e.g. in the proof of Theorem 1) arises from\nthe bound on the expected spectral norm of a matrix, which, for a diagonal matrix, is just a bound\non the deviation of empirical frequencies. Might it be possible, then, to avoid this logarithmic factor\nby using the empirical marginals? Although we could not establish such a result in the inductive\nsetting, we now turn to the transductive setting, where we could indeed obtain a better guarantee.\n\n4.2 Guarantee for the transductive setting\nIn the transductive model, we \ufb01x a set S \u2282 [n] \u00d7 [m] of size 2s, and then randomly split S into a\ntraining set S and a test set T of equal size s. The goal is to obtain a good estimator for the entries\nin T based on the values of the entries in S, as well as the locations (indexes) of all elements on S.\nWe will use the smoothed empirical marginals of S, for the weighted trace-norm.\nWe now show that, for bounded loss, there may be a bene\ufb01t to weighting with the smoothed empir-\nical marginals\u2014the sample size requirement can be lowered to s = O(rn log1/2(n)).\nTheorem 5. For an l-Lipschitz loss (cid:96) bounded by b, \ufb01x any matrix Y and sample size s. Let\nS \u2282 [n] \u00d7 [m] be a \ufb01xed subset of size 2s, split uniformly at random into training and test\nsets S and T , each of size s. Let p denote the smoothed empirical marginals of S. Let \u02c6XS =\narg min\n\n(cid:110) \u02c6LS(X) : X \u2208 Wr [p]\n\n. Then in expectation over the splitting of S into S and T ,\n\n(cid:111)\n\n\u02c6LT ( \u02c6XS) \u2264 inf\n\nX\u2208Wr[p]\n\n\u02c6LT (X) + O\n\nrn log1/2(n)\n\ns\n\n+\n\nb\u221a\ns\n\n.\n\n(6)\n\n(cid:115)\n\n(cid:32)\n\nl \u00b7\n\n(cid:33)\n\nThis result (proved in the Supplementary Materials) is stated in the transductive setting, with a\nsomewhat different sampling procedure and evaluation criteria, but we believe the main difference is\nin the use of the empirical weights. Although it is usually straightforward to convert a transductive\nguarantee to an inductive one, the situation here is more complicated, since the hypothesis class\ndepends on the weighting, and hence on the sample S. Nevertheless, we believe such a conversion\nmight be possible, establishing a similar guarantee for learning with the (smoothed) empirically\n\n7\n\n\fweighted trace-norm also in the inductive setting. Furthermore, since the empirical marginals are\nclose to the true marginals when s = \u0398(n log(n)), it might be possible to obtain a learning guarantee\n\nfor the true (non-empirical) weighting with a sample of size s = O(cid:0)n(r log1/2(n) + log(n))(cid:1).\n\nTheorem 5 can be viewed as a transductive analog to Theorem 3 (with weights based on the com-\nbined sample S). In the Supplementary Materials we give transductive analogs to Theorems 1 and 2.\nAs mentioned in Section 2.3, our lower bound examples can also be stated in the transductive setting,\nand thus all our guarantees and lower bounds can also be obtained in this setting.\n\n4.3 Simulations with empirical weights\n\nIn order to numerically investigate the possible advantage of empirical weighting, we performed\nsimulations on low-rank matrix reconstruction under uniform sampling with the unweighted, and the\nsmoothed empirically weighted, trace-norms. We choose to work with uniform sampling in order to\nemphasize the bene\ufb01t of empirical weights, even in situations where one might not consider to use\nany weights at all. In all the experiments, we attempt to reconstruct a possibly noisy, random rank-2\n(n, n, 0, . . . , 0), ensuring (cid:107)M(cid:107)F = n. We measure error\n\u201csignal\u201d matrix M with singular values 1\u221a\nusing the squared loss2. Simulations were performed using MATLAB, with code adapted from the\nSOFTIMPUTE code developed by [20]. We performed two types of simulations:\n\n2\n\nIn Figure 1(b), we plot the resulting average squared error (over 100 repetitions) over a range of\nsample sizes s and noise levels \u03bd, with both uniform weighting and empirical weighting. A larger\nplot including standard error bars is shown in the Supplementary Materials.\nThe results from both experiments show a signi\ufb01cant bene\ufb01t to using the empirical marginals.\n\nSample complexity comparison in the noiseless setting: We de\ufb01ne Y = M, and compute\n\n\u02c6XS = arg min(cid:8)(cid:107)X(cid:107) : \u02c6LS(X) = 0(cid:9), where (cid:107)X(cid:107) = (cid:107)X(cid:107)tr or = (cid:107)X(cid:107)tr( \u02c6pr, \u02c6pc), as appropriate.\nnoise N has i.i.d. standard normal entries. We compute \u02c6XS = arg min(cid:8)(cid:107)X(cid:107) : \u02c6LS(X) \u2264 \u03bd2(cid:9).\n\nIn Figure 1(a), we plot the average number of samples per row needed to get average squared error\n(over 100 repetitions) of at most 0.1, with both uniform weighting and empirical weighting.\n\nExcess error comparison in the noiseless and noisy settings: We de\ufb01ne Y = M + \u03bdN, where\n\nFigure 1: (a) Left: Sample size needed to obtain avg. error 0.1, with respect to n. (b) Right: Excess\nerror level over a range of sample sizes, for \ufb01xed n = 200. (Axes are on a logarithmic scale.)\n\n5 Discussion\n\nIn this paper, we prove learning guarantees for the weighted trace-norm by analyzing expected\nRademacher complexities. We show that weighting with smoothed marginals eliminates degenerate\nscenarios that can arise in the case of a non-product sampling distribution, and demonstrate in exper-\niments on the Net\ufb02ix and MovieLens datasets that this correction can be useful in applied settings.\nWe also give results for empirically-weighted trace-norm regularization, and see indications that\nusing the empirical distribution may be better than using the true distribution, even if it is available.\n\n2Although Lipschitz in a bounded domain, it is probably possible to improve all our results (removing the\n\nsquare root) for the special case of the squared-loss, possibly with an i.i.d. noise assumption, as in [10].\n\n8\n\n100141200282400910111213Matrix dimension ns/n = Avg. # samples per row True pEmpirical p500100020000.10.20.40.8Sample size sAvg. squared error \u03bd=0.4, true p\u03bd=0.4, empirical p\u03bd=0.2, true p\u03bd=0.2, empirical p\u03bd=0.0, true p\u03bd=0.0, empirical p\fReferences\n[1] M. Fazel. Matrix rank minimization with applications. PhD Thesis, Stanford University, 2002.\n[2] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. Advances in\n\nNeural Information Processing Systems, 17, 2004.\n\n[3] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. Advances in Neural Infor-\n\nmation Processing Systems, 20, 2007.\n\n[4] F. Bach. Consistency of trace-norm minimization. Journal of Machine Learning Research,\n\n9:1019\u20131048, 2008.\n\n[5] E. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE\n\nTrans. Inform. Theory, 56(5):2053\u20132080, 2009.\n\n[6] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. 18th Annual Conference on\n\nLearning Theory (COLT), pages 545\u2013560, 2005.\n\n[7] B. Recht. A simpler approach to matrix completion. arXiv:0910.0651, 2009.\n[8] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of\n\nMachine Learning Research, 11:2057\u20132078, 2010.\n\n[9] V. Koltchinskii, A. Tsybakov, and K. Lounici. Nuclear norm penalization and optimal rates for\n\nnoisy low rank matrix completion. arXiv:1011.6256, 2010.\n\n[10] S. Negahban and M. Wainwright. Restricted strong convexity and weighted matrix completion:\n\nOptimal bounds with noise. arXiv:1009.2118, 2010.\n\n[11] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrix reconstruction.\n\n24th Annual Conference on Learning Theory (COLT), 2011.\n\n[12] R. Salakhutdinov and N. Srebro. Collaborative Filtering in a Non-Uniform World: Learning\nwith the Weighted Trace Norm. Advances in Neural Information Processing Systems, 23, 2010.\n[13] O. Shamir and S. Shalev-Shwartz. Collaborative \ufb01ltering with the trace norm: Learning,\n\nbounding, and transducing. 24th Annual Conference on Learning Theory (COLT), 2011.\n\n[14] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and struc-\n\ntural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[15] J.A. Tropp. User-friendly tail bounds for sums of random matrices. arXiv:1004.4389, 2010.\n[16] J. Bennett and S. Lanning. The net\ufb02ix prize.\n\nIn Proceedings of KDD Cup and Workshop,\n\nvolume 2007, page 35. Citeseer, 2007.\n\n[17] MovieLens Dataset. Available at http://www.grouplens.org/node/73. 2006.\n[18] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering\nmodel. ACM Int. Conference on Knowledge Discovery and Data Mining (KDD\u201908), pages\n426\u2013434, 2008.\n\n[19] J. Lee, B. Recht, R. Salakhutdinov, N. Srebro, and J. Tropp. Practical Large-Scale Optimization\nfor Max-Norm Regularization. Advances in Neural Information Processing Systems, 23, 2010.\n[20] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning\n\nlarge incomplete matrices. Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1182, "authors": [{"given_name": "Rina", "family_name": "Foygel", "institution": null}, {"given_name": "Ohad", "family_name": "Shamir", "institution": null}, {"given_name": "Nati", "family_name": "Srebro", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}