{"title": "Matrix reconstruction with the local max norm", "book": "Advances in Neural Information Processing Systems", "page_first": 935, "page_last": 943, "abstract": "We introduce a new family of matrix norms, the ''local max'' norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netflix and MovieLens ratings data, and find improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms.", "full_text": "Matrix reconstruction with the local max norm\n\nRina Foygel\n\nDepartment of Statistics\n\nStanford University\n\nrinafb@stanford.edu\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\n\nnati@ttic.edu\n\nRuslan Salakhutdinov\n\nDept. of Statistics and Dept. of Computer Science University of Toronto\n\nrsalakhu@utstat.toronto.edu\n\nAbstract\n\nWe introduce a new family of matrix norms, the \u201clocal max\u201d norms, generalizing\nexisting methods such as the max norm, the trace norm (nuclear norm), and the\nweighted or smoothed weighted trace norms, which have been extensively used in\nthe literature as regularizers for matrix reconstruction problems. We show that this\nnew family can be used to interpolate between the (weighted or unweighted) trace\nnorm and the more conservative max norm. We test this interpolation on simulated\ndata and on the large-scale Net\ufb02ix and MovieLens ratings data, and \ufb01nd improved\naccuracy relative to the existing matrix norms. We also provide theoretical results\nshowing learning guarantees for some of the new norms.\n\nIntroduction\n\n1\nIn the matrix reconstruction problem, we are given a matrix Y \u2208 Rn\u00d7m whose entries are only partly\nobserved, and would like to reconstruct the unobserved entries as accurately as possible. Matrix\nreconstruction arises in many modern applications, including the areas of collaborative \ufb01ltering\n(e.g. the Net\ufb02ix prize), image and video data, and others. This problem has often been approached\nusing regularization with matrix norms that promote low-rank or approximately-low-rank solutions,\nincluding the trace norm (also known as the nuclear norm) and the max norm, as well as several\nadaptations of the trace norm described below.\nIn this paper, we introduce a unifying family of norms that generalizes these existing matrix norms,\nand that can be used to interpolate between the trace and max norms. We show that this family\nincludes new norms, lying strictly between the trace and max norms, that give empirical and theo-\nretical improvements over the existing norms. We give results allowing for large-scale optimization\nwith norms from the new family. Some proofs are deferred to the Supplementary Materials.\nNotation Without loss of generality we take n \u2265 m. We let R+ denote the nonnegative real\nnumbers. For any n \u2208 N, let [n] = {1, . . . , n}, and de\ufb01ne the simplex on [n] as \u2206[n] =\n\ni ri = 1(cid:9). We analyze situations where the locations of observed entries are sampled\n\n(cid:8)r \u2208 Rn\n+ :(cid:80)\ni.i.d. according to some distribution p on [n] \u00d7 [m]. We write pi\u2022 =(cid:80)\n\nj pij to denote the marginal\nprobability of row i, and prow = (p1\u2022, . . . , pn\u2022) \u2208 \u2206[n] to denote the marginal row distribution.\nWe de\ufb01ne p\u2022j and pcol similarly for the columns. For any matrix M, M(i) denotes its ith row.\n\n1.1 Trace norm and max norm\n\nA common regularizer used in matrix reconstruction, and other matrix problems, is the trace norm\n(cid:107)X(cid:107)tr, equal to the sum of the singular values of X. This norm can also be de\ufb01ned via a factorization\n\n1\n\n\fof X [1]:\n\n1\u221a\nnm\n\n(cid:107)X(cid:107)tr =\n\n1\n2\n\nmin\n\nAB(cid:62)=X\n\n\uf8eb\uf8ed 1\n\nn\n\n(cid:88)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\ni\n\n+\n\n1\nm\n\n\uf8f6\uf8f8 ,\n\n(cid:88)\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\nj\n\n(1)\n\n\u221a\nwhere the minimum is taken over factorizations of X of arbitrary dimension\u2014that is, the number of\ncolumns in A and B is unbounded. Note that we choose to scale the trace norm by 1/\nnm in order\nto emphasize that we are averaging the squared row norms of A and B.\nRegularization with the trace norm gives good theoretical and empirical results, as long as the loca-\ntions of observed entries are sampled uniformly (i.e. when p is the uniform distribution on [n]\u00d7[m]),\nand, under this assumption, can also be used to guarantee approximate recovery of an underlying\nlow-rank matrix [1, 2, 3, 4].\nThe factorized de\ufb01nition of the trace norm (1) allows for an intuitive comparison with the max norm,\nde\ufb01ned as [1]:\n\n(cid:18)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\n(cid:19)\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\n2\n\n(cid:107)X(cid:107)max =\n\n1\n2\n\nmin\n\nAB(cid:62)=X\n\nsup\n\ni\n\n2 + sup\nj\n\n.\n\n(2)\n\nWe see that the max norm measures the largest row norms in the factorization, while the rescaled\ntrace norm instead considers the average row norms. The max norm is therefore an upper bound\non the rescaled trace norm, and can be viewed as a more conservative regularizer. For the more\ngeneral setting where p may not be uniform, Foygel and Srebro [4] show that the max norm is still\nan effective regularizer (in particular, bounds on error for the max norm are not affected by p). On\nthe other hand, Salakhutdinov and Srebro [5] show that the trace norm is not robust to non-uniform\nsampling\u2014regularizing with the trace norm may yield large error due to over-\ufb01tting on the rows and\ncolumns with high marginals. They obtain improved empirical results by placing more penalization\non these over-represented rows and columns, described next.\n\n1.2 The weighted trace norm\n\nTo reduce over\ufb01tting on the rows and columns with high marginal probabilities under the distribution\np, Salakhutdinov and Srebro propose regularizing with the p-weighted trace norm,\n\n(cid:13)(cid:13)(cid:13)diag(prow)1/2 \u00b7 X \u00b7 diag(pcol)1/2(cid:13)(cid:13)(cid:13)tr\n\n.\n\n(cid:107)X(cid:107)tr(p) :=\n\nIf the row and the column of entries to be observed are sampled independently (i.e. p = prow \u00b7\npcol is a product distribution), then the p-weighted trace norm can be used to obtain good learning\nguarantees even when prow and pcol are non-uniform [3, 6]. However, for non-uniform non-product\nsampling distributions, even the p-weighted trace norm can yield poor generalization performance.\nTo correct for this, Foygel et al. [6] suggest adding in some \u201csmoothing\u201d to avoid under-penalizing\nthe rows and columns with low marginal probabilities, and obtain improved empirical and theoretical\nresults for matrix reconstruction using the smoothed weighted trace norm:\n\n(cid:13)(cid:13)(cid:13)diag((cid:101)prow)1/2 \u00b7 X \u00b7 diag((cid:101)pcol)1/2(cid:13)(cid:13)(cid:13)tr\nwhere(cid:101)prow and(cid:101)pcol denote smoothed row and column marginals, given by\n\n(cid:101)prow = (1 \u2212 \u03b6) \u00b7 prow + \u03b6 \u00b7 1/n and(cid:101)pcol = (1 \u2212 \u03b6) \u00b7 pcol + \u03b6 \u00b7 1/m ,\n\n(cid:107)X(cid:107)tr((cid:101)p) :=\n\n,\n\ntotal # observations\n\n(3)\nfor some choice of smoothing parameter \u03b6 which may be selected with cross-validation1. The\nsmoothed empirically-weighted trace norm is also studied in [6], where pi\u2022 is replaced with\n\n, the empirical marginal probability of row i (and same for (cid:98)p\u2022j). Using\n\n(cid:98)pi\u2022 = # observations in row i\n\nempirical rather than \u201ctrue\u201d weights yielded lower error in experiments in [6], even when the true\nsampling distribution was uniform.\nMore generally, for any weight vectors r \u2208 \u2206[n] and c \u2208 \u2206[m] and a matrix X \u2208 Rn\u00d7m, the\n(r, c)-weighted trace norm is given by\n(cid:107)X(cid:107)tr(r,c) =\n\n(cid:13)(cid:13)(cid:13)diag(r)1/2 \u00b7 X \u00b7 diag(c)1/2(cid:13)(cid:13)(cid:13)tr\n\n.\n\n1Our \u03b6 parameter here is equivalent to 1 \u2212 \u03b1 in [6].\n\n2\n\n\fOf course, we can easily obtain the existing methods of the uniform trace norm, (empirically)\nweighted trace norm, and smoothed (empirically) weighted trace norm as special cases of this for-\nmulation. Furthermore, the max norm is equal to a supremum over all possible weightings [7]:\n\n(cid:107)X(cid:107)max =\n\nsup\n\nr\u2208\u2206[n],c\u2208\u2206[m]\n\n(cid:107)X(cid:107)tr(r,c) .\n\n2 The local max norm\n\nWe consider a generalization of these norms, which lies \u201cin between\u201d the trace norm and max norm.\nFor any R \u2286 \u2206[n] and C \u2286 \u2206[m], we de\ufb01ne the (R,C)-norm of X:\n(cid:107)X(cid:107)tr(r,c) .\n\n(cid:107)X(cid:107)(R,C) = sup\nr\u2208R,c\u2208C\n\nThis gives a norm on matrices, except in the trivial case where, for some i or some j, ri = 0 for all\nr \u2208 R or cj = 0 for all c \u2208 C.\nWe now show some existing and novel norms that can be obtained using local max norms.\n\n2.1 Trace norm and max norm\nWe can obtain the max norm by taking the largest possible R and C, i.e. (cid:107)X(cid:107)max = (cid:107)X(cid:107)(\u2206[n],\u2206[m]),\nand similarly we can obtain the (r, c)-weighted trace norm by taking the singleton sets R = {r}\nand C = {c}. As discussed above, this includes the standard trace norm (when r and c are uniform),\nas well as the weighted, empirically weighted, and smoothed weighted trace norm.\n\n2.2 Arbitrary smoothing\n\nWhen using the smoothed weighted max norm, we need to choose the amount of smoothing to\napply to the marginals, that is, we need to choose \u03b6 in our de\ufb01nition of the smoothed row and\ncolumn weights, as given in (3). Alternately, we could regularize simultaneously over all possible\namounts of smoothing by considering the local max norm with\n\nR = {(1 \u2212 \u03b6) \u00b7 prow + \u03b6 \u00b7 1/n : any \u03b6 \u2208 [0, 1]} ,\n\nand same for C. That is, R and C are line segments in the simplex\u2014they are larger than any single\npoint as for the uniform or weighted trace norm (or smoothed weighted trace norm for a \ufb01xed amount\nof smoothing), but smaller than the entire simplex as for the max norm.\n\n2.3 Connection to (\u03b2, \u03c4 )-decomposability\n\nHazan et al. [8] introduce a class of matrices de\ufb01ned by a property of (\u03b2, \u03c4 )-decomposability: a\nmatrix X satis\ufb01es this property if there exists a factorization X = AB(cid:62) (where A and B may have\nan arbitrary number of columns) such that2\n\n(cid:27)\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\n2\n\n\u2264 2\u03b2,\n\n(cid:88)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\n2 +\n\n(cid:88)\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\ni\n\nj\n\n2 \u2264 \u03c4 .\n\n(cid:26)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\nmax\n\nmax\n\ni\n\n2 , max\n\nj\n\nComparing with (1) and (2), we see that the \u03b2 and \u03c4 parameters essentially correspond to the max\nnorm and trace norm, with the max norm being the minimal 2\u03b2\u2217 such that the matrix is (\u03b2\u2217, \u03c4 )-\ndecomposable for some \u03c4, and the trace norm being the minimal \u03c4\u2217/2 such that the matrix is\n(\u03b2, \u03c4\u2217)-decomposable for some \u03b2. However, Hazan et al. go beyond these two extremes, and rely\non balancing both \u03b2 and \u03c4: they establish learning guarantees (in an adversarial online model, and\n\u03b2 \u00b7 \u03c4. It may therefore be\nthus also under an arbitrary sampling distribution p) which scale with\nuseful to consider a penalty function of the form:\n\n\u221a\n\n\uf8f1\uf8f2\uf8f3\n(cid:114)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\n(cid:115)(cid:88)\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\n2 \u00b7\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\n2 +\n\n(cid:88)\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\n2\n\ni\n\nj\n\n\uf8fc\uf8fd\uf8fe .\n\nPenalty(\u03b2,\u03c4 )(X) = min\n\nX=AB(cid:62)\n\nmax\n\ni\n\n2 + max\n\nj\n\n(4)\n2Hazan et al. state the property differently, but equivalently, in terms of a semide\ufb01nite matrix decomposition.\n\n3\n\n\f(cid:110)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\n(cid:111)\n\n(cid:13)(cid:13)A(i)\n\n(cid:13)(cid:13)2\n\n(cid:13)(cid:13)B(j)\n\n(cid:13)(cid:13)2\n\nmaxi\n\n2 , maxj\n\nis replaced with maxi\n\n(Note that max\n2 + maxj\n\u221a\nfor later convenience. This affects the value of the penalty function by at most a factor of\nThis penalty function does not appear to be convex in X. However, the proposition below (proved in\nthe Supplementary Materials) shows that we can use a (convex) local max norm penalty to compute\na solution to any objective function with a penalty function of the form (4):\n\nProposition 1. Let (cid:98)X be the minimizer of a penalized loss function with this modi\ufb01ed penalty,\n\n2.)\n\n2,\n\n2\n\nwhere \u03bb \u2265 0 is some penalty parameter and Loss(\u00b7) is any convex function. Then, for some penalty\nparameter \u00b5 \u2265 0 and some t \u2208 [0, 1],\n\nX\n\n(cid:110)\n(cid:111)\n(cid:98)X := arg min\nLoss(X) + \u03bb \u00b7 Penalty(\u03b2,\u03c4 )(X)\n(cid:110)\n(cid:98)X = arg min\n(cid:27)\nLoss(X) + \u00b5 \u00b7 (cid:107)X(cid:107)(R,C)\n\u2200i\n\nc \u2208 \u2206[m] : cj \u2265\n\nand C =\n\n(cid:26)\n\n(cid:111)\n\nX\n\nt\n\n, where\n\n,\n\n(cid:27)\n\n.\n\nt\n\n1 + (m \u2212 1)t\n\n\u2200j\n\n(cid:26)\n\nR =\n\nr \u2208 \u2206[n] : ri \u2265\n\n1 + (n \u2212 1)t\n\nthe unknown solution (cid:98)X.\n\nWe note that \u00b5 and t cannot be determined based on \u03bb alone\u2014they will depend on the properties of\n\nHere the sets R and C impose a lower bound on each of the weights, and this lower bound can be\nused to interpolate between the max and trace norms: when t = 1, each ri is lower bounded by\n1/n (and similarly for cj), i.e. R and C are singletons containing only the uniform weights and we\nobtain the trace norm. On the other hand, when t = 0, the weights are lower-bounded by zero, and\nso any weight vector is allowed, i.e. R and C are each the entire simplex and we obtain the max\nnorm. Intermediate values of t interpolate between the trace norm and max norm and correspond to\ndifferent balances between \u03b2 and \u03c4.\n\n2.4\n\nInterpolating between trace norm and max norm\n\nWe next turn to an interpolation which relies on an upper bound, rather than a lower bound, on the\nweights. Consider\n\nR\u0001 =(cid:8)r \u2208 \u2206[n] : ri \u2264 \u0001 \u2200i(cid:9) and C\u03b4 =(cid:8)c \u2208 \u2206[n] : cj \u2264 \u03b4 \u2200j(cid:9) ,\n\n(5)\nfor some \u0001 \u2208 [1/n, 1] and \u03b4 \u2208 [1/m, 1]. The (R\u0001,C\u03b4)-norm is then equal to the (rescaled) trace norm\nwhen we choose \u0001 = 1/n and \u03b4 = 1/m, and is equal to the max norm when we choose \u0001 = \u03b4 = 1.\nAllowing \u0001 and \u03b4 to take intermediate values gives a smooth interpolation between these two familiar\nnorms, and may be useful in situations where we want more \ufb02exibility in the type of regularization.\nWe can generalize this to an interpolation between the max norm and a smoothed weighted trace\nnorm, which we will use in our experimental results. We consider two generalizations\u2014for each\none, we state a de\ufb01nition of R, with C de\ufb01ned analogously. The \ufb01rst is multiplicative:\n\n\u03b6,\u03b3 :=(cid:8)r \u2208 \u2206[n] : ri \u2264 \u03b3 \u00b7 ((1 \u2212 \u03b6) \u00b7 pi\u2022 + \u03b6 \u00b7 1/n) \u2200i(cid:9) ,\n\n(6)\nwhere \u03b3 = 1 corresponds to choosing the singleton set R\u00d7\n\u03b6,\u03b3 = {(1 \u2212 \u03b6) \u00b7 prow + \u03b6 \u00b7 1/n} (i.e. the\nsmoothed weighted trace norm), while \u03b3 = \u221e corresponds to the max norm (for any choice of \u03b6)\nsince we would get R\u00d7\n(cid:110)\n(cid:111)\nThe second option for an interpolation is instead de\ufb01ned with an exponent:\nr \u2208 \u2206[n] : ri \u2264 ((1 \u2212 \u03b6) \u00b7 pi\u2022 + \u03b6 \u00b7 1/n)1\u2212\u03c4 \u2200i\n\n\u03b6,\u03b3 = \u2206[n].\n\nR\u03b6,\u03c4 :=\n\nR\u00d7\n\n(7)\n\n.\n\nHere \u03c4 = 0 will yield the singleton set corresponding to the smoothed weighted trace norm, while\n\u03c4 = 1 will yield R\u03b6,\u03c4 = \u2206[n], i.e. the max norm, for any choice of \u03b6.\nWe \ufb01nd the second (exponent) option to be more natural, because each of the row marginal bounds\nwill reach 1 simultaneously when \u03c4 = 1, and hence we use this version in our experiments. On\nthe other hand, the multiplicative version is easier to work with theoretically, and we use this in our\nlearning guarantee in Section 4.2. If all of the row and column marginals satisfy some loose upper\nbound, then the two options will not be highly different.\n\n4\n\n\f3 Optimization with the local max norm\n\nOne appeal of both the trace norm and the max norm is that they are both SDP representable [9, 10],\nand thus easily optimizable, at least in small scale problems. In the Supplementary Materials we\nshow that the local max norm is also SDP representable, as long as the sets R and C can be written\nin terms of linear or semi-de\ufb01nite constraints\u2014this includes all the examples we mention, where in\nall of them the sets R and C are speci\ufb01ed in terms of simple linear constraints.\nHowever, for large scale problems, it is not practical to directly use SDP optimization approaches.\nInstead, and especially for very large scale problems, an effective optimization approach for both\nthe trace norm and the max norm is to use the factorized versions of the norms, given in (1) and (2),\nand to optimize the factorization directly (typically, only factorizations of some truncated dimen-\nsionality are used) [11, 12, 7]. As we show in Theorem 1 below, a similar factorization-optimization\napproach is also possible for any local max norm with convex R and C. We further give a simpli\ufb01ed\nrepresentation which is applicable when R and C are speci\ufb01ed through element-wise upper bounds\nR \u2208 Rn\n\nR = {r \u2208 \u2206[n] : ri \u2264 Ri \u2200i} and C = {c \u2208 \u2206[m] : cj \u2264 Cj \u2200j} ,\n\ni Ri \u2265 1, 0 \u2264 Cj \u2264 1,(cid:80)\n(cid:88)\n(cid:13)(cid:13)A(i)\n(cid:16)(cid:13)(cid:13)A(i)\n(cid:13)(cid:13)2\n\n(8)\nj Cj \u2265 1 to avoid triviality. This includes the\ninterpolation norms of Section 2.4.\nTheorem 1. If R and C are convex, then the (R,C)-norm can be calculated with the factorization\n(9)\nIn the special case when R and C are de\ufb01ned by (8), writing (x)+ := max{0, x}, this simpli\ufb01es to\n(cid:107)X(cid:107)(R,C) =\nProof sketch for Theorem 1. For convenience we will write r1/2 to mean diag(r)1/2, and same for c.\nUsing the trace norm factorization identity (1), we have\n\n+ and C \u2208 Rm\n\nwith 0 \u2264 Ri \u2264 1,(cid:80)\n\n(cid:13)(cid:13)2\n(cid:17)\n2 \u2212 a\n\n(cid:88)\n(cid:88)\n\n(cid:107)X(cid:107)(R,C) =\n\n+ , respectively:\n\nAB(cid:62)=X;a,b\u2208R\n\n2 + sup\nc\u2208C\n\n2 \u2212 b\n\n(cid:13)(cid:13)2\n\n(cid:111)\n\n+ b +\n\n(cid:17)\n\n1\n2\n\ninf\n\ncj\n\nri\n\n+\n\n+\n\n.\n\n.\n\nj\n\nj\n\ni\n\n2(cid:107)X(cid:107)(R,C) = 2\n\n= sup\n\nr\u2208R,c\u2208C\n\ninf\n\nAB(cid:62)=X\n\nsup\n\nr\u2208R,c\u2208C\n\n(cid:18)(cid:13)(cid:13)(cid:13)r1/2 \u00b7 A\n\ninf\n\nCD(cid:62)=r1/2\u00b7X\u00b7c1/2\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n= sup\n\nr\u2208R,c\u2208C\n\u2264 inf\n\nAB(cid:62)=X\n\n(cid:18)\n\nsup\nr\u2208R\n\n(cid:13)(cid:13)(cid:13)r1/2A\n\ninf\n\n1\n2\n\na +\n\nsup\nr\u2208R\n\nAB(cid:62)=X\n\n(cid:16)\n(cid:110)\n(cid:88)\n(cid:13)(cid:13)(cid:13)r1/2 \u00b7 X \u00b7 c1/2(cid:13)(cid:13)(cid:13)tr\n(cid:13)(cid:13)(cid:13)c1/2 \u00b7 B\n(cid:13)(cid:13)(cid:13)2\n\nRi\n\n+\n\ni\n\nF\n\n2\n\nCj\n\n(cid:17)\n(cid:13)(cid:13)2\n(cid:13)(cid:13)B(j)\n(cid:16)(cid:13)(cid:13)B(j)\n(cid:16)(cid:107)C(cid:107)2\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n+ sup\nc\u2208C\n\nF + (cid:107)D(cid:107)2\n\n(cid:17)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)c1/2B\n\nF\n\nF\n\n(cid:19)\n\n,\n\nwhere for the next-to-last step we set C = r1/2A and D = c1/2B, and the last step follows because\nsup inf \u2264 inf sup always (weak duality). The reverse inequality holds as well (strong duality), and\nis proved in the Supplementary Materials, where we also prove the special-case result.\n\n4 An approximate convex hull and a learning guarantee\n\nIn this section, we look for theoretical bounds on error for the problem of estimating unobserved\nentries in a matrix Y that is approximately low-rank. Our results apply for either uniform or non-\nuniform sampling of entries from the matrix. We begin with a result comparing the (R,C)-norm unit\nball to a convex hull of rank-1 matrices, which will be useful for proving our learning guarantee.\n\n4.1 Convex hull\nTo gain a better theoretical understanding of the (R,C) norm, we \ufb01rst need to de\ufb01ne corresponding\nvector norms on Rn and Rm. For any u \u2208 Rn, let\n\n(cid:115)\n\n(cid:88)\n\ni\n\n(cid:107)u(cid:107)R :=\n\nsup\nr\u2208R\n\nriu2\n\ni = sup\nr\u2208R\n\n(cid:13)(cid:13)(cid:13)diag(r)1/2 \u00b7 u\n(cid:13)(cid:13)(cid:13)2\n\n.\n\nWe can think of this norm as a way to interpolate between the (cid:96)2 and (cid:96)\u221e vector norms. For example,\nif we choose R = R\u0001 as de\ufb01ned in (5), then (cid:107)u(cid:107)R is equal to the root-mean-square of the \u0001\u22121\nlargest entries of u whenever \u0001\u22121 is an integer. De\ufb01ning (cid:107)v(cid:107)C analogously for v \u2208 Rm, we can now\nrelate these vector norms to the (R,C)-norm on matrices.\n\n5\n\n\fTheorem 2. For any convex R \u2286 \u2206[n] and C \u2286 \u2206[m], the (R,C)-norm unit ball is bounded above\nand below by a convex hull as:\n\n(cid:111) \u2286 KG\u00b7Conv(cid:8)uv(cid:62):(cid:107)u(cid:107)R = (cid:107)v(cid:107)C = 1(cid:9) ,\n\nConv(cid:8)uv(cid:62):(cid:107)u(cid:107)R = (cid:107)v(cid:107)C = 1(cid:9) \u2286(cid:110)\n\nX :(cid:107)X(cid:107)(R,C) \u2264 1\n\nwhere KG \u2264 1.79 is Grothendieck\u2019s constant, and implicitly u \u2208 Rn, v \u2208 Rm.\nThis result is a nontrivial extension of Srebro and Shraibman [1]\u2019s analysis for the max norm and\nthe trace norm. They show that the statement holds for the max norm, i.e. when R = \u2206[n] and\nC = \u2206[m], and that the trace norm unit ball is exactly equal to the corresponding convex hull (see\nCorollary 2 and Section 3.2 in their paper, respectively).\nProof sketch for Theorem 2. To prove the \ufb01rst inclusion, given any X = uv(cid:62) with (cid:107)u(cid:107)R = (cid:107)v(cid:107)C =\n1, we apply the factorization result Theorem 1 to see that (cid:107)X(cid:107)(R,C) \u2264 1. Since the (R,C)-norm\nunit ball is convex, this is suf\ufb01cient. For the second inclusion, we state a weighted version of\nGrothendieck\u2019s Inequality (proof in the Supplementary Materials):\n\nsup(cid:8)(cid:104)Y, U V (cid:62)(cid:105) : U \u2208 Rn\u00d7k, V \u2208 Rm\u00d7k,(cid:13)(cid:13)U(i)\n\n= KG \u00b7 sup(cid:8)(cid:104)Y, uv(cid:62)(cid:105) : u \u2208 Rn, v \u2208 Rm,|ui| \u2264 ai \u2200i, |vj| \u2264 bj \u2200j(cid:9) .\n\n(cid:13)(cid:13)2 \u2264 ai \u2200i, (cid:13)(cid:13)V(j)\n\n(cid:13)(cid:13)2 \u2264 bj \u2200j(cid:9)\n\nWe then apply this weighted inequality to the dual norm of the (R,C)-norm to prove the desired\ninclusion, as in Srebro and Shraibman [1]\u2019s work for the max norm case (see Corollary 2 in their\npaper). Details are given in the Supplementary Materials.\n\n4.2 Learning guarantee\n\nk\n\nij\n\n,\n\ninf\n\nk\n\nkn\ns\n\n\u00b7\n\n1 +\n\npij\n\nij\n\nApproximation error\n\n+O\n\n(cid:124)\n\n(cid:125)\n\n(cid:19)\n(cid:125)\n\nlog(n)\u221a\n\u03b3\n\n(cid:12)(cid:12)(cid:12)Yij \u2212 (cid:98)Xij\n\n(cid:12)(cid:12)(cid:12) \u2264\n\npij |Yij \u2212 Xij|\n\nWe now give our main matrix reconstruction result, which provides error bounds for a family of\nnorms interpolating between the max norm and the smoothed weighted trace norm.\nTheorem 3. Let p be any distribution on [n] \u00d7 [m]. Suppose that, for some \u03b3 \u2265 1, R \u2287\n1/2,\u03b3 and C \u2287 C\u00d7\nR\u00d7\n1/2,\u03b3, where these two sets are de\ufb01ned in (6). Let S = {(it, jt) : t = 1, . . . , s} be\na random sample of locations in the matrix drawn i.i.d. from p, where s \u2265 n. Then, in expectation\n(cid:88)\nover the sample S,\n\n(cid:32)(cid:114)\n\n(cid:33)\n\nProof sketch for Theorem 3. The main idea is to use the convex hull formulation from Theorem 2\nk, there exists a decomposition X = X(cid:48) + X(cid:48)(cid:48) with\n\n(cid:88)\n(cid:123)(cid:122)\n(cid:124)\n(cid:107)X(cid:107)(R,C)\u2264\u221a\n(cid:80)s\nt=1 |Yitjt \u2212 Xitjt|. Additionally, if we assume that s \u2265\n\n(cid:18)\n(cid:123)(cid:122)\nwhere (cid:98)X = arg min(cid:107)X(cid:107)(R,C)\u2264\u221a\nn log(n), then in the excess risk bound, we can reduce the term log(n) to(cid:112)log(n).\nk) and (cid:107)X(cid:48)(cid:48)(cid:107)tr((cid:101)p) \u2264 O((cid:112)k/\u03b3), where(cid:101)p represents the smoothed marginals with\nto show that, for any X with (cid:107)X(cid:107)(R,C) \u2264 \u221a\n(cid:107)X(cid:48)(cid:107)max \u2264 O(\nk(cid:9). This then yields a learning guarantee by\nRademacher complexity of (cid:8)X : (cid:107)X(cid:107)(R,C) \u2264 \u221a\nsmoothing parameter \u03b6 = 1/2 as in (3). We then apply known bounds on the Rademacher complexity\nof the max norm unit ball [1] and the smoothed weighted trace norm unit ball [6], to bound the\nweighted trace norm. Speci\ufb01cally, choosing \u03b3 = \u221e gives us an excess error term of order(cid:112)kn/s\n(cid:112)kn log(n)/s for the smoothed weighted trace norm as long as s \u2265 n log(n), as shown in [6].\n\nAs special cases of this theorem, we can re-derive the existing results for the max norm and smoothed\n\nfor the max norm, previously shown by [1], while setting \u03b3 = 1 yields an excess error term of order\n\nTheorem 8 of Bartlett and Mendelson [13]. Details are given in the Supplementary Materials.\n\nExcess error\n\n\u221a\n\nWhat advantage does this new result offer over the existing results for the max norm and for the\nsmoothed weighted trace norm? To simplify the comparison, suppose we choose \u03b3 = log2(n), and\nde\ufb01ne R = R\u00d7\n1/2,\u03b3. Then, comparing to the max norm result (when \u03b3 = \u221e), we see\n\n1/2,\u03b3 and C = C\u00d7\n\n6\n\n\fTable 1: Matrix \ufb01tting for the \ufb01ve methods used in experiments.\n\n(Uniform) trace norm\n\nEmpirically-weighted trace norm\n\nArbitrarily-smoothed emp.-wtd. trace norm\n\nNorm\n\nMax norm\n\nLocal max norm\n\nFixed parameters\n\u03b6 arbitrary; \u03c4 = 1\n\n\u03b6 = 1; \u03c4 = 0\n\u03b6 = 0; \u03c4 = 0\n\n\u03c4 = 0\n\n\u2014\n\nFree parameters\n\n\u03bb\n\u03bb\n\u03bb\n\u03b6; \u03bb\n\u03b6; \u03c4; \u03bb\n\nFigure 1: Simulation results for matrix reconstruction with a rank-2 (left) or rank-4 (right) signal, corrupted by\nnoise. The plot shows per-entry squared error averaged over 50 trials, with standard error bars. For the rank-4\nexperiment, max norm error exceeded 0.20 for each n = 60, 120, 240 and is not displayed in the plot.\n\nthat the excess error term is the same in both cases (up to a constant), but the approximation error\nterm may in general be much lower for the local max norm than for the max norm. Comparing next\nto the weighted trace norm (when \u03b3 = 1), we see that the excess error term is lower by a factor of\nlog(n) for the local max norm. This may come at a cost of increasing the approximation error, but in\ngeneral this increase will be very small. In particular, the local max norm result allows us to give a\nmeaningful guarantee for a sample size s = \u0398 (kn), rather than requiring s \u2265 \u0398 (kn log(n)) as for\nany trace norm result, but with a hypothesis class signi\ufb01cantly richer than the max norm constrained\nclass (though not as rich as the trace norm constrained class).\n\n5 Experiments\n\nWe test the local max norm on simulated and real matrix reconstruction tasks, and compare its\nperformance to the max norm, the uniform and empirically-weighted trace norms, and the smoothed\nempirically-weighted trace norm.\n5.1 Simulations\nWe simulate n \u00d7 n noisy matrices for n = 30, 60, 120, 240, where the underlying signal has rank\nk = 2 or k = 4, and we observe s = 3kn entries (chosen uniformly without replacement). We\nperformed 50 trials for each of the 8 combinations of (n, k).\nData For each trial, we randomly draw a matrix U \u2208 Rn\u00d7k by drawing each row uniformly at\nrandom from the unit sphere in Rn. We generate V \u2208 Rm\u00d7k similarly. We set Y = U V (cid:62) + \u03c3 \u00b7 Z,\nwhere the noise matrix Z has i.i.d. standard normal entries and \u03c3 = 0.3 is a moderate noise level.\nWe also divide the n2 entries of the matrix into sets S0 (cid:116) S1 (cid:116) S2 which consist of s = 3kn training\nentries, s validation entries, and n2 \u2212 2s test entries, respectively, chosen uniformly at random.\nMethods We use the two-parameter family of norms de\ufb01ned in (7), but replacing the true\nand each penalty parameter value \u03bb \u2208 {21, 22, . . . , 210}, we compute the \ufb01tted matrix\n\nmarginals pi\u2022 and p\u2022j with the empirical marginals(cid:98)pi\u2022 and(cid:98)p\u2022j. For each \u03b6, \u03c4 \u2208 {0, 0.1, . . . , 0.9, 1}\n\n(i,j)\u2208S0 (Yij \u2212 Xij)2 + \u03bb \u00b7 (cid:107)X(cid:107)(R\u03b6,\u03c4 ,C\u03b6,\u03c4 )\n\n(10)\n(In fact, we use a rank-8 approximation to this optimization problem, as described in Section 3.) For\neach method, we use S1 to select the best \u03b6, \u03c4, and \u03bb, with restrictions on \u03b6 and/or \u03c4 as speci\ufb01ed by\nthe de\ufb01nition of the method (see Table 1), then report the error of the resulting \ufb01tted matrix on S2.\nResults The results for these simulations are displayed in Figure 1. We see that the local max\nnorm results in lower error than any of the tested existing norms, across all the settings used.\n\n(cid:98)X = arg min(cid:8)(cid:88)\n\n(cid:9) .\n\n7\n\nMatrix dimension nMean sq. error per entryllllllllllllllllllll0.160.200.240.283060120240TraceEmp. traceSmth. traceMaxLocal maxMatrix dimension nMean sq. error per entrylllllllllllllllll0.110.130.153060120240\fTable 2: Root mean squared error (RMSE) results for estimating movie ratings on Net\ufb02ix and MovieLens data\nusing a rank 30 model. Setting \u03c4 = 0 corresponds to the uniform or weighted or smoothed weighted trace\nnorm (depending on \u03b6), while \u03c4 = 1 corresponds to the max norm for any \u03b6 value.\n\n\u03b6 \\ \u03c4\n0.00\n0.05\n0.10\n0.15\n0.20\n1.00\n\n0.00\n0.7852\n0.7836\n0.7831\n0.7833\n0.7842\n0.7997\n\nMovieLens\n\n0.05\n0.7827\n0.7822\n0.7837\n0.7842\n0.7853\n\n0.10\n0.7838\n0.7842\n0.7846\n0.7854\n0.7866\n\n1.00\n0.7918\n\n\u2014\n\u2014\n\u2014\n\u2014\n\n\u03b6 \\ \u03c4\n0.00\n0.05\n0.10\n0.15\n0.20\n1.00\n\n0.00\n0.9107\n0.9095\n0.9096\n0.9102\n0.9126\n0.9235\n\nNet\ufb02ix\n0.05\n0.9092\n0.9090\n0.9098\n0.9111\n0.9344\n\n0.10\n0.9094\n0.9107\n0.9122\n0.9131\n0.9153\n\n1.00\n0.9131\n\n\u2014\n\u2014\n\u2014\n\u2014\n\n5.2 Movie ratings data\nWe next compare several different matrix norms on two collaborative \ufb01ltering movie ratings datasets,\nthe Net\ufb02ix [14] and MovieLens [15] datasets. The sizes of the data sets, and the split of the ratings\ninto training, validation and test sets3, are:\n\nDataset\nNet\ufb02ix\n\nMovieLens\n\n# users\n480,189\n71,567\n\n# movies\n17,770\n10,681\n\nTraining set\n100,380,507\n8,900,054\n\nValidation set\n\n100,000\n100,000\n\nTest set\n1,408,395\n1,000,000\n\nWe test the local max norm given in (7) with \u03b6 \u2208 {0, 0.05, 0.1, 0.15, 0.2} and \u03c4 \u2208 {0, 0.05, 0.1}.\nWe also test \u03c4 = 1 (the max norm\u2014here \u03b6 is arbitrary) and \u03b6 = 1, \u03c4 = 0 (the uniform trace norm).\nWe follow the test protocol of [6], with a rank-30 approximation to the optimization problem (10).\nTable 2 shows root mean squared error (RMSE) for the experiments. For both the MovieLens and\nNet\ufb02ix data, the local max norm with \u03c4 = 0.05 and \u03b6 = 0.05 gives strictly better accuracy than any\npreviously-known norm studied in this setting. (In practice, we can use a validation set to reliably\nselect good values for the \u03c4 and \u03b6 parameters4.) For the MovieLens data, the local max norm\nachieves RMSE of 0.7822, compared to 0.7831 achieved by the smoothed empirically-weighted\ntrace norm with \u03b6 = 0.10, which gives the best result among the previously-known norms. For the\nNet\ufb02ix dataset the local max norm achieves RMSE of 0.9090, improving upon the previous best\nresult of 0.9095 achieved by the smoothed empirically-weighted trace norm [6].\n\n6 Summary\n\nIn this paper, we introduce a unifying family of matrix norms, called the \u201clocal max\u201d norms, that\ngeneralizes existing methods for matrix reconstruction, such as the max norm and trace norm. We\nexamine some interesting sub-families of local max norms, and consider several different options\nfor interpolating between the trace (or smoothed weighted trace) and max norms. We \ufb01nd norms\nlying strictly between the trace norm and the max norm that give improved accuracy in matrix\nreconstruction for both simulated data and real movie ratings data. We show that regularizing with\nany local max norm is fairly simple to optimize, and give a theoretical result suggesting improved\nmatrix reconstruction using new norms in this family.\n\nAcknowledgements\n\nR.F. is supported by NSF grant DMS-1203762. R.S. is supported by NSERC and Early Researcher\nAward.\n\n3For Net\ufb02ix, the test set we use is their \u201cquali\ufb01cation set\u201d, designed for a more uniform distribution of\nratings across users relative to the training set. For MovieLens, we choose our test set at random from the\navailable data.\n\n4To check this, we subsample half of the test data at random, and use it as a validation set to choose (\u03b6, \u03c4 )\nfor each method (as speci\ufb01ed in Table 1). We then evaluate error on the remaining half of the test data. For\nMovieLens, the local max norm gives an RMSE of 0.7820 with selected parameter values \u03b6 = \u03c4 = 0.05, as\ncompared to an RMSE of 0.7829 with selected smoothing parameter \u03b6 = 0.10 for the smoothed weighted trace\nnorm. For Net\ufb02ix, the local max norm gives an RMSE of 0.9093 with \u03b6 = \u03c4 = 0.05, while the smoothed\nweighted trace norm gives an RMSE of 0.9098 with \u03b6 = 0.05. The other tested methods give higher error on\nboth datasets.\n\n8\n\n\fReferences\n[1] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. 18th Annual Conference on\n\nLearning Theory (COLT), pages 545\u2013560, 2005.\n\n[2] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of\n\nMachine Learning Research, 11:2057\u20132078, 2010.\n\n[3] S. Negahban and M. Wainwright. Restricted strong convexity and weighted matrix completion:\n\nOptimal bounds with noise. arXiv:1009.2118, 2010.\n\n[4] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrix reconstruction.\n\n24th Annual Conference on Learning Theory (COLT), 2011.\n\n[5] R. Salakhutdinov and N. Srebro. Collaborative Filtering in a Non-Uniform World: Learning\nwith the Weighted Trace Norm. Advances in Neural Information Processing Systems, 23, 2010.\n[6] R. Foygel, R. Salakhutdinov, O. Shamir, and N. Srebro. Learning with the weighted trace-norm\nunder arbitrary sampling distributions. Advances in Neural Information Processing Systems,\n24, 2011.\n\n[7] J. Lee, B. Recht, R. Salakhutdinov, N. Srebro, and J. Tropp. Practical Large-Scale Optimization\nfor Max-Norm Regularization. Advances in Neural Information Processing Systems, 23, 2010.\n[8] E. Hazan, S. Kale, and S. Shalev-Shwartz. Near-optimal algorithms for online matrix predic-\n\ntion. 25th Annual Conference on Learning Theory (COLT), 2012.\n\n[9] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to mini-\nmum order system approximation. In Proceedings of the 2001 American Control Conference,\nvolume 6, pages 4734\u20134739, 2002.\n\n[10] N. Srebro, J.D.M. Rennie, and T.S. Jaakkola. Maximum-margin matrix factorization. Advances\n\nin Neural Information Processing Systems, 18, 2005.\n\n[11] J.D.M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\nprediction. In Proceedings of the 22nd international conference on Machine learning, pages\n713\u2013719. ACM, 2005.\n\n[12] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. Advances in neural infor-\n\nmation processing systems, 20:1257\u20131264, 2008.\n\n[13] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and struc-\n\ntural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[14] J. Bennett and S. Lanning. The net\ufb02ix prize.\n\nvolume 2007, page 35. Citeseer, 2007.\n\nIn Proceedings of KDD Cup and Workshop,\n\n[15] MovieLens Dataset. Available at http://www.grouplens.org/node/73. 2006.\n\n9\n\n\f", "award": [], "sourceid": 450, "authors": [{"given_name": "Rina", "family_name": "Foygel", "institution": null}, {"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}