{"title": "Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 845, "page_last": 853, "abstract": "We propose a novel class of algorithms for low rank matrix completion. Our approach builds on novel penalty functions on the singular values of the low rank matrix. By exploiting a mixture model representation of this penalty, we show that a suitably chosen set of latent variables enables to derive an Expectation-Maximization algorithm to obtain a Maximum A Posteriori estimate of the completed low rank matrix. The resulting algorithm is an iterative soft-thresholded algorithm which iteratively adapts the shrinkage coefficients associated to the singular values. The algorithm is simple to implement and can scale to large matrices. We provide numerical comparisons between our approach and recent alternatives showing the interest of the proposed approach for low rank matrix completion.", "full_text": "Probabilistic Low-Rank Matrix Completion with\n\nAdaptive Spectral Regularization Algorithms\n\nAdrien Todeschini\n\nINRIA - IMB - Univ. Bordeaux\n\n33405 Talence, France\n\nAdrien.Todeschini@inria.fr\n\nFranc\u00b8ois Caron\n\nUniv. Oxford, Dept. of Statistics\n\nOxford, OX1 3TG, UK\n\nCaron@stats.ox.ac.uk\n\nMarie Chavent\n\nUniv. Bordeaux - IMB - INRIA\n\n33000 Bordeaux, France\n\nMarie.Chavent@u-bordeaux2.fr\n\nAbstract\n\nWe propose a novel class of algorithms for low rank matrix completion. Our ap-\nproach builds on novel penalty functions on the singular values of the low rank\nmatrix. By exploiting a mixture model representation of this penalty, we show\nthat a suitably chosen set of latent variables enables to derive an Expectation-\nMaximization algorithm to obtain a Maximum A Posteriori estimate of the com-\npleted low rank matrix. The resulting algorithm is an iterative soft-thresholded\nalgorithm which iteratively adapts the shrinkage coef\ufb01cients associated to the sin-\ngular values. The algorithm is simple to implement and can scale to large matrices.\nWe provide numerical comparisons between our approach and recent alternatives\nshowing the interest of the proposed approach for low rank matrix completion.\n\n1\n\nIntroduction\n\nMatrix completion has attracted a lot of attention over the past few years. The objective is to \u201ccom-\nplete\u201d a matrix of potentially large dimension based on a small (and potentially noisy) subset of its\nentries [1, 2, 3]. One popular application is to build automatic recommender systems, where the\nrows correspond to users, the columns to items and entries may be ratings or binary (like/dislike).\nThe objective is then to predict user preferences from a subset of the entries.\nIn many cases, it is reasonable to assume that the unknown m \u00d7 n matrix Z can be approximated\nby a matrix of low rank Z (cid:39) ABT where A and B are respectively of size m \u00d7 k and n \u00d7 k, with\nk (cid:28) min(m, n). In the recommender system application, the low rank assumption is sensible as it\nis commonly believed that only a few factors contribute to user\u2019s preferences. The low rank structure\nthus implies some sort of collaboration between the different users/items [4].\nWe typically observe a noisy version Xij of some entries (i, j) \u2208 \u2126 where \u2126 \u2282 {1, . . . , m} \u00d7\n{1, . . . , n}. For (i, j) \u2208 \u2126\n\nXij = Zij + \u03b5ij, \u03b5ij\n\n(1)\nwhere \u03c32 > 0 and N (\u00b5, \u03c32) is the normal distribution of mean \u00b5 and variance \u03c32. Low rank matrix\ncompletion can be adressed by solving the following optimization problem\n(Xij \u2212 Zij)2 + \u03bb rank(Z)\n\n(cid:88)\n\n(2)\n\niid\u223c N (0, \u03c32)\n\nminimize\n\nZ\n\n1\n2\u03c32\n\n(i,j)\u2208\u2126\n\n1\n\n\f(cid:88)\n\n1\n2\u03c32\n\nwhere \u03bb > 0 is some regularization parameter. For general subsets \u2126, the optimization problem (2)\nis computationally hard and many authors have advocated the use of a convex relaxation of (2)\n[5, 6, 4], yielding the following convex optimization problem\n\n(Xij \u2212 Zij)2 + \u03bb(cid:107)Z(cid:107)\u2217\n\nZ\n\n(i,j)\u2208\u2126\n\nminimize\n\n(3)\nwhere (cid:107)Z(cid:107)\u2217 is the nuclear norm of Z, or the sum of the singular values of Z. [4] proposed an\niterative algorithm, called Soft-Impute, for solving the nuclear norm regularized minimization (3).\nIn this paper, we show that the solution to the objective function (3) can be interpreted as a Maxi-\nmum A Posteriori (MAP) estimate when assuming that the singular values of Z are independently\nand identically drawn (iid) from an exponential distribution with rate \u03bb. Using this Bayesian in-\nterpretation, we propose alternative concave penalties to the nuclear norm, obtained by consider-\ning that the singular values are iid from a mixture of exponential distributions. We show that this\nclass of penalties bridges the gap between the nuclear norm and the rank penalty, and that a simple\nExpectation-Maximization (EM) algorithm can be derived to obtain MAP estimates. The resulting\nalgorithm iteratively adapts the shrinkage coef\ufb01cients associated to the singular values. It can be\nseen as the equivalent for matrices of reweighted (cid:96)1 algorithms [6] for multivariate linear regression.\nInterestingly, we show that the Soft-Impute algorithm of [4] is obtained as a particular case. We also\ndiscuss the extension of our algorithms to binary matrices, building on the same seed of ideas, in the\nsupplementary material. Finally, we provide some empirical evidence of the interest of the proposed\napproach on simulated and real data.\n\n2 Complete matrix X\nConsider \ufb01rst that we observe the complete matrix X of size m \u00d7 n. Let r = min(m, n). We\nconsider the following convex optimization problem\n\nminimize\n\nF + \u03bb(cid:107)Z(cid:107)\u2217\n\n(4)\nwhere (cid:107)\u00b7(cid:107)F is the Frobenius norm. The solution to Eq. (4) in the complete case is a soft-thresholded\nsingular value decomposition (SVD) of X [7, 4], i.e.\nwhere S\u03bb(X) = (cid:101)U(cid:101)D\u03bb(cid:101)V T with (cid:101)D\u03bb = diag(((cid:101)d1 \u2212 \u03bb)+, . . . , ((cid:101)dr \u2212 \u03bb)+) and t+ = max(t, 0).\nX = (cid:101)U(cid:101)D(cid:101)V T is the singular value decomposition of X with (cid:101)D = diag((cid:101)d1, . . . ,(cid:101)dr).\nThe solution (cid:98)Z to the optimization problem (4) can be interpreted as the Maximum A Posteriori\n\n2\u03c32 (cid:107)X \u2212 Z(cid:107)2\n(cid:98)Z = S\u03bb\u03c32(X)\n\n1\n\nZ\n\nestimate under the likelihood (1) and prior\n\np(Z) \u221d exp (\u2212\u03bb(cid:107)Z(cid:107)\u2217)\n\np(Z) = p(U )p(V )p(D)\n\nr(cid:89)\n\nAssuming Z = U DV T , with D = diag(d1, d2, . . . , dr) this can be further decomposed as\n\nwhere we assume a uniform Haar prior distribution on the unitary matrices U and V , and exponential\npriors on the singular values di, hence\n\nExp (di; \u03bb)\n\np(d1, . . . , dr) =\n\n(5)\nwhere Exp(x; \u03bb) = \u03bb exp(\u2212\u03bbx) is the probability density function (pdf) of the exponential distri-\nbution of parameter \u03bb evaluated at x. The exponential distribution has a mode at 0, hence favoring\nsparse solutions.\nWe propose here alternative penalty/prior distributions, that bridge the gap between the rank and the\nnuclear norm penalties. Our penalties are based on hierarchical Bayes constructions and the related\noptimization problems to obtain MAP estimates can be solved by using an EM algorithm.\n\ni=1\n\n2.1 Hierarchical adaptive spectral penalty\n\nWe consider the following hierarchical prior for the low rank matrix Z. We still assume that Z =\nU DV T , where the unitary matrices U and V are assigned uniform priors and D = diag(d1, . . . , dr).\nWe now assume that each singular value di has its own regularization parameter \u03b3i.\n\n2\n\n\fFigure 2: Thresholding rules on the singular\n\nvalues (cid:101)di of X for the soft thresholding rule\n\n(\u03bb = 1), and hierarchical adaptive soft thresh-\nolding algorithm with a = b = \u03b2.\n\nFigure 1: Marginal distribution p(di) with\na = b = \u03b2 for different values of the param-\neter \u03b2. The distribution becomes more con-\ncentrated around zero with heavier tails as \u03b2\ndecreases. The case \u03b2 \u2192 \u221e corresponds to\nan exponential distribution with unit rate.\n\np(d1, . . . , dr|\u03b31, . . . \u03b3r) =\n\nr(cid:89)\n\ni=1\n\nExp(di; \u03b3i)\n\nr(cid:89)\n\ni=1\n\np(di|\u03b3i) =\n\nr(cid:89)\n\nr(cid:89)\n\np(\u03b31, . . . , \u03b3r) =\n\np(\u03b3i) =\n\nGamma(\u03b3i; a, b)\n\ni=1\n\ni=1\n\nWe further assume that the regularization parameters are themselves iid from a gamma distribution\n\n(cid:90) \u221e\n\n0\n\nwhere Gamma(\u03b3i; a, b) is the pdf of the gamma distribution of parameters a > 0 and b > 0 evalu-\nated at \u03b3i. The marginal distribution over di is thus a continuous mixture of exponential distributions\n\np(di) =\n\nExp(di; \u03b3i) Gamma(\u03b3i; a, b)d\u03b3i =\n\naba\n\n(di + b)a+1\n\n(6)\n\nIt is a Pareto distribution which has heavier tails than the exponential distribution. Figure 1 shows\nthe marginal distribution p(di) for a = b = \u03b2. The lower \u03b2, the heavier the tails of the distribution.\nWhen \u03b2 \u2192 \u221e, one recovers the exponential distribution of unit rate parameter. Let\n\npen(Z) = \u2212 log p(Z) = \u2212 r(cid:88)\n\nlog(p(di)) = C1 +\n\n(a + 1) log(b + di)\n\n(7)\n\nr(cid:88)\n\ni=1\n\ni=1\n\nbe the penalty induced by the prior p(Z). We call the penalty (7) the Hierarchical Adaptive Spectral\nPenalty (HASP). On Figure 3 (top) are represented the balls of constant penalties for a symmetric\n2 \u00d7 2 matrix, for the HASP, nuclear norm and rank penalties. When the matrix is assumed to\nbe diagonal, one recovers respectively the lasso, hierarchical adaptive lasso (HAL) [6, 8] and (cid:96)0\npenalties, as shown on Figure 3 (bottom).\nThe penalty (7) admits as special cases the nuclear norm penalty \u03bb||Z||\u2217 when a = \u03bbb and b \u2192 \u221e.\nAnother closely related penalty is the log-det heuristic [5, 9] penalty, de\ufb01ned for a square matrice Z\nby log det(Z + \u03b4I) where \u03b4 is some small regularization constant. Both penalties agree on square\nmatrices when a = b = 0 and \u03b4 = 0.\n\n2.2 EM algorithm for MAP estimation\n\nUsing the exponential mixture representation (6), we now show how to derive an EM algorithm [10]\nto obtain a MAP estimate\n\n[log p(X|Z) + log p(Z)]\n\ni.e. to minimize\n\nL(Z) =\n\n(a + 1) log(b + di)\n\n(8)\n\n(cid:98)Z = arg max\n2\u03c32 (cid:107)X \u2212 Z(cid:107)2\n\n1\n\nZ\n\nr(cid:88)\n\nF +\n\ni=1\n\n3\n\n01234500.10.20.30.40.50.60.70.80.91dip(di) \u03b2=\u221e\u03b2=2\u03b2=0.1\f(a) Nuclear norm\n\n(b) HASP (\u03b2 = 1)\n\n(c) HASP (\u03b2 = 0.1)\n\n(d) Rank penalty\n\n(e) (cid:96)1 norm\n\n(f) HAL (\u03b2 = 1)\n\n(g) HAL (\u03b2 = 0.1)\n\n(h) (cid:96)0 norm\n\nFigure 3: Top: Manifold of constant penalty, for a symmetric 2 \u00d7 2 matrix Z = [x, y; y, z] for\n(a) the nuclear norm, (b-c) hierarchical adaptive spectral penalty with a = b = \u03b2 (b) \u03b2 = 1 and\n(c) \u03b2 = 0.1, and (d) the rank penalty. Bottom: contour of constant penalty for a diagonal matrix\n[x, 0; 0, z], where one recovers the classical (e) lasso, (f-g) hierarchical lasso and (h) (cid:96)0 penalties.\n\nWe use the parameters \u03b3 = (\u03b31, . . . , \u03b3r) as latent variables in the EM algorithm. The E step is\nobtained by\n\nQ(Z, Z\u2217) = E [log(p(X, Z, \u03b3))|Z\u2217, X] = C2 \u2212 1\n\nF \u2212 r(cid:88)\n\ni=1\n\nE[\u03b3i|d\u2217\n\ni ]di\n\nHence at each iteration of the EM algorithm, the M step consists in solving the optimization problem\n\n2\u03c32 (cid:107)X \u2212 Z(cid:107)2\nr(cid:88)\n\n\u03c9idi\n\ni=1\n\n(9)\n\n1\n\n2\u03c32 (cid:107)X \u2212 Z(cid:107)2\n\nF +\n\nZ\n\nminimize\n[\u2212 log p(d\u2217\n\nwhere \u03c9i = E[\u03b3i|d\u2217\n(9) is an adaptive nuclear norm regularized optimization problem, with weights \u03c9i. Without loss of\ngenerality, assume that d\u2217\n\ni )] = a+1\nb+d\u2217\n\ni ] = \u2202\n\u2202d\u2217\n\nr. It implies that\n\n1 \u2265 d\u2217\n\n2 \u2265 . . . \u2265 d\u2217\n\n.\n\ni\n\ni\n\n(10)\nThe above weights will therefore penalize less heavily higher singular values, hence reducing bias.\nAs shown by [11, 12], a global optimal solution to Eq. (9) under the order constraint (10) is given\nby a weighted soft-thresholded SVD\n\n(cid:98)Z = S\u03c32\u03c9(X)\nwhere S\u03c9(X) = (cid:101)U(cid:101)D\u03c9(cid:101)V T with (cid:101)D\u03c9 = diag(((cid:101)d1 \u2212 \u03c91)+, . . . , ((cid:101)dr \u2212 \u03c9r)+). X = (cid:101)U(cid:101)D(cid:101)V T is the SVD\nof X with (cid:101)D = diag((cid:101)d1, . . . ,(cid:101)dr) and (cid:101)d1 \u2265 (cid:101)d2 . . . \u2265 (cid:101)dr.\n\n(11)\n\n0 \u2264 \u03c91 \u2264 \u03c92 \u2264 . . . \u2264 \u03c9r\n\nAlgorithm 1 summarizes the Hierarchical Adaptive Soft Thresholded (HAST) procedure to converge\nto a local minimum of the objective (8). This algorithm admits the soft-thresholded SVD operator\nas a special case when a = b\u03bb and b = \u03b2 \u2192 \u221e. Figure 2 shows the thresholding rule applied to\nthe singular values of X for the HAST algorithm (a = b = \u03b2, with \u03b2 = 2 and \u03b2 = 0.1) and the\nsoft-thresholded SVD for \u03bb = 1. The bias term, which is equal to \u03bb for the nuclear norm, goes to\n\nzero as (cid:101)di goes to in\ufb01nity.\n\nSetting of the hyperparameters and initialization of the EM algorithm In the experiments, we\nhave set b = \u03b2 and a = \u03bb\u03b2 where \u03bb and \u03b2 are tuning parameters that can be chosen by cross-\nvalidation. As \u03bb is the mean value of the regularization parameter \u03b3i, we initialize the algorithm\nwith the soft thresholded SVD with parameter \u03c32\u03bb. It is possible to estimate the hyperparameter \u03c3\nwithin the EM algorithm as described in the supplementary material. In our experiments, we have\nfound the results not very sensitive to the setting of \u03c3, and set it to 1.\n\n4\n\n\fAlgorithm 1 Hierarchical Adaptive Soft Thresholded (HAST) algorithm for low rank estimation of\ncomplete matrices\nInitialize Z (0). At iteration t \u2265 1\n\n\u2022 For i = 1, . . . , r, compute the weights \u03c9(t)\n\u2022 Set Z (t) = S\u03c32\u03c9(t) (X)\n\u2022 If L(Z(t\u22121))\u2212L(Z(t))\n\n< \u03b5 then return (cid:98)Z = Z (t)\n\nL(Z(t\u22121))\n\ni = a+1\n\nb+d(t\u22121)\n\ni\n\n3 Matrix completion\n\nWe now show how the EM algorithm derived in the previous section can be adapted to the case\nwhere only a subset of the entries is observed. It relies on imputing missing values, similarly to the\nEM algorithm for SVD with missing data, see e.g. [10, 13].\nConsider that only a subset \u2126 \u2282 {1, . . . , m}\u00d7{1, . . . , n} of the entries of the matrix X is observed.\nSimilarly to [7], we introduce the operator P\u2126(X) and its complementary P \u22a5\n\n\u2126 (X)\n\nP\u2126(X)(i, j) =\n\nif (i, j) \u2208 \u2126\notherwise\n\nand\n\nP \u22a5\n\u2126 (X)(i, j) =\n\nif (i, j) \u2208 \u2126\notherwise\n\n(cid:26) Xij\n\n0\n\n(cid:26) 0\n\nXij\n\nr(cid:88)\n\nAssuming the same prior (6), the MAP estimate is obtained by minimizing\n\n1\n\nL(Z) =\n\nF + (a + 1)\n\nWe will now derive the EM algorithm, by using latent variables \u03b3 and P \u22a5\nby (details in supplementary material)\n\n2\u03c32 (cid:107)P\u2126(X) \u2212 P\u2126(Z)(cid:107)2\nQ(Z, Z\u2217) = E(cid:2)log(p(P\u2126(X), P \u22a5\n(cid:110)(cid:13)(cid:13)P\u2126(X) + P \u22a5\n\u2126 (Z\u2217) \u2212 Z(cid:13)(cid:13)2\nr(cid:88)\n\n\u2126 (X), Z, \u03b3))|Z\u2217, P\u2126(X)(cid:3)\n(cid:111) \u2212 r(cid:88)\n\nHence at each iteration of the algorithm, one needs to minimize\n\n= C4 \u2212 1\n2\u03c32\n\ni=1\n\ni=1\n\nF\n\nlog(b + di)\n\n(12)\n\n\u2126 (X). The E step is given\n\nE[\u03b3i|d\u2217\n\ni ]di\n\n1\n\n2\u03c32 (cid:107)X\u2217 \u2212 Z(cid:107)2\n\n(13)\nwhere \u03c9i = E[\u03b3i|d\u2217\n\u2126 (Z\u2217) is the observed matrix, completed with entries\nin Z\u2217. We now have a complete matrix problem. As mentioned in the previous section, the mini-\nmum of (13) is obtained with a weighted soft-thresholded SVD. Algorithm 2 provides the resulting\niterative procedure for matrix completion with the hierarchical adaptive spectral penalty.\n\ni ] and X\u2217 = P\u2126(X) + P \u22a5\n\nF +\n\n\u03c9idi\n\ni=1\n\nAlgorithm 2 Hierarchical Adaptive Soft Impute (HASI) algorithm for matrix completion\nInitialize Z (0). At iteration t \u2265 1\n\n\u2022 For i = 1, . . . , r, compute the weights \u03c9(t)\n\u2022 Set Z (t) = S\u03c32\u03c9(t)\n\u2022 If L(Z(t\u22121))\u2212L(Z(t))\n\n(cid:0)P\u2126(X) + P \u22a5\n\u2126 (Z (t\u22121))(cid:1)\n< \u03b5 then return (cid:98)Z = Z (t)\n\nL(Z(t\u22121))\n\ni = a+1\n\nb+d(t\u22121)\n\ni\n\nRelated algorithms Algorithm 2 admits the Soft-Impute algorithm of [4] as a special case when\na = \u03bbb and b = \u03b2 \u2192 \u221e. In this case, one obtains at each iteration \u03c9(t)\ni = \u03bb for all i. On the contrary,\nwhen \u03b2 < \u221e, our algorithm adaptively updates the weights so that to penalize less heavily higher\nsingular values. Some authors have proposed related one-step adaptive spectral penalty algorithms\n[14, 11, 12]. However, in these procedures, the weights have to be chosen by some procedure\nwhereas in our case they are iteratively adapted.\n\nInitialization The objective function (12) is in general not convex and different initializations may\nlead to different modes. As in the complete case, we suggest to set a = \u03bbb and b = \u03b2 and to initialize\nthe algorithm with the Soft-Impute algorithm with regularization parameter \u03c32\u03bb.\n\n5\n\n\f(cid:0)P\u2126(X) + P \u22a5\n\n\u2126 (Z (t\u22121))(cid:1) which requires calculating a low rank truncated SVD.\n\nScaling Similarly to the Soft-Impute algorithm, the computationally demanding part of Algo-\nrithm 2 is S\u03c32\u03c9(t)\nFor large matrices, one can resort to the PROPACK algorithm [15, 16] as described in [4]. This\nsophisticated linear algebra algorithm can ef\ufb01ciently compute the truncated SVD of the \u201csparse +\nlow rank\u201d matrix\n\nP\u2126(X) + P \u22a5\n\n(cid:125)\n\u2126 (Z (t\u22121)) = P\u2126(X) \u2212 P\u2126(Z (t\u22121))\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nsparse\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n+ Z (t\u22121)\nlow rank\n\nand can thus handle large matrices, as shown in [4].\n\n4 Experiments\n\n4.1 Simulated data\n\n(cid:113) var(Z)\n\nWe \ufb01rst evaluate the performance of the proposed approach on simulated data. Our simulation\nsetting is similar to that of [4]. We generate Gaussian matrices A and B respectively of size m \u00d7 q\nand n \u00d7 q, q \u2264 r so that the matrix Z = ABT is of low rank q. A Gaussian noise of variance\n\u03c32 is then added to the entries of Z to obtain the matrix X. The signal to noise ratio is de\ufb01ned as\n. We set m = n = 100 and \u03c3 = 1. We run all the algorithms with a precision\nSNR =\n\u0001 = 10\u22129 and a maximum number of tmax = 200 iterations (initialization included for HASI). We\n\ncompute err, the relative error between the estimated matrix (cid:98)Z and the true matrix Z in the complete\n\n\u03c32\n\ncase, and err\u2126\u22a5 in the incomplete case, where\n\n||(cid:98)Z \u2212 Z||2\n\nF\n\n||Z||2\n\nF\n\nerr =\n\nand\n\nerr\u2126\u22a5 =\n\n\u2126 ((cid:98)Z) \u2212 P \u22a5\n||(cid:98)P \u22a5\n\u2126 (Z)||2\n\n||P \u22a5\n\nF\n\n\u2126 (Z)||2\n\nF\n\nFor the HASP penalty, we set a = \u03bb\u03b2 and b = \u03b2. We compute the solutions over a grid of 50 values\nof the regularization parameter \u03bb linearly spaced from \u03bb0 to 0, where \u03bb0 = ||P\u2126(X)||2 is the largest\nsingular value of the input matrix X, padded with zeros. This is done for three different values\n\u03b2 = 1, 10, 100. We use the same grid to obtain the regularization path for the other algorithms.\n\nComplete case We \ufb01rst consider that the observed matrix is complete, with SNR = 1 and q = 10.\nThe HAST algorithm 1 is compared to the soft thresholded (ST) and hard thresholded (HT) SVD.\nResults are reported in Figure 4(a). The HASP penalty provides a bridge/tradeoff between the\nnuclear norm and the rank penalty. For example, value of \u03b2 = 10 show a minimum at the true rank\nq = 10 as HT, but with a lower error when the rank is overestimated.\n\n(a) SNR=1; Complete; rank=10\n\n(b) SNR=1; 50% missing; rank=5 (c) SNR=10; 80% missing; rank=5\n\nFigure 4: Test error w.r.t.\nthe rank obtained by varying the value of the regularization parameter\n\u03bb. Results on simulated data are given for (a) complete matrix with SNR=1 (b) 50% missing and\nSNR=1 and (c) 80% missing and SNR=10.\n\nIncomplete case Then we consider the matrix completion problem, and remove uniformly a given\npercentage of the entries in X. We compare the HASI algorithm to the Soft-Impute, Soft-Impute+\nand Hard-Impute algorithms of [4] and to the MMMF algorithm of [17]. Results, averaged over\n50 replications, are reported in Figures 4(b-c) for a true rank q = 5, (b) 50% of missing data and\n\n6\n\n01020304050607080901000.10.20.30.40.50.60.70.80.91RankRelative error STHTHAST\u03b2=100HAST\u03b2=10HAST\u03b2=101020304050600.30.40.50.60.70.80.91RankTest error MMMFSoftImpSoftImp+HardImpHASI\u03b2=100HASI\u03b2=10HASI\u03b2=105101520253000.10.20.30.40.50.60.70.80.91RankTest error MMMFSoftImpSoftImp+HardImpHASI\u03b2=100HASI\u03b2=10HASI\u03b2=1\f(a) SNR=1; 50% miss.\n\n(b) SNR=1; 50% miss. (c) SNR=10; 80% miss. (d) SNR=10; 80% miss.\n\nFigure 5: Boxplots of the test error and ranks obtained over 50 replications on simulated data.\n\nTable 1: Results on the Jester and MovieLens datasets\n\nJester 1\n\n24983 \u00d7 100\n27.5% miss.\n\nJester 2\n\n23500 \u00d7 100\n27.3% miss.\n\nJester 3\n\n24938 \u00d7 100\n75.3% miss.\n\nMovieLens 100k MovieLens 1M\n6040 \u00d7 3952\n95.8% miss.\n\nMethod\nMMMF\nSoft Imp\nSoft Imp+\nHard Imp\nHASI\n\nNMAE Rank NMAE Rank NMAE Rank\n0.161\n0.161\n0.169\n0.158\n0.153\n\n0.162\n0.162\n0.171\n0.159\n0.153\n\n0.183\n0.184\n0.184\n0.181\n0.174\n\n96\n100\n11\n6\n100\n\n95\n100\n14\n7\n100\n\n58\n78\n33\n4\n30\n\n943 \u00d7 1682\n93.7% miss.\nNMAE Rank\n0.195\n0.197\n0.197\n0.190\n0.187\n\n50\n156\n108\n7\n35\n\nNMAE Rank\n0.169\n0.176\n0.189\n0.175\n0.172\n\n30\n30\n30\n8\n27\n\nSNR = 1 and (c) 80% of missing data and SNR = 10. Similar behavior is observed, with the HASI\nalgorithm attaining a minimum at the true rank q = 5. We then conduct the same experiments,\nbut remove 20% of the observed entries as a validation set to estimate the regularization parameters\n(\u03bb, \u03b2) for HASI, and \u03bb for the other methods. We estimate Z on the whole observed matrix, and use\nthe unobserved entries as a test set. Results on the test error and estimated ranks over 50 replications\nare reported in Figure 5. For 50% missing data, HASI is shown to outperform the other methods.\nFor 80% missing data, HASI and Hard Impute provide the best performances. In both cases, it is\nable to recover very accurately the true rank of the matrix.\n\n4.2 Collaborative \ufb01ltering examples\n\nWe now compare the different methods on several benchmark datasets. We \ufb01rst consider the Jester\ndatasets [18]. The three datasets1 contain one hundred jokes, with user ratings between -10 and +10.\nWe randomly select two ratings per user as a test set, and two other ratings per user as a validation\nset to select the parameters \u03bb and \u03b2. The results are computed over four values \u03b2 = 1000, 100, 10, 1.\nWe compare the results of the different methods with the Normalized Mean Absolute Error (NMAE)\n\n(cid:80)\n\n|Xij \u2212 (cid:98)Zij|\n\nNMAE =\n\n1\n\ncard(\u2126test)\n\n(i,j)\u2208\u2126test\n\nmax(X) \u2212 min(X)\n\nwhere \u2126test is the test set. The mean number of iterations for Soft-Impute, Hard-Impute and HASI\n(initialization included) algorithms are respectively 9, 76 and 76. Computations for the HASI algo-\nrithm take approximately 5 hours on a standard computer. The results, averaged over 10 replications\n(with almost no variability observed), are presented in Table 1. The HASI algorithm provides very\ngood performance on the different Jester datasets, with lower NMAE than the other methods.\nFigure 6 shows the NMAE in function of the rank. Low values of \u03b2 exhibit a bimodal behavior\nwith two modes at low rank and full rank. High value \u03b2 = 1000 is unimodal and outperforms\nSoft-Impute at any particular rank.\n\n1Jester datasets can be downloaded from the url http://goldberg.berkeley.edu/jester-data/\n\n7\n\n0.20.30.40.50.60.70.80.91HASIHardImpSoftImp+SoftImpMMMFTest error102030405060708090100Rank00.050.10.150.20.25Test error102030405060708090100Rank\f(a) Jester 1\n\n(b) Jester 3\n\nFigure 6: NMAE on the test set of the (a) Jester 1 and (b) Jester 3 datasets w.r.t. the rank obtained\nby varying the value of the regularization parameter \u03bb. The curves obtained on the Jester 2 dataset\nare hardly distinguishable from (a) and hence are not displayed due to space limitation.\n\nSecond, we conducted the same comparison on two MovieLens datasets2, which contain ratings of\nmovies by users. We randomly select 20% of the entries as a test set, and the remaining entries\nare split between a training set (80%) and a validation set (20%). For all the methods, we stop\nthe regularization path as soon as the estimated rank exceeds rmax = 100. This is a practical\nconsideration: given that the computations for high ranks demand more time and memory, we are\ninterested in restricting ourselves to low rank solutions. Table 1 presents the results, averaged over 5\nreplications. For the MovieLens 100k dataset, HASI provides better NMAE than the other methods\nwith a low rank solution. For the larger MovieLens 1M dataset, the precision, maximum number\nof iterations and maximum rank are decreased to \u0001 = 10\u22126, tmax = 100 and rmax = 30. On\nthis dataset, MMMF provides the best NMAE at maximum rank. HASI provides the second best\nperformances with a slightly lower rank.\n\n5 Conclusion\n\nThe proposed class of methods has shown to provide good results compared to several alternative\nlow rank matrix completion methods. It provides a bridge between nuclear norm and rank regular-\nization algorithms. Although the related optimization problem is not convex, experiments show that\ninitializing the algorithm with the Soft-Impute algorithm of [4] provides very satisfactory results.\nIn this paper, we have focused on a gamma mixture of exponentials, as it leads to a simple and\ninterpretable expression for the weights. It is however possible to generalize the results presented\nhere by using a three parameter generalized inverse Gaussian prior distribution (see e.g. [19]) for\nthe regularization parameters \u03b3i, thus offering an additional degree of freedom. Derivations of\nthe weights are provided in the supplementary material. Additionally, it is possible to derive an\nEM algorithm for low rank matrix completion for binary matrices. Details are also provided in\nsupplementary material.\nWhile we focus on point estimation in this paper, it would be of interest to investigate a fully\nBayesian approach and derive a Gibbs sampler or variational algorithm to approximate the posterior\ndistribution, and compare to other full Bayesian approaches to matrix completion [20, 21].\n\nAcknowledgments\n\nF.C. acknowledges the support of the European Commission under the Marie Curie Intra-European\nFellowship Programme. The contents re\ufb02ect only the authors views and not the views of the Euro-\npean Commission.\n\n2MovieLens datasets can be downloaded from the url http://www.grouplens.org/node/73.\n\n8\n\n01020304050607080901000.160.180.20.220.240.260.280.30.32RankTest error MMMFSoftImpSoftImp+HardImpHASI\u03b2=1000HASI\u03b2=100HASI\u03b2=10HASI\u03b2=101020304050607080900.150.20.250.3RankTest error MMMFSoftImpSoftImp+HardImpHASI\u03b2=1000HASI\u03b2=100HASI\u03b2=10HASI\u03b2=1\fReferences\n[1] N. Srebro, J.D.M. Rennie, and T. Jaakkola. Maximum-Margin Matrix Factorization. In Ad-\nvances in neural information processing systems, volume 17, pages 1329\u20131336. MIT Press,\n2005.\n\n[2] E.J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational mathematics, 9(6):717\u2013772, 2009.\n\n[3] E.J. Cand`es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013\n\n936, 2010.\n\n[4] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning\nlarge incomplete matrices. The Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n[5] M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002.\n[6] E.J. Cand`es, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted l1 minimization.\n\nJournal of Fourier Analysis and Applications, 14(5):877\u2013905, 2008.\n\n[7] J.F. Cai, E.J. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix comple-\n\ntion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[8] Anthony Lee, Francois Caron, Arnaud Doucet, and Chris Holmes. A hierarchical Bayesian\n\nframework for constructing sparsity-inducing priors. arXiv preprint arXiv:1009.1914, 2010.\n\n[9] M. Fazel, H. Hindi, and S.P. Boyd. Log-det heuristic for matrix rank minimization with ap-\nplications to hankel and euclidean distance matrices. In American Control Conference, 2003.\nProceedings of the 2003, volume 3, pages 2156\u20132162. IEEE, 2003.\n\n[10] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society. Series B, pages 1\u201338, 1977.\n\n[11] S. Ga\u00a8\u0131ffas and G. Lecu\u00b4e. Weighted algorithms for compressed sensing and matrix completion.\n\narXiv preprint arXiv:1107.1638, 2011.\n\n[12] Kun Chen, Hongbo Dong, and Kung-Sik Chan. Reduced rank regression via adaptive nuclear\n\nnorm penalization. Biometrika, 100(4):901\u2013920, 2013.\n\n[13] N. Srebro and T. Jaakkola. Weighted low-rank approximations. In NIPS, volume 20, page 720,\n\n2003.\n\n[14] F. Bach. Consistency of trace norm minimization. The Journal of Machine Learning Research,\n\n9:1019\u20131048, 2008.\n\n[15] R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization. Technical report,\n\nDAIMI PB-357, 1998.\n\n[16] R. M. Larsen. Propack-software for large and sparse svd calculations. Available online. URL\n\nhttp://sun. stanford. edu/rmunk/PROPACK, 2004.\n\n[17] J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative pre-\nIn Proceedings of the 22nd international conference on Machine learning, pages\n\ndiction.\n713\u2013719. ACM, 2005.\n\n[18] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative\n\n\ufb01ltering algorithm. Information Retrieval, 4(2):133\u2013151, 2001.\n\n[19] Z. Zhang, S. Wang, D. Liu, and M. I. Jordan. EP-GIG priors and applications in Bayesian\n\nsparse learning. The Journal of Machine Learning Research, 98888:2031\u20132061, 2012.\n\n[20] M. Seeger and G. Bouchard. Fast variational Bayesian inference for non-conjugate matrix\n\nfactorization models. In Proc. of AISTATS, 2012.\n\n[21] S. Nakajima, M. Sugiyama, S. D. Babacan, and R. Tomioka. Global analytic solution of fully-\nobserved variational Bayesian matrix factorization. Journal of Machine Learning Research,\n14:1\u201337, 2013.\n\n9\n\n\f", "award": [], "sourceid": 475, "authors": [{"given_name": "Adrien", "family_name": "Todeschini", "institution": "INRIA"}, {"given_name": "Fran\u00e7ois", "family_name": "Caron", "institution": "University of Oxford"}, {"given_name": "Marie", "family_name": "Chavent", "institution": "Universit\u00e9 de Bordeaux II & INRIA"}]}