{"title": "Scalable Robust Matrix Factorization with Nonconvex Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 5061, "page_last": 5070, "abstract": "Robust matrix factorization (RMF), which uses the $\\ell_1$-loss, often outperforms standard matrix factorization using the $\\ell_2$-loss, particularly when outliers are present. The state-of-the-art RMF solver is the RMF-MM algorithm, which, however, cannot utilize data sparsity. Moreover, sometimes even the (convex) $\\ell_1$-loss is not robust enough. In this paper, we propose the use of nonconvex loss to enhance robustness. To address the resultant difficult optimization problem, we use majorization-minimization (MM) optimization and propose a new MM surrogate. To improve scalability, we exploit data sparsity and optimize the surrogate via its dual with the accelerated proximal gradient algorithm. The resultant algorithm has low time and space complexities and is guaranteed to converge to a critical point. Extensive experiments demonstrate its superiority over the state-of-the-art in terms of both accuracy and scalability.", "full_text": "Scalable Robust Matrix Factorization with\n\nNonconvex Loss\n\nQuanming Yao1,2, James T. Kwok2\n\n14Paradigm Inc. Beijing, China\n\nyaoquanming@4paradigm.com, jamesk@cse.ust.hk\n\n2Department of Computer Science and Engineering,\n\nHong Kong University of Science and Technology, Hong Kong\n\nAbstract\n\nMatrix factorization (MF), which uses the (cid:96)2-loss, and robust matrix factorization\n(RMF), which uses the (cid:96)1-loss, are sometimes not robust enough for outliers.\nMoreover, even the state-of-the-art RMF solver (RMF-MM) is slow and cannot\nutilize data sparsity. In this paper, we propose to improve robustness by using\nnonconvex loss functions. The resultant optimization problem is dif\ufb01cult. To\nimprove ef\ufb01ciency and scalability, we use majorization-minimization (MM) and\noptimize the MM surrogate by using the accelerated proximal gradient algorithm\non its dual problem. Data sparsity can also be exploited. The resultant algorithm\nhas low time and space complexities, and is guaranteed to converge to a critical\npoint. Extensive experiments show that it outperforms the state-of-the-art in terms\nof both accuracy and speed.\n\n1\n\nIntroduction\n\nMatrix factorization (MF) is a fundamental tool in machine learning, and an important component\nin many applications such as computer vision [1, 38], social networks [37] and recommender\nsystems [30]. The square loss has been commonly used in MF [8, 30]. This implicitly assumes\nthe Gaussian noise, and is sensitive to outliers. Eriksson and van den Hengel [12] proposed robust\nmatrix factorization (RMF), which uses the (cid:96)1-loss, and obtains much better empirical performance.\nHowever, the resultant nonconvex nonsmooth optimization problem is much more dif\ufb01cult.\nMost RMF solvers are not scalable [6, 12, 22, 27, 40]. The current state-of-the-art solver is RMF-MM\n[26], which is based on majorization minimization (MM) [20, 24]. In each iteration, a convex\nnonsmooth surrogate is optimized. RMF-MM is advantageous in that it has theoretical convergence\nguarantees, and demonstrates fast empirical convergence [26]. However, it cannot utilize data sparsity.\nThis is problematic in applications such as structure from motion [23] and recommender system [30],\nwhere the data matrices, though large, are often sparse.\nThough the (cid:96)1-loss used in RMF is more robust than the (cid:96)2, still it may not be robust enough for\noutliers. Recently, better empirical performance is obtained in total-variation image denosing by\nusing the (cid:96)0-loss instead [35], and in sparse coding the capped-(cid:96)1 loss [21]. A similar observation\nis also made on the (cid:96)1-regularizer in sparse learning and low-rank matrix learning [16, 38, 41]. To\nalleviate this problem, various nonconvex regularizers have been introduced. Examples include the\nGeman penalty [14], Laplace penalty [34], log-sum penalty (LSP) [9] minimax concave penalty\n(MCP) [39], and the smooth-capped-absolute-deviation (SCAD) penalty [13]. These regularizers are\nsimilar in shape to Tukey\u2019s biweight function in robust statistics [19], which \ufb02attens for large values.\nEmpirically, they achieve much better performance than (cid:96)1.\nIn this paper, we propose to improve the robustness of RMF by using these nonconvex functions\n(instead of (cid:96)1 or (cid:96)2) as the loss function. The resultant optimization problem is dif\ufb01cult, and existing\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRMF solvers cannot be used. As in RMF-MM, we rely on the more \ufb02exible MM optimization\ntechnique, and a new MM surrogate is proposed. To improve scalability, we transform the surrogate\nto its dual and then solve it with the accelerated proximal gradient (APG) algorithm [2, 32]. Data\nsparsity can also be exploited in the design of the APG algorithm. As for its convergence analysis,\nproof techniques in RMF-MM cannot be used as the loss is no longer convex. Instead, we develop\nnew proof techniques based on the Clarke subdifferential [10], and show that convergence to a\ncritical point can be guaranteed. Extensive experiments on both synthetic and real-world data sets\ndemonstrate superiority of the proposed algorithm over the state-of-the-art in terms of both accuracy\nand scalability.\nNotation. For scalar x, sign (x) = 1 if x > 0, 0 if x = 0, and \u22121 otherwise. For a vector x,\nij)1/2\ni,j |Xij| is its (cid:96)1-norm, and nnz(X) is the number of nonzero\ni Xii is its trace. For two matrices X, Y , X (cid:12) Y\ndenotes their element-wise product. For a smooth function f, \u2207f is its gradient. For a convex f,\nG \u2208 \u2202f (X) = {U : f (Y ) \u2265 f (X) + tr(U(cid:62)(Y \u2212 X))} is a subgradient.\n\nDiag(x) constructs a diagonal matrix X with Xii = xi. For a matrix X, (cid:107)X(cid:107)F = ((cid:80)\nis its Frobenius norm, (cid:107)X(cid:107)1 = (cid:80)\nelements in X. For a square matrix X, tr(X) =(cid:80)\n\ni,j X 2\n\n2 Related Work\n\n2.1 Majorization Minimization\nMajorization minimization (MM) is a general technique to make dif\ufb01cult optimization problems\neasier [20, 24]. Consider a function h(X) which is hard to optimize. Let the iterate at the kth MM\niteration be X k. The next iterate is generated as X k+1 = X k + arg minX f k(X; X k), where f k is a\nsurrogate that is being optimized instead of h. A good surrogate should have the following properties\n[24]: (i) h(X k + X) \u2264 f k(X; X k) for any X; (ii) 0 = arg minX\nh(X k) = f k(0; X k); and (iii) f k is convex on X. MM only guarantees that the objectives obtained\nin successive iterations are non-increasing, but does not guarantee convergence of X k [20, 24].\n\n(cid:0)f k(X; X k) \u2212 h(X k + X)(cid:1) and\n\n2.2 Robust Matrix Factorization (RMF)\nIn matrix factorization (MF), the data matrix M \u2208 Rm\u00d7n is approximated by U V (cid:62), where U \u2208\nRm\u00d7r, V \u2208 Rn\u00d7r and r (cid:28) min(m, n) is the rank. In applications such as structure from motion\n(SfM) [1] and recommender systems [30], some entries of M may be missing. In general, the\nMF problem can be formulated as: minU,V\nF ), where\nW \u2208 {0, 1}m\u00d7n contains indices to the observed entries in M (with Wij = 1 if Mij is observed,\nand 0 otherwise), and \u03bb \u2265 0 is a regularization parameter. The (cid:96)2-loss is sensitive to outliers. In [11],\nit is replaced by the (cid:96)1-loss, leading to robust matrix factorization (RMF):\nF + (cid:107)V (cid:107)2\nF ).\n\n2(cid:107)W (cid:12) (M \u2212 U V (cid:62))(cid:107)2\n\n(cid:107)W (cid:12) (M \u2212 U V (cid:62))(cid:107)1 +\n\nF + (cid:107)V (cid:107)2\n\n2 ((cid:107)U(cid:107)2\n\n((cid:107)U(cid:107)2\n\nF + \u03bb\n\n(1)\n\n1\n\nmin\nU,V\n\n\u03bb\n2\n\nMany RMF solvers have been developed [6, 7, 12, 18, 22, 26, 27, 40]. However, as the objective\nin (1) is neither convex nor smooth, these solvers lack scalability, robustness and/or convergence\nguarantees. Interested readers are referred to Section 2 of [26] for details.\nRecently, the RMF-MM algorithm [26] solves (1) using MM. Let the kth iterate be (U k, V k).\nRMF-MM tries to \ufb01nd increments ( \u00afU , \u00afV ) that should be added to obtain the target (U, V ):\n\nU = U k + \u00afU ,\n\nV = V k + \u00afV .\n\n2(cid:107)U k + \u00afU(cid:107)2\n\n(2)\nSubstituting into (1), the objective can be rewritten as H k( \u00afU , \u00afV ) \u2261 (cid:107)W (cid:12) (M \u2212(U k + \u00afU )(V k +\n\u00afV )(cid:62))(cid:107)1 + \u03bb\nF . The following Proposition constructs a surrogate F k of\nH k that satis\ufb01es properties (i) and (ii) in Section 2.1. Unlike H k, F k is jointly convex in ( \u00afU , \u00afV ).\nProposition 2.1. [26] Let nnz(W(i,:)) (resp. nnz(W(:,j))) be the number of nonzero elements in\n\u221a\nthe ith row (resp. jth column) of W , \u039br = Diag(\nnnz(W(m,:))), and \u039bc =\nDiag(\n\nnnz(W(:,n))). Then, H k( \u00afU , \u00afV ) \u2264 F k( \u00afU , \u00afV ), where\n\n2(cid:107)V k + \u00afV (cid:107)2\n\nnnz(W(1,:)), . . . ,\n\nnnz(W(:,1)), . . . ,\n\nF + \u03bb\n\n\u221a\n\n\u221a\n\n\u221a\n\nF k( \u00afU , \u00afV ) \u2261(cid:107)W (cid:12)(M \u2212 U k(V k)(cid:62) \u2212 \u00afU (V k)(cid:62) \u2212 U k \u00afV (cid:62))(cid:107)1\n(cid:107)V k + \u00afV (cid:107)2\n\n(cid:107)U k + \u00afU(cid:107)2\n\n(cid:107)\u039br \u00afU(cid:107)2\n\n+\n\nF +\n\nF +\n\n\u03bb\n2\n\n\u03bb\n2\n\nF +\n\n(cid:107)\u039bc \u00afV (cid:107)2\nF .\n\n1\n2\n\n(3)\n\n1\n2\n\n2\n\n\fEquality holds iff ( \u00afU , \u00afV ) = (0, 0).\nBecause of the coupling of \u00afU , V k (resp. U k, \u00afV ) in \u00afU (V k)(cid:62) (resp. U k \u00afV (cid:62)) in (3), F k is still dif\ufb01cult\nto optimize. To address this problem, RMF-MM uses the LADMPSAP algorithm [25], which is a\nmulti-block variant of the alternating direction method of multipliers (ADMM) [3].\nRMF-MM has a space complexity of O(mn), and a time complexity of O(mnrIK), where I is the\nnumber of (inner) LADMPSAP iterations and K is the number of (outer) RMF-MM iterations. These\ngrow linearly with the matrix size, and can be expensive on large data sets. Besides, as discussed in\nSection 1, the (cid:96)1-loss may still be sensitive to outliers.\n\n3 Proposed Algorithm\n3.1 Use a More Robust Nonconvex Loss\nIn this paper, we improve robustness of RMF by using a general nonconvex loss instead of the (cid:96)1-loss.\nProblem (1) is then changed to:\n\n\u02d9H(U, V ) \u2261 m(cid:88)\n\nn(cid:88)\n\nWij\u03c6(cid:0)|Mij \u2212 [U V (cid:62)]ij|(cid:1) +\n\nmin\nU,V\n\ni=1\n\nj=1\n\n((cid:107)U(cid:107)2\n\nF + (cid:107)V (cid:107)2\nF ),\n\n\u03bb\n2\n\n(4)\n\nwhere \u03c6 is nonconvex. We assume the following on \u03c6:\nAssumption 1. \u03c6(\u03b1) is concave, smooth and strictly increasing on \u03b1 \u2265 0.\n\nAssumption 1 is satis\ufb01ed by many nonconvex functions, including the Geman, Laplace and LSP\npenalties mentioned in Section 1, and slightly modi\ufb01ed variants of the MCP and SCAD penalties.\nDetails can be found in Appendix A. Unlike previous papers [16, 38, 41], we use these nonconvex\nfunctions as the loss, not as regularizer. The (cid:96)1 also satis\ufb01es Assumption 1, and thus (4) includes (1).\nWhen the ith row of W is zero, the ith row of U obtained is zero because of the (cid:107)U(cid:107)2\nF regularizer.\nSimilarly, when the ith column of W is zero, the corresponding column in V is zero. To avoid this\ntrivial solution, we assume the following, as in matrix completion [8] and RMF-MM.\nAssumption 2. W has no zero row or column.\n\n3.2 Constructing the Surrogate\nProblem (4) is dif\ufb01cult to solve, and existing RMF solvers cannot be used as they rely crucially on\nthe (cid:96)1-norm. In this Section, we use the more \ufb02exible MM technique as in RMF-MM. However, its\nsurrogate construction scheme cannot be used here. RMF-MM uses the convex (cid:96)1 loss, and only\nneeds to handle nonconvexity resulting from the product U V (cid:62) in (1). Here, nonconvexity in (4)\ncomes from both from the loss and U V (cid:62).\nThe following Proposition \ufb01rst obtains a convex upper bound of the nonconvex \u03c6 using Taylor\nexpansion. An illustration is shown in Figure 1. Note that this upper bound is simply a re-weighted\n(cid:96)1, with scaling factor \u03c6(cid:48)(|\u03b2|) and offset \u03c6(|\u03b2|) \u2212 \u03c6(cid:48)(|\u03b2|)|\u03b2|. As one may expect, recovery of the (cid:96)1\nmakes optimization easier. It is known that the LSP, when used as a regularizer, can be interpreted as\nre-weighted (cid:96)1 regularization [8]. Thus, Proposition 3.1 includes this as a special case.\nProposition 3.1. For any given \u03b2 \u2208 R, \u03c6(|\u03b1|) \u2264 \u03c6(cid:48)(|\u03b2|)|\u03b1| + (\u03c6(|\u03b2|)\u2212 \u03c6(cid:48)(|\u03b2|)|\u03b2|), and the equality\nholds iff \u03b1 = \u00b1\u03b2.\n\n(a) Geman.\n\n(b) Laplace.\n\n(c) LSP.\n\n(d) modi\ufb01ed MCP.\n\n(e) modi\ufb01ed SCAD.\n\nFigure 1: Upper bounds for the various nonconvex penalties (see Table 5 in Appendix A.2) \u03b2 = 1,\n\u03b8 = 2.5 for SCAD and \u03b8 = 0.5 for the others; and \u03b4 = 0.05 for MCP and SCAD.\n\n3\n\n\fi=1\n\ni=1\n\n2(cid:107)V k + \u00afV (cid:107)2\n\n\u02d9H k( \u00afU , \u00afV ) \u2264 bk + \u03bb\n\n\u02d9H in (4) can be\n2(cid:107)U k + \u00afU(cid:107)2\nF +\n\n\u02d9H k( \u00afU , \u00afV ) \u2261(cid:80)m\n\nF + (cid:107) \u02d9W k (cid:12) (M \u2212 U k(V k)(cid:62) \u2212\nij|[U k(V k)(cid:62)]ij|),\n\nF . Using Proposition 3.1, we obtain the following convex upper bound for \u02d9H k.\n\n\u00afU (V k)(cid:62)\u2212U k \u00afV (cid:62)\u2212 \u00afU \u00afV (cid:62))(cid:107)1, where bk =(cid:80)m\n\nGiven the current iterate (U k, V k), we want to \ufb01nd increments ( \u00afU , \u00afV ) as in (2).\nrewritten as:\n2(cid:107)V k + \u00afV (cid:107)2\n\u03bb\nCorollary 3.2.\n\n(cid:80)n\nj=1 Wij\u03c6(|Mij \u2212 [(U k+ \u00afU )(V k + \u00afV )(cid:62)]ij|) + \u03bb\n(cid:80)n\n2(cid:107)U k + \u00afU(cid:107)2\nF + \u03bb\nj=1 Wij(\u03c6(|[U k(V k)(cid:62)]ij|)\u2212Ak\nij = \u03c6(cid:48)(|[U k(V k)(cid:62)]ij|).\n(cid:112)\n\n\u02d9W k = Ak (cid:12) W , and Ak\nThe product \u00afU \u00afV (cid:62) still couples \u00afU and \u00afV together. As \u02d9H k is similar to H k in Section 2.2, one\nmay want to reuse Proposition 2.1. However, Proposition 2.1 holds only when W is a binary\nmatrix, while \u02d9W k here is real-valued. Let \u039bk\n(m,:))) and\n(:,n))). The following Proposition shows that \u02d9F k( \u00afU , \u00afV ) \u2261\nc = Diag(\n\u039bk\n(cid:107) \u02d9W k (cid:12) (M \u2212 U k(V k)(cid:62) \u2212 \u00afU (V k)(cid:62) \u2212 U k \u00afV (cid:62))(cid:107)1 + \u03bb\nF + \u03bb\nF +\n2(cid:107)\u039bk\n\u00afV (cid:107)2\nF +bk, can be used as a surrogate. Moreover, it can be easily seen that \u02d9F k quali\ufb01es as a good\nsurrogate in Section 2.1: (a) \u02d9H( \u00afU +U k, \u00afV +V k) \u2264 \u02d9F k( \u00afU , \u00afV ); (b) (0, 0) = arg min \u00afU , \u00afV\n\u02d9F k( \u00afU , \u00afV )\u2212\n\u02d9H k( \u00afU , \u00afV ) and \u02d9F k(0, 0) = \u02d9H(0, 0); and (c) \u02d9F k is jointly convex in \u00afU , \u00afV .\nProposition 3.3.\nRemark 3.1. In the special case where the (cid:96)1-loss is used,\nc = \u039bc.\nThe surrogate \u02d9F k( \u00afU , \u00afV ) then reduces to that in (3), and Proposition 3.3 becomes Proposition 2.1.\n\n\u02d9H k( \u00afU , \u00afV ) \u2264 \u02d9F k( \u00afU , \u00afV ), with equality holds iff ( \u00afU , \u00afV ) = (0, 0).\n\n(1,:)), . . . ,\n2(cid:107)\u039bk\n\u00afU(cid:107)2\n\n\u02d9W k = W , bk = 0 \u039bk\n\n2(cid:107)V k + \u00afV (cid:107)2\n\n2(cid:107)U k + \u00afU(cid:107)2\n\nF + 1\n\nsum( \u02d9W k\n\n(:,1)), . . . ,\n\nr = \u039br, and \u039bk\n\nsum( \u02d9W k\n\nr = Diag(\n\nsum( \u02d9W k\n\n(cid:112)\n\n(cid:112)\n\nsum( \u02d9W k\n\n(cid:112)\n\n1\n\nc\n\nr\n\n3.3 Optimizing the Surrogate via APG on the Dual\nLADMPSAP, which is used in RMF-MM, can also be used to optimize \u02d9F k. However, the dual\nvariable in LADMPSAP is a dense matrix, and cannot utilize possible sparsity of W . Moreover,\nLADMPSAP converges at a rate of O(1/T ) [25], which is slow. In the following, we propose a time-\nand space-ef\ufb01cient optimization procedure based on running the accelerated proximal gradient (APG)\nalgorithm on the surrogate optimization problem\u2019s dual. Note that while the primal problem has\nO(mn) variables, the dual problem has only nnz(W ) variables.\nProblem Reformulation. Let \u2126 \u2261 {(i1, j1), . . . , (innz(W ), jnnz(W ))} be the set containing indices\nof the observed elements in W , H\u2126(\u00b7) be the linear operator which maps a nnz(W )-dimensional\nvector x to the sparse matrix X \u2208 Rm\u00d7n with nonzero positions indicated by \u2126 (i.e., Xitjt = xt\nwhere (it, jt) is the tth element in \u2126), and H\u22121\nProposition 3.4. The dual problem of min \u00afU , \u00afV\n\n\u2126 (\u00b7) be the inverse operator of H\u2126.\n\n\u02d9F k( \u00afU , \u00afV ) is\n\nr (H\u2126(x)V k \u2212 \u03bbU k)) \u2212 tr(H\u2126(x)(cid:62)M )\n\nmin\nx\u2208W k\n\nDk(x) \u2261 1\n2\n\ntr((H\u2126(x)V k \u2212 \u03bbU k)(cid:62)Ak\ntr((H\u2126(x)(cid:62)U k \u2212 \u03bbV k)(cid:62)Ak\n1\n2\n\n+\n\n(5)\nr )2)\u22121, and\nc )2)\u22121. From the obtained x, the primal ( \u00afU , \u00afV ) solution can be recovered as\n\nwhere W k \u2261 {x \u2208 Rnnz(W ) : |xi| \u2264 [ \u02d9wk]\u22121\ni },\nAk\nc = (\u03bbI + (\u039bk\n\u00afU = Ak\n\nr (H\u2126(x)V k \u2212 \u03bbU k) and \u00afV = Ak\n\n\u02d9wk = H\u22121\nc (H\u2126(x)(cid:62)U k \u2212 \u03bbV k).\n\nr = (\u03bbI + (\u039bk\n\n\u2126 ( \u02d9W k), Ak\n\nc (H\u2126(x)(cid:62)U k \u2212 \u03bbV k)),\n\nProblem (5) can be solved by the APG algorithm, which has a convergence rate of O(1/T 2) [2, 32]\nand is faster than LADMPSAP. As W k involves only (cid:96)1 constraints, the proximal step can be easily\ncomputed with closed-form (details are in Appendix B.3) and takes only O(nnz(W )) time.\nThe complete procedure, which will be called Robust Matrix Factorization with Nonconvex Loss\n(RMFNL) algorithm, is shown in Algorithm 1. The surrogate is optimized via its dual in step 4. The\nprimal solution is recovered in step 5, and (U k, V k) are updated in step 6.\nExploiting Sparsity. A direct implementation of APG takes O(mn) space and O(mnr) time per\niteration. In the following, we show how these can be reduced by exploiting sparsity of W .\nc and W k, which are all related to \u02d9W k. Recall that\nThe objective in (5) involves Ak\nCorollary 3.2 is sparse (as W is sparse). Thus, by exploiting sparsity, constructing Ak\nr , Ak\nonly take O(nnz(W )) time and space.\n\n\u02d9W k in\nc and W k\n\nr , Ak\n\n4\n\n\fAlgorithm 1 Robust matrix factorization using nonconvex loss (RMFNL) algorithm.\n1: initialize U 1 \u2208 Rm\u00d7r and V 1 \u2208 Rm\u00d7r;\n2: for k = 1, 2, . . . , K do\n3:\n4:\n5:\n6:\n7: end for\n8: return U K+1 and V K+1.\n\ncompute \u02d9W k in Corollary 3.2 (only on the observed positions), and \u039bk\ncompute xk = arg minx\u2208W k Dk(x) in Proposition 3.4 using APG;\nc (H\u2126(xk)(cid:62)U k \u2212 \u03bbV k);\n\u00afU k = Ak\nr\nU k+1 = U k + \u00afU k, V k+1 = V k + \u00afV k;\n\n(cid:0)H\u2126(xk)V k \u2212 \u03bbU k(cid:1), \u00afV k = Ak\n\nc ;\nr , \u039bk\n\nIn each APG iteration, one has to compute the gradient, objective, and proximal step. First, consider\nthe gradient \u2207Dk(x) of the objective, which is equal to\n\nH\u22121\n\u2126 (Ak\n\nr (H\u2126(x)V k \u2212 \u03bbU k)(V k)(cid:62)) + H\u22121\n\n\u2126 (U k[(U k)(cid:62)H\u2126(x) \u2212 \u03bb(V k)(cid:62)]Ak\n\nc ) \u2212 H\u22121\n\n\u2126 (M ).\n\n\u2126 (\u00b7), we have \u02c6gk\n\nt =(cid:80)r\n\nr (H\u2126(x)V k)\u2212\u03bb(Ak\n\n\u2126 (Qk(V k)(cid:62)), where Qk = Ak\n\n(6)\nr (H\u2126(x)V k \u2212 \u03bbU k). As Ak\nThe \ufb01rst term can be rewritten as \u02c6gk = H\u22121\nr\nis diagonal and H\u2126(x) is sparse, Qk can be computed as Ak\nr U k) in O(nnz(W )r+\nmr) time, where r is the number of columns in U k and V k. Let the tth element in \u2126 be (it, jt). By the\nde\ufb01nition of H\u22121\njtq, and this takes O(nnz(W )r+mr) time. Similarly,\ncomputing the second term in (6) takes O(nnz(W )r + nr) time. Hence, computing \u2207Dk(x) takes a\ntotal of O(nnz(W )r + (m + n)r) time and O(nnz(W ) + (m + n)r) space (the Algorithm is shown\nin Appendix B.1). Similarly, the objective can be obtained in O(nnz(W )r + (m + n)r) time and\nO(nnz(W ) + (m + n)r) space (details are in Appendix B.2). The proximal step takes O(nnz(W ))\ntime and space, as x \u2208 Rnnz(W ). Thus, by exploiting sparsity, the APG algorithm has a space\ncomplexity of O(nnz(W ) + (m + n)r) and iteration time complexity of O(nnz(W )r + (m + n)r).\nIn comparison, LADMPSAP needs O(mn) space and iteration time complexity of O(mnr). A\nsummary of the complexity results is shown in Figure 2(a).\n\nq=1 Qk\n\nitqV k\n\n3.4 Convergence Analysis\nIn this section, we study the convergence of RMFNL. Note that the proof technique in RMF-MM\ncannot be used, as it relies on convexity of the (cid:96)1-loss while \u03c6 in (4) is nonconvex (in particular,\nProposition 1 in [26] fails). Moreover, the proof of RMF-MM uses the subgradient. Here, as \u03c6 is\nnonconvex, we will use the Clarke subdifferential [10], which generalizes subgradients to nonconvex\nfunctions (a brief introduction is in Appendix C). For the iterates {X k} generated by RMF-MM, it is\nguaranteed to have a suf\ufb01cient decrease on the objective f in the following sense [26]: There exists a\nF ,\u2200k. The following Proposition\nconstant \u03b3 > 0 such that f (X k) \u2212 f (X k+1) \u2265 \u03b3(cid:107)X k \u2212 X k+1(cid:107)2\nshows that RMFNL also achieves a suf\ufb01cient decrease on its objective. Moreover, the {(U k, V k)}\nsequence generated is bounded, which has at least one limit point.\nProposition 3.5. For Algorithm 1, {(U k, V k)} is bounded, and has a suf\ufb01cient decrease on \u02d9H.\nTheorem 3.6. The limit points of the sequence generated by Algorithm 1 are critical points of (4).\n\n4 Experiments\nIn this section, we compare the proposed RMFNL with state-of-the-art MF algorithms. Experiments\nare performed on a PC with Intel i7 CPU and 32GB RAM. All the codes are in Matlab, with sparse\nmatrix operations implemented in C++. We use the nonconvex loss functions of LSP, Geman and\nLaplace in Table 5 of Appendix A, with \u03b8 = 1; and \ufb01x \u03bb = 20/(m + n) in (1) as suggested in [26].\n\n4.1 Synthetic Data\nWe \ufb01rst perform experiments on synthetic data, which is generated as X = U V (cid:62) with U \u2208 Rm\u00d75,\nV \u2208 Rm\u00d75, and m = {250, 500, 1000}. Elements of U and V are sampled i.i.d. from the standard\nnormal distribution N (0, 1). This is then corrupted to form M = X + N + S, where N is\nthe noise matrix from N (0, 0.1), and S is a sparse matrix modeling outliers with 5% nonzero\nelements randomly sampled from {\u00b15}. We randomly draw 10 log(m)/m% of the elements from\n\n5\n\n\f\u221a\n\n(cid:107) \u00afW (cid:12) (X \u2212 \u00afU \u00afV T )(cid:107)2\n\nM as observations, with half of them for training and the other half for validation. The remaining\nunobserved elements are for testing. Note that the larger the m, the sparser is the observed matrix.\nThe iterate (U 1, V 1) is initialized as Gaussian random matrices, and the iterative procedure is stopped\nwhen the relative change in objective values between successive iterations is smaller than 10\u22124. For\nthe subproblems in RMF-MM and RMFNL, iteration is stopped when the relative change in objective\nvalue is smaller than 10\u22126 or a maximum of 300 iterations is used. Rank r is set to the ground truth\n(i.e., 5). For performance evaluation, we follow [26] and use the (i) testing root mean square error,\nF /nnz( \u00afW ), where \u00afW is a binary matrix indicating positions of the\nRMSE =\ntesting elements; and (ii) CPU time. To reduce statistical variability, results are averaged over \ufb01ve\nrepetitions.\nSolvers for Surrogate Optimization. Here, we compare three solvers for surrogate optimization\nin each RMFNL iteration (with the LSP loss and m = 1000): (i) LADMPSAP in RMF-MM; (ii)\nAPG(dense), which uses APG but without utilizing data sparsity; and (iii) APG in Algorithm 1, which\nutilizes data sparsity as in Section 3.3. The APG stepsize is determined by line-search, and adaptive\nrestart is used for further speedup [32]. Figure 2 shows convergence in the \ufb01rst RMFNL iteration\n(results for the other iterations are similar). As can be seen, LADMPSAP is the slowest w.r.t. the\nnumber of iterations, as its convergence rate is inferior to both variants of APG (whose rates are the\nsame). In terms of CPU time, APG is the fastest as it can also utilize data sparsity.\n\n(a) Complexities of surrogate optimizers.\n\n(b) Number of iterations.\n\n(c) CPU time.\n\nFigure 2: Convergence of the objective on the synthetic data set (with the LSP loss and m = 1000).\nNote that the curves for APG-dense and APG overlap in Figure 2(b).\n\nTable 1 shows performance of the whole RMFNL algorithm with different surrogate optimizers.1 As\ncan be seen, the various nonconvex losses (LSP, Geman and Laplace) lead to similar RMSE\u2019s, as has\nbeen similarly observed in [16, 38]. Moreover, the different optimizers all obtain the same RMSE. In\nterms of speed, APG is the fastest, then followed by APG(dense), and LADMPSAP is the slowest.\nHence, in the sequel, we will only use APG to optimize the surrogate.\n\nTable 1: Performance of RMFNL with different surrogate optimizers.\n\nloss\n\nLSP\n\nGeman\n\nsolver\n\nLADMPSAP\nAPG(dense)\n\nAPG\n\nLADMPSAP\nAPG(dense)\n\nAPG\n\nLADMPSAP\nLaplace APG(dense)\n\nAPG\n\nRMSE\n\nm = 250 (nnz: 11.04%)\nCPU time\n0.110\u00b10.004\n17.0\u00b11.4\n12.1\u00b10.6\n0.110\u00b10.004\n3.2\u00b10.6\n0.110\u00b10.004\n20.4\u00b10.8\n0.115\u00b10.014\n0.115\u00b10.011\n13.9\u00b11.6\n0.114\u00b10.009\n3.1\u00b10.5\n0.110\u00b10.004\n17.1\u00b11.5\n12.1\u00b12.1\n0.110\u00b10.004\n0.111\u00b10.004\n2.8\u00b10.4\n\nm = 500 (nnz: 6.21%)\nRMSE\n\n0.072\u00b10.001\n0.073\u00b10.001\n0.073\u00b10.001\n0.074\u00b10.006\n0.073\u00b10.002\n0.073\u00b10.002\n0.072\u00b10.001\n0.073\u00b10.003\n0.074\u00b10.001\n\nCPU time\n195.7\u00b134.7\n114.4\u00b118.8\n5.5\u00b11.0\n231.0\u00b136.9\n146.9\u00b124.8\n8.3\u00b11.1\n203.4\u00b122.7\n120.9\u00b128.9\n5.6\u00b11.0\n\nm = 1000 (nnz: 3.45%)\nRMSE\nCPU time\n0.45\u00b10.007\n950.8\u00b1138.8\n490.1\u00b191.9\n0.45\u00b10.007\n24.6\u00b13.2\n0.45\u00b10.006\n950.8\u00b1138.8\n0.45\u00b10.007\n0.45\u00b10.007\n490.1\u00b191.9\n0.45\u00b10.006\n24.6\u00b13.2\n0.45\u00b10.007\n950.8\u00b1138.8\n490.1\u00b191.9\n0.45\u00b10.007\n0.45\u00b10.006\n24.6\u00b13.2\n\nComparison with State-of-the-Art Matrix Factorization Algorithms. The (cid:96)2-loss-based MF\nalgorithms that will be compared include alternating gradient descent (AltGrad) [30], Riemannian\npreconditioning (RP) [29], scaled alternating steepest descent (ScaledASD) [33], alternative\nminimization for large scale matrix imputing (ALT-Impute) [17] and online massive dictionary\nlearning (OMDL) [28]. The (cid:96)1-loss-based RMF algorithms being compared include RMF-MM [26],\nrobust matrix completion (RMC) [7] and Grassmannian robust adaptive subspace tracking algorithm\n\n1For all tables in the sequel, the best and comparable results according to the pairwise t-test with 95%\n\ncon\ufb01dence are highlighted.\n\n6\n\n\f(GRASTA) [18]. Codes are provided by the respective authors. We do not compare with AOPMC\n[36], which has been shown to be slower than RMC [7].\nAs can be seen from Table 2, RMFNL produces much lower RMSE than the MF/RMF algorithms,\nand the RMSEs from different nonconvex losses are similar. AltGrad, RP, ScaledASD, ALT-Impute\nand OMDL are very fast because they use the simple (cid:96)2 loss. However, their RMSEs are much higher\nthan RMFNL and RMF algorithms. A more detailed convergence comparison is shown in Figure 3.\nAs can be seen, RMF-MM is the slowest. RMFNL with different nonconvex losses have similar\nconvergence behavior, and they all converge to a lower testing RMSE much faster than the others.\n\nTable 2: Performance of the various matrix factorization algorithms on synthetic data.\n\nloss\n(cid:96)2\n\n(cid:96)1\n\nLSP\nGeman\nLaplace\n\nRP\n\nRMC\n\nRMSE\n\nm = 250 (nnz: 11.04%)\nalgorithm\nCPU time\n1.0\u00b10.6\n1.062\u00b10.040\nAltGrad\n0.1\u00b10.1\n1.048\u00b10.071\nScaledASD 1.042\u00b10.066\n0.2\u00b10.1\n1.030\u00b10.060\n0.2\u00b10.1\nALT-Impute\n0.1\u00b10.1\n1.089\u00b10.055\nOMDL\n1.5\u00b10.1\n0.338\u00b10.033\nGRASTA\n0.226\u00b10.040\n2.8\u00b11.0\nRMF-MM 0.194\u00b10.032\n13.4\u00b10.6\n0.110\u00b10.004\n3.2\u00b10.6\nRMFNL\n3.1\u00b10.5\n0.114\u00b10.004\nRMFNL\n0.111\u00b10.004\n2.8\u00b10.4\nRMFNL\n\nm = 500 (nnz: 6.21%)\nCPU time\nRMSE\n1.8\u00b10.3\n0.4\u00b10.2\n0.4\u00b10.3\n0.3\u00b10.1\n0.2\u00b10.1\n2.9\u00b10.3\n2.7\u00b10.5\n154.9\u00b112.5\n5.5\u00b11.0\n8.3\u00b11.1\n5.6\u00b11.0\n\n0.950\u00b10.005\n0.953\u00b10.012\n0.950\u00b10.009\n0.937\u00b10.010\n0.945\u00b10.018\n0.306\u00b10.002\n0.201\u00b10.001\n0.145\u00b10.009\n0.073\u00b10.001\n0.073\u00b10.001\n0.074\u00b10.001\n\nm = 1000 (nnz: 3.45%)\nCPU time\nRMSE\n6.0\u00b14.2\n1.1\u00b10.1\n1.2\u00b10.5\n1.0\u00b10.2\n0.5\u00b10.2\n6.1\u00b10.4\n4.2\u00b12.5\n827.7\u00b1116.3\n14.0\u00b15.2\n19.0\u00b14.9\n15.9\u00b16.1\n\n0.853\u00b10.010\n0.848\u00b10.009\n0.847\u00b10.009\n0.838\u00b10.009\n0.847\u00b10.009\n0.244\u00b10.009\n0.195\u00b10.006\n0.122\u00b10.004\n0.047\u00b10.002\n0.047\u00b10.001\n0.047\u00b10.002\n\n(a) m = 250.\n\n(b) m = 500.\n\n(c) m = 1000.\n\nFigure 3: Convergence of testing RMSE for the various algorithms on synthetic data.\n\n4.2 Robust Collaborative Recommendation\nIn a recommender system, the love/hate attack changes the ratings of selected items to the minimum\n(hate) or maximum (love) [5]. The love/hate attack is very simple, but can signi\ufb01cantly bias overall\nprediction. As no love/hate attack data sets are publicly available, we follow [5, 31] and manually\nadd permutations. Experiments are performed on the popular MovieLens recommender data sets:\nMovieLens-100K, MovieLens-1M, and MovieLens-10M (Some statistics on these data sets are in\nAppendix E.1). We randomly select 3% of the items from each data set. For each selected item,\nall its observed ratings are set to either the minimum or maximum with equal possibilities. 50% of\nthe observed ratings are used for training, 25% for validation, and the rest for testing. Algorithms\nin Section 4.1 will be compared. To reduce statistical variability, results are averaged over \ufb01ve\nrepetitions. As in Section 4.1, the testing RMSE and CPU time are used for performance evaluation.\nResults are shown in Table 3, and Figure 4 shows convergence of the RMSE. Again, RMFNL\nwith different nonconvex losses have similar performance and achieve the lowest RMSE. The MF\nalgorithms are fast, but have high RMSEs. GRASTA is not stable, with large RMSE and variance.\n\n4.3 Af\ufb01ne Rigid Structure-from-Motion (SfM)\nSfM reconstructs the 3D scene from sparse feature points tracked in m images of a moving camera\n[23]. Each feature point is projected to every image plane, and is thus represented by a 2m-\ndimensional vector. With n feature points, this leads to a 2m \u00d7 n matrix. Often, this matrix\nhas missing data (e.g., some feature points may not be always visible) and outliers (arising from\nfeature mismatch). We use the Oxford Dinosaur sequence, which has 36 images and 4, 983 feature\npoints. As in [26], we extract three data subsets using feature points observed in at least 5, 6 and 7\n\n7\n\n\fTable 3: Performance on the MovieLens data sets. CPU time is in seconds. RMF-MM cannot converge\nin 104 seconds on the MovieLens-1M and MovieLens-10M data sets, and thus is not reported.\nMovieLens-10M\n\nMovieLens-100K\n\nMovieLens-1M\n\nloss\n(cid:96)2\n\n(cid:96)1\n\nLSP\nGeman\nLaplace\n\nRP\n\nRMSE\n\nalgorithm\n0.954\u00b10.004\nAltGrad\n0.968\u00b10.008\nScaledASD 0.951\u00b10.004\n0.942\u00b10.021\nALT-Impute\n0.958\u00b10.003\nOMDL\n1.057\u00b10.218\nGRASTA\n0.920\u00b10.001\nRMF-MM 0.901\u00b10.003\n0.885\u00b10.006\nRMFNL\n0.885\u00b10.005\nRMFNL\n0.885\u00b10.005\nRMFNL\n\nRMC\n\nCPU time\n1.0\u00b10.2\n0.2\u00b10.1\n0.3\u00b10.1\n0.2\u00b10.1\n0.1\u00b10.1\n4.6\u00b10.3\n1.4\u00b10.2\n402.3\u00b180.0\n5.9\u00b11.5\n6.6\u00b11.2\n4.9\u00b11.1\n\nRMSE\n\n0.856\u00b10.005\n0.867\u00b10.002\n0.878\u00b10.003\n0.859\u00b10.001\n0.873\u00b10.008\n0.842\u00b10.011\n0.849\u00b10.001\n0.828\u00b10.001\n0.829\u00b10.005\n0.828\u00b10.001\n\n\u2014\n\nCPU time\n30.6\u00b12.5\n4.4\u00b10.4\n8.7\u00b10.2\n10.7\u00b10.2\n2.6\u00b10.5\n31.1\u00b10.6\n40.6\u00b12.2\n34.9\u00b11.0\n35.3\u00b10.3\n35.1\u00b10.2\n\n\u2014\n\nRMSE\n\n0.872\u00b10.003\n0.948\u00b10.011\n0.884\u00b10.001\n0.872\u00b10.001\n0.881\u00b10.003\n0.876\u00b10.047\n0.855\u00b10.001\n0.817\u00b10.004\n0.817\u00b10.004\n0.817\u00b10.005\n\n\u2014\n\nCPU time\n1130.4\u00b19.6\n199.9\u00b139.0\n230.2\u00b17.7\n198.9\u00b12.6\n63.4\u00b14.2\n1304.3\u00b118.0\n526.0\u00b129.5\n1508.2\u00b169.1\n1478.5\u00b172.8\n1513.4\u00b112.2\n\n\u2014\n\n(a) MovieLens-100K.\n\n(b) MovieLens-1M.\n\n(c) MovieLens-10M.\n\nFigure 4: Convergence of testing RMSE on the recommendation data sets.\n\nimages. These are denoted \u201cD1\" (with size 72\u00d7932), \u201cD2\" (72\u00d7557) and \u201cD3\" (72\u00d7336). The fully\nobserved data matrix can be recovered by rank-4 matrix factorization [12], and so we set r = 4.\nWe compare RMFNL with RMF-MM and its variant (denoted RMF-MM(heuristic)) in Section 4.2\nof [26]. In this variant, the diagonal entries of \u039br and \u039bc are initialized with small values and\nthen increased gradually. It is claimed in [26] that this leads to faster convergence. However, our\nexperimental results show that this heuristic leads to more accurate, but not faster, results. Moreover,\nits key pitfall is that Proposition 2.1 and the convergence guarantee for RMF-MM no longer holds.\nFor performance evaluation, as there is no ground-truth, we follow [26] and use the (i) mean absolute\nerror (MAE) (cid:107) \u00afW (cid:12) ( \u00afU \u00afV (cid:62) \u2212 X)(cid:107)1/nnz( \u00afW ), where \u00afU and \u00afV are outputs from the algorithm, X is\nthe data matrix with observed positions indicated by the binary \u00afW ; and (ii) CPU time. As the various\nnonconvex penalties have been shown to have similar performance, we will only report the LSP here.\nResults are shown in Table 4. As can be seen, RMF-MM(heuristic) obtains a lower MAE than\nRMF-MM, but is still outperformed by RMFNL. RMFNL is the fastest, though the speedup is not\nas signi\ufb01cant as in previous sections. This is because the Dinosaur subsets are not very sparse (the\npercentages of nonzero entries in \u201cD1\", \u201cD2\" and \u201cD3\" are 17.9%, 20.5% and 23.1%, respectively).\n\nTable 4: Performance on the Dinosaur data subsets. CPU time is in seconds.\n\nMAE\n\nD1\n0.374\u00b10.031\n0.442\u00b10.096\n0.323\u00b10.012\n\nCPU time\n43.9\u00b13.3\n26.9\u00b13.4\n8.3\u00b11.9\n\nMAE\n\nD2\n0.381\u00b10.022\n0.458\u00b10.043\n0.332\u00b10.005\n\nCPU time\n25.9\u00b13.1\n14.9\u00b12.2\n6.8\u00b11.3\n\nMAE\n\nD3\n0.382\u00b10.034\n0.466\u00b10.072\n0.316\u00b10.006\n\nCPU time\n10.8\u00b13.4\n9.2\u00b12.1\n3.4\u00b11.0\n\nRMF-MM(heuristic)\n\nRMF-MM\nRMFNL\n\n5 Conclusion\n\nIn this paper, we improved the robustness of matrix factorization by using a nonconvex loss instead\nof the commonly used (convex) (cid:96)1 and (cid:96)2-losses. Second, we improved its scalability by exploiting\ndata sparsity (which RMF-MM cannot) and using the accelerated proximal gradient algorithm (which\nis faster than the commonly used ADMM). The space and iteration time complexities are greatly\nreduced. Theoretical analysis shows that the proposed RMFNL algorithm generates a critical point.\nExtensive experiments on both synthetic and real-world data sets demonstrate that RMFNL is more\naccurate and more scalable than the state-of-the-art.\n\n8\n\n\fReferences\n[1] R. Basri, D. Jacobs, and I. Kemelmacher. Photometric stereo with general, unknown lighting. International\n\nJournal of Computer Vision, 72(3):239\u2013257, 2007.\n\n[2] M. Beck, A.and Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via\nthe alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1\u2013122,\n2011.\n\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[5] R. Burke, M. O\u2019Mahony, and N. Hurley. Recommender Systems Handbook. Springer, 2015.\n\n[6] R. Cabral, F. De la Torre, J. Costeira, and A. Bernardino. Unifying nuclear norm and bilinear factorization\napproaches for low-rank matrix decomposition. In International Conference on Computer Vision, pages\n2488\u20132495, 2013.\n\n[7] L. Cambier and P. Absil. Robust low-rank matrix completion by Riemannian optimization. SIAM Journal\n\non Scienti\ufb01c Computing, 38(5):S440\u2013S460, 2016.\n\n[8] E.J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational\n\nMathematics, 9(6):717\u2013772, 2009.\n\n[9] E.J. Cand\u00e8s, M.B. Wakin, and S. Boyd. Enhancing sparsity by reweighted (cid:96)1 minimization. Journal of\n\nFourier Analysis and Applications, 14(5-6):877\u2013905, 2008.\n\n[10] F. Clarke. Optimization and Nonsmooth Analysis. SIAM, 1990.\n\n[11] F. De La Torre and M. Black. A framework for robust subspace learning. International Journal of Computer\n\nVision, 54(1):117\u2013142, 2003.\n\n[12] A. Eriksson and A. Van Den Hengel. Ef\ufb01cient computation of robust low-rank matrix approximations\nin the presence of missing data using the (cid:96)1-norm. In International Conference on Computer Vision and\nPattern Recognition, pages 771\u2013778, 2010.\n\n[13] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[14] D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions\n\non Image Processing, 4(7):932\u2013946, 1995.\n\n[15] P. Gong and J. Ye. HONOR: Hybrid optimization for non-convex regularized problems. In Advance in\n\nNeural Information Processing Systems, pages 415\u2013423, 2015.\n\n[16] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm\nfor non-convex regularized optimization problems. In International Conference on Machine Learning,\npages 37\u201345, 2013.\n\n[17] T. Hastie, R. Mazumder, J. Lee, and R. Zadeh. Matrix completion and low-rank SVD via fast alternating\n\nleast squares. Journal of Machine Learning Research, 16:3367\u20133402, 2015.\n\n[18] J. He, L. Balzano, and A. Szlam.\n\nIncremental gradient on the Grassmannian for online foreground\nand background separation in subsampled video. In Computer Vision and Pattern Recognition, pages\n1568\u20131575, 2012.\n\n[19] P. Huber. Robust Statistics. Springer, 2011.\n\n[20] D. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30\u201337, 2004.\n\n[21] W. Jiang, F. Nie, and H. Huang. Robust dictionary learning with capped (cid:96)1-norm. In International Joint\n\nConference on Arti\ufb01cial Intelligence, pages 3590\u20133596, 2015.\n\n[22] E. Kim, M. Lee, C. Choi, N. Kwak, and S. Oh. Ef\ufb01cient (cid:96)1-norm-based low-rank matrix approximations for\nlarge-scale problems using alternating recti\ufb01ed gradient method. IEEE Transactions on Neural Networks\nand Learning Systems, 26(2):237\u2013251, 2015.\n\n9\n\n\f[23] J. Koenderink and A. Van Doorn. Af\ufb01ne structure from motion. Journal of the Optical Society of America,\n\n8(2):377\u2013385, 1991.\n\n[24] K. Lange, R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of\n\nComputational and Graphical Statistics, 9(1):1\u201320, 2000.\n\n[25] Z. Lin, R. Liu, and H. Li. Linearized alternating direction method with parallel splitting and adaptive\n\npenalty for separable convex programs in machine learning. Machine Learning, 2(99):287\u2013325, 2015.\n\n[26] Z. Lin, C. Xu, and H. Zha. Robust matrix factorization by majorization minimization. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, (99), 2017.\n\n[27] D. Meng, Z. Xu, L. Zhang, and J. Zhao. A cyclic weighted median method for (cid:96)1 low-rank matrix\n\nfactorization with missing entries. In AAAI Conference on Arti\ufb01cial Intelligence, pages 704\u2013710, 2013.\n\n[28] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive matrix factorization.\n\nIn International Conference on Machine Learning, pages 1737\u20131746, 2016.\n\n[29] B. Mishra and R. Sepulchre. Riemannian preconditioning. SIAM Journal on Optimization, 26(1):635\u2013660,\n\n2016.\n\n[30] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In Advance in Neural Information\n\nProcessing Systems, pages 1257\u20131264, 2008.\n\n[31] B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Toward trustworthy recommender systems: An\nanalysis of attack models and algorithm robustness. ACM Transactions on Internet Technology, 7(4):23,\n2007.\n\n[32] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming,\n\n140(1):125\u2013161, 2013.\n\n[33] J. Tanner and K. Wei. Low rank matrix completion by alternating steepest descent methods. Applied and\n\nComputational Harmonic Analysis, 40(2):417\u2013429, 2016.\n\n[34] J. Trzasko and A. Manduca. Highly undersampled magnetic resonance image reconstruction via homotopic-\n\nminimization. IEEE Transactions on Medical Imaging, 28(1):106\u2013121, 2009.\n\n[35] M. Yan. Restoration of images corrupted by impulse noise and mixed Gaussian impulse noise using blind\n\ninpainting. SIAM Journal on Imaging Sciences, 6(3):1227\u20131245, 2013.\n\n[36] M. Yan, Y. Yang, and S. Osher. Exact low-rank matrix completion from sparsely corrupted entries via\n\nadaptive outlier pursuit. Journal of Scienti\ufb01c Computing, 56(3):433\u2013449, 2013.\n\n[37] J. Yang and J. Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization\n\napproach. In Web Search and Data Mining, pages 587\u2013596, 2013.\n\n[38] Q. Yao, J. Kwok, T. Wang, and T. Liu. Large-scale low-rank matrix learning with nonconvex regularizers.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[39] C. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics,\n\n38(2):894\u2013942, 2010.\n\n[40] Y. Zheng, G. Liu, S. Sugimoto, S. Yan, and M. Okutomi. Practical low-rank matrix approximation\nunder robust (cid:96)1-norm. In International Conference on Computer Vision and Pattern Recognition, pages\n1410\u20131417, 2012.\n\n[41] W. Zuo, D. Meng, L. Zhang, X. Feng, and D. Zhang. A generalized iterated shrinkage algorithm for\n\nnon-convex sparse coding. In International Conference on Computer Vision, pages 217\u2013224, 2013.\n\n10\n\n\f", "award": [], "sourceid": 2447, "authors": [{"given_name": "Quanming", "family_name": "Yao", "institution": "4Paradigm"}, {"given_name": "James", "family_name": "Kwok", "institution": "Hong Kong University of Science and Technology"}]}