{"title": "Divide-and-Conquer Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1134, "page_last": 1142, "abstract": "This work introduces Divide-Factor-Combine (DFC), a parallel divide-and-conquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.", "full_text": "Divide-and-Conquer Matrix Factorization\n\nMichael I. Jordana, b\nLester Mackeya\na Department of Electrical Engineering and Computer Science, UC Berkeley\n\nAmeet Talwalkara\n\nb Department of Statistics, UC Berkeley\n\nAbstract\n\nThis work introduces Divide-Factor-Combine (DFC), a parallel divide-and-\nconquer framework for noisy matrix factorization. DFC divides a large-scale\nmatrix factorization task into smaller subproblems, solves each subproblem in par-\nallel using an arbitrary base matrix factorization algorithm, and combines the sub-\nproblem solutions using techniques from randomized matrix approximation. Our\nexperiments with collaborative \ufb01ltering, video background modeling, and simu-\nlated data demonstrate the near-linear to super-linear speed-ups attainable with\nthis approach. Moreover, our analysis shows that DFC enjoys high-probability\nrecovery guarantees comparable to those of its base algorithm.\n\n1 Introduction\nThe goal in matrix factorization is to recover a low-rank matrix from irrelevant noise and corrup-\ntion. We focus on two instances of the problem: noisy matrix completion, i.e., recovering a low-rank\nmatrix from a small subset of noisy entries, and noisy robust matrix factorization [2, 3, 4], i.e., re-\ncovering a low-rank matrix from corruption by noise and outliers of arbitrary magnitude. Examples\nof the matrix completion problem include collaborative \ufb01ltering for recommender systems, link pre-\ndiction for social networks, and click prediction for web search, while applications of robust matrix\nfactorization arise in video surveillance [2], graphical model selection [4], document modeling [17],\nand image alignment [21].\nThese two classes of matrix factorization problems have attracted signi\ufb01cant interest in the research\ncommunity. In particular, convex formulations of noisy matrix factorization have been shown to ad-\nmit strong theoretical recovery guarantees [1, 2, 3, 20], and a variety of algorithms (e.g., [15, 16, 23])\nhave been developed for solving both matrix completion and robust matrix factorization via convex\nrelaxation. Unfortunately, these methods are inherently sequential and all rely on the repeated and\ncostly computation of truncated SVDs, factors that limit the scalability of the algorithms.\nTo improve scalability and leverage the growing availability of parallel computing architectures, we\npropose a divide-and-conquer framework for large-scale matrix factorization. Our framework, en-\ntitled Divide-Factor-Combine (DFC), randomly divides the original matrix factorization task into\ncheaper subproblems, solves those subproblems in parallel using any base matrix factorization al-\ngorithm, and combines the solutions to the subproblem using ef\ufb01cient techniques from randomized\nmatrix approximation. The inherent parallelism of DFC allows for near-linear to superlinear speed-\nups in practice, while our theory provides high-probabilityrecovery guarantees for DFC comparable\nto those enjoyed by its base algorithm.\nThe remainder of the paper is organized as follows. In Section 2, we de\ufb01ne the setting of noisy ma-\ntrix factorization and introduce the components of the DFC framework. To illustrate the signi\ufb01cant\nspeed-up and robustness of DFC and to highlight the effectiveness of DFC ensembling, we present\nexperimental results on collaborative \ufb01ltering, video background modeling, and simulated data in\nSection 3. Our theoretical analysis follows in Section 4. There, we establish high-probability noisy\nrecovery guarantees for DFC that rest upon a novel analysis of randomized matrix approximation\nand a new recovery result for noisy matrix completion.\n\n1\n\n\fM U\"\n\nNotation For M \u2208 Rm\u00d7n, we de\ufb01ne M(i) as the ith row vector and Mij as the ijth en-\ntry.\nIf rank(M) = r, we write the compact singular value decomposition (SVD) of M as\nM, where \u03a3M is diagonal and contains the r non-zero singular values of M, and\nUM \u03a3M V\"\nUM \u2208 Rm\u00d7r and VM \u2208 Rn\u00d7r are the corresponding left and right singular vectors of M. We\nde\ufb01ne M+ = VM \u03a3\u22121\nM as the Moore-Penrose pseudoinverse of M and PM = MM+ as the\northogonal projection onto the column space of M. We let \"\u00b7\"2, \"\u00b7\"F, and \"\u00b7\"\u2217 respectively denote\nthe spectral, Frobenius, and nuclear norms of a matrix and let \"\u00b7\" represent the !2 norm of a vector.\n2 The Divide-Factor-Combine Framework\nIn this section, we present our divide-and-conquerframework for scalable noisy matrix factorization.\nWe begin by de\ufb01ning the problem setting of interest.\n\n2.1 Noisy Matrix Factorization (MF)\nIn the setting of noisy matrix factorization, we observe a subset of the entries of a matrix M =\nL0 + S0 + Z0 \u2208 Rm\u00d7n, where L0 has rank r # m, n, S0 represents a sparse matrix of outliers of\narbitrary magnitude, and Z0 is a dense noise matrix. We let \u2126 represent the locations of the observed\nentries and P\u2126 be the orthogonal projection onto the space of m \u00d7 n matrices with support \u2126, so\nthat\nOur goal is to recover the low-rank matrix L0 from P\u2126(M) with error proportional to the noise level\n\u2206 ! \"Z0\"F. We will focus on two speci\ufb01c instances of this general problem:\n\n(P\u2126(M))ij = Mij, if (i, j) \u2208 \u2126 and (P\u2126(M))ij = 0 otherwise.\n\n\u2022 Noisy Matrix Completion (MC): s ! |\u2126| entries of M are revealed uniformly without\nreplacement, along with their locations. There are no outliers, so that S0 is identically zero.\n\u2022 Noisy Robust Matrix Factorization (RMF): S0 is identically zero save for s outlier en-\ntries of arbitrary magnitude with unknown locations distributed uniformly without replace-\nment. All entries of M are observed, so that P\u2126(M) = M.\n\n2.2 Divide-Factor-Combine\nAlgorithms 1 and 2 summarize two canonical examples of the general Divide-Factor-Combine\nframework that we refer to as DFC-PROJ and DFC-NYS. Each algorithm has three simple steps:\n(D step) Divide input matrix into submatrices: DFC-PROJ randomly partitions P\u2126(M) into t l-\ncolumn submatrices, {P\u2126(C1), . . . ,P\u2126(Ct)}1, while DFC-NYS selects an l-column sub-\nmatrix, P\u2126(C), and a d-row submatrix, P\u2126(R), uniformly at random.\n(F step) Factor each submatrix in parallel using any base MF algorithm: DFC-PROJ performs\nt parallel submatrix factorizations, while DFC-NYS performs two such parallel factoriza-\ntions. Standard base MF algorithms output the low-rank approximations { \u02c6C1, . . . , \u02c6Ct} for\nDFC-PROJ and \u02c6C, and \u02c6R for DFC-NYS. All matrices are retained in factored form.\n(C step) Combine submatrix estimates: DFC-PROJ generates a \ufb01nal low-rank estimate \u02c6Lproj by\nprojecting [ \u02c6C1, . . . , \u02c6Ct] onto the column space of \u02c6C1, while DFC-NYS forms the low-\nrank estimate \u02c6Lnys from \u02c6C and \u02c6R via the generalized Nystr\u00a8om method. These matrix\napproximation techniques are described in more detail in Section 2.3.\n\n2.3 Randomized Matrix Approximations\nOur divide-and-conqueralgorithms rely on two methods that generate randomized low-rank approx-\nimations to an arbitrary matrix M from submatrices of M.\n\n1For ease of discussion, we assume that mod(n, t) = 0, and hence, l = n/t. Note that for arbitrary n and\n\nt, P\u2126(M) can always be partitioned into t submatrices, each with either !n/t\" or #n/t$ columns.\n\n2\n\n\fAlgorithm 1 DFC-PROJ\nInput: P\u2126(M), t\n{P\u2126(Ci)}1\u2264i\u2264t = SAMPCOL(P\u2126(M), t)\ndo in parallel\n\n\u02c6C1 = BASE-MF-ALG(P\u2126(C1))\n\u02c6Ct = BASE-MF-ALG(P\u2126(Ct))\n\nend do\n\u02c6Lproj = COLPROJECTION( \u02c6C1, . . . , \u02c6Ct)\n\n...\n\nAlgorithm 2 DFC-NYSa\nInput: P\u2126(M), l, d\nP\u2126(C) ,P\u2126(R) = SAMPCOLROW(P\u2126(M), l, d)\ndo in parallel\n\n\u02c6C = BASE-MF-ALG(P\u2126(C))\n\u02c6R = BASE-MF-ALG(P\u2126(R))\n\nend do\n\u02c6Lnys = GENNYSTR\u00a8OM ( \u02c6C, \u02c6R)\naWhen Q is a submatrix of M we abuse notation and\nde\ufb01ne P\u2126(Q) as the corresponding submatrix of P\u2126(M).\n\nColumn Projection This approximation, introduced by Frieze et al. [7], is derived from column\nsampling of M. We begin by sampling l < n columns uniformly without replacement and let C\nbe the m \u00d7 l matrix of sampled columns. Then, column projection uses C to generate a \u201cmatrix\nprojection\u201d approximation [13] of M as follows:\n\nLproj = CC+M = UC U\"\n\nCM.\n\nCM.\nIn practice, we do not reconstruct Lproj but rather maintain low-rank factors, e.g., UC and U\"\nGeneralized Nystr\u00a8om Method The standard Nystr\u00a8om method is often used to speed up large-\nscale learning applications involving symmetric positive semide\ufb01nite (SPSD) matrices [24] and has\nbeen generalized for arbitrary real-valued matrices [8].\nIn particular, after sampling columns to\nobtain C, imagine that we independently sample d < m rows uniformly without replacement. Let\nR be the d \u00d7 n matrix of sampled rows and W be the d \u00d7 l matrix formed from the intersection\nof the sampled rows and columns. Then, the generalized Nystr\u00a8om method uses C, W, and R to\ncompute an \u201cspectral reconstruction\u201d approximation [13] of M as follows:\n\nW U\"\n\nW R.\n\nW and U\"\n\nLnys = CW+R = CVW \u03a3+\n\nW R .\nAs with Mproj, we store low-rank factors of Lnys, such as CVW \u03a3+\n2.4 Running Time of DFC\nMany state-of-the-art MF algorithms have \u2126(mnkM ) per-iteration time complexity due to the rank-\nkM truncated SVD performed on each iteration. DFC signi\ufb01cantly reduces the per-iteration com-\nplexity to O(mlkCi) time for Ci (or C) and O(ndkR) time for R. The cost of combining the\nsubmatrix estimates is even smaller, since the outputs of standard MF algorithms are returned in fac-\ntored form. Indeed, the column projection step of DFC-PROJ requires only O(mk2 + lk2) time for\nk ! maxi kCi: O(mk2 + lk2) time for the pseudoinversion of \u02c6C1 and O(mk2 + lk2) time for ma-\ntrix multiplication with each \u02c6Ci in parallel. Similarly, the generalized Nystr\u00a8om step of DFC-NYS\nrequires only O(l\u00afk2 + d\u00afk2 + min(m, n)\u00afk2) time, where \u00afk ! max(kC, kR). Hence, DFC divides\nthe expensive task of matrix factorization into smaller subproblems that can be executed in parallel\nand ef\ufb01ciently combines the low-rank, factored results.\n\n2.5 Ensemble Methods\nEnsemble methods have been shown to improve performance of matrix approximation algorithms,\nwhile straightforwardly leveraging the parallelism of modern many-core and distributed architec-\ntures [14]. As such, we propose ensemble variants of the DFC algorithms that demonstrably reduce\nrecovery error while introducing a negligible cost to the parallel running time. For DFC-PROJ-\nENS, rather than projecting only onto the column space of \u02c6C1, we project [ \u02c6C1, . . . , \u02c6Ct] onto the\ncolumn space of each \u02c6Ci in parallel and then average the t resulting low-rank approximations. For\nDFC-NYS-ENS, we choose a random d-row submatrix P\u2126(R) as in DFC-NYS and independently\npartition the columns of P\u2126(M) into {P\u2126(C1), . . . ,P\u2126(Ct)} as in DFC-PROJ. After running the\n\n3\n\n\fbase MF algorithm on each submatrix, we apply the generalized Nystr\u00a8om method to each ( \u02c6Ci, \u02c6R)\npair in parallel and average the t resulting low-rank approximations. Section 3 highlights the empir-\nical effectiveness of ensembling.\n3 Experimental Evaluation\nWe nowexplore the accuracyand speed-up of DFC on a variety of simulated and real-world datasets.\nWe use state-of-the-artmatrix factorization algorithms in our experiments: the Accelerated Proximal\nGradient (APG) algorithm of [23] as our base noisy MC algorithm and the APG algorithm of [15] as\nour base noisy RMF algorithm. In all experiments, we use the default parameter settings suggested\nby [23] and [15], measure recovery error via root mean square error (RMSE), and report parallel\nrunning times for DFC. We moreover compare against two baseline methods: APG used on the full\nmatrix M and PARTITION, which performs matrix factorization on t submatrices just like DFC-\nPROJ but omits the \ufb01nal column projection step.\n3.1 Simulations\nFor our simulations, we focused on square matrices (m = n) and generated random low-rank and\nsparse decompositions, similar to the schemes used in related work, e.g., [2, 12, 25]. We created\nL0 \u2208 Rm\u00d7m as a random product, AB\", where A and B are m \u00d7 r matrices with indepen-\ndent N (0,!1/r) entries such that each entry of L0 has unit variance. Z0 contained independent\nN (0, 0.1) entries. In the MC setting, s entries of L0 + Z0 were revealed uniformly at random. In\nthe RMF setting, the support of S0 was generated uniformly at random, and the s corrupted entries\ntook values in [0, 1] with uniform probability. For each algorithm, we report error between L0 and\nthe recovered low-rank matrix, and all reported results are averages over \ufb01ve trials.\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nE\nS\nM\nR\n\n0\n \n0\n\n2\n\nMC\n\n \n\nPart\u221210%\nProj\u221210%\nNys\u221210%\nProj\u2212Ens\u221210%\nNys\u2212Ens\u221210%\nProj\u2212Ens\u221225%\nBase\u2212MC\n\n4\n\n% revealed entries\n\n6\n\n8\n\n10\n\nE\nS\nM\nR\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n \n0\n\nRMF\n\n \n\nPart\u221210%\nProj\u221210%\nNys\u221210%\nProj\u2212Ens\u221210%\nNys\u2212Ens\u221210%\nBase\u2212RMF\n50\n60\n\n70\n\n10\n\n20\n\n30\n40\n% of outliers\n\nFigure 1: Recovery error of DFC relative to base algorithms.\n\nWe \ufb01rst explored the recovery error of DFC as a function of s, using (m = 10K, r = 10) with\nvarying observation sparsity for MC and (m = 1K, r = 10) with a varying percentage of outliers\nfor RMF. The results are summarized in Figure 1.2 In both MC and RMF, the gaps in recovery\nbetween APG and DFC are small when sampling only 10% of rows and columns. Moreover, DFC-\nPROJ-ENS in particular consistently outperforms PARTITION and DFC-NYS-ENS and matches the\nperformance of APG for most settings of s.\nWe next explored the speed-up of DFC as a function of matrix size. For MC, we revealed 4% of\nthe matrix entries and set r = 0.001 \u00b7 m, while for RMF we \ufb01xed the percentage of outliers to 10%\nand set r = 0.01 \u00b7 m. We sampled 10% of rows and columns and observed that recovery errors\nwere comparable to the errors presented in Figure 1 for similar settings of s; in particular, at all\nvalues of n for both MC and RMF, the errors of APG and DFC-PROJ-ENS were nearly identical.\nOur timing results, presented in Figure 2, illustrate a near-linear speed-up for MC and a superlinear\nspeed-up for RMF across varying matrix sizes. Note that the timing curves of the DFC algorithms\nand PARTITION all overlap, a fact that highlights the minimal computational cost of the \ufb01nal matrix\napproximation step.\n\n2In the left-hand plot of Figure 1, the lines for Proj-10% and Proj-Ens-10% overlap.\n\n4\n\n\f3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n)\ns\n(\n \ne\nm\n\ni\nt\n\n0\n \n1.5\n\n2\n\nMC\n\nPart\u221210%\nProj\u221210%\nNys\u221210%\nProj\u2212Ens\u221210%\nNys\u2212Ens\u221210%\nBase\u2212RMF\n\n \n\n10000\n\nRMF\n\n)\ns\n(\n \ne\nm\n\ni\nt\n\n8000\n\n6000\n\n4000\n\n2000\n\n \n\nPart\u221210%\nProj\u221210%\nNys\u221210%\nProj\u2212Ens\u221210%\nNys\u2212Ens\u221210%\nBase\u2212RMF\n\n2.5\n\n3\n\n0\n\n \n\n3.5\n\n4\n\nm\n\n3000\nm\nFigure 2: Speed-up of DFC relative to base algorithms.\n\n1000\n\n2000\n\n4.5\n\n5\nx 104\n\n4000\n\n5000\n\n3.2 Collaborative Filtering\nCollaborative \ufb01ltering for recommender systems is one prevalent real-world application of noisy\nmatrix completion. A collaborative \ufb01ltering dataset can be interpreted as the incomplete observation\nof a ratings matrix with columns corresponding to users and rows corresponding to items. The goal\nis to infer the unobserved entries of this ratings matrix. We evaluate DFC on two of the largest\npublicly available collaborative \ufb01ltering datasets: MovieLens 10M3 (m = 4K, n = 6K, s > 10M)\nand the Net\ufb02ix Prize dataset4 (m = 18K, n = 480K, s > 100M). To generate test sets drawn\nfrom the training distribution, for each dataset, we aggregated all available rating data into a single\ntraining set and withheld test entries uniformly at random, while ensuring that at least one training\nobservation remained in each row and column. The algorithms were then run on the remaining\ntraining portions and evaluated on the test portions of each split. The results, averaged over three\ntrain-test splits, are summarized in Table 3.2. Notably, DFC-PROJ, DFC-PROJ-ENS, and DFC-\nNYS-ENS all outperform PARTITION, and DFC-PROJ-ENS performs comparably to APG while\nproviding a nearly linear parallel time speed-up. The poorer performance of DFC-NYS can be in\npart explained by the asymmetry of these problems. Since these matrices have many more columns\nthan rows, MF on column submatrices is inherently easier than MF on row submatrices, and for\nDFC-NYS, we observe that \u02c6C is an accurate estimate while \u02c6R is not.\n\nTable 1: Performance of DFC relative to APG on collaborative \ufb01ltering tasks.\n\nMovieLens 10M\nMethod\nRMSE\nTime\nAPG\n0.8005\n294.3s\nPARTITION-25%\n77.4s\n0.8146\n36.0s\nPARTITION-10%\n0.8461\nDFC-NYS-25%\n0.8449\n77.2s\n53.4s\nDFC-NYS-10%\n0.8769\n84.5s\nDFC-NYS-ENS-25% 0.8085\n63.9s\nDFC-NYS-ENS-10% 0.8327\nDFC-PROJ-25%\n0.8061\n77.4s\n36.1s\nDFC-PROJ-10%\n0.8272\nDFC-PROJ-ENS-25% 0.7944\n77.4s\nDFC-PROJ-ENS-10% 0.8119\n36.1s\n\nNet\ufb02ix\n\nRMSE\n0.8433\n0.8451\n0.8492\n0.8832\n0.9224\n0.8486\n0.8613\n0.8436\n0.8484\n0.8411\n0.8433\n\nTime\n2653.1s\n689.1s\n289.2s\n890.9s\n487.6s\n964.3s\n546.2s\n689.5s\n289.7s\n689.5s\n289.7s\n\n3.3 Background Modeling\nBackground modeling has important practical rami\ufb01cations for detecting activity in surveillance\nvideo. This problem can be framed as an application of noisy RMF, where each video frame is\na column of some matrix (M), the background model is low-rank (L0), and moving objects and\n\n3http://www.grouplens.org/\n4http://www.netflixprize.com/\n\n5\n\n\fbackground variations, e.g., changes in illumination, are outliers (S0). We evaluate DFC on two\nvideos: \u2018Hall\u2019 (200 frames of size 176 \u00d7 144) contains signi\ufb01cant foreground variation and was\nstudied by [2], while \u2018Lobby\u2019 (1546 frames of size 168\u00d7120)includes many changes in illumination\n(a smaller video with 250 frames was studied by [2]). We focused on DFC-PROJ-ENS, due to its\nsuperior performance in previous experiments, and measured the RMSE between the background\nmodel recovered by DFC and that of APG. On both videos, DFC-PROJ-ENS recovered nearly the\nsame background model as the full APG algorithm in a small fraction of the time. On \u2018Hall,\u2019 the\nDFC-PROJ-ENS-5% and DFC-PROJ-ENS-0.5% models exhibited RMSEs of 0.564 and 1.55, quite\nsmall given pixels with 256 intensity values. The associated runtime was reduced from 342.5s for\nAPG to real-time (5.2s for a 13s video) for DFC-PROJ-ENS-0.5%. Snapshots of the results are\npresented in Figure 3. On \u2018Lobby,\u2019 the RMSE of DFC-PROJ-ENS-4% was 0.64, and the speed-up\nover APG was more than 20X, i.e., the runtime reduced from 16557s to 792s.\n\nOriginal frame\n\nAPG\n(342.5s)\n\n5% sampled\n\n(24.2s)\n\n0.5% sampled\n\n(5.2s)\n\nFigure 3: Sample \u2018Hall\u2019 recovery by APG, DFC-PROJ-ENS-5%, and DFC-PROJ-ENS-.5%.\n\n4 Theoretical Analysis\nHaving investigated the empirical advantages of DFC, we now show that DFC admits high-\nprobability recovery guarantees comparable to those of its base algorithm.\n4.1 Matrix Coherence\nSince not all matrices can be recovered from missing entries or gross outliers, recent theoretical\nadvances have studied suf\ufb01cient conditions for accurate noisy MC [3, 12, 20] and RMF [1, 25].\nMost prevalent among these are matrix coherence conditions, which limit the extent to which the\nsingular vectors of a matrix are correlated with the standard basis. Letting ei be the ith column of\nthe standard basis, we de\ufb01ne two standard notions of coherence [22]:\nDe\ufb01nition 1 (\u00b50-Coherence). Let V \u2208 Rn\u00d7r contain orthonormal columns with r \u2264 n. Then the\n\u00b50-coherence of V is:\n\nDe\ufb01nition 2 (\u00b51-Coherence). Let L \u2208 Rm\u00d7n have rank r. Then, the \u00b51-coherence of L is:\n\n\u00b50(V) ! n\n\nr max1\u2264i\u2264n $PV ei$2 = n\n\u00b51(L) !! mn\n\nr maxij |e#\n\nr max1\u2264i\u2264n $V(i)$2 .\n\ni ULV#\n\nL ej| .\n\nFor any \u00b5 > 0, we will call a matrix L (\u00b5, r)-coherent if rank(L) = r, max(\u00b50(UL), \u00b50(VL)) \u2264\n\u00b5, and \u00b51(L) \u2264 \u221a\u00b5. Our analysis will focus on base MC and RMF algorithms that express their\nrecovery guarantees in terms of the (\u00b5, r)-coherence of the target low-rank matrix L0. For such\nalgorithms, lower values of \u00b5 correspond to better recovery properties.\n4.2 DFC Master Theorem\nWe now show that the same coherence conditions that allow for accurate MC and RMF also imply\nhigh-probability recovery for DFC. To make this precise, we let M = L0 + S0 + Z0 \u2208 Rm\u00d7n,\nwhere L0 is (\u00b5, r)-coherent and $P\u2126(Z0)$F \u2264 \u2206. We further \ufb01x any \u0001, \u03b4 \u2208 (0, 1] and de\ufb01ne A(X)\n1\u2212\u0001/2 , r)-coherent. Then, our Thm. 3 provides a generic recovery\nas the event that a matrix X is ( r\u00b52\nbound for DFC when used in combination with an arbitrary base algorithm. The proof requires a\nnovel, coherence-based analysis of column projection and random column sampling. These results\nof independent interest are presented in Appendix A.\n\n6\n\n\fTheorem 3. Choose t = n/l and l \u2265 cr\u00b5 log(n) log(2/\u03b4)/\u00012, where c is a \ufb01xed positive con-\nstant, and \ufb01x any ce \u2265 0. Under the notation of Algorithm 1, if a base MF algorithm yields\nP!\"C0,i \u2212 \u02c6Ci\"F > ce\u221aml\u2206 | A(C0,i)\" \u2264 \u03b4C for each i, where C0,i is the corresponding parti-\ntion of L0, then, with probability at least (1 \u2212 \u03b4)(1 \u2212 t\u03b4C), DFC-PROJ guarantees\nUnder Algorithm 2, if a base MF algorithm yields P!\"C0 \u2212 \u02c6C\"F > ce\u221aml\u2206 | A(C)\" \u2264 \u03b4C\nand P!\"R0 \u2212 \u02c6R\"F > ce\u221adn\u2206 | A(R)\" \u2264 \u03b4R for d \u2265 cl\u00b50( \u02c6C) log(m) log(1/\u03b4)/\u00012, then, with\nprobability at least (1 \u2212 \u03b4)2(1 \u2212 \u03b4C \u2212 \u03b4R), DFC-NYS guarantees\n\n\"L0 \u2212 \u02c6Lproj\"F \u2264 (2 + \u0001)ce\u221amn\u2206.\n\n\"L0 \u2212 \u02c6Lnys\"F \u2264 (2 + 3\u0001)ce\u221aml + dn\u2206.\n\nTo understand the conclusions of Thm. 3, consider a typical base algorithm which, when applied to\nP\u2126(M), recovers an estimate \u02c6L satisfying \"L0 \u2212 \u02c6L\"F \u2264 ce\u221amn\u2206 with high probability. Thm. 3\nasserts that, with appropriately reduced probability, DFC-PROJ exhibits the same recovery error\nscaled by an adjustable factor of 2 + \u0001, while DFC-NYS exhibits a somewhat smaller error scaled by\n2+3\u0001.5 The key take-away then is that DFC introducesa controlled increase in error and a controlled\ndecrement in the probability of success, allowing the user to interpolate between maximum speed\nand maximum accuracy. Thus, DFC can quickly provide near-optimal recovery in the noisy setting\nand exact recovery in the noiseless setting (\u2206= 0) , even when entries are missing or grossly\ncorrupted. The next two sections demonstrate how Thm. 3 can be applied to derive speci\ufb01c DFC\nrecovery guarantees for noisy MC and noisy RMF. In these sections, we let \u00afn ! max(m, n).\n4.3 Consequences for Noisy MC\nOur \ufb01rst corollary of Thm. 3 shows that DFC retains the high-probability recovery guarantees of a\nstandard MC solver while operating on matrices of much smaller dimension. Suppose that a base\nMC algorithm solves the following convex optimization problem, studied in [3]:\nsubject to \"P\u2126(M \u2212 L)\"F \u2264 \u2206.\n\nThen, Cor. 4 follows from a novel guarantee for noisy convex MC, proved in the appendix.\nCorollary 4. Suppose that L0 is (\u00b5, r)-coherent and that s entries of M are observed, with locations\n\u2126 distributed uniformly. De\ufb01ne the oversampling parameter\n\nminimizeL \"L\"\u2217\n\n\u03b2s !\n\ns(1 \u2212 \u0001/2)\n\n32\u00b52r2(m + n) log2(m + n)\n\n,\n\n\u03b2s\n\n\u03b2s\n\n\u00012\n\n\u03b2s\n\nand \ufb01x any target rate parameter 1 <\u03b2 \u2264 \u03b2s. Then, if \"P\u2126(M) \u2212P \u2126(L0)\"F \u2264 \u2206 a.s., it suf\ufb01ces\nto choose t = n/l and\nl \u2265 max# n\u03b2\nto achieve\n\n+$ n(\u03b2\u22121)\nDFC-PROJ: \"L0 \u2212 \u02c6Lproj\"F \u2264 (2 + \u0001)c#\nDFC-NYS: \"L0 \u2212 \u02c6Lnys\"F \u2264 (2 + 3\u0001)c#\n\nd \u2265 max# m\u03b2\ne\u221amn\u2206\ne\u221aml + dn\u2206\n\n, cr\u00b5 log(n) log(2/\u03b4)\n\n+$ m(\u03b2\u22121)\n\n\u03b2s\n\n, cl\u00b50( \u02c6C) log(m) log(1/\u03b4)\n\n\u00012\n\n%,\n\n%\n\nwith probability at least\n\nDFC-PROJ: (1 \u2212 \u03b4)(1 \u2212 5t log(\u00afn)\u00afn2\u22122\u03b2) \u2265 (1 \u2212 \u03b4)(1 \u2212 \u00afn3\u22122\u03b2)\nDFC-NYS: (1 \u2212 \u03b4)2(1 \u2212 10 log(\u00afn)\u00afn2\u22122\u03b2),\ne a positive constant.\n\nrespectively, with c as in Thm. 3 and c#\n\n5Note that the DFC-NYS guarantee requires the number of rows sampled to grow in proportion to \u00b50( \u02c6C),\n\na quantity always bounded by \u00b5 in our simulations.\n\n7\n\n\fm log2(m + n)) sampled columns and O( m\n\nNotably, Cor. 4 allows for the fraction of columns and rows sampled to decrease as the oversampling\nparameter \u03b2s increases with m and n. In the best case, \u03b2s =\u0398( mn/[(m + n) log2(m + n)]), and\nn log2(m + n)) sampled rows. In\nCor. 4 requires only O( n\nthe worst case, \u03b2s =\u0398(1) , and Cor. 4 requires the number of sampled columns and rows to grow\nlinearly with the matrix dimensions. As a more realistic intermediate scenario, consider the setting\nin which \u03b2s =\u0398( \u221am + n) and thus a vanishing fraction of entries are revealed. In this setting,\nonly O(\u221am + n) columns and rows are required by Cor. 4.\n4.4 Consequences for Noisy RMF\nOur next corollary shows that DFC retains the high-probability recovery guarantees of a standard\nRMF solver while operating on matrices of much smaller dimension. Suppose that a base RMF\nalgorithm solves the following convex optimization problem, studied in [25]:\n\nminimizeL,S\n\n\"L\"\u2217 + \u03bb\"S\"1\n\nsubject to \"M \u2212 L \u2212 S\"F \u2264 \u2206,\n\nwith \u03bb = 1/\u221a\u00afn. Then, Cor. 5 follows from Thm. 3 and the noisy RMF guarantee of [25, Thm. 2].\nCorollary 5. Suppose that L0 is (\u00b5, r)-coherent and that the uniformly distributed support set of\nS0 has cardinality s. For a \ufb01xed positive constant \u03c1s, de\ufb01ne the undersampling parameter\n\n\u03b2s !!1 \u2212\n\ns\n\nmn\"/\u03c1s,\n\nl \u2265 max# r2\u00b52 log2(\u00afn)\n(1 \u2212 \u0001/2)\u03c1r\nd \u2265 max# r2\u00b52 log2(\u00afn)\n(1 \u2212 \u0001/2)\u03c1r\n\nand \ufb01x any target rate parameter \u03b2> 2 with rescaling \u03b2\" ! \u03b2 log(\u00afn)/ log(m) satisfying 4\u03b2s \u2212\n3/\u03c1s \u2264 \u03b2\" \u2264 \u03b2s. Then, if \"M \u2212 L0 \u2212 S0\"F \u2264 \u2206 a.s., it suf\ufb01ces to choose t = n/l and\n, cr\u00b5 log(n) log(2/\u03b4)/\u00012$\n, cl\u00b50( \u02c6C) log(m) log(1/\u03b4)/\u00012$\ne\u221amn\u2206\ne\u221aml + dn\u2206\n\nDFC-PROJ: \"L0 \u2212 \u02c6Lproj\"F \u2264 (2 + \u0001)c\"\"\nDFC-NYS: \"L0 \u2212 \u02c6Lnys\"F \u2264 (2 + 3\u0001)c\"\"\n\n4 log(\u00afn)\u03b2(1 \u2212 \u03c1s\u03b2s)\nm(\u03c1s\u03b2s \u2212 \u03c1s\u03b2\")2\n4 log(\u00afn)\u03b2(1 \u2212 \u03c1s\u03b2s)\nn(\u03c1s\u03b2s \u2212 \u03c1s\u03b2\")2\n\nto have\n\n,\n\n,\n\nwith probability at least\n\nDFC-PROJ: (1 \u2212 \u03b4)(1 \u2212 tcp\u00afn\u2212\u03b2) \u2265 (1 \u2212 \u03b4)(1 \u2212 cp\u00afn1\u2212\u03b2)\nDFC-NYS: (1 \u2212 \u03b4)2(1 \u2212 2cp\u00afn\u2212\u03b2),\n\ne , and cp positive constants.\n\nrespectively, with c as in Thm. 3 and \u03c1r, c\"\"\nNote that Cor. 5 places only very mild restrictions on the number of columns and rows to be sampled.\nIndeed, l and d need only grow poly-logarithmically in the matrix dimensions to achieve high-\nprobability noisy recovery.\n5 Conclusions\nTo improve the scalability of existing matrix factorization algorithms while leveraging the ubiquity\nof parallel computing architectures, we introduced, evaluated, and analyzed DFC, a divide-and-\nconquer framework for noisy matrix factorization with missing entries or outliers. We note that the\ncontemporaneous work of [19] addresses the computational burden of noiseless RMF by reformu-\nlating a standard convex optimization problem to internally incorporate random projections. The\ndifferences between DFC and the approach of [19] highlight some of the main advantages of this\nwork: i) DFC can be used in combination with any underlying MF algorithm, ii) DFC is trivially\nparallelized, and iii) DFC provably maintains the recovery guarantees of its base algorithm, even in\nthe presence of noise.\n\n8\n\n\fReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex relaxation:\n\nOptimal rates in high dimensions. In International Conference on Machine Learning, 2011.\n\n[2] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM, 58\n\n(3):1\u201337, 2011.\n\n[3] E.J. Cand`es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 \u2013936, 2010.\n[4] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Sparse and low-rank matrix decompo-\n\nsitions. In Allerton Conference on Communication, Control, and Computing, 2009.\n\n[5] Y. Chen, H. Xu, C. Caramanis, and S. Sanghavi. Robust matrix completion and corrupted columns. In\n\nInternational Conference on Machine Learning, 2011.\n\n[6] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM\n\nJournal on Matrix Analysis and Applications, 30:844\u2013881, 2008.\n\n[7] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for \ufb01nding low-rank approximations.\n\nIn Foundations of Computer Science, 1998.\n\n[8] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin. A theory of pseudoskeleton approximations.\n\nLinear Algebra and its Applications, 261(1-3):1 \u2013 21, 1997.\n\n[9] D. Gross and V. Nesme. Note on sampling without replacing from a \ufb01nite collection of matrices. CoRR,\n\nabs/1001.2738, 2010.\n\n[10] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nStatistical Association, 58(301):13\u201330, 1963.\n\n[11] D. Hsu, S. M. Kakade, and T. Zhang. Dimension-free tail inequalities for sums of random matrices.\n\narXiv:1104.1672v3[math.PR], 2011.\n\n[12] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of Machine\n\nLearning Research, 99:2057\u20132078, 2010.\n\n[13] S. Kumar, M. Mohri, and A. Talwalkar. On sampling-based approximate spectral decomposition.\n\nInternational Conference on Machine Learning, 2009.\n\nIn\n\n[14] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystr\u00a8om method. In NIPS, 2009.\n[15] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen, and Y. Ma. Fast convex optimization algorithms for exact\n\nrecovery of a corrupted low-rank matrix. UIUC Technical Report UILU-ENG-09-2214, 2009.\n\n[16] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrix rank minimiza-\n\ntion. Mathematical Programming, 128(1-2):321\u2013353, 2011.\n\n[17] K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal\n\ncomponent pursuit. In Conference on Information and Knowledge Management, 2010.\n\n[18] M. Mohri and A. Talwalkar. Can matrix coherence be ef\ufb01ciently and accurately estimated? In Conference\n\non Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[19] Y. Mu, J. Dong, X. Yuan, and S. Yan. Accelerated low-rank visual recovery by random projection. In\n\nConference on Computer Vision and Pattern Recognition, 2011.\n\n[20] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal\n\nbounds with noise. arXiv:1009.2118v2[cs.IT], 2010.\n\n[21] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. Rasl: Robust alignment by sparse and low-rank\ndecomposition for linearly correlated images. In Conference on Computer Vision and Pattern Recognition,\n2010.\n\n[22] B. Recht. A simpler approach to matrix completion. arXiv:0910.0651v2[cs.IT], 2009.\n[23] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares\n\nproblems. Paci\ufb01c Journal of Optimization, 6(3):615\u2013640, 2010.\n\n[24] C.K. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In NIPS, 2000.\n[25] Z. Zhou, X. Li, J. Wright, E. J. Cand`es, and Y. Ma. Stable principal component pursuit. arXiv:\n\n1001.2363v1[cs.IT], 2010.\n\n9\n\n\f", "award": [], "sourceid": 669, "authors": [{"given_name": "Lester", "family_name": "Mackey", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": null}]}