{"title": "Regularized Weighted Low Rank Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 4059, "page_last": 4069, "abstract": "The classical low rank approximation problem is to find a rank $k$ matrix $UV$ (where $U$ has $k$ columns and $V$ has $k$ rows) that minimizes the Frobenius norm of $A - UV$. Although this problem can be solved efficiently, we study an NP-hard variant of this problem that involves weights and regularization. A previous paper of [Razenshteyn et al. '16] derived a polynomial time algorithm for weighted low rank approximation with constant rank. We derive provably sharper guarantees for the regularized version by obtaining parameterized complexity bounds in terms of the statistical dimension rather than the rank, allowing for a rank-independent runtime that can be significantly faster. Our improvement comes from applying sharper matrix concentration bounds, using a novel conditioning technique, and proving structural theorems for regularized low rank problems.", "full_text": "Regularized Weighted Low Rank Approximation\n\nFrank Ban\n\nUC Berkeley / Google\nfban@berkeley.edu\n\nDavid Woodruff\n\nCarnegie Mellon University\ndwoodruf@cs.cmu.edu\n\nQiuyi (Richard) Zhang\nUC Berkeley / Google\nqiuyi@berkeley.edu\n\nAbstract\n\nThe classical low rank approximation problem is to \ufb01nd a rank k matrix U V\n(where U has k columns and V has k rows) that minimizes the Frobenius norm of\nA U V . Although this problem can be solved ef\ufb01ciently, we study an NP-hard\nvariant of this problem that involves weights and regularization. A previous paper\nof [Razenshteyn et al. \u201916] derived a polynomial time algorithm for weighted low\nrank approximation with constant rank. We derive provably sharper guarantees for\nthe regularized version by obtaining parameterized complexity bounds in terms of\nthe statistical dimension rather than the rank, allowing for a rank-independent\nruntime that can be signi\ufb01cantly faster. Our improvement comes from applying\nsharper matrix concentration bounds, using a novel conditioning technique, and\nproving structural theorems for regularized low rank problems.\n\nIntroduction\n\n1\nIn the weighted low rank approximation problem, one is given a matrix M 2 n\u21e5d, a weight matrix\nW 2 n\u21e5d, and an integer parameter k, and would like to \ufb01nd factors U 2 n\u21e5k and V 2 k\u21e5d\nso as to minimize\n\nkW (M U \u00b7 V )k2\n\nF =\n\nW 2\ni,j(Mi,j hUi,\u21e4, V\u21e4,ji)2,\n\nnXi=1\n\ndXj=1\n\nwhere Ui,\u21e4 denotes the i-th row of U and V\u21e4,j denotes the j-th column of V . W.l.o.g., we assume\nn d. This is a weighted version of the classical low rank approximation problem, which is a\nspecial case when Wi,j = 1 for all i and j. One often considers the approximate version of this\nproblem, for which one is given an approximation parameter \" 2 (0, 1) and would like to \ufb01nd\nU 2 n\u21e5k and V 2 k\u21e5d so that\n(1)\nkW (M U \u00b7 V )k2\n\nU02 n\u21e5k,V 02 k\u21e5d kW (M U0 \u00b7 V 0)k2\nF .\n\nF \uf8ff (1 + \")\n\nmin\n\nWeighted low rank approximation extends the classical low rank approximation problem in many\nways. While in principal component analysis, one typically \ufb01rst subtracts off the mean to make\nthe matrix M have mean 0, this does not \ufb01x the problem of differing variances. Indeed, imagine\none of the columns of M has much larger variance than the others. Then in classical low rank\napproximation with k = 1, it could suf\ufb01ce to simply \ufb01t this single high variance column and ignore\nthe remaining entries of M. Weighted low rank approximation, on the other hand, can correct for this\nby re-weighting each of the entries of M to give them similar variance; this allows for the low rank\napproximation U \u00b7 V to capture more of the remaining data. This technique is often used in gene\nexpression analysis and co-occurrence matrices; we refer the reader to [SJ03] and the Wikipedia\nentry on weighted low rank approximation1. The well-studied problem of matrix completion is\n\n1https://en.wikipedia.org/wiki/Low-rank_approximation#Weighted_low-rank_\n\napproximation_problems\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\falso a special case of weighted low rank approximation, where the entries Wi,j are binary, and has\nnumerous applications in recommendation systems and other settings with missing data.\nUnlike its classical variant, weighted low rank approximation is known to be NP-hard [GG11].\nClassical low rank approximation can be solved quickly via the singular value decomposition, which\nis often sped up with sketching techniques [Woo14, PW15, TYUC17]. However, in the weighted\nsetting, under a standard complexity-theoretic assumption known as the Random Exponential Time\nHypothesis (see, e.g., Assumption 1.3 in [RSW16] for a discussion), there is a \ufb01xed constant \"0 2\n(0, 1) for which any algorithm achieving (1) with constant probability and for \" = \"0, and even\nfor k = 1, requires 2\u2326(r) time, where r is the number of distinct columns of the weight matrix W .\nFurthermore, as shown in Theorem 1.4 of [RSW16], this holds even if W has both at most r distinct\nrows and r distinct columns.\nDespite the above hardness, in a number of applications the parameter r may be small. Indeed, in\na matrix in which the rows correspond to users and the columns correspond to ratings of a movie,\nsuch as in the Net\ufb02ix matrix, one may have a small number of categories of movies. In this case,\none may want to use the same column in W for each movie in the same category. It may thus\nmake sense to renormalize user ratings based on the category of movies being watched. Note that\nany number of distinct rows of W is possible here, as different users may have completely different\nratings, but there is just one distinct column of W per category of movie. In some settings one may\nsimultaneously have a small number of distinct rows and a small number of distinct columns. This\nmay occur if say, the users are also categorized into a small number of groups. For example, the\nusers may be grouped by age and one may want to weight ratings of different categories of movies\nbased on age. That is, maybe cartoon ratings of younger users should be given higher weight, while\nhistorical \ufb01lms rated by older users should be given higher weight.\nMotivated by such applications when r is small, [RSW16] propose several parameterized complexity\nalgorithms. They show that in the case that W has at most r distinct rows and r distinct columns,\nthere is an algorithm solving (1) in 2O(k2r/\")poly(n) time. If W has at most r distinct columns but\nany number of distinct rows, there is an algorithm achieving (1) in 2O(k2r2/\")poly(n) time. Note\nthat these bounds imply that for constant k and \", even if r is as large as \u21e5(log n) in the \ufb01rst case,\nand \u21e5(plog n) in the second case, the corresponding algorithm is polynomial time.\nIn [RSW16], the authors also consider the case when the rank of the weight matrix W is at most r,\nwhich includes the r distinct rows and columns, as well as the r distinct column settings above, as\nspecial cases. In this case the authors achieve an nO(k2r/\") time algorithm. Note that this is only\npolynomial time if k, r, and \" are each \ufb01xed constants, and unlike the algorithms for the other two\nsettings, this algorithm is not \ufb01xed parameter tractable, meaning its running time cannot be written\nas f (k, r, 1/\") \u00b7 poly(nd), where f is a function that is independent of n and d.\nThere are also other algorithms for weighted low rank approximation, though they do not have prov-\nable guarantees, or require strong assumptions on the input. There are gradient-based approaches of\nShpak [Shp90] and alternating minimization approaches of Lu et al. [LPW97, LA03], which were\nre\ufb01ned and used in practice by Srebro and Jaakkola [SJ03]. However, neither of these has provable\ngurantees. While there is some work that has provable guarantees, it makes incoherence assump-\ntions on the low rank factors of M, as well as assumptions that the weight matrix W is spectrally\nclose to the all ones matrix [LLR16] and that there are no 0 weights.\n\n1.1 Our Contributions\n\nThe only algorithms with provable guarantees that do not make assumptions on the inputs are slow,\nand inherently so given the above hardness results. Motivated by this and the widespread use of reg-\nularization in machine learning, we propose to look at provable guarantees for regularized weighted\nlow rank approximation. In one version of this problem, where the parameter r corresponds to the\nrank of the weight matrix W , we are given a matrix M 2 n\u21e5d, a weight matrix W 2 n\u21e5d with\nrank r, and a target integer k > 0, and we consider the problem\n\nmin\n\nU2 n\u21e5k,V 2 k\u21e5d kW (U V M )k2\n\nF + kUk2\n\nF + kV k2\n\nF\n\nLet U\u21e4, V \u21e4 minimize kW (U V M )k2\n\nF + kUk2\n\nF + kV k2\n\nF and OPT be the minimum value.\n\n2\n\n\fRegularization is a common technique to avoid over\ufb01tting and to solve an ill-posed problem. It has\nbeen applied in the context of weighted low rank approximation [DN11], though so far the only such\nresults known for weighted low rank approximation with regularization are heuristic. In this paper\nwe give the \ufb01rst provable bounds, without any assumptions on the input, on regularized weighted\nlow rank approximation.\nImportantly, we show that regularization improves our running times for weighted low rank ap-\nproximation, as speci\ufb01ed below. Intuitively, the complexity of regularized problems depends on the\n\u201cstatistical dimension\u201d or \u201ceffective dimension\u201d of the underlying problem, which is often signi\ufb01-\ncantly smaller than the number of parameters in the regularized setting.\nLet U\u21e4 and V \u21e4 denote the optimal low-rank matrix approximation factors, DWi,: denote the diagonal\nmatrix with the i-th row of W along the diagonal, and DW:,j denote the diagonal matrix with the\nj-th column of W along the diagonal.\nImproving the Exponent: We \ufb01rst show how to improve the nO(k2r/\") time algorithm of [RSW16]\nto a running time of nO((s+log(1/\"))rk/\"). Here s is de\ufb01ned to be the maximum statistical dimension\nof V \u21e4DWi,: and DW:,j U\u21e4, over all i = 1, . . . , n, and j = 1, . . . , d, where the statistical dimension\nof a matrix M is:\n\nizing weight (here i are the singular values of M).\n\ni ) denote the statistical dimension of M with regular-\n\nDe\ufb01nition 1. Let sd(M ) =Pi 1/(1 + /2\nNote that this maximum value s is always at most k and for any s log(1/\"), our bound directly\nimproves upon the previous time bound. Our improvement requires us to sketch matrices with k\ncolumns down to s/\" rows where s/\" is potentially smaller than k. This is particularly interesting\nsince most previous provable sketching results for low rank approximation cannot have sketch sizes\nthat are smaller than the rank, as following the standard analysis would lead to solving a regression\nproblem on a matrix with fewer rows than columns.\nThus, we introduce the notion of an upper and lower distortion factor (KS and \uf8ffS below) and\nshow that the lower distortion factor will satisfy tail bounds only on a smaller-rank subspace of\nsize s/\", which can be smaller than k. Directly following the analysis of [RSW16] will cause\nthe lower distortion factor to be in\ufb01nite. The upper distortion factor also satis\ufb01es tail bounds via\na more powerful matrix concentration result not used previously. Furthermore, we apply a novel\nconditioning technique that conditions on the product of the upper and lower distortion factors on\nseparate subspaces, whereas previous work only conditions on the condition number of a speci\ufb01c\nsubspace.\nWe next considerably strengthen the above result by showing an nO(r2(s+log(1/\"))2/\"2)) time algo-\nrithm. This shows that the rank k need not be in the exponent of the algorithm at all! We do this\nvia a novel projection argument in the objective (sketching on the right), which was not done in\n[RSW16] and also improves a previous result for the classical setting in [ACW17]. To gain some\nperspective on this result, suppose \" is a large constant, close to 1, and r is a small constant. Then\nour algorithm runs in nO(s2) time as opposed to the algorithm of [RSW16] which runs in nO(k2)\ntime. We stress in a number of applications, the effective dimension s may be a very small constant,\nclose to 1, even though the rank parameter k can be considerably larger. This occurs, for example,\nif there is a single dominant singular value, or if the singular values are geometrically decaying.\nConcretely, it is realistic that k could be \u21e5(log n), while s = \u21e5(1), in which case our algorithm is\nthe \ufb01rst polynomial time algorithm for this setting.\nImproving the Base: We can further optimize by removing our dependence on n in the base. The\nnon-negative rank of a n \u21e5 d matrix A is de\ufb01ned to be the least r such that there exist factors\nU 2 Rn\u21e5r and V 2 Rr\u21e5d where A = U \u00b7 V and both U and V have non-negative entries. By\napplying a novel rounding procedure, if in addition the non-negative rank of W is at most r0, then\nwe can obtain a \ufb01xed-parameter tractable algorithm running in time 2r0r2(s+log(1/\"))2/\"2)poly(n).\nNote that r \uf8ff r0, where r is the rank of W . Note also that if W has at most r distinct rows or\ncolumns, then its non-negative rank is also at most r since we can replace the entries of W with\ntheir absolute values without changing the objective function, while still preserving the property of\nat most r distinct rows and/or columns. Consequently, we signi\ufb01cantly improve the algorithms for\na small number of distinct rows and/or columns of [RSW16], as our exponent is independent of k.\n\n3\n\n\fThus, even if k =\u21e5( n) but the statistical dimension s = O(plog n), for constant r0 and \" our\nalgorithm is polynomial time, while the best previous algorithm would be exponential time.\nWe also give ways, other than non-negative rank, for improving the running time. Supposing that\nthe rank of W is r again, we apply iterative techniques in linear system solving like Richardson\u2019s\nIteration and preconditioning to further improve the running time. We are able to show that instead\nof an npoly(rs/\") time algorithm, we are able to obtain algorithms that have running time roughly\n(2/)poly(rs/\")poly(n) or (uW /lW )poly(rs/\")poly(n), where 2 is de\ufb01ned to be the maximum\nsingular value of V \u21e4DWi,: and DW:,j U\u21e4, over all i = 1, . . . , n, and j = 1, . . . , d, while uW is\nde\ufb01ned to be the maximum absolute value of an entry of W and lW the minimum absolute value of\nan entry. In a number of settings one may have 2/ = O(1) or uW /lW = O(1) in which case we\nagain obtain \ufb01xed parameter tractable algorithms.\nEmpirical Evaluation: Finally, we give an empirical evaluation of our results. While the main goal\nof our work is to obtain the \ufb01rst algorithms with provable guarantees for regularized weighted low\nrank approximation, we can also use them to guide heuristics in practice. In particular, alternating\nminimization is a common heuristic for weighted low rank approximation. We consider a sketched\nversion of alternating minimization to speed up each iteration. We show that in the regularized case,\nthe dimension of the sketch can be signi\ufb01cantly smaller if the statistical dimension is small, which\nis consistent with our theoretical results.\n\n2 Preliminaries\n\nWe let k\u00b7kF denote the Frobenius norm of a matrix and let be the elementwise matrix multiplication\noperator. We denote x 2 [a, b] y if ay \uf8ff x \uf8ff by. For a matrix M, let Mi,; denote its ith row and\nlet M;,j denote its jth column. For v 2 n, let Dv denote the n \u21e5 n diagonal matrix with its i-th\ndiagonal entry equal to vi. For a matrix M with non-negative Mij, let nnr(M ) denote the non-\nnegative rank of M. Let sr(M ) = kMk2\nF /kMk2 denote the stable rank of M. Let D denote a\ndistribution over r \u21e5 n matrices; in our setting, there are matrices with entries that are Gaussian\nrandom variables with mean 0 and variance 1/r (or r \u21e5 n CountSketch matrices [Woo14]).\nDe\ufb01nition 2. For S sampled from a distribution of matrices D and a matrix M with n\nrows,\nlet cS(M ) 1 denote the smallest (possibly in\ufb01nite) number such that kSM vk2 2\n[cS(M )1, cS(M )]kM vk2 for all v.\nDe\ufb01nition 3. For S sampled from a distribution of matrices D and a matrix M, let KS(M ) 1\ndenote the smallest number such that kSM vk2 \uf8ff KS(M )kM vk2 for all v.\nDe\ufb01nition 4. For S sampled from a distribution of matrices D and a matrix M, let \uf8ffS(M ) \uf8ff 1\ndenote the largest number such that kSM vk2 \uf8ffS(M )kM vk2 for all v.\nNote that by de\ufb01nition, cs(M ) equals the max of KS(M ) and\nnumber of a matrix A to be cA = KA(I)/\uf8ffA(I).\n\n\uf8ffS (M ). We de\ufb01ne the condition\n\n1\n\n2.1 Previous Techniques\n\nBuilding upon the initial framework established in [RSW16], we apply a polynomial system solver\nto solve weighted regularized LRA to high accuracy. By applying standard sketching guarantees, v\ncan be made a polynomial function of k, 1/\", r that is independent of n.\nTheorem 1 ([Ren92a][Ren92b][BPR96]). Given a real polynomial system P (x1, x2, ..., xv) having\nv variables and m polynomial constraints fi(x1, ..., xv)i0, where i 2 {, =,\uf8ff}, d is the maxi-\nmum degree of all polynomials, and H is the maximum bitsize of the coef\ufb01cients of the polynomials,\none can determine if there exists a solution to P in (md)O(v)poly(H) time.\n\nIntuitively, the addition of regularization requires us to only preserve directions with high spectral\nweight in order to preserve our low rank approximation well enough. This dimension of the subspace\nspanned by these important directions is exactly the statistical dimension of the problem, allowing us\nto sketch to a size less than k that could provably preserve our low rank approximation well enough.\nIn line with this intuition, we use an important lemma from [CNW16]\n\n4\n\n\fLemma 2.1. Let A, B be matrices with n rows and let S, sampled from D, have ` =\u2326(\nlog(1/\"))) rows and n columns. Then\nF /K) <\"\n\nPr\uf8ffkAT ST SB AT Bk > \u00b7q(kAk2 + kAk2\n\nF /K) (kBk2 + kBk2\n\nIn particular, if we choose K > \u2326(sr(A) + sr(B)), then we have for some small constant 0,\n\n1\n2 (K +\n\nPr\u21e5kAT ST SB AT Bk > 0kAkkBk\u21e4 <\"\n\n3 Multiple Regression Sketches\n\nIn this section, we prove our main structural theorem which allows us to sketch regression matrices\nto the size of the statistical dimension of the matrices while maintaining provable guarantees. Specif-\nically, to approximately solve a sum of regression problems, we are able to reduce the dimension of\nthe problem to the maximum statistical dimension of the regression matrices.\nTheorem 2. Let M (1), . . . , M (d) 2 n\u21e5k and b(1), . . . , b(d) 2 n be column vectors. Let S 2\n`\u21e5n be sampled from D with ` =\u21e5( 1\nDe\ufb01ne x(i) = argminx kM (i)x b(i)k2 + kxk2 and y(i) = argminy kS(M (i)y b(i))k2 + kyk2.\nThen, with constant probability,\n\n\" (s + log(1/\"))) and s = maxi{sd(M (i))}.\n\nkM (i)y(i) b(i)k2 + ky(i)k2 \uf8ff (1 + \") \u00b7 dXi=1\n\ndXi=1\n\nkM (i)x(i) b(i)k2 + kx(i)k2!\n\nWe note that a simple union bound would incur a dependence of a factor of log(d) in the sketching\ndimension l. While this might seem mild at \ufb01rst, the algorithms we consider are exponential in l,\nimplying that we would be unable to derive polynomial time algorithms for solving weighted low\nrank approximation even when the input and weight matrix are both of constant rank. Therefore, we\nneed an average case version of sketching guarantees to hold; however, this is not always the case\nsince l is small and applying Lemma 2.1 na\u00a8\u0131vely only gives a probability bound. Ultimately, we must\ncondition on the event of a combination of sketching guarantees holding and carefully analyzing the\nexpectation in separate cases.\n\n4 Algorithms\n\nIn this section, we present a fast algorithm for solving regularized weighted low rank approximation.\nOur algorithm exploits the structure of low-rank approximation as a sum of regression problems and\napplies the main structural theorem of our previous section to signi\ufb01cantly reduce the number of\nvariables in the optimization process. Note that we can write\n\nkW (U V A)k2\n\nF =\n\nnXi=1\n\nkUi,;V DWi,; Ai,;DWi,;k2 =\n\ndXj=1\n\nkDW;,j U V;,j DW;,j A;,jk2\n\n4.1 Using the Polynomial Solver with Sketching\nNow we sample Gaussian sketch matrices S0 from d\u21e5\u21e5( s\nWe let P (i) denote V DWi,;S0 and Q(j) denote S00DW;,j U.\nThe matrices P (i) and Q(j) can be encoded using \u21e5( s+log(1/\")\nQ(j) we can de\ufb01ne\n\n\"\n\n\" ) log(1/\") and S00 from \u21e5( s\n\n\" ) log(1/\")\u21e5n.\n\n)kr variables. For \ufb01xed P (i) and\n\nand\n\n\u02dcU = argmin\nU2 n\u21e5k\n\n\u02dcV = argmin\nV 2 k\u21e5n\n\nkUi,;P (i) Ai,;DWi,;S0k2 + kUi,;k2\n\nnXi=1\ndXj=1\nkQ(j)V;,j S00DW;,j A;,jk2 + kV;,jk2\n\n5\n\n\fAlgorithm 1 Regularized Weighted Low Rank Approximation\n\npublic : procedure REGWEIGHTEDLOWRANK(A, W, , s, k, \")\n\nSample Gaussian sketch S0 2 d\u21e5\u21e5( s\nSample Gaussian sketch S00 2 \u21e5( s\nCreate matrix variables P (i) 2 k\u21e5\u21e5( s\n\n\" ) log(1/\") from D\n\" ) log(1/\")\u21e5n from D.\n\" ) log(1/\"), Q(j) 2 k\u21e5\u21e5( s\n\n\" ) log(1/\") for i, j from 1 to r\n. Variables used in polynomial system solver\n\nUse Cramer\u2019s Rule to express \u02dcUi,; = Ai,;DWi,;S0(P (i))T (P (i)(P (i))T + Ik)1 as a rational\n\nfunction of variables P (i); similarly, \u02dcV;,j = ((Q(j))T Q(j) + Ik)1(Q(j))T S00DW;,j A;,j\n\nSolve min \u02dcU , \u02dcV kW ( \u02dcU \u02dcV A)k2\n\n. \u02dcU , \u02dcV are now rational function of variables P, Q\nF + k \u02dcV k2\n\nF and apply binary search to \ufb01nd \u02dcU , \u02dcV\n. Optimization with polynomial solver of Theorem 1 in variables P, Q\n\nF + k \u02dcUk2\n\nreturn \u02dcU , \u02dcV\n\nto get\n\nand\n\n\u02dcUi,; = Ai,;DWi,;S0(P (i))T (P (i)(P (i))T + Ik)1\n\n\u02dcV;,j = ((Q(j))T Q(j) + Ik)1(Q(j))T S00DW;,j A;,j\n\n\"\n\nso \u02dcU and \u02dcV can be encoded as rational functions over \u21e5( (s+log(1/\"))kr\n) variables by Cramer\u2019s Rule.\nF + k \u02dcV k2\nBy Theorem 2, we can argue that min \u02dcU , \u02dcV kW ( \u02dcU \u02dcV A)k2\nF + k \u02dcUk2\nF is a good approxi-\nF with constant probability, and so in particular\nmation for kW (U\u21e4V \u21e4A)k2\nsuch a good approximation exists. By using the polynomial system feasibility checkers described in\nTheorem 1 and following similar procedures and doing binary search, we get an polynomial system\nwith O(nk)-degree and O( s+log(1/\")\nkr) variables after simplifying and so our polynomial solver\nruns in time nO((s+log(1/\"))kr/\") logO(1)(/).\nTheorem 3. Given matrices A, W 2 n\u21e5d and \"< 0.1 such that\n\nF +kV \u21e4k2\n\nF +kU\u21e4k2\n\n\"\n\n1. rank(W) = r\n2. non-zero entries of A, W are multiples of > 0\n3. all entries of A, W are at most in absolute value\n4. s = maxi,j{sd(V \u21e4DWi,;), sd(DW;,j U\u21e4)} < k\n\nthere is an algorithm to \ufb01nd U 2 n\u21e5k, V 2 k\u21e5d in time nO((s+log(1/\"))kr/\") logO(1)(/) such\nthat kW (U V A)k2\nF + kUk2\n4.2 Removing Rank Dependence\n\nF \uf8ff (1 + \")OPT.\n\nF + kV k2\n\nNote that the running time of our algorithm still depends on k, the dimension that we are reducing\nto. To remove this, we prove a structural theorem about low rank approximation of low statistical\ndimension matrices.\nTheorem 4. Given matrices A, W in\nmaxi,j{sd(V \u21e4DWi,;), sd(DW;,j U\u21e4)} < k, if we let OPT(k) denote\n\nn\u21e5d and \"< 0.1 such that rank(W ) is r, and letting s equal\n\nmin\n\nU2 n\u21e5k,V 2 k\u21e5d kW (U V A)k2\n\nF + kUk2\n\nF + kV k2\n\nF\n\nthen OPT(O(r(s + log(1/\"))/\")) \uf8ff (1 + \")OPT(k)\nCombining Theorem 3 and Theorem 4, we have our \ufb01nal theorem. We note that this also improves\nrunning time bounds of un-weighted regularized low rank approximation in Section 3 of [ACW17].\n\n6\n\n\fTheorem 5. Given matrices A, W 2 n\u21e5d and \"< 0.1 and the conditions of Theorem 3, there is\nan algorithm to \ufb01nd U 2 n\u21e5k, V 2 k\u21e5d in time nO(r2(s+log(1/\"))2/\"2) logO(1)(/) such that\nkW (U V A)k2\n5 Reducing the Degree of the Solver\n\nF \uf8ff (1 + \")OPT.\n\nF + kV k2\n\nF + kUk2\n\n5.1 Non-negative Weight Matrix and Non-Negative Rank\nUnder the case where W is rank r with only r distinct columns (up to scaling), we are able to\nimprove the running time to poly(n)2r3(s+log(1/\"))2/\"2 by showing that the degree of the solver is\nO(rk) as opposed to O(nk). Speci\ufb01cally, the O(nk) degree comes from clearing the denominator of\nthe rational expressions that come from na\u00a8\u0131vely using and analyzing Cramer\u2019s Rule; in this section,\nwe demonstrate different techniques to avoid the dependence on n. We also show the same running\ntime bound under a more relaxed assumption of non-negative rank, which is always less than or\nequal to the number of distinct columns.\nTheorem 6. Given matrices A, W 2 n\u21e5d and \"< 0.1 and suppose the conditions of Theorem 3\nhold. Furthermore, we are given Y, Z 0 such that W = Y Z and Y, ZT has nnr(W ) = r0\ncolumns.\nThen, there is an algorithm to \ufb01nd U 2 n\u21e5k, V 2 k\u21e5d in time poly(n) \u00b7 2O(r0r2(s+log( 1\n\"2 ) \u00b7\nlogO(1) \n\n such that kW (U V A)k2\n\nF \uf8ff (1 + \")OPT.\n\nF + kV k2\n\nF + kUk2\n\n\" ))2 1\n\n5.2 Richardson\u2019s Iteration\nNote that the current polynomial solver uses Cramer\u2019s rule to solve\n\ngiving\n\n\u02dcU = argmin\nU2 n\u21e5k\n\nnXi=1\n\nkUi,;P (i) Ai,;DWi,;S0k2 + kUi,;k2\n\n\u02dcUi,; = Ai,;DWi,;S0(P (i))T (P (i)(P (i))T + Ik)1.\n\nWe want to use Richardson\u2019s iteration instead to avoid rational expressions and the dependence on\nn in the degree that comes from clearing the denominator.\nTheorem 7 (Preconditioned Richardson [CKP+17]). Let A, B be symmetric PSD matrices such that\nker(A) = ker(B) and \u2318A B A. Then, for any b, if x0 = 0 and xi+1 = xi \u2318B1(Axi b),\n\nkxt A1bk \uf8ff \"kA1bk\n\nfor t = \u2326(log(cB/\")/\u2318). Furthermore, we may express xt as a polynomial of degree O(t) in terms\nof the entries of B1 and A.\nTheorem 8. Given matrices A, W 2 n\u21e5d and \"< 0.1 and suppose the conditions of Theorem 3\nhold. Furthermore, let = maxi,j{1(V \u21e4DWi,;), 1(DW;,j U\u21e4)}.\n\u2318\u2318l\nThere is an algorithm to \ufb01nd U 2 n\u21e5k, V 2 k\u21e5d in time poly(n)\u21e3 2\nlogO(1) \n\"2 ), such that kW (U V A)k2\n\n , where l = O((s + log( 1\n\n \u00b7 log\u21e3 (2+)n\n\n(1 + \")OPT + \u2327.\n\nF + kV k2\n\nF + kUk2\n\n\" ))2 r2\n\n\u2327\n\n\u00b7\nF \uf8ff\n\n5.3 Preconditioned GD\nInstead of directly using Richardson\u2019s iteration, we may use a preconditioner \ufb01rst instead. The right\npreconditioner can also be guessed at a cost of increasing the number of variables. Note that multiple\npreconditioners may be used, but for now, we demonstrate the power of a single preconditioner.\nTheorem 9. Given matrices A, W 2 n\u21e5d and \"< 0.1 and suppose the conditions of Theorem 8\nhold. Furthermore, 0 < lW \uf8ff| W|\uf8ff uW . Then, there is an algorithm to \ufb01nd U 2 n\u21e5k, V 2\nk\u21e5d in time poly(n) \u00b7\u21e3 uW\n\" ))2 r2\n\"2 ),\nsuch that kW (U V A)k2\n\n , where l = O((s + log( 1\n\n\u2318\u2318l\n\u00b7 logO(1) \nF \uf8ff (1 + \")OPT + \u2327.\n\nlW \u00b7 log\u21e3 (2+)n\n\nF + kV k2\n\nF + kUk2\n\n\u2327\n\n7\n\n\f6 Experiments\n\nThe goal of our experiments was to show that sketching down to the statistical dimension can be\napplied to regularized weighted low rank approximation without sacri\ufb01cing overall accuracy in the\nobjective function, as our theory predicts. We combine sketching with a common practical alter-\nnating minimization heuristic for solving regularized weighted low rank approximation, rather than\nimplementing a polynomial system solver. At each step in the algorithm, we have a candidate U and\nV and we perform a \u201cbest-response\u201d where we either update U to give the best regularized weighted\nlow rank approximation cost for V or we update V to give the best regularized weighted low rank\napproximation cost for U. We used a synthetic dataset and several real datasets (connectus, NIPS,\nlandmark, and language) [DH11, PJST17]. All our experiments ran on a MacBook Pro 2012 with\n8GB RAM and a 2.5GHz Intel Core i5 processor.\n\nFigure 1: Regularized weighted low-rank approximations with = 0.556 for landmark, = 314\nfor NIPS, and = 1 for the synthetic dataset.\n\nFor all datasets, the task was to \ufb01nd a rank k = 50 decomposition of a given matrix A. For the\nexperiments of Figure 1 and Figure 2, we generated dense weight matrices W with the same shape\nas A and with each entry being a 1 with probability 0.8, a 0.1 with probability 0.15, and a 0.01 with\nprobability 0.05. For the experiments of Figure 3, we generated binary weight matrices where each\nentry was 1 with probability 0.9. Note that this setting corresponds to a regularized form of matrix\ncompletion. We set the regularization parameter to a variety of values (described in the Figure\ncaptions) to illustrate the performance of the algorithm in different settings.\nFor the synthetic dataset, we generated matrices A with dimensions 10000\u21e51000 by picking random\northogonal vectors as its singular vectors and having one singular value equal to 10000 and making\nthe rest small enough so that the statistical dimension of A would be approximately 2.\nFor the real datasets, we chose the connectus, landmark, and language datasets [DH11] and the\nNIPS dataset [PJST17]. We sampled 1000 rows from each adjacency or word count matrix to form\na matrix B and then let A be the radial basis function kernel of B. We performed three algorithms\non each dataset: Singular Value Decomposition, Alternating Minimization without Sketching, and\nAlternating Minimization with Sketching. We parameterized the experiments by t, the sketch size,\nwhich took values in {10, 15, 20, 25, 30, 35, 40, 45, 50}. For each value of t we generated a weight\nmatrix and either generated a synthetic dataset or sampled a real dataset as described in the above\nparagraphs, then tested our three algorithms.\nFor the SVD, we just took the best rank k approximation to A as given by the top k singular vectors.\nWe used the built-in svd function in numpy\u2019s linear algebra package.\nFor Alternating Minimization without Sketching, we initialized the low rank matrix factors U and\nV to be random subsets of the rows and columns of A respectively, then performed n = 25 steps of\nalternating minimization.\nFor Alternating Minimization with Sketching, we initialized U and V the same way, but performed\nn = 25 best response updates in the sketched space, as in Theorem 3. The sketch S was chosen to\nbe a CountSketch matrix with t. Based on Theorem 5, we calculated a rank t < k approximation of\nA whenever we used a sketch of size t. We plotted the objective value of the low rank approximation\nfor the connectus, NIPS, and synthetic datasets (the other datasets as well as a different family of\nweight matrices are discussed in the supplementary material) for each value of t and each algorithm\nin Figure 1. The experiment with the landmark dataset in Figure 1 used a regularization parameter\n\n8\n\n\fvalue of = 0.556, while the experiments with the NIPS and synthetic datasets used a value of\n = 1. Objective values were given in 1000\u2019s in the Frobenius norm.\nBoth forms of alternating minimization greatly outperform the low rank approximation given by\nthe SVD. Alternating minimization with sketching comes within a factor of 1.5 approximation to\nalternating minimization without sketching and can sometimes slightly outperform alternating min-\nimization without sketching2, showing that performing CountSketch at each best response step does\nnot result in a critically suboptimal objective value. The runtime of alternating minimization with\nsketching varies from being around 2 times as fast as alternating minimization without sketching\n(when the sketch size t = 10) to being around 1.4 times as fast (when the sketch size t = 50).\nTable 1 shows the runtimes for the non-synthetic experiments of Figure 1.\n\nFigure 2: Regularized weighted low-rank approximations with = 2.754 for language, = 1 for\nNIPS, and = 1.982 for landmark.\n\nFigure 3: Regularized weighted low-rank approximations with binary weights and = 1.\n\nRuntimes w/ sketching\nt\nlandmark NIPS\n49.1\n10\n50.33\n15\n51.8\n20\n56.43\n25\n30\n57.34\n62.66\n35\n63.48\n40\n67.73\n45\n50\n73.11\n\n54.31\n53.58\n57.65\n65.53\n68.68\n72.22\n79.94\n81.22\n72.77\n\nRuntimes wo/ sketching\nt\nNIPS\n104.5\n10\n105.75\n15\n104.28\n20\n104.35\n25\n30\n105.42\n100.5\n35\n101.75\n40\n104.93\n45\n50\n101.77\n\nlandmark\n126.22\n113.8\n119.17\n121.69\n123.51\n129.87\n123.65\n109.02\n100.61\n\nTable 1: Runtimes in seconds for alternating minimization with and without sketching.\n\nAcknowledgements: Part of this work was done while D. Woodruff was visiting Google Mountain\nView as well as the Simons Institute for the Theory of Computing, and was supported in part by an\nOf\ufb01ce of Naval Research (ONR) grant N00014-18-1-2562.\n\n2See the supplementary material for additional discussion.\n\n9\n\n\fReferences\n[ACW17] Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Sharper bounds for regular-\nized data \ufb01tting. In Approximation, Randomization, and Combinatorial Optimization.\nAlgorithms and Techniques, APPROX/RANDOM 2017, August 16-18, 2017, Berkeley,\nCA, USA, pages 27:1\u201327:22, 2017. 1.1, 4.2, A\n\n[BPR96] Saugata Basu, Richard Pollack, and Marie-Franc\u00b8oise Roy. On the combinatorial and al-\ngebraic complexity of quanti\ufb01er elimination. Journal of the ACM (JACM), 43(6):1002\u2013\n1045, 1996. 1\n\n[CD08] Zizhong Chen and Jack J. Dongarra. Condition numbers of gaussian random matrices.\n\nCoRR, abs/0810.0800, 2008. A\n\n[CKP+17] Michael B Cohen, Jonathan Kelner, John Peebles, Richard Peng, Anup B Rao, Aaron\nSidford, and Adrian Vladu. Almost-linear-time algorithms for markov chains and new\nspectral primitives for directed graphs. In Proceedings of the 49th Annual ACM SIGACT\nSymposium on Theory of Computing, pages 410\u2013419. ACM, 2017. 7\n\n[CNW16] Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal Approximate Ma-\ntrix Product in Terms of Stable Rank.\nIn Ioannis Chatzigiannakis, Michael Mitzen-\nmacher, Yuval Rabani, and Davide Sangiorgi, editors, 43rd International Colloquium\non Automata, Languages, and Programming (ICALP 2016), volume 55 of Leibniz In-\nternational Proceedings in Informatics (LIPIcs), pages 11:1\u201311:14, Dagstuhl, Germany,\n2016. Schloss Dagstuhl\u2013Leibniz-Zentrum fuer Informatik. 2.1\n\n[DH11] Timothy A. Davis and Yifan Hu. The university of \ufb02orida sparse matrix collection. ACM\n\nTrans. Math. Softw., 38(1):1:1\u20131:25, December 2011. 6, 6\n\n[DN11] Saptarshi Das and Arnold Neumaier. Regularized low rank approximation of weighted\n\ndata sets. Preprint, 2011. 1.1\n\n[GG11] Nicolas Gillis and Francois Glineur. Low-rank matrix approximation with weights\nSIAM Journal on Matrix Analysis and Applications,\n\nor missing data is np-hard.\n32(4):1149\u20131165, 2011. 1\n\n[LA03] Wu-Sheng Lu and Andreas Antoniou. New method for weighted low-rank approxima-\ntion of complex-valued matrices and its application for the design of 2-d digital \ufb01lters.\nIn ISCAS (3), pages 694\u2013697, 2003. 1\n\n[LLR16] Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee of weighted low-\nrank approximation via alternating minimization. In International Conference on Ma-\nchine Learning, pages 2358\u20132367, 2016. 1\n\n[LPW97] W.-S Lu, S.-C Pei, and P.-H Wang. Weighted low-rank approximation of general com-\nplex matrices and its application in the design of 2-d digital \ufb01lters. In IEEE Transactions\non Circuits and Systems, volume 44, pages 650\u2013655, 1997. 1\n\n[PJST17] Valerio Perrone, Paul A. Jenkins, Dario Span`o, and Yee Whye Teh. Poisson random\n\ufb01elds for dynamic feature models. Journal of Machine Learning Research, 18:127:1\u2013\n127:45, 2017. 6, 6\n\n[PW15] Mert Pilanci and Martin J Wainwright. Randomized sketches of convex programs with\nsharp guarantees. IEEE Transactions on Information Theory, 61(9):5096\u20135115, 2015.\n1\n\n[Ren92a] James Renegar. On the computational complexity and geometry of the \ufb01rst-order theory\nof the reals. part i: Introduction. preliminaries. the geometry of semi-algebraic sets. the\ndecision problem for the existential theory of the reals. Journal of symbolic computation,\n13(3):255\u2013299, 1992. 1\n\n[Ren92b] James Renegar. On the computational complexity and geometry of the \ufb01rst-order theory\nof the reals. part ii: The general decision problem. preliminaries for quanti\ufb01er elimina-\ntion. Journal of Symbolic Computation, 13(3):301\u2013327, 1992. 1\n\n10\n\n\f[RSW16] Ilya P. Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approx-\nimations with provable guarantees. In Proceedings of the 48th Annual ACM SIGACT\nSymposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21,\n2016, pages 250\u2013263, 2016. 1, 1.1, 1.1, 2.1, A\n\n[Shp90] D. Shpak. A weighted-least-squares matrix decomposition method with applications to\nthe design of two-dimensional digital \ufb01lters. In IEEE Thirty Third Midwest Symposium\non Circuits and Systems, 1990. 1\n\n[SJ03] Nathan Srebro and Tommi S. Jaakkola. Weighted low-rank approximations. In Machine\nLearning, Proceedings of the Twentieth International Conference (ICML 2003), August\n21-24, 2003, Washington, DC, USA, pages 720\u2013727, 2003. 1\n\n[TYUC17] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching\nalgorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and\nApplications, 38(4):1454\u20131485, 2017. 1\n\n[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and\n\nTrends in Theoretical Computer Science, 10(1-2):1\u2013157, 2014. 1, 2\n\n11\n\n\f", "award": [], "sourceid": 2243, "authors": [{"given_name": "Frank", "family_name": "Ban", "institution": "UC Berkeley / Google"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}, {"given_name": "Richard", "family_name": "Zhang", "institution": "UC Berkeley"}]}