{"title": "Decentralized sketching of low rank matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 10101, "page_last": 10110, "abstract": "We address a low-rank matrix recovery problem where each column of a rank-r matrix X of size (d1,d2) is compressed beyond the point of recovery to size L with L << d1. Leveraging the joint structure between the columns, we propose a method to recover the matrix to within an epsilon relative error in the Frobenius norm from a total of O(r(d_1 + d_2)\\log^6(d_1 + d_2)/\\epsilon^2) observations. This guarantee holds uniformly for all incoherent matrices of rank r. In our method, we propose to use a novel matrix norm called the mixed-norm along with the maximum l2 norm of the columns to design a novel convex relaxation for low-rank recovery that is tailored to our observation model. We also show that our proposed mixed-norm, the standard nuclear norm, and the max-norm are particular instances of convex regularization of low-rankness via tensor norms. Finally, we provide a scalable ADMM algorithm for the mixed-norm based method and demonstrate its empirical performance via large-scale simulations.", "full_text": "Decentralized sketching of low-rank matrices\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nKiryung Lee\n\nOhio State University\nColumbus, OH 43210\nlee.8763@osu.edu\n\nRakshith S Srinivasa \u2217\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30318\n\nrsrinivasa6@gatech.edu\n\nUniversity of Illinois-Urbana Champagne\n\nMarius Junge\n\nDept. of Mathematics\n\nUrbana, IL, 61801\n\nmjunge@illinois.edu\n\nJustin Romberg\n\nDept. of Electrical and Computer Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30318\n\njrom@ece.gatech.edu\n\nAbstract\n\nWe address a low-rank matrix recovery problem where each column of a rank-r\nmatrix X \u2208 Rd1\u00d7d2 is compressed beyond the point of individual recovery to RL\nwith L (cid:28) d1. Leveraging the joint structure among the columns, we propose a\nmethod to recover the matrix to within an \u0001 relative error in the Frobenius norm\nfrom a total of O(r(d1 + d2) log6(d1 + d2)/\u00012) observations. This guarantee holds\nuniformly for all incoherent matrices of rank r. In our method, we propose to use a\nnovel matrix norm called the mixed-norm along with the maximum (cid:96)2-norm of the\ncolumns to design a new convex relaxation for low-rank recovery that is tailored to\nour observation model. We also show that the proposed mixed-norm, the standard\nnuclear norm, and the max-norm are particular instances of convex regularization\nof low-rankness via tensor norms. Finally, we provide a scalable ADMM algorithm\nfor the mixed-norm-based method and demonstrate its empirical performance via\nlarge-scale simulations.\n\n1\n\nIntroduction\n\nA fundamental structural model for data is that the data points lie close to an unknown subspace,\nmeaning that the matrix created by concatenating the data vectors has low rank. We address a\nparticular low-rank matrix recovery problem where we wish to recover a set of vectors from a\nlow-dimensional subspace after they have been individually compressed (or \u201csketched\u201d). More\nconcretely, let x1,\u00b7\u00b7\u00b7 , xd2 be vectors from an unknown r-dimensional subspace in Rd1. We observe\nthe vectors indirectly via linear sketches by corresponding sensing matrices B1, . . . , Bd2 \u2208 Rd1\u00d7L,\nwhere L < d1, i.e., the observed measurement vectors are written as\ni = 1, . . . , d2.\n\n(1)\nAlthough individual recovery of each vector is ill-posed, it is still possible to recover x1, . . . , xd2\njointly by leveraging their mutual structure without knowing the underlying subspace a priori. This\nindeed results in a low-rank matrix recovery problem with a column-wise observation model.\nWe are motivated mainly by large-scale inference problems where data is collected in a distributed\nnetwork or in a streaming setting. In both cases, it is desired to compress the data to lower the\n\u2217This work was supported in part NSF CCF-1718771, NSF DMS 18-00872 and in part by C-BRIC, one of\n\nyi = B(cid:62)\n\ni xi + zi,\n\nsix centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcommunication overhead. In the \ufb01rst scenario, the data is partitioned according to the network\nstructure and each data point must be compressed without accessing the remainders. In the second\nscenario, memory or computational constraints may limit access to relatively small number of recent\ndata points.\nSuch compressive and distributive acquisition schemes arise frequently in numerous real-world\napplications. In next generation high-resolution astronomical imaging systems, an antenna array may\nbe distributed across a wide geographical area to collect data points that have a high dimension but\nare also heavily correlated (and so belong to a low-dimensional subspace). Compression at the node\nlevel relieves the overhead to transmit data to a central processing unit [1]. In scienti\ufb01c computing, it\nis common to generate large scale simulation data that has redundancies that manifest as low-rank\nstructures. For example, simulations in a \ufb02uid dynamic system generate large state vectors that have\nlow-rank dynamics [2]. Our observational model describes a kind of on-the-\ufb02y compression, where\nthe states are compressed as the system evolves, resulting in ef\ufb01cient communication and storage.\nIn each of these applications, if the underlying low-dimensional subspace were known a priori, then the\nprojection onto that subspace could have implemented an optimal distortion-free linear compression.\nAlternatively if the uncompressed data were available, the standard Principal Component Analysis\n(PCA) might have been used to discover the subspace. Unfortunately, neither is the case. Therefore\nwe approach the recovery as sketching without knowing the latent subspace a priori. It is also\ninterpreted as a blind compressed sensing problem that recovers the data points and underlying\nsubspace simultaneously from compressed measurements.\nThe measurement model in (1) is equivalently rewritten as follows: Let X0 \u2208 Rd1\u00d7d2 be a matrix\nobtained by concatenating x1, . . . , xd1. It follows that the rank of X0 is at most r. The entries\nof y1, . . . , yd2 then correspond to noisy linear measurements of X0, i.e., for l = 1, . . . , L and\ni = 1, . . . , d2, the lth entry of yi denoted by yl,i is written as\n\nyl,i = (cid:104)Al,i, X0(cid:105) + zl,i with Al,i =\n\nbl,ie(cid:62)\ni ,\n\n1\u221a\nL\n\n(2)\n\nwhere zl,i, bl,i, and ei respectively denote the lth entry of zi, the lth column of Bi, and the ith column\nof the identity matrix of size d2. We propose a convex optimization method to recover X0 from\n{yl,i} and provide theoretical analysis when bl,i and zl,i are independent copies of random vectors\ndrawn according to N (0, Id1) and N (0, \u03c32) respectively.\n\n1.1 Mixed-norm-based low-rank recovery\n\nLow-rank matrix recovery has been extensively studied (e.g., see [3]). One popular approach is to\nformulate the recovery as a convex program with various matrix norms such as the nuclear norm\n[4, 5, 6] and the max norm [7]. As we show in Section 2, these two norms together with the new norm\nwe propose below are all particular instances of a uni\ufb01ed perspective of low-rank regularization. We\npropose a convex relaxation of low-rankness by a matrix norm for the recovery from measurements\ngiven in (2).\nFor a matrix X, the maximum (cid:96)2 column norm is de\ufb01ned as\n\n(cid:107)X(cid:107)1\u21922 = max\nj=1\u00b7\u00b7\u00b7d2\n\n(cid:107)Xej(cid:107)2 ,\n\n(3)\n\nwhere ej is the jth standard basis vector. This can be interpreted as the operator norm from the vector\nspace (cid:96)d2\n\n1 to that of (cid:96)d1\n\n(4)\n\n2 . We de\ufb01ne the \u201cmixed-norm\" of a matrix X as\n(cid:107)U(cid:107)F (cid:107)V(cid:62)(cid:107)1\u21922.\n\n(cid:107)X(cid:107)mixed =\n\ninf\n\nU,V:UV(cid:62)=X\n\nIndeed the above two norms provide a convex relaxation suitable to the observation model in (2)\nthrough the interlacing property given in the following lemma, the proof of which is given in the\nsupplementary material.\nLemma 1 Let X \u2208 Rd1\u00d7d2 satisfy rank(X) \u2264 r. Then\n(cid:107)X(cid:107)1\u21922 \u2264 (cid:107)X(cid:107)mixed \u2264 \u221a\n\nr (cid:107)X(cid:107)1\u21922 .\n\n(5)\n\n2\n\n\fBy Lemma 1, the set \u03ba(\u03b1, R) de\ufb01ned by\n\n\u03ba(\u03b1, R) = {X : (cid:107)X(cid:107)1\u21922 \u2264 \u03b1, (cid:107)X(cid:107)mixed \u2264 R}\n\n(6)\n\ncontains the set of rank-r matrices with column norms bounded by \u03b1. We show that the observation\nmodel in (2) results in an \u0001-embedding of the set \u03ba(\u03b1, R) for a total number of measurements\n\nLd2 (cid:38) r(d1 + d2) log6(d1 + d2)/\u00012. We consider an estimate (cid:98)X of X0 given by\n\n(cid:88)\n\nl,i\n\n\u02c6X \u2208 argmin\nX\u2208\u03ba(\u03b1,R)\n\n|yl,i \u2212 (cid:104)Al,i, X(cid:105)|2.\n\n(7)\n\nWe have attempted to use the nuclear norm instead of the mixed norm but this approach was not\nsuccessful with providing a guarantee at a near optimal sample complexity. Furthermore it also\ndemonstrates worse empirical performance compared to our approach, as show in Section 3.\nAnother appealing property of the mixed-norm is that it can be computed in polynomial time using a\nsemide\ufb01nite formulation. This renders our proposed estimator readily implementable using general\npurpose convex solvers. However, to address scalability, we propose an ADMM based framework.\nWe defer further details on ef\ufb01cient computation to Section 3.\n\n1.2 Main result\n\nto proving our guarantee, we indeed show that (cid:80)\n\nOur main result, stated in Theorem 1, provides an upper bound on the Frobenius norm of the\nerror between the estimate \u02c6X obtained from solving (7) and the ground truth matrix X0 that holds\nsimultaneously for all matrices X \u2208 \u03ba(\u03b1, R) rather than for a \ufb01xed arbitrary matrix X0. En route\nl,i(cid:104)Al,i, X(cid:105)2 is well concentrated around its\nexpectation (cid:107)X(cid:107)2\nF for all X \u2208 \u03ba(\u03b1, R) and hence, the measurements results in an embedding of the\nset \u03ba(\u03b1, R) into a low dimension.\n\n\u221a\n\nTheorem 1 Let \u03ba(\u03b1, R) be de\ufb01ned as in (6). Suppose that the bl,i are drawn independently from\nN (0, Id1 ), (zi,l) are i.i.d. following N (0, \u03c32), d = d1 + d2 and d2 \u2264 Ld2 \u2264 d1d2. Then, for\nR \u2264 \u03b1\n\nr, there exist numerical constants c1, c2 such that the estimate (cid:98)X satis\ufb01es\n(cid:107)(cid:98)X \u2212 X0(cid:107)2\n\nr(d1 + d2) log6 d\n\n(cid:115)\n\n(cid:32)\n\n(cid:33)\n\n\u221a\n\nL\n\n(cid:107)X0(cid:107)2\n\nF\n\nF\n\n\u2264 c1 \u00b7\n\n\u03b12\n(cid:107)X0(cid:107)2\nF /d2\n\n\u00b7 max\n\n\u03c3\n\n1,\n\n\u00b7\n\n\u03b1\n\nLd2\n\n(8)\n\nwith probability at least 1 \u2212 2 exp(\u2212c2R2d/\u03b12) for all X0 \u2208 \u03ba(\u03b1, R).\nThere are a few remarks in order:\n\u2022 The factor \u03b12d2/(cid:107)X0(cid:107)2\n\nF is the ratio between the maximum and the average of the squared\ncolumn (cid:96)2 norm of the ground truth matrix X0 and represents its degree of incoherence. A\nratio close to 1 indicates that the columns have similar (cid:96)2-norms and results in a lower sample\ncomplexity than when the ratio is much larger than 1. This is similar to the dependence on\nthe relative magnitude of each entry in the max-norm-based estimator [7] and the dependence\non incoherence in matrix completion problems.\n\n\u2022 The second factor is written as max(1, \u03b7) where \u03b7 = \u03c3\n\n\u221a\n\u03b1 accounts for the noise level in the\nmeasurements. Since we take L measurements per column and the measurement operator is\nisotropic, \u03b12, is compared against the corresponding noise-variance \u03c32L.\n\n\u2022 If the incoherence term is upper-bounded by a constant and the normalized noise level\n\n\u03b7 satis\ufb01es \u03b7 = \u2126(1), then (cid:98)X obtained from O(\u03b72rd log6(d)\u0001\u22122) measurements satis\ufb01es\n(cid:107)(cid:98)X \u2212 X0(cid:107)2\n\nF with high probability.\n\nF \u2264 \u0001(cid:107)X(cid:107)2\n\n\u2022 We conjecture that the corresponding minimax lower bound coincide with (8) except the\nmaximum of \u03b7 with 1 and the logarithmic term. Particularly if \u03b7 = \u2126(1), then the sample\ncomplexity in (8) will be near optimal.\n\nL\n\n3\n\n\f1.3 Related work\n\nThe model in (2) has been studied in the context of compressed principal components estimation\n[8, 9, 10]. These works studied a speci\ufb01c method that computes the underlying subspace though an\nempirical covariance estimation. While being guaranteed at a near optimal sample complexity, this\napproach is inherently limited to the linear observation model. On the other hand, our method is more\n\ufb02exible in terms of its potential extension to nonlinear observation models.\nNegahban and Wainwright [11] considered the multivariate linear regression problem where a similar\nmodel to (2) arises but with a \ufb01xed sensing matrix A, i.e., Ai = A for all i = 1, . . . , d2. They\nshowed that a nuclear-norm penalized least squares provides robust recovery at a near optimal sample\ncomplexity within a logarithmic factor of the degrees of freedom of rank-r matrices. However, their\nguarantees applies to an arbitrary \ufb01xed ground truth matrix and not to all matrices within the model\nsimultaneously. Our aim is to work with an embedding of the model set \u03ba(\u03b1, R) and we obtain a\nuniform theoretical guarantee over the entire model set at the cost of using different sensing matrices\nAi\u2019s and incoherence of the matrices.\nOur solution approach is partly inspired by earlier works on low-rank matrix completion using the\nmax-norm [12, 13, 7]. The pair of max-norm and (cid:96)\u221e norms is used to relax the set of low-rank\nmatrices to a convex model. We generalize this approach to that of using tensor norms (see Section\n2) as a proxy for low rank regularization and show that the max-norm and the mixed-norm are\nparticular instances of this general framework. In particular we choose a speci\ufb01c pair of tensor norms\nin accordance with the structure in the observation model. This leads to a new convex relaxation\nmodel of low-rankness, a corresponding optimization formulation, algorithm, and its performance\nguarantee. Finally, we point out that our method of proofs and the technical tools we use to establish\nour results are signi\ufb01cantly different from that of [7].\n\n2 Properties of tensor norms on low-rank matrices\nWe interpret a matrix X \u2208 Rd1\u00d7d2 as a linear operator from a vector space Rd2 to another vector\nspace Rd1. Then let the domain and range spaces be respectively endowed with the (cid:96)p norm and the\n(cid:96)q norm. The vector space of all d1 \u00d7 d2 matrices is then identi\ufb01ed as the tensor product of the two\nBanach spaces, denoted as (cid:96)p(cid:48) \u2297 (cid:96)q (e.g., [14]), where 1/p + 1/p(cid:48) = 1.\nA tensor norm is a norm on the algebraic tensor product of two Banach spaces that satis\ufb01es the\noperator ideal property (see e.g., [14, 15]). The main insight driving the uni\ufb01ed perspective is that,\nwhen we restrict linear operators to those of rank at most r, certain tensor norms become equivalent\nup to a function of r. In particularly, we consider the injective and projective tensor norms, de\ufb01ned\nrespectively as\n\nand\n\n(cid:107)X(cid:107)\u2227 = inf\n\n(cid:107)X(cid:107)\u2228 =\n\n(cid:40)(cid:88)\n\nk\n\nsup\n\nu\u2208Rd1 ,(cid:107)u(cid:107)p=1\n\n(cid:107)uk(cid:107)p(cid:48)(cid:107)vk(cid:107)q\n\n(cid:107)Xu(cid:107)q\n(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X =\n\nk\n\n(cid:41)\n\n.\n\nvku\u2217\n\nk\n\n(9)\n\n(10)\n\n(11)\n\nThe pair of the injective and projective norms characterizes the set of low-rank matrices through\nan interlacing property between them. For example, when p = q = 2, it can be easily veri-\n\ufb01ed that (cid:107)X(cid:107)\u2228=(cid:107)X(cid:107)2 and (cid:107)X(cid:107)\u2227=(cid:107)X(cid:107)\u2217 .\nIt follows from the singular value decomposition that\nIn yet another example where p = 1, q = \u221e, we have (cid:107)X(cid:107)\u2228 = (cid:107)X(cid:107)\u221e.\n(cid:107)X(cid:107)2\u2264(cid:107)X(cid:107)\u2217\u2264r(cid:107)X(cid:107)2 .\nLinial et al. [13] showed that Grothendieck\u2019s inequality implies\n\nwhere\n\n(cid:107)X(cid:107)\u221e \u2264 (cid:107)X(cid:107)max \u2264 \u221a\n(cid:13)(cid:13)U(cid:62)(cid:13)(cid:13)1\u21922\n\nU,V:UV(cid:62)=X\n\ninf\n\nr (cid:107)X(cid:107)\u221e ,\n\n(cid:13)(cid:13)V(cid:62)(cid:13)(cid:13)1\u21922 .\n\n(cid:107)X(cid:107)max =\n\nIn this case, it has been shown that the max norm is equivalent up to a constant to the projective norm.\nFinally, by letting p = 1, q = 2, we obtain (cid:107)X(cid:107)\u2228 = (cid:107)X(cid:107)1\u21922 and that the projective norm is\nequivalent (up to a constant factor) to the mixed norm and the relationship in Lemma 1 holds. Further,\nit is interesting that unlike many tensor norms, the mixed norm and max-norms can be computed\nef\ufb01ciently in a polynomial time, similar to the nuclear norm. As we note in the next section, this\nenables ef\ufb01cient implementation of mixed-norm-based low-rank recovery programs.\n\n4\n\n\f3 Fast algorithm for mixed-norm-based optimization\n\nThe mixed-norm of any matrix X can be computed in polynomial time as\n\n(cid:107)X(cid:107)mixed =\n\nmin\n\nW11,W22\n\ns.t.\n\n(cid:35)\n\n(cid:34)\nmax(trace(W11),(cid:107)diag(W22)(cid:107)\u221e)\nW11 X\nX(cid:62) W22\n\n(cid:23) 0,\n\nwhere diag(W22) denotes the vector of the diagonal entries of W22. Then the optimization routine\nin (7) can be written as\n\n(12)\n\n(13)\n\n(cid:80)\n\nminimize\nW11,W22,X\nsubject to\n\nl,i |yl,i \u2212 (cid:104)Al,i, X(cid:105)|2\n\ntrace(W11) \u2264 R,\n(cid:107)X(cid:107)1\u21922 \u2264 \u03b1, W =\n\n(cid:35)\n(cid:34)\n(cid:107)diag(W22)(cid:107)\u221e \u2264 R,\n(cid:23) 0.\n\nW11 X\nX(cid:62) W22\n\nThe program in (13) is now a constrained convex optimization problem over the cone of positive\nsemide\ufb01nite (PSD) matrices.\n\n3.1 ADMM based fast algorithm\n\nThe program in (13) can be implemented using standard convex optimization solvers like SeDuMi.\n[16]. However, this could result in scaling issues, as run times could be prohibitive in higher\ndimensions. To address this, we propose to use the ADMM based algorithm [17] which breaks down\nthe optimization problem into smaller problems that can be solved ef\ufb01ciently. Our approach is similar\nto [18], where the positive semide\ufb01nite constraint on W in (13) is treated separately from the other\nconstraints. We provide an algorithm for the norm-penalized version of (13). By Lagrangian duality,\nthe penalized version and the constrained version are equivalent when the Lagrangian multipliers \u03bb1\nand \u03bb2 are chosen appropriately.\nBy introducing an auxiliary variable T, it is straightforward to show that the optimization problem\n(13) is equivalent to\n\nl,i |yl,i \u2212 (cid:104)Al,i, W(cid:105)|2 + \u03bb1 trace(T11) + \u03bb2 (cid:107)diag(W22)(cid:107)\u221e\n\nminimize\nsubject to (cid:107)W12(cid:107)1\u21922 \u2264 \u03b1, T = W, T (cid:23) 0.\n\nW,T\n\n(14)\n\nIn (14), we carry the constraints on trace(T11) and (cid:107)diag(W22)(cid:107) to the objective function by\nusing the Lagrangian formulation. Note that there are other variations possible, with more or fewer\nconstraints carried over to the objective function. The formulation in (14) is amenable to the ADMM\nalgorithm. The augmented Lagrangian of (14) is given by\n\n(cid:80)\n\nL(T, W, Z) = f (W) + \u03bb1 trace(T11) + \u03bb2 (cid:107)diag(W22)(cid:107)\u221e\n\n+ (cid:104)Z, T \u2212 W(cid:105) +\n\n(cid:107)T \u2212 W(cid:107)2\n\n\u03c1\n2\n\nF + \u03c7{T(cid:23)0} + \u03c7{(cid:107)W12(cid:107)1\u21922\u2264\u03b1},\n\nwhere Z is the dual variable and \u03c7S is the indicator function of the set S given as \u03c7S (t) = 0 if t \u2208 S\nand \u03c7S (t) = \u221e otherwise. The ADMM algorithm then iterates by alternating among T, W and\nZ, as shown in Algorithm 1. While we leave the \ufb01ner details of the algorithm to the supplementary\nmaterial, it is worthwhile to note that each step in Algorithm 1 has a unique closed-form solution that\nallows for scalability to high dimensions.\n\n3.2 Experiments\n\nTo complement our theoretical results, we observe the empirical performance of the mixed-norm-\nbased method in a set of Monte Carlo simulations. Matrices are set to be of size 1, 000 \u00d7 1, 000\nand of rank 5. In our experiments we normalize the columns to have the same energy. We observe\nthe estimation error by varying the degree of compression and the signal-to-noise (SNR) ratio. We\ncompare the proposed method to the popular matrix LASSO, which minimizes the least squares loss\n\n5\n\n\fAlgorithm 1 ADMM algorithm\n\nInitialize: T0, W0, Z0\nwhile not converged do\n\nTk+1 = argmin\n\nT(cid:23)0\n\nL(T, Wk, Zk)\n\nWk+1 = argmin\nZk+1 = Zk + \u03c1(Tk+1 \u2212 Wk+1)\n\n(cid:107)W12(cid:107)1\u21922\u2264\u03b1\n\nL(Tk+1, W, Zk)\n\nend while\n\nFigure 1: Simulation results comparing the proposed mixed-norm based estimator and the nuclear\nnorm based estimator. The test matrices were of size 1, 000 \u00d7 1, 000 with rank 5. Each data point is\ncomputed as an average of 5 trials. Mixed norm estimator is able to achieve much lower errors with\nfewer measurements compared to the nuclear norm estimator.\n\nwith a nuclear norm regularizer. We used Algorithm 1 to implement the mixed-norm based method.\nThe nuclear norm minimization approach was implemented using the algorithm provided in [19].\nFigure 1 shows the obtained simulation results. The estimation error is averaged over 5 trials. The\nresult indicates that the mixed-norm-based estimator outperforms the nuclear-norm-based estimator\nat both the SNR levels considered.\n\n4 Proof sketch\n\nWe state the key lemmas involved in proving our result and point to the tools we use and defer \ufb01ner\ndetails to the supplementary material. We begin with the basic optimality condition that relates\nthe estimate \u02c6X to the ground truth X0. Let M = \u02c6X \u2212 X0. By the triangular inequality, we have\nM \u2208 \u03ba(2\u03b1, 2R). For notational brevity, we assume from now on that M \u2208 \u03ba(\u03b1, R). (Neither the\nmain result nor the proofs are affected by this since they involve multiplication with some numerical\nconstants.)\nWe adapt the \ufb01rst step in the analysis framework of the analogous matrix completion problem [7]. By\noptimality of the solution and (2), we have\n\n(cid:88)\n\n(cid:16)\n\nl,i\n\nyl,i \u2212 (cid:104)Al,i,(cid:98)X(cid:105)(cid:17)2 \u2264(cid:88)\n(cid:88)\n(cid:88)\n\nl,i\n\n(yl,i \u2212 (cid:104)Al,i, X0(cid:105))2 .\n\n(15)\n\n(16)\n\nAfter substituting \u02c6X \u2212 X0 by M and rearranging the terms, we obtain\n(cid:104)Al,i, M(cid:105)zl,i.\n\n(cid:104)Al,i, M(cid:105)2 \u2264 2\n\nl,i\n\nl,i\n\nAs in [7], we rely on the stochastic nature of the noise. The proof also relies on the norm-constrained\noptimization rather than norm-penalized optimization. Our strategy is to obtain a lower bound on\nF and a uniform upper bound on the linear form\n\nthe quadratic form(cid:80)\n(cid:80)\nl,i(cid:104)Al,i, M(cid:105)zl,i over the set \u03ba(\u03b1, R). We can then bound (cid:107)M(cid:107)2\n\nl,i(cid:104)Al,i, M(cid:105)2 in terms of (cid:107)M(cid:107)2\n\nF uniformly over the set.\n\n6\n\n\f4.1 Lower bound on the quadratic form\n\nWe observe that(cid:80)\n\nrandom variables. Let us de\ufb01ne\n\nl,i(cid:104)Al,i, M(cid:105)2 can be reformulated as a quadratic form in standard Gaussian\n\nThen it follows that \u03be \u223c N (0, ILd1d2 ). Therefore, the left-hand side of (16) is rewritten as\n\n(cid:104)Al,i, M(cid:105)2 = (cid:107)QM\u03be(cid:107)2 , where\n\n(17)\n\n(18)\n\nj = IL \u2297 (Mej)(cid:62) \u2208 RL\u00d7Ld1 .\n\n(cid:88)\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n(cid:102)M(cid:62)\n0 (cid:102)M(cid:62)\n...\n...\n\nl,i\n\n0\n\n1\n\n2\n\nQM =\n\n1\u221a\nL\n\nWe also have\n\n0\n\n0\n\n\uf8f9\uf8fa\uf8fb \u2208 RLd1d2.\n\n\uf8ee\uf8ef\uf8f0 b1,1\n\n...\nbL,d2\n\n\u03be =\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb , (cid:102)M(cid:62)\n\n0\n0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\n\u00b7\u00b7\u00b7 (cid:102)M(cid:62)\n\n0\n\nd2\n\nE(cid:107)QM\u03be(cid:107)2 = (cid:107)M(cid:107)2\n\nF\n\n(cid:12)(cid:12)(cid:12)(cid:107)A\u03be(cid:107)2 \u2212 E(cid:107)A\u03be(cid:107)2(cid:12)(cid:12)(cid:12) ,\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 cR\n(cid:32)\n(cid:114) d\n\nLd2\n\nd2\n\nF\n\nWe compute a tail estimate on supM\u2208\u03ba(\u03b1,R) (cid:107)QM\u03be(cid:107)2\n2 by using the results on suprema of chaos\nprocesses [20]. They derived a sharp tail estimate on the supremum of a Gaussian quadratic form\nmaximized over a given set A, which is written as\n\nby using a chaining argument. By adapting their framework, we obtain the following Lemma:\n\nLemma 2 Under the assumptions of Theorem 1, if QM and \u03be are as de\ufb01ned in (18) and (17), then\n\n(cid:33)\n\n\u221a\nd\u221a\nR\nLd2\n\nlog3(d)\n\nlog3 d.\n\n(19)\n\nsup\nA\u2208A\n\nd2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:107)QM\u03be(cid:107)2\n(cid:80)\nl,i(cid:104)Al,i, M(cid:105)2\n\nsup\n\n\u2212 (cid:107)M(cid:107)2\nwith probability at least 1 \u2212 2 exp(\u2212cR2d\u03b12).\nFrom Lemma 2, in the regime where Ld2 > R2d/\u03b12, we can obtain\n\nM\u2208\u03ba(\u03b1,R)\n\n\u03b1 +\n\n\u2265 (cid:107)M(cid:107)2\nWe obtain the following uniform upper bound(cid:80)\n\n4.2 Upper bound on the right-hand side of (16)\n\nd2\n\nd2\n\nF\n\n(cid:114) d\n\nLd2\n\n\u2212 cR\u03b1\n\n(cid:80)\nl,i(cid:104)Al,i, M(cid:105)zl,i\n\nsup\n\nM\u2208\u03ba(\u03b1,R)\n\nd2\n\nl,i(cid:104)Al,i, M(cid:105)zl,i:\n\n\u221a\n\n\u2264 c(\u03c3\n\nL)R\n\n(cid:114) d\n\nLd2\n\nLemma 3 Under the assumptions of Theorem 1, with probability at least 1 \u2212 2 exp(\u2212cR2d/\u03b12),\n\nlog3 d.\n\n(20)\n\nTo derive Lemma 3, we \ufb01rst express the left-hand side of (20) using a matrix norm.\nDe\ufb01ne\n\nThen by the de\ufb01nition of \u03ba(\u03b1, R) in (6) it follows that the unit ball B := {M : |||M|||| \u2264 1} with\nrespect to ||| \u00b7 ||| coincides with \u03ba(\u03b1, R). Therefore via the Banach space duality, we obtain\n\n(cid:88)\n\nl,i\n\nsup\n\nM\u2208\u03ba(\u03b1,R)\n\n(cid:104)Al,i, M(cid:105)zl,i = sup\n\nM\u2208\u03ba(\u03b1,R)\n\n|||M||| :=\n\n(cid:107)M(cid:107)1\u21922\n\n\u03b1\n\n.\n\nR\n\n\u2228 (cid:107)M(cid:107)mixed\nzl,iAl,i, M(cid:105) = |||(cid:88)\n(cid:104)(cid:88)\n\nl,i\n\nl,i\n\nzl,iAl,i|||\u2217\n\n7\n\n\fwhere (rl,i) is a Rademacher sequence and the expectation is conditioned on (Ai,l). Then by the\nsymmetry of the standard Gaussian distribution, we obtain\n\nE(rl,i) |||(cid:88)\n\nl,i\n\nrl,iAl,i|||\u2217 =\n\nwhere ||| \u00b7 |||\u2217 denotes the dual norm. Then, conditioned on Al,i\u2019s, it follows from Theorem 4.7 in\n[21] that with probability 1 \u2212 \u03b4\n\n|||(cid:88)\n\nl,i\n\nzl,iAl,i|||\u2217 \u2264 Ez |||(cid:88)\n(cid:123)(cid:122)\n\n(cid:124)\n\nl,i\n\nT1\n\nzl,iAl,i|||\u2217\n\n(cid:125)\n\n+ \u03c0\n\n(cid:124)\n\n(cid:118)(cid:117)(cid:117)(cid:116) log(2/\u03b4)\n\n2\n\n(cid:88)\n\nl,i\n\nsup\n\nM\u2208\u03ba(\u03b1,R)\n\n(cid:123)(cid:122)\n\nT2\n\n(cid:104)Al,i, M(cid:105)2\n\n.\n\n(21)\n\n(cid:125)\n\nThe \ufb01rst term T1 is the Gaussian complexity of the sample set {Al,i} over the function class\n{(cid:104)M,\u00b7(cid:105) : M \u2208 \u03ba(\u03b1, R)}. This can be (up to a logarithmic factor of the size of the summation)\nupper-bounded by the corresponding Rademacher complexity ([22], Equation (4.9)) as\n\nrl,iAl,i|||\u2217,\n\nT1 \u2264 c\u03c3(cid:112)log(Ld2 + 1) E(rl,i) |||(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =\n\n(cid:104)rl,ibl,i, Mel(cid:105)\n\n(cid:88)\n\nM\u2208\u03ba(\u03b1,R)\n\n1\u221a\nL\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nsup\n\nl,i\n\nl,i\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:88)\n(cid:123)(cid:122)\n\nl,i\n(\u00a7)\n\n1\u221a\nL\n\nsup\n\nM\u2208\u03ba(\u03b1,R)\n\n(cid:124)\n\n(cid:104)bl,i, Mel(cid:105)\n\n,\n\n(22)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:125)\n\n(23)\n\nwhere the second equation holds in the sense of distribution.\nNote that (\u00a7) is the maximum of linear combinations of Gaussian variables and an upper bound can\nbe obtained using Dudley\u2019s inequality [22]. Once we obtain a tail estimate of (\u00a7), since (\u00a7) no longer\ndepends on the Rademacher sequence (rl,i), it can be used to upper-bound T1 through (22) and (23).\nAn upper bound on T2 has been already derived in Lemma 2. Combining these upper estimates on T1\n\nand T2 results in Lemma 3. From the lower bound on(cid:80)\n\n(cid:114) d\n\n(cid:107)M(cid:107)2\nd2\n\nF\n\n\u2212 c\u03b1R\n\nlog3 d \u2264 1\nd2\n\nLd2\n\n(cid:80)\nl,i(cid:104)Al,i, M(cid:105)2, we have\nl,i(cid:104)Al,i, M(cid:105)zl,i\n\nM\u2208\u03ba(\u03b1,R)\n\nd2\n\n.\n\n(cid:104)Al,i, M(cid:105)2 \u2264 sup\n\n(cid:88)\n\nl,i\n\nFrom Lemma 3, we get the following inequality, which then leads to the \ufb01nal result.\n\n(cid:107)M(cid:107)2\nd2\n\nF\n\n\u2264 c log3 dR\n\n(cid:114) d\n\nLd2\n\n\u221a\n\n(\u03b1 \u2228 \u03c3\n\nL)\n\n4.3 Entropy estimate\n\nPart of proofs of lemmas 2 and 3 has been deferred to the supplementary material. Both proofs rely\non a key quantity that captures the \u201ccomplexity\u201d of the set \u03ba(\u03b1, R). In particular, using Dudley\u2019s\ninequality requires an estimate of the entropy number of the set \u03ba(\u03b1, R), which is given by the\nfollowing Lemma.\nLemma 4 Let \u03ba(\u03b1, R) be as in (6) and let B1\u21922 be the unit ball with respect to (cid:107)\u00b7(cid:107)1\u21922. Then there\nexists a numerical constant c such that\n\n(cid:90) \u221e\n\n(cid:112)log N (\u03ba(\u03b1, R), \u03b7B1\u21922)d\u03b7 \u2264 cR\n\n\u221a\n\nd log3/2(d1 + d2).\n\n(24)\n\n0\n\nHere N (\u03ba(\u03b1, R), \u03b7B1\u21922) denotes the covering number of \u03ba(\u03b1, R) with respect to the scaled unit\nball \u03b7B1\u21922.\nIn Section 2 we introduced the projective tensor norm (cid:107) \u00b7 (cid:107)\u2227. Let B\u2227 denote the unit ball with respect\nto the projective tensor norm in (cid:96)d2\u221e \u2297 (cid:96)d1\n2 reduces to (cid:107) \u00b7 (cid:107)1\u21922.\nBy its construction, \u03ba(\u03b1, R) is given as the intersection of two norm balls \u03b1B1\u21922 and RB\u2227. The\nproof of Lemma 4 reduces to the computation of the entropy number of the identity map on (cid:96)d2\u221e \u2297 (cid:96)d1\n2\nfrom the Banach space with the projective tensor norm to that with the injective tensor norm. This\nproof along with a study of the machinery of computing such entropy numbers can be found in a\ncomplementary paper [23].\n\n2 . The injective tensor norm in (cid:96)d2\u221e \u2297 (cid:96)d1\n\n8\n\n\f5 Discussion\n\nLow rank modeling is a widely used approach in many machine learning and signal processing\ntasks. By interpreting low-rankness as a property expressed by tensor norms, we are able to design\na practical and sample ef\ufb01cient regularization method that is tailored to the observation model.\nThe proposed method comes with theoretical guarantees and also performs well empirically. Our\nproposed method can also be implemented ef\ufb01ciently in high dimensions, making it a viable option\nfor performing PCA or low rank recovery in big data scenarios.\n\nReferences\n[1] R. Spencer. The square kilometre array: The ultimate challenge for processing big data. In IET\n\nSeminar on Data Analytics: Deriving Intelligence and Value from Big Data.\n\n[2] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Streaming low-rank matrix approximation\n\nwith an application to scienti\ufb01c simulation. arXiv:1902.08651, 2019.\n\n[3] M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete\n\nobservations. IEEE J. Sel. Topics Signal Process, 10(4):608\u2013622, June 2016.\n\n[4] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[5] E.J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Found, of Comp.\n\nMath., 9(6):717, 2009.\n\n[6] D. Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. IEEE Trans. Inf.\n\nTheory, 57(3):1548\u20131566, 2011.\n\n[7] T.T. Cai and W. Zhou. Matrix completion via max-norm constrained optimization. Electronic J.\n\nStat., 10(1):1493\u20131525, 2016.\n\n[8] Farhad P. A. and Shannon H. Memory and computation ef\ufb01cient PCA via very sparse ran-\ndom projections. In Proceedings of the 31st International Conference on Machine Learning,\nvolume 32, pages 1341\u20131349, Bejing, China, Jun. 2014.\n\n[9] M. Azizyan, A. Krishnamurthy, and A. Singh. Extreme compressive sampling for covariance\n\nestimation. arXiv preprint arXiv:1506.00898, 2015.\n\n[10] H. Qi and S. M. Hughes. Invariance of principal components under low-dimensional random\n\nprojection of the data. In 19th IEEE Int. Conf. Image Process., pages 937\u2013940, Sep. 2012.\n\n[11] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. Ann. Statist., 39(2):1069\u20131097, 04 2011.\n\n[12] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul,\nY. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages\n1329\u20131336. MIT Press, 2005.\n\n[13] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of sign\n\nmatrices. Combinatorica, 27(4):439\u2013463, 2007.\n\n[14] G. Jameson. Summing and nuclear norms in Banach space theory, volume 8. Cambridge\n\nUniversity Press, 1987.\n\n[15] R. A. Ryan. Introduction to tensor products of Banach spaces. Springer Science & Business\n\nMedia, 2013.\n\n[16] J. F. Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones.\n\nOptimization methods and software, 11(1-4):625\u2013653, 1999.\n\n[17] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine learning, 3(1):1\u2013122, 2011.\n\n9\n\n\f[18] E. X. Fang, H. Liu, K. Toh, and W. Zhou. Max-norm optimization for robust matrix recovery.\n\nMathematical Programming, 167(1):5\u201335, 2018.\n\n[19] K. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized\n\nlinear least squares problems. Paci\ufb01c Journal of optimization, 2010.\n\n[20] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of chaos processes and the restricted\nisometry property. Communications on Pure and Applied Mathematics, 67(11):1877\u20131904,\n2014.\n\n[21] G. Pisier. The volume of convex bodies and Banach space geometry, volume 94. Cambridge\n\nUniversity Press, 1999.\n\n[22] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes.\n\nSpringer Science & Business Media, 2013.\n\n[23] K. Lee, R. S. Srinivasa, M. Junge, and J. Romberg. Entropy estimates on tensor products of\nbanach spaces and applications to low-rank recovery. In Proceesings of 13th International\nConference on Sampling Theory and Applications (SampTA), Bordeaux, France, July 2019.\n\n10\n\n\f", "award": [], "sourceid": 5343, "authors": [{"given_name": "Rakshith Sharma", "family_name": "Srinivasa", "institution": "Georgia Institute of Technology"}, {"given_name": "Kiryung", "family_name": "Lee", "institution": "Ohio state university"}, {"given_name": "Marius", "family_name": "Junge", "institution": "University of Illinois"}, {"given_name": "Justin", "family_name": "Romberg", "institution": "Georgia Institute of Technology"}]}