{"title": "Nonconvex Low-Rank Tensor Completion from Noisy Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1863, "page_last": 1874, "abstract": "We study a completion problem of broad practical interest: the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for large-scale applications, or come with sub-optimal statistical guarantees. Focusing on ``incoherent'' and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm --- (vanilla) gradient descent following a rough initialization --- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all low-rank tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e.~minimal sample complexity and optimal $\\ell_2$ and $\\ell_{\\infty}$ statistical accuracy). The insights conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.", "full_text": "Nonconvex Low-Rank Symmetric Tensor Completion\n\nfrom Noisy Data\n\nChangxiao Cai\n\nPrinceton University\n\nGen Li\n\nTsinghua University\n\nH. Vincent Poor\n\nPrinceton University\n\nYuxin Chen\n\nPrinceton University\n\nAbstract\n\nWe study a completion problem of broad practical interest: the reconstruction\nof a low-rank symmetric tensor from highly incomplete and randomly corrupted\nobservations of its entries. While a variety of prior work has been dedicated to\nthis problem, prior algorithms either are computationally too expensive for large-\nscale applications, or come with sub-optimal statistical guarantees. Focusing on\n\u201cincoherent\u201d and well-conditioned tensors of a constant CP rank, we propose a\ntwo-stage nonconvex algorithm \u2014 (vanilla) gradient descent following a rough\ninitialization \u2014 that achieves the best of both worlds. Speci\ufb01cally, the proposed\nnonconvex algorithm faithfully completes the tensor and retrieves individual tensor\nfactors within nearly linear time, while at the same time enjoying near-optimal\nstatistical guarantees (i.e. minimal sample complexity and optimal (cid:96)2 and (cid:96)\u221e\nstatistical accuracy). The insights conveyed through our analysis of nonconvex\noptimization might have implications for other tensor estimation problems.\n\n1\n\nIntroduction\n\n1.1 Tensor completion from noisy entries\n\nEstimation of low-complexity models from highly incomplete observations is a fundamental task\nthat spans a diverse array of engineering applications. Arguably one of the most extensively studied\nproblems of this kind is matrix completion, where one wishes to recover a low-rank matrix given\nonly partial entries [21, 14]. Moving beyond matrix-type data, a natural higher-order generalization\nis low-rank tensor completion, which aims to reconstruct a low-rank tensor when the vast majority of\nits entries are unseen. There is certainly no shortage of applications that motivate the investigation of\ntensor completion, examples including seismic data analysis [44, 24], visual data in-painting [47, 46],\nmedical imaging [25, 58, 19], multi-dimensional harmonic retrieval [13, 72], to name just a few.\nFor the sake of clarity, we phrase the problem formally before we proceed, focusing on a simple\nmodel that already captures the intrinsic dif\ufb01culty of tensor completion in many aspects.1 Imagine\nwe are asked to estimate a symmetric order-three tensor2 T (cid:63) \u2208 Rd\u00d7d\u00d7d from a few noisy entries\n\nTj,k,l = T (cid:63)\n\nj,k,l + Ej,k,l,\n\n(1)\nwhere Tj,k,l is the observed noisy entry at location (j, k, l), Ej,k,l stands for the associated noise, and\n\u2126 \u2286 {1,\u00b7\u00b7\u00b7 , d}3 is a symmetric index subset to sample from. For notational simplicity, we set T =\n[Tj,k,l]1\u2264j,k,l\u2264d and E = [Ej,k,l]1\u2264j,k,l\u2264d, with Tj,k,l = Ej,k,l = 0 for any (j, k, l) /\u2208 \u2126. We adopt a\nrandom sampling model such that each index (j, k, l) (j \u2264 k \u2264 l) is included in \u2126 independently with\nprobability p. In addition, we know a priori that the unknown tensor T (cid:63) \u2208 Rd\u00d7d\u00d7d is a superposition\nof r rank-one tensors (often termed canonical polyadic (CP) decomposition if r is minimal)\n\n\u2200(j, k, l) \u2208 \u2126,\n\n(cid:88)r\n\n(cid:88)r\n\nT (cid:63) =\n\ni=1\n\ni \u2297 u(cid:63)\nu(cid:63)\n\ni \u2297 u(cid:63)\ni ,\n\nor more concisely,\n\nT (cid:63) =\n\nu(cid:63) \u22973\n\ni\n\n,\n\n(2)\n\ni=1\n\nnaturally extend to the more general case with asymmetric tensors of possibly higher order.\n\n1We focus on symmetric order-3 tensors primarily for simplicity of presentation. Many of our \ufb01ndings\n2Here, a tensor T \u2208 Rd\u00d7d\u00d7d is said to be symmetric if Tj,k,l = Tk,j,l = Tk,l,j = Tl,k,j = Tj,l,k = Tl,j,k.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhere each u(cid:63)\nfaithfully estimate T (cid:63), as well as the factors {u(cid:63)\n\ni \u2208 Rd represents one of the r factors. The primary question is: can we hope to\n\ni }1\u2264i\u2264r, from the partially revealed entries (1)?\n\n1.2 Computational and statistical challenges\n\nEven though tensor completion conceptually resembles matrix completion in various ways, it is\nconsiderably more challenging than the matrix counterpart. This is perhaps not surprising, given\nthat a plethora of natural tensor problems are all notoriously hard [32]. As a notable example, while\nmatrix completion is often ef\ufb01ciently solvable under nearly minimal sample complexity [8, 29], all\npolynomial-time algorithms developed so far for tensor completion \u2014 even in the noise-free case\n\u2014 require a sample size at least exceeding the order of rd3/2. This is substantially larger than the\ndegrees of freedom (i.e. rd) underlying the model (2). In fact, it is widely conjectured that there\nexists a large computational barrier away from the information-theoretic sampling limits [4].\nWith this fundamental gap in mind, the current paper focuses on the regime (in terms of the sample\nsize) that enables reliable tensor completion in polynomial time. A variety of algorithms have been\nproposed that enjoy some sort of theoretical guarantees in (at least part of) this regime, including\nbut not limited to spectral methods [50], sum-of-squares hierarchy [4, 53], nonconvex algorithms\n[36, 67], and also convex relaxation (based on proper unfolding) [25, 64, 34, 57, 47, 51, 28]. While\nthese are all polynomial-time algorithms, most of the computational complexities supported by prior\ntheory remain prohibitively high when dealing with large-scale tensor data. The only exception is\nthe unfolding-based spectral method, which, however, fails to achieve exact recovery even when the\nnoise vanishes. This leads to a critical question that this paper aims to explore:\n\nQ1: Is there any linear-time algorithm that is guaranteed to work for tensor completion?\n\nGoing beyond such computational concerns, one might naturally wonder whether it is also possible\nfor a fast algorithm to achieve a nearly un-improvable statistical accuracy in the presence of noise.\nTowards this end, intriguing stability guarantees have been established for sum-of-squares hierarchy in\nthe noisy settings [4], although this paradigm is computationally prohibitive for large-scale data. The\nrecent work [68] came up with a two-stage algorithm (i.e. spectral method followed by tensor power\niterations) for noisy tensor completion. Its estimation accuracy, however, falls short of achieving exact\nrecovery in the absence of noise. This gives rise to another question of fundamental importance:\n\nQ2: Can we achieve near-optimal statistical accuracy without compromising computa-\ntional ef\ufb01ciency?\n\n1.3 A two-stage nonconvex algorithm\n\nTo address the above-mentioned challenges, a \ufb01rst impulse is to resort to the least squares formulation\n\nminimize\nu1,\u00b7\u00b7\u00b7 ,ur\u2208Rd\n\n\u2212 Tj,k,l\n\n,\n\n(3)\n\ni=1\n\nj,k,l\n\n(cid:88)\n\nj,k,l\u2208\u2126\n\n(cid:16)(cid:104)(cid:88)r\n(cid:13)(cid:13)(cid:13)P\u2126\n\n(cid:105)\n\nu\u22973\n\ni\n\n(cid:16)(cid:88)r\n\n(cid:17)2\n(cid:17)(cid:13)(cid:13)(cid:13)2\n\nF\n\nor more concisely (up to proper re-scaling),\n\ni \u2212 T\nu\u22973\n\nf (U ) (cid:44) 1\n6p\n\ni=1\n\nminimize\nU\u2208Rd\u00d7r\n\n(4)\nif we take U := [u1, . . . , ur] \u2208 Rd\u00d7r. Here, we denote by P\u2126(T ) the orthogonal projection of\nany tensor T onto the subspace of tensors which vanish outside of \u2126. This optimization problem,\nhowever, is highly nonconvex, resulting in computational intractability in general.\nFortunately, not all nonconvex problems are as danuting as they may seem. For example, recent years\nhave seen a \ufb02urry of activity in low-rank matrix factorization via nonconvex optimization, which\nachieves optimal statistical and computational ef\ufb01ciency at once [55, 39, 41, 35, 9, 12, 62, 20, 18, 11,\n76, 49, 65, 78]. Motivated by this strand of work, we propose to solve (4) via a two-stage nonconvex\nparadigm, presented below in reverse order. The procedure is summarized in Algorithms 1-3.\n\nGradient descent (GD). Arguably one of the simplest optimization algorithms is gradient descent,\nwhich adopts a gradent update rule\n\nU t+1 = U t \u2212 \u03b7t\u2207f (U t),\n\nt = 0, 1,\u00b7\u00b7\u00b7\n\n(5)\n\n2\n\n\fAlgorithm 1 Gradient descent for nonconvex tensor completion\n1: Input: observed entries {Tj,k,l | (j, k, l) \u2208 \u2126}, sampling rate p, number of iterations t0.\n2: Generate an initial estimate U 0 \u2208 Rd\u00d7r via Algorithm 2.\n3: for t = 0, 1, . . . , t0 \u2212 1 do\n4:\n\n(cid:1)\u22973 \u2212 T(cid:1)\u00d7seq\n\nU t+1 = U t \u2212 \u03b7t\u2207f (U t) = U t \u2212 \u03b7t\n\n(cid:0)(cid:80)r\n\n(cid:0)ut\n\n1 U t \u00d7seq\n\n2 U t, where \u00d7seq\n\n1\n\np P\u2126\n\ni=1\n\ni\n\nand \u00d7seq\n\n2\n\nare de\ufb01ned in Section 1.5.\n\nAlgorithm 2 Spectral initialization for nonconvex tensor completion\n1: Input: sampling set \u2126, observed entries {Ti,j,k | (i, j, k) \u2208 \u2126}, sampling rate p.\n2: Let U \u039bU(cid:62) be the rank-r eigen-decomposition of\n\nwhere A = unfold(cid:0)p\u22121T(cid:1) is the mode-1 matricization of p\u22121T , and Po\ufb00-diag(Z) extracts out\n\nB := Po\ufb00-diag(AA(cid:62)),\n\n(6)\n\nthe off-diagonal entries of Z.\n\n3: Output: initial estimate U 0 \u2208 Rd\u00d7r from U \u2208 Rd\u00d7r using Algorithm 3.\n\nwhere \u03b7t is the learning rate. The main computational burden in each iteration lies in gradient\nevaluation, which, in this case, can be performed in time proportional to that taken to read the data.\nDespite the simplicity of this algorithm, two critical issues stand out and might signi\ufb01cantly affect its\nef\ufb01ciency, which we shall bear in mind throughout the algorithmic and theoretical development.\n(i) Local stationary points and initialization. As is well known, GD is guaranteed to \ufb01nd an\napproximate local stationary point, provided that the learning rates do not exceed the inverse Lipschitz\nconstant of the gradient [5]. There exist, however, local stationary points (e.g. saddle points or spurious\nlocal minima) that might fall short of the desired statistical properties. This requires us to properly\navoid such undesired points, while retaining computational ef\ufb01ciency. To address this issue, one\nstrategy is to \ufb01rst identify a rough initial guess within a local region surrounding the global solution,\nwhich often helps rule out bad local minima. As a side remark, while careful initialization might not\nbe crucial for several matrix recovery cases [45, 15, 27], it does seem to be critical in various tensor\nproblems [56]. We shall elucidate this point in the full version [7].\n(ii) Learning rates and regularization. Learning rates play a pivotal role in determining the conver-\ngence properties of GD. The challenge, however, is that the loss function (4) is overall not suf\ufb01ciently\nsmooth (i.e. its gradient often has a very large Lipschitz constant), and hence generic optimization\ntheory recommends a pessimistically slow update rule (i.e. an extremely small learning rate) so as to\nguard against over-shooting. This, however, slows down the algorithm signi\ufb01cantly, thus destroying\nthe main computational advantage of GD (i.e. low per-iteration cost). With this issue in mind, prior\nliterature suggests carefully designed regularization steps (e.g. proper projection, regularized loss\nfunctions) in order to improve the geometry of the optimization landscape [67]. By contrast, we\nargue that one is allowed to take a constant learning rate \u2014 which is as aggressive as it can possibly\nbe \u2014 even without enforcing any regularization procedures.\n\nInitialization. Motivated by the above-mentioned issue (i), we develop a procedure that guarantees\na reasonable initial estimate. In a nutshell, the proposed procedure consists of two steps:\n(a) Estimate the subspace spanned by the r tensor factors {u(cid:63)\n(b) Disentangle individual tensor factors from this subspace estimate.\n\ni }1\u2264i\u2264r via a spectral method;\n\nThe computational complexity of the proposed initialization is linear-time (i.e. O(pd3)) when r =\nO(1). Note, however, that these steps are more complicated to describe. We postpone the details to\nSection 2 and intuitions to [7]. The readers can catch a glimpse of these procedures in Algorithm 2-3.\n\n1.4 Main results\n\nEncouragingly, the proposed nonconvex algorithm provably achieves the best of both worlds \u2014 in\nterms of statistical accuracy and computational ef\ufb01ciency \u2014 for a broad class of problem instances.\n\n3\n\n\fAlgorithm 3 Retrieval of low-rank tensor factors from a given subspace estimate.\n1: Input: sampling set \u2126, observed entries {Ti,j,k | (i, j, k) \u2208 \u2126}, sampling rate p, number of\n2: for \u03c4 = 1, . . . , L do\n3:\n4:\n\nrestarts L, pruning threshold \u0001th, subspace estimate U \u2208 Rd\u00d7r.\nGenerate an independent Gaussian vector g\u03c4 \u223c N (0, Id).\n\n(cid:1) \u2190 RETRIEVE-ONE-TENSOR-FACTOR(T , p, U , g\u03c4 ).\n(cid:1)(cid:9)L\n\n5: Generate(cid:8)(w1, \u03bb1), . . . , (wr, \u03bbr)(cid:9) \u2190 PRUNE((cid:8)(cid:0)\u03bd\u03c4 , \u03bb\u03c4 , spec-gap\u03c4\n6: Output: initial estimate U 0 =(cid:2)\u03bb1/3\n\n(cid:0)\u03bd\u03c4 , \u03bb\u03c4 , spec-gap\u03c4\n\nr wr(cid:3).\n\n\u03c4 =1, \u0001th).\n\n1 w1, . . . , \u03bb1/3\n\n1: function RETRIEVE-ONE-TENSOR-FACTOR(T , p, U , g)\n\nCompute\n\n\u03b8 = U U(cid:62)g =: PU (g),\nM = p\u22121T \u00d73 \u03b8,\n\n(7a)\n(7b)\n\nwhere \u00d73 is de\ufb01ned in Section 1.5.\n\nreturn(cid:0)\u03bd, \u03bb, \u03c31(M ) \u2212 \u03c32(M )(cid:1).\n\n2:\n3:\n\nLet \u03bd be the leading singular vector of M obeying (cid:104)T , \u03bd\u22973(cid:105) \u2265 0; set \u03bb = (cid:104)p\u22121T , \u03bd\u22973(cid:105).\n\nBefore continuing, we note that one cannot hope to recover an arbitrary tensor from highly sub-\nsampled and arbitrarily corrupted entries. In order to enable provably valid recovery, the present\npaper focuses on a tractable model by imposing the following assumptions.\nAssumption 1.1 (Incohrence and well-conditionedness). The tensor factors {u(cid:63)\n\ni }1\u2264i\u2264r satisfy\n\n(cid:107)T (cid:63)(cid:107)\u221e \u2264(cid:112)\u00b50/d3 (cid:107)T (cid:63)(cid:107)F ,\ni (cid:107)\u221e \u2264(cid:112)\u00b51/d (cid:107)u(cid:63)\n(cid:11)(cid:12)(cid:12) \u2264(cid:112)\u00b52/d (cid:107)u(cid:63)\n(cid:12)(cid:12)(cid:10)u(cid:63)\n\u03ba (cid:44)(cid:0) max\n(cid:1) /(cid:0) min\n\ni (cid:107)2 ,\ni (cid:107)2\n(cid:107)u(cid:63)\ni (cid:107)3\n\n(cid:107)u(cid:63)\ni (cid:107)3\n\n(cid:107)u(cid:63)\n\ni , u(cid:63)\nj\n\n2\n\n2\n\n(cid:13)(cid:13)u(cid:63)\n(cid:13)(cid:13)2\n(cid:1) = O(1).\n\n,\n\nj\n\ni\n\ni\n\n(A1)\n\n(A2)\n\n(A3)\n\n(A4)\n\n1 \u2264 i \u2264 d;\n1 \u2264 i (cid:54)= j \u2264 d;\n\n(8a)\n(8b)\n(8c)\n(8d)\n\ni is de-localized; (3) the factors {u(cid:63)\n\nRemark 1.2. Here, \u00b50, \u00b51 and \u00b52 are termed the incoherence parameters. Assumptions A1, A2 and\nA3 can be viewed as some sort of incoherence conditions for the tensor. For instance, when \u00b50, \u00b51\nand \u00b52 are small, these conditions say that (1) the energy of tensor T (cid:63) is (nearly) evenly spread\ni } are nearly orthogonal to\nacross all entries; (2) each factor u(cid:63)\neach other. Assumption A4 is concerned with the \u201cwell-conditionedness\u201d of the tensor, meaning that\neach rank-1 component is of roughly the same strength.\nFor notational simplicity, we shall set \u00b5 := max{\u00b50, \u00b51, \u00b52}.\nAssumption 1.3 (Random noise). Suppose that E is a symmetric random tensor, where\n{Ej,k,l}1\u2264j\u2264k\u2264l\u2264d (cf. (1)) are independently generated symmetric sub-Gaussian random variables\nwith mean zero and variance Var(Ej,k,l) \u2264 \u03c32.\nIn addition, recognizing that there is a global permutational ambiguity issue (namely, one cannot\ndistinguish u(cid:63)\nr from an arbitrary permutation of them), we introduce the following loss\nmetrics to account for this ambiguity\n\n1,\u00b7\u00b7\u00b7 , u(cid:63)\n\ndistF(U , U (cid:63)) := min\u03a0\u2208permr (cid:107)U \u03a0 \u2212 U (cid:63)(cid:107)F,\n\n(9a)\n\n1: function PRUNE((cid:8)(cid:0)\u03bd\u03c4 , \u03bb\u03c4 , spec-gap\u03c4\n(cid:1)(cid:9)L\n(cid:1)(cid:9)L\nSet \u0398 =(cid:8)(cid:0)\u03bd\u03c4 , \u03bb\u03c4 , spec-gap\u03c4\nUpdate \u0398 \u2190 \u0398 \\(cid:8)(cid:0)\u03bd\u03c4 , \u03bb\u03c4 , spec-gap\u03c4\nreturn(cid:8)(w1, \u03bb1), . . . , (wr, \u03bbr)(cid:9).\n\nfor i = 1, . . . , r do\n\n2:\n3:\n4:\n5:\n6:\n\n\u03c4 =1.\n\n\u03c4 =1, \u0001th)\n\nChoose (\u03bd\u03c4 , \u03bb\u03c4 , spec-gap\u03c4 ) from \u0398 with the largest spec-gap\u03c4 ; set wi = \u03bd\u03c4 , \u03bbi = \u03bb\u03c4 .\n\n(cid:1) \u2208 \u0398 : |(cid:104)\u03bd\u03c4 , wi(cid:105)| > 1 \u2212 \u0001th\n\n(cid:9).\n\n4\n\n\fdist\u221e(U , U (cid:63)) := min\u03a0\u2208permr (cid:107)U \u03a0 \u2212 U (cid:63)(cid:107)\u221e,\n(9b)\ndist2,\u221e(U , U (cid:63)) := min\u03a0\u2208permr (cid:107)U \u03a0 \u2212 U (cid:63)(cid:107)2,\u221e,\n(9c)\nwhere permr stands for the set of r \u00d7 r permutation matrices. For notational simplicity, we also take\ni (cid:107)3\nmin := min1\u2264i\u2264r (cid:107)u(cid:63)\n2 and \u03bb(cid:63)\n\u03bb(cid:63)\nWith these in place, we are ready to present our main results.\nTheorem 1.4. Fix an arbitrary small constant \u03b4 > 0. Suppose that r, \u03ba, \u00b5 = O(1),\n\ni (cid:107)3\nmax := max1\u2264i\u2264r (cid:107)u(cid:63)\n2.\n\np \u2265 c0\n\nlog5 d\nd3/2 ,\n\n\u03c3\n\u03bb(cid:63)\n\nmin\n\n\u221a\np\u221a\n\nd3/2 log5 d\n\n\u2264 c1\n\n(cid:18)\n\n,\n\n(cid:113) d log2 d\n\np +\n\nL = c3\n\nand \u0001th = c4\n\n\u221a\nlog d\nd\n\np + \u03c3\n\u03bb(cid:63)\n\nmin\n\nd\n\n(cid:19)\n\n(cid:113) log d\nmin /(cid:0)32\u03bb(cid:63)8/3\n(cid:19)\n(cid:113) d log d\n\n(cid:1). Then with\n\n(10a)\n\nfor some suf\ufb01ciently large constants c0, c3 > 0 and some suf\ufb01ciently small constants c1, c4 > 0.\nThe learning rate \u03b7t \u2261 \u03b7 is taken to be a constant obeying 0 < \u03b7 \u2264 \u03bb(cid:63)4/3\nprobability at least 1 \u2212 \u03b4,\n\nmax\n\n(cid:18)\n\ndistF(U t, U (cid:63)) \u2264\n\nC1\u03c1t + C2\n\n\u03c3\n\u03bb(cid:63)\n\nmin\n\n(cid:19)\n\n(cid:113) d log d\n(cid:18)\n\np\n\n(cid:107)U (cid:63)(cid:107)F\n\ndist\u221e(U t, U (cid:63)) \u2264 dist2,\u221e(U t, U (cid:63)) \u2264\n\n(10b)\nhold simultaneously for all 0 \u2264 t \u2264 t0 = d5. Here, 0 < C1, C3, \u03c1 < 1 and C2, C4 > 0 are some\nabsolute constants.\n\n(cid:107)U (cid:63)(cid:107)2,\u221e\n\nC3\u03c1t + C4\n\n\u03c3\n\u03bb(cid:63)\n\nmin\n\np\n\nProof. The proof of this theorem is built upon a powerful statistical technique \u2014 called the leave-one-\nout analysis [23, 16, 1, 49, 79, 15, 22, 17, 52]. The proof can be found in our full version [7].\n\nmax (cid:16) \u03bb(cid:63)\n\nmin (cid:16) 1 for\n\nTheorem 5] up to some logarithmic factor.\n\nSeveral important implications are as follows. The discussion below assumes \u03bb(cid:63)\nnotational simplicity.\n\u2022 Linear convergence. In the absence of noise, the proposed algorithm converges linearly, namely,\nit provably attains \u03b5 accuracy within O(log(1/\u03b5)) iterations. Given the inexpensiveness of each\ngradient iteration, this algorithm can be viewed as a linear-time algorithm, which can almost be\nimplemented as long as we can read the data.\n\u2022 Near-optimal sample complexity. The fast convergence is guaranteed as soon as the sample size\nexceeds the order of d3/2poly log(d). This matches the minimal sample complexity \u2014 modulo\nsome logarithmic factor \u2014 known so far for any polynomial-time algorithm.\n\u2022 Near-optimal statistical accuracy. The proposed algorithm converges geometrically fast to a\n\npoint with Euclidean error O(cid:0)\u03c3(cid:112)(d log d)/p(cid:1). This matches the lower bound established in [68,\nO(cid:0)\u03c3(cid:112)(log d)/p(cid:1). This entrywise error bound is about\n\n\u2022 Entrywise estimation accuracy. In addition to the Euclidean error bound, we have also established\nan entrywise error bound which, to the best of our knowledge, has not been established in any\nof the prior works. When t is suf\ufb01ciently large, the iterates reach an entrywise error bound\nd times smaller than the above (cid:96)2 norm\n\u2022 Implicit regularization. One appealing feature of our \ufb01nding is the simplicity of the algorithm. All\nof the above statistical and computational bene\ufb01ts hold for vanilla gradient descent (when properly\ninitialized). This should be contrasted with prior work (e.g. [67]) that requires extra regularization\nto stabilize the optimization landscape. In principle, vanilla GD implicitly constrains itself within\na region of well-conditioned landscape, thus enabling fast convergence without regularization.\n\u2022 No sample splitting. The theory developed herein does not require fresh samples in each iteration.\nWe note that sample splitting has been frequently adopted in other context primarily to simplify\nanalysis. Nevertheless, it typically does not exploit the data in an ef\ufb01cient manner (i.e. each data\nsample is used only once), thus resulting in the need of a much larger sample size in practice.\n\nbound, implying that the estimation errors are evenly spread out across all entries.\n\n\u221a\n\ntensor estimate in the t-th iteration be T t :=(cid:80)r\n\nAs an immediate consequence of Theorem 1.4, we obtain optimal (cid:96)\u221e statistical guarantees for\nestimating tensor entries, which are previously rarely available (see Table 1). Speci\ufb01cally, let our\nr] \u2208 Rd\u00d7r.\n\ni, where U t = [ut\n\n1,\u00b7\u00b7\u00b7 , ut\n\ni \u2297 ut\n\ni \u2297 ut\n\ni=1 ut\n\n5\n\n\falgorithm\n\nspectral method\n+ (vanilla) GD\n\nspectral initialization\n+ tensor power method\n\nours\n\n[68]\n\nsample\n\ncomplexity\n\ncomput.\n\ncomplexity\n\nd1.5\n\nd1.5\n\npd3\n\npd3\n\n(cid:96)2 error\n(noisy)\n\n(cid:113) d\n\n\u03c3\n\np\n\n((cid:107)T (cid:63)(cid:107)\u221e+\u03c3)\n\n\u221a\n\np\n\n\u221a\nd\n\n(cid:96)\u221e error\n(noisy)\n\n(cid:113) 1\n\n\u03c3\n\np\n\nrecovery type\n(noiseless)\n\nexact\n\nn/a\n\napproximate\n\npoly(d)\n\nspectral method\n+ GD on manifold\nspectral method\nsum-of-squares\nsum-of-squares\n\n[67]\n[50]\n[4]\n[53]\nexact\n[73]\nexact\n[74]\nTable 1: Comparison with theory for existing methods when r, \u00b5, \u03ba (cid:16) 1 (neglecting log factors).\n\nn/a\nn/a\n+ \u03c3d1.5\nn/a\nn/a\n\ntensor nuclear norm\n\nn/a\nn/a\nn/a\nn/a\nn/a\n\nd1.5\nd1.5\nd1.5\n\nminimization\n\n(cid:107)T (cid:63)(cid:107)F\u221a\n\napproximate\napproximate\n\nNP-hard\n\nd3\nd15\n\nexact\n\nd1.5\n\nd10\n\npd1.5\n\nd\n\nCorollary 1.5. Fix an arbitrarily small constant \u03b4 > 0. Instate the assumptions of Theorem 1.4.\nThen with probability at least 1 \u2212 \u03b4,\n\n(cid:13)(cid:13)T t \u2212 T (cid:63)(cid:13)(cid:13)F\n(cid:13)(cid:13)T t \u2212 T (cid:63)(cid:13)(cid:13)\u221e (cid:46)\n\n(cid:46)\n\n(cid:18)\n(cid:18)\n\n(cid:113) d log d\n(cid:113) d log d\n\np\n\n(cid:19)\n(cid:19)\n\n(cid:107)T (cid:63)(cid:107)F\n(cid:107)T (cid:63)(cid:107)\u221e\n\nC1\u03c1t + C2\n\n\u03c3\n\u03bb(cid:63)\n\nmin\n\n(11a)\n\n(11b)\nhold simultaneously for all 0 \u2264 t \u2264 t0 = d5. Here, 0 < C1, C3, \u03c1 < 1 and C2, C4 > 0 are some\nabsolute constants.\n\nC3\u03c1t + C4\n\n\u03c3\n\u03bb(cid:63)\n\nmin\n\np\n\nWe shall take a moment to discuss the merits of our approach in comparison to prior work. One of\nthe best-known polynomial-time algorithms is the degree-6 level of the sum-of-squares hierarchy,\nwhich seems to match the computationally feasible limit in terms of the sample complexity [4].\nHowever, this approach has a well-documented limitation in that it involves solving a semide\ufb01nite\nprogram of dimensions d3 \u00d7 d3, which requires enormous storage and computation power. Yuan\net al. [73, 74] proposed to consider tensor nuclear norm minimization, which provably allows for\nreduced sample complexity. The issue, however, is that computing the tensor nuclear norm itself is\nalready computationally intractable. The work [50] alleviates this computational burden by resorting\nto a clever unfolding-based spectral algorithm; it is a nearly linear-time procedure that enables\nnear-minimal sample complexity (among polynomial-time algorithms), although it does not achieve\nexact recovery even in the absence of noise. The two-stage algorithm developed by [68] \u2014 which\nis based on spectral initialization followed by tensor power methods \u2014 shares similar advantages\nand drawbacks as [50]. The work [36] used tensor power methods for initialization, which, however,\nrequires a large number of restart attempts; see discussions in [7]. Further, [67] proposes a polynomial-\ntime nonconvex algorithm based on gradient descent over Grassmann manifold (with a properly\nregularized objective function), which is an extension of the nonconvex matrix completion algorithm\nproposed by [40, 41] to tensor data. The theory provided in [67], however, does not provide explicit\ncomputational complexities. The recent work [59] attempts tensor estimation via a collaborative\n\ufb01ltering approach, which, however, does not enable exact recovery even in the absence of noise.\n\n1.5 Notations\n\nT , R \u2208 Rd\u00d7d\u00d7d, the inner product is de\ufb01ned as (cid:104)T , R(cid:105) =(cid:80)\nof T is de\ufb01ned as (cid:107)T(cid:107)F :=(cid:112)(cid:104)T , T(cid:105). For any vectors u, v \u2208 Rd, we de\ufb01ne the vector products of a\n\nBefore proceeding, we gather a few notations that will be used throughout this paper. For any tensors\nj,k,l Tj,k,lRj,k,l. The Frobenius norm\n\ntensor T \u2208 Rd\u00d7d\u00d7d \u2014 denoted by T \u00d73 u \u2208 Rd\u00d7d and T \u00d71 u \u00d72 v \u2208 Rd \u2014 such that\n1 \u2264 i, j \u2264 d;\n1 \u2264 k \u2264 d.\n\n(cid:2)T \u00d73 u(cid:3)\n(cid:2)T \u00d71 u \u00d72 v(cid:3)\n\n(cid:88)\n(cid:88)\n\nTi,j,kuivj,\n\nTi,j,kuk,\n\nij :=\n\n1\u2264k\u2264d\n\nk :=\n\n1\u2264i,j\u2264d\n\nFor any U = [u1,\u00b7\u00b7\u00b7 , ur] \u2208 Rd\u00d7r and V = [v1,\u00b7\u00b7\u00b7 , vr] \u2208 Rd\u00d7r, we de\ufb01ne\n2 V := [T \u00d71 ui \u00d72 vi]1\u2264i\u2264r \u2208 Rd\u00d7r.\n\n1 U \u00d7seq\n\nT \u00d7seq\n\n(12a)\n\n(12b)\n\n(13)\n\n6\n\n\fFurther, f (n) (cid:46) g(n) or f (n) = O(g(n)) means that |f (n)/g(n)| \u2264 C1 for some constant C1 > 0;\nf (n) (cid:38) g(n) means that |f (n)/g(n)| \u2265 C2 for some constant C2 > 0; f (n) (cid:16) g(n) means that\nC1 \u2264 |f (n)/g(n)| \u2264 C2 for some constants C1, C2 > 0.\n\n2\n\nInitialization\n\nThis section presents formal details of the proposed two-step initialization. Recall that the proposed\ninitialization procedures consist of two steps, which we detail separately.\n\n2.1 Step 1: subspace estimation via a spectral method\n\nA = unfold1\u00d72(cid:0) 1\n\nThe spectral algorithm is often applied in conjunction with simple \u201cunfolding\u201d (or \u201cmatricization\u201d) to\nestimate the subspace spanned by the r factors {u(cid:63)\ni }1\u2264i\u2264r. This strategy is partly motivated by prior\napproaches developed for covariance estimation with missing data [48, 50], and has been investigated\nin detail in [6]. For self-containedness, we provide a brief introduction below, and refer the interested\nreader to [6] for in-depth discussions.\nLet\n\np T(cid:1) \u2208 Rd\u00d7d2\nE[A] = unfold(cid:0)T (cid:63)(cid:1) =\n\n,\n\nor more concisely A = unfold(cid:0) 1\n(14)\np Ti,j,k = Ai,(j\u22121)d+k for any 1 \u2264 i, j, k \u2264 d) [43].\nr(cid:88)\n\np T(cid:1) \u2208 Rd\u00d7d2\n\n(cid:62)\n\nbe the mode-1 matricization of p\u22121T (namely, 1\nThe rationale of this step is that: under our model, the unfolded matrix A obeys\n\n(15)\nwhose column space is precisely the span of {u(cid:63)}1\u2264i\u2264r. This motivates one to estimate the r-\ndimensional column space of E[A] from A. Towards this, a natural strategy is to look at the principal\nsubspace of AA(cid:62). However, the diagonal entries of AA(cid:62) bear too much in\ufb02uence on the principal\ndirections and need to be properly down-weighed. The current paper chooses to work with the\nprincipal subspace of the following matrix that zeros out all diagonal components:\n\n=: A(cid:63),\n\ni (u(cid:63)\n\nu(cid:63)\n\ni=1\n\ni \u2297 u(cid:63)\ni )\n\n(16)\nwhere Po\ufb00-diag(Z) extracts out the off-diagonal entries of a squared matrix Z. If we let U \u2208 Rd\u00d7r\nbe an orthonormal matrix whose columns are the top-r eigenvectors of B, then U serves as our\nsubspace estimate. See Algorithm 2 for a summary of the procedure, and [6] for in-depth discussions.\n\nB := Po\ufb00-diag(AA(cid:62)),\n\n2.2 Step 2: retrieval of low-rank tensor factors from the subspace estimate\n2.2.1 Procedure\nAs it turns out, it is possible to obtain rough (but reasonable) estimates of all low-rank tensor factors\n{u(cid:63)\ni }1\u2264i\u2264r \u2014 up to global permutation \u2014 given a reliable subspace estimate U. This is in stark\ncontrast to the low-rank matrix recovery case, where there exists some global rotational ambiguity\nthat prevents us from disentangling the r factors of interest.\nWe begin by describing how to retrieve one tensor factor from the subspace estimate \u2014 a procedure\nsummarized in RETRIEVE-ONE-TENSOR-FACTOR(). Let us generate a random vector from the\nprovided subspace U (which has orthonormal columns), that is,\n\n\u03b8 = U U(cid:62)g,\n\ng \u223c N (0, Id).\n\n(17)\nThe rescaled tensor data p\u22121T is then transformed into a matrix via proper \u201cprojection\u201d along this\nrandom direction \u03b8, namely,\n\np T \u00d73 \u03b8 \u2208 Rd\u00d7d.\n\n(18)\nOur estimate for a tensor factor is then given by \u03bb1/3\u03bd, where \u03bd is the leading singular vector of M\n\nobeying (cid:104)T , \u03bd\u22973(cid:105) \u2265 0, and \u03bb is taken as \u03bb =(cid:10)p\u22121T , \u03bd\u22973(cid:11). Informally, \u03bd re\ufb02ects the direction of\n\nM = 1\n\ni that exhibits the largest correlation with the random direction \u03b8, and \u03bb forms an\n\nthe component u(cid:63)\nestimate of the corresponding size (cid:107)u(cid:63)\nA challenge remains, however, as there are oftentimes more than one tensor factors to estimate. To\naddress this issue, we propose to re-run the aforementioned procedure multiple times, so as to ensure\nthat we get to retrieve each tensor factor of interest at least once. We will then apply a careful pruning\nprocedure (i.e. PRUNE()) to remove redundancy.\n\ni (cid:107)2. We shall provide intuition in the full version [7].\n\n7\n\n\fFigure 1: Relative error of the es-\ntimate U t and T t vs. the iteration\ncount t for the noiseless case, where\nd = 100, r = 4, p = 0.1.\n\nFigure 2: Empirical success rate\nvs. sampling rate. Each point is av-\neraged over 100 trials.\n\nFigure 3: Squared relative errors\nvs. SNR for noisy settings. Here,\nd = 100, r = 4, p = 0.1. Each\npoint is averaged over 100 trials.\n\n3 Numerical experiments\n\nthe truth T (cid:63) = (cid:80)\n\ni\n\nWe carry out a series of numerical experiments to corroborate our theoretical \ufb01ndings. We generate\ni.i.d.\u223c N (0, Id). The learning rates, the restart\n\nrandomly with u(cid:63)\ni\n\n1\u2264i\u2264r u(cid:63) \u22973\n\n(cid:107)T (cid:63)(cid:107)\u221e . Here, we set T t = (cid:80)r\n\n; (2) the relative (cid:107) \u00b7 (cid:107)2,\u221e error dist2,\u221e(U t,U (cid:63))\n; (4) the relative (cid:96)\u221e error (cid:107)T t\u2212T (cid:63)(cid:107)\u221e\n1,\u00b7\u00b7\u00b7 , ut\n\nnumber and the pruning threshold are taken to be \u03b7t \u2261 0.2, L = 64, \u0001th = 0.4.\nWe start with numerical convergence rates of our algorithm in the absence of noise. Set d = 100,\nr = 4 and p = 0.1. Fig. 1 shows the numerical estimation errors vs. iteration count t in a typical\nMonte Carlo trial. Here, 4 kinds of estimation errors are reported: (1) the relative Euclidean error\ndistF(U t,U (cid:63))\n(cid:107)U (cid:63)(cid:107)2,\u221e ; (3) the relative Frobenius norm error\n(cid:107)U (cid:63)(cid:107)F\n(cid:107)T t\u2212T (cid:63)(cid:107)F\ni with\n(cid:107)T (cid:63)(cid:107)F\nr]. For all these metrics, the numerical estimation errors decay geometrically fast.\nU t = [ut\nNext, we study the phase transition (in terms of the success rates for exact recovery) in the noise-free\nsettings. For the sake of comparisons, we also report the numerical performance of tensor power\nmethod (TPM) followed by gradient descent. When running the tensor power method, we set the\niteration number and restart number to be 16 and 64 respectively. Set r = 4. Each trial is claimed to\n\u2264 0.01. Fig. 2 plots the empirical success rates over 100\nindependent trials. As can be seen, our initialization algorithm outperforms the tensor power method.\nThe third series of experiments concerns the statistical accuracy of our algorithm. Take t0 = 100,\nF/d3\nd = 100, r = 4 and p = 0.1. De\ufb01ne the signal-to-noise ratio (SNR) to be SNR =\n.\nWe report in Fig. 3 three types of squared relative errors (namely, dist2\nand\n(cid:107)(cid:98)T \u2212T (cid:63)(cid:107)2\u221e\n(cid:107)T (cid:63)(cid:107)2\u221e ) vs. SNR. Here, the SNR varies from 1 to 1000. Figure 3 illustrates that all three types of\nrelative squared errors scale inversely proportional to the SNR, which is consistent with our theory.\n\nsucceed if the relative (cid:96)2 error distF((cid:98)U ,U (cid:63))\n\n(cid:107)T (cid:63)(cid:107)2\n\u03c32\n\n2,\u221e((cid:98)U ,U (cid:63))\n\ni \u2297 ut\n\ni \u2297 ut\n\ni=1 ut\n\nF((cid:98)U ,U (cid:63))\n\n(cid:107)U (cid:63)(cid:107)2\n\nF\n\n(cid:107)U (cid:63)(cid:107)F\n\n, dist2\n\n(cid:107)U (cid:63)(cid:107)2\n\n2,\u221e\n\n4 Discussion\n\nThe current paper uncovers the possibility of ef\ufb01ciently and stably completing a low-CP-rank\ntensor from partial and noisy entries. Perhaps somewhat unexpectedly, despite the high degree of\nnonconvexity, this problem can be solved to optimal statistical accuracy within nearly linear time.\nTo the best of our knowledge, this intriguing message has not been shown in the prior literature.\nThe insights and analysis techniques developed in this paper might also have implications for other\nnonconvex algorithms [36, 66, 69, 54, 67, 38, 61, 71, 37, 70] and other tensor recovery problems\n[2, 3, 63, 33, 60, 26, 75, 42, 77, 31, 10, 30].\n\nAcknowledgements\n\nY. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant\nN00014-19-1-2120, by the ARO grant W911NF-18-1-0303, by the NSF grants CCF-1907661 and\nIIS-1900140. H. V. Poor is supported in part by the NSF grant DMS-1736417.\n\n8\n\n51015202510-610-510-410-310-210-11000.020.040.060.080.1 00.10.20.30.40.50.60.70.80.9110010110210310-610-510-410-310-210-1\fReferences\n[1] E. Abbe, J. Fan, K. Wang, and Y. Zhong. Entrywise eigenvector analysis of random matrices\n\nwith low expected rank. arXiv preprint arXiv:1709.09565, 2017.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\nlearning latent variable models. The Journal of Machine Learning Research, 15(1):2773\u20132832,\n2014.\n\n[3] A. Anandkumar, R. Ge, and M. Janzamin. Guaranteed non-orthogonal tensor decomposition\n\nvia alternating rank-1 updates. arXiv preprint arXiv:1402.5180, 2014.\n\n[4] B. Barak and A. Moitra. Noisy tensor completion via the sum-of-squares hierarchy.\n\nConference on Learning Theory, pages 417\u2013445, 2016.\n\nIn\n\n[5] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R(cid:13)\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[6] C. Cai, G. Li, Y. Chi, H. V. Poor, and Y. Chen. Subspace estimation from unbalanced and\nincomplete data matrices: (cid:96)2,\u221e statistical guarantees. arXiv preprint arXiv:1910.04267, 2019.\n[7] C. Cai, G. Li, H. V. Poor, and Y. Chen. Nonconvex low-rank symmetric tensor completion from\n\nnoisy data. arXiv preprint arXiv:1911.04436, 2019.\n\n[8] E. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9(6):717\u2013772, April 2009.\n\n[9] E. J. Candes, X. Li, and M. Soltanolkotabi. Phase retrieval via wirtinger \ufb02ow: Theory and\n\nalgorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007, 2015.\n\n[10] H. Chen, G. Raskutti, and M. Yuan. Non-convex projected gradient descent for generalized\nlow-rank tensor regression. The Journal of Machine Learning Research, 20(1):172\u2013208, 2019.\n\n[11] Y. Chen and E. Cand\u00e8s. The projected power method: An ef\ufb01cient algorithm for joint alignment\n\nfrom pairwise differences. Comm. Pure and Appl. Math., 71(8):1648\u20131714, 2018.\n\n[12] Y. Chen and E. J. Cand\u00e8s. Solving random quadratic systems of equations is nearly as easy as\n\nsolving linear systems. Comm. Pure Appl. Math., 70(5):822\u2013883, 2017.\n\n[13] Y. Chen and Y. Chi. Robust spectral compressed sensing via structured matrix completion.\n\nIEEE Transactions on Information Theory, 60(10):6576 \u2013 6601, 2014.\n\n[14] Y. Chen and Y. Chi. Harnessing structures in big data via guaranteed low-rank matrix estimation:\nIEEE Signal\n\nRecent theory and fast algorithms via convex and nonconvex optimization.\nProcessing Magazine, 35(4):14\u201331, July 2018.\n\n[15] Y. Chen, Y. Chi, J. Fan, and C. Ma. Gradient descent with random initialization: Fast global\n\nconvergence for nonconvex phase retrieval. Mathematical Programming, pages 1\u201333, 2018.\n\n[16] Y. Chen, J. Fan, C. Ma, and K. Wang. Spectral method and regularized MLE are both optimal\n\nfor top-K ranking. Annals of Statistics, 47(4):2204\u20132235, August 2019.\n\n[17] Y. Chen, J. Fan, C. Ma, and Y. Yan. Inference and uncertainty quanti\ufb01cation for noisy matrix\ncompletion. accepted to the Proceedings of the National Academy of Sciences (PNAS), 2019.\n\n[18] Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General\n\nstatistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.\n\n[19] J. Y. Cheng, T. Zhang, M. T. Alley, M. Uecker, M. Lustig, J. M. Pauly, and S. S. Vasanawala.\nComprehensive multi-dimensional MRI for the simultaneous assessment of cardiopulmonary\nanatomy and physiology. Scienti\ufb01c reports, 7(1):5330, 2017.\n\n[20] Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization:\n\nAn overview. IEEE Transactions on Signal Processing, 67(20):5239\u20135269, 2019.\n\n[21] M. A. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete\n\nobservations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608\u2013622, 2016.\n\n9\n\n\f[22] L. Ding and Y. Chen. The leave-one-out approach for matrix completion: Primal and dual\n\nanalysis. arXiv preprint arXiv:1803.07554, 2018.\n\n[23] N. El Karoui. On the impact of predictor geometry on the performance on high-dimensional\nridge-regularized generalized robust regression estimators. Probability Theory and Related\nFields, pages 1\u201381, 2015.\n\n[24] G. Ely, S. Aeron, N. Hao, and M. E. Kilmer. 5D and 4D pre-stack seismic data completion\nusing tensor nuclear norm (TNN). In SEG Technical Program Expanded Abstracts 2013, pages\n3639\u20133644. Society of Exploration Geophysicists, 2013.\n\n[25] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via\n\nconvex optimization. Inverse Problems, 27(2):025010, 2011.\n\n[26] R. Ge and T. Ma. On the optimization landscape of tensor decompositions. In Advances in\n\nNeural Information Processing Systems, pages 3653\u20133663, 2017.\n\n[27] D. Gilboa, S. Buchanan, and J. Wright. Ef\ufb01cient dictionary learning with gradient descent.\n\narXiv preprint arXiv:1809.10313, 2018.\n\n[28] D. Goldfarb and Z. Qin. Robust low-rank tensor recovery: Models and algorithms. SIAM\n\nJournal on Matrix Analysis and Applications, 35(1):225\u2013253, 2014.\n\n[29] D. Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. IEEE Transactions\n\non Information Theory, 57(3):1548\u20131566, March 2011.\n\n[30] B. Hao, B. Wang, P. Wang, J. Zhang, J. Yang, and W. W. Sun. Sparse tensor additive regression.\n\narXiv preprint arXiv:1904.00479, 2019.\n\n[31] B. Hao, A. Zhang, and G. Cheng. Sparse and low-rank tensor estimation via cubic sketchings.\n\narXiv preprint arXiv:1801.09326, 2018.\n\n[32] C. J. Hillar and L.-H. Lim. Most tensor problems are np-hard. Journal of the ACM (JACM),\n\n60(6):45, 2013.\n\n[33] S. B. Hopkins, T. Schramm, J. Shi, and D. Steurer. Fast spectral algorithms from sum-of-squares\nproofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth\nannual ACM symposium on Theory of Computing, pages 178\u2013191. ACM, 2016.\n\n[34] B. Huang, C. Mu, D. Goldfarb, and J. Wright. Provable models for robust low-rank tensor\n\ncompletion. Paci\ufb01c Journal of Optimization, 11(2):339\u2013364, 2015.\n\n[35] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimiza-\n\ntion. In ACM symposium on Theory of computing, pages 665\u2013674, 2013.\n\n[36] P. Jain and S. Oh. Provable tensor factorization with missing data. In Advances in Neural\n\nInformation Processing Systems, pages 1431\u20131439, 2014.\n\n[37] T.-Y. Ji, T.-Z. Huang, X.-L. Zhao, T.-H. Ma, and G. Liu. Tensor completion using total variation\n\nand low-rank matrix factorization. Information Sciences, 326:243\u2013257, 2016.\n\n[38] H. Kasai and B. Mishra. Low-rank tensor completion: a riemannian manifold preconditioning\n\napproach. In International Conference on Machine Learning, pages 1012\u20131021, 2016.\n\n[39] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.\n\nTransactions on Information Theory, 56(6):2980 \u20132998, June 2010.\n\n[40] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.\n\ntransactions on information theory, 56(6):2980\u20132998, 2010.\n\nIEEE\n\nIEEE\n\n[41] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. J. Mach.\n\nLearn. Res., 11:2057\u20132078, 2010.\n\n[42] M. E. Kilmer, K. Braman, N. Hao, and R. C. Hoover. Third-order tensors as operators on\nmatrices: A theoretical and computational framework with applications in imaging. SIAM\nJournal on Matrix Analysis and Applications, 34(1):148\u2013172, 2013.\n\n10\n\n\f[43] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013\n\n500, 2009.\n\n[44] N. Kreimer, A. Stanton, and M. D. Sacchi. Tensor completion based on nuclear norm minimiza-\n\ntion for 5d seismic data reconstruction. Geophysics, 78(6):V273\u2013V284, 2013.\n\n[45] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to\n\nminimizers. In Conference on learning theory, pages 1246\u20131257, 2016.\n\n[46] X. Li, Y. Ye, and X. Xu. Low-rank tensor completion with total variation for visual data\n\ninpainting. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[47] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in\nvisual data. IEEE transactions on pattern analysis and machine intelligence, 35(1):208\u2013220,\n2013.\n\n[48] K. Lounici. High-dimensional covariance matrix estimation with missing observations.\n\nBernoulli, 20(3):1029\u20131058, 2014.\n\n[49] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical esti-\nmation: Gradient descent converges linearly for phase retrieval, matrix completion and blind\ndeconvolution. accepted to Foundations of Computational Mathematics, 2018.\n\n[50] A. Montanari and N. Sun. Spectral algorithms for tensor completion. Communications on Pure\n\nand Applied Mathematics, 71(11):2381\u20132425, 2018.\n\n[51] C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds and improved\nrelaxations for tensor recovery. In International conference on machine learning, pages 73\u201381,\n2014.\n\n[52] A. Pananjady and M. J. Wainwright. Value function estimation in markov reward processes:\nInstance-dependent (cid:96)\u221e-bounds for policy evaluation. arXiv preprint arXiv:1909.08749, 2019.\n[53] A. Potechin and D. Steurer. Exact tensor completion with sum-of-squares. In Conference on\n\nLearning Theory, pages 1619\u20131673, 2017.\n\n[54] H. Rauhut, R. Schneider, and \u017d. Stojanac. Low rank tensor recovery via iterative hard thresh-\n\nolding. Linear Algebra and its Applications, 523:220\u2013262, 2017.\n\n[55] J. D. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative\n\nprediction. In International conference on Machine learning, pages 713\u2013719. ACM, 2005.\n\n[56] E. Richard and A. Montanari. A statistical model for tensor PCA. In Advances in Neural\n\nInformation Processing Systems, pages 2897\u20132905, 2014.\n\n[57] B. Romera-Paredes and M. Pontil. A new convex relaxation for tensor completion. In Advances\n\nin Neural Information Processing Systems, pages 2967\u20132975, 2013.\n\n[58] O. Semerci, N. Hao, M. E. Kilmer, and E. L. Miller. Tensor-based formulation and nuclear norm\nregularization for multienergy computed tomography. IEEE Transactions on Image Processing,\n23(4):1678\u20131693, 2014.\n\n[59] D. Shah and C. L. Yu. Iterative collaborative \ufb01ltering for sparse noisy tensor estimation. arXiv\n\npreprint arXiv:1908.01241, 2019.\n\n[60] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos.\nTensor decomposition for signal processing and machine learning. IEEE Transactions on Signal\nProcessing, 65(13):3551\u20133582, 2017.\n\n[61] M. Steinlechner. Riemannian optimization for high-dimensional tensor completion. SIAM\n\nJournal on Scienti\ufb01c Computing, 38(5):S461\u2013S484, 2016.\n\n[62] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization. IEEE\n\nTransactions on Information Theory, 62(11):6535\u20136579, 2016.\n\n[63] G. Tang and P. Shah. Guaranteed tensor decomposition: A moment approach. In International\n\nConference on Machine Learning, pages 1491\u20131500, 2015.\n\n11\n\n\f[64] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimiza-\n\ntion. arXiv preprint arXiv:1010.0789, 2010.\n\n[65] G. Wang, G. B. Giannakis, and Y. C. Eldar. Solving systems of random quadratic equations via\n\ntruncated amplitude \ufb02ow. IEEE Transactions on Information Theory, 2017.\n\n[66] W. Wang, V. Aggarwal, and S. Aeron. Tensor completion by alternating minimization under the\n\ntensor train (tt) model. arXiv preprint arXiv:1609.05587, 2016.\n\n[67] D. Xia and M. Yuan. On polynomial time methods for exact low rank tensor completion. arXiv\n\npreprint arXiv:1702.06980, 2017.\n\n[68] D. Xia, M. Yuan, and C.-H. Zhang. Statistically optimal and computationally ef\ufb01cient low rank\n\ntensor completion from noisy entries. arXiv preprint arXiv:1711.04934, 2017.\n\n[69] Y. Xu, R. Hao, W. Yin, and Z. Su. Parallel matrix factorization for low-rank tensor completion.\n\nInverse Problems & Imaging, 9(2):601\u2013624, 2015.\n\n[70] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization\nwith applications to nonnegative tensor factorization and completion. SIAM Journal on imaging\nsciences, 6(3):1758\u20131789, 2013.\n\n[71] Q. Yao.\n\nScalable tensor completion with nonconvex regularization.\n\narXiv:1807.08725, 2018.\n\narXiv preprint\n\n[72] J. Ying, H. Lu, Q. Wei, J.-F. Cai, D. Guo, J. Wu, Z. Chen, and X. Qu. Hankel matrix nuclear\nnorm regularized tensor completion for n-dimensional exponential signals. IEEE Transactions\non Signal Processing, 65(14):3702\u20133717, 2017.\n\n[73] M. Yuan and C.-H. Zhang. On tensor completion via nuclear norm minimization. Foundations\n\nof Computational Mathematics, 16(4):1031\u20131068, 2016.\n\n[74] M. Yuan and C.-H. Zhang. Incoherent tensor norms and their applications in higher order tensor\n\ncompletion. IEEE Transactions on Information Theory, 63(10):6753\u20136766, 2017.\n\n[75] A. Zhang and D. Xia. Tensor SVD: Statistical and computational limits. IEEE Transactions on\n\nInformation Theory, 64(11):7311\u20137338, 2018.\n\n[76] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi. A nonconvex approach for phase retrieval: Re-\nshaped wirtinger \ufb02ow and incremental algorithms. The Journal of Machine Learning Research,\n18(1):5164\u20135198, 2017.\n\n[77] Z. Zhang and S. Aeron. Exact tensor completion using t-svd. IEEE Trans. Signal Processing,\n\n65(6):1511\u20131526, 2017.\n\n[78] Q. Zheng and J. Lafferty. Convergence analysis for rectangular matrix completion using\n\nBurer-Monteiro factorization and gradient descent. arXiv:1605.07051, 2016.\n\n[79] Y. Zhong and N. Boumal. Near-optimal bound for phase synchronization. SIAM Journal on\n\nOptimization, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1072, "authors": [{"given_name": "Changxiao", "family_name": "Cai", "institution": "Princeton University"}, {"given_name": "Gen", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "H. Vincent", "family_name": "Poor", "institution": "Princeton University"}, {"given_name": "Yuxin", "family_name": "Chen", "institution": "Princeton University"}]}