{"title": "A Dual Framework for Low-rank Tensor Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 5484, "page_last": 5495, "abstract": "One of the popular approaches for low-rank tensor completion is to use the latent trace norm regularization. However, most existing works in this direction learn a sparse combination of tensors. In this work, we fill this gap by proposing a variant of the latent trace norm that helps in learning a non-sparse combination of tensors. We develop a dual framework for solving the low-rank tensor completion problem. We first show a novel characterization of the dual solution space with an interesting factorization of the optimal solution. Overall, the optimal solution is shown to lie on a Cartesian product of Riemannian manifolds. Furthermore, we exploit the versatile Riemannian optimization framework for proposing computationally efficient trust region algorithm. The experiments illustrate the efficacy of the proposed algorithm on several real-world datasets across applications.", "full_text": "A dual framework for low-rank tensor completion\n\nMadhav Nimishakavi\u2217, Pratik Jawanpuria\u2020, Bamdev Mishra\u2020\n\n\u2217Indian Institute of Science, India\n\n\u2020 Microsoft, India\n\nmadhav@iisc.ac.in, {pratik.jawanpuria,bamdevm}@microsoft.com\n\nAbstract\n\nOne of the popular approaches for low-rank tensor completion is to use the latent\ntrace norm regularization. However, most existing works in this direction learn a\nsparse combination of tensors. In this work, we \ufb01ll this gap by proposing a variant\nof the latent trace norm that helps in learning a non-sparse combination of tensors.\nWe develop a dual framework for solving the low-rank tensor completion problem.\nWe \ufb01rst show a novel characterization of the dual solution space with an interesting\nfactorization of the optimal solution. Overall, the optimal solution is shown to lie on\na Cartesian product of Riemannian manifolds. Furthermore, we exploit the versatile\nRiemannian optimization framework for proposing computationally ef\ufb01cient trust\nregion algorithm. The experiments illustrate the ef\ufb01cacy of the proposed algorithm\non several real-world datasets across applications.\n\n1\n\nIntroduction\n\nTensors are multidimensional or K-way arrays, which provide a natural way to represent multi-modal\ndata [10, 11]. Low-rank tensor completion problem, in particular, aims to recover a low-rank tensor\nfrom partially observed tensor [2]. This problem has numerous applications in image/video inpainting\n[27, 26], link-prediction [14], and recommendation systems [39], to name a few.\nIn this work, we focus on trace norm regularized low-rank tensor completion problem of the form\n\n(cid:107)W\u2126 \u2212 Y\u2126(cid:107)2\n\nF +\n\nR(W),\n\n1\n\u03bb\n\nmin\n\nW\u2208Rn1\u00d7n2\u00d7...\u00d7nK\n\n(1)\nwhere Y\u2126 \u2208 Rn1\u00d7...\u00d7nK is a partially observed K- mode tensor, whose entries are only known for a\nsubset of indices \u2126. (W\u2126)(i1,...,iK ) = W(i1,...,iK ), if (i1, . . . , iK) \u2208 \u2126 and 0 otherwise, (cid:107)\u00b7(cid:107)F is the\nFrobenius norm , R(\u00b7) is a low-rank promoting regularizer, and \u03bb > 0 is the regularization parameter.\nSimilar to the matrix completion problem, the trace norm regularization has been used to enforce the\nlow-rank constraint for the tensor completion problem. The works [41, 42] discuss the overlapped\nand latent trace norm regularizations for tensors. In particular, [42, 45] show that the latent trace\nnorm has certain better tensor reconstruction bounds. The latent trace norm regularization learns the\ntensor as a sparse combination of different tensors. In our work, we empirically motivate the need\nfor learning non-sparse combination of tensors and propose a variant of the latent trace norm that\nlearns a non-sparse combination of tensors. We show a novel characterization of the solution space\nthat allows for a compact storage of the tensor, thereby allowing to develop scalable optimization\nformulations. Concretely, we make the following contributions in this paper.\n\n\u2022 We propose a novel trace norm regularizer for low-rank tensor completion problem, which\nlearns a tensor as a non-sparse combination of tensors. In contrast, the more popular latent\ntrace norm regularizer [41, 42, 45] learns a highly sparse combination of tensors. Non-sparse\ncombination helps in capturing information along all the modes.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u2022 We propose a dual framework for analyzing the problem formulation. This provides\ninteresting insights into the solution space of the tensor completion problem, e.g., how the\nsolutions along different modes are related, allowing a compact representation of the tensor.\n\u2022 Exploiting the characterization of the solution space, we develop a \ufb01xed-rank formulation.\nOur optimization problem is on Riemannian spectrahedron manifolds and we propose\ncomputationally ef\ufb01cient trust-region algorithm for our formulation.\n\nNumerical comparisons on real-world datasets for different applications such as video and\nhyperspectral-image completion, link prediction, and movie recommendation show that the pro-\nposed algorithm outperforms state-of-the-art latent trace norm regularized algorithms. The proofs of\nall the theorems and lemmas and additional experimental details are provided in the longer version of\nthe paper [32]. Our codes are available at https://pratikjawanpuria.com/.\n\n2 Related work\n\nnorm is de\ufb01ned as: R(W) := (cid:80)K\n\nTrace norm regularized tensor completion formulations. The works [27, 42, 37, 34, 9] discuss\nthe overlapped trace norm regularization for tensor learning. The overlapped trace norm is motivated\nas a convex proxy for minimizing the Tucker (multilinear) rank of a tensor. The overlapped trace\nk=1 (cid:107)Wk(cid:107)\u2217, where Wk is the mode-k matrix unfolding of the\ntensor W [25] and (cid:107)\u00b7(cid:107)\u2217 denotes the trace norm regularizer. Wk is a nk \u00d7 \u03a0j(cid:54)=k nj matrix obtained\nby concatenating mode-k \ufb01bers (column vectors) of the form W(i1,...,ik\u22121,:,ik+1,...,iK ) [25].\nLatent trace norm is another convex regularizer used for low-rank tensor learning [41, 43, 42, 45, 17].\nIn this setting, the tensor W is modeled as sum of K (unknown) tensors W (1), . . . ,W (K) such that\nW (k)\n\nare low-rank matrices. The latent trace norm is de\ufb01ned as:\n\nk\n\n\u221a\nk (cid:107)\u2217 scaled by 1/\n\nA variant of the latent trace norm ((cid:107)W (k)\nnk) is analyzed in [45]. Latent trace norm\nand its scaled variant achieve better recovery bounds than overlapped trace norm [42, 45]. Recently,\n[17] proposed a scalable latent trace norm based Frank-Wolfe algorithm for tensor completion.\nThe latent trace norm (2) corresponds to the sparsity inducing (cid:96)1-norm penalization across (cid:107)W (k)\nk (cid:107)\u2217.\nHence, it learns W as a sparse combination of W (k). In case of high sparsity, it may result in\nselecting only one of the tensors W (k) as W, i.e., W = W (k) for some k, in which case W is\nessentially learned as a low-rank matrix. In several real-world applications, tensor data cannot be\nmapped to a low-rank matrix structure and they require a higher order structure. Therefore, we\npropose a regularizer which learns a non-sparse combination of W (k). Non-sparse norms have led to\nbetter generalization performance in other machine learning settings [12, 38, 22].\nWe show the bene\ufb01t of learning a non-sparse mixture of tensors as against a sparse mixture on two\ndatasets: Ribeira and Baboon (refer Section 5 for details). Figures 1(a) and 1(b) show the relative\nsparsity of the optimally learned tensors in the mixture as learned by the (cid:96)1-regularized latent trace\nnorm based model (2) [42, 45, 17] versus the proposed (cid:96)2-regularized model (discussed in Section 3).\nk (cid:107)W (k)(cid:107)F . In both\nthe datasets, our model learns a non-sparse combination of tensors, whereas the latent trace norm\n\nThe relative sparsity for each W (k) in the mixture is computed as (cid:107)W (k)(cid:107)F /(cid:80)\n\nR(W) :=\n\n(cid:80)K\nk=1 W (k)=W; W (k)\u2208Rn1\u00d7...\u00d7nK\n\ninf\n\n(cid:80)K\nk=1 (cid:107)W (k)\n\nk (cid:107)\u2217,\n\n(2)\n\n(a) Ribeira\n\n(b) Baboon\n\n(c) Ribeira\n\n(d) Baboon\n\nFigure 1: (a) & (b) Relative sparsity of each tensor in the mixture of tensors for Ribeira and Baboon datasets.\nOur proposed formulation learns a (cid:96)2-norm based non-sparse combination of tensors; (c) & (d) show that the\nproposed non-sparse combination obtain better generalization performance on both the datasets.\n\n2\n\n\fbased model learns a highly skewed mixture of tensors. The proposed non-sparse tensor combination\nalso leads to better generalization performance, as can be observed in the Figures 1(c) and 1(d). In\nthe particular case of Baboon dataset, the latent trace norm essentially learns W as a low-rank matrix\n(W = W (3)) and consequently obtains poor generalization.\nOther tensor completion formulations. Other approaches for low-rank tensor completion include\ntensor decomposition methods like Tucker and CP [25, 10, 11]. They generalize the notion of singular\nvalue decomposition of matrices to tensors. Recently, [26] exploits the Riemannian geometry of \ufb01xed\nmultilinear rank to learn factor matrices and the core tensor. They propose a computationally ef\ufb01cient\nnon-linear conjugate gradient method for optimization over manifolds of tensors of \ufb01xed multilinear\nrank. [24] further propose an ef\ufb01cient preconditioner for low-rank tensor learning with the Tucker\ndecomposition. [49] propose a Bayesian probabilistic CP model for performing tensor completion.\nTensor completion algorithms based on tensor tubal-rank have been recently proposed in [48, 28].\n\nk (cid:107)2\u2217,\n\nk\n\nk\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:80)\n\n3 Non-sparse latent trace norm and duality\nWe propose the following formulation for learning the low-rank tensor W\n(cid:107)W (k)\n\n+(cid:80)\n\nW (k)\n\n\u2126 \u2212 Y\u2126\n\nF\n\n1\n\u03bbk\n\nmin\n\nk (cid:107)\u2217.\n\nW (k)\u2208Rn1\u00d7...\u00d7nK\n\nwhere W =(cid:80)\n\n(3)\nk W (k) is the learned tensor. It should be noted that the proposed regularizer in (3)\nk (cid:107)\u2217. In contrast, the latent trace norm regularizer (2) has the (cid:96)1-norm\n\nemploys the (cid:96)2-norm over (cid:107)W (k)\nover (cid:107)W (k)\nWhile the existing tensor completion approaches [24, 17, 26, 27, 42, 37] mostly discuss a primal\nformulation similar to (1), we propose a novel dual framework for our analysis. The use of dual\nframework for learning low-rank matrices [46, 20], multi-task problems [33, 21, 19], etc., often leads\nto novel insights into the solution space of the primal problem.\nWe begin by discussing how to obtain the dual formulation of (3). Later, we explain how the insights\nfrom the dual framework motivate us to propose a novel \ufb01xed-rank formulation. As a \ufb01rst step, we\nexploit the following variational characterization of the trace norm studied in [3, Theorem 4.1].\nGiven X \u2208 Rd\u00d7T , the following result holds:\n\nmin\n\n(cid:113)\n\n(cid:104)\u0398\u2020, XX(cid:62)(cid:105),\n\n(cid:107)X(cid:107)2\u2217 =\n(cid:112)\n\nXX(cid:62)/trace(\n\n\u0398\u2208P d,range(X)\u2286range(\u0398)\n\n(4)\nwhere P d denotes the set of d \u00d7 d positive semi-de\ufb01nite matrices with unit trace, \u0398\u2020 denotes the\npseudo-inverse of \u0398, range(\u0398) = {\u0398z : z \u2208 Rd}, and (cid:104)\u00b7,\u00b7(cid:105) is the inner product. The expression\nXX(cid:62)) [3], and hence the ranks of \u0398 and X are equal at\nfor optimal \u0398\u2217 is \u0398\u2217 =\noptimality. Thus, (4) implicitly transfers the low-rank constraint on X (due to trace norm) to an\nauxiliary variable \u0398 \u2208 P d. It is well known that positive semi-de\ufb01nite matrix \u0398 with unit trace\nconstraint implies the (cid:96)1-norm constraint on the eigenvalues of \u0398, leading to low-rankedness of \u0398.\nThe result (4) has also been recently employed to obtain new factorization insights for structured\nlow-rank matrices [20].\nUsing the result (4) in (3) leads to K auxiliary matrices, one \u0398k \u2208 P nk corresponding to every W (k)\n(mode-k matrix unfolding of the tensor W (k)). It should also be noted that \u0398k \u2208 P nk are low-rank\nmatrices. We now present the following theorem that states an equivalent minimax formulation of (3).\n\nk\n\nF \u2212(cid:88)\n\nmin\n\n\u03981\u2208P n1 ,...,\u0398K\u2208P nK\n\nTheorem 1 An equivalent minimax formulation of the problem (3) is\n\u03bbk\n2\n\nmaxZ\u2208C (cid:104)Z,Y\u2126(cid:105) \u2212 1\nThe optimal solution of (3) is given by W\u2217 =(cid:80)\n\n(5)\nwhere Z is the dual tensor variable corresponding to the primal problem (3) and Zk is the mode-k\nunfolding of Z. The set C := {Z \u2208 Rn1\u00d7...\u00d7nK : Z(i1,...,iK ) = 0, (i1, . . . , iK) /\u2208 \u2126} constrains Z\nK,Z\u2217} be the optimal solution of (5).\nto be a sparse tensor with |\u2126| non-zero entries. Let {\u0398\u2217\nk) \u2200k and\n\u00d7k denotes the tensor-matrix multiplication along mode k.\n\nk W (k)\u2217, where W (k)\u2217 = \u03bbk(Z\u2217 \u00d7k \u0398\u2217\n\n(cid:104)\u0398k,ZkZ(cid:62)\nk (cid:105),\n\n1, . . . , \u0398\u2217\n\n(cid:107)Z(cid:107)2\n\n4\n\nk\n\n3\n\n\fAlgorithm 1 Proposed Riemannian trust-region algorithm for (7).\nInput: Y\u2126, rank (r1, . . . , rK), regularization parameter \u03bb, and tolerance \u0001.\nInitialize : u \u2208 M.\nrepeat\n\n1: Compute the gradient \u2207u(cid:96) for (7) as given in Lemma 1.\n2: Compute the search direction which minimizes the trust-region subproblem.\n3: Update x with the retraction step to maintain strict feasibility on M. Speci\ufb01cally for the\n\nIt makes use of \u2207u(cid:96) and its directional derivative presented in Lemma 1 for (7).\nspectrahedron manifold, Uk \u2190 (Uk + Vk)/(cid:107)Uk + Vk(cid:107)F , where Vk is the search direction.\n\nuntil (cid:107)\u2207u(cid:96)(cid:107)F < \u0001.\nOutput: u\u2217\n\nRemark 1: Theorem 1 shows that the optimal solutions W (k)\u2217 for all k in (3) are completely charac-\nterized by a single sparse tensor Z\u2217 and K low-rank positive semi-de\ufb01nite matrices {\u0398\u2217\nK}.\nIt should be noted that such a novel relationship of W (k)\u2217 (for all k) with each other is not evident\nfrom the primal formulation (3).\nWe next present the following result related to the form of the optimal solution of (3).\n\n1, . . . , \u0398\u2217\n\nCorollary 1 (Representer theorem) The optimal solution of the primal problem (3) admits a repre-\nsentation of the form: W (k)\u2217 = \u03bbk(Z \u00d7k \u0398k) \u2200k, where Z \u2208 C and \u0398k \u2208 P nk.\nk \u2208 P nk is a low-rank positive semi-de\ufb01nite\nAs discussed earlier in the section, the optimal \u0398\u2217\nmatrix for all k. In spite of the low-rankness of the optimal solution, an algorithm for (5) need not\nproduce intermediate iterates that are low rank. From the perspective of large-scale applications, this\nobservation as well as other computational ef\ufb01ciency concerns discussed below motivate to exploit a\n\ufb01xed-rank parameterization of \u0398k for all k.\nFixed-rank parameterization. We propose to explicitly constrain the rank of \u0398k to rk as follows:\n(6)\nr := {U \u2208 Rn\u00d7r : (cid:107)U(cid:107)F = 1}. In large-scale tensor completion problems, it\nwhere Uk \u2208 S nk\nis common to set rk (cid:28) nk, where the \ufb01xed-rank parameterization (6) of \u0398k has a two-fold advantage.\nFirst, the search space dimension drastically reduces from nk((nk + 1)/2 \u2212 1), which is quadratic in\ntensor dimensions, to nkrk \u2212 1 \u2212 rk(rk \u2212 1)/2, which is linear in tensor dimensions [23]. Second,\nenforcing the constraint Uk \u2208 S nk\nrk costs O(nkrk), which is linear in tensor dimensions and is\ncomputationally much cheaper than enforcing \u0398k \u2208 P nk that costs O(n3\nk).\nEmploying the proposed \ufb01xed-rank parameterization (6), we propose a scalable tensor completion\ndual formulation.\nFixed-rank dual formulation. The \ufb01rst formulation is obtained by employing the parameterization\n(6) directly in (5). We subsequently solve the resulting problem as a minimization problem as follows:\n\n\u0398k = UkU(cid:62)\nk ,\n\nrk and S n\n\nwhere u = (U1, . . . , UK) and g : S n1\n\nr1\n\nmin\nu\u2208S n1\nr1 \u00d7...\u00d7S nK\n\u00d7 . . . \u00d7 S nK\n\nrK\n\nrK\n\ng(u) := maxZ\u2208C (cid:104)Z,Y\u2126(cid:105) \u2212 1\n\n4\n\n(cid:107)Z(cid:107)2\n\ng(u),\n\n\u2192 R is the function\nk Zk\n\nF \u2212(cid:88)\n\n(cid:13)(cid:13)(cid:13)U(cid:62)\n\n\u03bbk\n2\n\nk\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n.\n\n(7)\n\n(8)\n\nIt should be noted that though (7) is a non-convex problem in u, the optimization problem in (8) is\nstrongly convex in Z for a given u and has unique solution.\n\n4 Optimization algorithm\n\nThe optimization problem (7) is of the form\n\nwhere (cid:96) : M \u2192 R is a smooth loss and M := S n1\n\n\u00d7 . . . \u00d7 S nK\n\nrK\n\nmin\nx\u2208M (cid:96)(x),\n\nr1\n\n4\n\n\u00d7 C is the constraint set.\n\n(9)\n\n\fIn order to propose numerically ef\ufb01cient algorithms for optimization over M, we exploit the particular\nstructure of the set S n\nr , which is known as the spectrahedron manifold [23]. The spectrahedron mani-\nfold has the structure of a compact Riemannian quotient manifold [23]. Consequently, optimization\non the spectrahedron manifold is handled in the Riemannian optimization framework. This allows to\nexploit the rotational invariance of the constraint (cid:107)U(cid:107)F = 1 naturally. The Riemannian manifold\noptimization framework embeds the constraint (cid:107)U(cid:107)F = 1 into the search space, thereby translating\nthe constrained optimization problem into unconstrained optimization problem over the spectrahedron\nmanifold. The Riemannian framework generalizes a number of classical \ufb01rst- and second-order\n(e.g., the conjugate gradient and trust-region algorithms) Euclidean algorithms to manifolds and\nprovides concrete convergence guarantees [13, 1, 36, 47, 35]. The work [1], in particular, shows a\nsystematic way of implementing trust-region (TR) algorithms on quotient manifolds. A full list of\noptimization-related ingredients and their matrix characterizations for the spectrahedron manifold S n\nis in the supplementary material. Overall, the constraint M is endowed a Riemannian structure.\nr\nWe implement the Riemannian TR (second-order) algorithm for (9). To this end, we require the\nnotions of the Riemannian gradient (the \ufb01rst-order derivative of the objective function on the mani-\nfold), the Riemannian Hessian along a search direction (the covariant derivative of the Riemannian\ngradient along a tangential direction on the manifold), and the retraction operator which ensures\nthat we always stay on the manifold (i.e., maintain strict feasibility). The Riemannian gradient and\nHessian notions require computations of the standard (Euclidean) gradient \u2207x(cid:96)(x) and the directional\nderivative of this gradient along a given search direction v denoted by D\u2207x(cid:96)(x)[v]. The expressions\nof both for (7) are given in Lemma 1.\nLemma 1 Let \u02c6Z be the optimal solution of the convex problem (8) at u \u2208 S n1\n. Let\n\u2207ug denote the gradient of g(u) at u, D\u2207ug[v] denote the directional derivative of the gradient \u2207ug\nalong v \u2208 Rn1\u00d7r1 \u00d7 . . .\u00d7 RnK\u00d7rK , and \u02d9Zk be the directional derivative of Zk along v at \u02c6Zk. Then,\n\u2207ug = (\u2212\u03bb1 \u02c6Z1 \u02c6Z(cid:62)\nKUK), and D\u2207ug[v] = (\u2212\u03bb1A1, . . . ,\u2212\u03bbKAK), where\nAk = \u02c6Zk \u02c6Z(cid:62)\nA key requirement in Lemma 1 is to ef\ufb01ciently solve (8) for a given u = (U1, . . . , UK). It should be\nnoted that (8) has a closed-form sparse solution, which is equivalent to solving the linear system\n\nk )Uk and symm(\u2206) = (\u2206 + \u2206(cid:62))/2.\n\n1 U1, . . . ,\u2212\u03bbK \u02c6ZK \u02c6Z(cid:62)\n\nk Vk + symm( \u02d9Zk \u02c6Z(cid:62)\n\n\u00d7 . . . \u00d7 S nK\n\nrK\n\nr1\n\nk \u03bbk( \u02c6Z\u2126 \u00d7k UkU(cid:62)\n\nk )\u2126 = Y\u2126.\n\n(10)\nSolving the linear system (10) in a single step is computationally expensive (it involves the use of\nKronecker products, vectorization of a sparse tensor, and a matrix inversion). Instead, we use an\niterative solver that exploits the sparsity in the variable Z and the factorization form UkU(cid:62)\nk ef\ufb01ciently.\nSimilarly, given \u02c6Z and v,\n\n\u02d9Z can be computed by solving\n\n\u02c6Z\u2126 +(cid:80)\n\n\u03bbk( \u02c6Z\u2126 \u00d7k (VkU(cid:62)\n\nk + UkV(cid:62)\n\nk ))\u2126.\n\n(11)\n\n(cid:88)\n\n\u02d9Z\u2126 +\n\n\u03bbk( \u02d9Z\u2126 \u00d7k UkU(cid:62)\n\nk )\u2126 = \u2212(cid:88)\n\nk\n\nk\n\nThe Riemannian TR algorithm solves a Riemannian trust-region sub-problem in every iteration [1,\nChapter 7]. The TR sub-problem is a second-order approximation of the objective function in a\nneighborhood, solution to which does not require inverting the full Hessian of the objective function.\nIt makes use of the gradient \u2207x(cid:96) and its directional derivative along a search direction. The TR\nsub-problem is approximately solved with an iterative solver, e.g., the truncated conjugate gradient\nalgorithm. The TR sub-problem outputs a potential update candidate for x, which is then accepted or\nrejected based on the amount of decrease in the function (cid:96). Algorithm 1 summarizes the key steps of\nthe TR algorithm for solving (9).\nComputational complexity: the per-iteration computational complexity of Algorithm 1 scales\nlinearly with the number of known entries Y\u2126, denoted by |\u2126|. In particular, the per-iteration\ncomputational cost depends on the following ingredients.\n\n\u2022 U(cid:62)\n\nk Zk: it involves computation of nk \u00d7 rk matrix Ukwith mode-k unfolding of a sparse\nZ with |\u2126| non-zero entries. This costs O(|\u2126|rk). It should be noted that although the\ni=1,i(cid:54)=k ni, only a maximum of |\u2126| columns have non-zero\n\ndimension of Zk is nk \u00d7(cid:81)K\ncomputing the left hand side of (10) for a given candidate Z. This costs O(|\u2126|(cid:80)\n\nentries. We exploit this property of Zk and have a compact memory storage of U(cid:62)\n\n\u2022 Computing the solution \u02c6Z of the linear system (10): an iterative solver for (10) requires\n\nk Zk.\n\nk rk).\n\n5\n\n\fTable 1: Summary of the baseline low-rank tensor completion algorithms.\n\nTrace norm regularized algorithms\n\nOther algorithms\n\nFFW\n\nHard\n\nScaled latent trace norm + Frank Wolfe\noptimization + basis size reduction\nScaled overlapped trace norm + proxi-\nmal gradient\n\nHaLRTC Scaled\n\noverlapped\n\ntrace\n\nnorm\n\nLatent\n\n+ ADMM\nLatent trace norm + ADMM\n\nTopt\n\nFixed multilinear rank + conjugate gradi-\nents (CG)\n\nBayesCP Bayesian CP algorithm with rank tuning\ngeomCG Riemannian CG + \ufb01xed multilinear rank\nRprecon Riemannian CG with preconditioning\n\nT-svd\n\n+ \ufb01xed multilinear rank\nTensor tubal-rank + ADMM\n\nobjective function in (8). This costs O(|\u2126|(cid:80)\n\n\u2022 Computation of g(u): it relies on the solution of (10) and then explicitly computing the\n\nk rk + K|\u2126|).\n\u2022 \u2207ug(u): it requires the computation of terms like \u02c6Zk( \u02c6Z(cid:62)\nk rk).\n\u2022 Computing the solution \u02d9Zk of the linear system (11): similar to (10), (11) is solved with\nan iterative solver. The computational cost of solving (11) requires computing terms like\n\u02d9Zk, which costs O(|\u2126|rk). It should be noted that both \u02d9Z and \u02c6Z share the\nk Zk and U(cid:62)\nU(cid:62)\nsame sparsity pattern.\n\nk Uk), which cost O(|\u2126|(cid:80)\n\n\u2022 D\u2207ug(u)[v]: it costs O(|\u2126|(cid:80)\n\nk\n\nk rk).\n\n\u2022 Retraction on S nk\n\nrk : it projects a matrix of size nk \u00d7 rk on to the set S nk\n\nrk , which costs\n\nk r3\n\nk + r3\n\nk).\n\n\u2022 S nk\n\nOverall,\nk nkr2\n\nO(nkrk).\nrk manifold-related ingredients cost O(nkr2\n\n(cid:80)\nThe memory cost for our algorithm is O(|\u2126| +(cid:80)\n\nthe per-iteration computational complexities of our algorithm is O(m(|\u2126|(cid:80)\nk +(cid:80)\n\nk rk +\nk)), where m is the number of iterations needed to solve (10) and (11) approximately.\nk nkrk). We observe that both computational and\nmemory cost scales linearly with the number of observed entries (|\u2126|), which makes our algorithms\nscalable to large datasets.\nConvergence. The Riemannian TR algorithms come with rigorous convergence guarantees. [1]\ndiscuss the rate of convergence analysis of manifold algorithms, which directly apply in our case. For\ntrust regions, the global convergence to a \ufb01rst-order critical point is discussed in [1, Section 7.4.1]\nand the local convergence to local minima is discussed in [1, Section 7.4.2]. From an implementation\nperspective, we follow the existing approaches [26, 24, 17] and bound the number of TR iterations.\nNumerical implementation. Our algorithm is implemented using the Manopt toolbox [7] in Matlab,\nwhich has off-the-shelf generic TR implementation.\n\n5 Experiments\n\nWe evaluate the generalization performance and ef\ufb01ciency of our proposed TR algorithm against\nstate-of-the-art algorithms in several tensor completion applications.\nTrace norm regularized algorithms. Scaled latent trace norm regularized algorithms such as FFW\n[17] and Latent [42], and overlapped trace norm based algorithms such as HaLRTC [27] and Hard\n[37] are the closest to our approach. FFW is a recently proposed state-of-the-art large scale tensor\ncompletion algorithm. Table 1 summarizes the trace norm regularized baseline algorithms.\nWe denote our algorithm as TR-MM (Trust-Region algorithm for MiniMax tensor completion\nformulation). We set \u03bbk = \u03bbnk \u2200k in (7). Hence, we tune only one hyper-parameter \u03bb, from the set\n{10\u22123, 10\u22122, . . . , 103}, via \ufb01ve-fold cross-validation of the training data.\nVideo and image completion\nWe work with the following datasets for predicting missing values in multi-media data: a) Ribeira is\na hyperspectral image [16] of size 1017\u00d7 1340\u00d7 33, where each slice represents the image measured\nat a particular wavelength. We re-size it to 203 \u00d7 268 \u00d7 33 [37, 26, 24]; b) Tomato is a video\nsequence dataset [27, 8] of size 242 \u00d7 320 \u00d7 167; and c) Baboon is an RGB image [49], modeled as\n\n6\n\n\fTable 2: Generalization performance across several applications: hyperspectral-image/video/image\ncompletion, movie recommendation, and link prediction. Our algorithm, TR-MM, performs signi\ufb01-\ncantly better than other trace norm based algorithms and obtain the best overall performance. The\nsymbol \u2018\u2212\u2019 denotes the dataset is too large for the algorithm to generate result.\n\nTR-MM FFW Rprecon geomCG Hard Topt HaLRTC Latent T-svd BayesCP\n\nRMSE reported\nRibeira\nTomato\nBaboon\nML10M\nAUC reported\nYouTube (subset)\nYouTube (full)\nFB15k-237\n\n0.067\n0.041\n0.121\n0.840\n\n0.957\n0.932\n0.823\n\n0.088\n0.045\n0.133\n0.895\n\n0.954\n0.929\n0.764\n\n0.083\n0.052\n0.128\n0.831\n\n0.941\n0.926\n0.821\n\n0.156\n0.052\n0.128\n0.844\n\n0.941\n0.926\n0.785\n\n0.114 0.127\n0.060 0.102\n0.126 0.130\n\u2212\n\u2212\n\n0.954 0.941\n\u2212\n\u2212\n\u2212\n\u2212\n\n0.095\n0.202\n0.247\n\u2212\n\n0.783\n\u2212\n\u2212\n\n0.087 0.064\n0.042\n0.046\n0.459\n0.146\n\u2212\n\u2212\n\n0.945\n\u2212\n\u2212\n\n0.941\n\u2212\n\u2212\n\n0.154\n0.103\n0.159\n\u2212\n\n0.950\n\u2212\n\u2212\n\n(a) Ribeira\n\n(b) FB15k-237\n\n(c) YouTube (full)\n\n(d) FB15k-237\n\nFigure 2: (a) Evolution of test RMSE on Ribeira; (b) & (c) Evolution of test AUC on FB15k-237 and\nYouTube (full), respectively. Our algorithm, TR-MM, obtains the best generalization performance\nin all the three datasets. In addition, TR-MM converges to a good solution is fairly quick time; (d)\nVariation of test AUC as the amount of training data changes on FB15k-237. TR-MM performs\nsigni\ufb01cantly better than baselines when the amount of training data is less.\n\na 256 \u00d7 256 \u00d7 3 tensor. Following [24], we train on a random sample of 10% of the entries and test\non another 10% of the entries for all the three datasets. Each experiment is repeated ten times.\nResults. Table 2 reports the root mean squared error (RMSE) on the test set, averaged over ten splits.\nOur algorithm, TR-MM, obtains the best results, outperforming other trace norm based algorithms on\nall the three datasets. Figure 2(a) shows the trade-off between the test RMSE and the training time of\nall the algorithms on Ribeira. It can be observed that TR-MM converges to the lowest RMSE at a\nsigni\ufb01cantly faster rate compared to the other baselines. It is evident from the results that learning\na mixture of non-sparse tensors, as learned by the proposed algorithm, helps in achieving better\ngeneralization performance compared to the algorithms that learn a sparse mixture of tensors.\nLink prediction\nThe aim in link prediction setting is to predict missing or new links in knowledge graphs, social\nnetworks, etc. We consider FB15k-237 and YouTube datasets, discussed below.\nFB15k-237: this is a subset of FB15k dataset [6, 44], containing facts of the form subject-predicate-\nobject (RDF) triples from Freebase knowledge graph. FB15k-237 contains 14 541 entities and 237\nrelationships. The task is to predict the relationships (from a given set of relations) between a pair of\nentities in the knowledge graph. It has 310 116 observed relationships (links) between pairs of entities,\nwhich are the positive samples. In addition, 516 606 negative samples are generated following the\nprocedure described in [6]. We model this task as a 14 541\u00d7 14 541\u00d7 237 tensor completion problem.\nY(a,b,c) = 1 implies that relationshipb exists between entitya and entityc, and Y(a,b,c) = 0 implies\notherwise. We keep 80% of the observed entries for training and the remaining 20% for testing.\nYouTube: this is a link prediction dataset [40] having 5 types of interactions between 15 088 users.\nThe task is to predict the interaction (from a given set of interactions) between a pair of users. We\nmodel it as a 15 088 \u00d7 15 088 \u00d7 5 tensor completion problem. All the entries are known in this case.\nWe randomly sample 0.8% of the data for training [17] and another 0.8% for testing.\n\n7\n\n10-2100102Time (in seconds)0.10.150.20.250.30.35Test RMSETR-MMFFWRprecongeomCGHardToptHaLRTCLatentT-svd100102Time (in seconds)0.40.50.60.70.8Test AUCTR-MMFFWRprecongeomCG100101102103Time (in seconds)0.50.60.70.80.9Test AUCTR-MMFFWRprecongeomCG20406080 Percentage of full data as training set0.60.650.70.750.80.85Test AUCTR-MMFFWRprecongeomCG\fTable 3: Rank sets at which the proposed TR-MM algorithm and the Tucker decomposition based\ntensor completion algorithms (Rprecon, geomGC, Topt) achieve best results across datasets. It should\nbe noted that the notion of rank in trace norm regularized approaches (such as TR-MM) differs from\nthe Tucker rank.\n\nRibeira\nTomato\nBaboon\nML10M\nYouTube (subset)\nYouTube (full)\nFB15k-237\n\nTR-MM rank Tucker rank\n(5, 5, 5)\n(10, 10, 10)\n(4, 4, 3)\n(20, 10, 1)\n(3, 3, 1)\n(3, 3, 1)\n(20, 20, 1)\n\n(15, 15, 6)\n(15, 15, 15)\n(4, 4, 3)\n(4, 4, 4)\n(5, 5, 5)\n(5, 5, 5)\n(5, 5, 5)\n\nIt should be noted that Hard, HaLRTC, and Latent do not scale to the full FB15k-237 and YouTube\ndatasets as they need to store full tensor in memory. Hence, we follow [17] to create a subset of the\nYouTube dataset of size 1509 \u00d7 1509 \u00d7 5 in which 1509 users with most number of links are chosen.\nWe randomly sample 5% of the data for training and another 5% for testing.\nEach experiment is repeated on ten random train-test splits. Following [29, 17], the generalization\nperformance for link prediction task is measured by computing the area under the ROC curve on the\ntest set (test AUC) for each algorithm.\nResults. Table 2 reports the average test AUC on YouTube (subset), Youtube (full) and FB15k-237\ndatasets. The TR-MM algorithm achieves the best performance in all the link prediction tasks.\nThis shows that the non-sparse mixture of tensors learned by TR-MM helps in achieving better\nperformance. Figures 2(b) & 2(c) plots the trade-off between the test AUC and the training time for\nFB15k-237 and YouTube, respectively. We observe that TR-MM is the fastest to converge to a good\nAUC and take only a few iterations.\nWe also conduct experiments to evaluate the performance of different algorithms in challenging\nscenarios when the amount of training data available is less. On the FB15k-237 dataset, we vary the\nsize of training data from 20% to 80% of the observed entries, and the remaining 20% of the observed\nentries is kept as the test set. Figure 2(d) plots the results of this experiment. We can observe that\nTR-MM does signi\ufb01cantly better than the baselines in data scarce regimes.\nMovie recommendation\nWe evaluate the algorithms on the MovieLens10M (ML10M) dataset [18]. This is a movie recom-\nmendation task \u2014 predict the ratings given to movies by various users. MovieLens10M contains\n10 000 054 ratings of 10 681 movies given by 71 567 users. Following [24], we split the time into\n7-days wide bins, forming a tensor of size 71 567 \u00d7 10 681 \u00d7 731. For our experiments, we generate\nten random train-test splits, where 80% of the observed entries is kept for training and the rest 20%\nfor testing.\nResults. Table 2 reports the average test RMSE on this task. It can be observed that our algorithm,\nTR-MM, outperforms state-of-the-art scaled latent trace norm based algorithm FFW.\nResults compared to other baseline algorithms\nIn addition to the trace norm based algorithms, we also compare against algorithms that model tensor\nvia Tucker decomposition with \ufb01xed multilinear ranks: Rprecon [24], geomCG [26], and Topt [15].\nLarge scale state-of-the-art algorithms in this multilinear framework include Rprecon and geomCG.\nWe also compare against tensor tubal-rank based algorithm T-svd [48] and CP decomposition based\nalgorithm BayesCP [49]. Table 1 summarizes these baselines.\nAs can be observed from Table 2, TR-MM obtains better overall generalization performance than\nthe above discussed baselines. In the movie recommendation problem, Rprecon achieves better\nresults than TR-MM. It should be noted that Topt, T-svd, and BayesCP are not scalable to large scale\ndatasets.\nRank of solutions of TR-MM algorithm\nTable 3 shows the rank sets at which the proposed TR-MM and Tucker decomposition based tensor\ncompletion algorithms (Rprecon, geomGC, Topt) achieve best results across datasets. The latent\n\n8\n\n\fTable 4: Results on outlier robustness experiments. Our algorithm, TR-MM, is more robust to outliers\nthan the competing baselines. The symbol \u2018\u2212\u2019 denotes the dataset is too large for the algorithm to\ngenerate result.\n\nx\n\nRibeira (RMSE)\n\n0.05\n0.10\nFB15k-237 (AUC) 0.05\n0.10\n\nTR-MM FFW Rprecon geomCG Hard Topt HaLRTC Latent T-svd BayesCP\n0.081\n0.111\n\n0.157\n0.172\n\n0.095\n0.112\n\n0.258\n0.373\n\n0.142 0.169\n0.158 0.188\n\u2212\n\u2212\n\u2212\n\u2212\n\n0.121\n0.135\n\u2212\n\u2212\n\n0.103 0.146\n0.120 0.182\n\u2212\n\u2212\n\u2212\n\u2212\n\n0.201\n0.204\n\u2212\n\u2212\n\n0.803\n0.772\n\n0.734\n0.711\n\n0.794\n0.765\n\n0.764\n0.739\n\ntrace norm based algorithms (TR-MM, FFW, Latent) model the tensor completion by approximating\nthe input tensor as a combination of tensors. Each tensor in this combination is constrained to be\nlow-ranked along a given mode. In contrast, Tucker-decomposition based algorithms model the\ntensor completion problem as a factorization problem with the given Tucker rank (also known as the\nmultilinear rank). Due to this fundamental difference in modeling, the concept of rank in TR-MM\nalgorithm is different from the multilinear rank of Tucker decomposition based algorithms.\nResults on outlier robustness\nWe also evaluate TR-MM and the baselines considered in Table 1 for outlier robustness on hyper-\nspectal image completion and link prediction problems. In the Ribeira dataset, we add the standard\nGaussian noise (N (0, 1)) to randomly selected x fraction of the entries in the training set. The\nminimum and the maximum value of entries in the (original) Ribeira are 0.01 and 2.09, respectively.\nIn FB15k-237 dataset, we \ufb02ip randomly selected x fraction of the entries in the training set, i.e., the\nlink is removed if present and vice-versa. We experiment with x = 0.05 and x = 0.10.\nThe results are reported in Table 4. We observe that our algorithm, TR-MM, obtains the best\ngeneralization performance and, hence, is the most robust to outliers. We also observe that trace\nnorm regularized algorithms are relatively more robust to outliers than Tucker-decomposition, CP-\ndecomposition, and tensor tubal-rank based algorithms.\n\n6 Discussion and conclusion\n\nIn this paper, we introduce a novel regularizer for low-rank tensor completion problem which learns\nthe tensor as a non-sparse combination of K tensors, where K is the number of modes. Existing\nworks [41, 42, 45, 17] learn a sparse combination of tensors, essentially learning the tensor as a\nlow-rank matrix and losing higher order information in the available data. Hence, we recommend\nlearning a non-sparse combination of tensors in trace norm regularized setting, especially since K\nis typically a small integer in most real-world applications. In our experiments, we observe better\ngeneralization performance with the proposed regularization. Theoretically, we provide the following\nresult on the reconstruction error in the context of recovering an unknown tensor W\u2217 from noisy\nobservation (a similar result on the latent trace norm is presented in [42]).\nLemma 2 Let W\u2217 be the true tensor to be recovered from observed Y, which is obtained as\nY = W\u2217 + E, where E \u2208 Rn1\u00d7...\u00d7nK is the noise tensor. Assume that the regularization constant \u03bb\n\nsatis\ufb01es \u03bb \u2264 1/((cid:80)K\n\nk=1 (cid:107)Ek(cid:107)2\u221e)1/2 then the estimator\n(cid:107)Y \u2212 W(cid:107)2\n\n\u02c6W = argmin\n1\n(\n2\nsatis\ufb01es the inequality (cid:107) \u02c6W \u2212 W\u2217(cid:107)F \u2264 2\nright hand side also approaches zero.\n\nW\n\n\u03bb\n\n(cid:113)min\n\nk\n\n(cid:88)\n\nk\n\nF +\n\n1\n\u03bb\n\n(cid:107)W (k)\n\nk (cid:107)2\u2217),\n\nnk. When noise approaches zero, i.e., E \u2192 0, the\n\nWe present a dual framework to analyze the proposed tensor completion formulation. This leads\nto a novel \ufb01xed-rank formulation, for which we exploit the Riemannian framework to develop\nscalable trust region algorithm. In experiments, our algorithm TR-MM obtains better generalization\nperformance and is more robust to outliers than state-of-the-art low-rank tensor completion algorithms.\nIn future, optimization algorithms for the proposed formulation can be developed for online or\ndistributed frameworks. Recent works [4, 5, 30, 31] have explored optimization over Riemannian\nmanifolds in such learning settings.\n\n9\n\n\fAcknowledgement\n\nMost of this work was done when MN (as an intern), PJ, and BM were at Amazon.\n\nReferences\n[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.\n\nPrinceton University Press, 2008.\n\n[2] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. M\u00f8rup. Scalable tensor factorizations for\n\nincomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41\u201356, 2011.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, 2006.\n\n[4] L. Balzano, R. Nowak, and B. Recht. Online identi\ufb01cation and tracking of subspaces from\nhighly incomplete information. In the 48th Annual Allerton Conference on Communication,\nControl, and Computing, 2010.\n\n[5] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on\n\nAutomatic Control, 58(9):2217\u20132229, 2013.\n\n[6] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings\n\nfor modeling multi-relational data. In NIPS, 2013.\n\n[7] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization\n\non manifolds. Journal of Machine Learning Research, 15(Apr):1455\u20131459, 2014.\n\n[8] C. F. Caiafa and A. Cichocki. Stable, robust, and super fast reconstruction of tensors using\n\nmulti-way projections. IEEE Transactions on Signal Processing, 63(3):780\u2013793, 2014.\n\n[9] H. Cheng, Y. Yu, X. Zhang, E. Xing, and D. Schuurmans. Scalable and sound low-rank tensor\n\nlearning. In AISTATS, 2016.\n\n[10] A. Cichocki, H.-A. Phan, Q. Zhao, N. Lee, I. Oseledets, and D. P. Mandic. Tensor networks for\ndimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions.\nFoundations and Trends in Machine Learning, 9(4\u20135):249\u2013429, 2017.\n\n[11] A. Cichocki, H.-A. Phan, Q. Zhao, N. Lee, I. Oseledets, and D. P. Mandic. Tensor networks\nfor dimensionality reduction and large-scale optimization: Part 2 applications and future\nperspectives. Foundations and Trends in Machine Learning, 9(6):431\u2013673, 2017.\n\n[12] C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In UAI,\n\n2009.\n\n[13] A. Edelman, T.A. Arias, and S.T. Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303\u2013353, 1998.\n\n[14] B. Ermi\u00b8s, E. Acar, and A. T. Cemgil. Link prediction in heterogeneous data via generalized\n\ncoupled tensor factorization. In KDD, 2015.\n\n[15] M. Filipovi\u00b4c and A. Juki\u00b4c. Tucker factorization with missing data with application to low-n-rank\n\ntensor completion. Multidimensional Systems and Signal Processing, 2015.\n\n[16] D. H. Foster, S. M. C. Nascimento, and K. Amano. Information limits on neural identi\ufb01cation\nof colored surfaces in natural scenes. Visual Neuroscience, 21(3):331\u2013336, 2004. URL:\nhttps://personalpages.manchester.ac.uk/staff/d.h.foster/.\n\n[17] X. Guo, Q. Yao, and J. T. Kwok. Ef\ufb01cient sparse low-rank tensor completion using the\n\nfrank-wolfe algorithm. In AAAI, 2017. URL: https://github.com/quanmingyao/FFWTensor.\n\n[18] F. M. Harper and J. A. Konstan.\n\nThe movielens datasets: History and context.\nACM Transactions on Interactive Intelligent Systems, 5(4):19:1\u201319:19, 2015. URL:\nhttp://\ufb01les.grouplens.org/datasets/movielens/ml-10m-README.html.\n\n10\n\n\f[19] P. Jawanpuria, M. Lapin, M. Hein, and B. Schiele. Ef\ufb01cient output kernel learning for multiple\n\ntasks. In NIPS, 2015.\n\n[20] P. Jawanpuria and B. Mishra. A uni\ufb01ed framework for structured low-rank matrix learning. In\n\nICML, 2018.\n\n[21] P. Jawanpuria and J. S. Nath. Multi-task multiple kernel learning. In SDM, 2011.\n\n[22] P. Jawanpuria, M. Varma, and J. S. Nath. On p-norm path following in multiple kernel learning\n\nfor non-linear feature selection. In ICML, 2014.\n\n[23] M. Journ\u00e9e, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of\n\npositive semide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\n[24] H. Kasai and B. Mishra. Low-rank tensor completion: a Riemannian manifold preconditioning\n\napproach. In ICML, 2016. URL: https://bamdevmishra.in/codes/tensorcompletion/.\n\n[25] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013\n\n500, 2009.\n\n[26] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion by Riemannian\noptimization. BIT Numerical Mathematics, 54(2):447\u2013468, 2014. URL: anchp.ep\ufb02.ch/geomCG.\n\n[27] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in\nvisual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208\u2013220,\n2013. URL: http://www.cs.rochester.edu/u/jliu/code/TensorCompletion.zip.\n\n[28] X.-Y. Liu, S. Aeron, V. Aggarwal, and X. Wang. Low-tubal-rank tensor completion using\n\nalternating minimization. In SPIE Conference on Defense and Security, 2016.\n\n[29] L. L\u00fc and T. Zhou. Link prediction in complex networks: A survey. Physica A: statistical\n\nmechanics and its applications, 390(6):1150\u20131170, 2011.\n\n[30] M. Meghawanshi, P. Jawanpuria, A. Kunchukuttan, H. Kasai, and B. Mishra. Mctorch, a\nmanifold optimization library for deep learning. In the NeurIPS workshop on Machine Learning\nOpen Source Software, 2018.\n\n[31] B. Mishra, H. Kasai, P. Jawanpuria, and A. Saroop. A Riemannian gossip approach to subspace\n\nlearning on Grassmann manifold. Machine Learning, (to appear in) 2019.\n\n[32] M. Nimishakavi, P. Jawanpuria, and B. Mishra. A dual framework for low-rank tensor comple-\n\ntion. Technical report, arXiv preprint arXiv:1712.01193, 2017.\n\n[33] T. K. Pong, P. Tseng, S. Ji, and J. Ye. Trace norm regularization: Reformulations, algorithms,\n\nand multi-task learning. SIAM Journal on Optimization, 20(6):3465\u20133489, 2010.\n\n[34] B. Romera-paredes, H. Aung, N. Bianchi-berthouze, and M. Pontil. Multilinear multitask\n\nlearning. In ICML, 2013.\n\n[35] H. Sato and T. Iwai. A new, globally convergent Riemannian conjugate gradient method.\nOptimization: A Journal of Mathematical Programming and Operations Research, 64(4):1011\u2013\n1031, 2013.\n\n[36] H. Sato, H. Kasai, and B. Mishra. Riemannian stochastic variance reduced gradient. Technical\n\nreport, arXiv preprint arXiv:1702.05594, 2017.\n\n[37] M. Signoretto, Q. T. Dinh, L. D. Lathauwer, and J. A. K. Suykens. Learning with tensors:\na framework based on convex optimization and spectral regularization. Machine Learning,\n94(3):303\u2013351, 2014.\n\n[38] T. Suzuki. Unifying framework for fast learning rate of non-sparse multiple kernel learning. In\n\nNIPS, 2011.\n\n[39] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. Tag recommendations based on tensor\n\ndimensionality reduction. In RecSys, 2008.\n\n11\n\n\f[40] L. Tang, X. Wang, and H. Liu. Uncovering groups via heterogeneous interaction analysis. In\n\nICDM, 2009. URL: http://leitang.net/heterogeneous_network.html.\n\n[41] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimiza-\n\ntion. Technical report, arXiv preprint arXiv:1010.0789, 2010.\n\n[42] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured schatten norm regular-\n\nization. In NIPS, 2013. URL: http://tomioka.dk/softwares/.\n\n[43] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor\n\ndecomposition. In NIPS, 2011.\n\n[44] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Represent-\nIn EMNLP, 2015. URL:\n\ning text for joint embedding of text and knowledge bases.\nhttp://kristinatoutanova.com/.\n\n[45] K. Wimalawarne, M. Sugiyama, and R. Tomioka. Multitask learning meets tensor factorization:\n\ntask imputation via convex optimization. In NIPS, 2014.\n\n[46] Y. Xin and T. Jaakkola. Primal-dual methods for sparse constrained matrix completion. In\n\nAISTATS, 2012.\n\n[47] H. Zhang, S. J. Reddi, and S. Sra. Riemannian svrg: Fast stochastic optimization on Riemannian\n\nmanifolds. In NIPS, 2016.\n\n[48] Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. E. Kilmer. Novel methods for multi-\nIn CVPR, 2014. URL:\n\nlinear data completion and de-noising based on tensor-svd.\nhttp://www.ece.tufts.edu/ shuchin/software.html.\n\n[49] Q. Zhao, L. Zhang, and A. Cichocki. Bayesian CP factorization of incomplete tensors with\nautomatic rank determination. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n37(9):1751\u20131763, 2015. URL: https://github.com/qbzhao/BCPF.\n\n12\n\n\f", "award": [], "sourceid": 2639, "authors": [{"given_name": "Madhav", "family_name": "Nimishakavi", "institution": "Indian Institute of Science"}, {"given_name": "Pratik Kumar", "family_name": "Jawanpuria", "institution": "Microsoft"}, {"given_name": "Bamdev", "family_name": "Mishra", "institution": "Microsoft"}]}