{"title": "Exact and Stable Recovery of Pairwise Interaction Tensors", "book": "Advances in Neural Information Processing Systems", "page_first": 1691, "page_last": 1699, "abstract": "Tensor completion from incomplete observations is a problem of significant practical interest. However, it is unlikely that there exists an efficient algorithm with provable guarantee to recover a general tensor from a limited number of observations. In this paper, we study the recovery algorithm for pairwise interaction tensors, which has recently gained considerable attention for modeling multiple attribute data due to its simplicity and effectiveness. Specifically, in the absence of noise, we show that one can exactly recover a pairwise interaction tensor by solving a constrained convex program which minimizes the weighted sum of nuclear norms of matrices from $O(nr\\log^2(n))$ observations. For the noisy cases, we also prove error bounds for a constrained convex program for recovering the tensors. Our experiments on the synthetic dataset demonstrate that the recovery performance of our algorithm agrees well with the theory. In addition, we apply our algorithm on a temporal collaborative filtering task and obtain state-of-the-art results.", "full_text": "Exact and Stable Recovery of Pairwise Interaction\n\nTensors\n\nShouyuan Chen Michael R. Lyu\n{sychen,lyu,king}@cse.cuhk.edu.hk\n\nThe Chinese University of Hong Kong\n\nIrwin King\n\nZenglin Xu\n\nPurdue University\n\nxu218@purdue.edu\n\nAbstract\n\nTensor completion from incomplete observations is a problem of signi\ufb01cant prac-\ntical interest. However, it is unlikely that there exists an ef\ufb01cient algorithm with\nprovable guarantee to recover a general tensor from a limited number of obser-\nvations. In this paper, we study the recovery algorithm for pairwise interaction\ntensors, which has recently gained considerable attention for modeling multiple\nattribute data due to its simplicity and effectiveness. Speci\ufb01cally, in the absence\nof noise, we show that one can exactly recover a pairwise interaction tensor by\nsolving a constrained convex program which minimizes the weighted sum of nu-\nclear norms of matrices from O(nr log2(n)) observations. For the noisy cases,\nwe also prove error bounds for a constrained convex program for recovering the\ntensors. Our experiments on the synthetic dataset demonstrate that the recovery\nperformance of our algorithm agrees well with the theory. In addition, we apply\nour algorithm on a temporal collaborative \ufb01ltering task and obtain state-of-the-art\nresults.\n\n1\n\nIntroduction\n\nMany tasks of recommender systems can be formulated as recovering an unknown tensor (multi-\nway array) from a few observations of its entries [17, 26, 25, 21]. Recently, convex optimization\nalgorithms for recovering a matrix, which is a special case of tensor, have been extensively studied\n[7, 22, 6]. Moreover, there are several theoretical developments that guarantee exact recovery of\nmost low-rank matrices from partial observations using nuclear norm minimization [8, 5]. These\nresults seem to suggest a promising direction to solve the general problem of tensor recovery.\nHowever, there are inevitable obstacles to generalize the techniques for matrix completion to tensor\nrecovery, since a number of fundamental computational problems of matrix is NP-hard in their\ntensorial analogues [10]. For instance, H\u02daastad showed that it is NP-hard to compute the rank of a\ngiven tensor [9]; Hillar and Lim proved the NP-hardness to decompose a given tensor into sum of\nrank-one tensors even if a tensor is fully observed [10]. The existing evidence suggests that it is\nvery unlikely that there exists an ef\ufb01cient exact recovery algorithm for general tensors with missing\nentries. Therefore, it is natural to ask whether it is possible to identify a useful class of tensors for\nwhich we can devise an exact recovery algorithm.\nIn this paper, we focus on pairwise interaction tensors, which have recently demonstrated strong\nperformance in several recommendation applications, e.g. tag recommendation [19] and sequential\ndata analysis [18]. Pairwise interaction tensors are a special class of general tensors, which directly\nmodel the pairwise interactions between different attributes. Take movie recommendation as an ex-\nample, to model a user\u2019s ratings for movies varying over time, a pairwise interaction tensor assumes\nthat each rating is determined by three factors: the user\u2019s inherent preference on the movie, the\nmovie\u2019s trending popularity and the user\u2019s varying mood over time. Formally, pairwise interaction\ntensor assumes that each entry Tijk of a tensor T of size n1 \u00d7 n2 \u00d7 n3 is given by following\nfor all (i, j, k) \u2208 [n1] \u00d7 [n2] \u00d7 [n3],\n\nTijk =\n\n, v(a)\n\n(cid:69)\n\n(cid:69)\n\n(cid:68)\n\n(cid:69)\n\n(cid:68)\n\n(cid:68)\n\n(1)\n\n+\n\nu(b)\n\nj\n\n, v(b)\nk\n\n+\n\nu(c)\nk , v(c)\n\ni\n\n,\n\nu(a)\n\ni\n\nj\n\n1\n\n\fj }j\u2208[n2],{v(b)\n\nk }k\u2208[n3],{v(c)\n\nk }k\u2208[n3] are r2 di-\n\ni }j\u2208[n2] are r1 dimensional vectors, {u(b)\n\ni }i\u2208[n1] are r3 dimensional vectors, respectively. 1\n\nwhere {u(a)\ni }i\u2208[n1],{v(a)\nmensional vectors and {u(c)\nThe existing recovery algorithms for pairwise interaction tensor use local optimization methods,\nwhich do not guarantee the recovery performance [18, 19]. In this paper, we design ef\ufb01cient re-\ncovery algorithms for pairwise interaction tensors with rigorous guarantee. More speci\ufb01cally, in the\nabsence of noise, we show that one can exactly recover a pairwise interaction tensor by solving a\nconstrained convex program which minimizes the weighted sum of nuclear norms of matrices from\nO(nr log2(n)) observations, where n = max{n1, n2, n3} and r = max{r1, r2, r3}. For noisy\ncases, we also prove error bounds for a constrained convex program for recovering the tensors.\nIn the proof of our main results, we reformulated the recovery problem as a constrained matrix\ncompletion problem with a special observation operator. Previously, Gross et al. [8] have showed\nthat the nuclear norm heuristic can exactly recover low rank matrix from a suf\ufb01cient number of\nobservations of an orthogonal observation operator. We note that the orthogonality is critical to their\nargument. However, the observation operator, in our case, turns out to be non-orthogonal, which\nbecomes a major challenge in our proof. In order to deal with the non-orthogonal operator, we have\nsubstantially extended their technique in our proof. We believe that our technique can be generalized\nto handle other matrix completion problem with non-orthogonal observation operators.\nMoreover, we extend existing singular value thresholding method to develop a simple and scalable\nalgorithm for solving the recovery problem in both exact and noisy cases. Our experiments on the\nsynthetic dataset demonstrate that the recovery performance of our algorithm agrees well with the\ntheory. Finally, we apply our algorithm on a temporal collaborative \ufb01ltering task and obtain state-\nof-the-art results.\n\n2 Recovering pairwise interaction tensors\n\nIn this section, we \ufb01rst introduce the matrix formulation of pairwise interaction tensors and specify\nthe recovery problem. Then we discuss the suf\ufb01cient conditions on pairwise interaction tensors\nfor which an exact recovery would be possible. After that we formulate the convex program for\nsolving the recovery problem and present our theoretical results on the sample bounds for achieving\nan exact recovery. In addition, we also show a quadratically constrained convex program is stable\nfor the recovery from noisy observations.\nA matrix formulation of pairwise interaction tensors. The original formulation of pairwise inter-\naction tensors by Rendle et al. [19] is given by Eq. (1), in which each entry of a tensor is the sum of\ninner products of feature vectors. We can reformulate Eq. (1) more concisely using matrix notations.\nIn particular, we can rewrite Eq. (1) as follows\n\nTijk = Aij + Bjk + Cki,\n\n(cid:69)\n\n(cid:68)\n\n(cid:68)\n\n(cid:69)\n\nfor all (i, j, k) \u2208 [n1] \u00d7 [n2] \u00d7 [n3],\nu(c)\nk , v(c)\nu(b)\n\n, and Cki =\n\n, v(b)\nk\n\nj\n\ni\n\n(cid:68)\n\n(cid:69)\n\n(2)\n\nu(a)\n\n, v(a)\n\ni\n\nj\n\n, Bjk =\n\nfor all (i, j, k).\n\nwhere we set Aij =\nClearly, matrices A, B and C are rank r1, r2 and r3 matrices, respectively.\nWe call tensor T \u2208 Rn1\u00d7n2\u00d7n3 a pairwise interaction tensor, which is denoted as T =\nPair(A, B, C), if T obeys Eq. (2). We note that this concise de\ufb01nition is equivalent to the original\none. In the rest of this paper, we will exclusively use the matrix formulation of pairwise interaction\ntensors.\nRecovery problem. Suppose we have partial observations of a pairwise interaction tensor T =\nPair(A, B, C). We write \u2126 \u2286 [n1] \u00d7 [n2] \u00d7 [n3] to be the set of indices of m observed entries. In\nthis work, we shall assume \u2126 is sampled uniformly from the collection of all sets of size m. Our goal\nis to recover matrices A, B, C and therefore the entire tensor T from exact or noisy observations of\n{Tijk}(ijk)\u2208\u2126.\nBefore we proceed to the recovery algorithm, we \ufb01rst discuss when the recovery is possible.\nRecoverability: uniqueness. The original recovery problem for pairwise interaction tensors is ill-\nposed due to a uniqueness issue. In fact, for any pairwise interaction tensor T = Pair(A, B, C),\n\n1For simplicity, we only consider three-way tensors in this paper.\n\n2\n\n\fwe can construct in\ufb01nitely manly different sets of matrices A(cid:48), B(cid:48), C(cid:48) such that Pair(A, B, C) =\nPair(A(cid:48), B(cid:48), C(cid:48)). For example, we have Tijk = Aij + Bjk + Cki = (Aij + \u03b4ai) + Bjk + (Cki +\n(1 \u2212 \u03b4)ai), where \u03b4 (cid:54)= 0 can be any non-zero constant and a is an arbitrary non-zero vector of\nsize n1. Now, we can construct A(cid:48), B(cid:48) and C(cid:48) by setting A(cid:48)\njk = Bjk and\nki = Cki + (1 \u2212 \u03b4)ai. It is clear that T = Pair(A(cid:48), B(cid:48), C(cid:48)).\nC(cid:48)\nThis ambiguity prevents us to recover A, B, C even if T is fully observed, since it is entirely\npossible to recover A(cid:48), B(cid:48), C(cid:48) instead of A, B, C based on the observations.\nIn order to avoid\nthis obstacle, we construct a set of constraints such that, given any pairwise interaction ten-\nsor Pair(A, B, C), there exists unique matrices A(cid:48), B(cid:48), C(cid:48) satisfying the constraints and obeys\nPair(A, B, C) = Pair(A(cid:48), B(cid:48), C(cid:48)). Formally, we prove the following proposition.\nProposition 1. For any pairwise interaction tensor T = Pair(A, B, C), there exists unique A(cid:48) \u2208\nSA, B(cid:48) \u2208 SB, C(cid:48) \u2208 SC such that Pair(A, B, C) = Pair(A(cid:48), B(cid:48), C(cid:48)) where we de\ufb01ne SB = {M \u2208\nRn2\u00d7n3 : 1T M = 0T},SC = {M \u2208 Rn3\u00d7n1 : 1T M = 0T} and SA = {M \u2208 Rn1\u00d7n2 : 1T M =\n\nij = Aij + \u03b4ai, B(cid:48)\n\n(cid:16) 1\n\nn2\n\n(cid:17)\n\n1T M1\n\n1T}.\n\n(cid:80)\n\nr\n\nk\u2208[r] U 2\n\nr\n\n(cid:80)\n\nk\u2208[r] V 2\n\njk \u2264 \u00b50.\n\nik \u2264 \u00b50 and n2\n\nWe point out that there is a natural connection between the uniqueness issue and the \u201cbias\u201d compo-\nnents, which is a quantity of much attention in the \ufb01eld of recommender system [13]. Due to lack\nof space, we defer the detailed discussion on this connection and the proof of Proposition 1 to the\nsupplementary material.\nRecoverability: incoherence. It is easy to see that recovering a pairwise tensor T = Pair(A, 0, 0)\nis equivalent to recover the matrix A from a subset of its entries. Therefore, the recovery problem of\npairwise interaction tensors subsumes matrix completion problem as a special case. Previous studies\nhave con\ufb01rmed that the incoherence condition is an essential requirement on the matrix in order to\nguarantee a successful recovery of matrices. This condition can be stated as follows.\nLet M = U\u03a3VT be the singular value decomposition of a rank r matrix M. We call matrix M is\n(\u00b50, \u00b51)-incoherent if M satis\ufb01es:\nA0. For all i \u2208 [n1] and j \u2208 [n2], we have n1\nA1. The maximum entry of UVT is bounded by \u00b51\nIt is well known the recovery is possible only if the matrix is (\u00b50, \u00b51)-incoherent for bounded \u00b50, \u00b51\n(i.e, \u00b50, \u00b51 is poly-logarithmic with respect to n). Since the matrix completion problem is reducible\nto the recovery problem for pairwise interaction tensors, our theoretical result will inherit the inco-\nherence assumptions on matrices A, B, C.\nExact recovery in the absence of noise. We \ufb01rst consider the scenario where the observations are\nexact. Speci\ufb01cally, suppose we are given m observations {Tijk}(ijk)\u2208\u2126, where \u2126 is sampled from\nuniformly at random from [n1]\u00d7 [n2]\u00d7 [n3]. We propose to recover matrices A, B, C and therefore\ntensor T = Pair(A, B, C) using the following convex program,\nn1 (cid:107)Y(cid:107)\u2217 +\nsubject to Xij + Yjk + Zki = Tijk,\n\n(cid:112)r/(n1n2) in absolute value.\n\nn2 (cid:107)Z(cid:107)\u2217\n(i, j, k) \u2208 \u2126,\n\nwhere (cid:107)M(cid:107)\u2217 denotes the nuclear norm of matrix M, which is the sum of singular values of M, and\nSA, SB, SC is de\ufb01ned in Proposition 1.\nWe show that, under the incoherence conditions, the above nuclear norm minimization method suc-\ncessful recovers a pairwise interaction tensor T when the number of observations m is O(nr log2 n)\nwith high probability.\nTheorem 1. Let T \u2208 Rn1\u00d7n2\u00d7n3 be a pairwise interaction tensor T = Pair(A, B, C) and A \u2208\nSA, B \u2208 SB, C \u2208 SC as de\ufb01ned in Proposition 1. Without loss of generality assume that 9 \u2264 n1 \u2264\nn2 \u2264 n3. Suppose we observed m entries of T with the locations sampled uniformly at random\nfrom [n1] \u00d7 [n2] \u00d7 [n3] and also suppose that each of A, B, C is (\u00b50, \u00b51)-incoherent. Then, there\nexists a universal constant C, such that if\n\nminimize\n\nX\u2208SA,Y\u2208SB ,Z\u2208SC\n\nn3 (cid:107)X(cid:107)\u2217 +\n\n\u221a\n\n\u221a\n\n\u221a\n\n(3)\n\nm > C max{\u00b52\n\n1, \u00b50}n3r\u03b2 log2(6n3),\n\nwhere r = max{rank(A), rank(B), rank(C)} and \u03b2 > 2 is a parameter, the minimizing solution\nX, Y, Z for program Eq. (3) is unique and satis\ufb01es X = A, Y = B, Z = C with probability at\nleast 1 \u2212 log(6n3)6n2\u2212\u03b2\n\n3 \u2212 3n2\u2212\u03b2\n\n.\n\n3\n\n3\n\n\fStable recovery in the presence of noise. Now, we move to the case where the observations are\nperturbed by noise with bounded energy. In particular, our noisy model assumes that we observe\n\nfor all (i, j, k) \u2208 \u2126,\n\n\u02c6Tijk = Tijk + \u03c3ijk,\n\n(4)\nwhere \u03c3ijk is a noise term, which maybe deterministic or stochastic. We assume \u03c3 has bounded\nenergy on \u2126 and speci\ufb01cally that (cid:107)P\u2126(\u03c3)(cid:107)F \u2264 \u00011 for some \u00011 > 0, where P\u2126(\u00b7) denotes the\nrestriction on \u2126. Under this assumption on the observations, we derive the error bound of the\nfollowing quadratically-constrained convex program, which recover T from the noisy observations.\n(5)\n\nminimize\n\n\u221a\n\n\u221a\n\n\u221a\n\nn2 (cid:107)Z(cid:107)\u2217\n\nX\u2208SA,Y\u2208SB ,Z\u2208SC\nsubject to\n\nn3 (cid:107)X(cid:107)\u2217 +\n\nn1 (cid:107)Y(cid:107)\u2217 +\n\n(cid:13)(cid:13)(cid:13)P\u2126(Pair(X, Y, Z)) \u2212 P\u2126( \u02c6T )\n\n(cid:13)(cid:13)(cid:13)F\n\n\u2264 \u00012.\n\n2rn1n2\n2\n\n(\u00011 + \u00012),\n\n8\u03b2 log(n1)\n\n3 \u2212 3n2\u2212\u03b2\n\nTheorem 2. Let T = Pair(A, B, C) and A \u2208 SA, B \u2208 SB, C \u2208 SC. Let \u2126 be the set of\nobservations as described in Theorem 1. Suppose we observe \u02c6Tijk for (i, j, k) \u2208 \u2126 as de\ufb01ned in\nEq. (4) and also assume that (cid:107)P\u2126(\u03c3)(cid:107)F \u2264 \u00011 holds. Denote the reconstruction error of the optimal\nsolution X, Y, Z of convex program Eq. (5) as E = Pair(X, Y, Z) \u2212 T . Also assume that \u00011 \u2264 \u00012.\nThen, we have\n\n(cid:115)\n(cid:107)E(cid:107)\u2217 \u2264 5\nwith probability at least 1 \u2212 log(6n3)6n2\u2212\u03b2\nThe proof of Theorem 1 and Theorem 2 is available in the supplementary material.\nRelated work. Rendle et al. [19] proposed pairwise interaction tensors as a model used for tag rec-\nommendation. In a subsequent work, Rendle et al. [18] applied pairwise interaction tensors in the\nsequential analysis of purchase data. In both applications, their methods using pairwise interaction\ntensor demonstrated excellent performance. However, their algorithms are prone to local optimal\nissues and the recovered tensor might be very different from its true value. In contrast, our main re-\nsults, Theorem 1 and Theorem 2, guarantee that a convex program can exactly or accurately recover\nthe pairwise interaction tensors from O(nr log2(n)) observations. In this sense, our work can be\nconsidered as a more effective way to recover pairwise interaction tensors from partial observations.\nIn practice, various tensor factorization methods are used for estimating missing entries of tensors\n[12, 20, 1, 26, 16]. In addition, inspired by the success of nuclear norm minimization heuristics in\nmatrix completion, several work used a generalized nuclear norm for tensor recovery [23, 24, 15].\nHowever, these work do not guarantee exact recovery of tensors from partial observations.\n\n.\n\n3\n\n3 Scalable optimization algorithm\n\nThere are several possible methods to solving the optimization problems Eq. (3) and Eq. (5). For\nsmall problem sizes, one may reformulate the optimization problems as semi-de\ufb01nite programs and\nsolve them using interior point method. The state-of-the-art interior point solvers offer excellent\naccuracy for \ufb01nding the optimal solution. However, these solvers become prohibitively slow for\npairwise interaction tensors larger than 100 \u00d7 100 \u00d7 100. In order to apply the recover algorithms\non large scale pairwise interaction tensors, we use singular value thresholding (SVT) algorithm\nproposed recently by Cai et al. [3], which is a \ufb01rst-order method with promising performance for\nsolving nuclear norm minimization problems.\nWe \ufb01rst discuss the SVT algorithm for solving the exact completion problem Eq. (3). For conve-\nnience, we reformulate the original optimization objective Eq. (3) as follows,\n\nminimize\n\nX\u2208SA,Y\u2208SB ,Z\u2208SC\n\n(cid:107)X(cid:107)\u2217 + (cid:107)Y(cid:107)\u2217 + (cid:107)Z(cid:107)\u2217\nZki\u221a\nn2\n\nYjk\u221a\nn1\n\n+\n\n+\n\nsubject to Xij\u221a\nn3\n\n= Tijk,\n\n(6)\n\n(i, j, k) \u2208 \u2126,\n\nwhere we have incorporated coef\ufb01cients on the nuclear norm terms into the constraints. It is easy\nto see that the recovered tensor is given by Pair(n\nZ), where X, Y, Z is the\n\n\u22121/2\n\u22121/2\n1 Y, n\n2\n\n\u22121/2\n3 X, n\n\n4\n\n\foptimal solution of Eq. (6). Our algorithm solves a slightly relaxed version of the reformulated\nobjective Eq. (6),\n\nminimize\n\nX\u2208SA,Y\u2208SB ,Z\u2208SC\n\n\u03c4 ((cid:107)X(cid:107)\u2217 + (cid:107)Y(cid:107)\u2217 + (cid:107)Z(cid:107)\u2217) +\n\n1\n2\n\nF + (cid:107)Y(cid:107)2\n\nF + (cid:107)Z(cid:107)2\n\nF\n\n(7)\n\n(cid:16)(cid:107)X(cid:107)2\n\n(cid:17)\n\nsubject to Xij\u221a\nn3\n\n+\n\nYjk\u221a\nn1\n\n+\n\nZki\u221a\nn2\n\n= Tijk,\n\n(i, j, k) \u2208 \u2126.\n\nIt is easy to see that Eq. (7) is closely related to Eq. (6) and the original problem Eq. (3), as the\nrelaxed problem converges to the original one as \u03c4 \u2192 \u221e. Therefore by selecting a large value the\nparameter \u03c4, a minimizing solution to Eq. (7) nearly minimizes Eq. (3).\nOur algorithm iteratively minimizes Eq. (7) and produces a sequence of matrices {Xk, Yk, Zk}\nconverging to the optimal solution (X, Y, Z) that minimizes Eq. (7). We begin with several def-\ninitions. For observations \u2126 = {ai, bi, ci|i \u2208 [m]}, let operators P\u2126A : Rn1\u00d7n2 \u2192 Rm,\nP\u2126B : Rn2\u00d7n3 \u2192 Rm and P\u2126C : Rn3\u00d7n1 \u2192 Rm represents the in\ufb02uence of X, Y, Z on the\nm observations. In particular,\nP\u2126A (X) =\n\nYbici\u03b4i, and P\u2126C (Z) =\n\nXaibi\u03b4i, P\u2126B (Y) =\n\nm(cid:88)\n\nm(cid:88)\n\nm(cid:88)\n\nZciai\u03b4i.\n\n1\u221a\nn3\n\ni=1\n\n1\u221a\nn1\n\ni=1\n\n\u22121/2\n3 X, n\nbe the adjoint operator of P\u2126A and similarly de\ufb01ne P\u2217\n\nIt is easy to verify that P\u2126A (X) + P\u2126B (Y) + P\u2126C (Z) = P\u2126(Pair(n\nWe also denote P\u2217\na matrix X for size n1 \u00d7 n2, we de\ufb01ne center(X) = X\u2212 1\nthat removes the mean of each n2 columns, i.e., 1T center(X) = 0T .\nStarting with y0 = 0 and k = 1, our algorithm iteratively computes\n\nZ)).\n. Finally, for\n11T X as the column centering operator\n\n\u2126B\n\n\u2126A\n\n\u2126C\n\nn1\n\n1\u221a\nn2\n\u22121/2\n\u22121/2\n1 Y, n\nand P\u2217\n2\n\ni=1\n\nStep (1). Xk = shrinkA(P\u2217\nYk = shrinkB(P\u2217\nZk = shrinkC(P\u2217\n\n\u2126A\n\n\u2126B\n\n\u2126C\n\n(yk\u22121), \u03c4 ),\n(yk\u22121), \u03c4 ),\n(yk\u22121), \u03c4 ),\n\u22121/2\n3 X, n\n\nStep (2e). ek = P\u2126(T ) \u2212 P\u2126(Pair(n\n\n\u22121/2\n\u22121/2\n1 Y, n\n2\n\nZ))\n\nyk = yk\u22121 + \u03b4ek.\n\nHere shrinkA is a shrinkage operator de\ufb01ned as follows\n\nshrinkA(M, \u03c4 ) (cid:44) arg min\n\u02dcM\u2208SA\n\n1\n2\n\n(cid:13)(cid:13)(cid:13) \u02dcM \u2212 M\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n(cid:13)(cid:13)(cid:13) \u02dcM\n(cid:13)(cid:13)(cid:13)\u2217 .\n\n+ \u03c4\n\n(8)\n\nShrinkage operators shrinkB and shrinkC are de\ufb01ned similarly except they require \u02dcM belongs SB\nand SC, respectively. We note that our de\ufb01nition of the shrinkage operators shrinkA, shrinkB and\nshrinkC are slightly different from that of the original SVT [3] algorithm, where \u02dcM is unconstrained.\nWe can show that our constrained version of shrinkage operators can also be calculated using singu-\nlar value decompositions of column centered matrices.\nLet the SVD of the column centered matrix center(M) be center(M) = U\u03a3VT , \u03a3 =\ndiag({\u03c3i}). We can prove that the shrinkage operator shrinkB is given by\n\nshrinkB(M, \u03c4 ) = U diag({\u03c3i \u2212 \u03c4}+)VT ,\n\n(9)\nwhere s+ is the positive part of s, that is, s+ = max{0, s}. Since subspace SC is structurally\nidentical to SB, it is easy to see that the calculation of shrinkC is identical to that of shrinkB. The\ncomputation of shrinkA is a little more complicated. We have\n\nshrinkA(M, \u03c4 ) = U diag({\u03c3i \u2212 \u03c4}+)VT +\n\n({\u03b4 \u2212 \u03c4}+ + {\u03b4 + \u03c4}\u2212) 11T ,\n(10)\n1T M1 is a constant and s\u2212 = min{0, s}\nwhere U\u03a3VT is still the SVD of center(M), \u03b4 = 1\u221a\nis the negative part of s. The algorithm iterates between Step (1) and Step (2e) and produces a series\nof (Xk, Yk, Zk) converging to the optimal solution of Eq. (7). The iterative procedure terminates\n\n1\u221a\nn1n2\n\nn1n2\n\n5\n\n\fwhen the training error is small enough, namely,(cid:13)(cid:13)ek(cid:13)(cid:13)F \u2264 \u0001. We refer interested readers to [3] for\n\na convergence proof of the SVT algorithm.\nThe optimization problem for noisy completion Eq. (5) can be solved in a similar manner. We only\nneed to modify Step (2e) to incorporate the quadratical constraint of Eq. (5) as follows\n\nStep (2n).\n\nek = P\u2126( \u02c6T ) \u2212 P\u2126(Pair(n\n\n\u22121/2\n\u22121/2\n1 Y, n\n2\n\nZ))\n\n(cid:21)\n\n(cid:20) yk\n\nsk\n\n= PK\n\n(cid:21)\n\n+ \u03b4\n\n(cid:21)(cid:19)\n\n(cid:20) ek\n\n\u22121/2\n3 X, n\n\u2212\u0001\n\n,\n\nsk\u22121\n\n(cid:18)(cid:20) yk\u22121\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3(x, t)\n\n(0, 0)\n\nwhere P\u2126( \u02c6T ) is the noisy observations and the cone projection operator PK can be explicitly com-\nputed by\n\nPK : (x, t) \u2192\n\n(cid:107)x(cid:107)+t\n2(cid:107)x(cid:107) (x,(cid:107)x(cid:107))\n\nif (cid:107)x(cid:107) \u2264 t,\nif \u2212 (cid:107)x(cid:107) \u2264 t \u2264 (cid:107)x(cid:107) ,\nif t \u2264 \u2212(cid:107)x(cid:107) .\n\nBy iterating between Step (1) and Step (2n) and selecting a suf\ufb01ciently large \u03c4, the algorithm gener-\nates a sequence of {Xk, Yk, Zk} that converges to a nearly optimal solution to the noisy completion\nprogram Eq. (5) [3]. We have also included a detailed description of both algorithms in the supple-\nmentary material.\nAt each iteration, we need to compute one singular value decomposition and perform a few elemen-\ntary matrix additions. We can see that for each iteration k, Xk vanishes outside of \u2126A = {aibi} and\nis sparse. Similarly Yk,Zk are also sparse matrices. Previously, we showed that the computation of\nshrinkage operators requires a SVD of a column centered matrix center(M) \u2212 1\n11T X, which is\nthe sum of a sparse matrix M and a rank-one matrix. Clearly the matrix-vector multiplication of the\nform center(M)v can be computed with time O(n + m). This enables the use of Lanczos method\nbased SVD implementations for example PROPACK [14] and SVDPACKC [2], which only needs\nsubroutine of calculating matrix-vector products. In our implementation, we develop a customized\nversion of SVDPACKC for computing the shrinkage operators. Further, for an appropriate choice\nof \u03c4, {Xk, Yk, Zk} turned out to be low rank matrices, which matches the observations in the orig-\ninal SVT algorithm [3]. Hence, the storage cost Xk, Yk, Zk can be kept low and we only need to\nperform a partial SVD to get the \ufb01rst r singular vectors. The estimated rank r is gradually increased\nduring the iterations using a similar method suggested in [3, Section 5.1.1]. We can see that, in sum,\nthe overall complexity per iteration of the recovery algorithm is O(r(n + m)).\n\nn1\n\n4 Experiments\n\nPhase transition in exact recovery. We investigate how the number of measurements affects the\nsuccess of exact recovery. In this simulation, we \ufb01xed n1 = 100, n2 = 150, n3 = 200 and r1 =\nr2 = r3 = r. We tested a variety of choices of (r, m) and for each choice of (r, m), we repeat the\nprocedure for 10 times. At each time, we randomly generated A \u2208 SA, B \u2208 SB, C \u2208 SC of rank\nr. We generated A \u2208 SA by sampling two factor matrices UA \u2208 Rn1\u00d7r, VA \u2208 Rn2\u00d7r with i.i.d.\nstandard Gaussian entries and setting A = PSA(UAVT\nA), where PSA is the orthogonal projection\nonto subspace SA. Matrices B \u2208 SB and C \u2208 SC are sampled in a similar way. We uniformly\nsampled a subset \u2126 of m entries and reveal them to the recovery algorithm. We deemed A, B, C\nsuccessfully recovered if ((cid:107)A(cid:107)F + (cid:107)B(cid:107)F + (cid:107)C(cid:107)F )\u22121((cid:107)X \u2212 A(cid:107)F + (cid:107)Y \u2212 B(cid:107)F + (cid:107)Z \u2212 C(cid:107)F ) \u2264\n10\u22123, where X, Y and Z are the recovered matrices. Finally, we set the parameters \u03c4, \u03b4 of the exact\n\u221a\nrecovery algorithm by \u03c4 = 10\nFigure 1 shows the results of these experiments. The x-axis is the ratio between the number of\nmeasurements m and the degree of freedom d = r(n1 + n2 \u2212 r) + r(n2 + n3 \u2212 r) + r(n3 + n1 \u2212 r).\nNote that a value of x-axis smaller than one corresponds to a case where there is in\ufb01nite number of\nsolutions satisfying given entries. The y-axis is the rank r of the synthetic matrices. The color of\neach grid indicates the empirical success rate. White denotes exact recovery in all 10 experiments,\nand black denotes failure for all experiments. From Figure 1 (Left), we can see that the algorithm\nsucceeded almost certainly when the number of measurements is 2.5 times or larger than the degree\nof freedom for most parameter settings. We also observe that, near the boundary of m/d \u2248 2.5,\nthere is a relatively sharp phase transition. To verify this phenomenon, we repeated the experiments,\n\nn1n2n3 and \u03b4 = 0.9m(n1n2n3)\u22121.\n\n6\n\n\fFigure 1: Phase transition with respect to rank and degree of freedom. Left: m/d \u2208 [1, 5]. Right:\nm/d \u2208 [1.5, 3.0].\n\nbut only vary m/d between 1.5 and 3.0 with \ufb01ner steps. The results on Figure 1 (Right) shows that\nthe phase transition continued to be sharp at a higher resolution.\nStability of recovering from noisy data. In this simulation, we show the recovery performance\nwith respect to noisy data. Again, we \ufb01xed n1 = 100, n2 = 150, n3 = 200 and r1 = r2 = r3 = r\nand tested against different choices of (r, m). For each choice of (r, m), we sampled the ground\ntruth A, B, C using the same method as in the previous simulation. We generated \u2126 uniformly at\nrandom. For each entry (i, j, k) \u2208 \u2126, we simulated the noisy observation \u02c6Tijk = Tijk + \u0001ijk, where\nn. Then, we revealed { \u02c6Tijk}(ijk)\u2208\u2126 to\n\u0001ijk is a zero-mean Gaussian random variable with variance \u03c32\nthe noisy recovery algorithm and collect the recovered matrix X, Y, Z. The error of recovery result\nis measured by ((cid:107)X \u2212 A(cid:107)F +(cid:107)Y \u2212 B(cid:107)F +(cid:107)Z \u2212 C(cid:107)F )/((cid:107)A(cid:107)F +(cid:107)B(cid:107)F +(cid:107)C(cid:107)F ). We tested the\nalgorithm with a range of noise levels and for each different con\ufb01guration of (r, m, \u03c32\nn), we repeated\nthe experiments for 10 times and recorded the mean and standard deviation of the relative error.\n\nnoise level\n\n0.1\n0.2\n0.3\n0.4\n0.5\n\nrelative error\n\n0.1020 \u00b1 0.0005\n0.1972 \u00b1 0.0007\n0.2877 \u00b1 0.0011\n0.3720 \u00b1 0.0015\n0.4524 \u00b1 0.0015\n\nobservations m\n\nm = 3d\nm = 4d\nm = 5d\nm = 6d\nm = 7d\n\nrelative error\n\n0.1445 \u00b1 0.0008\n0.1153 \u00b1 0.0006\n0.1015 \u00b1 0.0004\n0.0940 \u00b1 0.0007\n0.0920 \u00b1 0.0011\n\nrank r\n\n10\n20\n30\n40\n50\n\nrelative error\n\n0.1134 \u00b1 0.0006\n0.1018 \u00b1 0.0007\n0.0973 \u00b1 0.0037\n0.1032 \u00b1 0.0212\n0.1520 \u00b1 0.0344\n\n(a) Fix r = 20, m = 5d and\nnoise level varies.\n\n(b) Fix r = 20, 0.1 noise level\nand m varies.\n\n(c) Fix m = 5d, 0.1 noise level\nand r varies.\n\nTable 1: Simulation results of noisy data.\n\nWe present the result of the experiments in Table 1. From the results in Table 1(a), we can see that\nthe error in the solution is proportional to the noise level. Table 1(b) indicates that the recovery is not\nreliable when we have too few observations, while the performance of the algorithm is much more\nstable for a suf\ufb01cient number of observations around four times of the degree of freedom. Table 1(c)\nshows that the recovery error is not affected much by the rank, as the number of observations scales\nwith the degree of freedom in our setting.\nTemporal collaborative \ufb01ltering. In order to demonstrate the performance of pairwise interaction\ntensor on real world applications, we conducted experiments on the Movielens dataset. The Movie-\nLens dataset contains 1,000,209 ratings from 6,040 users and 3,706 movies from April, 2000 and\nFebruary, 2003. Each rating from Movielens dataset is accompanied with time information provided\nin seconds. We transformed each timestamp into its corresponding calendar month. We randomly\nselect 10% ratings as test set and use the rest of the ratings as training set. In the end, we obtained\na tensor T of size 6040 \u00d7 3706 \u00d7 36, in which the axes corresponded to user, movie and times-\ntamp respectively, with 0.104% observed entries as the training set. We applied the noisy recovery\nalgorithm on the training set. Following previous studies which applies SVT algorithm on movie\nrecommendation datasets [11], we used a pre-speci\ufb01ed truncation level r for computing SVD in\neach iteration, i.e., we only kept top r singular vectors. Therefore, the rank of recovered matrices\nare at most r.\n\n7\n\n\fWe evaluated the prediction performance in terms of root mean squared error (RMSE). We com-\npared our algorithm with noisy matrix completion method using standard SVT optimization algo-\nrithm [3, 4] to the same dataset while ignore the time information. Here we can regard the noisy\nmatrix completion algorithm as a special case of the recover a pairwise interaction tensor of size\n6040 \u00d7 3706 \u00d7 1, i.e., the time information is ignored. We also noted that the training tensor had\nmore than one million observed entries and 80 millions total entries. This scale made a number of\ntensor recovery algorithms, for example Tucker decomposition and PARAFAC [12], impractical to\napply on the dataset. In contrast, our recovery algorithm took 2430 seconds to \ufb01nish on a standard\nworkstation for truncation level r = 100.\nThe experimental result is shown in Figure 2. The empirical result of Figure 2(a) suggests that, by\nincorporating the temporal information, pairwise interaction tensor recovery algorithm consistently\noutperformed the matrix completion method.\nInterestingly, we can see that, for most parameter\nsettings in Figure 2(b), our algorithm recovered a rank 2 matrix Y representing the change of movie\npopularity over time and a rank 15 matrix Z that encodes the change of user interests over time. The\nreason of the improvement on the prediction performance may be that the recovered matrix Y and\nZ provided meaningful signal. Finally, we note that our algorithm achieves a RMSE of 0.858 when\nthe truncation level is set to 50, which slightly outperforms the RMSE=0.861 (quote from Figure 7\nof the paper) result of 30-dimensional Bayesian Probabilistic Tensor Factorization (BPTF) on the\nsame dataset, where the authors predict the ratings by factorizing a 6040 \u00d7 3706 \u00d7 36 tensor using\nBPTF method [26]. We may attribute the performance gain to the modeling \ufb02exibility of pairwise\ninteraction tensor and the learning guarantees of our algorithm.\n\n(a)\n\n(b)\n\nFigure 2: Empirical results on the Movielens dataset. (a) Comparison of RMSE with different trun-\ncation levels. MC: Matrix completion algorithm. RPIT: Recovery algorithm for pairwise interaction\ntensor. (b) Rank of recovered matrix X, Y, Z. r1 = rank(X), r2 = rank(Y), r3 = rank(Z).\n\n5 Conclusion\n\nIn this paper, we proved rigorous guarantees for convex programs for recovery of pairwise interac-\ntion tensors with missing entries, both in the absence and in the presence of noise. We designed a\nscalable optimization algorithm for solving the convex programs. We supplemented our theoretical\nresults with simulation experiments and a real-world application to movie recommendation. In the\nnoiseless case, simulations showed that the exact recovery almost always succeeded if the number of\nobservations is a constant time of the degree of freedom, which agrees asymptotically with the the-\noretical result. In the noisy case, the simulation results con\ufb01rmed that the stable recovery algorithm\nis able to reliably recover pairwise interaction tensor from noisy observations. Our results on the\ntemporal movie recommendation application demonstrated that, by incorporating the temporal in-\nformation, our algorithm outperforms conventional matrix completion and achieves state-of-the-art\nresults.\n\nAcknowledgments\n\nThis work was fully supported by the Basic Research Program of Shenzhen (Project No.\nJCYJ20120619152419087 and JC201104220300A), and the Research Grants Council of the Hong\nKong Special Administrative Region, China (Project No. CUHK 413212 and CUHK 415212).\n\n8\n\n 020406080100 0.860.880.90.920.940.960.981SVD truncation levelRMSEMCRPIT120 20406080100 020406080100120SVD Truncation Levelr1r3r2\fReferences\n[1] Evrim Acar, Daniel M Dunlavy, Tamara G Kolda, and Morten M\u00f8rup. Scalable tensor factorizations for\n\nincomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41\u201356, 2011.\n\n[2] M Berry et al. Svdpackc (version 1.0) user\u2019s guide, university of tennessee tech. Report (393-194, 1993\n\n(Revised October 1996)., 1993.\n\n[3] Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm for matrix\n\ncompletion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[4] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013\n\n936, 2010.\n\n[5] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 9(6):717\u2013772, 2009.\n\n[6] A Evgeniou and Massimiliano Pontil. Multi-task feature learning. 2007.\n[7] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. A rank minimization heuristic with application to\n\nminimum order system approximation. In American Control Conference, 2001, 2001.\n\n[8] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography\n\nvia compressed sensing. Physical review letters, 105(15):150401, 2010.\n\n[9] Johan H\u02daastad. Tensor rank is np-complete. Journal of Algorithms, 11(4):644\u2013654, 1990.\n[10] Christopher Hillar and Lek-Heng Lim. Most tensor problems are np hard. JACM, 2013.\n[11] Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular value projec-\n\ntion. In NIPS, 2010.\n\n[12] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013\n\n500, 2009.\n\n[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender sys-\n\ntems. Computer, 42(8):30\u201337, 2009.\n\n[14] Rasmus Munk Larsen. Propack-software for large and sparse svd calculations. Available online., 2004.\n[15] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating missing\n\nvalues in visual data. In ICCV, 2009.\n\n[16] Ian Porteous, Evgeniy Bart, and Max Welling. Multi-hdp: A non-parametric bayesian model for tensor\n\nfactorization. In AAAI, 2008.\n\n[17] Steffen Rendle, Leandro Balby Marinho, Alexandros Nanopoulos, and Lars Schmidt-Thieme. Learning\n\noptimal ranking with tensor factorization for tag recommendation. In SIGKDD, 2009.\n\n[18] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov\n\nchains for next-basket recommendation. In WWW, 2010.\n\n[19] Steffen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag\n\nrecommendation. In ICDM, 2010.\n\n[20] Amnon Shashua and Tamir Hazan. Non-negative tensor factorization with applications to statistics and\n\ncomputer vision. In ICML, 2005.\n\n[21] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, Alan Hanjalic, and Nuria Oliver.\n\nTfmap: Optimizing map for top-n context-aware recommendation. In SIGIR, 2012.\n\n[22] Nathan Srebro, Jason DM Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. NIPS,\n\n2005.\n\n[23] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. Estimation of low-rank tensors via convex opti-\n\nmization. arXiv preprint arXiv:1010.0789, 2010.\n\n[24] Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, and Hisashi Kashima. Statistical performance of convex\n\ntensor decomposition. NIPS, 2011.\n\n[25] Jason Weston, Chong Wang, Ron Weiss, and Adam Berenzweig. Latent collaborative retrieval. ICML,\n\n2012.\n\n[26] Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, and Jaime G Carbonell. Temporal collaborative\n\n\ufb01ltering with bayesian probabilistic tensor factorization. In SDM, 2010.\n\n9\n\n\f", "award": [], "sourceid": 856, "authors": [{"given_name": "Shouyuan", "family_name": "Chen", "institution": "CUHK"}, {"given_name": "Michael", "family_name": "Lyu", "institution": "CUHK"}, {"given_name": "Irwin", "family_name": "King", "institution": "Chinese University of Hong Kong"}, {"given_name": "Zenglin", "family_name": "Xu", "institution": "University of Purdue"}]}