{"title": "A New Convex Relaxation for Tensor Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 2967, "page_last": 2975, "abstract": "We study the problem of learning a tensor from a set of linear measurements. A prominent methodology for this problem is based on the extension of trace norm regularization, which has been used extensively for learning low rank matrices, to the tensor setting. In this paper, we highlight some limitations of this approach and propose an alternative convex relaxation on the Euclidean unit ball. We then describe a technique to solve the associated regularization problem, which builds upon the alternating direction method of multipliers. Experiments on one synthetic dataset and two real datasets indicate that the proposed method improves significantly over tensor trace norm regularization in terms of estimation error, while remaining computationally tractable.", "full_text": "A New Convex Relaxation for Tensor Completion\n\nBernardino Romera-Paredes\nDepartment of Computer Science\n\nand UCL Interactive Centre\nUniversity College London\n\nMalet Place, London WC1E 6BT, UK\n\nMassimiliano Pontil\n\nDepartment of Computer Science and\n\nCentre for Computational Statistics\n\nand Machine Learning\n\nUniversity College London\n\nB.RomeraParedes@cs.ucl.ac.uk\n\nMalet Place, London WC1E 6BT, UK\n\nm.pontil@cs.ucl.ac.uk\n\nAbstract\n\nWe study the problem of learning a tensor from a set of linear measurements.\nA prominent methodology for this problem is based on a generalization of trace\nnorm regularization, which has been used extensively for learning low rank ma-\ntrices, to the tensor setting. In this paper, we highlight some limitations of this\napproach and propose an alternative convex relaxation on the Euclidean ball. We\nthen describe a technique to solve the associated regularization problem, which\nbuilds upon the alternating direction method of multipliers. Experiments on one\nsynthetic dataset and two real datasets indicate that the proposed method improves\nsigni\ufb01cantly over tensor trace norm regularization in terms of estimation error,\nwhile remaining computationally tractable.\n\n1\n\nIntroduction\n\nDuring the recent years, there has been a growing interest on the problem of learning a tensor from\na set of linear measurements, such as a subset of its entries, see [9, 17, 22, 23, 25, 26, 27] and\nreferences therein. This methodology, which is also referred to as tensor completion, has been\napplied to various \ufb01elds, ranging from collaborative \ufb01ltering [15], to computer vision [17], and\nmedical imaging [9], among others. In this paper, we propose a new method to tensor completion,\nwhich is based on a convex regularizer which encourages low rank tensors and develop an algorithm\nfor solving the associated regularization problem.\n\nArguably the most widely used convex approach to tensor completion is based upon the extension\nof trace norm regularization [24] to that context. This involves computing the average of the trace\nnorm of each matricization of the tensor [16]. A key insight behind using trace norm regularization\nfor matrix completion is that this norm provides a tight convex relaxation of the rank of a matrix\nde\ufb01ned on the spectral unit ball [8]. Unfortunately, the extension of this methodology to the more\ngeneral tensor setting presents some dif\ufb01culties. In particular, we shall prove in this paper that the\ntensor trace norm is not a tight convex relaxation of the tensor rank.\n\nThe above negative result stems from the fact that the spectral norm, used to compute the convex\nrelaxation for the trace norm, is not an invariant property of the matricization of a tensor. This\nobservation leads us to take a different route and study afresh the convex relaxation of tensor rank on\nthe Euclidean ball. We show that this relaxation is tighter than the tensor trace norm, and we describe\na technique to solve the associated regularization problem. This method builds upon the alternating\ndirection method of multipliers and a subgradient method to compute the proximity operator of the\nproposed regularizer. Furthermore, we present numerical experiments on one synthetic dataset and\ntwo real-life datasets, which indicate that the proposed method improves signi\ufb01cantly over tensor\ntrace norm regularization in terms of estimation error, while remaining computationally tractable.\n\n1\n\n\fThe paper is organized in the following manner. In Section 2, we describe the tensor completion\nframework. In Section 3, we highlight some limitations of the tensor trace norm regularizer and\npresent an alternative convex relaxation for the tensor rank. In Section 4, we describe a method to\nsolve the associated regularization problem. In Section 5, we report on our numerical experience\nwith the proposed method. Finally, in Section 6, we summarize the main contributions of this paper\nand discuss future directions of research.\n\n2 Preliminaries\n\nIn this section, we begin by introducing some notation and then proceed to describe the learning\nproblem. We denote by N the set of natural numbers and, for every k \u2208 N, we de\ufb01ne [k] =\n{1, . . . , k}. Let N \u2208 N and let1 p1, . . . , pN \u2265 2. An N -order tensor W \u2208 Rp1\u00d7\u00b7\u00b7\u00b7\u00d7pN , is a\ncollection of real numbers (Wi1,...,iN : in \u2208 [pn], n \u2208 [N ]). Boldface Euler scripts, e.g. W, will be\nused to denote tensors of order higher than two. Vectors are 1-order tensors and will be denoted by\nlower case letters, e.g. x or a; matrices are 2-order tensors and will be denoted by upper case letters,\ne.g. W . If x \u2208 Rd then for every r \u2264 s \u2264 d, we de\ufb01ne xr:s := (xi : r \u2264 i \u2264 s). We also use the\nnotation pmin = min{p1, . . . , pN} and pmax = max{p1, . . . , pN}.\nA mode-n \ufb01ber of a tensor W is a vector composed of the elements of W obtained by \ufb01xing all\nindices but one, corresponding to the n-th mode. This notion is a higher order analogue of columns\n(mode-1 \ufb01bers) and rows (mode-2 \ufb01bers) for matrices. The mode-n matricization (or unfolding) of\nW, denoted by W(n), is a matrix obtained by arranging the mode-n \ufb01bers of W so that each of\n\nis not important as long as it is used consistently.\n\nthem is a column of W(n) \u2208 Rpn\u00d7Jn , where Jn :=Qk6=n pk. Note that the ordering of the columns\nWe are now ready to describe the learning problem. We choose a linear operator I : Rp1\u00d7\u00b7\u00b7\u00b7\u00d7pN \u2192\nRm, representing a set of linear measurements obtained from a target tensor W 0 as y = I(W 0)+\u03be,\nwhere \u03be is some disturbance noise. Tensor completion is an important example of this setting, in\nthis case the operator I returns the known elements of the tensor. That is, we have I(W 0) =\n(W 0\ni1(j),...,iN (j) : j \u2208 [m]), where, for every j \u2208 [m] and n \u2208 [N ], the index in(j) is a prescribed\ninteger in the set [pn]. Our aim is to recover the tensor W 0 from the data (I, y). To this end, we\n\nsolve the regularization problem\n\n(1)\nwhere \u03b3 is a positive parameter which may be chosen by cross validation. The role of the regularizer\nR is to encourage solutions W which have a simple structure in the sense that they involve a small\nnumber of \u201cdegrees of freedom\u201d. A natural choice is to consider the average of the rank of the\ntensor\u2019s matricizations. Speci\ufb01cally, we consider the combinatorial regularizer\n\n2 + \u03b3R(W) : W \u2208 Rp1\u00d7\u00b7\u00b7\u00b7\u00d7pN(cid:9)\n\nmin(cid:8)ky \u2212 I(W)k2\n\nFinding a convex relaxation of this regularizer has been the subject of recent works [9, 17, 23]. They\nall agree to use the sum of nuclear norms as a convex proxy of R. This is de\ufb01ned as the average of\nthe trace norm of each matricization of W, that is,\n\nrank(W(n)).\n\n(2)\n\nR(W) =\n\n1\nN\n\nN\n\nXn=1\n\nkWktr =\n\n1\nN\n\nN\n\nXn=1\n\nkW(n)ktr\n\n(3)\n\nwhere kW(n)ktr is the trace (or nuclear) norm of matrix W(n), namely the \u21131-norm of the vector of\nsingular values of matrix W(n) (see, e.g. [14]). Note that in the particular case of 2-order tensors,\nfunctions (2) and (3) coincide with the usual notion of rank and trace norm of a matrix, respectively.\n\nA rational behind the regularizer (3) is that the trace norm is the tightest convex lower bound to the\nrank of a matrix on the spectral unit ball, see [8, Thm. 1]. This lower bound is given by the convex\nenvelope of the function\n\n\u03a8(W ) =(cid:26) rank(W ),\n\n+\u221e,\n\nif kWk\u221e \u2264 1\notherwise\n\n(4)\n\n1For simplicity we assume that pn \u2265 2 for every n \u2208 [N ], otherwise we simply reduce the order of the\n\ntensor without loss of information.\n\n2\n\n\fwhere k\u00b7k\u221e is the spectral norm, namely the largest singular value of W . The convex envelope can\nbe derived by computing the double conjugate of \u03a8. This is de\ufb01ned as\n\n\u03a8\u2217\u2217(W ) = sup(cid:8)hW, Si \u2212 \u03a8\u2217(W ) : S \u2208 Rp1\u00d7p2(cid:9)\n\nwhere \u03a8\u2217 is the conjugate of \u03a8, namely \u03a8\u2217(S) = sup{hW, Si \u2212 \u03a8(W ) : W \u2208 Rp1\u00d7p2}.\nNote that \u03a8 is a spectral function, that is, \u03a8(W ) = \u03c8(\u03c3(W )) where \u03c8 : Rd\nassociated symmetric gauge function. Using von Neumann\u2019s trace theorem (see e.g.\neasily seen that \u03a8\u2217(S) is also a spectral function. That is, \u03a8\u2217(S) = \u03c8\u2217(\u03c3(S)), where\n\n+ \u2192 R denotes the\n[14]) it is\n\n(5)\n\n\u03c8\u2217(\u03c3) = sup(cid:8)h\u03c3, wi \u2212 \u03c8(w) : w \u2208 Rd\n\n+(cid:9) , with d := min(p1, p2).\n\nWe refer to [8] for a detailed discussion of these ideas. We will use this equivalence between spectral\nand gauge functions repeatedly in the paper.\n\n3 Alternative Convex Relaxation\n\nIn this section, we show that the tensor trace norm is not a tight convex relaxation of the tensor rank\nR in equation (2). We then propose an alternative convex relaxation for this function.\n\nNote that due to the composite nature of the function R, computing its convex envelope is a chal-\nlenging task and one needs to resort to approximations. In [22], the authors note that the tensor trace\nnorm k \u00b7 ktr in equation (3) is a convex lower bound to R on the set\n\nG\u221e :=(cid:8)W \u2208 Rp1\u00d7\u00b7\u00b7\u00b7\u00d7pN : (cid:13)(cid:13)W(n)(cid:13)(cid:13)\u221e \u2264 1, \u2200n \u2208 [N ](cid:9) .\n\nThe key insight behind this observation is summarized in Lemma 4, which we report in Appendix A.\nHowever, the authors of [22] leave open the question of whether the tensor trace norm is the convex\nenvelope of R on the set G\u221e. In the following, we will prove that this question has a negative answer\nby showing that there exists a convex function \u2126 6= k \u00b7 ktr which underestimates the function R on\nG\u221e and such that for some tensor W \u2208 G\u221e it holds that \u2126(W) > kWktr.\nTo describe our observation we introduce the set\n\nG2 :=(cid:8)W \u2208 Rp1\u00d7...\u00d7pN : kWk2 \u2264 1(cid:9)\n\nwhere k \u00b7 k2 is the Euclidean norm for tensors, that is,\nXiN =1\n\nXi1=1\n\nkWk2\n\n2 :=\n\n\u00b7\u00b7\u00b7\n\np1\n\npN\n\nWe will choose\n\n\u2126(W) = \u2126\u03b1(W) :=\n\n1\nN\n\n(Wi1,...,iN )2.\n\nN\n\nXn=1\n\n\u03c9\u2217\u2217\u03b1 \u03c3 W(n)(cid:1)(cid:1)\n\n(6)\n\nwhere \u03c9\u2217\u2217\u03b1 is the convex envelope of the cardinality of a vector on the \u21132-ball of radius \u03b1 and we\nwill choose \u03b1 = \u221apmin. Note, by Lemma 4 stated in Appendix A, that for every \u03b1 > 0, function\n\u2126\u03b1 is a convex lower bound of function R on the set \u03b1G2.\nBelow, for every vector s \u2208 Rd we denote by s\u2193 the vector obtained by reordering the components\nof s so that they are non increasing in absolute value, that is, |s\u21931| \u2265 \u00b7\u00b7\u00b7 \u2265 |s\u2193d|.\nLemma 1. Let \u03c9\u2217\u2217\u03b1 be the convex envelope of the cardinality on the \u21132-ball of radius \u03b1. Then, for\nevery x \u2208 Rd such that kxk2 = \u03b1, it holds that \u03c9\u2217\u2217\u03b1 (x) = card (x).\nThis lemma is proved in Appendix B. The function \u03c9\u2217\u2217\u03b1 resembles the norm developed in [1], which\ncorresponds to the convex envelope of the indicator function of the cardinality of a vector in the \u21132\nball. The extension of its application to tensors is not straighforward though, as it is required to\nspecify beforehand the rank of each matricization.\n\nThe next lemma provides, together with Lemma 1, a suf\ufb01cient condition for the existence of a tensor\nW \u2208 G\u221e at which the regularizer in equation (6) is strictly larger than the tensor trace norm.\n\n3\n\n\fLemma 2.\n\nIf N \u2265 3 and p1, . . . , pN are not all equal\n\nW \u2208 Rp1\u00d7\u00b7\u00b7\u00b7\u00d7pN such that: (a) kWk2 = \u221apmin, (b) W \u2208 G\u221e, (c) min\nn\u2208[N ]\nmax\nn\u2208[N ]\n\nrank(W(n)).\n\nto each other,\n\nthen there exists\nrank(W(n)) <\n\nThe proof of this lemma is presented in Appendix C. We are now ready to formulate the main result\nof this section.\nProposition 3. Let p1, . . . , pN \u2208 N, let k \u00b7 ktr be the tensor trace norm in equation (3) and let\n\u2126\u03b1 be the function in equation (6) for \u03b1 = \u221apmin.\nIf pmin < pmax, then there are in\ufb01nitely\nmany tensors W \u2208 G\u221e such that \u2126\u03b1(W) > kWktr. Moreover, for every W \u2208 G2, it holds that\n\u21261(W) \u2265 kWktr.\nProof. By construction \u2126\u03b1(W) \u2264 R(W) for every W \u2208 \u03b1G2. Since G\u221e \u2282 \u03b1G2 then \u2126\u03b1 is a\nconvex lower bound for the tensor rank R on the set G\u221e as well. The \ufb01rst claim now follows by\nLemmas 1 and 2. Indeed, all tensors obtained following the process described in the proof of Lemma\n2 (in Appendix C) have the property that\n\nN\n\nkWktr =\n\n<\n\n1\nN\n\n1\nN\n\n1\n\nk\u03c3(W(n))k1 =\n\nN (cid:18)pmin(N \u2212 1) +qp2\nXn=1\n(pmin(N \u2212 1) + pmin + 1) = \u2126(W) = R(W).\n\nmin + pmin(cid:19)\n\nFurthermore there are in\ufb01nitely many such tensors which satisfy this claim (see Appendix C).\nWith respect to the second claim, given that \u03c9\u2217\u22171\nis the convex envelope of the cardinality card on\nthe Euclidean unit ball, then \u03c9\u2217\u22171 (\u03c3) \u2265 k\u03c3k1 for every vector \u03c3 such that k\u03c3k2 \u2264 1. Consequently,\n\n\u21261(W) = 1\n\nN PN\n\nn=1 \u03c9\u2217\u22171 \u03c3 W(n)(cid:1)(cid:1) \u2265 1\n\nN PN\n\nn=1 k\u03c3(W(n))k1 = kWktr.\n\nThe above result stems from the fact that the spectral norm is not an invariant property of the matri-\ncization of a tensor, whereas the Euclidean (Frobenius) norm is. This observation leads us to further\nstudy the function \u2126\u03b1.\n\n4 Optimization Method\n\nIn this section, we explain how to solve the regularization problem associated with the regularizer\n(6). For this purpose, we \ufb01rst recall the alternating direction method of multipliers (ADMM) [4],\nwhich was conveniently applied to tensor trace norm regularization in [9, 22].\n\n4.1 Alternating Direction Method of Multipliers (ADMM)\n\nTo explain ADMM we consider a more general problem comprising both tensor trace norm regular-\nization and the regularizer we propose,\n\nXn=1\nwhere E(W) is an error term such as ky \u2212 I(W)k2\nde\ufb01ned, for every matrix A, as\n\nW (E (W) + \u03b3\n\nmin\n\nN\n\n\u03a8 W(n)(cid:1))\n\n2 and \u03a8 is a convex spectral function. It is\n\n(7)\n\n\u03a8(A) = \u03c8(\u03c3(A))\n\nwhere \u03c8 is a gauge function, namely a function which is symmetric and invariant under permuta-\ntions. In particular, if \u03c8 is the \u21131 norm then problem (7) corresponds to tensor trace norm regular-\nization, whereas if \u03c8 = \u03c9\u2217\u2217\u03b1 it implements the proposed regularizer.\nProblem (7) poses some dif\ufb01culties because the terms under the summation are interdependent, due\nto the different matricizations of W having the same elements rearranged in a different way. In\n\n4\n\n\forder to overcome this dif\ufb01culty, the authors of [9, 22] proposed to use ADMM as a natural way to\ndecouple the regularization term appearing in problem (7). This strategy is based on the introduction\n\nof N auxiliary tensors, B1, . . . , BN \u2208 Rp1\u00d7\u00b7\u00b7\u00b7\u00d7pN , so that problem (7) can be reformulated as2\n\nW,B1,...,BN( 1\n\nmin\n\n\u03b3\n\nE (W) +\n\nN\n\nXn=1\n\n\u03a8 Bn(n)(cid:1) : Bn = W, n \u2208 [N ])\n\nThe corresponding augmented Lagrangian (see e.g. [4, 5]) is given by\n\nL (W, B, A) =\n\n1\n\u03b3\n\nE (W) +\n\nN\n\nXn=1(cid:18)\u03a8 Bn(n)(cid:1) \u2212 hAn, W \u2212 Bni +\n\n\u03b2\n\n2(cid:19) ,\n2 kW \u2212 Bnk2\n\nwhere h\u00b7,\u00b7i denotes the scalar product between tensors, \u03b2 is a positive parameter and A1, . . . AN \u2208\nRp1\u00d7\u00b7\u00b7\u00b7\u00d7pN are the set of Lagrange multipliers associated with the constraints in problem (8).\n\nADMM is based on the following iterative scheme\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nn\n\nW [i+1] \u2190 argmin\nB[i+1]\n\u2190 argmin\n\u2190 A[i]\n\nW L(cid:16)W, B[i], A[i](cid:17)\nBn L(cid:16)W [i+1], B, A[i](cid:17)\nn \u2212(cid:16)\u03b2W [i+1] \u2212 B[i+1]\n(cid:17) .\n\nA[i+1]\n\nn\n\nn\n\nStep (12) is straightforward, whereas step (10) is described in [9]. Here we focus on the step (11)\nsince this is the only problem which involves function \u03a8. We restate it with more explanatory\nnotations as\n\nBy completing the square in the right hand side, the solution of this problem is given by\n\nargmin\n\nBn(n) (cid:26)\u03a8 Bn(n)(cid:1) \u2212(cid:10)An(n), W(n) \u2212 Bn(n)(cid:11) +\n\n\u03b2\n\n\u02c6Bn(n) = prox 1\n\n\u03b2 \u03a8 (X) := argmin\n\nBn(n) (cid:26) 1\n\n\u03b2\n\n\u03a8 Bn(n)(cid:1) +\n\n2\n\n2(cid:27) .\n\n2 (cid:13)(cid:13)W(n) \u2212 Bn(n)(cid:13)(cid:13)\n2(cid:27) ,\n2(cid:13)(cid:13)Bn(n) \u2212 X(cid:13)(cid:13)\n\n1\n\n2\n\nwhere X = W(n) \u2212 1\n\nknow that if \u03c8 is a gauge function then\n\n\u03b2 An(n). By using properties of proximity operators (see e.g. [2, Prop. 3.1]) we\n\nprox 1\n\n\u03b2 \u03a8 (X) = UX diag(cid:16)prox 1\n\n\u03b2 \u03c8 (\u03c3(X))(cid:17)V \u22a4X ,\n\nwhere UX and VX are the orthogonal matrices formed by the left and right singular vectors of\nX, respectively. If we choose \u03c8 = k\u00b7k1 the associated proximity operator is the well-known soft\nthresholding operator, that is, prox 1\n\n(\u03c3) = v, where the vector v has components\n\n\u03b2 k\u00b7k1\n\nvi = sign (\u03c3i)(cid:18)|\u03c3i| \u2212\n\n1\n\n\u03b2(cid:19) .\n\nOn the other hand, if we choose \u03c8 = \u03c9\u2217\u2217\u03b1 , we need to compute prox 1\ndescribe a method to accomplish this task.\n\n\u03b2 \u03c9\u2217\u2217\n\u03b1\n\n. In the next section, we\n\n4.2 Computation of the Proximity Operator\n\nTo compute the proximity operator of the function 1\ncalculus. First, we use the formula (see e.g. [7]) proxg\u2217 (x) = x \u2212 proxg (x) for g\u2217 = 1\nwe use a property of conjugate functions from [21, 13], which states that g(\u00b7) = 1\nby the scaling property of proximity operators [7], we have that proxg (x) = 1\n\n\u03b2 \u03c9\u2217\u2217\u03b1 we will use several properties of proximity\n\u03b2 \u03c9\u2217\u2217\u03b1 . Next\n\u03b2 \u03c9\u2217\u03b1(\u03b2\u00b7). Finally,\n\n(\u03b2x).\n\n\u03b2 prox\u03b2\u03c9\u2217\n\n\u03b1\n\n2The somewhat cumbersome notation Bn(n) denotes the mode-n matricization of tensor Bn, that is,\n\nBn(n) = (Bn)(n).\n\n5\n\n\fAlgorithm 1 Computation of prox\u03b2\u03c9\u2217\n\n\u03b1\n\n(y)\n\nInput: y \u2208 Rd, \u03b1, \u03b2 > 0.\nOutput: \u02c6w \u2208 Rd.\nInitialization: initial step \u03c40 = 1\nfor t = 1, 2, . . . do\n\n2 , initial and best found solution w0 = \u02c6w = PS (y) \u2208 Rd.\n\n\u03c4 \u2190 \u03c40\u221at\nFind k such that k \u2208 argmax(cid:8)\u03b1kwt\u22121\n1:r k2 \u2212 r : 0 \u2264 r \u2264 d(cid:9)\n1:k (cid:18)1 + \u03b1\u03b2\n1:k \u2212 \u03c4(cid:18)wt\u22121\n1:k k2(cid:19) \u2212 y1:k(cid:19)\n\u02dcw1:k \u2190 wt\u22121\nkwt\u22121\n\u02dcwk+1:d \u2190 wt\u22121\nk+1:d \u2212 \u03c4 wt\u22121\nk+1:d \u2212 yk+1:d(cid:1)\nwt \u2190 \u02dcPS ( \u02dcw)\nIf h(wt) < h( \u02c6w) then \u02c6w \u2190 wt\n\nIf \u201cStopping Condition = True\u201d then terminate.\n\nend for\n\nIt remains to compute the proximity operator of a multiple of the function \u03c9\u2217\u03b1 in equation (13), that\nis, for any \u03b2 > 0, y \u2208 S, we wish to compute\n\nprox\u03b2\u03c9\u2217\n\n\u03b1\n\n(y) = argmin\n\n{h (w) : w \u2208 S}\n\nw\n\nwhere we have de\ufb01ned S := {w \u2208 Rd : w1 \u2265 \u00b7\u00b7\u00b7 \u2265 wd \u2265 0} and\n\nh (w) =\n\n1\n2 kw \u2212 yk2\n\n2 + \u03b2\n\nd\n\nr=0 {\u03b1kw1:rk2 \u2212 r} .\nmax\n\nIn order to solve this problem we employ the projected subgradient method, see e.g. [6]. It consists\nin applying two steps at each iteration. First, it advances along a negative subgradient of the current\nsolution; second, it projects the resultant point onto the feasible set S. In fact, according to [6], it\nis suf\ufb01cient to compute an approximate projection, a step which we describe in Appendix D. To\n{\u03b1kw1:rk2 \u2212 r}.\ncompute a subgradient of h at w, we \ufb01rst \ufb01nd any integer k such that k \u2208\nThen, we calculate a subgradient g of the function h at w by the formula\n\nargmax\n\nr=0\n\nd\n\ngi =( (cid:16)1 + \u03b1\u03b2\nwi \u2212 yi,\n\nkw1:kk2(cid:17) wi \u2212 yi,\n\nif i \u2264 k,\notherwise.\n\nNow we have all the ingredients to apply the projected subgradient method, which is summarized\nin Algorithm 1. In our implementation we stop the algorithm when an update of \u02c6w is not made for\nmore than 102 iterations.\n\n5 Experiments\n\nWe have conducted a set of experiments to assess whether there is any advantage of using the pro-\nposed regularizer over the tensor trace norm for tensor completion3. First, we have designed a\nsynthetic experiment to evaluate the performance of both approaches under controlled conditions.\nThen, we have tried both methods on two tensor completion real data problems. In all cases, we have\nused a validation procedure to tune the hyper-parameter \u03b3, present in both approaches, among the\n\nvalues(cid:8)10j : j = \u22127,\u22126, . . . , 1(cid:9). In our proposed approach there is one further hyper-parameter,\n\n\u03b1, to be speci\ufb01ed. It should take the value of the Euclidean norm of the underlying tensor. Since\nthis is unknown, we propose to use the estimate\n\n\u02c6\u03b1 =vuutkwk2\n\n2 + (mean(w)2 + var(w))! N\nYi=1\n\npi \u2212 m# ,\n\nwhere m is the number of known entries and w \u2208 Rm contains their values. This estimator assumes\nthat each value in the tensor is sampled from N (mean(w), var(w)), where mean(w) and var(w)\nare the average and the variance of the elements in w.\n\n3The code is available at http://romera-paredes.com/code/tensor-completion\n\n6\n\n\fTensor Trace Norm\nProposed Regularizer\n\n0.0115\n\n0.011\n\n0.0105\n\n0.01\n\n0.0095\n\n0.009\n\nE\nS\nM\nR\n\n0.0085\n\n\u22125\n\n\u22124\n\n\u22123\n\n2\nlog \u03c3\n\n\u22122\n\n\u22121\n\ns\nd\nn\no\nc\ne\nS\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\nTensor Trace Norm\nProposed Regularizer\n\n50\n\n100\n\np\n\n150\n\n200\n\nFigure 1: Synthetic dataset: (Left) Root Mean Squared Error (RMSE) of tensor trace norm and the\nproposed regularizer. (Right) Running time execution for different sizes of the tensor.\n\n5.1 Synthetic Dataset\n\nWe have generated a 3-order tensor W 0 \u2208 R40\u00d720\u00d710 by the following procedure. First we gener-\nated a tensor W with ranks (12, 6, 3) using Tucker decomposition (see e.g. [16])\n\n12\n\n6\n\n3\n\nWi1,i2,i3 =\n\nXj1=1\n\nXj2=1\n\nXj3=1\n\nCj1,j2,j3 M (1)\n\ni1,j1\n\nM (2)\ni2,j2\n\nM (3)\ni3,j3\n\n,\n\n(i1, i2, i3) \u2208 [40] \u00d7 [20] \u00d7 [10]\n\nwhere each entry of the Tucker decomposition components is sampled from the standard Gaussian\n\ndistribution N (0, 1). We then created the ground truth tensor W 0 by the equation\n\ni1,i2,i3 = Wi1,i2,i3 \u2212 mean(W)\n\n\u221aN std(W)\n\nW 0\n\n+ \u03bei1,i2,i3\n\nwhere mean(W) and std(W) are the mean and standard deviation of the elements of W, N is\nthe total number of elements of W, and the \u03bei1,i2,i3 are i.i.d. Gaussian random variables with zero\nmean and variance \u03c32. We have randomly sampled 10% of the elements of the tensor to compose\nthe training set, 45% for the validation set, and the remaining 45% for the test set. After repeating\nthis process 20 times, we report the average results in Figure 1 (Left). Having conducted a paired\nt-test for each value of \u03c32, we conclude that the visible differences in the performances are highly\nsigni\ufb01cant, obtaining always p-values less than 0.01 for \u03c32 \u2264 10\u22122.\n\nFurthermore, we have conducted an experiment to test the running time of both approaches. We\n\nhave generated tensors W 0 \u2208 Rp\u00d7p\u00d7p for different values of p \u2208 {20, 40, . . . , 200}, following\n\nthe same procedure as outlined above. The results are reported in Figure 1 (Right). For low values\nof p, the ratio between the running time of our approach and that of the trace norm regularization\nmethod is quite high. For example in the lowest value tried for p in this experiment, p = 20, this\nratio is 22.661. However, as the volume of the tensor increases, the ratio quickly decreases. For\nexample, for p = 200, the running time ratio is 1.9113. These outcomes are expected because when\np is low, the most demanding routine in our method is the one described in Algorithm 1, where\n\np increases the singular value decomposition routine, which is common to both methods, becomes\n\neach iteration is of order O (p) and O p2(cid:1) in the best and worst case, respectively. However, as\nthe most demanding because it has a time complexity O p3(cid:1) [10]. Therefore, we can conclude\n\nthat even though our approach is slower than the trace norm based method, this difference becomes\nmuch smaller as the size of the tensor increases.\n\n5.2 School Dataset\n\nThe \ufb01rst real dataset we have tried is the Inner London Education Authority (ILEA) dataset. It is\ncomposed of examination marks ranging from 0 to 70, of 15362 students who are described by a set\nof attributes such as school and ethnic group. Most of these attributes are categorical, thereby we can\nthink of exam mark prediction as a tensor completion problem where each of the modes corresponds\nto a categorical attribute. In particular, we have used the following attributes: school (139), gender\n(2), VR-band (3), ethnic (11), and year (3), leading to a 5-order tensor W \u2208 R139\u00d72\u00d73\u00d711\u00d73.\n\n7\n\n\fE\nS\nM\nR\n\n11.6\n\n11.4\n\n11.2\n\n11\n\n10.8\n\n10.6\n\n10.4\n\n4000\n\nTensor Trace Norm\nProposed Regularizer\n\n6000\n\n8000\n\n10000\nm (Training Set Size)\n\n12000\n\nE\nS\nM\nR\n\n42\n\n40\n\n38\n\n36\n\n34\n\n32\n\n30\n\n28\n\n26\n\n24\n\n2\n\n4\n\n6\n\nTensor Trace Norm\nProposed Regularizer\n\n8\n\n12\nm (Training Set Size)\n\n10\n\n14\n\n16\n\n4\nx 10\n\nFigure 2: Root Mean Squared Error (RMSE) of tensor trace norm and the proposed regularizer for\nILEA dataset (Left) and Ocean video (Right).\n\nWe have selected randomly 5% of the instances to make the test set and another 5% of the instances\nfor the validation set. From the remaining instances, we have randomly chosen m of them for several\nvalues of m. This procedure has been repeated 20 times and the average performance is presented\nin Figure 2 (Left). There is a distinguishable improvement of our approach with respect to tensor\ntrace norm regularization for values of m > 7000. To check whether this gap is signi\ufb01cant, we have\nconducted a set of paired t-tests in this regime. In all these cases we obtained a p-value below 0.01.\n\n5.3 Video Completion\n\nIn the second real-data experiment we have performed a video completion test. Any video can be\ntreated as a 4-order tensor: \u201cwidth\u201d \u00d7 \u201cheight\u201d \u00d7 \u201cRGB\u201d \u00d7 \u201cvideo length\u201d, so we can use tensor\ncompletion algorithms to rebuild a video from a few inputs, a procedure that can be useful for\ncompression purposes. In our case, we have used the Ocean video, available at [17]. This video\n\nsequence can be treated as a tensor W \u2208 R160\u00d7112\u00d73\u00d732. We have randomly sampled m tensors\nelements as training data, 5% of them as validation data, and the remaining ones composed the test\nset. After repeating this procedure 10 times, we present the average results in Figure 2 (Right). The\nproposed approach is noticeably better than the tensor trace norm in this experiment. This apparent\noutcome is strongly supported by the paired t-tests which we run for each value of m, obtaining\nalways p-values below 0.01, and for the cases m > 5 \u00d7 104, we obtained p-values below 10\u22126.\n\n6 Conclusion\n\nIn this paper, we proposed a convex relaxation for the average of the rank of the matricizations of\na tensor. We compared this relaxation to a commonly used convex relaxation used in the context\nof tensor completion, which is based on the trace norm. We proved that this second relaxation is\nnot tight and argued that the proposed convex regularizer may be advantageous. Our numerical\nexperience indicates that our method consistently improves in terms of estimation error over tensor\ntrace norm regularization, while being computationally comparable on the range of problems we\nconsidered. In the future it would be interesting to study methods to speed up the computation of the\nproximity operator of our regularizer and investigate its utility in tensor learning problems beyond\ntensor completion such as multilinear multitask learning [20].\n\nAcknowledgements\n\nWe wish to thank Andreas Argyriou, Raphael Hauser, Charles Micchelli and Marco Signoretto for\nuseful comments. A valuable contribution was made by one of the anonymous referees. Part of this\nwork was supported by EPSRC Grant EP/H017178/1, EP/H027203/1 and Royal Society Interna-\ntional Joint Project 2012/R2.\n\n8\n\n\fReferences\n\n[1] A. Argyriou, R. Foygel and N. Srebro. Sparse Prediction with the k-Support Norm. Advances in Neural\n\nInformation Processing Systems 25, pages 1466\u20131474, 2012.\n\n[2] A. Argyriou, C.A. Micchelli, M. Pontil, L. Shen and Y. Xu. Ef\ufb01cient \ufb01rst order methods for linear com-\n\nposite regularizers. arXiv:1104.1436, 2011.\n\n[3] R. Bhatia. Matrix Analysis. Springer Verlag, 1997.\n\n[4] D.P. Bertsekas, J.N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall,\n\n1989.\n\n[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1\u2013\n122, 2011.\n\n[6] S. Boyd, L. Xiao, A. Mutapcic. Subgradient methods, Stanford University, 2003.\n\n[7] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Al-\ngorithms for Inverse Problems in Science and Engineering (H. H. Bauschke et al. Eds), pages 185\u2013212,\nSpringer, 2011.\n\n[8] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order\n\nsystem approximation. Proc. American Control Conference, Vol. 6, pages 4734\u20134739, 2001.\n\n[9] S. Gandy, B. Recht, I. Yamada. Tensor completion and low-n-rank tensor recovery via convex optimiza-\n\ntion. Inverse Problems, 27(2), 2011.\n\n[10] G. H. Golub, C. F. Van Loan. Matrix Computations. 3rd Edition. Johns Hopkins University Press, 1996.\n\n[11] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, J. Malick. Large-scale image classi\ufb01cation with trace-\nnorm regularization. IEEE Conference on Computer Vision & Pattern Recognition (CVPR), pages 3386\u2013\n3393, 2012.\n\n[12] J-B. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms, Part I. Springer,\n\n1996.\n\n[13] J-B. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms, Part II. Springer,\n\n1993.\n\n[14] R.A. Horn and C.R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 2005.\n\n[15] A. Karatzoglou, X. Amatriain, L. Baltrunas, N. Oliver. Multiverse recommendation: n-dimensional ten-\nsor factorization for context-aware collaborative \ufb01ltering. Proc. 4th ACM Conference on Recommender\nSystems, pages 79\u201386, 2010.\n\n[16] T.G. Kolda and B.W. Bade. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\n[17] J. Liu, P. Musialski, P. Wonka, J. Ye. Tensor completion for estimating missing values in visual data.\n\nProc. 12th International Conference on Computer Vision (ICCV), pages 2114\u20132121, 2009.\n\n[18] Y. Nesterov. Gradient methods for minimizing composite objective functions. ECORE Discussion Paper,\n\n2007/96, 2007.\n\n[19] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413\u2013\n\n3430, 2009.\n\n[20] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze and M. Pontil. Multilinear multitask learning. Proc.\n\n30th International Conference on Machine Learning (ICML), pages 1444\u20131452, 2013.\n\n[21] N. Z. Shor. Minimization Methods for Non-differentiable Functions. Springer, 1985.\n\n[22] M. Signoretto, Q. Tran Dinh, L. De Lathauwer, J.A.K. Suykens. Learning with tensors: a framework\n\nbased on convex optimization and spectral regularization. Machine Learning, to appear.\n\n[23] M. Signoretto, R. Van de Plas, B. De Moor, J.A.K. Suykens. Tensor versus matrix completion: a com-\n\nparison with application to spectral data. IEEE Signal Processing Letters, 18(7):403\u2013406, 2011.\n\n[24] N. Srebro, J. Rennie and T. Jaakkola. Maximum margin matrix factorization. Advances in Neural Infor-\n\nmation Processing Systems (NIPS) 17, pages 1329\u20131336, 2005.\n\n[25] R. Tomioka, K. Hayashi, H. Kashima, J.S.T. Presto. Estimation of low-rank tensors via convex optimiza-\n\ntion. arXiv:1010.0789, 2010.\n\n[26] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured Schatten norm regularization.\n\narXiv:1303.6370, 2013.\n\n[27] R. Tomioka, T. Suzuki, K. Hayashi, H. Kashima. Statistical performance of convex tensor decomposition.\n\nAdvances in Neural Information Processing Systems (NIPS) 24, pages 972\u2013980, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1357, "authors": [{"given_name": "Bernardino", "family_name": "Romera-Paredes", "institution": "UCL"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "UCL"}]}