{"title": "Low-Rank Regression with Tensor Responses", "book": "Advances in Neural Information Processing Systems", "page_first": 1867, "page_last": 1875, "abstract": "This paper proposes an efficient algorithm (HOLRR) to handle regression tasks where the outputs have a tensor structure. We formulate the regression problem as the minimization  of a least square criterion under a multilinear rank constraint, a difficult  non convex problem.  HOLRR computes efficiently an approximate solution of this problem, with solid theoretical guarantees. A kernel extension is also presented. Experiments on synthetic and real data show that HOLRR computes accurate solutions while being computationally very competitive.", "full_text": "Low-Rank Regression with Tensor Responses\n\nGuillaume Rabusseau and Hachem Kadri\nAix Marseille Univ, CNRS, LIF, Marseille, France\n\n{firstname.lastname}@lif.univ-mrs.fr\n\nAbstract\n\nThis paper proposes an e\ufb03cient algorithm (HOLRR) to handle regression\ntasks where the outputs have a tensor structure. We formulate the regression\nproblem as the minimization of a least square criterion under a multilinear\nrank constraint, a di\ufb03cult non convex problem. HOLRR computes e\ufb03ciently\nan approximate solution of this problem, with solid theoretical guarantees.\nA kernel extension is also presented. Experiments on synthetic and real data\nshow that HOLRR computes accurate solutions while being computationally\nvery competitive.\n\n1 Introduction\n\nRecently, there has been an increasing interest in adapting machine learning and statistical\nmethods to tensors. Data with a natural tensor structure are encountered in many scienti\ufb01c\nareas including neuroimaging [30], signal processing [4], spatio-temporal analysis [2] and\ncomputer vision [16]. Extending multivariate regression methods to tensors is one of the\nchallenging task in this area. Most existing works extend linear models to the multilinear\nsetting and focus on the tensor structure of the input data (e.g. [24]). Little has been done\nhowever to investigate learning methods for tensor-structured output data.\nWe consider a multilinear regression task where outputs are tensors; such a setting can occur\nin the context of e.g. spatio-temporal forecasting or image reconstruction. In order to leverage\nthe tensor structure of the output data, we formulate the problem as the minimization of\na least squares criterion subject to a multilinear rank constraint on the regression tensor.\nThe rank constraint enforces the model to capture low-rank structure in the outputs and to\nexplain dependencies between inputs and outputs in a low-dimensional multilinear subspace.\nUnlike previous work (e.g. [22, 24, 27]) we do not rely on a convex relaxation of this di\ufb03cult\nnon-convex optimization problem. Instead we show that it is equivalent to a multilinear sub-\nspace identi\ufb01cation problem for which we design a fast and e\ufb03cient approximation algorithm\n(HOLRR), along with a kernelized version which extends our approach to the nonlinear\nsetting (Section 3). Our theoretical analysis shows that HOLRR provides good approximation\nguarantees. Furthermore, we derive a generalization bound for the class of tensor-valued\nregression functions with bounded multilinear rank (Section 3.3). Experiments on synthetic\nand real data are presented to validate our theoretical \ufb01ndings and show that HOLRR\ncomputes accurate solutions while being computationally very competitive (Section 4).\nProofs of all results stated in the paper can be found in supplementary material A.\n\nRelated work. The problem we consider is a generalization of the reduced-rank regression\nproblem (Section 2.2) to tensor structured responses. Reduced-rank regression has its roots\nin statistics [10] but it has also been investigated by the neural network community [3];\nnon-parametric extensions of this method have been proposed in [18] and [6]. In the context\nof multi-task learning, a linear model using a tensor-rank penalization of a least squares\ncriterion has been proposed in [22] to take into account the multi-modal interactions between\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftasks. They propose an approach relying on a convex relaxation of the multlinear rank\nconstraint using the trace norms of the matricizations, and a non-convex approach based\non alternating minimization. Nonparametric low-rank estimation strategies in reproducing\nkernel Hilbert spaces (RKHS) based on a multilinear spectral regularization have been\nproposed in [23, 24]. Their method is based on estimating the regression function in the\ntensor product of RKHSs and is naturally adapted for tensor covariates. A greedy algorithm\nto solve a low-rank tensor learning problem has been proposed in [2] in the context of\nmultivariate spatio-temporal data analysis. The linear model they assume is di\ufb00erent from\nthe one we propose and is speci\ufb01cally designed for spatio-temporal data. A higher-order\nextension of partial least squares (HOPLS) has been proposed in [28] along with a kernel\nextension in [29]. While HOPLS has the advantage of taking the tensor structure of the\ninput into account, the questions of approximation and generalization guarantees were not\naddressed in [28]. The generalization bound we provide is inspired from works on matrix\nand tensor completion [25, 19].\n\n2 Preliminaries\n\nWe begin by introducing some notations. For any integer k we use [k] to denote the set of\nintegers from 1 to k. We use lower case bold letters for vectors (e.g. v \u2208 Rd1), upper case bold\nletters for matrices (e.g. M \u2208 Rd1\u00d7d2) and bold calligraphic letters for higher order tensors\n(e.g. T \u2208 Rd1\u00d7d2\u00d7d3). The identity matrix will be written as I. The ith row (resp. column)\nof a matrix M will be denoted by Mi,: (resp. M:,i). This notation is extended to slices of a\ntensor in the straightforward way. If v \u2208 Rd1 and v0 \u2208 Rd2, we use v \u2297 v0 \u2208 Rd1\u00b7d2 to denote\nthe Kronecker product between vectors, and its straightforward extension to matrices and\ntensors. Given a matrix M \u2208 Rd1\u00d7d2, we use vec(M) \u2208 Rd1\u00b7d2 to denote the column vector\nobtained by concatenating the columns of M.\n\n2.1 Tensors and Tucker Decomposition\n\nF = hT ,T i. In the following T always denotes a tensor of size d1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 dp.\n\nWe \ufb01rst recall basic de\ufb01nitions of tensor algebra; more details can be found in [13]. A tensor\nT \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dp can simply be seen as a multidimensional array (T i1,\u00b7\u00b7\u00b7 ,ip : in \u2208 [dn], n \u2208 [p]).\nThe mode-n \ufb01bers of T are the vectors obtained by \ufb01xing all indices except the nth one,\ne.g. T :,i2,\u00b7\u00b7\u00b7 ,ip \u2208 Rd1. The nth mode matricization of T is the matrix having the mode-n\n\ufb01bers of T for columns and is denoted by T(n) \u2208 Rdn\u00d7d1\u00b7\u00b7\u00b7dn\u22121dn+1\u00b7\u00b7\u00b7dp. The vectorization of\na tensor is de\ufb01ned by vec(T ) = vec(T(1)). The inner product between two tensors S and T\n(of the same size) is de\ufb01ned by hS,T i = hvec(S), vec(T )i and the Frobenius norm is de\ufb01ned\nby kT k2\nThe mode-n matrix product of the tensor T and a matrix X \u2208 Rm\u00d7dn is a tensor denoted\nby T \u00d7n X.\nIt is of size d1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 dn\u22121 \u00d7 m \u00d7 dn+1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 dp and is de\ufb01ned by the\nrelation Y = T \u00d7n X \u21d4 Y(n) = XT(n). The mode-n vector product of the tensor T and\na vector v \u2208 Rdn is a tensor de\ufb01ned by T \u2022n v = T \u00d7n v> \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dn\u22121\u00d7dn+1\u00d7\u00b7\u00b7\u00b7\u00d7dp.\nThe mode-n rank of T is the dimension of the space spanned by its mode-n \ufb01bers, that is\nrankn(T ) = rank(T(n)). The multilinear rank of T , denoted by rank(T ), is the tuple of\nmode-n ranks of T : rank(T ) = (R1,\u00b7\u00b7\u00b7 , Rp) where Rn = rankn(T ) for n \u2208 [p]. We will\nwrite rank(T ) \u2264 (S1,\u00b7\u00b7\u00b7 , Sp) whenever rank1(T ) \u2264 S1, rank2(T ) \u2264 S2,\u00b7\u00b7\u00b7 , rankp(T ) \u2264 Sp.\nThe Tucker decomposition decomposes a tensor T into a core tensor G transformed by\nan orthogonal matrix along each mode: (i) T = G \u00d71 U1 \u00d72 U2 \u00d73 \u00b7\u00b7\u00b7 \u00d7p Up, where\nG \u2208 RR1\u00d7R2\u00d7\u00b7\u00b7\u00b7\u00d7Rp, Ui \u2208 Rdi\u00d7Ri and U>\ni Ui = I for all i \u2208 [p]. The number of parameters\ninvolved in a Tucker decomposition can be considerably smaller than d1d2 \u00b7\u00b7\u00b7 dp. We have\nthe following identities when matricizing and vectorizing a Tucker decomposition: T(n) =\nUnG(n)(Up \u2297\u00b7\u00b7\u00b7\u2297Un+1 \u2297Un\u22121 \u2297\u00b7\u00b7\u00b7\u2297U1)> and vec(T ) = (Up \u2297Up\u22121 \u2297\u00b7\u00b7\u00b7\u2297U1)vec(G).\nIt is well known that T admits the Tucker decomposition (i) i\ufb00 rank(T ) \u2264 (R1,\u00b7\u00b7\u00b7 , Rp)\n(see e.g. [13]). Finding an exact Tucker decomposition can be done using the higher-order\nSVD algorithm (HOSVD) introduced by [5]. Although \ufb01nding the best approximation of\n\n2\n\n\fmultilinear rank (R1,\u00b7\u00b7\u00b7 , Rp) of a tensor T is a di\ufb03cult problem, the truncated HOSVD\nalgorithm provides good approximation guarantees and often performs well in practice.\n\n2.2 Low-Rank Regression\nMultivariate regression is the task of recovering a function f : Rd \u2192 Rp from a set of input-\noutput pairs {(x(n), y(n))}N\nn=1 sampled from the model with an additive noise y = f(x) + \u03b5,\nwhere \u03b5 is the error term. To solve this problem, the ordinary least squares (OLS) approach\nassumes a linear dependence between input and output data and boils down to \ufb01nding\na matrix W \u2208 Rd\u00d7p that minimizes the squared error kXW \u2212 Yk2\nF , where X \u2208 RN\u00d7d\nand Y \u2208 RN\u00d7p denote the input and the output matrices. To prevent over\ufb01tting and to\navoid numerical instabilities a ridge regularization term (i.e. \u03b3kWk2\nF ) is often added to the\nobjective function, leading to the regularized least squares (RLS) method. It is easy to see\nthat the OLS/RLS approach in the multivariate setting is equivalent to performing p linear\nregressions for each scalar output {yj}p\nj=1 independently. Thus it performs poorly when the\noutputs are correlated and the true dimension of the response is less than p. Low-rank\nregression (or reduced-rank regression) addresses this issue by solving the rank penalized\nproblem minW\u2208Rd\u00d7p kXW \u2212 Yk2\nF s.t. rank(W) \u2264 R for a given integer R. The\nrank constraint was \ufb01rst proposed in [1], whereas the term reduced-rank regression was\nintroduced in [10]. Adding a ridge regularization was proposed in [18]. In the rest of the\npaper we will refer to this approach as low-rank regression (LRR). For more description\nand discussion of reduced-rank regression, we refer the reader to the books [21] and [11].\n\nF + \u03b3kWk2\n\n3 Low-Rank Regression for Tensor-Valued Functions\n\n3.1 Problem Formulation\nWe consider a multivariate regression task where the input is a vector and the response has\na tensor structure. Let f : Rd0 \u2192 Rd1\u00d7d2\u00d7\u00b7\u00b7\u00b7\u00d7dp be the function we want to learn from a\nsample of input-output data {(x(n),Y(n))}N\nn=1 drawn from the model Y = f(x)+E, where E\nis an error term. We assume that f is linear, that is f(x) = W \u20221 x for some regression tensor\nW \u2208 Rd0\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp. The vectorization of this relation leads to vec(f(x)) = W>\n(1)x showing\nthat this model is equivalent to the standard multivariate linear model. One way to tackle\nthis regression task would be to vectorize each output sample and to perform a standard\nlow-rank regression on the data {(x(n), vec(Y(n)))}N\nn=1 \u2282 Rd0 \u00d7 Rd1\u00b7\u00b7\u00b7dp. A major drawback\nof this approach is that the tensor structure of the output is lost in the vectorization step.\nThe low-rank model tries to capture linear dependencies between components of the output\nbut it ignores higher level dependencies that could be present in a tensor-structured output.\nFor illustration, suppose the output is a matrix encoding the samples of d1 continuous\nvariables at d2 di\ufb00erent time steps, one could expect structural relations between the d1 time\nseries, e.g. linear dependencies between the rows of the output matrix.\n\nLow-rank regression for tensor responses. To overcome the limitation described above\nwe propose an extension of the low-rank regression method for tensor-structured responses\nby enforcing low multilinear rank of the regression tensor W. Let {(x(n),Y(n))}N\nn=1 \u2282\nRd0 \u00d7 Rd1\u00d7d2\u00d7\u00b7\u00b7\u00b7\u00d7dp be a training sample of input/output data drawn from the model\nf(x) = W \u20221 x +E where W is assumed of low multilinear rank. Considering the framework\nof empirical risk minimization, we want to \ufb01nd a low-rank regression tensor W minimizing\nthe loss on the training data. To avoid numerical instabilities and to prevent over\ufb01tting\nwe add a ridge regularization to the objective function, leading to the minimization of\nF w.r.t. the regression tensor W subject to the constraint\nrank(W) \u2264 (R0, R1,\u00b7\u00b7\u00b7 , Rp) for some given integers R0, R1,\u00b7\u00b7\u00b7 , Rp and where \u2018 is a loss\nIn this paper, we consider the squared error loss between tensors de\ufb01ned by\nfunction.\nL(T , \u02c6T ) = kT \u2212 \u02c6T k2\n\nPN\nn=1 \u2018(W \u20221 x(n),Y(n)) + \u03b3kWk2\n\nF . Using this loss we can rewrite the minimization problem as\nkW \u00d71 X \u2212 Yk2\ns.t. rank(W) \u2264 (R0, R1,\u00b7\u00b7\u00b7 , Rp),\n\nF + \u03b3kWk2\n\nmin\n\nW\u2208Rd0\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp\n\n(1)\n\nF\n\n3\n\n\fFigure 1: Image reconstruction from noisy measurements: Y = W \u20221 x + E where W is a\ncolor image (RGB). Each image is labeled with the algorithm and the rank parameter.\nwhere the input matrix X \u2208 RN\u00d7d0 and the output tensor Y \u2208 RN\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp are de\ufb01ned by\nXn,: = (x(n))>, Y n,:,\u00b7\u00b7\u00b7 ,: = Y(n) for n = 1,\u00b7\u00b7\u00b7 , N (Y is the tensor obtained by stacking the\noutput tensors along the \ufb01rst mode).\nLow-rank regression function. Let W\u2217 be a solution of problem (1), it follows from\nthe multilinear rank constraint that W\u2217 = G \u00d71 U0 \u00d72 \u00b7\u00b7\u00b7 \u00d7p+1 Up for some core tensor\nG \u2208 RR0\u00d7\u00b7\u00b7\u00b7\u00d7Rp and orthogonal matrices Ui \u2208 Rdi\u00d7Ri for 0 \u2264 i \u2264 p. The regression function\nf\u2217 : x 7\u2192 W\u2217 \u20221 x can thus be written as f\u2217 : x 7\u2192 G \u00d71 x>U0 \u00d72 \u00b7\u00b7\u00b7 \u00d7p+1 Up.\nThis implies several interesting properties. First, for any x \u2208 Rd0 we have f\u2217(x) = T x \u00d71\nU1 \u00d72 \u00b7\u00b7\u00b7 \u00d7p Up with T x = G \u20221 U>\n0 x, which implies rank(f\u2217(x)) \u2264 (R1,\u00b7\u00b7\u00b7 , Rp), that is\nthe image of f\u2217 is a set of tensors with low multilinear rank. Second, the relation between\nx and Y = f\u2217(x) is explained in a low dimensional subspace of size R0 \u00d7 R1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Rp.\nIndeed one can decompose the mapping f\u2217 into the following steps: (i) project x in RR0 as\n0 x, (ii) perform a low-dimensional mapping \u00afY = G \u20221 \u00afx, (iii) project back into the\n\u00afx = U>\noutput space to get Y = \u00afY \u00d71 U1 \u00d72 \u00b7\u00b7\u00b7 \u00d7p Up.\nTo give an illustrative intuition on the di\ufb00erences between matrix and multilinear rank\nregularization we present a simple experiment1 in Figure 1. We generate data from the model\nY = W \u20221 x + E where the tensor W \u2208 R3\u00d7m\u00d7n is a color image of size m \u00d7 n encoded\nwith three color channels RGB. The components of both x and E are drawn from N (0, 1).\nThis experiment allows us to visualize the tensors returned by RLS, LRR and our method\nHOLRR that enforces low multilinear rank of the regression function. First, this shows\nthat the function learned by vectorizing the outputs and performing LRR does not enforce\nany low-rank structure. This is well illustrated in (Figure 1) where the regression tensors\nreturned by HOLRR-(3,1,1) are clearly of low-rank while the ones returned by LRR-1 are\nnot. This also shows that taking into account the low-rank structure of the model allows\none to better eliminate the noise when the true regression tensor is of low rank (Figure 1,\nleft). However if the ground truth model does not have a low-rank structure, enforcing low\nmutlilinear rank leads to under\ufb01tting for low values of the rank parameter (Figure 1, right).\n\n3.2 Higher-Order Low-Rank Regression and its Kernel Extension\nWe now propose an e\ufb03cient algorithm to tackle problem (1). We \ufb01rst show that the ridge\nregularization term in (1) can be incorporated in the data \ufb01tting term. Let \u02dcX \u2208 R(N+d0)\u00d7d0\nand \u02dcY \u2208 R(N+d0)\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp be de\ufb01ned by \u02dcX> = (X | \u03b3I)> and \u02dcY>\neasy to check that the objective function in (1) is equal to kW \u00d71 \u02dcX \u2212 \u02dcYk2\nproblem (1) is then equivalent to\nkW \u00d71 \u02dcX \u2212 \u02dcYk2\n\n(1) =(cid:0)Y(1) | 0(cid:1)>. It is\n\nF . Minimization\n\nmin\n\nG\u2208RR0\u00d7R1\u00d7\u00b7\u00b7\u00b7\u00d7Rp ,\nUi\u2208Rdi\u00d7Ri for 0\u2264i\u2264p\n\ns.t. W = G \u00d71 U0 \u00b7\u00b7\u00b7 \u00d7p+1 Up, U>\n\ni Ui = I for all i. (2)\n\nF\n\nWe now show that this minimization problem can be reduced to \ufb01nding p + 1 projection\nmatrices onto subspaces of dimension R0, R1,\u00b7\u00b7\u00b7 , Rp. We start by showing that the core\ntensor G solution of (2) is determined by the factor matrices U0,\u00b7\u00b7\u00b7 , Up.\n\n1An extended version of this experiment is presented in supplementary material B.\n\n4\n\n\fTheorem 1. For given orthogonal matrices U0,\u00b7\u00b7\u00b7 , Up the tensor G that minimizes (2) is\ngiven by G = \u02dcY \u00d71 (U>\nIt follows from Theorem 1 that problem (1) can be written as\n\n1 \u00d73 \u00b7\u00b7\u00b7 \u00d7p+1 U>\np .\n\n0 \u02dcX> \u02dcXU0)\u22121U>\n\n0 \u02dcX> \u00d72 U>\n\nmin\n\nUi\u2208Rdi\u00d7Ri ,0\u2264i\u2264p\n\nk \u02dcY \u00d71 \u03a00 \u00d72 \u00b7\u00b7\u00b7 \u00d7p+1 \u03a0p \u2212 \u02dcYk2\n\nF\n\n(3)\n\n(cid:0)U>\n\n(cid:1)\u22121 U>\n\ni\n\n0 \u02dcX> \u02dcXU0\n\n0 \u02dcXT , \u03a0i = UiU>\n\ni Ui = I for all i, \u03a00 = \u02dcXU0\n\nfor i \u2265 1.\nsubject to U>\nNote that \u03a00 is the orthogonal projection onto the space spanned by the columns of \u02dcXU0\nand \u03a0i is the orthogonal projection onto the column space of Ui for i \u2265 1. Hence solving\nproblem (1) is equivalent to \ufb01nding p + 1 low-dimensional subspaces U0,\u00b7\u00b7\u00b7 , Up such that\nprojecting \u02dcY onto the spaces \u02dcXU0, U1,\u00b7\u00b7\u00b7 , Up along the corresponding modes is close to \u02dcY.\nHOLRR algorithm. Since solving problem (3) for the p + 1 projections simultaneously is\na di\ufb03cult non-convex optimization problem we propose to solve it independently for each pro-\njection. This approach has the bene\ufb01ts of both being computationally e\ufb03cient and providing\ngood theoretical approximation guarantees (see Theorem 2). The following proposition gives\nthe analytic solutions of (3) when each projection is considered independently.\nProposition 1. For 0 \u2264 i \u2264 p, using the de\ufb01nition of \u03a0i in (3), the optimal solution\nof minUi\u2208Rdi\u00d7Ri k \u02dcY \u00d7i+1 \u03a0i \u2212 \u02dcYk2\ni Ui = I is given by the top Ri eigenvectors of\n( \u02dcX> \u02dcX)\u22121 \u02dcX> \u02dcY(1) \u02dcY>\n(1)\nThe results from Theorem 1 and Proposition 1 can be rewritten in terms of the original input\nmatrix X and output tensor Y using the identities \u02dcX> \u02dcX = X>X + \u03b3I, \u02dcY \u00d71 \u02dcX> = Y \u00d71 X>\n(i) for any i \u2265 1. The overall Higher-Order Low-Rank Regression\nand \u02dcY(i) \u02dcY>\nprocedure (HOLRR) is summarized in Algorithm 1. Note that the Tucker decomposition\nof the solution returned by HOLRR could be a good initialization point for an Alternative\nLeast Square method. However, studying the theoretical and experimental properties of this\napproach is beyond the scope of this paper and is left for future work.\n\n\u02dcX if i = 0 and \u02dcY(i+1) \u02dcY>\n\n(i+1) otherwise.\n\n(i) = Y(i)Y>\n\nF s.t. U>\n\nHOLRR Kernel Extension We now design a kernelized version of the HOLRR algorithm\nby analyzing how it would be instantiated in a feature space. We show that all the steps\ninvolved can be performed using the Gram matrix of the input data without having to\nexplicitly compute the feature map. Let \u03c6 : Rd0 \u2192 RL be a feature map and let \u03a6 \u2208 RN\u00d7L\nbe the matrix with rows \u03c6(x(n))> for n \u2208 [N]. The higher-order low-rank regression problem\nin the feature space boils down to the minimization problem\n\nmin\n\nkW \u00d71 \u03a6 \u2212 Yk2\n\nF + \u03b3kWk2\n\ns.t. rank(W) \u2264 (R0, R1,\u00b7\u00b7\u00b7 , Rp) . (4)\n\nF\n\nW\u2208RL\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp\nFollowing the HOLRR algorithm, one needs to compute the top R0 eigenvectors of the L\u00d7 L\nmatrix (\u03a6>\u03a6 + \u03b3I)\u22121\u03a6>Y(1)Y>\n(1)\u03a6. The following proposition shows that this can be done\nusing the Gram matrix K = \u03a6\u03a6> without explicitly knowing the feature map \u03c6.\nProposition 2. If \u03b1 \u2208 RN is an eigenvector with eigenvalue \u03bb of the matrix (K +\n(1)K, then v = \u03a6>\u03b1 \u2208 RL is an eigenvector with eigenvalue \u03bb of the ma-\n\u03b3I)\u22121Y(1)Y>\ntrix (\u03a6>\u03a6 + \u03b3I)\u22121\u03a6>Y(1)Y>\nLet A be the top R0 eigenvectors of the matrix (K + \u03b3I)\u22121Y(1)Y>\n(1)K. When working with\nthe feature map \u03c6, it follows from the previous proposition that line 1 in Algorithm 1 is\nequivalent to choosing U0 = \u03a6>A \u2208 RL\u00d7R0, while the updates in line 3 stay the same.\nThe regression tensor W \u2208 RL\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp returned by this algorithm is then equal to W =\nY \u00d71 P\u00d72 U1U>\nA>\u03a6\u03a6>.\n\np , where P = \u03a6>A(cid:16)A>\u03a6(\u03a6>\u03a6 + \u03b3I)\u03a6>A(cid:17)\u22121\nIt is easy to check that P can be rewritten as P = \u03a6>A(cid:0)A>K(K + \u03b3I)A(cid:1)\u22121 A>K.\n\n1 \u00d72\u00b7\u00b7\u00b7\u00d7p+1 UpU>\n\n(1)\u03a6.\n\nSuppose now that the feature map \u03c6 is induced by a kernel k : Rd0 \u00d7 Rd0 \u2192 R. The\nprediction for an input vector x is then given by W \u20221 x = C \u20221 kx where the nth component\n\n5\n\n\fAlgorithm 1 HOLRR\nInput: X \u2208 RN\u00d7d0, Y \u2208 RN\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp,\nrank (R0, R1,\u00b7\u00b7\u00b7 , Rp) and regularization\nparameter \u03b3.\n\nAlgorithm 2 Kernelized HOLRR\nInput: Gram matrix K \u2208 RN\u00d7N, Y \u2208\nrank (R0, R1,\u00b7\u00b7\u00b7 , Rp)\n\nRN\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp,\nand regularization parameter \u03b3.\n(1)K\n\np\n\n(i+1)\n\np\n\n(1)X\n\n(i+1)\n\n(cid:1)\u22121 U>\n\n0 (X>X + \u03b3I)U0\n\n1 \u00d73 \u00b7\u00b7\u00b7 \u00d7p+1 U>\n\n5: M =(cid:0)U>\n\n(cid:0)A>K(K + \u03b3I)A(cid:1)\u22121 A>K\u00d72U>\n\n5: M \u2190(cid:0)A>K(K + \u03b3I)A(cid:1)\u22121 A>K\n\nC \u2208 RN\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp de\ufb01ning the regression function f : x 7\u2192 C \u20221 kx =PN\n\n1: A \u2190 top R0 eigenvectors of\n1: U0 \u2190 top R0 eigenvectors of\n(K + \u03b3I)\u22121Y(1)Y>\n(X>X + \u03b3I)\u22121X>Y(1)Y>\n2: for i = 1 to p do\n2: for i = 1 to p do\n3: Ui \u2190 top Ri eigenvec. of Y(i+1)Y>\n3: Ui \u2190 top Ri eigenvec. of Y(i+1)Y>\n4: end for\n4: end for\n0 X>\n6: G \u2190 Y \u00d71 M \u00d72 U>\n6: G \u2190 Y \u00d71 M \u00d72 U>\n1 \u00d73 \u00b7\u00b7\u00b7 \u00d7p+1 U>\n7: return C = G\u00d71A\u00d72U1\u00d73\u00b7\u00b7\u00b7\u00d7p+1Up\n7: return G \u00d71 U0 \u00d72 \u00b7\u00b7\u00b7 \u00d7p+1 Up\nof kx \u2208 RN is h\u03c6(x(n)), \u03c6(x)i = k(x(n), x) and the tensor C \u2208 RN\u00d7d1\u00d7\u00b7\u00b7\u00b7\u00d7dp is de\ufb01ned by C =\nG\u00d71A\u00d72U1\u00d72\u00b7\u00b7\u00b7\u00d7p+1Up, with G = Y\u00d71\n2 \u00d73\u00b7\u00b7\u00b7\u00d7p+1Up.\nNote that C has multilinear rank (R0,\u00b7\u00b7\u00b7 , Rp), hence the low mutlilinear rank constraint on\nW in the feature space translates into the low rank structure of the coe\ufb03cient tensor C.\nLet H be the reproducing kernel Hilbert space associated with the kernel k. The overall proce-\ndure for kernelized HOLRR is summarized in Algorithm 2. This algorithm returns the tensor\nn=1 k(x, x(n))C(n),\nwhere C(n) = Cn:\u00b7\u00b7\u00b7: \u2208 Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dp.\n3.3 Theoretical Analysis\nComplexity analysis. HOLRR is a polynomial time algorithm, more precisely it has a\ntime complexity in O((d0)3 + N((d0)2 + d0d1 \u00b7\u00b7\u00b7 dp)+maxi\u22650 Ri(di)2 + N d1 \u00b7\u00b7\u00b7 dp maxi\u22651 di).\nIn comparison, LRR has a time complexity in O((d0)3 + N((d0)2 + d0d1 \u00b7\u00b7\u00b7 dp) + (N +\nR)(d1 \u00b7\u00b7\u00b7 dp)2). Since the complexity of HOLRR only have a linear dependence on the\nproduct of the output dimensions instead of a quadratic one for LRR, we can conclude\nthat HOLRR will be more e\ufb03cient than LRR when the output dimensions d1,\u00b7\u00b7\u00b7 , dp are\nlarge. It is worth mentioning that the method proposed in [22] to solve a convex relaxation\nof problem 2 is an iterative algorithm that needs to compute SVDs of matrices of size\ndi \u00d7 d1 \u00b7\u00b7\u00b7 di\u22121di+1 \u00b7\u00b7\u00b7 dp for each 0 \u2264 i \u2264 p at each iteration, it is thus computationally more\nexpensive than HOLRR. Moreover, since HOLRR only relies on simple linear algebra tools,\nreadily available methods could be used to further improve the speed of the algorithm, e.g.\nrandomized-SVD [8] and random feature approximation of the kernel function [12, 20].\n\nApproximation guarantees.\nIt is easy to check that problem (1) is NP-hard since it\ngeneralizes the problem of \ufb01tting a Tucker decomposition [9]. The following theorem shows\nthat HOLRR is a (p + 1)-approximation algorithm for this problem. This result generalizes\nthe approximation guarantees provided by the truncated HOSVD algorithm for the problem\nof \ufb01nding the best low multilinear rank approximation of an arbitrary tensor.\nTheorem 2. Let W\u2217 be a solution of problem (1) and let W be the regression tensor\nreturned by Algorithm 1. If L : Rd0\u00d7\u00b7\u00b7\u00b7\u00d7dp \u2192 R denotes the objective function of (1) w.r.t.\nW then L(W) \u2264 (p + 1)L(W\u2217).\nGeneralization Bound. The following theorem gives an upper bound on the excess-\nrisk for the function class F = {x 7\u2192 W \u20221 x : rank(W) \u2264 (R0,\u00b7\u00b7\u00b7 , Rp)} of tensor-valued\nregression functions with bounded multilinear rank. Recall that the expected loss of an\nhypothesis h \u2208 F w.r.t. the target function f\u2217 is de\ufb01ned by R(h) = Ex[L(h(x), f\u2217(x))] and\nits empirical loss by \u02c6R(h) = 1\nTheorem 3. Let L : Rd1\u00d7\u00b7\u00b7\u00b7\u00d7dp \u2192 R be a loss function satisfying L(A,B) =\n\u2018(Ai1,\u00b7\u00b7\u00b7 ,ip ,Bi1,\u00b7\u00b7\u00b7 ,ip) for some loss-function \u2018 : R \u2192 R+ bounded by M. Then\nd1\u00b7\u00b7\u00b7dp\nfor any \u03b4 > 0, with probability at least 1\u2212 \u03b4 over the choice of a sample of size N, the follow-\n\nPN\nn=1 L(h(x(n)), f\u2217(x(n))).\n\nP\n\ni1,\u00b7\u00b7\u00b7 ,ip\n\nN\n\n1\n\n6\n\n\fr\n\n2D log(cid:16) 4e(p+2)d0d1\u00b7\u00b7\u00b7dp\n\n(cid:17) log(N)/N +\n\n\u03b4\n\nM\n\ning inequality holds for all h \u2208 F: R(h) \u2264 \u02c6R(h) + M\n\n(cid:1) /(2N), where D = R0R1 \u00b7\u00b7\u00b7 Rp +Pp\n\nq\nlog(cid:0) 1\n(cid:8)(x, i1,\u00b7\u00b7\u00b7 , ip) 7\u2192 (W \u20221 x)i1,\u00b7\u00b7\u00b7 ,ip : rank(W) = (R0,\u00b7\u00b7\u00b7 , Rp)(cid:9). We show that the pseudo-\n(cid:17). This\ndimension of \u02dcF is upper bounded by (R0R1 \u00b7\u00b7\u00b7 Rp +Pp\n\nProof. (Sketch) The complete proof is given in the supplementary material.\nIt re-\nlies on bounding the pseudo-dimension of the class of real-valued functions \u02dcF =\n\ni=0 Ridi) log(cid:16) 4e(p+2)d0d1\u00b7\u00b7\u00b7dp\n\ni=0 Ridi.\n\nmaxi\u22650 di\n\nis done by leveraging the following result originally due to [26]: the number of sign patterns\nof r polynomials, each of degree at most d, over q variables is at most (4edr/q)q for all\nr > q > 2 [25, Theorem 2]. The rest of the proof consists in showing that the risk (resp.\nempirical risk) of hypothesis in F and \u02dcF are closely related and invoking standard error\ngeneralization bounds in terms of the pseudo-dimension [17, Theorem 10.6].\n\nNote that generalization bounds based on the pseudo-dimension for multivariate regression\n\nwithout low-rank constraint would involve a term in O(pd0d1 \u00b7\u00b7\u00b7 dp). In contrast, the bound\nbounded by O(plog(d1 \u00b7\u00b7\u00b7 dp)). In some sense, taking into account the low mutlilinear rank\nfrom O(pd0 \u00b7\u00b7\u00b7 dp) to O(p(R0 \u00b7\u00b7\u00b7 Rp +P\n\nfrom the previous theorem only depends on the product of the output dimensions in a term\n\nof the hypothesis allows us to signi\ufb01cantly reduce the dependence on the output dimensions\n\ni Ridi)(P\n\ni log(di))).\n\nmaxi\u22650 di\n\n4 Experiments\n\nIn this section, we evaluate HOLRR on both synthetic and real-world datasets. Our\nexperimental results are for tensor-structured output regression problems on which we report\nroot mean-squared errors (RMSE) averaged across all the outputs. We compare HOLLR\nwith the following methods: regularized least squares RLS, low-rank regression LRR\ndescribed in Section 2.2, a multilinear approach based on tensor trace norm regularization\nADMM [7, 22], a nonconvex multilinear multitask learning approach MLMT-NC [22], an\nhigher order extension of partial least squares HOPLS [28] and the greedy tensor approach\nfor multivariate spatio-temporal analysis Greedy [2].\nFor experiments with kernel algorithms we use the readily available kernelized RLS and the\nLRR kernel extension proposed in [18]. Note that ADMM, MLMT-NC and Greedy only\nconsider a linear dependency between inputs and outputs. The greedy tensor algorithm\nproposed in [2] is developed specially for spatio-temporal data and the implementation\nprovided by the authors is restricted to third-order tensors. Although MLMLT-NC is\nperhaps the closest algorithm to ours, we applied it only to simulated data. This is because\nMLMLT-NC is computationally very expensive and becomes intractable for large data sets.\nAverage running times are reported in supplementary material B.\n\n4.1 Synthetic Data\nWe generate both linear and nonlinear data. Linear data is drawn from the model Y =\nW \u20221 x + E where W \u2208 R10\u00d710\u00d710\u00d710 is a tensor of multilinear rank (6, 4, 4, 8) drawn at\nrandom, x \u2208 R10 is drawn from N (0, I), and each component of the error tensor E is drawn\nfrom N (0, 0.1). Nonlinear data is drawn from Y = W\u20221(x\u2297x)+E where W \u2208 R25\u00d710\u00d710\u00d710\nis of rank (5, 6, 4, 2) and x \u2208 R5 and E are generated as above. Hyper-parameters for all\nalgorithms are selected using 3-fold cross-validation on the training data.\nThese experiments have been carried out for di\ufb00erent sizes of the training data set, 20 trials\nhave been executed for each size. The average RMSEs on a test set of size 100 for the 20\ntrials are reported in Figure 2. We see that HOLRR algorithm clearly outperforms the other\nmethods on the linear data. MLMT-NC method achieved the second best performance, it is\nhowever much more computationally expensive (see Table 1 in supplementary material B).\nOn the nonlinear data LRR achieves good performances but HOLRR is still signi\ufb01cantly\nmore accurate, especially with small training datasets.\n\n7\n\n\fFigure 2: Average RMSE as a function of the training set size: (left) linear data, (middle)\nnonlinear data, (right) for di\ufb00erent values of the rank parameter.\n\nTable 1: RMSE on forecasting task.\n\nData set\n\nCCDS\n\nFoursquare\nMeteo-UK\n\nADMM Greedy HOPLS HOLRR K-HOLRR\n0.8448\n0.1407\n0.6140\n\n(poly)\n0.8275\n0.1223\n0.6107\n\n0.8147\n0.1224\n0.625\n\n0.8325\n0.1223\n\n\u2212\n\n0.8096\n0.1227\n0.5971\n\nK-HOLRR\n\n(rbf)\n0.7913\n0.1226\n0.5886\n\nTo see how sensitive HOLLR is w.r.t. the choice of the multilinear rank, we carried out a\nsimilar experiment comparing HOLLR performances for di\ufb00erent values of the rank parameter,\nsee Fig. 2 (right). In this experiment, the rank of the tensor W used to generate the data is\n(2, 2, 2, 2) while the input and output dimensions and the noise level are the same as above.\n\n4.2 Real Data\nWe evaluate our algorithm on a forecasting task on the following real-world data sets:\nCCDS: the comprehensive climate data set is a collection of climate records of North America\nfrom [15]. The data set contains monthly observations of 17 variables such as Carbon dioxide\nand temperature spanning from 1990 to 2001 across 125 observation locations.\nFoursquare: the Foursquare data set [14] contains users\u2019 check-in records in Pittsburgh\narea categorized by di\ufb00erent venue types such as Art & University. It records the number of\ncheck-ins by 121 users in each of the 15 category of venues over 1200 time intervals.\nMeteo-UK: The data set is collected from the meteorological o\ufb03ce of the UK2. It contains\nmonthly measurements of 5 variables in 16 stations across the UK from 1960 to 2000.\nThe forecasting task consists in predicting all variables at times t + 1,. . . , t + k from their\nvalues at times t \u2212 2, t \u2212 1 and t. The \ufb01rst two real data sets were used in [2] with k = 1 (i.e.\noutputs are matrices). We consider here the same setting for these two data sets. For the\nthird dataset we consider higher-order output tensors by setting k = 5. The output tensors\nare thus of size respectively 17 \u00d7 125, 15 \u00d7 121 and 16 \u00d7 5 \u00d7 5 for the three datasets.\nFor all the experiments, we use 90% of the available data for training and 10% for testing.\nAll hyper-parameters are chosen by cross-validation. The average test RMSE over 10 runs\nare reported in Table 1 (running times are reported in Table 1 in supplementary material B).\nWe see that HOLRR and K-HOLRR outperforms the other methods on the CCDS dataset\nwhile being orders of magnitude faster for the kernelized version (0.61s vs. 75.47s for Greedy\nand 235.73s for ADMM in average). On the Foursquare dataset HOLRR performs as well as\nGreedy and on the Meteo-UK dataset K-HOLRR gets the best results with the RBF kernel\nwhile being much faster than ADMM (1.66s vs. 40.23s in average).\n5 Conclusion\n\nWe proposed a low-rank multilinear regression model for tensor-structured output data. We\ndeveloped a fast and e\ufb03cient algorithm to tackle the multilinear rank penalized minimization\nproblem and provided theoretical guarantees. Experimental results showed that capturing\nlow-rank structure in the output data can help to improve tensor regression performance.\n\n2http://www.metoffice.gov.uk/public/weather/climate-historic/\n\n8\n\n\fAcknowledgments\nWe thank Fran\u00e7ois Denis and the reviewers for their helpful comments and suggestions. This\nwork was partially supported by ANR JCJC program MAD (ANR- 14-CE27-0002).\nReferences\n[1] T. W. Anderson. Estimating linear restrictions on regression coe\ufb03cients for multivariate normal\n\ndistributions. Annals of Mathematical Statistics, 22:327\u2013351, 1951.\n\n[2] M. T. Bahadori, Q. R. Yu, and Y. Liu. Fast multivariate spatio-temporal analysis via low rank\n\ntensor learning. In NIPS. 2014.\n\n[3] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\n[4] A. Cichocki, R. Zdunek, A.H. Phan, and S.I. Amari. Nonnegative Matrix and Tensor Factor-\n\n[5] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition.\n\nSIAM journal on Matrix Analysis and Applications, 21(4):1253\u20131278, 2000.\n\n[6] R. Foygel, M. Horrell, M. Drton, and J. D. La\ufb00erty. Nonparametric reduced rank regression.\n\nizations. Wiley, 2009.\n\nIn NIPS, 2012.\n\n[7] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via\n\nconvex optimization. Inverse Problems, 27(2):025010, 2011.\n\n[8] N. Halko, P. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM, 53(2):217\u2013288, 2011.\n\n[9] C. J. Hillar and L. Lim. Most tensor problems are np-hard. JACM, 60(6):45, 2013.\n[10] A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate\n\nAnalysis, 5(2):248\u2013264, 1975.\n\n[11] A. J. Izenman. Modern Multivariate Statistical Techniques: Regression, Classi\ufb01cation, and\n\nManifold Learning. Springer-Verlag, New York, 2008.\n\n[12] P. Kar and H. Karnick. Random feature maps for dot product kernels. In AISTATS, 2012.\n[13] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review,\n\n[14] X. Long, L. Jin, and J. Joshi. Exploring trajectory-driven local geographic topics in foursquare.\n\n51(3):455\u2013500, 2009.\n\nIn UbiComp, 2012.\n\n[15] A. C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatial-\n\ntemporal causal modeling for climate change attribution. In KDD, 2009.\n\n[16] H. Lu, K.N. Plataniotis, and A. Venetsanopoulos. Multilinear Subspace Learning: Dimensionality\n\nReduction of Multidimensional Data. CRC Press, 2013.\n\n[17] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT, 2012.\n[18] A. Mukherjee and J. Zhu. Reduced rank ridge regression and its kernel extensions. Statistical\n\nanalysis and data mining, 4(6):612\u2013622, 2011.\n\n[19] M. Nickel and V. Tresp. An analysis of tensor models for learning on structured data. In\n\nMachine Learning and Knowledge Discovery in Databases, pages 272\u2013287. Springer, 2013.\n[20] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[21] G.C. Reinsel and R.P. Velu. Multivariate reduced-rank regression: theory and applications.\n\nLecture Notes in Statistics. Springer, 1998.\n\n[22] B. Romera-Paredes, M. H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask\n\nlearning. In ICML, 2013.\n\n[23] M. Signoretto, L. De Lathauwer, and J. K. Suykens. Learning tensors in reproducing kernel\n\nhilbert spaces with multilinear spectral penalties. arXiv preprint arXiv:1310.4977, 2013.\n\n[24] M. Signoretto, Q. T. Dinh, L. De Lathauwer, and J. K. Suykens. Learning with tensors: a\nframework based on convex optimization and spectral regularization. Mach. Learn., 1\u201349, 2013.\n[25] N. Srebro, N. Alon, and T. S. Jaakkola. Generalization error bounds for collaborative prediction\n\nwith low-rank matrices. In NIPS, 2004.\n\n[26] Hugh E Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the\n\nAmerican Mathematical Society, 133(1):167\u2013178, 1968.\n\n[27] K. Wimalawarne, M. Sugiyama, and R. Tomioka. Multitask learning meets tensor factorization:\n\ntask imputation via convex optimization. In NIPS. 2014.\n\n[28] Q. Zhao, C. F. Caiafa, D. P. Mandic, Z. C. Chao, Y. Nagasaka, N. Fujii, L. Zhang, and\nA. Cichocki. Higher-order partial least squares (hopls). IEEE Trans. on Pattern Analysis and\nMachine Intelligence, 35(7):1660\u20131673, 2012.\n\n[29] Q. Zhao, Guoxu Z., T. Adal\u0131, L. Zhang, and A. Cichocki. Kernel-based tensor partial least\n\nsquares for reconstruction of limb movements. In ICASSP, 2013.\n\n[30] H. Zhou, L. Li, and H. Zhu. Tensor regression with applications in neuroimaging data analysis.\n\nJournal of the American Statistical Association, 108(502):540\u2013552, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1028, "authors": [{"given_name": "Guillaume", "family_name": "Rabusseau", "institution": "Aix-Marseille University"}, {"given_name": "Hachem", "family_name": "Kadri", "institution": "Aix-Marseille University"}]}