{"title": "Boosted Sparse and Low-Rank Tensor Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1009, "page_last": 1018, "abstract": "We propose a sparse and low-rank tensor regression model to relate a univariate outcome to a feature tensor, in which each unit-rank tensor from the CP decomposition of the coefficient tensor is assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor regression problems. To make the computation efficient and scalable, for the unit-rank tensor regression, we propose a stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The superior performance of our approach is demonstrated on various real-world and synthetic examples.", "full_text": "Boosted Sparse and Low-Rank Tensor Regression\n\nLifang He\n\nWeill Cornell Medicine\n\nlifanghescut@gmail.com\n\nKun Chen\u2217\n\nUniversity of Connecticut\nkun.chen@uconn.edu\n\nWanwan Xu\n\nUniversity of Connecticut\nwanwan.xu@uconn.edu\n\nJiayu Zhou\n\nMichigan State Universtiy\ndearjiayu@gmail.com\n\nFei Wang\n\nWeill Cornell Medicine\n\nfew2001@med.cornell.edu\n\nAbstract\n\nWe propose a sparse and low-rank tensor regression model to relate a univariate\noutcome to a feature tensor, in which each unit-rank tensor from the CP decom-\nposition of the coef\ufb01cient tensor is assumed to be sparse. This structure is both\nparsimonious and highly interpretable, as it implies that the outcome is related\nto the features through a few distinct pathways, each of which may only involve\nsubsets of feature dimensions. We take a divide-and-conquer strategy to simplify\nthe task into a set of sparse unit-rank tensor regression problems. To make the\ncomputation ef\ufb01cient and scalable, for the unit-rank tensor regression, we propose\na stagewise estimation procedure to ef\ufb01ciently trace out its entire solution path. We\nshow that as the step size goes to zero, the stagewise solution paths converge exactly\nto those of the corresponding regularized regression. The superior performance of\nour approach is demonstrated on various real-world and synthetic examples.\n\n1\n\nIntroduction\n\n(cid:88)M\n\n1\nM\n\nRegression analysis is commonly used for modeling the relationship between a predictor vector\nx \u2208 RI and a scalar response y. Typically a good regression model can achieve two goals: accurate\nprediction on future response and parsimonious interpretation of the dependence structure between\ny and x [Hastie et al., 2009]. As a general setup, it \ufb01ts M training samples {(xm, ym)}M\nm=1 via\nminimizing a regularized loss, i.e., a loss L(\u00b7) plus a regularization term \u2126(\u00b7), as follows\n\nL((cid:104)xm, w(cid:105), ym) + \u03bb\u2126(w),\n\nw\n\nm=1\n\nmin\n\n(1)\nwhere w \u2208 RI is the regression coef\ufb01cient vector, (cid:104)\u00b7,\u00b7(cid:105) is the standard Euclidean inner product,\nand \u03bb > 0 is the regularization parameter. For example, the sum of squared loss with (cid:96)1-norm\nregularization leads to the celebrated LASSO approach [Tibshirani, 1996], which performs sparse\nestimation of w and thus has implicit feature selection embedded therein.\nIn many modern real-world applications, the predictors/features are represented more naturally as\nhigher-order tensors, such as videos and Magnetic Resonance Imaging (MRI) scans. In this case, if\nwe want to predict a response variable for each tensor, a naive approach is to perform linear regression\non the vectorized data (e.g., by stretching the tensor element by element). However, it completely\nignores the multidimensional structure of the tensor data, such as the spatial coherence of the voxels.\nThis motivates the tensor regression framework [Yu and Liu, 2016, Zhou et al., 2013], which treats\neach observation as a tensor X and learns a tensor coef\ufb01cient W via regularized model \ufb01tting:\n\nmin\nW\n\n1\nM\n\n\u2217Corresponding Author\n\n(cid:88)M\n\nm=1\n\nL((cid:104)X m,W(cid:105), ym) + \u03bb\u2126(W).\n\n(2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWhen (cid:96)1-norm regularization is used, this formulation is essentially equivalent to (1) via vectorization.\nTo effectively exploit the structural information of X m, we can impose a low-rank constraint on W\nfor Problem (2). Some authors achieved this by \ufb01xing the CANDECOMP/PARAFAC (CP) rank\nof W as a priori. For example, Su et al. [2012] assumed W to be rank-1. Since rank-1 constraint\nis too restrictive, Guo et al. [2012] and Zhou et al. [2013] imposed a rank-R constraint in a tensor\ndecomposition model, but none of the above methods considered adding sparsity constraint as well to\nenhance the model interpretability. Wang et al. [2014] imposed a restrictive rank-1 constraint on W\nand also applied an elastic net regularization [Zou and Hastie, 2005]. Tan et al. [2012] imposed a\nrank-R constraint and applied (cid:96)1-norm regularization to factor matrices to also promote sparsity in\nW. Signoretto et al. [2014] applied the trace norm (nuclear norm) for low-rank estimation of W, and\nSong and Lu [2017] imposed a combination of trace norm and (cid:96)1-norm on W. Bengua et al. [2017]\nshowed that the trace norm may not be appropriate for capturing the global correlation of a tensor as\nit provides only the mean of the correlation between a single mode (rather than a few modes) and the\nrest of the tensor. In all the above sparse and low-rank tensor models, the sparsity is imposed on W\nitself, which, does not necessarily lead to the sparsity on the decomposed matrices.\nIn this paper, we propose a sparse and low-rank tensor regression model in which the unit-rank\ntensors from the CP decomposition of the coef\ufb01cient tensor are assumed to be sparse. This structure\nis both parsimonious and highly interpretable, as it implies that the outcome is related to the features\nthrough a few distinct pathways, each of which may only involve subsets of feature dimensions.\nWe take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor\nfactorization/regression problems (SURF) in the form of\n\n(ym \u2212 (cid:104)X m,W(cid:105))2 + \u03bb(cid:107)W(cid:107)1,\n\ns.t.\n\nrank(W) \u2264 1,\n\n(cid:88)M\n\nm=1\n\nmin\nW\n\n1\nM\n\nTo make the solution process ef\ufb01cient for the SURF problem, we propose a boosted/stagewise\nestimation procedure to ef\ufb01ciently trace out its entire solution path. We show that as the step size\ngoes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized\nregression. The effectiveness and ef\ufb01ciency of our proposed approach is demonstrated on various\nreal-world datasets as well as under various simulation setups.\n2 Preliminaries on Tensors\nWe start with a brief review of some necessary preliminaries on tensors, and more details can be found\nin [Kolda and Bader, 2009]. We denote scalars by lowercase letters, e.g., a; vectors by boldfaced\nlowercase letters, e.g., a; matrices by boldface uppercase letters, e.g., A; and tensors by calligraphic\nletters, e.g., A. We denote their entries by ai, ai,j, ai,j,k, etc., depending on the number of dimensions.\nIndices are denoted by lowercase letters spanning the range from 1 to the uppercase letter of the index,\ne.g., n = 1,\u00b7\u00b7\u00b7 , N. Each entry of an Nth-order tensor A \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN is indexed by N indices\n{in}N\nDe\ufb01nition 1 (Inner Product). The inner product of two tensors A,B \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN is the sum of the\n\nproducts of their entries, de\ufb01ned as (cid:104)A,B(cid:105) =(cid:80)I1\nIt follows immediately that the Frobenius norm of A is de\ufb01ned as (cid:107)A(cid:107)F =(cid:112)(cid:104)A,A(cid:105). The (cid:96)1 norm\ni1=1 \u00b7\u00b7\u00b7(cid:80)IN\nof a tensor is de\ufb01ned as (cid:107)A(cid:107)1=(cid:80)I1\n\nn=1, and each in indexes the n-mode of A. Speci\ufb01cally, \u2212n denotes every mode except n.\n\ni1=1 \u00b7\u00b7\u00b7(cid:80)IN\n\nDe\ufb01nition 2 (Tensor Product). Let a(n) \u2208 RIn be a length-In vector for each n = 1, 2,\u00b7\u00b7\u00b7 , N. The\ntensor product of {a(n)}, denoted by A = a(1) \u25e6 \u00b7\u00b7\u00b7 \u25e6 a(N ), is an (I1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 IN )-tensor of which\nthe entries are given by Ai1,\u00b7\u00b7\u00b7,iN = a(1)\n. We call A a rank-1 tensor or a unit-rank tensor.\nDe\ufb01nition 3 (CP Decomposition). Every tensor A \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN can be decomposed as a weighted\n(cid:88)R\nsum of rank-1 tensors with a suitably large R as\n\n\u00b7\u00b7\u00b7 a(N )\n\ni1=1 ai1,\u00b7\u00b7\u00b7,iN bi1,\u00b7\u00b7\u00b7,iN .\n\niN =1|Ai1,\u00b7\u00b7\u00b7,iN|.\n\niN\n\ni1\n\n\u03c3r \u00b7 a(1)\n\nr \u25e6 \u00b7\u00b7\u00b7 \u25e6 a(N )\n\nr\n\n,\n\n(3)\n\nA =\nr (cid:107)2= 1.\nr \u2208 RI and (cid:107)a(n)\n\nr=1\n\nwhere \u03c3r \u2208 R, a(n)\nDe\ufb01nition 4 (Tensor Rank). The tensor rank of A, denoted by rank(A), is the smallest number R\nsuch that the equality (3) holds.\nDe\ufb01nition 5 (n-mode Product). The n-mode product of a tensor A \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN by a vector\nu \u2208 RIn, denoted by A \u00d7n u, is an (I1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 In\u22121 \u00d7 In+1 \u00b7\u00b7\u00b7 \u00d7 IN )-tensor of which the entries\n\nare given by (A \u00d7n u)i1,...,in\u22121in+1,...,iN =(cid:80)In\n\nin=1 ai1,...,iN uin .\n\n2\n\n\f3 Sparse and Low-Rank Tensor Regression\n3.1 Model Formulation\nFor an Nth-order predictor tensor X m \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN and a scalar response ym, m = 1,\u00b7\u00b7\u00b7 , M, we\nconsider the regression model of the form\n\nym = (cid:104)X m,W(cid:105) + \u03b5m\n\nm=1.\n\nr=1 \u03c3rw(1)\n\nr\n\nm=1(X m\n\nm=1 X m\n\ni1,\u00b7\u00b7\u00b7,iN\n\nr \u25e6 \u00b7\u00b7\u00b7 \u25e6 w(N )\n\n= 0 and(cid:80)M\n\nr=1 Wr = (cid:80)R\n\nstandardizing the predictors as(cid:80)M\nm=1 ym = 0;(cid:80)M\nbe decomposed via CP decomposition as W = (cid:80)R\n\n(4)\nwhere W \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7IN is an Nth-order coef\ufb01cient tensor, and \u03b5m is a random error term of\nzero mean. Without loss of generality, the intercept is set to zero by centering the response and\n)2/M = 1\ni1,\u00b7\u00b7\u00b7,iN\nfor in = 1,\u00b7\u00b7\u00b7 , In. Our goal is to estimate W with M i.i.d. observations {(X m, ym)}M\nTo reduce the complexity of the model and leverage the structural information in X m, we assume\nthat the coef\ufb01cient tensor W to be both low-rank and sparse. Speci\ufb01cally, we assume W can\n, where\neach rank-1 tensor Wr is possibly sparse, or equivalently, the vectors in its representation w(n)\n,\nn = 1, . . . , N, are possibly sparse, for all r = 1, . . . , R.\nHere we impose sparsity on the rank-1 components from CP decomposition \u2013 rather than on W itself\n[Chen et al., 2012, Tan et al., 2012]. This adaption can be more bene\ufb01cial in multiple ways: 1) It\nintegrates a \ufb01ner sparsity structure into the CP decomposition, which enables a direct control of\ncomponent-wise sparsity; 2) It leads to an appealing model interpretation and feature grouping: the\noutcome is related to the features through a few distinct pathways, each of which may only involve\nsubsets of feature dimensions; 3) It leads to a more \ufb02exible and parsimonious model as it requires\nless number of parameters to recover the within-decomposition sparsity of a tensor than existing\nmethods which impose sparsity on the tensor itself, thus makes the model generalizability better.\nA straightforward way of conducting model estimation is to solve the following optimization problem:\nr (cid:107)1,\n\n(cid:88)M\nr (cid:107)1= 1, n = 1,\u00b7\u00b7\u00b7 , N, r = 1,\u00b7\u00b7\u00b7 , R.\n\n(5)\nwhere \u03bbr,n, r = 1,\u00b7\u00b7\u00b7 , R, n = 1,\u00b7\u00b7\u00b7 , N, are regularization parameters. This problem is very dif\ufb01cult\nto solve because: 1) The CP rank R needs to be pre-speci\ufb01ed; 2) As the CP decomposition may not\nbe unique, the pursue of its within-decomposition sparsity is highly non-convex and the problem may\nsuffer from parameter identi\ufb01ability issues [Mishra et al., 2017]; 3) The estimation may involve many\nregularization parameters, for which the tuning becomes very costly.\n\nr \u25e6 \u00b7\u00b7\u00b7 \u25e6 w(N )\n\n1\nM\n(cid:107)w(n)\n\nmin\n\u03c3r,w(n)\nr\ns.t.\n\n(ym \u2212 (cid:104)X m,\n\n(cid:88)N\n\n(cid:88)R\n\n(cid:88)R\n\n\u03bbr,n(cid:107)w(n)\n\n(cid:105))2 +\n\nr\n\n\u03c3rw(1)\n\nr=1\n\nm=1\n\nr=1\n\nn=1\n\nr\n\n(cid:99)Wr = min\n\n(cid:88)M\n\n3.2 Divide-and-Conquer: Sequential Pursue for Sparse Tensor Decomposition\nWe propose a divide-and-conquer strategy to recover the sparse CP decomposition. Our approach is\nbased on the sequential extraction method (a.k.a. de\ufb02ation) [Phan et al., 2015, Mishra et al., 2017],\nwhich seeks a unit-rank tensor at a time and then de\ufb02ates to \ufb01nd further unit-rank tensors from the\nresiduals. This has been proved to be a rapid and effective method of partitioning and concentrating\ntensor decomposition. Speci\ufb01cally, we sequentially solve the following sparse unit-rank problem:\n\nWr\n\nm=1\n\ns.t.\n\n1\nM\n\n(cid:40)\n\nwhere r is the sequential number of the unit-rank terms and ym\n\nr \u2212 (cid:104)X m,Wr(cid:105))2 + \u03bbr(cid:107)Wr(cid:107)1+\u03b1(cid:107)Wr(cid:107)2\n(ym\nF ,\nr\u22121 \u2212 (cid:104)X m,(cid:99)Wr\u22121(cid:105), otherwise,\nwhere(cid:99)Wr\u22121 is the estimated unit-rank tensor in the (r \u2212 1)-th step, with tuning done by, e.g., cross\nvalidation. The \ufb01nal estimator is then obtained as (cid:99)W(R) = (cid:80)R\nr=1(cid:99)Wr. Here for improving the\n\nr is the current residue of response with\nif r = 1\n\nrank(Wr) \u2264 1.\n\nconvexity of the problem and its numerical stability, we have used the elastic net [Zou and Hastie,\n2005] penalty form instead of LASSO, which is critical to ensure the convergence of the optimization\nsolution. The accuracy of the solution can be controlled simply by adjusting the values of \u03bbr and \u03b1.\nSince we mainly focus on sparse estimation, we \ufb01x \u03b1 > 0 as a small constant in numerical studies.\n\nym,\nym\n\nym\nr :=\n\n(6)\n\n3\n\n\fr (cid:107)1= 1, n = 1,\u00b7\u00b7\u00b7 , N.\n\nIt is then clear that (cid:107)Wr(cid:107)1= (cid:107)(cid:98)\u03c3r(cid:98)w(1)\n\nr with \u03c3r \u2265 0\nAs each Wr is of unit rank, it can be decomposed as Wr = \u03c3rw(1)\nand (cid:107)w(n)\n(cid:107)1=\nr (cid:107)1. That is, the sparsity of a unit-rank tensor directly leads to the sparsity of its\n\u03c3r\ncomponents. This allows us to kill multiple birds with one stone: by simply pursuing the element-\nwise sparsity of the unit-rank coef\ufb01cient tensor with only one tuning parameter \u03bb, solving (6) can\n\n(cid:81)N\nn=1(cid:107)w(n)\nproduce a set of sparse factor coef\ufb01cients (cid:98)w(n)\n\nfor n = 1,\u00b7\u00b7\u00b7 , N simultaneously.\n\n\u25e6 \u00b7\u00b7\u00b7 \u25e6 (cid:98)w(N )\n\n\u25e6 \u00b7\u00b7\u00b7 \u25e6 w(N )\n\nr\n\nr\n\nr\n\nr\n\nWith this sequential pursue strategy, the general problem boils down to a set of sparse unit-rank\nestimation problems, for which we develop a novel stagewise/boosting algorithm.\n4 Fast Stagewise Unit-Rank Tensor Factorization (SURF)\nFor simplicity, we drop the index r and write the generic form of the problem in (6) as\n\nW\n\n1\nM\n\n(ym \u2212 (cid:104)X m,W(cid:105))2 + \u03bb(cid:107)W(cid:107)1+\u03b1(cid:107)W(cid:107)2\nF ,\n\n(7)\nLet W = \u03c3w(1) \u25e6 \u00b7\u00b7\u00b7 \u25e6 w(N ), where \u03c3 \u2265 0, (cid:107)w(n)(cid:107)1= 1, and the factors w(n), n = 1,\u00b7\u00b7\u00b7 , N\nare identi\ufb01able up to sign \ufb02ipping. Let y = [y1,\u00b7\u00b7\u00b7 , yM ] \u2208 RM , and X = [X 1,\u00b7\u00b7\u00b7 ,X M ] \u2208\nRI1\u00d7\u00b7\u00b7\u00b7\u00d7IN\u00d7M . Then (7) can be reformulated as\n\nrank(W) \u2264 1.\n\ns.t.\n\nm=1\n\n(cid:99)W = min\n\n(cid:88)M\n\n1\nM\n\n(cid:107)y \u2212 \u03c3X \u00d71 w(1) \u00d72 \u00b7\u00b7\u00b7 \u00d7N w(N )(cid:107)2\n\nmin\n\u03c3,w(n)\ns.t. \u03c3 \u2265 0, (cid:107)w(n)(cid:107)1= 1, n = 1,\u00b7\u00b7\u00b7 , N.\n\n2+\u03bb\u03c3\n\n(cid:89)N\n\n(cid:107)w(n)(cid:107)1+\u03b1\u03c32(cid:89)N\n\nn=1\n\nn=1\n\n(cid:107)w(n)(cid:107)2\n\n2\n\n(8)\n\n(9)\n\n1\nM\n\nBefore diving into the stagewise/boosting algorithm, we \ufb01rst consider an alternating convex search\n(ACS) approach [Chen et al., 2012, Minasian et al., 2014] which appears to be natural for solving\n(7) with any \ufb01xed tuning parameter. Speci\ufb01cally, we alternately optimize with respect to a block of\nvariables (\u03c3, w(n)) with others \ufb01xed. For each block (\u03c3, w(n)), the relevant constraints are \u03c3 \u2265 0 and\n(cid:107)w(n)(cid:107)1= 1, but the objective function in (8) is a function of (\u03c3, w(n)) only through their product\nand Z(\u2212n) = X \u00d71 w(1) \u00d72 \u00b7\u00b7\u00b7 \u00d7n\u22121 w(n\u22121) \u00d7n+1 \u00b7\u00b7\u00b7 \u00d7N w(N ), the subproblem boils down to\n\n\u03c3w(n). So both constraints are avoided when optimizing with respect to \u03c3w(n). Let (cid:98)w(n) = \u03c3w(n)\n(cid:107)yT \u2212 Z(\u2212n)T(cid:98)w(n)(cid:107)2\nmin(cid:98)w(n)\nwhere \u03b2(\u2212n) = (cid:81)\n2. Once we obtain the solution (cid:98)w(n), we can set \u03c3 = (cid:107)(cid:98)w(n)(cid:107)1 and\nw(n) = (cid:98)w(n)/\u03c3 to satisfy the constraints whenever (cid:98)w(n) (cid:54)= 0. If (cid:98)w(n) = 0, w(n) is no longer\nl(cid:54)=n(cid:107)w(l)(cid:107)2\n\n2+\u03b1\u03b2(\u2212n)(cid:107)(cid:98)w(n)(cid:107)2\n\n2+\u03bb(cid:107)(cid:98)w(n)(cid:107)1,\n\nidenti\ufb01able, we then set \u03c3 = 0 and terminate the algorithm. Note that when \u03b1 > 0, each subproblem\nis strongly convex and the generated solution sequence is uniformly bounded, we can show that ACS\ncan converge to a coordinate-wise minimum point of (8) with properly chosen initial value [Mishra\net al., 2017]. The optimization procedure then needs to be repeated to a grid of tuning parameter\nvalues for obtaining the solution paths of parameters and locating the optimal sparse solution along\nthe paths.\nInspired by the biconvex structure of (7) and the stagewise algorithm for LASSO [Zhao and Yu, 2007,\nVaughan et al., 2017], we develop a fast stagewise unit-rank tensor factorization (SURF) algorithm to\ntrace out the entire solution paths of (7) in a single run. The main idea of a stagewise procedure is\nto build a model from scratch, gradually increasing the model complexity in a sequence of simple\nlearning steps. For instance, in stagewise estimation for linear regression, a forward step searches for\nthe best predictor to explain the current residual and updates its coef\ufb01cient by a small increment, and\na backward step, on the contrary, may decrease the coef\ufb01cient of a selected predictor to correct for\ngreedy selection whenever necessary. Due to the biconvex or multi-convex structure of our objective\nfunction, it turns out that ef\ufb01cient stagewise estimation remains possible: the only catch is that, when\nwe determine which coef\ufb01cient to update at each iteration, we always get N competing proposals\nfrom N different tensor modes, rather than just one proposal in case of LASSO.\nTo simplify the notations, the objective function (9) is re-arranged into a standard LASSO as\n\n(cid:107)(cid:98)y \u2212(cid:98)Z(\u2212n)(cid:98)w(n)(cid:107)2\n\n2+\u03bb(cid:107)(cid:98)w(n)(cid:107)1\n\nmin(cid:98)w(n)\n\n1\nM\n\n4\n\n(10)\n\n\fAlgorithm 1 Fast Stagewise Unit-Rank Tensor Factorization (SURF)\nInput: Training data D, a small stepsize \u0001 > 0 and a small tolerance parameter \u03be 2\nOutput: Solution paths of (\u03c3,{w(n)}).\n1: Initialization: take a forward step with ({\u02c6i1,\u00b7\u00b7\u00b7 ,\u02c6iN}, \u02c6s) = arg min\n\n0 = sign((cid:98)s)1(cid:98)in\n\n0 = 1(cid:98)in\n\n, w(n)\n\n\u03c30 = \u0001, w(1)\n\nJ(s1i1 , 1i2 ,\u00b7\u00b7\u00b7 , 1iN ), and\n\n{i1,\u00b7\u00b7\u00b7,iN },s=\u00b1\u0001\n\n(n (cid:54)= 1), \u03bb0 = (J({0}) \u2212 J(\u03c30,{w(n)\n\n0 }))/\u0001.\n\n(11)\n\nSet the active index sets I (n)\n\n0 = {\u02c6in} for n = 1,\u00b7\u00b7\u00b7 , N; t = 0.\n\n2: repeat\n3:\n\nBackward step:\n\nif \u0393((cid:98)w((cid:98)n)\n\nn,in\u2208I\n\n((cid:98)n,(cid:98)i(cid:98)n) := arg min\nJ((cid:98)w(n)\nt + sin 1in ), where sin = \u2212sign((cid:98)w(n)\n; \u03bbt) \u2212 \u0393((cid:98)w((cid:98)n)\n1(cid:98)i(cid:98)n\nt + s(cid:98)i(cid:98)n\n; \u03bbt) \u2264 \u2212\u03be, then\nt+1 = ((cid:98)w((cid:98)n)\n\u03c3t+1 = (cid:107)(cid:98)w((cid:98)n)\n(cid:107)1, w((cid:98)n)\n(cid:40)\n1(cid:98)i(cid:98)n\nt + s(cid:98)i(cid:98)n\n1(cid:98)i(cid:98)n\nt + s(cid:98)i(cid:98)n\n\\ {(cid:98)i(cid:98)n},\nif w((cid:98)n)\nI ((cid:98)n)\n(t+1)(cid:98)i(cid:98)n\n\n)/\u03c3t+1, w(\u2212(cid:98)n)\n\n\u03bbt+1 = \u03bbt,\n\n= 0\n\n(n)\nt\n\ntin\n\nt\n\nI (n)\nt+1 :=\n\nt+1 = w(\u2212(cid:98)n)\n\nt\n\n,\n\nt\nI (n)\nt\n\n,\n\notherwise.\n\n)\u0001.\n\n(12)\n\n4:\n\nelse Forward step:\n\n((cid:98)n,(cid:98)i(cid:98)n,(cid:98)s(cid:98)i(cid:98)n\n\u03c3t+1 = (cid:107)(cid:98)w((cid:98)n)\n\nt +(cid:98)s(cid:98)i(cid:98)n\n\n) := arg min\n\nn,in,s=\u00b1\u0001\n\nt + sin 1in ),\n\nJ((cid:98)w(n)\nt+1 = ((cid:98)w((cid:98)n)\n(cid:107)1, w((cid:98)n)\n1(cid:98)i(cid:98)n\nt+1}) \u2212 \u03be\nt }) \u2212 J(\u03c3t+1,{w(n)\nt+1}) \u2212 \u2126(\u03c3t,{w(n)\nt })\n\nt +(cid:98)s(cid:98)i(cid:98)n\n\n1(cid:98)i(cid:98)n\nJ(\u03c3t,{w(n)\n\u2126(\u03c3t+1,{w(n)\n\n\u03bbt+1 = min[\u03bbt,\n\nSet t = t + 1.\n\n5:\n6: until \u03bbt \u2264 0\n\n(cid:112)\nusing the augmented data (cid:98)y = (y, 0)T and (cid:98)Z(\u2212n) = (Z(\u2212n),\n\u0393((cid:98)w(n); \u03bb) = J((cid:98)w(n)) + \u03bb\u2126((cid:98)w(n)).\n\nidentity matrix of size In. We write the objective function (10) by\n\nt+1 = w(\u2212(cid:98)n)\n)/\u03c3t+1, w(\u2212(cid:98)n)\n(cid:40)\nt \u222a {(cid:98)in},\n\n,\n\n],\n\nI (n)\nt+1 :=\n\nt\nI (n)\nI (n)\nt\n\n,\n\n(13)\n\nif n =(cid:98)n\n\notherwise.\n\n\u03b1\u03b2(\u2212n)M I)T , where I is the\n\nWe use (\u03c3,{w(n)}) to denote all the variables if necessary.\nThe structure of our stagewise algorithm is presented in Algorithm 1. It can be viewed as a boosting\nprocedure that builds up the solution gradually in terms of forward step (line 4) and backward step\niteration, the parameter update takes the form (cid:98)w(n) = (cid:98)w(n) + s1in in either forward or backward\n(line 3)3. The initialization step is solved explicitly (see Lemma 1 below). At each subsequent\ndirection, where 1in is a length-In vector with all zeros except for a 1 in the in-th coordinate, s = \u00b1\u0001,\nand \u0001 is the pre-speci\ufb01ed step size controlling the \ufb01neness of the grid. The algorithm also keeps track\nof the tuning parameter \u03bb. Intuitively, the selection of the index (n, in) and the increment s is guided\nby minimizing the penalized loss function with the current \u03bb subject to a constraint on the step size.\nComparing to the standard stagewise LASSO, the main difference here is that we need to select the\n\u201cbest\u201d triple of (n, in, s) over all the dimensions across all tensor modes.\n\nProblem (12) and (13) can be solved ef\ufb01ciently. By expansion of J((cid:98)w(n) + s1in ), we have\n\nJ((cid:98)w(n) + s1in ) =\n\n1\nM\n\n((cid:107)(cid:98)e(n)(cid:107)2\n\n2\u22122s(cid:98)e(n)T(cid:98)Z(\u2212n)1in + \u00012(cid:107)(cid:98)Z(\u2212n)1in(cid:107)2\n\n2),\n\n2As stated in [Zhao and Yu, 2007], \u03be is implementation-speci\ufb01c but not necessarily a user parameter. In all\n\nexperiments, we set \u03be = \u00012/2.\n\n3Boosting amounts to combine a set of \u201cweak learners\u201d to build one strong learner, and is connected to\nstagewise and regularized estimation methods. SURF is a boosting method since each backward/forward step\nis in essence \ufb01nding a weaker learner to incrementally improve the current learner (thus generate a path of\nsolutions).\n\n5\n\n\fwhere(cid:98)e(n) =(cid:98)y \u2212(cid:98)Z(\u2212n)(cid:98)w(n) is a constant at each iteration. Then the solution at each forward step is\n2|(cid:98)e(n)T(cid:98)Z(\u2212n)1in|\u2212\u0001Diag((cid:98)Z(\u2212n)T(cid:98)Z(\u2212n))T1in , (cid:98)s = sign((cid:98)e(n)T(cid:98)Z(\u2212(cid:98)n)1(cid:98)i(cid:98)n\n((cid:98)n,(cid:98)i(cid:98)n) := arg max\n\n)\u0001,\n\nand at each backward step is\n\nn,in\n\n((cid:98)n,(cid:98)i(cid:98)n) := arg min\n\nn,in\u2208I (n)\n\n2sign((cid:98)w(n)\n\nin\n\n)(cid:98)eT(cid:98)Z(\u2212n)1in + \u0001Diag((cid:98)Z(\u2212n)T(cid:98)Z(\u2212n))T1in ,\n\nwhere Diag(\u00b7) denotes the vector formed by the diagonal elements of a square matrix, and I (n) \u2282\n{1,\u00b7\u00b7\u00b7 , In} is the n-mode active index set at current iteration.\nComputational Analysis. In Algorithm 1, the most time-consuming part are the calculations of\n\n(cid:98)e(n)T(cid:98)Z(\u2212n) and Diag((cid:98)Z(\u2212n)T(cid:98)Z(\u2212n)), involved in both forward and backward steps. We further write\nas(cid:98)e(n)T(cid:98)Z(\u2212n) = (Z(\u2212n)e \u2212 \u03b1\u03b2(\u2212n)M(cid:98)w(n))T, Diag((cid:98)Z(\u2212n)T(cid:98)Z(\u2212n)) = Diag(Z(\u2212n)Z(\u2212n)T) +\n\n\u03b1\u03b2(\u2212n)M 1(n), where e = yT \u2212 Z(\u2212n)T \u02c6w(n) is a constant during each iteration but varies from one\niteration to the next, 1(n) is a length-In vector with all ones. At each iteration, the computational\ncost is dominated by the update of Z(\u2212n) (n (cid:54)= \u02c6n), which can be obtained by\n\nZ(\u2212n)\nt+1 =\n\n1\n\n(\u03c3tZ(\u2212n)\n\n+ Z(\u2212n,\u2212\u02c6n)\n\nt\n\nt\n\n),\n\n\u03c3t+1\n\n(14)\nwhere (\u2212n,\u2212\u02c6n) denotes every mode except n and \u02c6n, which greatly reduces the computation.\nTherefore, when n (cid:54)= \u02c6n, the updates of Z(\u2212n), e, Z(\u2212n)e and Diag(Z(\u2212n)TZ(\u2212n)) requires\ns(cid:54)=n,\u02c6n Is + 3M In), O(M I\u02c6n), O(M In), O(M In) operations, respectively, when n = \u02c6n,\nwe only need to additionally update Z(\u2212\u02c6n)e, which requires O(M I\u02c6n) operations. Overall the com-\ns(cid:54)=n,\u02c6n Is + 5In) + 2M I\u02c6n) per iteration.\nIn contrast, the ACS algorithm has to be run for each \ufb01xed \u03bb, and within each of such problems it\n\nO(M(cid:81)\nputational complexity of our approach is O(M(cid:80)N\nrequires O(M(cid:81)N\n\nn=1 In) per iteration [da Silva et al., 2015].\n\nn(cid:54)=\u02c6n((cid:81)\n\n\u00d7\u02c6n(cid:98)s(cid:98)i(cid:98)n\n\n1(cid:98)i(cid:98)n\n\n5 Convergence Analysis\nWe provide convergence analysis for our stagewise algorithm in this section. All detailed proofs are\ngiven in the Appendix A. Speci\ufb01cally, Lemma 1 and 2 below justify the validity of the initialization.\nLemma 1. Let X be the (N + 1)-mode matricization of X . Denote X = [x1,\u00b7\u00b7\u00b7 , xI ] where each\nxi is a column of X, then\n\n\u03bbmax = 2/M max{|xT\n\ni y|; i = 1,\u00b7\u00b7\u00b7 , I.}.\n\nMoreover, letting i\u2217 = arg maxi|xT\ntensor space, then the initial non-zero solution of (11), denoted as (\u03c3,{w(n)}), is given by\n\nN ) represents its corresponding indices in\n\ni y| and (i\u2217\n\n1,\u00b7\u00b7\u00b7 , i\u2217\n\n\u03c3 = \u0001, w(1) = sign(xT\n\ni\u2217 y)1i\u2217\n\n1\n\n, w(n) = 1i\u2217\n\nn\n\n,\u2200n = 2,\u00b7\u00b7\u00b7 , N.\n\nn\n\nn-th coordinate.\n\nis a vector with all 0\u2019s except for a 1 in the i\u2217\n\nwhere 1i\u2217\nLemma 2. If there exists s and in with |s|= \u0001, n = 1,\u00b7\u00b7\u00b7 , N such that \u0393(s1i1 , 1i2,\u00b7\u00b7\u00b7 , 1iN ; \u03bb) \u2264\n\u0393({0}; \u03bb), it must be true that \u03bb \u2264 \u03bb0.\nLemma 3 shows that the backward step always performs coordinate descent update of \ufb01xed size\n\u0001, each time along the steepest coordinate direction within the current active set, until the descent\nbecomes impossible subject to a tolerance level \u03be. Also, the forward step performs coordinate descent\nwhen \u03bbt+1 = \u03bbt. Lemma 4 shows that when \u03bb gets changed, the penalized loss for the previous \u03bb\ncan no longer be improved subject to a tolerance level \u03be. Thus \u0001 controls the granularity of the paths,\nand \u03be is a convergence threshold in optimizing the penalized loss with any \ufb01xed tuning parameter.\nThey enable convenient trade-off between computational ef\ufb01ciency and estimation accuracy.\nt }; \u03bbt+1) \u2212 \u03be.\nLemma 3. For any t with \u03bbt+1 = \u03bbt, we have \u0393(\u03c3t+1,{w(n)\nLemma 4. For any t with \u03bbt+1 < \u03bbt, we have \u0393( \u02c6w(n)\nt }) \u2192 (\u03c3(\u03bbt),{(cid:101)w(n)(\u03bbt)}) as\nLemma 3 and Lemma 4 proves the following convergence theorem.\n\u0001, \u03be \u2192 0, where (\u03c3(\u03bbt),{(cid:101)w(n)(\u03bbt)}) denotes a coordinate-wise minimum point of Problem (7).\nTheorem 1. For any t such that \u03bbt+1 < \u03bbt, we have (\u03c3t,{w(n)\n\nt+1}; \u03bbt+1) \u2264 \u0393(\u03c3t,{w(n)\n\nt + sin 1in ; \u03bbt) > \u0393( \u02c6w(n)\n\n; \u03bbt) \u2212 \u03be.\n\nt\n\n6\n\n\fTable 1: Compared methods. \u03b1 and \u03bb are regularized parameters; R is the CP rank.\n\nLASSO\nVector\n(cid:96)1 (w)\n\nENet\nVector\n(cid:96)1/(cid:96)2 (w) Nuclear/(cid:96)1 (W)\n\nRemurs\nTensor\n\n\u2014\n\u03bb\n\n\u2014\n\u03b1, \u03bb\n\n\u2014\n\n\u03bb1, \u03bb2\n\nMethods\nInput Data Type\nRegularization\nRank Explored\nHyperparameters\n\norTRR\nTensor\n\nGLTRM\nTensor\n\n(cid:96)2 (W(n))\nOptimized\n\n\u03b1, R\n\n(cid:96)1/(cid:96)2 (W(n))\n\nFixed\n\u03b1, \u03bb, R\n\nACS\nTensor\n\n(cid:96)1/(cid:96)2 (Wr)\nIncreased\n\u03b1, \u03bb, R\n\nSURF\nTensor\n\n(cid:96)1/(cid:96)2 (Wr)\nIncreased\n\u03b1, \u03bb, R\n\n6 Experiments\n\nWe evaluate the effectiveness and ef\ufb01ciency of our method SURF through numerical experiments on\nboth synthetic and real data, and compare with various state-of-the-art regression methods, including\nLASSO, Elastic Net (ENet), Regularized multilinear regression and selection (Remurs) [Song and Lu,\n2017], optimal CP-rank Tensor Ridge Regression (orTRR) [Guo et al., 2012], Generalized Linear\nTensor Regression Model (GLTRM) [Zhou et al., 2013], and a variant of our method with Alternating\nConvex Search (ACS) estimation. Table 1 summarizes the properties of all methods. All methods\nare implemented in MATLAB and executed on a machine with 3.50GHz CPU and 256GB RAM.\nFor LASSO and ENet we use the MATLAB package glmnet from [Friedman et al., 2010]. For\nGLTRM, we solve the regularized CP tensor regression simultaneously for all R factors based on\nTensorReg toolbox [Zhou et al., 2013]. We follow [Kampa et al., 2014] to arrange the test and\ntraining sets in the ratio of 1:5. The hyperparameters of all methods are optimized using 5-fold cross\nvalidation on the training set, with range \u03b1 \u2208 {0.1, 0.2,\u00b7\u00b7\u00b7 , 1}, \u03bb \u2208 {10\u22123, 5 \u00d7 10\u22123, 10\u22122, 5 \u00d7\n10\u22122,\u00b7\u00b7\u00b7 , 5 \u00d7 102, 103}, and R \u2208 {1, 2,\u00b7\u00b7\u00b7 , 50}. Speci\ufb01cally, for GLTRM, ACS, and SURF, we\nsimply set \u03b1 = 1. For LASSO, ENet and ACS, we generate a sequence of 100 values for \u03bb to cover\nthe whole path. For fairness, the number of iterations for all compared methods are \ufb01xed to 100. All\ncases are run 50 times and the average results on the test set are reported. Our code is available at\nhttps://github.com/LifangHe/SURF.\nSynthetic Data. We \ufb01rst use the synthetic data to examine the performance of our method in different\nscenarios, with varying step sizes, sparsity level, number of features as well as sample size. We\ngenerate the data as follows: y = (cid:104)X, W(cid:105)+\u03b5, where \u03b5 is a random noise generated from N (0, 1), and\nm=1 from N (0, \u03a3), where \u03a3 is a covariance matrix, the\nX \u2208 RI\u00d7I. We generate M samples {Xm}M\n\u221a\n(i\u2212p)2+(j\u2212q)2. We generate\ncorrelation coef\ufb01cient between features xi,j and xp,q is de\ufb01ned as 0.6\nr \u2208 RI, n = 1, 2, is a column vector\nwith N (0, 1) i.i.d. entries and normalized with (cid:96)1 norm, the scalars \u03c3r are de\ufb01ned by \u03c3r = 1/r. To\nimpose sparsity on W, we set S% of its entries (chosen uniformly at random) to zero. When studying\none of factors, other factors are \ufb01xed to M = 500, I = 16, R = 50, S = 80.\n\nthe true support as W =(cid:80)R\n\nr , where each w(n)\n\nr=1 \u03c3rw(1)\n\nr \u25e6 w(2)\n\n(a) \u0001 = 0.01\n\n(b) \u0001 = 0.1\n\n(c) \u0001 = 0.5\n\nFigure 1: Comparison of solution paths of SURF (solid line) and ACS (dashed line) with different\nstep sizes on synthetic data. The path of estimates W for each \u03bb is treated as a function of t = (cid:107)W(cid:107)1.\nNote that when the step size is small, the SURF path is almost indiscernible from the ACS path.\n\nIn a \ufb01rst step we analyze the critical parameter \u0001 for our method. This parameter controls how close\nSURF approximates the ACS paths. Figure 1 shows the solution path plot of our method versus\nACS method under both big and small step sizes. As shown by the plots, a smaller step size leads\nto a closer approximation to the solutions of ACS. In Figure 2, we also provide a plot of averaged\nprediction error with standard deviation bars (left side of y-axis) and CPU execution time (right side\nof y-axis in mins) over different values of step size. From the \ufb01gure, we can see that the choice of the\nstep size affects both computational speed and the root mean-squared prediction error (RMSE). The\n\n7\n\n00.511.522.53-0.4-0.200.20.40.6ACSSURF00.511.522.53-0.4-0.200.20.40.6ACSSURF00.511.522.53-0.4-0.200.20.40.6ACSSURF\fsmaller the value of step size, the more accurate the regression result but the longer it will take to run.\nIn both \ufb01gures, the moderate step size \u0001 = 0.1 seems to offer a better trade-off between performance\nand ease of implementation. In the following we \ufb01x \u0001 = 0.1.\nNext, we examine the performance of our method with varying\nsparsity level of W. For this purpose, we compare the prediction\nerror (RMSE) and running time (log min) of all methods on the\nsynthetic data. Figure 3(a)-(b) shows the results for the case\nof S = {60, 80, 90, 95, 98} on synthetic 2D data, where S%\nindicates the sparsity level of true W. As can be seen from\nthe plots, SURF generates slightly better predictions than other\nexisting methods when the true W is sparser. Moreover, as\nshown in Figure 3(c), it is also interesting to note that larger\nstep sizes give much more sparsity of coef\ufb01cients for SURF, this\nexplains why there is no value for some curves as the increase\n\nFigure 2: Results to different val-\nues of step size \u0001.\n\nof(cid:80)|W|1 in Figure 1(c).\n\nFurthermore, we compare the prediction error and running time (log min) of all methods with\nincreasing number of features. Figure 4(a)-(b) shows the results for the case of I = {8, 16, 32, 64} on\nsynthetic 2D data. Overall, SURF gives better predictions at a lower computational cost. Particularly,\nSURF and ACS have very similar prediction qualities, this matches with our theoretical result on\nthe solutions of SURF versus ACS. SURF achieves better predictions than other tensor methods,\nindicating the effectiveness of structured sparsity in unit-rank tensor decomposition itself. In terms of\nrunning time, it is clear that as the number of features is increased, SURF is signi\ufb01cantly faster than\nother methods.\nFinally, Figure 5 shows the results with increasing number of samples. Overall, SURF gives better\npredictions at a lower computational cost. Particularly, Remurs and orTRR do not change too much\nas increasing number of samples, this may due to early stop in the iteration process when searching\nfor optimized solution.\n\n(a) Test error\n\n(b) Running time\n\n(c) Sparsity vs. \u0001\n\nFigure 3: Results with increasing sparsity level (S%) of true W on synthetic 2D data (a)-(b), and (c)\nsparsity results of W versus step size for SURF.\n\n(a) Test error\n\n(b) Running time\n\n(c) Running time on MRI data\n\nFigure 4: Results with increasing number of features on synthetic 2D data (a)-(b), and (c) real 3D\nMRI data of features 240 \u00d7 175 \u00d7 176 with \ufb01xed hyperparameters (without cross validation).\nReal Data. We also examine the performance of our method on a real medical image dataset,\nincluding both DTI and MRI images, obtained from the Parkinson\u2019s progression markers initiative\n\n8\n\n0.010.050.1 0.5 1 0.90.9511.051.11.15RMSE02468Elapsed time (min)Test errorRunning time60809095980.911.11.21.31.4RMSELASSOENetRemursorTRRGLTRMACSSURF6080909598100101102103104105Elapsed time (log min)LASSOENetRemursorTRRGLTRMACSSURF0.010.050.1 0.5 1 0.70.750.80.850.90.95164 256 102440960.911.11.21.31.41.5RMSELASSOENetRemursorTRRGLTRMACSSURF64 256 1024409610-210-1100101102103Elapsed time (log min)LASSOENetRemursorTRRGLTRMACSSURFMethods100101102103Elapsed time (log min)LASSOENetRemursorTRRGLTRMACSSURF\f(a) Test error\n\n(b) Running time\n\nFigure 5: Results with increasing number of samples on synthetic 2D data.\n\nTable 2: Performance comparison over different DTI datasets. Column 2 indicates the used metrics\nRMSE, Sparsity of Coef\ufb01cients (SC) and CPU execution time (in mins). The results are averaged\nover 50 random trials, with both the mean values and standard deviations (mean \u00b1 std.)\n\nComparative Methods\n\nDatasets\n\nDTIf act\n\nDTIrk2\n\nDTIsl\n\nDTItl\n\nCombined\n\nMetrics\nRMSE\nSparsity\nTime\nRMSE\nSparsity\nTime\nRMSE\nSparsity\nTime\nRMSE\nSparsity\nTime\nRMSE\nSparsity\nTime\n\nLASSO\n2.94\u00b10.34\n0.99\u00b10.01\n6.4\u00b10.3\n3.18\u00b10.36\n0.99\u00b10.01\n5.7\u00b10.3\n3.06\u00b10.34\n0.98\u00b10.01\n5.8\u00b10.3\n3.20\u00b10.40\n0.99\u00b10.01\n5.5\u00b10.2\n3.02\u00b10.37\n0.99\u00b10.00\n8.8\u00b10.6\n\nENet\n\n2.92\u00b10.32\n0.97\u00b10.01\n46.6\u00b14.6\n3.16\u00b10.42\n0.95\u00b10.03\n42.4\u00b12.9\n2.99\u00b10.34\n0.95\u00b10.01\n45.0\u00b11.0\n3.21\u00b10.59\n0.96\u00b10.03\n42.3\u00b11.4\n2.89\u00b10.41\n0.97\u00b10.01\n71.9\u00b12.3\n\nRemurs\n2.91\u00b10.32\n0.66\u00b10.13\n161.3\u00b19.3\n2.97\u00b10.30\n0.37\u00b10.09\n155.0\u00b110.7\n2.93\u00b10.27\n0.43\u00b10.17\n163.6\u00b19.0\n2.84\u00b10.35\n0.44\u00b10.13\n159.6\u00b17.6\n2.81\u00b10.31\n0.34\u00b10.22\n443.5\u00b1235.7\n\norTRR\n3.48\u00b10.21\n0.00\u00b10.00\n27.9\u00b15.6\n3.76\u00b10.44\n0.00\u00b10.00\n10.2\u00b10.1\n3.56\u00b10.41\n0.00\u00b10.00\n7.5\u00b10.9\n3.66\u00b10.35\n0.00\u00b10.00\n26.6\u00b13.1\n3.33\u00b10.27\n0.00\u00b10.00\n48.4\u00b18.0\n\nGLTRM\n3.09\u00b10.35\n0.90\u00b10.10\n874.8\u00b129.6\n3.26\u00b10.46\n0.91\u00b10.06\n857.4\u00b122.5\n3.14\u00b10.39\n0.87\u00b10.03\n815.4\u00b16.5\n3.12\u00b10.32\n0.86\u00b10.03\n835.8\u00b19.9\n3.26\u00b10.45\n0.91\u00b10.19\n1093.6\u00b149.7\n\nACS\n\n2.81\u00b10.24\n0.92\u00b10.02\n60.8\u00b124.4\n2.90\u00b10.31\n0.93\u00b10.02\n63.0\u00b121.6\n2.89\u00b10.38\n0.90\u00b10.03\n66.3\u00b144.9\n2.82\u00b10.33\n0.90\u00b10.02\n96.7\u00b143.2\n2.79\u00b10.31\n0.97\u00b10.01\n463.2\u00b1268.1\n\nSURF\n\n2.81\u00b10.23\n0.95\u00b10.01\n1.7\u00b10.2\n2.91\u00b10.32\n0.94\u00b10.01\n5.2\u00b10.8\n2.87\u00b10.35\n0.93\u00b10.02\n1.5\u00b10.1\n2.83\u00b10.32\n0.91\u00b10.02\n3.8\u00b10.5\n2.78\u00b10.29\n0.99\u00b10.00\n6.2\u00b10.5\n\n(PPMI) database4 with 656 human subjects. We parcel the brain into 84 regions and extract four\ntypes connectivity matrices. Our goal is to predict the Montreal Cognitive Assessment (MoCA)\nscores for each subject. Details of data processing are presented in Appendix B. We use three\nmetrics to evaluate the performance: root mean squared prediction error (RMSE), which describes\nthe deviation between the ground truth of the response and the predicted values in out-of-sample\ntesting; sparsity of coef\ufb01cients (SC), which is the same as the S% de\ufb01ned in the synthetic data\nanalysis (i.e., the percentage of zero entries in the corresponding coef\ufb01cient); and CPU execution\ntime. Table 2 shows the results of all methods on both individual and combined datasets. Again,\nwe can observe that SURF gives better predictions at a lower computational cost, as well as good\nsparsity. In particular, the paired t-tests showed that for all \ufb01ve real datasets, the RMSE and SC of our\napproach are signi\ufb01cantly lower and higher than those of Remurs and GLTRM methods, respectively.\nThis indicates that the performance gain of our approach over the other low-rank + sparse methods is\nindeed signi\ufb01cant. Figure 4(c) provides the running time (log min) of all methods on the PPMI MRI\nimages of 240 \u00d7 175 \u00d7 176 voxels each, which clearly demonstrates the ef\ufb01ciency of our approach.\n\nAcknowledgements\n\nThis work is supported by NSF No.\nIIS-1716432 (Wang), IIS-1750326 (Wang), IIS-1718798\n(Chen), DMS-1613295 (Chen), IIS-1749940 (Zhou), IIS-1615597 (Zhou), ONR N00014-18-1-\n2585 (Wang), and N00014-17-1-2265 (Zhou), and Michael J. Fox Foundation grant number 14858\n(Wang). Lifang He\u2019s research is supported in part by 1R01AI130460. Data used in the preparation\nof this article were obtained from the Parkinson\u2019s Progression Markers Initiative (PPMI) database\n(http://www.ppmi-info.org/data). For up-to-date information on the study, visit http://www.\nppmi-info.org. PPMI \u2013 a public-private partnership \u2013 is funded by the Michael J. Fox Foundation\nfor Parkinson\u2019s Research and funding partners, including Abbvie, Avid, Biogen, Bristol-Mayers\nSquibb, Covance, GE, Genentech, GlaxoSmithKline, Lilly, Lundbeck, Merk, Meso Scale Discovery,\nP\ufb01zer, Piramal, Roche, Sano\ufb01, Servier, TEVA, UCB and Golub Capital.\n\n4http://www.ppmi-info.org/data\n\n9\n\n64 1024 4096 163840.811.21.41.61.82RMSELASSOENetRemursorTRRGLTRMACSSURF64 1024 4096 1638410-210-1100101102103Elapsed time (log min)LASSOENetRemursorTRRGLTRMACSSURF\fReferences\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, 2009.\n\nRobert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:\n\nSeries B, pages 267\u2013288, 1996.\n\nRose Yu and Yan Liu. Learning from multiway data: Simple and ef\ufb01cient tensor regression. In International\n\nConference on Machine Learning, pages 373\u2013381, 2016.\n\nHua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis.\n\nJournal of the American Statistical Association, 108(502):540\u2013552, 2013.\n\nYa Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regression. IEEE Transactions on\n\nSystems, Man, and Cybernetics, Part B (Cybernetics), 42(6):1560\u20131573, 2012.\n\nWeiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image\n\nProcessing, 21(2):816\u2013827, 2012.\n\nFei Wang, Ping Zhang, Buyue Qian, Xiang Wang, and Ian Davidson. Clinical risk prediction with multilinear\nsparse logistic regression. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, pages 145\u2013154. ACM, 2014.\n\nHui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society: Series B, 67(2):301\u2013320, 2005.\n\nXu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang. Logistic tensor regression for\nclassi\ufb01cation. In International Conference on Intelligent Science and Intelligent Data Engineering, pages\n573\u2013581. Springer, 2012.\n\nMarco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning with tensors: a\nframework based on convex optimization and spectral regularization. Machine Learning, 94(3):303\u2013351,\n2014.\n\nXiaonan Song and Haiping Lu. Multilinear regression for embedded feature selection with application to fmri\n\nanalysis. In AAAI, pages 2562\u20132568, 2017.\n\nJohann A Bengua, Ho N Phien, Hoang Duong Tuan, and Minh N Do. Ef\ufb01cient tensor completion for color\nimage and video recovery: Low-rank tensor train. IEEE Transactions on Image Processing, 26(5):2466\u20132479,\n2017.\n\nTamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM Review, 51(3):455\u2013500,\n\n2009.\n\nKun Chen, Kung-Sik Chan, and Nils Chr Stenseth. Reduced rank stochastic regression with a sparse singular\n\nvalue decomposition. Journal of the Royal Statistical Society: Series B, 74(2):203\u2013221, 2012.\n\nAditya Mishra, Dipak K. Dey, and Kun Chen. Sequential co-sparse factor regression. Journal of Computational\n\nand Graphical Statistics, pages 814\u2013825, 2017.\n\nAnh-Huy Phan, Petr Tichavsk`y, and Andrzej Cichocki. Tensor de\ufb02ation for candecomp/parafac\u2014part i:\nAlternating subspace update algorithm. IEEE Transactions on Signal Processing, 63(22):5924\u20135938, 2015.\n\nArin Minasian, Shahram ShahbazPanahi, and Raviraj S Adve. Energy harvesting cooperative communication\n\nsystems. IEEE Transactions on Wireless Communications, 13(11):6118\u20136131, 2014.\n\nPeng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8(Dec):2701\u20132726, 2007.\n\nGregory Vaughan, Robert Aseltine, Kun Chen, and Jun Yan. Stagewise generalized estimation equations with\n\ngrouped variables. Biometrics, 73:1332\u20131342, 2017.\n\nAlex Pereira da Silva, Pierre Comon, and Andre Lima Ferrer de Almeida. Rank-1 tensor approximation methods\n\nand application to de\ufb02ation. arXiv preprint arXiv:1508.05273, 2015.\n\nJerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via\n\ncoordinate descent. Journal of Statistical Software, 33(1):1, 2010.\n\nKittipat Kampa, S Mehta, Chun-An Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski. Sparse\noptimization in feature selection: application in neuroimaging. Journal of Global Optimization, 59(2-3):\n439\u2013457, 2014.\n\n10\n\n\f", "award": [], "sourceid": 548, "authors": [{"given_name": "Lifang", "family_name": "He", "institution": "Cornell University"}, {"given_name": "Kun", "family_name": "Chen", "institution": "University of Connecticut"}, {"given_name": "Wanwan", "family_name": "Xu", "institution": "University of Connecticut"}, {"given_name": "Jiayu", "family_name": "Zhou", "institution": "Michigan State University"}, {"given_name": "Fei", "family_name": "Wang", "institution": "Cornell University"}]}