{"title": "Multivariate Regression with Calibration", "book": "Advances in Neural Information Processing Systems", "page_first": 127, "page_last": 135, "abstract": "We propose a new method named calibrated multivariate regression (CMR) for fitting high dimensional multivariate regression models. Compared to existing methods, CMR calibrates the regularization for each regression task with respect to its noise level so that it is simultaneously tuning insensitive and achieves an improved finite-sample performance. Computationally, we develop an efficient smoothed proximal gradient algorithm which has a worst-case iteration complexity $O(1/\\epsilon)$, where $\\epsilon$ is a pre-specified numerical accuracy. Theoretically, we prove that CMR achieves the optimal rate of convergence in parameter estimation. We illustrate the usefulness of CMR by thorough numerical simulations and show that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR on a brain activity prediction problem and find that CMR is as competitive as the handcrafted model created by human experts.", "full_text": "Multivariate Regression with Calibration\u21e4\n\nDepartment of Operations Research and Financial Engineering\n\nHan Liu\n\nPrinceton University\n\nLie Wang\n\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nTuo Zhao\u2020\n\nDepartment of Computer Science\n\nJohns Hopkins University\n\nAbstract\n\nWe propose a new method named calibrated multivariate regression (CMR) for\n\ufb01tting high dimensional multivariate regression models. Compared to existing\nmethods, CMR calibrates the regularization for each regression task with respect\nto its noise level so that it is simultaneously tuning insensitive and achieves an\nimproved \ufb01nite-sample performance. Computationally, we develop an ef\ufb01cient\nsmoothed proximal gradient algorithm which has a worst-case iteration complex-\nity O(1/\u270f), where \u270f is a pre-speci\ufb01ed numerical accuracy. Theoretically, we prove\nthat CMR achieves the optimal rate of convergence in parameter estimation. We\nillustrate the usefulness of CMR by thorough numerical simulations and show\nthat CMR consistently outperforms other high dimensional multivariate regres-\nsion methods. We also apply CMR on a brain activity prediction problem and \ufb01nd\nthat CMR is as competitive as the handcrafted model created by human experts.\n\n1\n\nIntroduction\n\nGiven a design matrix X 2 Rn\u21e5d and a response matrix Y 2 Rn\u21e5m, we consider a multivariate\nlinear model Y = XB0 + Z, where B0 2 Rd\u21e5m is an unknown regression coef\ufb01cient matrix and\nZ 2 Rn\u21e5m is a noise matrix [1]. For a matrix A = [Ajk] 2 Rd\u21e5m, we denote Aj\u21e4 = (Aj1, ...,\nAjm) 2 Rm and A\u21e4k = (A1k, ..., Adk)T 2 Rd to be its jth row and kth column respectively. We\nassume that all Zi\u21e4\u2019s are independently sampled from an m-dimensional Gaussian distribution with\nmean 0 and covariance matrix \u2303 2 Rm\u21e5m.\nWe can represent the multivariate linear model as an ensemble of univariate linear regression models:\n\u21e4k +Z\u21e4k, k = 1, ..., m. Then we get a multi-task learning problem [3, 2, 26]. Multi-task\nY\u21e4k = XB0\nlearning exploits shared common structure across tasks to obtain improved estimation performance.\nIn the past decade, signi\ufb01cant progress has been made towards designing a variety of modeling\nassumptions for multivariate regression.\nA popular assumption is that all the regression tasks share a common sparsity pattern, i.e., many\n\u2019s are zero vectors. Such a joint sparsity assumption is a natural extension of that for univariate\nB0\nj\u21e4\nlinear regressions. Similar to the L1-regularization used in Lasso [23], we can adopt group regular-\nization to obtain a good estimator of B0 [25, 24, 19, 13]. Besides the aforementioned approaches,\nthere are other methods that aim to exploit the covariance structure of the noise matrix Z [7, 22]. For\n\u21e4The authors are listed in alphabetical order. This work is partially supported by the grants NSF\nIIS1408910, NSF IIS1332109, NSF Grant DMS-1005539, NIH R01MH102339, NIH R01GM083084, and\nNIH R01HG06841.\n\n\u2020Tuo Zhao is also af\ufb01liated with Department of Operations Research and Financial Engineering at Princeton\n\nUniversity.\n\n1\n\n\f1\n\npm||bB  B0||F \uf8ff C \u00b7 max r s log d\n\nnm\n\n+r sm12/p\n\nn\n\n! ,\n\n1, 2\n\n2, . . . , 2\n\ninstance, [22] assume that all Zi\u21e4\u2019s follow a multivariate Gaussian distribution with a sparse inverse\ncovariance matrix \u2326 = \u23031. They propose an iterative algorithm to estimate sparse B0 and \u2326 by\nmaximizing the penalized Gaussian log-likelihood. Such an iterative procedure is effective in many\napplications, but the theoretical analysis is dif\ufb01cult due to its nonconvex formulation.\nIn this paper, we assume an uncorrelated structure for the noise matrix Z,\ndiag(2\nproblem with a convex program as follows\n1\npn||Y  XB||2\n\ni.e., \u2303 =\nm). Under this setting, we can ef\ufb01ciently solve the resulting estimation\n\nF + ||B||1,p,\n\nm1, 2\n\njk is the Frobenius norm of a ma-\njk and\nj=1 max1\uf8ffk\uf8ffm |Bjk|. Computationally, the optimization problem in (1.1) can be\nef\ufb01ciently solved by some \ufb01rst order algorithms [11, 12, 4].\nThe problem with the uncorrelated noise structure is amenable to statistical analysis. Under suit-\nable conditions on the noise and design matrices, let max = maxk k, if we choose  =\n\nwhere > 0 is a tuning parameter, and ||A||F = qPj,k A2\ntrix A. Popular choices of p include p = 2 and p = 1:\n||B||1,1 = Pd\n2c \u00b7 maxplog d + m11/p, for some c > 1, then the estimator bB in (1.1) achieves the opti-\n\nmal rates of convergence1 [13], i.e., there exists some universal constant C such that with high\nprobability, we have\n\n||B||1,2 = Pd\n\nj=1qPm\n\nk=1 B2\n\nbB = argmin\n\n(1.1)\n\nB\n\nwhere s is the number of rows with non-zero entries in B0. However, the estimator in (1.1) has two\ndrawbacks: (1) All the tasks are regularized by the same tuning parameter , even though different\ntasks may have different k\u2019s. Thus more estimation bias is introduced to the tasks with smaller k\u2019s\nto compensate the tasks with larger k\u2019s. In another word, these tasks are not calibrated. (2) The\ntuning parameter selection involves the unknown parameter max. This requires tuning the regular-\nization parameter over a wide range of potential values to get a good \ufb01nite-sample performance.\nTo overcome the above two drawbacks , we formulate a new convex program named calibrated\nmultivariate regression (CMR). The CMR estimator is de\ufb01ned to be the solution of the following\nconvex program:\n\nB\n\n(1.2)\n\n||Y  XB||2,1 + ||B||1,p,\n\nwhere ||A||2,1 = PkqPj A2\n\nbB = argmin\njk is the nonsmooth L2,1 norm of a matrix A = [Ajk] 2 Rd\u21e5m.\nThis is a multivariate extension of the square-root Lasso [5]. Similar to the square-root Lasso, the\ntuning parameter selection of CMR does not involve max. Moreover, the L2,1 loss function can\nbe viewed as a special example of the weighted least square loss, which calibrates each regression\ntask (See more details in \u00a72). Thus CMR adapts to different k\u2019s and achieves better \ufb01nite-sample\nperformance than the ordinary multivariate regression estimator (OMR) de\ufb01ned in (1.1).\nSince both the loss and penalty functions in (1.2) are nonsmooth, CMR is computationally more\nchallenging than OMR. To ef\ufb01ciently solve CMR, we propose a smoothed proximal gradient (SPG)\nalgorithm with an iteration complexity O(1/\u270f), where \u270f is the pre-speci\ufb01ed accuracy of the objec-\ntive value [18, 4]. Theoretically, we provide suf\ufb01cient conditions under which CMR achieves the\noptimal rates of convergence in parameter estimation. Numerical experiments on both synthetic and\nreal data show that CMR universally outperforms existing multivariate regression methods. For a\nbrain activity prediction task, prediction based on the features selected by CMR signi\ufb01cantly out-\nperforms that based on the features selected by OMR, and is even competitive with that based on the\nhandcrafted features selected by human experts.\nNotations: Given a vector v = (v1, . . . , vd)T 2 Rd, for 1 \uf8ff p \uf8ff 1, we de\ufb01ne the Lp-vector\nif 1 \uf8ff p < 1 and ||v||p = max1\uf8ffj\uf8ffd |vj| if p = 1.\n\nnorm of v as ||v||p = \u21e3Pd\n\nj=1 |vj|p\u23181/p\n\n1The rate of convergence is optimal when p = 2, i.e., the regularization function is ||B||1,p\n\n2\n\n\fj=1Pm\n\nGiven two matrices A = [Ajk] and C = [Cjk] 2 Rd\u21e5m, we de\ufb01ne the inner product of A\nand C as hA, Ci = Pd\nk=1 AjkCjk = tr(AT C), where tr(A) is the trace of a matrix A.\nWe use A\u21e4k = (A1k, ..., Adk)T and Aj\u21e4 = (Aj1, ..., Ajm) to denote the kth column and jth\nrow of A. Let S be some subspace of Rd\u21e5m, we use AS to denote the projection of A onto S:\nF. Moreover, we de\ufb01ne the Frobenius and spectral norms of A as\nAS = argminC2S ||C  A||2\n||A||F = phA, Ai and ||A||2 = 1(A), 1(A) is the largest singular value of A. In addition,\nwe de\ufb01ne the matrix block norms as ||A||2,1 =Pm\nk=1 ||A\u21e4k||2, ||A||2,1 = max1\uf8ffk\uf8ffm ||A\u21e4k||2,\n||A||1,p =Pd\nj=1 ||Aj\u21e4||p, and ||A||1,q = max1\uf8ffj\uf8ffd ||Aj\u21e4||q, where 1 \uf8ff p \uf8ff 1 and 1 \uf8ff q \uf8ff 1.\nIt is easy to verify that ||A||2,1 is the dual norm of ||A||2,1. Let 1/1 = 0, then if 1/p + 1/q = 1,\n||A||1,q and ||A||1,p are also dual norms of each other.\n2 Method\n\nWe solve the multivariate regression problem by the following convex program,\n\nbB = argmin\n\nB\n\n||Y  XB||2,1 + ||B||1,p.\n\n(2.1)\n\nThe only difference between (2.1) and (1.1) is that we replace the L2-loss function by the nonsmooth\nL2,1-loss function. The L2,1-loss function can be viewed as a special example of the weighted square\nloss function. More speci\ufb01cally, we consider the following optimization problem,\n\n1\n\nkpn||Y\u21e4k  XB\u21e4k||2\n\n2 + ||B||1,p,\n\n(2.2)\n\nbB = argmin\n\nB\n\nmXk=1\n\nek =\n\n1\n\nkpn is a weight assigned to calibrate the kth regression task. Without prior knowledge on\n\nwhere\nk\u2019s, we use the following replacement of k\u2019s,\n\n1\npn||Y\u21e4k  XB\u21e4k||2, k = 1, ..., m.\n\n(2.3)\n\nBy plugging (2.3) into the objective function in (2.2), we get (2.1). In another word, CMR calibrates\ndifferent tasks by solving a penalized weighted least square program with weights de\ufb01ned in (2.3).\nThe optimization problem in (2.1) can be solved by the alternating direction method of multipliers\n(ADMM) with a global convergence guarantee [20]. However, ADMM does not take full advantage\nof the problem structure in (2.1). For example, even though the L2,1 norm is nonsmooth, it is\nnondifferentiable only when a task achieves exact zero residual, which is unlikely in applications.\nIn this paper, we apply the dual smoothing technique proposed by [18] to obtain a smooth surrogate\nfunction so that we can avoid directly evaluating the subgradient of the L2,1 loss function. Thus we\ngain computational ef\ufb01ciency like other smooth loss functions.\nWe consider the Fenchel\u2019s dual representation of the L2,1 loss:\n\n||Y  XB||2,1 = max\n\n||U||2,1\uf8ff1hU, Y  XBi.\n\n(2.4)\n\nLet \u00b5 > 0 be a smoothing parameter. The smooth approximation of the L2,1 loss can be obtained\nby solving the following optimization problem\n||Y  XB||\u00b5 = max\n\n||U||2,1\uf8ff1hU, Y  XBi \nF is the proximity function. Due to the fact that ||U||2\n\nwhere ||U||2\nfollowing uniform bound by combing (2.4) and (2.5),\n\n\u00b5\n2||U||2\nF,\nF \uf8ff m||U||2\n\n, we obtain the\n\n(2.5)\n\n2,1\n\n(2.6)\nFrom (2.6), we see that the approximation error introduced by the smoothing procedure can be\ncontrolled by a suitable \u00b5. Figure 2.1 shows several two-dimensional examples of the L2 norm\n\nm\u00b5\n2 \uf8ff|| Y  XB||\u00b5 \uf8ff|| Y  XB||2,1.\n\n||Y  XB||2,1 \n\nsmoothed by different \u00b5\u2019s. The optimization problem in (2.5) has a closed form solution bUB with\nbUB\n\u21e4k = (Y\u21e4k  XB\u21e4k)/ max{||Y\u21e4k  XB\u21e4k||2, \u00b5}.\nThe next lemma shows that ||Y  XB||\u00b5 is smooth in B with a simple form of gradient.\n\n3\n\n\f(a) \u00b5 = 0\n\n(b) \u00b5 = 0.1\n\n(c) \u00b5 = 0.25\n\n(d) \u00b5 = 0.5\n\nFigure 2.1: The L2 norm (\u00b5 = 0) and its smooth surrogates with \u00b5 = 0.1, 0.25, 0.5. A larger \u00b5\nmakes the approximation more smooth, but introduces a larger approximation error.\n\nG\u00b5(B) =\n\n@\u21e3hbUB, Y  XBi + \u00b5||bUB||2\nF/2\u2318\n\nLemma 2.1. For any \u00b5 > 0, ||Y  XB||\u00b5 is a convex and continuously differentiable function in\nB. In addition, G\u00b5(B)\u2014the gradient of ||Y  XB||\u00b5 w.r.t. B\u2014has the form\n= XTbUB.\n1\n\u00b5||XT X(B0  B00)||F \uf8ff\n\nMoreover, let  = ||X||2\nconstant /\u00b5, i.e., for any B0, B00 2 Rd\u21e5m,\n||G\u00b5(B0)  G\u00b5(B00)||F = ||hX,bUB0  bUB00i||F \uf8ff\nLemma 2.1 is a direct result of Theorem 1 in [18] and implies that ||Y  XB||\u00b5 has good computa-\ntional structure. Therefore we apply the smooth proximal gradient algorithm to solve the smoothed\nversion of the optimization problem as follows,\n\n2, then we have that G\u00b5(B) is Lipschitz continuous in B with the Lipschitz\n\n\n\u00b5||B0  B00||F.\n\n(2.7)\n\n@B\n\neB = argmin\n\nB\n\n||Y  XB||\u00b5 + ||B||1,p.\n\n(2.8)\n\nWe then adopt the fast proximal gradient algorithm to solve (2.8) [4]. To derive the algorithm,\nwe \ufb01rst de\ufb01ne three sequences of auxiliary variables {A(t)}, {V(t)}, and {H(t)} with A(0) =\nH(0) = V(0) = B(0), a sequence of weights {\u2713t = 2/(t + 1)}, and a nonincreasing sequence of\nstep-sizes {\u2318t > 0}. For simplicity, we can set \u2318t = \u00b5/. In practice, we use the backtracking\nline search to dynamically adjust \u2318t to boost the performance. At the tth iteration, we \ufb01rst take\nV(t) = (1  \u2713t)B(t1) + \u2713tA(t1). We then consider a quadratic approximation of ||Y  XH||\u00b5\nas\n\nQ\u21e3H, V(t),\u2318 t\u2318 = ||Y  XV(t)||\u00b5 + hG\u00b5(V(t)), H  V(t)i +\n\n1\n2\u2318t||H  V(t)||2\nF.\n\n1\n\n2\u2318t||H  eH(t)||2\n\nF + ||H||1,p.\n\n(2.9)\n\nConsequently, let eH(t) = V(t)  \u2318tG\u00b5(V(t)), we take\nQ\u21e3H, V(t),\u2318 t\u2318 + ||H||1,p = argmin\n\nH(t) = argmin\n\nH\n\nH\n\nWhen p = 2, (2.9) has a closed form solution H(t)\ndetails about other choices of p in the L1,p norm can be found in [11] and [12]. To ensure that the\nobjective value is nonincreasing, we choose\nargmin\n\nj\u21e4 = eHj\u21e4 \u00b7 maxn1  \u2318t/||eHj\u21e4||2, 0o. More\n\nB(t) =\n\n(2.10)\n\n||Y  XB||\u00b5 + ||B||1,p.\n\nB2{H(t), B(t1)}\n\nAt last, we take A(t) = B(t1)+ 1\n\u2713t\nwhere \" is the stopping precision.\nThe numerical rate of convergence of the proposed algorithm with respect to the original optimiza-\ntion problem (2.1) is presented in the following theorem.\n\n(H(t)B(t1)). The algorithm stops when ||H(t)V(t)||F \uf8ff \",\n\nTheorem 2.2. Given a pre-speci\ufb01ed accuracy \u270f and let \u00b5 = \u270f/m, after t = 2pm||B(0)bB||F/\u270f\n1 = O (1/\u270f) iterations, we have ||Y  XB(t)||2,1 + ||B(t)||1,p \uf8ff|| Y  XbB||2,1 + ||bB||1,p + \u270f.\n\nThe proof of Theorem 2.2 is provided in Appendix A.1. This result achieves the minimax optimal\nrate of convergence over all \ufb01rst order algorithms [18].\n\n4\n\n\f3 Statistical Properties\nFor notational simplicity, we de\ufb01ne a re-scaled noise matrix W = [Wik] 2 Rn\u21e5m with Wik =\nZik/k, where EZ2\nk. Thus W is a random matrix with all entries having mean 0 and variance\n1. We de\ufb01ne G0 to be the gradient of ||Y  XB||2,1 at B = B0. It is easy to see that\n\nik = 2\n\nG0\n\n\u21e4k =\n\nXT Z\u21e4k\n||Z\u21e4k||2\n\n=\n\nXT W\u21e4kk\n||W\u21e4kk||2\n\n=\n\nXT W\u21e4k\n||W\u21e4k||2\n\n\u21e4k works as an important\ndoes not depend on the unknown quantities k for all k = 1, ..., m. G0\npivotal in our analysis. Moreover, our analysis exploits the decomposability of the L1,p norm [17].\nMore speci\ufb01cally, we assume that B0 has s rows with all zero entries and de\ufb01ne\nj\u21e4 = 0 ,\nj\u21e4 6= 0 .\n\n(3.1)\n(3.2)\nNote that we have B0 2S and the L1,p norm is decomposable with respect to the pair (S,N ), i.e.,\nThe next lemma shows that when  is suitably chosen, the solution to the optimization problem in\n(2.1) lies in a restricted set.\n\nS =C 2 Rd\u21e5m | Cj\u21e4 = 0 for all j such that B0\nN =C 2 Rd\u21e5m | Cj\u21e4 = 0 for all j such that B0\n\n||A||1,p = ||AS||1,p + ||AN||1,p.\n\nLemma 3.1. Let B0 2S and bB be the optimum to (2.1), and 1/p + 1/q = 1. We denote the\nestimation error as b = bB  B0. If   c||G0||1,q for some c > 1, we have\nc  1||S||1,p .\n\nb 2M c :=\u21e2 2 Rd\u21e5m | ||N||1,p \uf8ff\n\nThe proof of Lemma 3.1 is provided in Appendix B.1. To prove the main result, we also need to\nassume that the design matrix X satis\ufb01es the following condition.\nAssumption 3.1. Let B0 2S , then there exist positive constants \uf8ff and c > 1 such that\n\n(3.3)\n\nc + 1\n\n\uf8ff \uf8ff min\n\n2Mc\\{0}\n\n||X||F\npn||||F\n\n.\n\nAssumption 3.1 is the generalization of the restricted eigenvalue conditions for analyzing univariate\nsparse linear models [17, 15, 6], Many common examples of random design satisfy this assumption\n[13, 21].\nNote that Lemma 3.1 is a deterministic result of the CMR estimator for a \ufb01xed . Since G is\nessentially a random matrix, we need to show that   cR\u21e4(G0) holds with high probability to\ndeliver a concrete rate of convergence for the CMR estimator in the next theorem.\nTheorem 3.2. We assume that each column of X is normalized as m1/21/pkX\u21e4jk2 = pn for all\nj = 1, ..., d. Then for some universal constant c0 and large enough n, taking\n\n =\n\np1  c0\nwith probability at least 1  2 exp(2 log d)  2 expnc2\n1  c0 r sm12/p\n\uf8ff2(c  1)r 1 + c0\n\n0/8 + log m, we have\nnm ! .\n+r s log d\n\n16cmax\n\nn\n\n1\n\n,\n\npm||bB  B0||F \uf8ff\n\nThe proof of Theorem 3.2 is provided in Appendix B.2. Note that when we choose p = 2, the\ncolumn normalization condition is reduced to kX\u21e4jk2 = pn. Meanwhile, the corresponding error\nbound is further reduced to\n\n2c(m11/p + plog d)\n\n(3.4)\n\n1\n\npm||bB  B0||F = OP r s\n\nn\n\nnm ! ,\n+r s log d\n\nwhich achieves the minimax optimal rate of convergence presented in [13]. See Theorem 6.1 in [13]\nfor more technical details. From Theorem 3.2, we see that CMR achieves the same rates of conver-\ngence as the noncalibrated counterpart, but the tuning parameter  in (3.4) does not involve k\u2019s.\nTherefore CMR not only calibrates all the regression tasks, but also makes the tuning parameter\nselection insensitive to max.\n\n5\n\n\f4 Numerical Simulations\n\nTo compare the \ufb01nite-sample performance between the calibrated multivariate regression (CMR)\nand ordinary multivariate regression (OMR), we generate a training dataset of 200 samples. More\nspeci\ufb01cally, we use the following data generation scheme: (1) Generate each row of the design\nmatrix Xi\u21e4, i = 1, ..., 200, independently from a 800-dimensional normal distribution N (0, \u2303)\nwhere \u2303jj = 1 and \u2303j` = 0.5 for all ` 6= j.(2) Let k = 1, . . . , 13, set the regression coef\ufb01cient\nmatrix B0 2 R800\u21e513 as B0\njk = 0 for all j 6= 1, 2, 4. (3)\nGenerate the random noise matrix Z = WD, where W 2 R200\u21e513 with all entries of W are\nindependently generated from N (0, 1), and D is either of the following matrices\n\n4k = 1.5, and B0\n\n2k = 2, B0\n\n1k = 3, B0\n\nDI = max \u00b7 diag\u21e320/4, 21/4,\u00b7\u00b7\u00b7 , 211/4, 212/4\u2318 2 R13\u21e513\nDH = max \u00b7 I 2 R13\u21e513.\n\n1\n\nF, where\n\nF, and Est. Err. = 1\n\nm||bB  B0||2\n\nF, where X and Y denotes the design\n\nWe generate a validation set of 200 samples for the regularization parameter selection and a testing\nset of 10,000 samples to evaluate the prediction accuracy.\nIn numerical experiments, we set max = 1, 2, and 4 to illustrate the tuning insensitivity\nof CMR. The regularization parameter  of both CMR and OMR is chosen over a grid \u21e4 =\n\n240/40, 239/40,\u00b7\u00b7\u00b7 , 217/40, 218/40 , where 0 = plog d + pm. The optimal regular-\nization parameterb is determined by the prediction error asb = argmin2\u21e4 ||eY eXbB||2\nbB denotes the obtained estimate using the regularization parameter , and eX and eY denote the\n10000||Y  XbB||F, Adj. Pre. Err. =\n10000m||(Y  XbB)D1||2\n\ndesign and response matrices of the validation set.\nSince the noise level k\u2019s are different in regression tasks, we adopt the following three crite-\nria to evaluate the empirical performance: Pre. Err. = 1\n\nand response matrices of the testing set.\nAll simulations are implemented by MATLAB using a PC with Intel Core i5 3.3GHz CPU and 16GB\nmemory. CMR is solved by the proposed smoothing proximal gradient algorithm, where we set the\nstopping precision \" = 104, the smoothing parameter \u00b5 = 104. OMR is solved by the monotone\nfast proximal gradient algorithm, where we set the stopping precision \" = 104. We set p = 2, but\nthe extension to arbitrary p > 2 is straightforward.\nWe \ufb01rst compare the smoothed proximal gradient (SPG) algorithm with the ADMM algorithm (the\ndetailed derivation of ADMM can be found in Appendix A.2). We adopt the backtracking line search\nto accelerate both algorithms with a shrinkage parameter \u21b5 = 0.8. We set max = 2 for the adopted\nmultivariate linear models. We conduct 200 simulations. The results are presented in Table 4.1. The\nSPG and ADMM algorithms attain similar objective values, but SPG is up to 4 times faster than\nADMM. Both algorithms also achieve similar estimation errors.\nWe then compare the statistical performance between CMR and OMR. Tables 4.2 and 4.3 summa-\nrize the results averaged over 200 replicates. In addition, we also present the results of the oracle\nestimator, which is obtained by solving (2.2), since we know the true values of k\u2019s. Note that the\noracle estimator is only for comparison purpose, and it is not a practical estimator. Since CMR\ncalibrates the regularization for each task with respect to k, CMR universally outperforms OMR,\nand achieves almost the same performance as the oracle estimator when we adopt the scale matrix\nDI to generate the random noise. Meanwhile, when we adopt the scale matrix DH, where all k\u2019s\nare the same, CMR and OMR achieve similar performance. This further implies that CMR can be a\nsafe replacement of OMR for multivariate regressions.\nIn addition, we also examine the optimal regularization parameters for CMR and OMR over all\n\nIn particular, we adopt the Gaussian kernel, and the kernel bandwidth is selected based on the 10-\nfold cross validation. Figure 4.1 illustrates the estimated density functions. The horizontal axis\n\nreplicates. We visualize the distribution of all 200 selectedb\u2019s using the kernel density estimator.\nbplog d+pm\u2318. We see that the optimal\ncorresponds to the rescaled regularization parameter as log\u21e3\n\nregularization parameters of OMR signi\ufb01cantly vary with different max. In contrast, the optimal\nregularization parameters of CMR are more concentrated. This is inconsistent with our claimed\ntuning insensitivity.\n\n6\n\n\fTable 4.1: Quantitive comparison of the computational performance between SPG and ADMM with\nthe noise matrices generated using DI. The results are averaged over 200 replicates with standard\nerrors in parentheses. SPG and ADMM attain similar objective values, but SPG is up to about 4\ntimes faster than ADMM.\n\nSPG\n\nADMM\n\n Algorithm Timing (second)\n2.8789(0.3141)\n8.4731(0.8387)\n3.2633(0.3200)\n11.976(1.460)\n3.7868(0.4551)\n18.360(1.9678)\n\nADMM\n\nADMM\n\nSPG\n\nSPG\n\n0\n\n20\n\n0.50\n\nObj. Val.\n\nNum. Ite.\n\nEst. Err.\n\n508.21(3.8498)\n508.22(3.7059)\n370.53(3.6144)\n370.53(3.4231)\n297.24(3.6125)\n297.25(3.3863)\n\n493.26(52.268)\n437.7(37.4532)\n565.80(54.919)\n600.94(74.629)\n652.53(78.140)\n1134.0(136.08)\n\n0.1213(0.0286)\n0.1215(0.0291)\n0.0819(0.0205)\n0.0822(0.0233)\n0.1399(0.0284)\n0.1409(0.0317)\n\nTable 4.2: Quantitive comparison of the statistical performance between CMR and OMR with the\nnoise matrices generated using DI. The results are averaged over 200 simulations with the standard\nerrors in parentheses. CMR universally outperforms OMR, and achieves almost the same perfor-\nmance as the oracle estimator.\n\nmax\n\n1\n\n2\n\n4\n\nMethod\nOracle\nCMR\nOMR\nOracle\nCMR\nOMR\nOracle\nCMR\nOMR\n\nPre. Err.\n\n5.8759(0.0834)\n5.8761(0.0673)\n5.9012(0.0701)\n23.464(0.3237)\n23.465(0.2598)\n23.580(0.2832)\n93.532(0.8843)\n93.542(0.9794)\n94.094(1.0978)\n\nAdj. Pre.Err\n1.0454(0.0149)\n1.0459(0.0123)\n1.0581(0.0162)\n1.0441(0.0148)\n1.0446(0.0121)\n1.0573(0.0170)\n1.0418(0.0962)\n1.0421(0.0118)\n1.0550(0.0166)\n\nEst. Err.\n\n0.0245(0.0086)\n0.0249(0.0071)\n0.0290(0.0091)\n0.0926(0.0342)\n0.0928(0.0279)\n0.1115(0.0365)\n0.3342(0.1255)\n0.3346(0.1063)\n0.4125(0.1417)\n\nTable 4.3: Quantitive comparison of the statistical performance between CMR and OMR with the\nnoise matrices generated using DH. The results are averaged over 200 simulations with the standard\nerrors in parentheses. CMR and OMR achieve similar performance.\n\nmax\n\n1\n\n2\n\n4\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nMethod\nCMR\nOMR\nCMR\nOMR\nCMR\nOMR\n\nPre. Err.\n\n13.565(0.1408)\n13.697(0.1554)\n54.171(0.5771)\n54.221(0.6173)\n215.98(2.104)\n216.19(2.391)\n\nAdj. Pre.Err\n1.0435(0.0108)\n1.0486(0.0142)\n1.0418(0.0110)\n1.0427(0.0118)\n1.0384(0.0101)\n1.0394(0.0114)\n\nEst. Err.\n\n0.0599(0.0164)\n0.0607(0.0128)\n0.2252(0.0649)\n0.2359(0.0821)\n0.80821(0.25078)\n0.81957(0.31806)\n\n \n\nOracle(1)\nOracle(2)\nOracle(4)\nCMR(1)\nCMR(2)\nCMR(4)\nOMR(1)\nOMR(2)\nOMR(4)\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nCMR(1)\nCMR(2)\nCMR(4)\nOMR(1)\nOMR(2)\nOMR(4)\n\n0\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n(a) The noise matrices are generated using DI\n\n2.5\n\n0\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(b) The noise matrices are generated using DH\n\nFigure 4.1: The distributions of the selected regularization parameters using the kernel density esti-\nmator. The numbers in the parentheses are max\u2019s. The optimal regularization parameters of OMR\nare spreader with different max than those of CMR and the oracle estimator.\n\n7\n\n\f5 Real Data Experiment\n\nWe apply CMR on a brain activity prediction problem which aims to build a parsimonious model to\npredict a person\u2019s neural activity when seeing a stimulus word. As is illustrated in Figure 5.1, for\na given stimulus word, we \ufb01rst encode it into an intermediate semantic feature vector using some\ncorpus statistics. We then model the brain\u2019s neural activity pattern using CMR. Creating such a\npredictive model not only enables us to explore new analytical tools for the fMRI data, but also\nhelps us to gain deeper understanding on how human brain represents knowledge [16].\nPredict fMRI brain activity patterns in response to text stimulus\n\n89/:4:2%,-.&2\n\n!\"#$%)/01'2\n\n%0334'&\n\n!\"#$%&'()*'\n\n!\"#$%0334'&\n\n%50//'.&\n\n%6)*7*4'&\n\nstimulus word\n\n!\"#$%50//'.&\n\n\"apple\"\n\n!\"#$%6)*7*4'&\n\n(Mitchell et al., Science,2008)\n\npredicted \nactivities \nfor \"apple\"\n\n+',%,-.&\n\nModel\n\n?\n\nintermediate semantic features\n\nmapping learned from fMRI data\n\nStandard solution \n(a) illustration of the data collection procedure\nLinear models\n(More restrictive)\n\nNonlinear models\n(Less restrictive)\n\nOur solution\n\n.\n\n(b) model for predicting fMRI brain activity pattern \n\n;5'%'<3'.)/'+=2%0.'%*-+&:*='&%)+%>\")=*5'44%'=%0?%8*)'+*'%@AB\n\nFigure 5.1: An illustration of the fMRI brain activity prediction problem [16]. (a) To collect the\ndata, a human participant sees a sequence of English words and their images. The corresponding\nfMRI images are recorded to represent the brain activity patterns; (b) To build a predictive model,\neach stimulus word is encoded into intermediate semantic features (e.g. the co-occurrence statistics\nof this stimulus word in a large text corpus). These intermediate features can then be used to predict\nthe brain activity pattern.\n\nOur experiments involves 9 participants, and Table 5.1 summarizes the prediction performance of\ndifferent methods on these participants. We see that the prediction based on the features selected by\nCMR signi\ufb01cantly outperforms that based on the features selected by OMR, and is as competitive\nas that based on the handcrafted features selected by human experts. But due to the space limit, we\npresent the details of the real data experiment in the technical report version.\n\nTable 5.1: Prediction accuracies of different methods (higher is better). CMR outperforms OMR for\n8 out of 9 participants, and outperforms the handcrafted basis words for 6 out of 9 participants\n\nMethod\nCMR\nOMR\n\nP. 1\n0.840\n0.803\nHandcraft 0.822\n\nP. 2\n0.794\n0.789\n0.776\n\nP. 3\n0.861\n0.801\n0.773\n\nP. 4\n0.651\n0.602\n0.727\n\nP. 5\n0.823\n0.766\n0.782\n\nP. 6\n0.722\n0.623\n0.865\n\nP. 7\n0.738\n0.726\n0.734\n\nP. 8\n0.720\n0.749\n0.685\n\nP. 9\n0.780\n0.765\n0.819\n\n6 Discussions\n\nA related method is the square-root sparse multivariate regression [8]. They solve the convex pro-\ngram with the Frobenius loss function and L1,p regularization function\n||Y  XB||F + ||B||1,p.\n\n(6.1)\n\nThe Frobenius loss function in (6.1) makes the regularization parameter selection independent of\nmax, but it does not calibrate different regression tasks. Note that we can rewrite (6.1) as\n1\npnm||Y  XB||F.\n\nF + ||B||1,p s. t. =\n\n(6.2)\n\nB\n\nbB = argmin\npnm||Y  XB||2\n\n1\n\n(bB,b) = argmin\n\nB,\n\nSince  in (6.2) is not speci\ufb01c to any individual task, it cannot calibrate the regularization. Thus it\nis fundamentally different from CMR.\n\n8\n\n\fReferences\n[1] T.W Anderson. An introduction to multivariate statistical analysis. Wiley New York, 1958.\n[2] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks\n\nand unlabeled data. The Journal of Machine Learning Research, 6(11):1817\u20131853, 2005.\n\n[3] J Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research, 12:149\u2013198,\n\n2000.\n\n[4] A. Beck and M Teboulle. Fast gradient-based algorithms for constrained total variation image denoising\n\nand deblurring problems. IEEE Transactions on Image Processing, 18(11):2419\u20132434, 2009.\n\n[5] A. Belloni, V. Chernozhukov, and L Wang. Square-root lasso: pivotal recovery of sparse signals via conic\n\nprogramming. Biometrika, 98(4):791\u2013806, 2011.\n\n[6] Peter J Bickel, Yaacov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig\n\nselector. The Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[7] L. Breiman and J.H Friedman. Predicting multivariate responses in multiple linear regression. Journal of\n\nthe Royal Statistical Society: Series B, 59(1):3\u201354, 2002.\n\n[8] Florentina Bunea, Johannes Lederer, and Yiyuan She. The group square-root lasso: Theoretical properties\n\nand fast algorithms. IEEE Transactions on Information Theory, 60:1313 \u2013 1325, 2013.\n\n[9] Iain M Johnstone. Chi-square oracle inequalities. Lecture Notes-Monograph Series, pages 399\u2013418,\n\n2001.\n\n[10] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces:\n\nSpringer, 2011.\n\nisoperimetry and processes.\n\n[11] H. Liu, M. Palatucci, and J Zhang. Blockwise coordinate descent procedures for the multi-task lasso,\nwith applications to neural semantic basis discovery. In Proceedings of the 26th Annual International\nConference on Machine Learning, pages 649\u2013656. ACM, 2009.\n\n[12] J. Liu and J Ye. Ef\ufb01cient `1/`q norm regularization. Technical report, Arizona State University, 2010.\n[13] K. Lounici, M. Pontil, S. Van De Geer, and A.B Tsybakov. Oracle inequalities and optimal inference\n\nunder group sparsity. The Annals of Statistics, 39(4):2164\u20132204, 2011.\n\n[14] N. Meinshausen and P. B\u00a8uhlmann. Stability selection. Journal of the Royal Statistical Society: Series B,\n\n72(4):417\u2013473, 2010.\n\n[15] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-dimensional\n\ndata. The Annals of Statistics, 37(1):246\u2013270, 2009.\n\n[16] T.M. Mitchell, S.V. Shinkareva, A. Carlson, K.M. Chang, V.L. Malave, R.A. Mason, and M.A Just. Pre-\ndicting human brain activity associated with the meanings of nouns. Science, 320(5880):1191\u20131195,\n2008.\n\n[17] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A uni\ufb01ed framework\nStatistical Science,\n\nfor high-dimensional analysis of m-estimators with decomposable regularizers.\n27(4):538\u2013557, 2012.\n\n[18] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[19] G. Obozinski, M.J. Wainwright, and M.I Jordan. Support union recovery in high-dimensional multivariate\n\nregression. The Annals of Statistics, 39(1):1\u201347, 2011.\n\n[20] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of\nIn Proceedings of the 30th International Conference on Machine Learning, pages 80\u201388,\n\nmultipliers.\n2013.\n\n[21] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated\n\ngaussian designs. The Journal of Machine Learning Research, 11(8):2241\u20132259, 2010.\n\n[22] A.J. Rothman, E. Levina, and J Zhu. Sparse multivariate regression with covariance estimation. Journal\n\nof Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\n[23] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[24] B.A. Turlach, W.N. Venables, and S.J Wright.\n\n47(3):349\u2013363, 2005.\n\nSimultaneous variable selection.\n\nTechnometrics,\n\n[25] M. Yuan and Y Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B, 68(1):49\u201367, 2005.\n\n[26] Jian Zhang. A probabilistic framework for multi-task learning. PhD thesis, Carnegie Mellon University,\n\nLanguage Technologies Institute, School of Computer Science, 2006.\n\n9\n\n\f", "award": [], "sourceid": 120, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}, {"given_name": "Lie", "family_name": "Wang", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Princeton University and Johns Hopkins University"}]}