{"title": "Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3185, "page_last": 3193, "abstract": "Multiple-output regression models require estimating multiple functions, one for each output. To improve parameter estimation in such models, methods based on structural regularization of the model parameters are usually needed. In this paper, we present a multiple-output regression model that leverages the covariance structure of the functions (i.e., how the multiple functions are related with each other) as well as the conditional covariance structure of the outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike most of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method.", "full_text": "Simultaneously Leveraging Output and Task\nStructures for Multiple-Output Regression\n\nPiyush Rai\u2020\n\nDept. of Computer Science\nUniversity of Texas at Austin\n\nAustin, TX\n\nAbhishek Kumar\u2020\n\nDept. of Computer Science\n\nUniversity of Maryland\n\nCollege Park, MD\n\nHal Daum\u00b4e III\n\nDept. of Computer Science\n\nUniversity of Maryland\n\nCollege Park, MD\n\npiyush@cs.utexas.edu\n\nabhishek@cs.umd.edu\n\nhal@umiacs.umd.edu\n\nAbstract\n\nMultiple-output regression models require estimating multiple parameters, one for\neach output. Structural regularization is usually employed to improve parameter\nestimation in such models. In this paper, we present a multiple-output regression\nmodel that leverages the covariance structure of the latent model parameters as\nwell as the conditional covariance structure of the observed outputs. This is in\ncontrast with existing methods that usually take into account only one of these\nstructures. More importantly, unlike some of the other existing methods, none of\nthese structures need be known a priori in our model, and are learned from the\ndata. Several previously proposed structural regularization based multiple-output\nregression models turn out to be special cases of our model. Moreover, in addition\nto being a rich model for multiple-output regression, our model can also be used in\nestimating the graphical model structure of a set of variables (multivariate outputs)\nconditioned on another set of variables (inputs). Experimental results on both\nsynthetic and real datasets demonstrate the effectiveness of our method.\n\nIntroduction\n\n1\nMultivariate response prediction, also known as multiple-output regression [3] when the responses\nare real-valued vectors, is an important problem in machine learning and statistics. The goal in\nmultiple-output regression is to learn a model for predicting K > 1 real-valued responses (the\noutput) from D predictors or features (the input), given a training dataset consisting of N input-\noutput pairs. Multiple-output prediction is also an instance of the problem of multitask learning [5,\n10] where predicting each output is a task and all the tasks share the same input data. Multiple-\noutput regression problems are encountered frequently in various application domains. For example,\nin computational biology [11], we often want to predict the gene-expression levels of multiple genes\nbased on a set of single nucleotide polymorphisms (SNPs); in econometrics [17], we often want to\npredict the stock prices in the future using relevant macro-economic variables and stock prices in the\npast as inputs; in geostatistics, we are often interested in jointly predicting the concentration levels\nof different heavy metal pollutants [9]; and so on.\n\nOne distinguishing aspect of multiple-output regression is that the outputs are often related to each\nother via some underlying (and often a priori unknown) structure. A part of this can be captured by\nthe imposing a relatedness structure among the regression coef\ufb01cients (e.g., the weight vectors in a\nlinear regression model) of all the outputs. We refer to the relatedness structure among the regression\ncoef\ufb01cients as task structure. However, there can still be some structure left in the outputs that is not\nexplained by the regression coef\ufb01cients alone. This can be due to a limited expressive power of our\nchosen hypothesis class (e.g., linear predictors considered in this paper). The residual structure that\nis left out when conditioned on inputs will be referred to as output structure here. This can be also be\nseen as the covariance structure in the output noise. It is therefore desirable to simultaneously learn\n\n\u2020Contributed equally\n\n1\n\n\fand leverage both the output structure and the task structure in multiple-output regression models\nfor improved parameter estimation and prediction accuracy.\n\nAlthough some of the existing multiple-output regression models have attempted to incorporate such\nstructures [17, 11, 13], most of these models are restrictive in the sense that (1) they usually exploit\nonly one of the two structures (output structure or task structure, but not both), and (2) they assume\navailability of prior information about such structures which may not always be available. For\nexample, Multivariate Regression with Covariance Estimation [17] (MRCE) is a recently proposed\nmethod which learns the output structure (in form of the covariance matrix for correlated noise\nacross multiple outputs) along with the regression coef\ufb01cients (i.e., the weight vector) for predicting\neach output. However MRCE does not explicitly model the relationships among the regression\ncoef\ufb01cients of the multiple tasks and therefore fails to account for the task structure. More recently,\n[14] proposed an extension of the MRCE model by allowing weighting the individual entries of\nthe regression coef\ufb01cients and the entries of the output (inverse) covariance matrix, but otherwise\nthis model has essentially the same properties as MRCE. Among other works, Graph-guided Fused\nLasso [11] (GFlasso) incorporates task structure to some degree by assuming that the regression\ncoef\ufb01cients of all the outputs have similar sparsity patterns. This amounts to assuming that all\nthe outputs share almost same set of relevant features. However, GFlasso assumes that output graph\nstructure is known which is rarely true in practice. Some other methods such as[13] take into account\nthe task structure by imposing structural sparsity on the regression coef\ufb01cients of the multiple tasks\nbut again assume that output structure is known a priori and/or is of a speci\ufb01c form. In [22], the\nauthors proposed a multitask learning model by explicitly modeling the task structures as the task\ncovariance matrix but this model does not take into account the output structure which is important\nin multiple-output regression problems.\n\nIn this paper, we present a multiple-output regression model that allows leveraging both output\nstructure and task structure without assuming an a priori knowledge of either. In our model, both\noutput structure and task structure are learned from the data, along with the regression coef\ufb01cients\nfor each task. Speci\ufb01cally, we model the output structure using the (inverse) covariance matrix of\nthe correlated noise across the multiple outputs, and the task structure using the (inverse) covariance\nmatrix of the regression coef\ufb01cients of the multiple tasks being learned in the model. By explicitly\nmodeling and learning the output structure and task structure, our model also addresses the limi-\ntations of the existing models that typically assume certain speci\ufb01c type of output structures (e.g.,\ntree [13]) or task structures (e.g., shared sparsity [11]). In particular, a model with task relatedness\nstructure based on shared sparsity on the task weight vectors may not be appropriate in many real\napplications where all the features are important for prediction and the true task structure is at a\nmore higher level (e.g., weight vectors for some tasks are closer to each other compared to others).\nApart from providing a \ufb02exible way of learning multiple-output regression, our model can also be\nused for the problem of conditional inverse covariance estimation of the (multivariate) outputs that\ndepend on another set of inputs variables, an important problem that has been gaining signi\ufb01cant\nattention recently [23, 15, 20, 4, 7, 6].\n\n2 Multiple-Output Regression\n\nIn multiple-output regression, each input is associated with a vector of responses and the goal is\nthe learn the input-output relationship given some training data consisting of input-output pairs.\nFormally, given an N \u00d7 D input matrix X = [x1, . . . , xN ]\u22a4 and an N \u00d7 K output matrix Y =\n[y1, . . . , yN ]\u22a4, the goal in multiple-output regression is to learn the functional relationship between\nthe inputs xn \u2208 RD and the outputs yn \u2208 RK. For a linear regression model, we write:\n\nyn = W\u22a4xn + b + \u01ebn\n\n(1)\nHere W = [w1, . . . , wK ] denotes the D \u00d7 K matrix where wk denotes the regression coef\ufb01cient\nof the k-th output, b = [b1, . . . , bK ]\u22a4 \u2208 RK is a vector of bias terms for the K outputs, and\n\u01ebn = [\u01ebn1, . . . , \u01ebnK ]\u22a4 \u2208 RK is a vector consisting of the noise for each of the K outputs. The noise\nis typically assumed to be Gaussian with a zero mean and uncorrelated across the K outputs.\nStandard parameter estimation for Equation 1 involves maximizing the (penalized) log-likelihood of\nthe model, or equivalently minimizing the (regularized) loss function over the training data:\n\n\u2200n = 1, . . . , N\n\narg min\nW,b\n\ntr((Y \u2212 XW \u2212 1b\u22a4)(Y \u2212 XW \u2212 1b\u22a4)\u22a4) + \u03bbR(W)\n\n(2)\n\n2\n\n\fwhere tr(.) denotes matrix trace, 1 an N \u00d71 vector of all 1s and R(W) the regularizer on the weight\nmatrix W consisting of the regression weight vectors of all the outputs. For a choice of R(W) =\ntr(W\u22a4W) (the \u21132-squared norm, equivalent to assuming independent, zero-mean Gaussian priors\non the weight vectors), solving Equation 2 amounts to solving K independent regression problems\nand this solution ignores any correlations among the outputs or among the weight vectors.\n\n3 Multiple-Output Regression with Output and Task Structures\n\nTo take into account both conditional output covariance and the covariance among the weight vec-\ntors W = [w1, . . . , wK ], we assume a full covariance matrix \u2126 of size K \u00d7 K on the output\nnoise distribution to capture conditional output covariance, and a structured prior distribution on the\nweight vector matrix W that induces structural regularization of W. We place the following prior\ndistribution on W\n\np(W) \u221d\n\nK\n\nY\n\nk=1\n\nNor(wk|0, ID)MN D\u00d7K(W|0D\u00d7K , ID \u2297 \u03a3)\n\n(3)\n\nwhere MN D\u00d7K(M, A \u2297 B) denotes the matrix-variate normal distribution with M \u2208 RD\u00d7K\nbeing its mean, A \u2208 RD\u00d7D its row-covariance matrix and B \u2208 RK\u00d7K its column-covariance\nmatrix. Here \u2297 denotes the Kronecker product. In this prior distribution, the Nor(wk|0, ID) factors\nregularize the weight vectors wk individually, and the MN D\u00d7K(W|0D\u00d7K , ID \u2297 \u03a3) term couples\nthe K weight vectors, allowing them to share statistical strength.\nTo derive our objective function, we start by writing down the likelihood of the model, for a set of\nN i.i.d. observations:\n\nN\n\nY\n\nn=1\n\np(yn|xn, W, b) =\n\nN\n\nY\n\nn=1\n\nNor(yn|W\u22a4xn + b, \u2126)\n\n(4)\n\nIn the above, a diagonal \u2126 would imply that the K outputs are all conditionally independent of\neach other. In this paper, we assume a full \u2126 which will allow us to capture the conditional output\ncorrelations.\n\nCombining the prior on W and the likelihood, we can write down the posterior distribution of W:\np(W|X, Y, b, \u2126, \u03a3) \u221d p(W)QN\n\nn=1 p(yn|xn, W, b)\n\n= QK\n\nk=1 Nor(wk|0, ID) MN D\u00d7K(W|0D\u00d7K , ID \u2297 \u03a3) QN\n\nn=1 Nor(yn|W\u22a4xn + b, \u2126)\n\nTaking the log of the above and simplifying the resulting expression, we can then write the negative\nlog-posterior of W as (ignoring the constants):\n\ntr((Y \u2212 XW \u2212 1b\u22a4)\u2126\u22121(Y \u2212 XW \u2212 1b\u22a4)\u22a4) + N log |\u2126| + tr(WW\u22a4)\n+ tr(W\u03a3\u22121W\u22a4) + D log |\u03a3|\n\nwhere 1 denotes a N \u00d7 1 vector of all 1s. Note that in the term tr(W\u03a3\u22121W\u22a4), the inverse\ncovariance matrix \u03a3\u22121 plays the role of coupling pairs of weight vectors, and therefore controls\nthe amount of sharing between any pair of tasks. The task covariance matrix \u03a3 as well as the\nconditional output covariance matrix \u2126 will be learned from the data. For reasons that will become\napparent later, we parameterize our model in terms of the inverse covariance matrices \u2126\u22121 and \u03a3\u22121\ninstead of covariance matrices. With this parameterization, the negative log-posterior becomes:\n\ntr((Y \u2212 XW \u2212 1b\u22a4)\u2126\u22121(Y \u2212 XW \u2212 1b\u22a4)\u22a4) \u2212 N log |\u2126\u22121| + tr(WW\u22a4)\n+ tr(W\u03a3\u22121W\u22a4) \u2212 D log |\u03a3\u22121|\n\n(5)\n\nThe objective function in Equation 5 naturally imposes positive-de\ufb01nite constraints on the inverse\ncovariance matrices \u2126\u22121 and \u03a3\u22121.\nIn addition, we will impose sparsity constraints (via an \u21131\npenalty) on \u2126\u22121 and \u03a3\u22121. Sparsity on these parameters is appealing in this context for two rea-\nsons: (1) Sparsity leads to improved robust estimates [19, 8] of \u2126\u22121 and \u03a3\u22121, and (2) Sparsity\nsupports the notion that the output correlations and the task correlations tend to be sparse [21, 4, 8]\n\n3\n\n\f\u2013 not all pairs of outputs are related (given the inputs and other outputs), and likewise not all task\npairs (and therefore the corresponding weight vectors) are related. Finally, we will also introduce\nregularization hyperparameters to control the trade-off between data-\ufb01t and model complexity. Pa-\nrameter estimation in the model involves minimizing the negative log-posterior which is equivalent\nto minimizing the (regularized) loss function. The minimization problem is given as\n\narg min\n\ntr((Y \u2212 XW \u2212 1b\u22a4)\u2126\u22121(Y \u2212 XW \u2212 1b\u22a4)\u22a4) \u2212 N log |\u2126\u22121| + \u03bb tr(WW\u22a4)\n\nW,b,\u03a3\u22121,\u2126\u22121\n\n+\u03bb1 tr(W\u03a3\u22121W\u22a4) \u2212 D log |\u03a3\u22121| + \u03bb2||\u2126\u22121||1 + \u03bb3||\u03a3\u22121||1\n\n(6)\nwhere ||A||1 denotes the sum of absolute values of the matrix A. Note that by replacing the regular-\nizer tr(WW\u22a4) with a sparsity inducing regularizer on the individual weight vectors w1, . . . , wK,\none can also learn Lasso-like sparsity [19] in the regression weights. In this exposition, however,\nwe consider \u21132 regularization on the regression weights and let the tr(W\u03a3\u22121W\u22a4) term capture the\nsimilarity between the weights of two tasks by learning the task inverse covariance matrix \u03a3\u22121. The\nabove cost function is not jointly convex in the variables but is individually convex in each variable\nwhen others are \ufb01xed. We adopt an alternating optimization strategy that was empirically observed\nto converge in all our experiments. More details are provided in the experiments section. Finally,\nalthough it is not the main goal of this paper, since our model provides an estimate of the inverse\ncovariance structure \u2126\u22121 of the outputs conditioned on the inputs, it can also be used for the more\ngeneral problem of estimating the conditional inverse covariance [23, 15, 20, 4, 7] of a set of vari-\nables y = {y1, . . . , yK} conditioned on another set of variables x = {x1, . . . , xD}, given paired\nsamples of the form {(x1, y1), . . . , (xN , yN )}.\n\n3.1 Special Cases\n\nIn this section, we show that our model subsumes/generalizes some previously proposed models for\nmultiple-output regression. Some of these include:\n\n\u2022 Multivariate Regression with Covariance Estimation (MRCE-\u21132): With the task in-\nverse covariance matrix \u03a3\u22121 = IK and the bias term set to zero, our model results in\nthe \u21132 regularized weights variant of the MRCE model [17] which would be equivalent to\nminimizing the following objective:\n\narg min\nW,\u2126\u22121\n\ntr((Y \u2212 XW)\u2126\u22121(Y \u2212 XW)\u22a4) + \u03bb tr(WW\u22a4) \u2212 N log |\u2126\u22121| + \u03bb2||\u2126\u22121||1\n\n\u2022 Multitask Relationship Learning for Regression (MTRL): With the output inverse co-\nvariance matrix \u2126\u22121 = IK and the sparsity constraint on \u03a3\u22121 dropped, our model results\nin the regression version of the multitask relationship learning model proposed in [22].\nSpeci\ufb01cally, the corresponding objective function would be:\n\narg min\nW,\u03a3\u22121\n\ntr((Y\u2212XW)(Y\u2212XW)\u22a4)+\u03bb tr(WW\u22a4)+\u03bb1 tr(W\u03a3\u22121W\u22a4)\u2212D log |\u03a3\u22121|\n\nIn [22], the \u2212 log |\u03a3\u22121| term is dropped since the authors solve their cost function in terms of \u03a3\nand this term is concave in \u03a3. A constraint of tr(\u03a3) = 1 was introduced in its place to restrict the\ncomplexity of the model. We keep the log | \u00b7 | constraint in our cost function since we parameterize\nour model in terms of \u03a3\u22121, and \u2212 log |\u03a3\u22121| is convex in \u03a3\u22121.\n\n3.2 Optimization\n\nWe take an alternating optimization approach to solve the optimization problem given by Equa-\ntion 6. Each sub-problem in the alternating optimization steps is convex. The matrices \u03a3 and \u2126 are\ninitialized to I in the beginning. The bias vector b is initialized to 1\n\nN Y\u22a41.\n\nOptimization w.r.t. W when \u2126\u22121, \u03a3\u22121 and b are \ufb01xed:\nGiven \u2126\u22121, \u03a3\u22121, b, the matrix W consisting of the regression weight vectors of all the tasks can\nbe obtained by solving the following optimization problem:\n\narg min\n\nW\n\ntr((Y \u2212XW\u22121b\u22a4)\u2126\u22121(Y \u2212XW\u22121b\u22a4)\u22a4)+\u03bb tr(WW\u22a4)+\u03bb1 tr(W\u03a3\u22121W\u22a4) (7)\n\n4\n\n\fThe estimate \u02c6W is given by solving the following system of linear equations w.r.t. W:\n\n(cid:2)(cid:0)\u2126\u22121 \u2297 X\u2032X(cid:1) + (cid:0)(cid:0)\u03bb1\u03a3\u22121 + \u03bbIK(cid:1) \u2297 ID(cid:1)(cid:3) vec(W) = vec(X\u2032(Y \u2212 1b\u22a4)\u2126\u22121)\n\n(8)\nIt is easy to see that with \u2126 and \u03a3 set to identity, the model becomes equivalent to solving K\nregularized independent linear regression problems.\nOptimization w.r.t. b when \u2126\u22121, \u03a3\u22121 and W are \ufb01xed:\nGiven \u2126\u22121, \u03a3\u22121, W, the bias vector b for all the K outputs can be obtained by solving the follow-\ning optimization problem:\n\narg min\n\nb\n\ntr((Y \u2212 XW \u2212 1b\u22a4)\u2126\u22121(Y \u2212 XW \u2212 1b\u22a4)\u22a4)\n\n(9)\n\nThe estimate \u02c6b is given by \u02c6b = 1\n\nN PN\n\nn=1(Y \u2212 XW)\u22a41\n\nOptimization w.r.t. \u03a3\u22121 when \u2126\u22121, W and b are \ufb01xed:\nGiven \u2126\u22121, W, b, the task inverse covariance matrix \u03a3\u22121 can be estimated by solving the following\noptimization problem:\n\narg min\n\u03a3\u22121\n\n\u03bb1 tr(W\u03a3\u22121W\u22a4) \u2212 D log |\u03a3\u22121| + \u03bb3||\u03a3\u22121||1\n\n(10)\n\nIt is easy to see that the above is an instance of the standard inverse covariance estimation problem\nwith sample covariance \u03bb1\nD W\u22a4W, and can be solve using standard tools for inverse covariance\nestimation. We use the graphical Lasso procedure [8] to solve Equation 10 to estimate \u03a3\u22121:\n\n\u02c6\u03a3\u22121 = gLasso(\n\n\u03bb1\nD\n\nW\u22a4W, \u03bb3)\n\n(11)\n\nIf we assume \u03a3\u22121 to be non-sparse, we can drop the \u21131 penalty on \u03a3\u22121 from Equation 10. However,\nthe solution to \u03a3\u22121 will not be de\ufb01ned (when K > D) or will over\ufb01t (when K is of the same order\nas D). To avoid this, we add a regularizer of the form \u03bb tr(\u03a3\u22121) to Equation 10. This can be seen as\nimposing a matrix variate Gaussian prior on \u03a3\u22121/2 with both row and column covariance matrices\nequal to I to make the solution well de\ufb01ned. In the previous case of sparse \u03a3\u22121, the solution was\nwell de\ufb01ned because of the sparsity prior on \u03a3\u22121. The optimization problem for \u03a3\u22121 is then given\nas\n\narg min\n\u03a3\u22121\n\n\u03bb1 tr(W\u03a3\u22121W\u22a4) \u2212 D log |\u03a3\u22121| + \u03bb tr(cid:0)\u03a3\u22121(cid:1) .\n(cid:17)\u22121\n\nD\n\nEquation 12 admits a closed form solution which is given by (cid:16) \u03bb1W\u22a4W+\u03bbI\n. For the non-sparse\n\u03a3\u22121 case, we keep the parameter \u03bb same as the hyperparameter for the term tr(WW\u22a4) in Equa-\ntion 6.\nOptimization w.r.t. \u2126\u22121 when \u03a3\u22121, W and b are \ufb01xed:\nGiven \u03a3\u22121, W, b, the task inverse covariance matrix \u2126\u22121 can be estimated by solving the following\noptimization problem:\n\n(12)\n\narg min\n\u2126\u22121\n\ntr((Y \u2212 XW \u2212 1b\u22a4)\u2126\u22121(Y \u2212 XW \u2212 1b\u22a4)\u22a4) \u2212 N log |\u2126\u22121| + \u03bb2||\u2126\u22121||1\n\n(13)\n\nIt is again easy to see that the above problem is an instance of the standard inverse covariance esti-\nmation problem with sample covariance 1\nN (Y \u2212XW\u22121b\u22a4)\u2032(Y \u2212XW\u22121b\u22a4), and can be solved\nusing standard tools for inverse covariance estimation. We use the graphical Lasso procedure [8] to\nsolve Equation 10 to estimate \u03a3\u22121:\n1\nN\n\n(Y \u2212 XW \u2212 cb\u22a4)\u22a4(Y \u2212 XW \u2212 cb\u22a4), \u03bb2)\n\n\u02c6\u2126\u22121 = gLasso(\n\n(14)\n\n4 Experiments\nIn this section, we evaluate our model by comparing it with several relevant baselines on both syn-\nthetic and real-world datasets. Our main set of results are on multiple-output regression problems\non which we report mean-squared errors averaged across all the outputs. However, since our model\nalso provides an estimate of the conditional inverse covariance structure \u2126\u22121 of the outputs, in Sec-\ntion 4.3 we provide experimental results on the structure recovery task as well. We compare our\nmethod with following baselines:\n\n5\n\n\f\u2022 Independent regressions (RLS): This baseline learns regularized least squares (RLS) re-\ngression model for each output, without assuming any structure among the weight vectors\nor among the outputs. This corresponds to our model with \u03a3 = IK and \u2126 = IK. The\nweight vector of each individual problem is \u21132 regularized with a hyperparameter \u03bb.\n\n\u2022 Curds and Whey (C&W): The predictor in Curds and Whey [3] takes the form Wcw =\nWrlsU\u039bU\u2212, where Wrls denotes the regularized least squares predictor, the columns\nof matrix U are the projection directions for the responses Y obtained from canonical\ncorrelation analysis (CCA) of X and Y, and U\u2212 denotes Moore-Penrose pseudoinverse of\nU. The diagonal matrix \u039b contains the shrinkage factors for each CCA projection direction.\n\u2022 Multi-task Relationship Learning (MTRL): This method leverages task relationships\nby assuming a matrix-variate prior on the weight matrix W [22]. We chose this baseline\nbecause of its \ufb02exibility in modeling the task relationships by \u201cdiscovering\u201d how the weight\nvectors are related (via \u03a3\u22121), rather than assuming a speci\ufb01c structure on them such as\nshared sparsity [16], low-rank assumption [2], etc. However MTRL in the multiple-output\nregression setting cannot take into account the output structure. It is therefor a special case\nof our model if we assume the output inverse covariance matrix \u2126\u22121 = I. The MTRL\napproach proposed in [22] does not have sparse penalty on \u03a3\u22121. We experimented with\nboth sparse and non-sparse variants of MTRL and report the better of the two results here.\n\u2022 Multivariate Regression with Covariance Estimation (MRCE-\u21132): This baseline is the\n\u21132 regularized variant of the MRCE model [17]. MRCE leverages output structure by\nassuming a full noise covariance in multiple-output regression and learning it along with the\nweight matrix W from the data. MRCE however cannot take into account the task structure\nbecause it cannot capture the relationships among the columns of W.\nIt is therefore a\nspecial case of our model if we assume the task inverse covariance matrix \u03a3\u22121 = I. We\ndo not compare with the original \u21131 regularized MRCE [17] to ensure a fair comparison by\nkeeping all the models non-sparse in weight vectors.\n\nIn the experiments, we refer to our model as MROTS (Multiple-output Regression with Output\nand Task Structures). We experiment with two variants of our proposed model, one without a\nsparsity inducing penalty on the task coupling matrix \u03a3\u22121 (called MROTS-I), and the other with\nthe sparse penalty on \u03a3\u22121 (called MROTS-II). The hyperparameters are selected using four-fold\ncross-validation. Both MTRL and MRCE-\u21132 have two hyperparameters each and these are selected\nby searching on a two-dimensional grid. For the proposed model with non-sparse \u03a3\u22121, we \ufb01x the\nhyperparameter \u03bb in Equations 6 and 12 as 0.001 for all the experiments. This is used to ensure\nthat the task inverse covariance matrix estimate \u02c6\u03a3\u22121 exists and is robust when number of response\nvariables K is of the same order or larger than the input dimension D. The other two parameters\n\u03bb1 and \u03bb2 are selected using cross-validation. For sparse \u03a3\u22121 case, we use the same values of \u03bb1\nand \u03bb2 that were selected for non-sparse case, and only the third parameter \u03bb3 is selected by cross-\nvalidation. This procedure avoids a potentially expensive search over a three dimensional grid. The\nhyperparameter \u03bb in Equation 6 is again \ufb01xed at 0.001.\n\n4.1 Synthetic data\n\nWe describe the process for synthetic data generation here. First, we generate a random positive\nde\ufb01nite matrix \u03a3\u22121 which will act as the task inverse covariance matrix. Next, a matrix V of size\nD\u00d7K is generated with each entry sampled from a zero mean and 1/D variance normal distribution.\nWe compute the square-root S of \u03a3 (= SS, where S is also a symmetric positive de\ufb01nite matrix),\nand S is used to generate the \ufb01nal weight matrix W as W = VS. It is clear that for a W generated\nin this fashion, we will have E[WT W] = SS = \u03a3. This process generates W such that its\ncolumns (and therefore the weight vectors for different outputs) are correlated. A bias vector b of\nsize K is generated randomly from a zero mean unit variance normal distribution. Then we generate\na sparse random positive de\ufb01nite matrix \u2126\u22121 that acts as the conditional inverse covariance matrix\non output noise making the outputs correlated (given the inputs). Next, input samples are generated\ni.i.d. from a normal distribution and the corresponding multivariate output variables are generated\nas yi = Wxi + b + \u01ebi, \u2200i = 1, 2, . . . , N, where \u01ebi is the correlated noise vector randomly sampled\nfrom a zero mean normal distribution with covariance matrix \u2126.\n\nWe generate three sets of synthetic data using the above process to gauge the effectiveness of the\nproposed model under varying circumstances: (i) D = 20, K = 10 and non-sparse \u03a3\u22121, (ii)\n\n6\n\n\fMethod\n\nRLS\nC&W\nMTRL\n\nMRCE-\u21132\nMROTS-I\nMROTS-II\n\nSynth data I\n\nSynth data II\n\nSynth data III\n\n37.29\n37.14\n34.45\n29.84\n26.65\n25.90\n\n3.22\n21.88\n3.12\n3.08\n2.61\n2.60\n\n3.94\n7.06\n3.86\n3.92\n3.75\n3.55\n\nPaper I\n1.08\n1.08\n1.07\n1.36\n0.90\n0.90\n\nPaper II Gene data\n\n1.04\n1.08\n1.03\n1.03\n1.03\n1.03\n\n1.92\n1.51\n1.24\n1.55\n1.18\n1.20\n\nTable 1: Prediction error (MSE) on synthetic and real datasets. RLS: Independent regression, C&W: Curds\nand Whey model [3], MTRL: Multi-task relationship learning [22], MRCE-\u21132: The \u21132-regularized version of\nMRCE [17], MROTS-I: our model without sparse penalty on \u03a3\u22121, MROTS-II: our model with sparse penalty\non \u03a3\u22121. Best results are highlighted in bold fonts.\n\nD = 10, K = 20 and non-sparse \u03a3\u22121, and (iii) D = 10, K = 20 and sparse \u03a3\u22121. We also\nexperiment with varying number of training samples (N = 20, 30, 40 and 50).\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n \n\nRLS\nC&W\nMTRL\nMRCE\u2212l2\nMROTS\u2212I\nMROTS\u2212II\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\nr\no\nr\nr\ne\n \ne\nr\na\nu\nq\ns\n \nn\na\ne\nM\n\n \n\nRLS\nMTRL\nMRCE\u2212l2\nMROTS\u2212I\nMROTS\u2212II\n\n60\n\n50\n\n40\n\n30\n\nl\n\ne\nu\na\nv\n \n.\nj\n\nb\nO\n\n \n\n \nr\no\nE\nS\nM\n\n \n\nMSE\nOBJ. VALUE\n\n1.4\n\n1.2\n\n1\n\n0.8\n\nl\n\ne\nu\na\nv\n \n.\nj\n\nb\nO\n\n \n\n \nr\no\nE\nS\nM\n\n \n\nMSE\nOBJ. VALUE\n\nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\nn\na\ne\nM\n\n \n\n \n\n0\n10\n\n40\n\n30\n\n20\n50\nNumber of training samples\n(a) Synthetic data I\n\n60\n\n \n\n2.5\n10\n\n30\n\n40\n\n20\n50\nNumber of training samples\n(b) Synthetic data II\n\n \n\n20\n0\n\n60\n\n10\n\nIterations\n\n20\n\n30\n\n \n\n0\n\n10\n20\nIterations\n\n30\n\n(c) Synthetic data I\n\n(d) Paper data I\n\nFigure 1: (a) and (b): Mean Square Error with varying number of training samples, (c) and (d): Mean Square\nError and the value of the Objective function with increasing iterations for the proposed method.\n4.2 Real data\nWe also evaluate our model on the following real-world multiple-output regression datasets:\n\n\u2022 Paper datasets: These are two multivariate multiple-response regression datasets from\npaper industry [1]. The \ufb01rst dataset has 30 samples with each sample having 9 features and\n32 outputs. The second dataset has 29 samples (after ignoring one sample with missing\nresponse variables), each having 9 features and 13 outputs. We take 15 samples for training\nand the remaining samples for test.\n\n\u2022 Genotype dataset: This dataset has genotypes as input variables and phenotypes or ob-\nserved traits as output variables [12]. The number of genotypes (features) is 25 and the\nnumber of phenotypes (outputs) is 30. We have a total of 100 samples in this dataset and\nwe split it equally into training and test data.\n\nThe results on synthetic and real-world datasets are shown in Table 1. For synthetic datasets, the\nreported results are with 50 training samples. Independent linear regression performs the worst on all\nsynthetic datasets. MRCE-\u21132 performs better than MTRL on \ufb01rst and second synthetic data while\nMTRL is better on the third dataset. This mixed behavior of MRCE-\u21132 and MTRL supports our\nmotivation that both task structure (i.e., relationships among weight vectors) and output structure are\nimportant in multiple-output regression. Both MTRL and MRCE-\u21132 are special cases of our model\nwith former ignoring the output structure (captured by \u2126\u22121) and the latter ignoring the weight vector\nrelationships (captured by \u03a3\u22121). Both variants of our model (MROTS-I and MROTS-II) perform\nsigni\ufb01cantly better than the compared baselines. The improvement with sparse \u03a3\u22121 variant is more\nprominent on the third dataset which is generated with sparse \u03a3\u22121 (5.33% relative reduction in\nMSE), than on the \ufb01rst two datasets (2.81% and 0.3% relative reduction in MSE). However, in our\nexperiments, the sparse \u03a3\u22121 variant (MROTS-II) always performed better or as good as the non-\nsparse variant on all synthetic and real datasets, which suggests that explicitly encouraging zero\nentries in \u03a3\u22121 leads to better estimates of task relationships (by avoiding spurious correlations\nbetween weight vectors). This can potentially improve the prediction performance. Finally, we also\nnote that the Curds & Whey method [3] performs signi\ufb01cantly worse than RLS for Synthetic data II\nand III. C&W uses CCA to project the response matrix Y to a lower min(D, K)-dimensional space\nlearning min(D, K) predictors there and then projecting them back to the original K-dimensional\n\n7\n\n\fspace. This procedure may end up throwing away relevant information from responses if K is\nmuch higher than D. These empirical results suggest that C&W may adversely affect the prediction\nperformance when the number of response variables K is higher than the number of explanatory\nvariables D (D = 2K in these cases).\nOn the real-world datasets too, our model performs better than or on par with the compared base-\nlines. Both MROTS-I and MROTS-II perform signi\ufb01cantly better than the other baselines on the\n\ufb01rst Paper dataset (9 features and 32 outputs per sample). All models perform almost similarly on\nthe second Paper dataset (9 features and 13 outputs per sample), which could be due to the absence\nof a strong task or output structure in this data. C&W does not preform well on both Paper datasets\nwhich might be due to the reason discussed earlier. On the genotype-phenotype prediction task\ntoo, both our models achieve better average mean squared errors than the other baselines, with both\nvariants performing roughly comparably.\n\nWe also evaluate our model\u2019s performance with varying number of training examples and compare\nwith the other baselines. Figures 1(a) and 1(b) show the plots of mean square error vs. number of\ntraining examples for \ufb01rst two synthetic datasets. We do not plot C&W for Synthetic data II since\nit performs worse than RLS. On the \ufb01rst synthetic data, the performance gain of our model is more\npronounced when number of training examples is small. For the second synthetic data, we retain\nsimilar performance gain over other models when number of training examples are increased from\n20. The MSE numbers for the \ufb01rst synthetic data are higher than the ones obtained for the second\nsynthetic data because of a difference in the magnitude of error covariances used in the generation\nof datasets.\n\nWe also experiment with the convergence properties of our method. Figures 1(c) and 1(d) show that\nplots of average MSE and the value of the objective function (given by Equation 6) with increasing\nnumber of iterations on the \ufb01rst synthetic dataset and the \ufb01rst Paper dataset. The plots show that our\nalternating optimization procedure converges in roughly 10\u201315 iterations.\n\n4.3 Covariance structure recovery\nAlthough not the main goal of the paper, we experiment with learned inverse covariance\nmatrix of the outputs (given the inputs) as a sanity check on the proposed model. To\nbetter visualize, we generate a dataset with 5 responses and 3 predictors using the same\nprocess as described in Sec. 4.1. Figure on the right shows the true conditional inverse\ncovariance matrix \u2126\u22121 (Top), the matrix learned with MROTS \u02c6\u2126\u22121 (Middle), and the\nprecision matrix learned with graphical lasso ignoring the predictors (Bottom). Taking\ninto account the regression weights results in better estimate of the true covariance\nmatrix. We got similar results for MRCE-\u21132 which also takes into account the predictors\nwhile learning the inverse covariance, although MROTS estimates were closer to the\nground truth in terms of the Frobenius norm.\n5 Related Work\nApart from the prior works discussed in Section 1, our work has connections to some other works\nwhich we discuss in this section. Recently, Sohn & Kim [18] proposed a model for jointly esti-\nmating the weight vector for each output and the covariance structure of the outputs. However, they\nassume a shared sparsity structure on the weight vectors. This assumption may be restrictive in some\nproblems. Some other works on conditional graphical model estimation [20, 4] are based on similar\nstructural sparsity assumptions. In contrast, our model does not assume any speci\ufb01c structure on the\nweight vectors, and by explicitly modeling the covariance structure of the weight vectors, learns the\nappropriate underlying structure from the data.\n\n6 Future Work and Conclusion\n\nWe have presented a \ufb02exible model for multiple-output regression taking into account the covariance\nstructure of the outputs and the covariance structure of the underlying prediction tasks. Our model\ndoes not require a priori knowledge of these structures and learns these from the data. Our model\nleads to improved accuracies on multiple-output regression tasks. Our model can be extended in\nseveral ways. For example, one possibility is to model nonlinear input-output relationships by ker-\nnelizing the model along the lines of [22].\n\n8\n\n\fReferences\n[1] M. Aldrin. Moderate projection pursuit regression for multivariate response data. Computa-\n\ntional Statistics and Data Analysis, 21, 1996.\n\n[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning.\n\nIn NIPS, 2007.\n\n[3] L. Breiman and J.H. Friedman. Predicting multivariate responses in multiple linear regression.\n\nJournal of the Royal Statistical Society. Series B (Methodological), pages 3\u201354, 1997.\n\n[4] T. Cai, H. Li, W. Liu, and J. Xie. Covariate adjusted precision matrix estimation with an\n\napplication in genetical genomics. Biometrika, 2011.\n\n[5] Rich Caruana. Multitask Learning. Machine Learning, 28, 1997.\n[6] J. Cheng, E. Levina, P. Wang, and J. Zhu.\n\nSparse ising models with covariates.\n\narXiv:1209.6342v1, 2012.\n\n[7] S. Ding, G. Wahba, and J. X. Zhu. Learning higher-order graph structure with features by\n\nstructure penalty. In NIPS, 2011.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[9] P. Goovaerts. Geostatistics For Natural Resources Evaluation. Oxford University Press, 1997.\n[10] T. Heskes. Empirical Bayes for learning to learn. ICML, 2000.\n[11] S. Kim, K. Sohn, and E. P. Xing. A multivariate regression approach to association analysis of\n\na quantitative trait network.\n\n[12] S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitative\n\ntrait network. PLoS Genetics, 2009.\n\n[13] S. Kim and E. P. Xing. Tree-guided group lasso for multi-response regression with structured\n\nsparsity, with an application to eQTL mapping. Annals of Applied Statistics, 2012.\n\n[14] W. Lee and Y. Liu. Simultaneous multiple response regression and inverse covariance ma-\ntrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis,\n2012.\n\n[15] H. Liu, X. Chen, J. Lafferty, and L. Wasserman. Graph-valued regression. In NIPS, 2010.\n[16] G. Obozinskiy, M. J. Wainwright, and M. I. Jordan. Union support recovery in high-\n\ndimensional multivariate regression. In NIPS, 2010.\n\n[17] A. J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estima-\n\ntion. Journal of Computational and Graphical Statistics, 2010.\n\n[18] K.A. Sohn and S. Kim. Joint estimation of structured sparsity and output structure in multiple-\n\noutput regression via inverse-covariance regularization. In AISTATS, 2012.\n\n[19] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical\n\nSociety, 1996.\n\n[20] J. Yin and H. Li. A sparse conditional gaussian graphical model for analysis of genetical\n\ngenomics data. The Annals of Applied Statistics, 2011.\n\n[21] Y. Zhang and J. Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. In\n\nNIPS, 2010.\n\n[22] Y. Zhang and D. Yeung. A convex formulation for learning task relationships in multi-task\n\nlearning. In UAI, 2010.\n\n[23] S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning\n\nJournal, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1458, "authors": [{"given_name": "Piyush", "family_name": "Rai", "institution": null}, {"given_name": "Abhishek", "family_name": "Kumar", "institution": null}, {"given_name": "Hal", "family_name": "Daume", "institution": null}]}