{"title": "Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1486, "page_last": 1494, "abstract": "We consider multivariate regression problems involving high-dimensional predictor and response spaces. To efficiently address such problems, we propose a variable selection method, Multivariate Group Orthogonal Matching Pursuit, which extends the standard Orthogonal Matching Pursuit technique to account for arbitrary sparsity patterns induced by domain-specific groupings over both input and output variables, while also taking advantage of the correlation that may exist between the multiple outputs. We illustrate the utility of this framework for inferring causal relationships over a collection of high-dimensional time series variables. When applied to time-evolving social media content, our models yield a new family of causality-based influence measures that may be seen as an alternative to PageRank. Theoretical guarantees, extensive simulations and empirical studies confirm the generality and value of our framework.", "full_text": "Block Variable Selection in Multivariate Regression\n\nand High-dimensional Causal Inference\n\nAur\u00b4elie C. Lozano, Vikas Sindhwani\n\nIBM T.J. Watson Research Center,\n\n1101 Kitchawan Road,\n\nYorktown Heights NY 10598,USA\n\n{aclozano,vsindhw}@us.ibm.com\n\nAbstract\n\nWe consider multivariate regression problems involving high-dimensional predic-\ntor and response spaces. To ef\ufb01ciently address such problems, we propose a vari-\nable selection method, Multivariate Group Orthogonal Matching Pursuit, which\nextends the standard Orthogonal Matching Pursuit technique. This extension ac-\ncounts for arbitrary sparsity patterns induced by domain-speci\ufb01c groupings over\nboth input and output variables, while also taking advantage of the correlation that\nmay exist between the multiple outputs. Within this framework, we then formulate\nthe problem of inferring causal relationships over a collection of high-dimensional\ntime series variables. When applied to time-evolving social media content, our\nmodels yield a new family of causality-based in\ufb02uence measures that may be seen\nas an alternative to the classic PageRank algorithm traditionally applied to hyper-\nlink graphs. Theoretical guarantees, extensive simulations and empirical studies\ncon\ufb01rm the generality and value of our framework.\n\n1 Introduction\n\nThe broad goal of supervised learning is to effectively learn unknown functional dependencies be-\ntween a set of input variables and a set of output variables, given a \ufb01nite collection of training\nexamples. This paper is at the intersection of two key topics that arise in this context.\n\nThe \ufb01rst topic is Multivariate Regression [4, 2, 24] which generalizes basic single-output regression\nto settings involving multiple output variables with potentially signi\ufb01cant correlations between them.\nApplications of multivariate regression models include chemometrics, econometrics and computa-\ntional biology. Multivariate Regression may be viewed as the classical precursor to many modern\ntechniques in machine learning such as multi-task learning [15, 16, 1] and structured output pre-\ndiction [18, 10, 22]. These techniques are output-centric in the sense that they attempt to exploit\ndependencies between output variables to learn joint models that generalize better than those that\ntreat outputs independently.\n\nThe second topic is that of sparsity [3], variable selection and the broader notion of regulariza-\ntion [20]. The view here is input-centric in the following speci\ufb01c sense. In very high dimensional\nproblems where the number of input variables may exceed the number of examples, the only hope\nfor avoiding over\ufb01tting is via some form of aggressive capacity control over the family of dependen-\ncies being explored by the learning algorithm. This capacity control may be implemented in various\nways, e.g., via dimensionality reduction, input variable selection or regularized risk minimization.\nEstimation of sparse models that are supported on a small set of input variables is a highly active\nand very successful strand of research in machine learning. It encompasses l1 regularization (e.g.,\nLasso [19]) and matching pursuit techniques [13] which come with theoretical guarantees on the\nrecovery of the exact support under certain conditions. Particularly pertinent to this paper is the\n\n1\n\n\fnotion of group sparsity. In many problems involving very high-dimensional datasets, it is natural\nto enforce the prior knowledge that the support of the model should be a union over domain-speci\ufb01c\ngroups of features. For instance, Group Lasso [23] extends Lasso, and Group-OMP [12, 9] extends\nmatching pursuit techniques to this setting.\n\nIn view of these two topics, we consider here very high dimensional problems involving a large\nnumber of output variables. We address the problem of enforcing sparsity via variable selection\nin multivariate linear models where regularization becomes crucial since the number of parameters\ngrows not only with the data dimensionality but also the number of outputs. Our approach is guided\nby the following desiderata: (a) performing variable selection for each output in isolation may be\nhighly suboptimal since the input variables which are relevant to (a subset of) the outputs may only\nexhibit weak correlation with each individual output. It is also desirable to leverage information\non the relatedness between outputs, so as to guide the decision on the relevance of a certain input\nvariable to a certain output, using additional evidence based on the relevance to related outputs. (b)\nIt is desirable to take into account any grouping structure that may exist between input and output\nvariables. In the presence of noisy data, inclusion decisions made at the group level may be more\nrobust than those at the level of individual variables.\n\nTo ef\ufb01ciently satisfy the above desiderata, we propose Multivariate Group Orthogonal Matching\nPursuit (MGOMP) for enforcing arbitrary block sparsity patterns in multivariate regression coef-\n\ufb01cients. These patterns are speci\ufb01ed by groups de\ufb01ned over both input and output variables. In\nparticular, MGOMP can handle cases where the set of relevant features may differ from one re-\nsponse (group) to another, and is thus more general than simultaneous variable selection procedures\n(e.g. S-OMP of [21]), as simultaneity of the selection in MGOMP is enforced within groups of\nrelated output variables rather than the entire set of outputs. MGOMP also generalizes the Group-\nOMP algorithm of [12] to the multivariate regression case. We provide theoretical guarantees on the\nquality of the model in terms of correctness of group variable selection and regression coef\ufb01cient\nestimation. We present empirical results on simulated datasets that illustrate the strength of our\ntechnique.\n\nWe then focus on applying MGOMP to high-dimensional multivariate time series analysis prob-\nlems. Speci\ufb01cally, we propose a novel application of multivariate regression methods with variable\nselection, namely that of inferring key in\ufb02uencers in online social communities, a problem of in-\ncreasing importance with the rise of planetary scale web 2.0 platforms such as Facebook, Twitter,\nand innumerable discussion forums and blog sites. We rigorously map this problem to that of infer-\nring causal in\ufb02uence relationships. Using special cases of MGOMP, we extend the classical notion\nof Granger Causality [7] which provides an operational notion of causality in time series analysis,\nto apply to a collection of multivariate time series variables representing the evolving textual con-\ntent of a community of bloggers. The sparsity structure of the resulting model induces a weighted\ncausal graph that encodes in\ufb02uence relationships. While we use blog communities to concretize\nthe application of our models, our ideas hold more generally to a wider class of spatio temporal\ncausal modeling problems. In particular, our formulation gives rise to a new class of in\ufb02uence mea-\nsures that we call GrangerRanks, that may be seen as causality-based alternatives to hyperlink-based\nranking techniques like the PageRank [17], popularized by Google in the early days of the internet.\nEmpirical results on a diverse collection of real-world key in\ufb02uencer problems clearly show the\nvalue of our models.\n\n2 Variable Group Selection in Multivariate Regression\n\nLet us begin by recalling the multivariate regression model, Y = X \u00afA + E, where Y \u2208 Rn\u00d7K\nis the output matrix formed by n training examples on K output variables, X \u2208 Rn\u00d7p is the data\nmatrix whose rows are p-dimensional feature vectors for the n training examples, \u00afA is the p \u00d7 K\nmatrix formed by the true regression coef\ufb01cients one wishes to estimate, and E is the n \u00d7 K error\nmatrix. The row vectors of E, are assumed to be independently sampled from N (0, \u03a3) where \u03a3 is\nthe K \u00d7 K error covariance matrix. For simplicity of notation we assume without loss of generality\nthat the columns of X and Y have been centered so we need not deal with intercept terms.\n\nThe negative log-likelihood function (up to a constant) corresponding to the aforementioned model\ncan be expressed as\n\n\u2212 l(A, \u03a3) = tr(cid:0)(Y \u2212 XA)T (Y \u2212 XA)\u03a3\u22121(cid:1) \u2212 n log(cid:12)(cid:12)\n\n\u03a3\u22121(cid:12)(cid:12) ,\n\n2\n\n(1)\n\n\fwhere A is any estimate of \u00afA, and |\u00b7| denotes the determinant of a matrix. The maximum likelihood\nestimator is the Ordinary Least Squares (OLS) estimator \u02c6AOLS = (XT X)\u22121XT Y, namely, the\nconcatenation of the OLS estimates for each of the K outputs taken separately, irrespective of \u03a3.\nThis suggests its suboptimality as the relatedness of the responses is disregarded. Also the OLS\nestimator is known to perform poorly in the case of high dimensional predictors and/or when the\npredictors are highly correlated. To alleviate these issues, several methods have been proposed\nthat are based on dimension reduction. Among those, variable selection methods are most popular\nas they lead to parsimonious and interpretable models, which is desirable in many applications.\nClearly, however, variable selection in multiple output regression is particularly challenging in the\npresence of high dimensional feature vectors as well as possibly a large number of responses.\n\nIn many applications, including high-dimensional time series analysis and causal modeling set-\ntings showcased later in this paper, it is possible to provide domain speci\ufb01c guidance for vari-\nable selection by imposing a sparsity structure on A. Let I = {I1 . . . IL} denote the set formed\nby L (possibly overlapping) groups of input variables where Ik \u2282 {1 . . . p}, k = 1, . . . L. Let\nO = {O1 . . . OM} denote the set formed by M (possibly overlapping) groups of output vari-\nables where Ok \u2282 {1 . . . K}, k = 1, . . . , M . Note that if certain variables do not belong to any\ngroup, they may be considered to be groups of size 1. These group de\ufb01nitions specify a block\nsparsity/support pattern on A. Without loss of generality, we assume that column indices are per-\nmuted so that groups go over contiguous indices. We now outline a novel algorithm, Multivariate\nGroup Orthogonal Matching Pursuit (MGOMP), that seeks to minimize the negative log-likelihood\nassociated with the multivariate regression model subject to the constraint that the support (set of\nnon-zeros) of the regression coef\ufb01cient matrix, A, is a union of blocks formed by input and output\nvariable groupings1.\n\n2.1 Multivariate Group Orthogonal Matching Pursuit\n\nThe MGOMP procedure performs greedy pursuit with respect to the loss function\n\nLC(A) = tr(cid:0)(Y \u2212 XA)T (Y \u2212 XA)C(cid:1) ,\n\n(2)\n\nwhere C is an estimate of the precision matrix \u03a3\u22121, given as input. Possible estimates include\nthe sample estimate using residual error obtained from running univariate Group-OMP for each\nresponse individually. In addition to leveraging the grouping information via block sparsity con-\nstraints, MGOMP is able to incorporate additional information on the relatedness among output\nvariables as implicitly encoded in the error covariance matrix \u03a3, noting that the latter is also the co-\nvariance matrix of the response Y conditioned on the predictor matrix X. Existing variable selection\nmethods often ignore this information and deal instead with (regularized versions of) the simpli\ufb01ed\n\nobjective tr(cid:0)(Y \u2212 XA)T (Y \u2212 XA)(cid:1) , thereby implicitly assuming that \u03a3 = I.\nBefore outlining the details of MGOMP, we \ufb01rst need to introduce some notation. For any set of\noutput variables O \u2282 {1, . . . , K}, denote by CO the restriction of the K \u00d7 K precision matrix\nC to columns corresponding to the output variables in O, and by CO,O similar restriction to both\ncolumns and rows. For any set of input variables I \u2282 {1, . . . , p}, denote by XI the restriction of\nX to columns corresponding to the input variables in I. Furthermore, to simplify the exposition,\nwe assume in the remainder of the paper that for each group of input variables Is \u2208 I, XIs is\nT XIs = I. Denote by A(m) the estimate of the regression coef\ufb01cient\northonormalized, i.e., XIs\nmatrix at iteration m, and by R(m) the corresponding matrix of residuals, i.e. R(m) = Y\u2212 XA(m).\nThe MGOMP procedure iterates between two steps : (a) Block Variable Selection and (b) Coef\ufb01cient\nmatrix re-estimation with selected block. We now outline the details of these two steps.\nBlock Variable Selection: In this step, each block, (Ir, Os), is evaluated with respect to how much\nits introduction into Am\u22121 can reduce residual loss. Namely, at round m, the procedure selects the\nblock (Ir, Os) that minimizes\n\narg min\n\n1\u2264r\u2264L,1\u2264s\u2264M\n\nmin\n\nA:Av,w=0,v6\u2208Ir ,w6\u2208Os\n\n(LC(A(m\u22121) + A) \u2212 LC(A(m\u22121))).\n\n1We note that we could easily generalize this setting and MGOMP to deal with the more general case where\nthere may be a different grouping structure for each output group, namely for each Ok, we could consider a\ndifferent set IOk of input variable groups.\n\n3\n\n\fNote that when the minimum attained falls below \u0001, the algorithm is stopped. Using standard Linear\nAlgebra, the block variable selection criteria simpli\ufb01es to\n\n(r(m), s(m)) = arg max\n\nr,s\n\ntr(cid:16)(XT\n\nIr\n\nR(m\u22121)COs )T (XT\nIr\n\nR(m\u22121)COs )(C\u22121\n\nOs ,Os\n\n)(cid:17) .\n\n(3)\n\nFrom the above equation, it is clear that the relatedness between output variables is taken into ac-\ncount in the block selection process.\nCoef\ufb01cient Re-estimation:\nLet M(m\u22121) be the set of blocks selected up to iteration m \u2212 1 . The set is now updated to include\nthe selected block of variables (Ir(m) , Os(m) ), i.e., M(m) = M(m\u22121) \u222a {(Ir(m) , Os(m) )}. The\nregression coef\ufb01cient matrix is then re-estimated as A(m) = \u02c6AX(M(m), Y), where\nLC(A) subject to supp(A) \u2286 M(m).\n\n\u02c6AX(M(m), Y) = arg min\n\nA\u2208Rp\u00d7K\n\n(4)\n\nSince certain features are only relevant to a subset of responses, here the precision matrix estimate\nC comes into play, and the problem can not be decoupled. However, a closed form solution for (4)\ncan be derived by recalling the following matrix identities [8],\n\ntr(MT\n1\n\n4 ) = vec(M1)T (M4 \u2297 M2)vec(M3),\n\nM2M3MT\nvec(M1M2) = (I \u2297 M1)vec(M2),\n\ntr(cid:0)(Y \u2212 XA)T (Y \u2212 XA)C(cid:1) = (vec(Y \u2212 XA))T (C \u2297 In)(vec(Y \u2212 XA)).\n\n(5)\n(6)\nwhere vec denotes the matrix vectorization, \u2297 the Kronecker product, and I the identity matrix.\nFrom (5), we have\n(7)\nFor a set of selected blocks, say M, denote by O(M) the union of the output groups in M. Let\n\u02dcC = CO(M),O(M) \u2297 In and \u02dcY = vec(YO(M)). For each output group Os in M, let I(Os) =\n\u222a(Ir ,Os)\u2208MIr. Finally de\ufb01ne \u02dcX such that \u02dcX = diag(cid:8)I|Os| \u2297 XI(Os), Os \u2208 O(M)(cid:9) . Using (7)\nand (6) one can show that the non-zero entries of vec( \u02c6AX(M, Y)), namely those corresponding to\nthe support induced by M, are given by \u02c6\u03b1 = (cid:16) \u02dcXT \u02dcC \u02dcX(cid:17)\u22121(cid:16) \u02dcXT \u02dcC(cid:17) \u02dcY , thus providing a closed-\n\nform formula for the coef\ufb01cient re-estimation step.\n\nTo conclude this section, we note that we could also consider preforming alternate optimization of\nthe objective in (1) over A and \u03a3, using MGOMP to optimize over A for a \ufb01xed estimate of \u03a3, and\nusing a covariance estimation algorithm (e.g. Graphical Lasso [5]) to estimate \u03a3 with \ufb01xed A.\n\n2.2 Theoretical Performance Guarantees for MGOMP\n\nIn this section we show that under certain conditions MGOMP can identify the correct blocks of\nvariables and provide an upperbound on the maximum absolute difference between the estimated\nand true regression coef\ufb01cients. We assume that the estimate of the error precision matrix, C, is in\nagreement with the speci\ufb01cation of the output groups, namely that Ci,j = 0 if i and j belong to\ndifferent output groups.\nFor each output variable group Ok, denote by Ggood(k) the set formed by the input groups included\nin the true model for the regressions in Ok, and let Gbad(k) be the set formed by all the pairs that are\nnot included. Similarly denote by Mgood the set formed by the pairs of input and output variable\ngroups included in the true model, and Mbad be the set formed by all the pairs that are not included.\nBefore we can state the theorem, we need to de\ufb01ne the parameters that are key in the conditions\n2 : supp(\u03b1) \u2286 Ggood(k)(cid:9) ,\nfor consistency. Let \u03c1X (Mgood) = mink\u2208{1,...,M} inf \u03b1(cid:8)kX\u03b1k2\nnamely \u03c1X (Mgood) is the minimum over the output groups Ok of the smallest eigenvalue of\n\n2/k\u03b1k2\n\nXT\n\nGgood (k)\n\nXGgood (k).\n\nFor each output group Ok, de\ufb01ne generally for any u = {u1, . . . , u|Ggood(k)|} and v =\n{v1, . . . , v|Gbad(k)|},\n\nkukgood(k)\n\n(2,1) = PGi\u2208Ggood (k)r Pj\u2208Gi\n\nj , and kvkbad(k)\nu2\n\n(2,1) = PGi\u2208Gbad(k)r Pj\u2208Gi\n\nv2\nj .\n\n4\n\n\fFor any matrix M \u2208 R|Ggood(k)|\u00d7|Gbad(k)|, let kMkgood/bad(k)\n\n(2,1)\n\n=\n\nsup\nkvkbad(k)\n\n(2,1) =1kMvkgood(k)\n\n(2,1)\n\n.\n\n1\n\nif\n\nGgood (k)\n\n(2,1)\n\n, where X+ de-\n\nleast 1 \u2212 2\u03b7,\n\nXGbad(k)kgood/bad(k)\n\nThen we de\ufb01ne \u00b5X (Mgood) = maxk\u2208{1,...,M} kX+\nnotes the Moore-Penrose pseudoinverse of X. We are now able to state the consistency theorem.\nTheorem 1. Assume that \u00b5X (Mgood) < 1 and 0 < \u03c1X (Mgood) \u2264 1. For any\nthe stopping criterion of MGOMP is\n\u03b7 \u2208 (0, 1/2), with probability at\n1\u2212\u00b5X (Mgood )p2pK ln(2pK/\u03b7) and mink\u2208{1,...,M},Ij \u2208Ggood(k) k \u00afAIj ,OkkF \u2265\nsuch that \u0001 >\n\u221a8\u0001\u03c1X(Mgood)\u22121 then when the algorithm stops M(m\u22121) = Mgood and kA(m\u22121) \u2212 \u00afAkmax \u2264\np(2 ln(2|Mgood|/\u03b7))/\u03c1X (Mgood).\nProof. The multivariate regression model Y = X \u00afA + E can be rewritten in an equiva-\nlent univariate form with white noise: \u02dcY = (IK \u2297 X)\u00af\u03b1 + \u03b7, where \u00af\u03b1 = vec( \u00afA), \u02dcY =\ndiag(cid:26) 1\nvec(YC1/2), and \u03b7 is formed by i.i.d samples from N (0, 1). We can see that\n\napplying the MGOMP procedure is equivalent to applying the Group-OMP procedure [12] to the\nabove vectorized regression model, using as grouping structure that naturally induced by the input-\noutput groups originally considered for MGOMP. The theorem then follows from Theorem 3 in [12]\nand translating the univariate conditions for consistency into their multivariate counterparts via\n\u00b5X (Mgood) and \u03c1X (Mgood). Since C is such that Ci,j = 0 for any i, j belonging to distinct\ngroups, the entries in \u02dcY do not mix components of Y from different output groups and hence the\nerror covariance matrix does not appear in the consistency conditions.\n\nIn(cid:27)K\n\nk=1\n\nC 1/2\nk,k\n\nNote that the theorem can also be re-stated with an alternative condition on the amplitude of the true\n\nregression coef\ufb01cient: mink\u2208{1,...,M},Ij \u2208Ggood (k) mins\u2208Ok k \u00afAIj ,kk2 \u2265 \u221a8\u0001\u03c1X (Mgood)\u22121/p|Ok|\n\nwhich suggests that the amplitude of the true regression coef\ufb01cients is allowed to be smaller in\nMGOMP compared to Group-OMP on individual regressions. Intuitively, through MGOMP we\nare combining information from multiple regressions, thus improving our capability to identify the\ncorrect groups.\n\n2.3 Simulation Results\n\nWe empirically evaluate the performance of our method against representative variable selection\nmethods, in terms of accuracy of prediction and variable (group) selection. As a measure of variable\nselection accuracy we use the F1 measure, which is de\ufb01ned as F1 = 2P R\nP +R , where P denotes the\nprecision and R denotes the recall. To compute the variable group F1 of a variable selection method,\nwe consider a group to be selected if any of the variables in the group is selected. As a measure of\nprediction accuracy we use the average squared error on a test set. For all the greedy pursuit meth-\nods, we consider the \u201choldout validated\u201d estimates. Namely, we select the iteration number that\nminimizes the average squared error on a validation set. For univariate methods, we consider indi-\nvidual selection of the iteration number for each univariate regression (joint selection of a common\niteration number across the univariate regressions led to worse results in the setting considered). For\neach setting, we ran 50 runs, each with 50 observations for training, 50 for validation and 50 for\ntesting.\nWe consider an n \u00d7 p predictor matrix X, where the rows are generated independently accord-\ning to Np(0, S), with Si,j = 0.7|i\u2212j|. The n \u00d7 K error matrix E is generated according to\nNK(0, \u03a3), with \u03a3i,j = \u03c1|i\u2212j|, where \u03c1 \u2208 {0, 0.5, 0.7, 0/9}. We consider a model with 3rd or-\nder polynomial expansion: [YT1 , . . . , YTM ] = X[A1,T1 , . . . , A1,TM ] + X2[A2,T1 , . . . , A2,TM ] +\nX3[A3,T1 , . . . , A3,TM ] + E. Here we abuse notation to denote by Xq the matrix such that Xq\ni,j =\n(Xi,j )q. T1, . . . , TM are the target groups. For each k, each row of [A1,Tk , . . . , A3,Tk ] is either all\nnon-zero or all zero, according to Bernoulli draws with success probability 0.1. Then for each non-\nzero entry of Ai,Tk , independently, we set its value according to N (0, 1). The number of features\nfor X is set to 20. Hence we consider 60 variables grouped into 20 groups corresponding the the 3rd\ndegree polynomial expansion. The number of regressions is set to 60. We consider 20 regression\ngroups (T1, . . . T20), each of size 3.\n\n5\n\n\fParallel runs\n\nK\nK\n1\n1\n1\nM\n\n(p, L)\n(p, p)\n(p, L)\n(p, p)\n(p, L)\n(p, L)\n(p, L)\n\n(K, M)\n\n(1, 1)\n(1, 1)\n(K, 1)\n(K, M )\n(K, M )\n(M 0, 1)\n\nMethod\nPrecision matrix estimate\nOMP [13]\nNot applicable\nGroup-OMP [12]\nNot applicable\nS-OMP [21]\nIdentity matrix\nIdentity matrix\nMGOMP(Id)\nEstimate from univariate OMP \ufb01ts MGOMP(C)\nIdentity matrix\n\nMGOMP(Parallel)\n\nTable 1: Various matching pursuit methods and their corresponding parameters.\n\n\u03c1\n\n0.9\n0.7\n0.5\n\n0\n\n\u03c1\n\n0.9\n0.7\n0.5\n\n0\n\nMGOMP (C)\n\n0.863 \u00b1 0.003\n0.850 \u00b1 0.002\n0.850 \u00b1 0.003\n0.847 \u00b1 0.004\n\nMGOMP (C)\n\n3.009 \u00b1 0.234\n3.114 \u00b1 0.252\n3.117 \u00b1 0.234\n3.124 \u00b1 0.256\n\nMGOMP (Id)\n\n0.818 \u00b1 0.003\n0.806 \u00b1 0.003\n0.802 \u00b1 0.004\n0.848 \u00b1 0.004\n\nMGOMP (Id)\n\n3.324 \u00b1 0.273\n3.555 \u00b1 0.287\n3.630 \u00b1 0.281\n3.123 \u00b1 0.262\n\nMGOMP(Parallel)\n0.762 \u00b1 0.003\n0.757 \u00b1 0.003\n0.766 \u00b1 0.004\n0.783 \u00b1 0.004\nMGOMP(Parallel)\n4.086 \u00b1 0.169\n4.461 \u00b1 0.159\n4.499 \u00b1 0.288\n3.852 \u00b1 0.185\n\nGroup-OMP\n\n0.646 \u00b1 0.007\n0.631 \u00b1 0.008\n0.641 \u00b1 0.006\n0.651 \u00b1 0.007\n\nGroup-OMP\n\n6.165 \u00b1 0.317\n8.170 \u00b1 0.328\n7.305 \u00b1 0.331\n6.137 \u00b1 0.330\n\nOMP\n\n0.517 \u00b1 0.006\n0.517 \u00b1 0.007\n0.525 \u00b1 0.007\n0.525 \u00b1 0.007\n\nOMP\n\n6.978 \u00b1 0.206\n8.14 \u00b1 0.390\n8.098 \u00b1 0.323\n7.414 \u00b1 0.331\n\nTable 2: Average F1 score (top) and average test set squared error (bottom) for the models output\nby variants of MGOMP, Group-OMP and OMP under the settings of Table 1.\n\nA dictionary of various matching pursuit methods and their corresponding parameters is provided in\nTable 1. In the table, note that MGOMP(Parallel) consists in running MGOMP separately for each\nregression group and C set to identity (Using C estimated from univariate OMP \ufb01ts has negligible\nimpact on performance and hence is omitted for conciseness.). The results are presented in Table 2.\n\nOverall, in all the settings considered, MGOMP is superior both in terms of prediction and vari-\nable selection accuracy, and more so when the correlation between responses increases. Note that\nMGOMP is stable with respect to the choice of the precision matrix estimate. Indeed the advantage\nof MGOMP persists under imperfect estimates (Identity and sample estimate from univariate OMP\n\ufb01ts) and varying degrees of error correlation. In addition, model selection appears to be more robust\nfor MGOMP, which has only one stopping point (MGOMP has one path interleaving input variables\nfor various regressions, while GOMP and OMP have K paths, one path per univariate regression).\n\n3 Granger Causality with Block Sparsity in Vector Autoregressive Models\n\n3.1 Model Formulation\n\nWe begin by motivating our main application. The emergence of the web2.0 phenomenon has set in\nplace a planetary-scale infrastructure for rapid proliferation of information and ideas. Social media\nplatforms such as blogs, twitter accounts and online discussion sites are large-scale forums where\nevery individual can voice a potentially in\ufb02uential public opinion. This unprecedented scale of un-\nstructured user-generated web content presents new challenges to both consumers and companies\nalike. Which blogs or twitter accounts should a consumer follow in order to get a gist of the com-\nmunity opinion as a whole? How can a company identify bloggers whose commentary can change\nbrand perceptions across this universe, so that marketing interventions can be effectively strategized?\nThe problem of \ufb01nding key in\ufb02uencers and authorities in online communities is central to any vi-\nable information triage solution, and is therefore attracting increasing attention [14, 6]. A traditional\napproach to this problem would treat it no different from the problem of ranking web-pages in a\nhyperlinked environment. Seminal ideas such as the PageRank [17] and Hubs-and-Authorities [11]\nwere developed in this context, and in fact even celebrated as bringing a semblance of order to the\nweb. However, the mechanics of opinion exchange and adoption makes the problem of inferring\nauthority and in\ufb02uence in social media settings somewhat different from the problem of ranking\ngeneric web-pages. Consider the following example that typi\ufb01es the process of opinion adoption. A\nconsumer is looking to buy a laptop. She initiates a web search for the laptop model and browses\nseveral discussion and blog sites where that model has been reviewed. The reviews bring to her\nattention that among other nice features, the laptop also has excellent speaker quality. Next she buys\nthe laptop and in a few days herself blogs about it. Arguably, conditional on being made aware of\n\n6\n\n\fspeaker quality in the reviews she had read, she is more likely to herself comment on that aspect\nwithout necessarily attempting to \ufb01nd those sites again in order to link to them in her blog. In other\nwords, the actual post content is the only trace that the opinion was implicitly absorbed. Moreover,\nthe temporal order of events in this interaction is indicative of the direction of causal in\ufb02uence.\n\n, . . . , wK,t\n\ni\n\ni\n\n]. The input to our model is a collection of multivariate time series, {Bt\n\nWe formulate these intuitions rigorously in terms of the notion of Granger Causality [7] and then\nemploy MGOMP for its implementation. For scalability, we work with MGOMP (Parallel), see\ntable 1. Introduced by the Nobel prize winning economist, Clive Granger, this notion has proven\nuseful as an operational notion of causality in time series analysis. It is based on the intuition that\na cause should necessarily precede its effect, and in particular if a time series variable X causally\naffects another Y , then the past values of X should be helpful in predicting the future values of Y ,\nbeyond what can be predicted based on the past values of Y alone.\nLet B1 . . . BG denote a community of G bloggers. With each blogger, we associate content vari-\nables, which consist of frequencies of words relevant to a topic across time. Speci\ufb01cally, given a\ndictionary of K words and the time-stamp of each blog post, we record wk,t\n, the frequency of the\nkth word for blogger Bi at time t. Then, the content of blogger Bi at time t can be represented as\ni = [w1,t\nBt\ni}T\n(1 \u2264 i \u2264 G), where T is the timespan of our analysis. Our key intuition is that authorities and\nin\ufb02uencers are causal drivers of future discussions and opinions in the community. This may be\nphrased in the following terms:\nGranger Causality: A collection of bloggers is said to in\ufb02uence Blogger Bi if their collective past\ncontent (blog posts) is predictive of the future content of Blogger Bi, with statistical signi\ufb01cance,\nand more so than the past content of Blogger Bi alone.\nThe in\ufb02uence problem can thus be mapped to a variable group selection problem in\na vector autoregressive model,\nin multivariate regression with G \u00d7 K responses\n{Bt\nG] =\n[Bt\u22121\nG ]A + E. We can then conclude that a certain blogger Bi\nin\ufb02uences blogger Bj, if the variable group {Bt\u2212l\ni }l\u2208{1,...,d} is selected by the variable selection\nmethod for the responses concerning blogger Bj. For each blogger Bj, this can be viewed as an ap-\nplication of a Granger test on Bj against bloggers B1, B2, . . . , BG. This induces a directed weighted\ngraph over bloggers, which we call causal graph, where edge weights are derived from the underly-\ning regression coef\ufb01cients. We refer to in\ufb02uence measures on causal graphs as GrangerRanks. For\nexample, GrangerPageRank refers to applying pagerank on the causal graph while GrangerOutDe-\ngree refers to computing out-degrees of nodes as a measure of causal in\ufb02uence.\n\nj, j = 1, 2 . . . G} in terms of variable groups (cid:8){Bt\u2212l\nj }d\n\nl=1, j = 1, 2 . . . G(cid:9):\n\n, . . . , Bt\u22121\n\nG , . . . , Bt\u2212d\n\n[Bt\n\n1, . . . , Bt\n\n, . . . , Bt\u2212d\n\n1\n\ni\n\nt=1\n\ni.e.,\n\n1\n\n3.2 Application: Causal In\ufb02uence in Online Social Communities\n\nProof of concept: Key In\ufb02uencers in Theoretical Physics: Drawn from a KDD Cup 2003 task,\nthis dataset is publically available at: http://www.cs.cornell.edu/projects/kddcup/datasets.html. It\nconsists of the latex sources of all papers in the hep-th portion of the arXiv (http://arxiv.org) In\nconsultation with a theoretical physicist we did our analysis at a time granularity of 1 month. In\ntotal, the data spans 137 months. We created document term matrices using standard text processing\ntechniques, over a vocabulary of 463 words chosen by running an unsupervised topic model. For\neach of the 9200 authors, we created a word-time matrix of size 463x137, which is the usage of the\ntopic-speci\ufb01c key words across time. We considered one year, i.e., d = 12 months as maximum\ntime lag. Our model produces the causal graph shown in Figure 1 showing in\ufb02uence relationships\namongst high energy physicists. The table on the right side of Figure 1 lists the top 20 authors ac-\ncording to GrangerOutDegree (also marked on the graph), GrangerPagerRank and Citation Count.\nThe model correctly identi\ufb01es several leading \ufb01gures such as Edward Witten, Cumrun Vafa as au-\nthorities in theoretical physics. In this domain, number of citations is commonly viewed as a valid\nmeasure of authority given disciplined scholarly practice of citing prior related work. Thus, we\nconsider citation-count based ranking as the \u201cground truth\u201d. We also \ufb01nd that GrangerPageRank\nand GrangerOutDegree have high positive rank correlation with citation counts (0.728 and 0.384\nrespectively). This experiment con\ufb01rms that our model agrees with how this community recognizes\nits authorities.\n\n7\n\n\f S.Gukov\n\n R.J.Szabo\n\n C.S.Chu\n\n Michael Douglas\n\n Arkady Tseytlin\n\n R.Tatar\n\n I.Antoniadis\n\n E.Witten\n\n Per Kraus\n\n Ian Kogan\n\n Igor Klebanov\n\n P.K.Townsend\n\n G.Moore\n\n S.Theisen\n\n Jacob Sonnenschein\n\n C.Vafa\n\n J.L.F.Barbon\n\n S.Ferrara\n\n Alex Kehagias\n\n M.Berkooz\n\nGrangerOutdegree\n\nGrangerPageRank\n\nCitation Count\n\nE.Witten\nC.Vafa\n\nAlex Kehagias\nArkady Tseytlin\nP.K.Townsend\n\nE.Witten\nC.Vafa\n\nAlex Kehagias\nArkady Tseytlin\nP.K.Townsend\n\nE.Witten\nN.Seiberg\n\nC.Vafa\n\nJ.M.Maldacena\n\nA.A.Sen\n\nJacob Sonnenschein\n\nJacob Sonnenschein\n\nAndrew Strominger\n\nIgor Klebanov\n\nR.J.Szabo\nG.Moore\n\nMichael Douglas\n\nR.J.Szabo\nG.Moore\n\nIgor Klebanov\n\nIan Kogan\n\nIgor Klebanov\nMichael Douglas\nArkady Tseytlin\n\nL.Susskind\n\nFigure 1: Causal Graph and top authors in High-Energy Physics according to various measures.\n\n(a) Causal Graph\n\n(b) Hyperlink Graph\n\nFigure 2: Causal and hyperlink graphs for the lotus blog dataset.\n\nReal application: IBM Lotus Bloggers: We crawled blogs pertaining to the IBM Lotus software\nbrand. Our crawl process ran in conjunction with a relevance classi\ufb01er that continuously \ufb01ltered out\nposts irrelevant to Lotus discussions. Due to lack of space we omit preprocessing details that are\nsimilar to the previous application. In all, this dataset represents a Lotus blogging community of\n684 bloggers, each associated with multiple time series describing the frequency of 96 words over a\ntime period of 376 days. We considered one week i.e., d = 7 days as maximum time lag in this ap-\nplication. Figure 2 shows the causal graph learnt by our models on the left, and the hyperlink graph\non the right. We notice that the causal graph is sparser than the hyperlink graph. By identifying the\nmost signi\ufb01cant causal relationships between bloggers, our causal graphs allow clearer inspection\nof the authorities and also appear to better expose striking sub-community structures in this blog\ncommunity. We also computed the correlation between PageRank and Outdegrees computed over\nour causal graph and the hyperlink graph (0.44 and 0.65 respectively). We observe positive corre-\nlations indicating that measures computed on either graph partially capture related latent rankings,\nbut at the same time are also suf\ufb01ciently different from each other. Our results were also validated\nby domain experts.\n\n4 Conclusion and Perspectives\n\nWe have provided a framework for learning sparse multivariate regression models, where the sparsity\nstructure is induced by groupings de\ufb01ned over both input and output variables. We have shown that\nextended notions of Granger Causality for causal inference over high-dimensional time series can\nnaturally be cast in this framework. This allows us to develop a causality-based perspective on the\nproblem of identifying key in\ufb02uencers in online communities, leading to a new family of in\ufb02uence\nmeasures called GrangerRanks. We list several directions of interest for future work: optimizing\ntime-lag selection; considering hierarchical group selection to identify pertinent causal relationships\nnot only between bloggers but also between communities of bloggers; incorporating the hyperlink\ngraph in the causal modeling; adapting our approach to produce topic speci\ufb01c rankings; developing\nonline learning versions; and conducting further empirical studies on the properties of the causal\ngraph in various applications of multivariate regression.\n\nAcknowledgments\nWe would like to thank Naoki Abe, Rick Lawrence, Estepan Meliksetian, Prem Melville and Grze-\ngorz Swirszcz for their contributions to this work in a variety of ways.\n\n8\n\n\fReferences\n[1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Machine Learning, 73(3):243\u2013272, 2008.\n\n[2] Leo Breiman and Jerome H Friedman. Predicting multivariate responses in multiple linear\n\nregression. Journal of the Royal Statistical Society: Series B, (1):1369\u20137412, 1997.\n\n[3] M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and\n\nImage Processing. Springer,2010\n\n[4] Ildiko E. Frank and Jerome H. Friedman. A statistical view of some chemometrics regression\n\ntools. Technometrics, 35(2):109\u2013135, 1993.\n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432\u2013441, July 2008.\n\n[6] M. Gomez-Rodriguez and J. Leskovec and A. Krause. Inferring Networks of Diffusion and\n\nIn\ufb02uence, KDD 2010.\n\n[7] C. Granger. Testing for causality: A personal viewpoint. Journal of Economic Dynamics and\n\nControl, 2:329\u2013352, 1980.\n\n[8] D. Harville. Matrix Algebra from a Statistician\u2019s Perspective. Springer, 1997.\n[9] J. Huang, T. Zhang, and D. Metaxas D. Learning with structured sparsity, ICML 2009.\n[10] T. Joachims. Structured output prediction with support vector machines. In Joint IAPR In-\nternational Workshops on Structural and Syntactic Pattern Recognition (SSPR) and Statistical\nTechniques in Pattern Recognition (SPR), pages 1\u20137, 2006.\n\n[11] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,\n\n46:668\u2013677, 1999.\n\n[12] A.C. Lozano, G. Swirszcz, and N. Abe. Grouped orthogonal matching pursuit for variable\n\nselection and prediction. Advances in Neural Information Processing Systems 22, 2009.\n\n[13] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-\n\ntions on Signal Processing, 1993.\n\n[14] P. Melville, K. Subbian, C. Perlich, R. Lawrence and E. Meliksetian. A Predictive Perspective\non Measures of In\ufb02uence in Networks Workshop on Information in Networks (WIN-10), New\nYork, September, 2010.\n\n[15] Charles A. Micchelli and Massimiliano Pontil. Kernels for multi\u2013task learning. In NIPS, 2004.\n[16] G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. Technical report, 2006.\n[17] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order\n\nto the web. Technical Report, Stanford Digital Libraries, 1998.\n\n[18] Elisa Ricci, Tijl De Bie, and Nello Cristianini. Magic moments for structured output prediction.\n\nJournal of Machine Learning Research, 9:2803\u20132846, December 2008.\n\n[19] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58:267\u2013288, 1994.\n\n[20] A.N. Tikhonov. Regularization of incorrectly posed problems. Sov. Math. Dokl, 4:16241627,\n\n1963.\n\n[21] J.A. Tropp, A.C. Gilbert, and M.J. Strauss. Algorithms for simultaneous sparse approximation:\n\npart i: Greedy pursuit. Sig. Proc., 86(3):572\u2013588, 2006.\n\n[22] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for\ninterdependent and structured output spaces. In International Conference on Machine Learning\n(ICML), pages 104\u2013112, 2004.\n\n[23] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJournal of the Royal Statistical Society, Series B, 68:49\u201367, 2006.\n\n[24] Ming Yuan, Ali Ekici, Zhaosong Lu, and Renato Monteiro. Dimension reduction and coef-\n\ufb01cient estimation in multivariate linear regression. Journal Of The Royal Statistical Society\nSeries B, 69(3):329\u2013346, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1349, "authors": [{"given_name": "Vikas", "family_name": "Sindhwani", "institution": null}, {"given_name": "Aurelie", "family_name": "Lozano", "institution": null}]}