{"title": "Sparse Instrumental Variables (SPIV) for Genome-Wide Studies", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modified here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as ``instruments'' to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for fine mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with $\\sim 10^5$ genetic instruments and $\\sim 47 \\times 10^3$ gene transcripts.", "full_text": "Sparse Instrumental Variables (SPIV) for\n\nGenome-Wide Studies\n\nFelix V. Agakov\n\nPublic Health Sciences\nUniversity of Edinburgh\n\nfelixa@aivalley.com\n\nPaul McKeigue\n\nPublic Health Sciences\nUniversity of Edinburgh\npaul.mckeigue@ed.ac.uk\n\nJon Krohn\n\nWTCHG, Oxford\n\njon.krohn@magd.ox.ac.uk\n\nAmos Storkey\n\nSchool of Informatics\nUniversity of Edinburgh\n\na.storkey@ed.ac.uk\n\nAbstract\n\nThis paper describes a probabilistic framework for studying associations between\nmultiple genotypes, biomarkers, and phenotypic traits in the presence of noise and\nunobserved confounders for large genetic studies. The framework builds on sparse\nlinear methods developed for regression and modi\ufb01ed here for inferring causal\nstructures of richer networks with latent variables. The method is motivated by the\nuse of genotypes as \u201cinstruments\u201d to infer causal associations between phenotypic\nbiomarkers and outcomes, without making the common restrictive assumptions of\ninstrumental variable methods. The method may be used for an effective screening\nof potentially interesting genotype-phenotype and biomarker-phenotype associa-\ntions in genome-wide studies, which may have important implications for validat-\ning biomarkers as possible proxy endpoints for early-stage clinical trials. Where\nthe biomarkers are gene transcripts, the method can be used for \ufb01ne mapping of\nquantitative trait loci (QTLs) detected in genetic linkage studies. The method is\napplied for examining effects of gene transcript levels in the liver on plasma HDL\ncholesterol levels for a sample of sequenced mice from a heterogeneous stock,\nwith \u223c 105 genetic instruments and \u223c 47 \u00d7 103 gene transcripts.\n\n1\n\nIntroduction\n\nA problem common to both epidemiology and to systems biology is to infer causal relationships\nbetween phenotypic measurements (biomarkers) and disease outcomes or quantitative traits. The\nproblem is complicated by the fact that in large bio-medical studies, the number of possible genetic\nand environmental causes is very large, which makes it implausible to conduct exhaustive inter-\nventional experiments. Moreover, it is generally impossible to remove the confounding bias due to\nunmeasured latent variables which in\ufb02uence associations between biomarkers and outcomes. Also,\nin situations when the biomarkers are mRNA transcript levels, the measurements are known to be\nquite noisy; additionally, the number of unique candidate causes may exceed the number of obser-\nvations by several orders of magnitude (the p \u226b n problem). A fundamentally important practical\ntask is to reduce the number of possible causes of a trait to a much more manageable subset of can-\ndidates for controlled interventions. Developing an ef\ufb01cient framework for addressing this problem\nmay be fundamental for overcoming bottlenecks in drug development, with possible applications in\nthe validation of biomarkers as causal risk factors, or developing proxies for clinical trials.\n\nWhether or not causation may be inferred from observational data has been a matter of philosophical\ndebate. Pearl [28] argues that causal assumptions cannot be veri\ufb01ed unless one makes a recourse\n\n1\n\n\fto experimental control, and that there is nothing in the probability distribution p(x, y) which can\ntell whether a change in x may have an effect on y. Traditional discussions of causality are largely\nfocused on the question of identi\ufb01ability, i.e. determining sets of graph-theoretic conditions when a\npost-intervention distribution p(y|do(x)) may be uniquely determined from a pre-intervention dis-\ntribution p(y, x, z) [27, 4, 32]. If the causal effects are shown to be identi\ufb01able, their magnitudes can\nbe obtained by statistical estimation, which for common models often reduces to solving systems of\nlinear equations.\nIn contrast, from the Bayesian perspective, the causality detection problem may\nbe viewed as that of model selection, where a model Mx\u2192y is compared with My\u2192x. The problem\nis complicated by the likelihood-equivalence, where for each setting of parameters of one model\nthere may exist a setting of parameters of the other giving rise to the identical likelihoods. However,\nunless the priors are chosen in such a way that Mx\u2192y and My\u2192x also have identical posteriors, it\nmay be possible to infer the direction of the arrow. The view that the priors of likelihood-equivalent\nmodels do not need to be set to ensure the equivalence of the posteriors is in contrast to e.g. [12]\n(and references therein), but has been defended by MacKay (see [21], Section 35).\n\nIn this paper we are leaving aside debates about the nature of causality and focus instead on iden-\ntifying a set of candidate causes for a large partially observed under-determined genetic problem.\nThe approach builds on the instrumental variable methods that were historically used in epidemi-\nological studies, and on approximate Bayesian inference in sparse linear latent variable models.\nSpeci\ufb01c modeling hypotheses are tested by comparing approximate marginal likelihoods of the cor-\nresponding direct, reverse, and pleiotropic models with and without latent confounders, where we\nfollow [21] in allowing for \ufb02exible priors. The approach is largely motivated by the observation that\nindependent variables do not establish a causal relation, while strong unconfounded direct depen-\ndencies retained in the posterior modes even under large sparseness-inducing penalties may indicate\npotential causality and suggest candidates for further controlled experiments.\n\n2 Previous work\n\nInference of causal direction of x on y is to some extent simpli\ufb01ed if we assume existence of an\nauxiliary variable g, such that g\u2019s effect on x may only be causal, and g\u2019s effect on y may only\nbe through x. The idea is exploited in instrumental variable methods [3, 2, 29] which typically\ndeal with low-dimensional linear models, where the strength of the causal effect may be estimated\nas wx\u2192y = cov(g, y)/cov(g, x). Note also that the hypothesized cause-outcome models such as\nMg\u2192x\u2192y and Mg\u2192y\u2192x are no longer Markov-equivalent, i.e.\nit may be possible to select an\nappropriate model via likelihood-based tests. Selecting a plausible instrument g may be dif\ufb01cult in\nsome domains; however, in genetic studies it may be possible to exploit as an instrument a measure\nof genotypic variation. In quantitative genetics, such applications of instrumental variable methods\nhave been termed Mendelian randomization [15, 34]. In accordance with the requirements of the\nclassic instrumental variable methods, it is assumed that effects of the genetic instrument g on the\nbiomarker x are unconfounded, and that effects of the instrument on the outcome y are mediated only\nthrough the biomarker (i.e. there is no pleiotropy) [17, 35]. The former assumption is grounded in the\nlaws of Mendelian genetics and is satis\ufb01ed as long as population strati\ufb01cation has been adequately\ncontrolled. However, the assumption of no hidden pleiotropy severely restricts the application of this\napproach, as most genotypic effects on complex traits are not suf\ufb01ciently well understood to exclude\npleiotropy as a possible explanation of an association. Thus the classical instrumental variable\nargument is limited to biomarkers for which suitable non-pleiotropic instruments exist, and cannot\nbe easily extended to exploit studies with multiple biomarkers and genome-wide data.\n\nA more general approach to exploiting genotypic variation to infer causal relationships between\ngene transcript levels and quantitative traits has been developed by Schadt et. al. [30] and subse-\nquently extended (see e.g. [5]). They relax the assumption of no pleiotropy, but instead compare\nmodels with and without pleiotropy by computing standard likelihood-based scores. After \ufb01ltering\nto select a set of gene transcripts {xj} that are associated with the trait y, and loci {gi} at which\ngenotypes have effects on transcript levels xj, each possible triad of marker locus gi, transcript xj\nand trait y is evaluated to compare three possible models: causal effect of transcript on trait, reverse\ncausation, and a pleiotropic model (see Figure 1 left, (i)\u2013(iii)). The support for these three models\nis compared by a measure of model \ufb01t penalized by complexity: either Akaike\u2019s Information Cri-\nterion (AIC) [30], or the Bayesian Information Criterion (BIC) [5]. Schadt et. al. [30] denote this\nprocedure as the \u201clikelihood-based causality model selection\u201d (LCMS) approach. While the LCMS\n\n2\n\n\f(i)\n\n(ii)\n\n(iii)\n\ngi\n\ngi\n\ngi\n\ngi\n\n(iv)\n\ngk\n\nxj\n\ny\n\nxj\n\nxj\n\ny\n\nxj\n\ny\n\ny\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\u22126\n\np(AIC\n\n\u2212AIC\n\nCSL\n\n)\n\nREV\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n8\n\nFigure 1: Left: (i\u2013iii): Causal, reverse, and pleiotropic models of the LCMS approach [30]; (iv):\npleiotropic model with two genetic instruments. Center: Possible arbitrariness of LCMS inference.\nThe histogram shows the difference of the AIC scores for the causal and reverse models for a \ufb01xed\nbiomarker and outcome, and various choices of loci from predictive regions. Right: AIC scores\nof the causal (top) and reverse (bottom) models for each choice of instrument gi (the straight lines\nlink the scores for a \ufb01xed choice of gi). Scores were centered relative to those of the pleiotropic\nmodel. Biomarker and outcome are liver expressions of Cyp27b1 and plasma HDL measurements\nfor heterogeneous mice. Based on the choice of gi, either causal or reverse explanations are favored.\n\nand related methods [30, 5] relax the assumption of no hidden pleiotropy of the classic Mendelian\nrandomization method, they have three key limitations. First, effects of loci and biomarkers on out-\ncomes are not modeled jointly, so widely varying inferences are possible depending on the choice\nof the triads {gi, xj, y}. Figure 1 center, right compares differences in the AIC scores for the causal\nand reverse models constructed for a \ufb01xed biomarker and outcome, and for various choices of the\ngenetic instruments from the predictive region. Depending on the choice of instrument gi, either\ncausal or reverse explanations are favored. A second key limitation is that the LCMS method does\nnot allow for dependencies between multiple biomarkers, measurement noise, or latent variables\n(such as unobserved confounders of the biomarker-outcome associations). Thus, for instance, with-\nout allowance for noise in the biomarker measurements, non-zero conditional mutual information\nI(gi, y|xj) will be interpreted as evidence of pleiotropy or reverse causation even when the relation\nbetween the underlying biomarker and outcome is causal. Also, the method is not Bayesian (the\nBIC score is only a crude approximation to the Bayesian procedure for model selection).\n\nOne extension of the classic instrumental variable methods has been proposed by [4], who described\ngraph-theoretic conditions which need to be satis\ufb01ed in order for parameters of edges xi \u2192 y to\nbe identi\ufb01able by solving a system of linear equations; however, they focus on the identi\ufb01ability\nproblem rather than on addressing a large practical under-determined task with latent variables.\nFor example, their method does not allow for an easy integration of unmeasured confounders with\nunknown correlations with the intermediate and outcome variables. Another approach to modeling\njoint effects of genetic loci and biomarkers (gene expressions) was described by [41]. They modeled\nthe expression measurements as three ordered levels, and used a biased greedy search over model\nstructures from multiple starting points, to \ufb01nd models with high BIC scores. Though applicable\nfor large-scale studies, the approach does not allow for measurement noise or latent variables (and\nlooses information by using categorical measurements). The vast majority of other recent model\nselection and structure learning methods from machine learning literature are also either not easily\nextended to include latent confounders (e.g. [16], [19], [22]), or applicable only for dealing with\nrelatively low-dimensional problems with abundant data (e.g. [33] and references therein).\n\n3 Methods\n\nTo address the problem of causal discovery in large bio-medical studies, we need a uni\ufb01ed frame-\nwork for modeling relations between genotypes, biomarkers, and outcomes that is computationally\ntractable to handle a large number of variables. Our approach extends LCMS and the instrumental\nvariable methods by the joint modeling of effects of genetic loci and biomarkers, and by allowing for\nboth pleiotropic genotypic effects and latent variables that generate couplings between biomarkers\nand confound the biomarker-outcome associations. It relies on Bayesian modeling of linear associ-\nations between the modeled variables, with sparseness-inducing priors on the linear weights. The\n\n3\n\n\fi = 1 . . . n\n\ng(i)\n\nV\n\n\u03a3z\n\nU\n\n\u03a8\u02dcx\n\n\u03a8\n\nx\n\nWg\n\n\u02dcx(i)\n\nx(i)\n\nz(i)\n\nWz\n\n\u03a8\u02dcy\n\n\u02dcy(i)\n\ny(i)\n\n\u03a8\n\ny\n\n\u22120.35\n\n\u22120.28\n\n\u22120.21\n\n\u22120.14\n\n\u22120.07\n\n0.00\n\n0.07\n\n0.14\n\n0.22\n\n0.29\n\n\u03c1\n \n\nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nc\n \nl\n\na\nc\ni\nr\ni\np\nm\nE\n\nBayes Factor: log\n\n L\n\n \u2212 log\n\n L\n\n10\n\nx\u2212>y\n\n10\n\nx<\u2212z\u2212>y\n\n, \u03c3\n2=1.0\nz\n\n \n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\nW\n\n0.35\n\n \n\n0.05 0.10 0.19 0.38 0.74 1.46 2.87 5.64 11.09 21.78 40.0\n\nConcentration parameter \u03b3\n1\n\nFigure 2: Left: SPIV structure. Filled/clear nodes correspond to observed/ latent variables. Right:\nlog Bayes factor of Mx\u2190z\u2192y and Mx\u2192y as a function of empirical correlations \u03c1 and \u03b31 for\ny = 1, |x| = |y| = |z| = 1 and \u03b32 = 0, on the log10 scale. For\nn = 100 observations, \u03c32\nintermediate \u03b31\u2019s and high empirical correlations, there is a strong preference for the causal model.\n\nx = \u03c32\n\nz = \u03c32\n\nBayesian framework allows prior biological information to be included if available: for instance,\ncis-acting genotypic effects on transcript levels are likely to be stronger and less pleiotropic than\ntrans-acting effects on transcript levels. It also offers a rigorous approach to model comparison, and\nis particularly attractive for addressing under-determined genetics problems (p \u226b n). The method\nbuilds on automatic relevance determination approaches (e.g. [20], [25], [37]) and adaptive shrink-\nage (e.g. [36], [8], [42]). Here it is used in the context of sparse multi-factor instrumental variable\nanalysis in the presence of unobserved confounders, pleiotropy, and noise.\nModel Parameterization\n\nOur sparse instrumental variables model (SPIV) is speci\ufb01ed with four classes of variables: geno-\ntypic and environmental covariates g \u2208 R|g|, phenotypic biomarkers x \u2208 R|x|, outcomes y \u2208 R|y|,\nand latent factors z1, . . . , z|z|. The dimensionality of the latent factors |z| is \ufb01xed at a moderately\nhigh value (extraneous dimensions will tend to be pruned under the sparse prior). The latent factors\nz play two major roles: they represent the shared structure between groups of biomarkers, and con-\nfound biomarker-outcome associations. The biomarkers x and outcomes y are speci\ufb01ed as hidden\nvariables inferred from noisy observations \u02dcx \u2208 R|\u02dcx| and \u02dcy \u2208 R|\u02dcy| (note that |\u02dcx| = |x|, |\u02dcy| = |y|). The\neffects of genotype on biomarkers and outcome are assumed to be unconfounded. Pleiotropic effects\nof genotype (effects on outcome that are not mediated through the phenotypic biomarkers) are ac-\ncounted for by an explicit parameterization of p(y|g, x, z). Graphical representation of the model is\nshown on Figure 2 (left). It is clear that the SPIV structure extends that of the instrumental variable\nmethods [2, 3, 29] by allowing for the pleiotropic links, and also extends the pleiotropic model of\nSchadt et. al. [30] (Figure 1 left (iii)) by allowing for multiple instruments and latent variables.\n\nAll the likelihood terms of p(x, \u02dcx, y, \u02dcy, z|g) are linear Gaussians with diagonal covariances\n\nx = UT g + VT z + ex,\n\ny = WT x + WT\ny), ey \u223c N (0, \u03a8\n\n(1)\ny), e\u02dcy \u223c N (0, \u03a8\u02dcy), e\u02dcx \u223c N (0, \u03a8\u02dcx), z \u223c\nand \u02dcy = y + e\u02dcy, where e\u02dcx \u223c N (0, \u03a8\nz), W \u2208 R|x|\u00d7|y|, Wz \u2208 R|z|\u00d7|y|, Wg \u2208 R|g|\u00d7|y|, V \u2208 R|z|\u00d7|x|, U \u2208 R|g|\u00d7|x| are regression\nN (0, \u03a8\ncoef\ufb01cients (factor loadings) \u2013 for clarity, we assume the data is centered. A \u2208 R|x|\u00d7|x| has a banded\nstructure (accounting for possible couplings of the neighboring microarray measurements).\nPrior Distribution\n\ng g + ey, \u02dcx = Ax + e\u02dcx,\n\nz z + WT\n\nAll model parameters are speci\ufb01ed as random variables with prior distributions. For computa-\ny, \u03a8\u02dcy, etc. are speci\ufb01ed\ntional convenience, the variance components of the diagonal covariances \u03a8\nwith inverse Gamma priors \u0393\u22121(ai, bi), with hyperparameters ai and bi \ufb01xed at values motivat-\ning the prior beliefs about the projection noise (often available to lab technicians collecting trait or\nbiomarker measurements). One way to view the latent confounders z is as missing genotypes or\nenvironmental covariates, so that prior variances of the latent factors are peaked at values repre-\nsentative of the empirical variances of the instruments g. Empirically, the choice of priors on the\nvariance components appears to be relatively unimportant, and other choices may be considered [9].\n\n4\n\n\fThe considered choice of a sparseness-inducing prior on parameters W, Wz, Wg, etc. is a product\nof zero-mean Laplace and zero-mean normal distributions\n\np(w) \u221d\n\n|w|\n\nY\n\ni=1\n\nLwi (0, \u03b31)Nwi (0, \u03b32),\n\n(2)\n\nLwi (0, \u03b31) \u221d exp{\u2212\u03b31|wi|}, and Nwi (0, \u03b32) \u221d exp{\u2212\u03b32w2\ni }. Due to the heavy tails of the Lapla-\ncian Lwi, the prior p(w) is \ufb02exible enough to capture large associations even if they are rare. Higher\nvalues of \u03b31 give a stronger tendency to shrink irrelevant weights to zero.\nIt is possible to set\ndifferent \u03b31 parameters for different linear weights (e.g. for the cis- and trans-acting effects); how-\never, for clarity of this presentation we shall only use a global parameter \u03b31. The isotropic Gaussian\ncomponent with the inverse variance \u03b32 contributes to the grouping effect (see [42], Theorem 1).\nThe considered family of priors (2) induces better consistency properties [40] than the commonly\nused Laplacians [36, 9, 39, 26, 31]. It has also been shown [14] that important associations between\nvariables may be recovered even for severely under-determined problems (p \u226b n) common in ge-\nnetics. The SPIV model with p(w) de\ufb01ned as in (2) generalizes LASSO and elastic net regression\n[36, 42]. As a special case, it also includes sparse conditional factor analysis. Other sparse priors\non the weights, such as Student-t, \u201cspike-and-slab\u201d, or inducing Lq<1 penalties tend to result in less\ntractable posteriors even for linear regression [10, 37, 8], which also motivates the choice (2).\n\nSome additional intuition of the in\ufb02uence of the sparse prior on the causal inference may be gained\nby numerically comparing the marginal likelihoods of the Markov-equivalent models with and with-\nout confounders Mx\u2190z\u2192y, Mx\u2192y. (Comparison of these models is of particular importance in\nepidemiology, because while the temporal data may often be available for distinguishing direct and\nreverse models Mx\u2192y and My\u2192x, it is generally dif\ufb01cult to ensure that there is no confounding).\nFigure 2 shows that when the empirical correlations are strong and \u03b31 is at intermediate levels, there\nis a strong preference for a causal model. This is because the alternative model with the confounders\nwill have more parameters, and the weights will need to be larger (and therefore more strongly pe-\nnalized by the prior) in order to lead to the same likelihood (note that for var(x) = var(y) = 1, the\nlikelihood-equivalence is achieved for w = vwz, |w| \u2264 1). Larger values of \u03b31 will tend to strongly\npenalize all the weights, which makes the models largely indistinguishable. Also, as the number of\ngenetic instruments grows, evidence in favor of the causal or pleiotropic model will be less depen-\ndent upon the priors on model parameters. For instance, with two genotypic variables that perturb\na single transcript, the causal model has three adjustable parameters, but the pleiotropic model has\n\ufb01ve (see Figure 1 left, (iv)). Where several genotypic variables perturb a single transcript and the\ncausal model \ufb01ts the data nearly as well as the pleiotropic model, the causal model will have higher\nmarginal likelihood under almost any plausible prior, because the slightly better \ufb01t of the pleiotropic\nmodel will be outweighed by the penalty imposed by several extra adjustable parameters.\nInference\n\nWhile the choice of prior (2) encourages sparse solutions, it makes exact inference of the posterior\nparameters p(\u03b8|D) analytically intractable. The most ef\ufb01cient approach is based on the maximum-\na-posteriori (MAP) treatment ([36], [9]), which reduces to solving the optimization problem\n\n\u03b8M AP = arg max\n\n\u03b8\n\n{log p ({\u02dcy}, {\u02dcx}|{g}, \u03b8) + log p(\u03b8)}\n\n(3)\n\nfor the joint parameters \u03b8, where the latent variables have been integrated out. Note that the MAP\nsolution for SPIV may also be easily derived for the semi-supervised case where the biomarker\nand outcome vectors are only partially observed. Compared to other approximations of inference\nin sparse linear models based e.g. on sampling or expectation propagation [26, 31], the MAP\napproximation allows for an ef\ufb01cient handling of very large networks with multiple instruments\nand biomarkers, and makes it straightforward to incorporate latent confounders. Depending on the\nchoice of the global sparseness and grouping hyperparameters \u03b31, \u03b32, the obtained solutions for the\nweights will tend to be sparse, which is also in contrast to the full inference methods. In high dimen-\nsions in particular, the parsimony induced by the point-estimates will facilitate structure discovery\nand interpretations of the \ufb01ndings.\nOne way to optimize (3) is by an EM-like algorithm. For example, the \ufb01xed-point update for ui \u2208\nR|g| linking biomarker xi with the vector of instruments g is easily expressed as\n\ni = (cid:16)GT G + \u03c32\nu(t)\n\nxi (cid:16)\u03b31 \u00b4U(t\u22121)\n\ni\n\n+ \u03b32I|g|(cid:17)(cid:17)\u22121\n\n(cid:0)GT hxii \u2212 GT hZivi(cid:1) ,\n\n(4)\n\n5\n\n\fMI between biomarkers and HDL at \u0398\n\n, \u03b3\n = 40.0, \u03b3\n = 10.0\n2\n1\n\nMAP\n\ni\n\n5\ns\ng\nR\n\n1\np\na\nU\n\n3\n1\nr\nN\n\n2\na\no\np\nA\n\ng\n1\nr\ne\nc\nF\n\n7\nf\nm\na\nS\n\nl\n\n4\nq\no\nC\n\n3\n3\nm\n\ni\nr\nT\n\nt\nb\nD\n\n9\n3\nc\nr\nr\nL\n\n5\nh\np\nD\n\n7\na\n0\n3\nc\nS\n\nl\n\nl\n\ng\nA\n\n7\nx\nn\nS\n\n3\nd\nc\nb\nA\n\nk\nc\nD\n\na\nh\nd\na\nH\n\n1\ns\n1\nn\ns\nC\n\n7\nd\nr\nB\n\n3\nx\ns\nM\n\n9\n4\n7\n3\n5\n0\nC\nB\n\ni\n\nk\nR\n9\n1\nM\n7\n1\n4\n0\n3\n9\n4\n\nl\n\n2\ns\nI\n\n2\n1\nc\nt\nT\n\n4\nr\n3\nk\nP\n\ni\n\nl\n\n2\n3\nd\nb\nM\n\nb\n5\n4\nm\ne\nm\nT\n\nl\n\n1\ns\nb\nH\n\ni\n\n1\nb\n7\n2\np\ny\nC\n\nk\nR\n0\n1\nK\n0\n2\n5\n0\n3\n9\n4\n\n3\nv\np\nr\nT\n\n2\nx\nb\nT\n\ni\n\nk\nR\n4\n1\nA\n1\n0\n4\n0\n3\n5\n5\n\ni\n\nk\nR\n7\n0\nA\n1\n0\n0\n0\n1\n1\n1\n\n5\n2\nm\nb\nR\n\nb\n2\nr\na\nk\nr\nP\n\n4\n1\nr\na\nE\n\n1\nm\na\nc\ny\nG\n\nl\n\nj\n\n5\np\nt\nA\n\n2\nz\ne\nF\n\n0\n9\n2\n1\n6\n0\nW\nA\n\n3\n5\n4\n1\nr\nf\nl\n\nO\n\ni\n\na\ne\nd\nC\n\n5\na\ns\n8\nt\nS\n\ni\n\ni\n\n2\n.\n4\n7\n2\n0\n5\n1\nC\nA\n\nk\nR\n8\n0\nF\n9\n2\n4\n3\n3\n9\n4\n\n\u02dcx = 0.25 and \u03c32\n\nFigure 3: Top: SPIV for arti\ufb01cial datasets. Left/right plots show typical applications for the high\nand low observation noise (\u03c32\n\u02dcx = 0.05 respectively). Top and bottom rows of each\nHinton diagram correspond to the ground truth and the MAP weights U (1\u201318), W (19\u201321), Wg (22\u2013\n27). Bottom: SPIV for a genome-wide study of causal effects on HDL in heterogeneous stock mice.\nLeft/right plots show maximum a-posteriori weights \u03b8M AP and the mutual information I(xi, y|e)\nbetween the unobserved biomarkers and outcome evaluated from the model at \u03b8M AP , under the\njoint Gaussian assumption. A cluster of pleiotropic links on chromosome 1 at about 173 MBP is\nconsistent with biology. The biomarker with the strongest unconfounded effect on HDL is Cyp27b1.\nTranscripts that are most predictive of HDL through their links with pleiotropic genetic markers on\nchrom 1 are Uap1, Rgs5, Apoa2, and Nr1i3. Parameters \u03b31,2 have been obtained by cross-validation.\n\nwhere G \u2208 Rn\u00d7|g| is the design matrix, (\u00b4Ui)kl = \u03b4kl/|uki| \u2200k, l \u2208 [1, |g|] \u2229 Z, xi \u2208 Rn, Z \u2208 Rn\u00d7|z|,\nvi \u2208 R|z|, and \u03c32\nxi = (\u03a8x)ii. The expectations h.i are computed with respect to p(.|{\u02dcx}, {\u02dcy}, {g}),\nwhich for (1) are easily expressed in the closed form. The rest is expressed analogously, and ex-\ntensions to the partially observed cases are straight-forward. Faster (although more heuristic) al-\nternatives may be used for speeding up the M-step (e.g. [7]). The hyperparameters may be set\nby cross-validation, marginalized out by specifying a hyper-prior, or set heuristically based on the\nexpected number of links to be retained in the posterior mode. Once a sparse representation is\nproduced by pruning irrelevant dimensions, more computationally-intensive inference methods for\nthe full posterior (such as expectation propagation or MCMC) may be used in the resulting lower-\ndimensional model if needed. After \ufb01tting SPIV to data, formal hypotheses tests were performed by\ncomparing the marginal likelihoods of the speci\ufb01c models for the retained instruments, biomarkers,\nand target outcomes. These were evaluated by the Laplace approximation at \u03b8M AP (e.g. [20]).\n\n4 Results\n\n= \u03c32\n\nArti\ufb01cial data: We applied SPIV to several simulated datasets, and compared speci\ufb01c modeling\nhypotheses for the biomarkers retained in the posterior modes. The structures were consistent with\nthe generic SPIV model, with all non-zero weights sampled from N (0, 1). Figure 3 (top) shows\n\u02dcy = 0.25/0.05). Note excellent sign-\ntypical results for the high/low observation noise (\u2200i, \u03c32\n\u02dcxi\nconsistency of the results for the more important factors. Separate simulations showed robustness\nunder multiple EM runs and under- or over-estimation of the true number of confounders. Sub-\nsequent testing of the speci\ufb01c modeling hypotheses for the most important factors resulted in the\ncorrect discrimination of causal and confounded associations in \u224886% of cases.\nGenome-wide study of HDL cholesterol in mice: To demonstrate our method for a large-scale\npractical application, we examined effects of gene transcript levels in the liver on plasma high-\ndensity lipoprotein (HDL) cholesterol levels for a mice from a heterogeneous stock. The genetic\nfactors in\ufb02uencing HDL in mice have been well explored in biology e.g. by Valdar et. al. [38].\nThe gene expression data was collected and preprocessed by [13], who have kindly agreed to share\na part of their data. Breeding pairs for the stock were obtained at 50 generations after the stock\n\n6\n\n\ffoundation. At each of the 12500 marker loci, genotypes were described by 8-D vectors of expected\nfounder ancestry proportions inferred from the raw marker genotypes by an HMM-based reconstruc-\ntion method [23]. Mouse-speci\ufb01c covariates included age and sex, which were used to augment the\nset of genetic instruments. The full set of phenotypic biomarkers consisted of 47429 transcript lev-\nels, appropriately transformed and cleaned. Available data included 260 animals. Before applying\nour method, we decreased the dimensionality of the genetic features and RNA expressions by using\na combination of seven feature (subset) selection methods, based on applications of \ufb01lters, greedy\n(step-wise) regression, sequential approximations of the mutual information between the retained\nset and the outcome of interest, and applications of regression methods with LASSO and elastic\nnet (EN) shrinkage priors for the genotypes g, observed biomarkers \u02dcx, and observed HDL mea-\nsurements \u02dcy. For the LASSO and EN methods, global hyper-parameters were obtained by 10-fold\ncross-validation. Note that feature selection is unavoidable for genome-wide studies using gene ex-\npressions as biomarkers. Indeed, the considered case of \u223c O(105) instruments and 47K biomarkers\nwould give rise to & O(109) interaction weights, which is expensive to analyze or even keep in\nmemory. After applying subset selection methods, SPIV was typically applied to subsets of data\nwith \u223c O(105) loci-biomarker interactions.\n\nThe results of the SPIV analysis of this dataset are shown on Figure 3 (bottom). The bottom left\nplot shows maximum a-posteriori weights \u03b8M AP computed by running the EM-like optimization\nprocedure to convergence from 20 random initializations. For a model with latent variables and\nabout 30, 000 weights, each run took approximately 10 minutes of execution time (only weakly\noptimized Matlab code, simple desktop). The parameters \u03b31,2 were obtained by 10-fold CV. Note\nthat only a fraction of the variables remains in the posterior. In this case and for the considered\nsparseness-inducing priors, no hidden confounders appear to have strong effects on the outcome in\nthe posterior1. The spikes of the pleiotropic activations in sex chromosome 20 and around chromo-\nsome 1 are consistent with the biological knowledge [38]. The biomarker with the strongest direct\neffect on HDL (computed as the mean MAP weight wi : xi \u2192 y divided by its standard deviation\nover multiple runs, where each mean weight exceeds a threshold) is the expression of Cyp27b1 (gene\nresponsible for vitamin D metabolism). Knockout of the Cyp27b1 gene in mice has been shown to\nalter body fat stores [24], which might be expected to affect HDL cholesterol levels. Recently it\nhas also been shown that quantitative trait locus for circulating vitamin D levels in humans includes\na gene that codes for the enzyme that synthesizes cholesterol [1]. A subsequent comparison of 18\nspeci\ufb01c reverse, pleiotropic, and causal models for Cyp27b1, HDL, and the whole vector of retained\ngenetic instruments (known to be causal by de\ufb01nition) showed a slightly stronger evidence in favor\nof the reverse hypothesis without latent confounders (with the ratio of Laplace approximations of\nthe marginal likelihoods of reverse vs causal models of \u2248 1.95 \u00b1 0.27). This is in contrast to the\nLCMS where the results are strongly affected by the choice of an instrument (Figure 1 right shows\nthe results for Cyp27b1, HDL, and the same choice of instruments).\n\nTo demonstrate an application to gene \ufb01ne-mapping studies, Figure 3 (bottom right) shows the\napproximate mutual information I(xi, y|e = {age, sex}) between the underlying biomarkers and\nunobserved HDL levels expressed from the model at \u03b8M AP . The mutual information takes into\naccount not only the strength of the direct effect of xi on y, but also associations with the pleiotropic\ninstruments, strengths of the pleiotropic effects, and dependencies between the instruments. Under\nthe as-if Gaussian assumption, I(xi, yj|\u03b8M AP ) = log(\u03c32\n\nxi) \u2212 log(\u03c32\n\nyj xi), where\n\nyj \u03c32\n\nyj \u03c32\n\nxi \u2212 \u03c34\n\nyj = k\u03a31/2\n\u03c32\n\ngg (Uwj + wgj )k2 + k\u03a81/2\n\nz\n\n(Vwj + wzj )k2 + wT\nj\n\n\u03a8\n\nxwj + \u03a8yj ,\n\n(5)\n\nwith the rest expressed analogously. Here \u03a3gg \u2208 R|g|\u00d7|g| is the empirical covariance of the instru-\nments, wj \u2208 R|x|, wzj \u2208 R|z|, and wgj \u2208 R|g| are the MAP weights of the couplings of yj with the\nbiomarkers, confounders, and genetic instruments respectively. When the outcome is HDL, the ma-\njority of predictive transcripts are \ufb01ne-mapped to a small region on chromosome 1 which includes\nUap1, Rgs5, Apoa2, and Nr1i3. The informativeness of these genes about the HDL cholesterol can-\nnot be inferred simply from correlations between the measured gene expression and HDL levels;\nfor example, when ranked in accordance to \u03c12(\u02dcxi, \u02dcy|age, sex), the top 4 genes have the rankings\n\n1No confounder effects in the posterior mode for the considered \u03b31\n\n2 is speci\ufb01c to the considered mouse\nHDL dataset, which shows relatively strong correlations between the measured biomarkers and the outcome.\nAn application of SPIV to proprietary human data for a study of effects of vitamins and calcium levels on\ncolorectal cancer (which we are not yet allowed to publish) showed very strong effects of the latent confounders.\n\n,\n\n7\n\n\fof 838, 961, 6284, and 65 respectively. The \ufb01ndings are also biologically plausible and consistent\nwith high-pro\ufb01le biological literature (with associations between Apoa2 and HDL described in [38],\nand strong links of Rgs5 to a genomic region strongly associated with metabolic traits discussed in\n[5], while Nr1i3 and Uap1 are their neighbors on chromosome 1 within \u223c 1M bp). Note that the\ncouplings are via the links with the pleiotropic genetic markers on chromosome 1. Adjusting for sex\nand age prior to performing feature selection and inference did not signi\ufb01cantly change the results.\n\nThe results reported here appear to be stable for different choices of feature selection methods, data\nadjustments, and algorithm runs. We note, however, that different results may potentially be obtained\nbased on the choice of animal populations and/or processing of the biomarker (gene expression)\nmeasurements. Details of the data collection, microarray preprocessing, and feature selection, along\nwith the detailed \ufb01ndings for other biomarkers and phenotypic outcomes will be made available\nonline. De\ufb01nitive con\ufb01rmation of these relationships would require gene knock-out experiments.\n\n5 Discussion and extensions\n\nIn large-scale genetic and bio-medical studies, we are facing a practical task of reducing a huge set\nof candidate causes of complex traits to a more manageable subset of candidates where experimen-\ntal control (such as gene knockout experiments or biomarker alternations) may be performed. SPIV\nperforms the screening of interesting biomarker-phenotype and genotype-biomarker-phenotype as-\nsociations by exploiting the maximum-a-posteriori inference in a sparse linear latent variable model.\nAdditional screening is performed by comparing approximate marginal likelihoods of speci\ufb01c mod-\neling hypotheses, including direct, reverse, and pleiotropic models with and without confounders,\nwhich (under the assumption of no \u201cprior equivalence\u201d) may serve as an additional test of possible\ncausation [21]. Intuitively, the approach is motivated by the observation that while independence\nof variables implies that they are not in a causal relation, a preference for an unconfounded causal\nmodel may indicate possible causality and require further controlled experiments.\n\nTechnically, SPIV may be viewed as an extension of LASSO and elastic net regression which al-\nlows for latent variables and pleiotropic dependencies. While being particularly attractive for genetic\nstudies, SPIV or its modi\ufb01cations may potentially be applied for addressing more general structure\nlearning tasks. For example, when applied iteratively, SPIV may be used to guide search over richer\nmodel structures (where a greedy search over parent nodes is replaced by a continuous optimiza-\ntion problem which combines subset selection and regression in the presence of latent variables),\nwhich may be used for structure learning problems. Other extensions of the framework could in-\nvolve hybrid (discrete- and real-valued) outcomes with nonlinear/nongaussian likelihoods. Also,\nas mentioned earlier, once sparse representations are produced by the MAP inference, it may be\npossible to utilize more accurate approximations of the inference applicable for the induced sparse\nstructures [6]. Also note that sparse priors on the linear weights tend to give rise to sparse covariance\nmatrices. A potentially interesting alternative may involve a direct estimation of conditional preci-\nsion matrices with a sparse group penalty. While SPIV attempts to focus the attention on important\nbiomarkers establishing strong direct associations with the phenotypes, modeling of the precisions\nmay be used for \ufb01ltering out unimportant factors (conditionally) independent of the outcome vari-\nables. Our future work will involve a direct estimation of the sparse conditional precision matrix\n\u03a3\u22121\nof the biomarkers, outcomes, and unmeasured confounders (given the instruments), through\nxyz|g\nlatent variable extensions of the recently proposed graphical LASSO and related methods [11, 18].\n\nThe key purpose of this paper is to draw attention of the machine learning community to the prob-\nlem of inferring causal relationships between phenotypic measurements and complex traits (disease\nrisks), which may have tremendous implications in epidemiology and systems biology. Our speci\ufb01c\napproach to the problem is inspired by the ideas of instrumental variable analysis commonly used\nin epidemiological studies, which we have extended to properly address situations when the ge-\nnetic variables may be direct causes of the hypothesized outcomes. The sparse instrumental variable\nframework (SPIV) overcomes limitations of the likelihood-based LCMS methods often used by ge-\nneticists, by modeling joint effects of genetic loci and biomarkers in the presence of noise and latent\nvariables. The approach is tractable enough to be used in genetic studies with tens of thousands of\nvariables. It may be used for identifying speci\ufb01c genes associated with phenotypic outcomes, and\nmay have wide applications in identi\ufb01cation of biomarkers as possible targets for interventions, or\nas proxy endpoints for early-stage clinical trials.\n\n8\n\n\fReferences\n[1] J. Ahn, K. Yu, and R. Stolzenberg-Solomon et. al. Genome-wide association study of circulating vitamin\n\nD levels. Human Molecular Genetics, 2010. Epub ahead of print.\n\n[2] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identi\ufb01cation of causal effects using instrumental variables\n\n(with discussion). J. of the Am. Stat. Assoc., 91:444\u2013455, 1996.\n\n[3] R. J. Bowden and D. A. Turkington. Instrumental Variables. Cambridge Uni Press, 1984.\n[4] C. Brito and J. Pearl. Generalized instrumental variables. In UAI, 2002.\n[5] Y. Chen, J. Zhu, and P. Y. Lum et. al. Variations in DNA elucidate molecular networks that cause disease.\n\nNature, 452:429\u2013435, 2008.\n\nAISTATS, 2010.\n\n[6] B. Cseke and T. Heskes. Improving posterior marginal approximations in latent Gaussian models. In\n\n[7] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Ann. of Stat., 32, 2004.\n[8] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. of\n\nthe Am. Stat. Assoc., 96(456):1348\u20131360, 2001.\n\n[9] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Trans. on PAMI, 25(9), 2003.\n[10] I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics,\n\n[11] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\n35(2):109\u2013135, 1993.\n\nBiostatistics, 9(3), 2008.\n\n[12] D. Heckerman, C. Meek, and G. F. Cooper. A Bayesian approach to causal discovery. In C. Glymour and\n\nG. F. Cooper, editors, Computation, Causation, and Discovery. MIT, 1999.\n\n[13] G. J. Huang, S. Shifman, and W. Valdar et. al. High resolution mapping of expression QTLs in heteroge-\n\nneous stock mice in multiple tissues. Genome Research, 19(6):1133\u201340, 2009.\n\n[14] J. Jia and B. Yu. On model selection consistency of the elastic net when p \u226b n. Technical Report 756,\n\nUC Berkeley, Department of Statistics, 2008.\n\n[15] M. B. Katan. Apolipoprotein E isoforms, serum cholesterol and cancer. Lancet, i:507\u2013508, 1986.\n[16] S. Kim and E. Xing. Statistical estimation of correlated genome associations to a quantitative trait net-\n\nwork. PLOS Genetics, 5(8), 2009.\n\n[17] D. A. Lawlor, R. M. Harbord, and J. Sterne et. al. Mendelian randomization: using genes as instruments\n\nfor making causal inferences in epidemiology. Stat. in Medicine, 27:1133\u20131163, 2008.\n\n[18] E. Levina, A. Rothman, and J. Zhu. Sparse estimation of large covariance matrices via a nested lasso\n\npenalty. The Ann. of App. Stat., 2(1):245\u2013263, 2008.\n\n[19] M. H. Maathius, M. Kalisch, and P. Buhlmann. Estimating high-dimensional intervention effects from\n\nobservation data. The Ann. of Stat., 37:3133\u20133164, 2009.\n\n[20] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4:415\u2013447, 1992.\n[21] D. J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cambridge Uni Press, 2003.\n[22] J. Mooij, D. Janzing, J. Peters, and B. Schoelkopf. Regression by dependence minimization and its\n\napplication to causal inference in additive noise models. In ICML, 2009.\n\n[23] R. Mott, C. J. Talbot, M. G. Turri, A. C. Collins, and J. Flint. A method for \ufb01ne mapping quantitative trait\n\nloci in outbred animal stocks. Proc. Nat. Acad. Sci. USA, 97:12649\u201312654, 2000.\n\n[24] C. J. Narvaez and D. Matthews et. al. Lean phenotype and resistance to diet-induced obesity in vitamin D\nreceptor knockout mice correlates with induction of uncoupling protein-1. Endocrinology, 150(2), 2009.\n\n[25] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.\n[26] T. Park and G. Casella. The Bayesian LASSO. J. of the Am. Stat. Assoc., 103(482), 2008.\n[27] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge Uni Press, 2000.\n[28] J. Pearl. Causal inference in statistics: an overview. Statistics Surveys, 3:96\u2013146, 2009.\n[29] J. M. Robins and S. Greenland. Identi\ufb01cation of causal effects using instrumental variables: comment. J.\n\nof the Am. Stat. Assoc., 91:456\u2013458, 1996.\n\n[30] E. E. Schadt, J. Lamb, X. Yang, and J. Zhu et. al. An integrative genomics approach to infer causal\n\nassociations between gene expression and disease. Nature Genetics, 37(7):710\u2013717, 2005.\n\n[31] M. W. Seeger. Bayesian inference and optimal design for the sparse linear model. JMLR, 9, 2008.\n[32] I. Shpitser and J. Pearl. Identi\ufb01cation of conditional interventional distributions. In UAI, 2006.\n[33] R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models.\n\nJMLR, 7, 2006.\n\nology, 32, 2004.\n\n[34] G. D. Smith and S. Ebrahim. Mendelian randomisation: can genetic epidemiology contribute to under-\n\nstanding environmental determinants of disease? Int. J. of Epidemiology, 32:1\u201322, 2003.\n\n[35] D.C. Thomas and D.V. Conti. Commentary: The concept of Mendelian randomization. Int. J. of Epidemi-\n\n[36] R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS B, 58(1):267\u2013288, 1996.\n[37] M. E. Tipping. Sparse Bayesian learning and the RVM. JMLR, 1:211\u2013244, 2001.\n[38] W. Valdar, L. C. Solberg, and S. Burnett et. al. Genome-wide genetic association of complex traits in\n\nheterogeneous stock mice. Nature Genetics, 38:879\u2013887, 2006.\n\n[39] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using L1-constrained\n\nquadratic programmming. IEEE Trans. on Inf. Theory, 55:2183 \u2013 2202, 2007.\n[40] M. Yuan and Y. Lin. On the nonnegative garrote estimator. JRSS:B, 69, 2007.\n[41] J. Zhu, M. C. Wiener, and C. Zhang et. al. Increasing the power to detect causal associations by combining\n\ngenotypic and expression data in segregating populations. PLOS Comp. Biol., 3(4):692\u2013703, 2007.\n[42] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. JRSS:B, 67(2), 2005.\n\n9\n\n\f", "award": [], "sourceid": 1309, "authors": [{"given_name": "Paul", "family_name": "Mckeigue", "institution": null}, {"given_name": "Jon", "family_name": "Krohn", "institution": null}, {"given_name": "Amos", "family_name": "Storkey", "institution": null}, {"given_name": "Felix", "family_name": "Agakov", "institution": null}]}