{"title": "Deep Generalized Method of Moments for Instrumental Variable Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 3564, "page_last": 3574, "abstract": "Instrumental variable analysis is a powerful tool for estimating causal effects when randomization or full control of confounders is not possible. The application of standard methods such as 2SLS, GMM, and more recent variants are significantly impeded when the causal effects are complex, the instruments are high-dimensional, and/or the treatment is high-dimensional. In this paper, we propose the DeepGMM algorithm to overcome this. Our algorithm is based on a new variational reformulation of GMM with optimal inverse-covariance weighting that allows us to efficiently control very many moment conditions. We further develop practical techniques for optimization and model selection that make it particularly successful in practice. Our algorithm is also computationally tractable and can handle large-scale datasets. Numerical results show our algorithm matches the performance of the best tuned methods in standard settings and continues to work in high-dimensional settings where even recent methods break.", "full_text": "Deep Generalized Method of Moments\n\nfor Instrumental Variable Analysis\n\nAndrew Bennett\u21e4\nCornell University\n\nawb222@cornell.edu\n\nNathan Kallus\u21e4\nCornell University\n\nkallus@cornell.edu\n\nTobias Schnabel\u21e4\nMicrosoft Research\ntbs49@cornell.edu\n\nAbstract\n\nInstrumental variable analysis is a powerful tool for estimating causal effects when\nrandomization or full control of confounders is not possible. The application of\nstandard methods such as 2SLS, GMM, and more recent variants are signi\ufb01cantly\nimpeded when the causal effects are complex, the instruments are high-dimensional,\nand/or the treatment is high-dimensional. In this paper, we propose the DeepGMM\nalgorithm to overcome this. Our algorithm is based on a new variational reformula-\ntion of GMM with optimal inverse-covariance weighting that allows us to ef\ufb01ciently\ncontrol very many moment conditions. We further develop practical techniques for\noptimization and model selection that make it particularly successful in practice.\nOur algorithm is also computationally tractable and can handle large-scale datasets.\nNumerical results show our algorithm matches the performance of the best tuned\nmethods in standard settings and continues to work in high-dimensional settings\nwhere even recent methods break.\n\n1\n\nIntroduction\n\nUnlike standard supervised learning that models correlations, causal inference seeks to predict the\neffect of counterfactual interventions not seen in the data. For example, when wanting to estimate\nthe effect of adherence to a prescription of -blockers on the prevention of heart disease, supervised\nlearning may overestimate the true effect because good adherence is also strongly correlated with\nhealth consciousness and therefore with good heart health [13]. Figure 1 shows a simple example of\nthis type and demonstrates how a standard neural network (in blue) fails to correctly estimate the true\ntreatment response curve (in orange) in a toy example. The issue is that standard supervised learning\nassumes that the residual in the response from the prediction of interest is independent of the features.\nOne approach to account for this is by adjusting for all confounding factors that cause the depen-\ndence, such as via matching [24, 33] or regression, potentially using neural networks [23, 25, 34].\nHowever, this requires that we actually observe all confounders so that treatment is as-if random\nafter conditioning on observables. This would mean that in the -blocker example, we would need to\nperfectly measure all latent factors that determine both an individual\u2019s adherence decision and their\ngeneral healthfulness which is often not possible in practice.\nInstrumental variables (IVs) provide an alternative approach to causal-effect identi\ufb01cation. If we\ncan \ufb01nd a latent experiment in another variable (the instrument) that in\ufb02uences the treatment (i.e., is\nrelevant) and does not directly affect the outcome (i.e., satis\ufb01es exclusion), then we can use this to\ninfer causal effects [3]. In the -blocker example [13], the authors used co-pay cost as an IV. Because\nthey enable analyzing natural experiments under mild assumptions, IVs have been one of the most\nwidely used tools for empirical research in a variety of \ufb01elds [2]. An important direction of research\n\n\u21e4Alphabetical order.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A toy example in which standard supervised learning fails to identify the true response\nfunction g0(X) = max( X\n5 , X). Data was generated via Y = g0(X) 2\u270f + \u2318, X = Z + 2\u270f. All\nother variables are standard normal.\n\nfor IV analysis is to develop methods that can effectively handle complex causal relationships and\ncomplex variables like images that necessitate more \ufb02exible models like neural networks [21, 28].\n\nIn this paper, we tackle this through a new method called DeepGMM that builds upon the optimally-\nweighted Generalized Method of Moments (GMM) [17], a widely popular method in econometrics\nthat uses the moment conditions implied by the IV model to ef\ufb01ciently estimate causal parameters.\nLeveraging a new variational reformulation of the ef\ufb01cient GMM with optimal weights, we develop a\n\ufb02exible framework, DeepGMM, for doing IV estimation with neural networks. In contrast to existing\napproaches, DeepGMM is suited for high-dimensional treatments X and instruments Z, as well as\nfor complex causal and interaction effects. DeepGMM is given by the solution to a smooth game\nbetween a prediction function and critic function. We prove that approximate equilibria provide\nconsistent estimates of the true causal parameters. We \ufb01nd these equilibria using optimistic gradient\ndescent algorithms for smooth game play [15], and give practical guidance on how to choose the\nparameters of our algorithm and do model validation. In our empirical evaluation, we demonstrate\nthat DeepGMM\u2019s performance is on par or superior to a large number of existing approaches in\nstandard benchmarks and continues to work in high-dimensional settings where other methods fail.\n\n2 Setup and Notation\n\nWe assume that our data is generated by\n\nY = g0(X) + \u270f,\n\n(1)\n\nwhere the residual \u270f has zero mean and \ufb01nite variance, i.e., E [\u270f] = 0 and E\u21e5\u270f2\u21e4 < 1. How-\never, different to standard supervised learning, we allow for the residual \u270f and X to be correlated,\nE [\u270f | X] 6= 0, i.e., X can be endogenous, and therefore g0(X) 6= E [Y | X]. We also assume that\nwe have access to an instrument Z satisfying\n\nE [\u270f | Z] = 0.\n\n(2)\n\nMoreover, Z should be relevant, i.e. P (X | Z) 6= P (X). Our goal is to identify the causal response\nfunction g0(\u00b7) from a parametrized family of functions G = {g(\u00b7; \u2713) : \u2713 2 \u21e5}. Examples are linear\nfunctions g(x; \u2713) = \u2713T (x), neural networks where \u2713 represent weights, and non-parametric classes\nwith in\ufb01nite-dimensional \u2713. For convenience, let \u27130 2 \u21e5 be such that g0(\u00b7) = g(\u00b7; \u27130). Throughout,\nwe measure the performance of an estimated response function \u02c6g by its MSE against the true g0.\nNote that if we additionally have some exogenous context variables L, the standard way to model\nthis using Eq. (1) is to include them both in X and in Z as X = (X0, L) and Z = (Z0, L), where X0\nis the endogenous variable and Z0 is an IV for it. In the -blocker example, if we were interested in\nthe heterogeneity of the effect of adherence over demographics, X would include both adherence and\ndemographics whereas Z would include both co-payment and demographics.\n\n2\n\n\f2.1 Existing methods for IV estimation\nTwo-stage methods. One strategy to identifying g0 is based on noting that Eq. (2) implies\n\nE [Y | Z] = E [g0(X) | Z] =Z g0(x)dP (X = x | Z) .\n\n(3)\n\nIf we let g(x; \u2713) = \u2713T (x) this becomes E [Y | Z] = \u2713T\n0 E [(X) | Z]. The two-stage least squares\n(2SLS) method [3, \u00a74.1.1] \ufb01rst \ufb01ts E [(X) | Z] by least-squares regression of (X) on Z (with\nZ possibly transformed) and then estimates \u02c6\u27132SLS as the coef\ufb01cient in the regression of Y on\nE [(X) | Z]. This, however, fails when one does not know a suf\ufb01cient basis (x) for g(x, \u27130).\n[14, 29] propose non-parametric methods for expanding such a basis but such approaches are limited\nto low-dimensional settings. [21] instead propose DeepIV, which estimates the conditional density\nP (X = x | Z) by \ufb02exible neural-network-parametrized Gaussian mixtures. This may be limited\nin settings with high-dimensional X and can suffer from the non-orthogonality of MLE under any\nmisspeci\ufb01cation, known as the \u201cforbidden regression\u201d issue [3, \u00a74.6.1] (see Section 5 for discussion).\nMoment methods. The generalized method of moments (GMM) instead leverages the moment\nconditions satis\ufb01ed by \u27130. Given functions f1, . . . , fm, Eq. (2) implies E [fj(Z)\u270f] = 0, giving us\n\n (f1; \u27130) = \u00b7\u00b7\u00b7 = (fm; \u27130) = 0, where (f ; \u2713) = E [f (Z)(Y g(X; \u2713))] .\n\n(4)\nA usual assumption when using GMM is that the m moment conditions in Eq. (4) are suf\ufb01cient to\nuniquely pin down (identify) \u27130.2 To estimate \u27130, GMM considers these moments\u2019 empirical counter-\nnPn\ni=1 f (Zi)(Yi g(Xi; \u2713)), and seeks to make all of them small simultaneously,\nparts, n(f ; \u2713) = 1\nmeasured by their Euclidean norm kvk2 = vT v:\n\n\u02c6\u2713GMM 2 argmin\n\n\u27132\u21e5 k( n(f1; \u2713), . . . , n(fm; \u2713))k2 .\n\n(5)\n\nOther vector norms are possible. [28] propose using kvk1 and solving the optimization with no-regret\nlearning along with an intermittent jitter to moment conditions in a framework they call AGMM (see\nSection 5 for discussion).\nHowever, when there are many moments (many fj), using any unweighted vector norm can lead\nto signi\ufb01cant inef\ufb01ciencies, as we may be wasting modeling resources to make less relevant or\nduplicate moment conditions small. The optimal combination of moment conditions, yielding\nminimal variance estimates is in fact given by weighting them by their inverse covariance, and it is\nsuf\ufb01cient to consistently estimate this covariance. In particular, a celebrated result [17] shows that\n(with \ufb01nitely-many moments), using the following norm in Eq. (5) will yield minimal asymptotic\nvariance (ef\ufb01ciency) for any consistent estimate \u02dc\u2713 of \u27130:\n\nkvk2\n\n\u02dc\u2713 = vT C1\n\u02dc\u2713\n\nv, where\n\n[C\u2713]jk =\n\n1\nn\n\nfj(Zi)fk(Zi)(Yi g(Xi; \u2713))2.\n\n(6)\n\nnXi=1\n\nExamples of this are the two-step, iterative, and continuously updating GMM estimators [20]. We\ngenerically refer to the GMM estimator given in Eq. (5) using the norm given in Eq. (6) as optimally-\nweighted GMM (OWGMM), or \u02c6\u2713OWGMM.\nFailure of (OW)GMM with Many Moment Conditions. When g(x; \u2713) is a \ufb02exible model such\nas a high-capacity neural network, many \u2013 possibly in\ufb01nitely many \u2013 moment conditions may be\nneeded to identify \u27130. However, GMM and OWGMM algorithms fail when we use too many moment\nconditions. On the one hand, one-step GMM (i.e., Eq. (5) with kvk = kvkp, p 2 [1,1]) is saddled\nwith the inef\ufb01ciency of trying to impossibly control many equally-weighted moments: at the extreme,\nif we let f1, . . . be all functions of Z with unit square integral, one-step GMM is simply equivalent\nto the non-causal least-squares regression of Y on X. We discuss this further in Appendix C. On\nthe other hand, we also cannot hope to learn the optimal weighting: the matrix C\u02dc\u2713 in Eq. (6) will\nnecessarily be singular and using its pseudoinverse would mean deleting all but n moment conditions.\nTherefore, we cannot simply use in\ufb01nite or even too many moment conditions in GMM or OWGMM.\n\n2This assumption that a \ufb01nite number of moment conditions uniquely identi\ufb01es \u2713 is perhaps too strong when\n\u2713 is very complex, and it easily gives statistically ef\ufb01cient methods for estimating \u2713 if true. However assuming\nthis is dif\ufb01cult to avoid in practice.\n\n3\n\n\f3 Methodology\n\nWe next present our approach. We start by motivating it using a new reformulation of OWGMM.\n\n3.1 Reformulating OWGMM\nLet us start by reinterpreting OWGMM. Consider the vector space V of real-valued functions f of Z\nunder the usual operations. Note that, for any \u2713, n(f ; \u2713) is a linear operator on V and\n\nC\u2713(f, h) =\n\n1\nn\n\nnXi=1\n\nf (Zi)h(Zi)(Yi g(Xi; \u2713))2\n\nis a bilinear form on V. Now, given any subset F\u2713V , consider the following objective function:\n(7)\n\n n(f ; \u2713) \n\n1\n4C\u02dc\u2713(f, f ).\n\n n(\u2713;F, \u02dc\u2713) = sup\nf2F\n\nLemma 1. Let kvk\u02dc\u2713 be the optimally-weighted norm as in Eq. (6) and let F = span(f1, . . . , fm).\nThen\n\nk( n(f1; \u2713), . . . , n(fm; \u2713))k2\nCorollary 1. An equivalent formulation of OWGMM is\n\n\u02dc\u2713 = n(\u2713;F, \u02dc\u2713).\n\n\u02c6\u2713OWGMM 2 argmin\n\u27132\u21e5\n\n n(\u2713;F, \u02dc\u2713).\n\n(8)\n\nIn other words, Lemma 1 provides a variational formulation of the objective function of OWGMM\nand Corollary 1 provides a saddle-point formulation of the OWGMM estimate.\n\n3.2 DeepGMM\n\nIn this section, we outline the details of our DeepGMM framework. Given our reformulation above\nin Eq. (8), our approach is to simply replace the set F with a more \ufb02exible set of functions. Namely\nwe let F = {f (z; \u2327 ) : \u2327 2T } be the class of all neural networks of a given architecture with\nvarying weights \u2327 (but not their span). Using a rich class of moment conditions allows us to learn\ncorrespondingly a rich g0. We therefore similarly let G = {g(x; \u2713) : \u2713 2 \u21e5} be the class of all neural\nnetworks of a given architecture with varying weights \u2713.\nGiven these choices, we let \u02c6\u2713DeepGMM be the minimizer in \u21e5 of n(\u2713;F, \u02dc\u2713) for any, potentially\ndata-driven, choice \u02dc\u2713. We discuss choosing \u02dc\u2713 in Section 4. Since this is no longer closed form, we\nformulate our algorithm in terms of solving a smooth zero-sum game. Formally, our estimator is\nde\ufb01ned as:\n\n\u02c6\u2713DeepGMM 2 argmin\n\u27132\u21e5\nwhere U\u02dc\u2713(\u2713, \u2327 ) =\n\nsup\n\u23272T\n1\nn\n\nU\u02dc\u2713(\u2713, \u2327 )\n\n(9)\n\nnXi=1\n\nf (Zi; \u2327 )(Yi g(Xi; \u2713)) \n\n1\n4n\n\nnXi=1\n\nf 2(Zi; \u2327 )(Yi g(Xi; \u02dc\u2713))2.\n\nSince evaluation is linear, for any \u02dc\u2713, the game\u2019s payoff function U\u02dc\u2713(\u2713, \u2327 ) is convex-concave in the\nfunctions g(\u00b7; \u2713) and f (\u00b7; \u2327 ), although it may not be convex-concave in \u2713 and \u2327 as is usually the\ncase when we parametrize functions using neural networks. Solving Eq. (9) can be done with any\nof a variety of smooth game playing algorithms; we discuss our choice in Section 4. We note\nthat AGMM [28] also formulates IV estimation as a smooth game objective, but without the last\nregularization term and with the adversary parametrized as a mixture over a \ufb01nite \ufb01xed set of critic\nfunctions.3\nIn our experiments, we found the regularization term to be crucial for solving the\ngame, and we found the use of a \ufb02exible neural network critic to be crucial with high-dimensional\ninstruments.\n\n3In their code they also include a jitter step where these critic functions are updated, however this step is\n\nheuristic and is not considered in their theoretical analysis.\n\n4\n\n\fNotably, our approach has very few tuning parameters: only the models F and G (i.e., the neural\nnetwork architectures) and whatever parameters the optimization method uses. In Section 4 we\ndiscuss how to select these.\nFinally, we highlight that unlike the case for OWGMM as in Lemma 1, our choice of F is not a\nlinear subspace of V. Indeed, per Lemma 1, replacing F with a high- or in\ufb01nite-dimensional linear\nsubspace simply corresponds to GMM with many or in\ufb01nite moments, which fails as discussed in\nSection 2.1 (in particular, we would generically have n(\u2713;F, \u02dc\u2713) = 1 unhelpfully). Similarly,\nenumerating many moment conditions as generated by, say, many neural networks f and plugging\nthese into GMM, whether one-step or optimally weighted, will fail for the same reasons. Instead, our\napproach is to leverage our variational reformulation in Lemma 1 and replace the class of functions\nF with a rich (non-subspace) set in this new formulation, which is distinct from GMM and avoids\nthese issues. In particular, as long as F has bounded complexity, even if its ambient dimension may\nbe in\ufb01nite, we can guarantee the consistency of our approach. Since the last layer in a network is\na linear combination of the penultimate one, our choice of F can in some sense be thought of as a\nunion over neural network weights of subspaces spanned by the penultimate layer of nodes.\n\n1\n\n1\n\n3.3 Consistency\nBefore discussing practical considerations in implementing DeepGMM, we \ufb01rst turn to the theoretical\nquestion of what consistency guarantees we can provide about our method if we were to approximately\nsolve Eq. (9). We phrase our results for generic bounded-complexity functional classes F,G; not\nnecessarily neural networks.\nOur main result depends on the following assumptions, which we discuss after stating the result.\nAssumption 1 (Identi\ufb01cation). \u27130 is the unique \u2713 2 \u21e5 satisfying (f ; \u2713) = 0 for all f 2F .\nAssumption 2 (Bounded complexity). F and G have vanishing Rademacher complexities:\n2n X\u21e02{1,+1}n\nAssumption 3 (Absolutely star shaped). For every f 2F and ||\uf8ff 1, we have f 2F .\nAssumption 4 (Continuity). For any x, g(x; \u2713), f (x; \u2327 ) are continuous in \u2713, \u2327 , respectively.\nAssumption 5 (Boundededness). Y, sup\u27132\u21e5 |g(X; \u2713)| , sup\u23272T |f (Z; \u2327 )| are all bounded random\nvariables.\nTheorem 2. Suppose Assumptions 1 to 5 hold. Let \u02dc\u2713n by any data-dependent sequence with a limit\nin probability. Let \u02c6\u2713n, \u02c6\u2327n be any approximate equilibrium in the game Eq. (9), i.e.,\n(\u2713, \u02c6\u2327n) + op(1).\n\n2n X\u21e02{1,+1}n\n\n\u21e0ig(Xi; \u2713) ! 0.\n\n\u21e0if (Zi; \u2327 ) ! 0,\n\nE sup\n\u23272T\n\nE sup\n\u27132\u21e5\n\nU\u02dc\u2713n\n\n(\u02c6\u2713n,\u2327 ) op(1) \uf8ff U\u02dc\u2713n\n\n(\u02c6\u2713n, \u02c6\u2327n) \uf8ff inf\n\n\u2713\n\nU\u02dc\u2713n\n\n1\nn\n\nnXi=1\n\n1\nn\n\nnXi=1\n\nsup\n\u23272T\n\nThen \u02c6\u2713n ! \u27130 in probability.\nTheorem 2 proves that approximately solving Eq. (9) (with eventually vanishing approximation error)\nguarantees the consistency of our method. We next discuss the assumptions we made.\nAssumption 1 stipulates that the moment conditions given by F are suf\ufb01cient to identify \u27130. Note that,\nby linearity, the moment conditions given by F are the same as those given by the subspace span(F)\nso we are actually successfully controlling many or in\ufb01nite moment conditions, perhaps making\nthe assumption defensible. If we do not assume Assumption 1, the arguments in Theorem 2 easily\nextend to showing instead that we approach some identi\ufb01ed \u2713 that satis\ufb01es all moment conditions. In\nparticular this means that if we parametrize f and g via neural networks where we can permute the\nparameter vector \u2713 and obtain an identical function, our result still holds. We formalize this by the\nfollowing alternative assumption and lemma.\nAssumption 6 (Identi\ufb01cation of g). Let \u21e50 = {\u2713 2 \u21e5: (f ; \u2713) = 0 8f 2F} . Then for any\n\u27131,\u2713 2 2 \u21e50 the functions g(\u00b7; \u27131) and g(\u00b7; \u27132) are identical.\nLemma 2. Suppose Assumptions 2 to 6 hold. Let \u02c6\u2713n, \u02c6\u2327n be any approximate equilibrium in the game\nEq. (9), i.e.,\n\nU\u02dc\u2713n\n\n(\u02c6\u2713n,\u2327 ) op(1) \uf8ff U\u02dc\u2713n\n\n(\u02c6\u2713n, \u02c6\u2327n) \uf8ff inf\n\n\u2713\n\nU\u02dc\u2713n\n\n(\u2713, \u02c6\u2327n) + op(1).\n\nsup\n\u23272T\n\n5\n\n\fThen inf \u27132\u21e50 k\u02c6\u2713n \u2713k ! 0 in probability.\nAssumption 2 provides that F and G, although potentially in\ufb01nite and even of in\ufb01nite ambient\ndimension, have limited complexity. Rademacher complexity is one way to measure function class\ncomplexity [5]. Given a bound (envelope) as in Assumption 5, this complexity can also be reduced\nto other combinatorial complexity measures such VC- or pseudo-dimension via chaining [31]. [6]\nstudied such combinatorial complexity measures of neural networks.\nAssumption 3 is needed to ensure that, for any \u2713 with (f ; \u2713) > 0 for some f, there also exists an\nf such that (f ; \u2713) > 1\n4 C\u02dc\u2713(f, f ). It trivially holds for neural networks by considering their last\nlayer. Assumption 4 similarly holds trivially and helps ensure that the moment conditions cannot\nsimultaneously arbitrarily approach zero far from their true zero point at \u27130. Assumption 5 is a purely\ntechnical assumption that can likely be relaxed to require only nice (sub-Gaussian) tail behavior.\nIts latter two requirements can nonetheless be guaranteed by either bounding weights (equivalently,\nusing weight decay) or applying a bounded activation at the output. We do not \ufb01nd doing this is\nnecessary in practice.\n\n4 Practical Considerations in Implementing DeepGMM\n\nSolving the Smooth Zero-Sum Game. In order to solve Eq. (9), we turn to the literature on solving\nsmooth games, which has grown signi\ufb01cantly with the recent surge of interest in generative adversarial\nnetworks (GANs). In our experiments we use the OAdam algorithm of [15]. For our game objective,\nwe found this algorithm to be more stable than standard alternating descent steps using SGD or\nAdam.\nUsing \ufb01rst-order iterative algorithms for solving Eq. (9) enables us to effectively handle very large\ndatasets. In particular, we implement DeepGMM using PyTorch, which ef\ufb01ciently provides gradients\nfor use in our descent algorithms [30]. As we see in Section 5, this allows us to handle very large\ndatasets with high-dimensional features and instruments where other methods fail.\nChoosing \u02dc\u2713. In Eq. (9), we let \u02dc\u2713 be any potentially data-driven choice. Since the hope is that \u02dc\u2713 \u21e1 \u27130,\none possible choice is just the solution \u02c6\u2713DeepGMM for another choice of \u02dc\u2713. We can recurse this many\ntimes over. In practice, to simulate many such iterations on \u02dc\u2713, we continually update \u02dc\u2713 as the previous\n\u2713 iterate over steps of our game-playing algorithm. Note that \u02dc\u2713 is nonetheless treated as \u201cconstant\u201d\nand does not enter into the gradient of \u2713. That is, the second term of U in Eq. (9) has zero partial\nderivative in \u2713.Given this approach we can interpret \u02dc\u2713 in the premise of Theorem 2 as the \ufb01nal \u02dc\u2713 at\nconvergence, since Theorem 2 allows \u02dc\u2713 to be fully data-driven.\nHyperparameter Optimization. The only parameters of our algorithm are the neural network\narchitectures for F and G and the optimization algorithm parameters (e.g., learning rate). To tune\nthese parameters, we suggest the following general approach. We form a validation surrogate \u02c6 n for\nour variational objective in Eq. (7) by taking instead averages on a validation data set and by replacing\nF with the pool of all iterates f encountered in the learning algorithm for all hyperparameter choice.\nWe then choose the parameters that maximize this validation surrogate \u02c6 n. We discuss this process\nin more detail in Appendix B.1.\nEarly Stopping. We further suggest to use \u02c6 n to facilitate early stopping for the learning algorithm.\nSpeci\ufb01cally, we periodically evaluate our iterate \u2713 using \u02c6 n and return the best evaluated iterate.\n\n5 Experiments\n\nIn this section, we compare DeepGMM against a wide set of baselines for IV estimation. Our\nimplementation of DeepGMM is publicly available at https://github.com/CausalML/DeepGMM.\nWe evaluate the various methods on two groups of scenarios: one where X, Z are both low-\ndimensional and one where X, Z, or both are high-dimensional images. In the high-dimensional\nscenarios, we use a convolutional architecture in all methods that employ a neural network to\naccommodate the images. We evaluate performance of an estimated \u02c6g by MSE against the true g0.\nMore speci\ufb01cally, we use the following baselines:\n\n6\n\n\fn\ni\ns\n\np\ne\nt\ns\n\ns\nb\na\n\nr\na\ne\nn\ni\nl\n\nFigure 2: Low-dimensional scenarios (Section 5.1). Estimated \u02c6g in blue; true response g0 in orange.\n\nScenario DirectNN Vanilla2SLS Poly2SLS GMM+NN AGMM DeepIV Our Method\nsin\n.02 \u00b1 .00\nstep\n.01 \u00b1 .00\nabs\n.03 \u00b1 .01\nlinear\n.01 \u00b1 .00\nTable 1: Low-dimensional scenarios: Test MSE averaged across ten runs with standard errors.\n\n.08 \u00b1 .00 .11 \u00b1 .01 .06 \u00b1 .00\n.06 \u00b1 .00 .06 \u00b1 .01 .03 \u00b1 .00\n.14 \u00b1 .02 .17 \u00b1 .03 .10 \u00b1 .00\n.06 \u00b1 .01 .03 \u00b1 .00 .04 \u00b1 .00\n\n.09 \u00b1 .00 .04 \u00b1 .00\n.03 \u00b1 .00 .03 \u00b1 .00\n.23 \u00b1 .00 .04 \u00b1 .00\n.00 \u00b1 .00 .00 \u00b1 .00\n\n.26 \u00b1 .00\n.21 \u00b1 .00\n.21 \u00b1 .00\n.09 \u00b1 .00\n\nScenario DirectNN Vanilla2SLS Ridge2SLS GMM+NN AGMM DeepIV Our Method\nMNISTz\n.07 \u00b1 .02\nMNISTx\n.15 \u00b1 .02\nMNISTx,z\n.14 \u00b1 .02\nTable 2: High-dimensional scenarios: Test MSE averaged across ten runs with standard errors.\n\n\u2013 .11 \u00b1 .00\n\u2013\n\u2013\n\u2013\n\u2013\n\n.25 \u00b1 .02\n.28 \u00b1 .03\n.24 \u00b1 .01\n\n.23 \u00b1 .00\n> 1000\n> 1000\n\n.23 \u00b1 .00\n.19 \u00b1 .00\n.39 \u00b1 .00\n\n.27 \u00b1 .01\n.19 \u00b1 .00\n.25 \u00b1 .01\n\n1. DirectNN: Predicts Y from X using a neural network with standard least squares loss.\n2. Vanilla2SLS: Standard two-stage least squares on raw X, Z.\n3. Poly2SLS: Both X and Z are expanded via polynomial features, and then 2SLS is done via ridge\nregressions at each stage. The regularization parameters as well polynomial degrees are picked\nvia cross-validation at each stage.\n\n4. GMM+NN: Here, we combine OWGMM with a neural network g(x; \u2713). We solve Eq. (5) over\nnetwork weights \u2713 using Adam. We employ optimal weighting, Eq. (6), by iterated GMM [20].\nWe are not aware of any prior work that uses OWGMM to train neural networks.\n\n5. AGMM [28]: Uses the publicly available implementation4 of the Adversarial Generalized Method\nof Moments, which performs no-regret learning on the one-step GMM objective Eq. (5) with norm\nk\u00b7k 1 and an additional jitter step on the moment conditions after each epoch.\n\n6. DeepIV [21]: We use the latest implementation that was released as part of the econML package.5\nNote that GMM+NN relies on being provided moment conditions. When Z is low-dimensional, we\nfollow AGMM [28] and expand Z via RBF kernels around 10 centroids returned from a Gaussian\nMixture model applied to the Z data. When Z is high-dimensional, we use the moment conditions\ngiven by each of its components.6\n\n7\n\n\f5.1 Low-dimensional scenarios\n\nIn this \ufb01rst group of scenarios, we study the case when both the instrument as well as treatment is\nlow-dimensional. Similar to [28], we generated data via the following process:\n\nY = g0(X) + e + X\nZ \u21e0 Uniform([3, 3]2)\n\n= Z1 + e + \n\ne \u21e0N (0, 1),,\n\n\u21e0N (0, 0.1)\n\nIn other words, only the \ufb01rst instrument has an effect on X, and e is the confounder breaking\nindependence of X and the residual Y g0(X). We keep this data generating process \ufb01xed, but vary\nthe true response function g0 between the following cases:\n\nsin: g0(x) = sin(x)\n\nstep: g0(x) = {x0}\n\nabs: g0(x) = |x|\n\nlinear: g0(x) = x\n\nWe sample n = 2000 points for train, validation, and test sets each. To avoid numerical issues, we\nstandardize the observed Y values by removing the mean and scaling to unit variance. Hyperparam-\neters used for our method in these scenarios are described in Appendix B.2.We plot the results in\nFig. 2. The left column shows the sampled Y plotted against X, with the true g0 in orange. The other\ncolumns show in blue the estimated \u02c6g using various methods. Table 1 shows the corresponding MSE\nover the test set.\nFirst we note that in each case there is suf\ufb01cient confounding that the DirectNN regression fails badly\nand a method that can use the IV information to remove confounding is necessary.\nOur next substantive observation is that our method performs competitively across scenarios, attaining\nthe lowest MSE in each (except linear where are beat just slightly and only by methods that use a\nlinear model). At the same time, other methods employing neural networks perform well in some\nscenarios and less well in others. Therefore we conclude that in the low dimensional setting, our\nmethod is able to adapt to the scenario and compete with best tuned methods for the scenario.\nOverall, we also found that GMM+NN performed well (but not as well as our method). In some\nsense GMM+NN is a novel method; we are not aware of previous work using (OW)GMM to train a\nneural network. Whereas GMM+NN needs to be provided moment conditions, our method can be\nunderstood as improving further on this by learning the best moment condition over a large class using\noptimal weighting. AGMM performed similarly well to GMM+NN, which uses the same moment\nconditions. Aside from the heuristic jitter step implemented in the AGMM code, it is equivalent to\none-step GMM, Eq. (5), with k\u00b7k 1 vector norm in place of the standard k\u00b7k 2 norm. Its worse\nperformance than our method perhaps also be explained by this change and by its lack of optimal\nweighting.\nIn the experiments, the other NN-based method, DeepIV, was consistently outperformed by Poly2SLS\nacross scenarios. This may be related to the computational dif\ufb01culty of its two-stage procedure, or\npossibly due to sensitivity of the second stage to errors in the density \ufb01tting in the \ufb01rst stage. Notably\nthis is despite the fact that the neural-network-parametrized Gaussian mixture model \ufb01t in the \ufb01rst\nstage is correctly speci\ufb01ed, so DeepIV\u2019s poorer performance cannot be attributed to the infamous\n\u201cforbidden regression\u201d issue. Therefore we might expect that, in more complex scenarios where the\n\ufb01rst-stage is not well speci\ufb01ed, DeepIV could be at even more of a disadvantage. In the next section,\nwe also discuss its limitations with high-dimensional X.\n\n5.2 High-dimensional scenarios\n\nWe now move on to scenarios based on the MNIST dataset [26] in order to test our method\u2019s ability\nto deal with structured, high-dimensional X and Z variables. For this group of scenarios, we use\nsame data generating process as in Section 5.1 and \ufb01x the response function g0 to be abs, but\nmap Z, X, or both X and Z to MNIST images. Let the output of Section 5.1 be X low, Zlow and\n\u21e1(x) = round(min(max(1.5x+5, 0), 9)) be a transformation function that maps inputs to an integer\nbetween 0 and 9, and let RandomImage(d) be a function that selects a random MNIST image from\nthe digit class d. The images are 28 \u21e5 28 = 784-dimensional. The scenarios are then given as:\n\n4https://github.com/vsyrgkanis/adversarial_gmm\n5https://github.com/microsoft/EconML\n6That is, we use fi(Z) = Zi for i = 1, . . . , dim(Z).\n\n8\n\n\f\u2022 MNISTZ: X X low, Z RandomImage(\u21e1(Zlow\n1 )).\n\u2022 MNISTX: X RandomImage(\u21e1(X low)), Z Zlow.\n\u2022 MNISTX, Z: X RandomImage(\u21e1(X low)), Z RandomImage(\u21e1(Zlow\n\n1 )).\n\nWe sampled 20000 points for the training, validation, and test sets and ran each method 10 times with\ndifferent random seeds. Hyperparameters used for our method in these scenarios are described in\nAppendix B.2. We report the averaged MSEs in Table 2. We failed to run the AGMM code on any\nof these scenarios, as it crashed and returned over\ufb02ow errors. Similarly, the DeepIV code produced\nnan outcomes on any scenario with a high-dimensional X. Furthermore, because of the size of\nthe examples, we were similarly not able to run Poly2SLS. Instead, we present Vanilla2SLS and\nRidge2SLS, where the latter is Poly2SLS with \ufb01xed linear degree. Vanilla2SLS failed to produce\nreasonable numbers for high-dimensional X because the \ufb01rst-stage regression is ill-posed.\nAgain, we found that our method performed competitively across scenarios, achieving the lowest\nMSE in each scenario. In the MNISTZ setting, our method had better MSE than DeepIV. In the\nMNISTX and MNISTX,Z scenarios, it handily outperformed all other methods. Even if DeepIV had\nrun on these scenarios, it would be at great disadvantage since it models the conditional distribution\nover images using a Gaussian mixture. This can perhaps be improved using richer conditional\ndensity models like [12, 22], but the forbidden regression issue remains nonetheless. Overall, these\nresults highlights our method\u2019s ability to adapt not only to each low-dimensional scenario but also to\nhigh-dimensional scenarios, whether the features, instrument, or both are high-dimensional, where\nother methods break. Aside from our method\u2019s competitive performance, our algorithm was tractable\nand was able to run on these large-scale examples where other algorithms broke computationally.\n6 Conclusions\nOther related literature and future work. We believe that our approach can also bene\ufb01t other\napplications where moment-based models and GMM is used [7, 18, 19]. Moreover, notice that while\nDeepGMM is related to GANs [16], the adversarial game that we play is structurally quite different.\nIn some senses, the linear part of our payoff function is similar to that of the Wasserstein GAN [4];\ntherefore our optimization problem might bene\ufb01t from a similar approaches to approximating the sup\nplayer as employed by WGANs. Another related line of work is in methods for learning conditional\nmoment models, either in the context of IV regression or more generally, that are statistically ef\ufb01cient\n[1, 8\u201311]. This line of work is different in focus than ours; they focus on methods that are statistically\nef\ufb01cient, whereas we focus on leveraging work on deep learning and smooth game optimization to\ndeal with complex high-dimensional instruments and/or treatment. However an important direction\nfor future work would be to investigate the possible ef\ufb01ciency of DeepGMM or ef\ufb01cient modi\ufb01cations\nthereof. Finally, there has been some prior work connecting GANs and GMM in the context of image\ngeneration [32], so another potential avenue of work would be to leverage some of the methodology\ndeveloped there for our problem of IV regression.\nConclusions. We presented DeepGMM as a way to deal with IV analysis with high-dimensional\nvariables and complex relationships. The method was based on a new variational reformulation of\nGMM with optimal weights with the aim of handling many moments and was formulated as the\nsolution to a smooth zero-sum game. Our empirical experiments showed that the method is able to\nadapt to a variety of scenarios, competing with the best tuned method in low dimensional settings and\nperforming well in high dimensional settings where even recent methods break.\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1846210.\n\n9\n\n\fReferences\n[1] C. Ai and X. Chen. Ef\ufb01cient estimation of models with conditional moment restrictions\n\ncontaining unknown functions. Econometrica, 71(6):1795\u20131843, 2003.\n\n[2] J. D. Angrist and A. B. Krueger. Instrumental variables and the search for identi\ufb01cation: From\nsupply and demand to natural experiments. Journal of Economic perspectives, 15(4):69\u201385,\n2001.\n\n[3] J. D. Angrist and J.-S. Pischke. Mostly Harmless Econometrics: An Empiricist\u2019s Companion.\n\nPrinceton university press, 2008.\n\n[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[5] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[6] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodi-\nmension bounds for piecewise linear neural networks. Journal of Machine Learning Research,\n20(63):1\u201317, 2019.\n\n[7] S. Berry, J. Levinsohn, and A. Pakes. Automobile prices in market equilibrium. Econometrica,\n\npages 841\u2013890, 1995.\n\n[8] R. Blundell, X. Chen, and D. Kristensen. Semi-nonparametric iv estimation of shape-invariant\n\nengel curves. Econometrica, 75(6):1613\u20131669, 2007.\n\n[9] G. Chamberlain. Asymptotic ef\ufb01ciency in estimation with conditional moment restrictions.\n\nJournal of Econometrics, 34(3):305\u2013334, 1987.\n\n[10] X. Chen and T. M. Christensen. Optimal sup-norm rates and uniform inference on nonlinear\n\nfunctionals of nonparametric iv regression. Quantitative Economics, 9(1):39\u201384, 2018.\n\n[11] X. Chen and D. Pouzo. Estimation of nonparametric conditional moment models with possibly\n\nnonsmooth generalized residuals. Econometrica, 80(1):277\u2013321, 2012.\n\n[12] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in\nneural information processing systems, pages 2172\u20132180, 2016.\n\n[13] J. A. Cole, H. Norman, L. B. Weatherby, and A. M. Walker. Drug copayment and adherence in\nchronic heart failure: effect on cost and outcomes. Pharmacotherapy: The Journal of Human\nPharmacology and Drug Therapy, 26(8):1157\u20131164, 2006.\n\n[14] S. Darolles, Y. Fan, J.-P. Florens, and E. Renault. Nonparametric instrumental regression.\n\nEconometrica, 79(5):1541\u20131565, 2011.\n\n[15] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training gans with optimism. arXiv preprint\n\narXiv:1711.00141, 2017.\n\n[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NeurIPS, pages 2672\u20132680, 2014.\n\n[17] L. P. Hansen. Large sample properties of generalized method of moments estimators. Econo-\n\nmetrica, pages 1029\u20131054, 1982.\n\n[18] L. P. Hansen and T. J. Sargent. Formulating and estimating dynamic linear rational expectations\n\nmodels. Journal of Economic Dynamics and control, 2:7\u201346, 1980.\n\n[19] L. P. Hansen and K. J. Singleton. Generalized instrumental variables estimation of nonlinear\n\nrational expectations models. Econometrica, pages 1269\u20131286, 1982.\n\n[20] L. P. Hansen, J. Heaton, and A. Yaron. Finite-sample properties of some alternative gmm\n\nestimators. Journal of Business & Economic Statistics, 14(3):262\u2013280, 1996.\n\n10\n\n\f[21] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep iv: A \ufb02exible approach for\ncounterfactual prediction. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1414\u20131423. JMLR. org, 2017.\n\n[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional\nadversarial networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 1125\u20131134, 2017.\n\n[23] F. D. Johansson, N. Kallus, U. Shalit, and D. Sontag. Learning weighted representations for\n\ngeneralization across designs. arXiv preprint arXiv:1802.08598, 2018.\n\n[24] N. Kallus. Generalized optimal matching methods for causal inference. arXiv preprint\n\narXiv:1612.08321, 2016.\n\n[25] N. Kallus. Deepmatch: Balancing deep covariate representations for causal inference using\n\nadversarial training. arXiv preprint arXiv:1802.05664, 2018.\n\n[26] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[27] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes.\n\nSpringer Science & Business Media, 2013.\n\n[28] G. Lewis and V. Syrgkanis. Adversarial generalized method of moments. arXiv preprint\n\narXiv:1803.07164, 2018.\n\n[29] W. K. Newey and J. L. Powell. Instrumental variable estimation of nonparametric models.\n\nEconometrica, 71(5):1565\u20131578, 2003.\n\n[30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.\n\n[31] D. Pollard. Empirical processes: theory and applications. In NSF-CBMS regional conference\n\nseries in probability and statistics, pages i\u201386. JSTOR, 1990.\n\n[32] S. Ravuri, S. Mohamed, M. Rosca, and O. Vinyals. Learning implicit generative models with\n\nthe method of learned moments. arXiv preprint arXiv:1806.11006, 2018.\n\n[33] D. B. Rubin. Matching to remove bias in observational studies. Biometrics, pages 159\u2013183,\n\n1973.\n\n[34] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization\nbounds and algorithms. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 3076\u20133085, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1931, "authors": [{"given_name": "Andrew", "family_name": "Bennett", "institution": "Cornell University"}, {"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Tobias", "family_name": "Schnabel", "institution": "Microsoft Research"}]}