NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1931 Deep Generalized Method of Moments for Instrumental Variable Analysis

### Reviewer 2

Originality: The theoretical contributions are completely novel. In spite of the similarity of their approach to AGMM, the formulation of the problem is based on a different objective function that allows optimal reweighing, instead of the unweighted moment conditions. Quality: The algorithmic, theoretical and empirical contributions are sound. Though the evaluation is not extensive, it is convincing. Clarity: The clarity and organization of the paper can be improved 1) It might be helpful to use different notations for the unweighted norm and weighted norm. In particular, lemma 1 would read better without having to refer back to the definition of the weighted norm. 2) Line 102 about equivalence to non-causal linear regression requires justification or a reference. 3) Line 79 is not meaningful. There ought to be some relationship between $m$ and complexity of $\theta$ 4) Theorem 2. It says that $\tilde{\theta}_n$ has a limit. Can this be any limit? or it should be $\theta_0$? If not, its quite counter intuitive and requires further explanation. 5) In particular, AGMM paper also suggests a way of learning the moment functions via deep networks. What precisely makes DeepGMM needs to be emphasized. 6) Line 190. it says that $\tilde{\theta}$ does not enter the gradient of $\theta$. Wouldn't that mean that the optimum of $\theta$ does not depend on $\tilde{\theta}$? Perhaps you just mean that $\tilde{\theta}$ should be treated as constants when taking gradient w.r.t \theta. It can still be a part of the gradient. 7) Line 224. Please elaborate on "When Z is high-dimensional, we use the moment conditions given by each of its components". 8) I could not understand the data for high dimensional case. Seems like $X$ is sometimes and image and sometimes a number. Furthermore, defining $g_0$ as abs would mean that one is taking absolute value of an image. Is that meaningful? Significance: The paper is an important contribution to the field of causality research. And likely to be used considering the performance of the algorithm. --- Post rebuttal comments: The authors responded adequately to most of my concerns, but they did not clarify comment 8 in my review. Furthermore, I agree with the issues pointed out by the other reviewers on the experimental section. I have lowered my score to reflect that.

### Reviewer 3

I found the paper interesting. In particular, I liked the variational formulation of optimally-weighted generalized method of moments. This formulation is likely useful when the number of moments are large and inverting the covariance matrix is computationally difficult. I also commend the attempt at proving consistency when both the causal response function g and moment function f are parameterized by neural networks. The empirical results seem promising compared to alternatives, particularly in the high-dimensional case. What would make this paper stronger is addressing some gaps from Lemma 1 to the proposed DeepGMM. First, the identification assumption is not true for neural network G's (e.g. a permutation of hidden units yields the same output). I know that identification is a standard assumption in GMM literature, and is not valid here, but discussing a bit more about the richness of \mathcal{F} needed to obtain correct \theta would be helpful (maybe as part of lines 158-163). Perhaps a larger issue than the identification condition is understanding if the -1/4 C(f, f) term is needed in the neural network f case. From my reading of the proof of Theorem 2, this term is not needed to prove consistency, since by Assumptions 2 and 5, the f's are already well-behaved. I would assume that one might obtain similar experimental results by other controls on f (perhaps by spectral normalization or gradient penalties), since -1/4 C(f, f) is only well-motivated in the "finite" f case. (This is somewhat of a minor comment, as there is some motivation from optimal weighting, but I'm not sure that the optimal weighting perspective makes sense in the infinite f case). The experimental results look promising, but I think it is missing some key details. First, it is worth understanding what the architectures for g and f were. Depending the architecture for g, it could be that the moment conditions for the baseline results in Sections 5.1 and 5.2 were not sufficiently "rich" to recover theta (I'm thinking mostly about Poly2SLS/Ridge2SLS) here. Second, it would nice to know how many hyperparameters (as defined in B.1) were used to train the network. Third, it would be nice to know other details, such as the learning rate, the number of training steps, batch size, etc. Other minor comments: 1.) It would be good to cite the paper "Kernel Instrumental Variable Regression" https://arxiv.org/abs/1906.00232. (This came out at ICML, so I'm not sure you were aware of the work during submission. 2.) The paper was on the whole well written, but I found a couple typos - line 144 "implementating" -> "implementing" - line 289, double citation of reference 15 - line 377 I think the equation should be "- 1/4 v'Cv" and not "+ 1/4 v'Cv" lines 381/382... I think you're missing some close parens here 3.) On related and future work, there is some work on learning implicit generative models using GMM-line techniques (e.g. "Learning Implicit Generative Models with the Method of Learned Moments" https://arxiv.org/abs/1806.11006), and it would be interesting to see a variant of the variational formulation used here. I hope the authors address the above comments, because I think the paper is promising and has some nice ideas.