NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1842

Reviewer 1

Overall opinion This is a good paper. The motivation is clear and the method is simple. The paper is meaningful in that it provides a new method to parameterize and learn a curvature in weight space. The analysis makes connections to second-order methods and also shows how the learned curvature operates in terms of previous gradients. Experiments show the advantages of their method. +Clearly written. I especially appreciated the crisp introduction to tensor algebra (section 2.1). +Analysis shows connections to Fisher information matrix, and the decomposition of the altered gradient (eq 10) shows how this method implicitly “memorizes” previous train- and val- gradients. +Experiments show that meta-curvature outperforms MAML, meta-SGD, MAML++, and layerLR -WRN experiments (Table 4) show that meta-curvature is slightly outperformed by LEO, though I believe this can be overlooked as LEO results involved heavy engineering. -The paper did not cite and compare to a relevant previous work with a similar motivation: [1] also learns additional parameters for MAML to learn to alter the gradient, and the learned parameters correspond to a learned curvature for each layer. Minor comments: -line 75: If I understand correctly, it is n-mode unfolding, matrix multiplication, and then reverse(?) n-mode unfolding. Otherwise, the result would be a 2-order rather than an N-order tensor. -section 3.2.2: I personally found this, along with figure 1, harder to follow than section 3.2.1. [1] Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace, ICML 2018 ------------------------ post-rebuttal ---------------------------- The authors have addressed all of my concerns, and the additional experiments highlight the advantage of meta-curvature even more. I maintain my score.

Reviewer 2

This submission aims to meta learn curvature estimations such that it will lead to better generalization than Hessian or Fisher-information matrix. In terms of writing, this work is well written. A concern is I couldn’t find how is equation (10) leads to better generalization, since the second term in equation (10) is a weighted gradient on validation dataset, the weights reflect the similarity between gradient of training loss and test loss but do not link to generalization directly.

Reviewer 3

Originality: The paper is quite original, I am not aware of similar studies. Clarity: Overall, the paper is written well, especially the most challenging sections 3.2 and 4. I have a number of targeted comments thought that I think should be addressed before publication: 1. Line 34: “They compute local curvatures with training losses and move along the curvatures as far as possible.”

I am not sure what authors mean here, perhaps some clarification would be helpful. 2. Line 130: Why is the layer parametrised in this way? If it is a convolutional layer, I would expect a 4rd dimension, the number of output channels. Do authors also consider fully-connected layers? What kind of curvature parametrisation is supposed to be used in that case? 3. Can authors provide more comments on the representational power of their _parametrizations_ (not gradients), e.g. with comparison to other tensor methods, especially Tensor-Train decomposition? 4. Does the batch normalization anyhow affect the analysis provided in section 4 due to gradient sharing? Can we get rid of the batch norm at all as the learned curvature can learn even a better normalisation scheme? 5. I have got a bit confused with the “train”, “validation” and “test” sets used in section 4, in the standard MAML setup is the meta-update computed on the “train” set and the initialisation is updated based on the loss on the “validation” set (using the paper’s terminology)? If so, I am not sure if the “validation” is the best term to use. Quality: The proposed method seems to be solid. The experimental comparison is also broad enough to claim an empirical contribution. Significance: The problem of generalisation in few-shot learning is very actual and the paper addresses this problem in a novel, interesting way. Learned curvatures might find applications in other problems too.