NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3181
Title:Adaptive Gradient-Based Meta-Learning Methods

Reviewer 1

I would like to thank the authors for very clearly written paper. I have a few small points which could potentially improve the write-up. Page 3, 1. Generality, “the regret of OGD on m_t convex G-Lipschitz losses has a well-known upper bound of” It may be useful to have the reference to this well-known upper bound.. “The niceness of \hat{R_t(x_t)} makes the analysis tractable” May be the authors mean here the case of the OGD? Would the form of \hat{R_t(x_t)} be nice in general? On the line 108 -> 109, it’s not clear how \hat{\bar{R_{T}} \leq o(T) + min_x 1/T sum_t \hat{R_t} In experimental section, authors test their optimization method against Reptile with Adam. I think another sensible baseline would be Reptile+Meta-SGD, which makes the learning rate learnable, thus making it a more fair comparison to ARUBA which uses the adaptive learning rate. For FedAvg case, it would be interesting to see whether the improvement comes from the meta-learning treatment of the federated learning problem (where we optimize for the initialization) or the ARUBA algorithm itself (e.g. the fact that the learning rate is adaptable).

Reviewer 2

The paper presents a new way to analyze multi-task learning algorithms in the online setting by analysing average regret-upper-bound. The authors introduce a two-level algorithm that runs a mirror descent algorithm for each task and adjusts the parameters of the mirror descent before each new task by running another online learning algorithm on the level of tasks. This task-level algorithm is designed to minimize theoretical regret bounds for each task. The authors prove a general bound on the performance of the presented algorithm and show its corollaries for settings with static and dynamic comparators. A version of the algorithm for different task similarity measures is also presented, as well as, an online-to-batch conversion result for learning-to-learn setting with i.i.d. tasks. In the end, an empirical evaluation is presented that shows an improved performance when ARUBA applied to few-shot classification and federated learning settings. Pros: * A new approach for multi-task learning by analysing average regret-upper-bound * A general performance bound (Theorem 1) that can be applied to a variety of settings * Interesting decomposition of (2) into two online learning problem that can be studied separately * Proof in the static environment case for non-convex Bregman divergencies * Interesting extension with inter-task geometry (Section 4) Cons: * Complicated and overloaded notations: too many versions of regret and bounds with very similar symbols, the sequence of \psi_t for dynamic regret is not defined for notations of Theorem 3.1. * The paper is missing a detailed discussion/examples of the behaviour of V_\Psi, which makes it's hard to judge the sensibility of the proven bounds. AFTER FEEDBACK: I had only minor comments and authors addressed them. My evaluation stays the same and I recommend the paper for acceptance.

Reviewer 3

This paper proposed a new theoretical framework for analysing the regret of gradient based meta learning algorithms and derived a few variants of algorithms to learn the initialisation and learning rate. It provides rigorous analysis about the online regret and how it depends on the similarity of tasks, number of gradient descent steps, number of tasks, etc. It provides a nice tool to select the update rule for the model initialisation and choosing the step size. It also provides an approach to analyse the bound and convergence rate of the excess transfer risk. However, I'm not familiar with the literature on online learning and its application on the meta-learning setup. Therefore, I'm not able to comment on the advantage of the proposed method compared to existing approaches. In the experiment section, the improvement of ARUBA over Adam is relative small for the reptile algorithm while the improvement on the FedAvg algorithm is a lot more significant. Could the author comment what factor leads to the big difference between these two experiments?