Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The authors propose a method of combining multiple sub-policies with continuous action spaces by multiplicative composition (instead of the standard additive model in options,etc.). The sub policies are pre-trained with imitation learning. MCP shows competitive or much better results than other state of the art hierarchical and latent space methods on challenging high-dimensional domains (the T-Rex playing soccer!). Pros: 1) The idea is clearly written and with several details for re-implementation 2) Compelling results on challenging environments 3) Good baseline comparisons with very recent papers 4) The analysis with latent space methods is really appreciated Cons: 1) I’m not sure how novel the idea is, as there is a lot of literature on using an ensemble or mixture of experts/dynamics models/policies, etc. That being said, the results are very compelling. A possible area of overlap is with the motion generation literature which the paper does not discuss e.g. Riemannian Motion Policies/RMPFlow (Ratliff et.al) are used to combine multiple policies in a modular and hierarchical fashion and transfer between different spaces. Discovery of Deep Continuous Options by Krishnan et al. may also be relevant to discuss. 2) I’m not sure if the option-critic baseline is fair. As far as I understand, OC learns the options (they don’t need to be hand defined) and in MCP there is a clear separation between pre-training the skills and learning the gating. Perhaps a better baseline would be to pre-train the options in OC (in the same way as MCP) and compare against the full MCP pipeline. Then, we can see 3) Do you hand-define the training for the sub policies (its stated that there are 8 sub-skills). For instance, skill #1 is for walking, #2 is using the foot to kick the ball, #3 is for turning, etc.? 4) What are the crucial hyperparameters in MCP? Some insight regarding this would be useful. 5) Since the paper claims to beat recent state-of-the-art methods (e.g. Figure 4) in non-standard environments like the T-Rex dribbling (i.e. its not an open ai gym task), the authors should release code.
Originality: I’m not very familiar with RL and imitation learning, but the work seems original. The word directly addresses a deficiency in existing approaches to learning complex behaviors like mixture-of-experts and hierarchical models. Quality: The quality of the work seems high overall. The explanation of the model is fairly clear, the experiments seem thorough, and there is abundant followup work suggesting that the model is achieving the desired effect, namely that the model is learning a set of primitive experts that combine to learn complex behaviors (Figure 7). Clarity - I found the terminology around a primitive’s “activation” to be quite confusing. My understanding is that a primitive being “active” means that it is contributing to the distribution that is actually sampled from. Under this definition, it makes sense that we would want multiple primitives to be active so that we can leverage the representational power of K many models rather than just one, in addition to primitive specialization. On the other hand, when you mention activating “primitive skill”, you seem to suggest performing multiple actions at the same time step. Does this mean that that model is allowed to activate several actions at each time step? That doesn’t seem to be the case in the studied setting, but seems to be used to justify MCP. - Using the definition of a primitive being active meaning they contribute to the sampling distribution, in the additive model, it seems fairly trivial to sample in a way such that multiple primitives could be “active”: Instead of sampling from w, compute the linear combination of the primitives and sample from the resulting distribution. If that’s the case, this seems like a useful baseline in the experiments for isolating the effect of the representational power of the particular model used (multiplicative factoring primitives) versus the ability to have multiple primitives active. Significance: Overall the work seems quite significant. The theoretical benefits of the proposed model allows the model to learn complex behaviors, which seem to indeed play out in experiments. I could see work building directly on the model presented here, and comparing to it as a baseline.
Generally speaking,this is a very interesting work with great originality and quality. However, the formulas and principles are not very clear. For example, how does the Gaussian primitives generate another Gaussian, and must the Gaussian policies be used? What is even more puzzling is that the addition model can't really be able to activate multiple primitives simultaneously as the multiplication model? From the perspective of the formula, the two are just the difference between addition and multiplication. Finally, I believe that learning and composing skills effectively are very meaningful research directions, and this paper does a significant job.