Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper describes a framework to learn from different expert trajectories how to decompose a task into smaller sub-goals. This is used to define a reward function and imitation learning is first used to learn a policy followed by reinforcement learning given the sub-goals. Also a module to deal with new states based on one-class learning is used to provide robustness to the system. A nice feature of the system is that it can learn tasks even with sub-optimal trajectories. One limitation is that the system assumes a sequential order of the sub-goals and all the trajectories must start from the same initial state distribution. This means that the system is unable to deal with sequences following different paths or from different initial states. The number of partitions is given in advance, and although the number seems not to be too relevant, in some cases it may hamper the performance of the system. It would be great if there could be some guidance or process to make this selection more easily. The DTW process is not sufficiently clarified. The system evaluates DTW for all pairs of trajectories? It is also not clear how the goal assignment is performed for each state after DTW. It seems to be an implicit assumption that all the trajectories must have very similar number of states, is that the case? It is clear that pre-learning process helps the subsequent RL process, however, it is not clear why the information from sub-goals cannot be used from the beginning with only one learning process. In the experiments the trajectories given to the system are not very good, not surprisingly the other methods perform poorly, however, with information from sub-goals, this initial guidance proves to deal even with noisy information. I will be very useful to see how the other methods perform with close-to-optimal trajectories. After reading the author's rebuttal, I have chosen to maintain my score for these reasons: I believe that automatically decomposing a problem and being able to use information from noisy trajectories are relevant topics. Although there are some concerns with this paper, and I will not argue too much for its acceptance, I still think that it opens some new lines of research for solving more complex tasks.
Summary: The objective of this paper is to produce an RLfD algorithm that can both increase learning speed vs. sparse-rewards from scratch, and exceed expert performance. The main assumption is that all expert trajectories go through the same sequence of abstract steps, which are internal to the expert and implicit in the demos. By training classifiers to segment these trajectories into a sequence of monotonically increasing "goals", the goals can be used as sparse shaping cues. Abstracting over states in this fashion prevents the shaping signal from fitting the idiosyncrasies of the expert, and allows the experimenter to control the granularity of expert's state-density distribution to follow during training. The main contribution is an EM-style algorithm for learning goal labels from expert trajectories. It works by alternating between (a) assigning a goal-label to all states in the dataset such that goals are monotonically increasing over the demos, and (b) fitting an n-way classifier to maximize log-prob of the labels. The authors also use an explicit anomaly detector to avoid shaping when far from the expert data, which showed a positive effect in their 2 most challenging tasks. Comments: "252 The poor performance of the ValueReward and AggreVaTeD can be attributed to the imperfect value 253 function learned with a limited number of trajectories. Specifically, with an increase in the trajectory 254 length, the variations in cumulative reward in the initial set of states are quite high. This introduces a 255 considerable amount of error in the estimated value function in the initial states, which in turn traps 256 the agent in some local optima when such value functions are used to guide the learning process." > This argument should apply to the proposed method as well. The proposed method is effectively learning a value-like-function on a discretized state-space, and using it in the same potential-based way the regular value function would be. How do the authors explain the robustness to local-optima vs. the baseline methods? These tasks are not especially complex, and I find it suspicious that the authors were unable to get VBRS and aggrevated to work at all. "285 We also see that our method performs better than even the optimal expert (as it is only optimal w.r.t. some cost function) used in AntMaze" > Aren't you evaluating on that same cost function? If so it sounds like an under-fitting issue, and if not the difference in cost function should be discussed. Fig 3) all baselines fail utterly in tasks 1 & 3, which is suspicious . See comment below Fig 4a) why does performance drop as nd 2->5? Why not show n=1? It seems that shaping may not be needed for this task. Fig 4b) no effect of ng -> suggests shaping isn't really necessary here and that the main effect comes from pretraining. Both 4a and 4b makes a weak case for the proposed method, since the goal sparsity has little effect on learning speed or final performance. "Although the equipartition labels in Eqn. 2 may have a many-to-one mapping from state to sub-goals, the learned network modeling πφ provides a one-to-one mapping." > How can a mapping from Rn -> Nk be one-to-one? Would be interesting to see if expert performance can be exceeded when training solely from expert demos, ie without any environment reward at all. There’s reason to believe this is possible, but it wasn’t discussed explicitly Final comment: The main approach to goal-labeling works by iteratively assigns labels to states in ascending order over trajectories, and training a classifier to minimize a cross-entropy loss with respect to these labels. It seems the root issues are the need to preserve sequential ordering of class-labels over time, and the need to preserve the balance of goal-labels across the dataset. The narrative and algorithm might benefit from expressing these criteria more explicitly, e.g. by using Ordinal-Regression to capture the sequentiality of discrete labels by construction in the model, and using a prior (e.g. uniform) or KL-regularizer on labels to preserve entropy. This comment is merely a suggestion, and did not affect the review.
The paper proposes an algorithm to solve sparse reward MDPs with expert demonstrations by learning a set of sub-goals from the demonstrations. To learn the sub-goals, they train a multi-class classification model that assigns agent states to different clusters. To ensure that states later in the trajectory do not belong to a previous sub goal, they used dynamic time warping to adjust the output of the model and use it to generate new training targets. To deal with the problem that the demonstrations do not cover the entire state space, they used the deep one class classification algorithm so that the final algorithm only uses prediction for states near the demonstration trajectories. The method seems reasonable to me and the results seems solid. The paper is not difficult to follow. However, I have a few concerns and questions regarding the proposed approach, as listed below: First, it’s not clear from the text how important it is to learn a neural network model for the sub-goals. From the results in Figure 7, it seems that the results could be produced by some clustering algorithm as well? For example, what would happen if one uses the equipartition (eq 2) with a nearest neighbor classifier to specify the sub-goals? Also, despite that the use of DTW to ensure temporal consistency is interesting, it’s not verified if it’s necessary to do so in the experiments. Also, the sub-goals used in the experiments seems to be the COM position of the robots (based on the look of Figure 7) and all the problems have a navigation nature. It’s not very clear how well can this generalize to higher dimensional states, like for manipulation or locomotion tasks where the full joint state of the robot is needed. What happens if the expert trajectories solve the problem in different ways, i.e. there are branches in the middle of the trajectory? In addition, there has been some existing work that uses self-imitation to deal with sparse reward problems , which could be applied to the problem of interest here. It would be nice if some comparisons can be done. . Learning Self-Imitating Diverse Policies. Gangwani et al. ICLR 2019. ================================================== The authors' response has addressed most of my concerns and the additional experiments regarding sub-goal modeling is well appreciated. I have updated my score.