Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes a mechanism for maintaining distributions over Q-values (called Q-posteriors) by defining the value function (the V-posterior) to be a Wasserstein barycenter of Q-posteriors and defining the TD update to be a Wasserstein barycenter of the current Q-posterior with an estimated posterior based on the value function. These distributions are intended to represent uncertainty about the Q-function and they enable more nuanced definitions of the "optimal" (w.r.t. actions) Q-value, for computing the V-posterior during TD learning, as well as for defining an exploration policy. Contributions seem to be: 1. A means of propagating uncertainty about Q-values via Wasserstein barycenters (Equations 2 & 3). 2. A proof that a modified version of the proposed algorithm is PAC-MDP in the average loss setting (Theorems 5.1 and 5.2). Empirical results: 1. Promising returns on four online RL tasks over tabular domains. 2. Favorable returns on the Asterix game, as compared to the Double DQN baseline. Strengths: 1. The paper is fairly clearly written and easy enough to understand. 2. The idea of propagating uncertainty via Wasserstein barycenters is interesting and suggests several concrete realizations. 3. Preliminary empirical results shows that propagating uncertainty can positively impact performance. Weaknesses: 1. The distinction between modeling uncertainty about the Q-values and modeling stochasticity of the reward (lines 119-121) makes some sense philosophically but the text should make clearer the practical distinction between this and distributional reinforcement learning. 2. It is not explained (Section 5) why the modifications made in Definition 5.1 aren't important in practice. 3. The Atari game result (Section 7.2) is limited to a single game and a single baseline. It is very hard to interpret this. Less major weaknesses: 1. The main text should make it more clear that there are additional experiments in the supplement (and preferably summarize their results). Questions: 1. You define a modified TD learning algorithm in Definition 5.1, for the purposes of theoretical analysis. Why should we use the original proposal (Algorithm 1) over this modified learning algorithm in practice? 2. Does this idea of propagating uncertainty not naturally combine with that of distributional RL, in that stochasticity of the reward might contribute to uncertainty about the Q-value? Typos, etc.: * Line 124, "... when experimenting a transition ..." ---- UPDATE: After reading the rebuttal, I have raised my score. I appreciate that the authors have included additional experiments and have explained further the difference between Definition 5.1 and the algorithm used in practice, as well as the distinction between the current work and distributional RL. I hope that all three of these additions will make their way into the final paper.
Thanks for the rebuttal and the extra results on two more atari games. I agree that this should be considered a theory paper and I'm keeping my score as is as I still think it's appropriate. ******************************************** The main idea of the paper is to construct a new TD update rule for the posterior over state-action values based on the notion of Wasserstein barycenter. This is an original idea which requires some time to digest. Based on this notion of TD learning, the authors derive a Q-learning algorithm which can take advantage of the estimated posterior for guiding exploration. They prove the efficiency of this algorithm in a specific tabular PAC-MDP setting, and they also attempt to scale the algorithm to situations where function approximation is necessary. The paper is very clearly written. Further comments: 1. The explanation of PDQN is not very detailed. There should be a separate algorithm box for PDQN. For example, the update rules considered in the paper are "derived" from gradient descent update rules and are directly tied to the notion of Wasserstein barycenter. DQN uses RMSProp update rule and so the connection to the theoretical results and definitions is unclear. 2. Why not show results for other Atari games? Asterix doesn't seem like the natural first choice for evaluating an algorithm. It seems that there are some extra hyperparameters e.g. related to the choice of the prior. The extra flexibility that stems from tuning these parameters could improve performance due to reasons unrelated to having an access to a good posterior.
-------------------------------------- Post rebuttal ---------------------------------------- I thank authors for providing the rebuttal, specifically for clarifying the relevance of OT to the problem considered and for agreeing to extend the discussion of OT use-cases in RL. I am increasing my score. ---------------------------------------------------------------------------------------------- The paper is rather clearly written. Detailed supplementary material was helpful to clarify several questions I had while reading the main text. It seems that Proposition A.3 requires a bit of a revision for the particle case. For a 1d Wasserstein barycenter to be an average, the particles should be ordered as in the Proposition A.1, but nothing about ordering is stated in A.3. One of the most intriguing properties of OT is its ability to take into account the geometry of the support of the distributions. In this paper OT is used for the value-function posteriors, which have 1d support without any apparent geometric properties. This also leads to quite trivial barycenters. My question is what is the reason to use Wasserstein barycenters over some other way of averaging distributions? Perhaps a more promising direction to incorporate OT into RL is to consider distributions over the state space. In many RL applications state space has interesting geometric properties which seem to be ignored by standard RL algorithms. I am wondering what authors think about such perspective and the premise of OT in RL in general.