Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper studies off-policy actor-critic, where the objective function depends on the on-policy distribution of the target policy (called counterfactual objective in this work). It is shown that the use of target on-policy distribution can be beneficial, which is then used to derive gradient to result in the generalized actor-critic algorithm. The approach is evaluated on a few OpenAI benchmarks, comparing favorably over two off-policy actor-critic baselines. The novelty is somewhat limited, but the objective and algorithm are new. Overall, the reviewers feel the paper makes some interesting contributions. Minor/detailed comments: * Line 105: I am not sure if it is standard to refer c (the density ratio) as covariate shift. The latter (in my opinion) refers to the discrepancy between training/behavior and testing/target distributions, namely, the scenarios where c \ne 1; it does not refer to c itself. * The paper claims three contributions. However, the first (Line 45) is a bit over-claimed. The potential mismatch between target and excursion distribution (and the resulting performance degradation) is known.