Paper ID: | 4754 |
---|---|

Title: | Surrogate Objectives for Batch Policy Optimization in One-step Decision Making |

Summary: The main points in the paper are: -- expected reward objective has exponentially many local maxima -- smooth risk and hence, the new loss L(q, r, x) which are both calibrated can be used and L is strongly convex implying a unique global optimum. -- experiments with cost-sensitive classification on MNIST and CIFAR -- batch contextual bandits: generalized reward imputation model, and showing that a particular surrogate loss which is strongly convex (under some imputation models) tends to perform better empirically. Originality: The work is original. Clarity: The paper is clear to read, except some details in the experimental section, on page 4, where the meanings of the risk R(\pi) is not described clearly. Significance and comments: First, in the new objective for contextual bandits, the authors mention that this objective is not the same as the trust-region or proximal objectives used in RL (line 237), but how does this compare with the maximum entropy RL (for example, Harrnoja et.al, Soft Q-learning and Soft actor-critic) objectives with the same policy and value function/reward models? In these maxent RL formulations, an estimator similar to Eqn 12, Page 5 is optimized. I would appreciate some discussion on this connection. Second, in both these cases, can we provide some guarantees on the test-time performance and possibly some guarantees outside of the seen (x, a, r_a) tuples? The model q is supposed to generalize over these tuples, but can we say something concretely about the performance of the policy learned using the proposed objective? Empirically, the proposed objectives perform a bit better than the other baselines and outperform methods like POEM on standard benchmarks. Overall, I like the idea of looking at surrogates for learning from logged data in a bandit setting, but I find some discussion on how this relates to existing works missing -- for example -- regularized MDPs (entropy, or generally, Bregman divergence regularized MDPs). Also, some discussion on how the proposed objective would "generalize" during training, and what considerations we should keep in mind at test time when using their approach would be appreciated. What I mean by this is that some discussion on the guarantees provided on the expected return of the learned policy under an arbitrary context distribution and action distribution of learned policy would be interesting (E_{x \sim p(x), a \sim \pi(x)} [r(x, a)]) and not just the data distribution ((x, a, r_a) \in D).

Originality: I find the idea of using a surrogate objective instead of the expected reward very interesting and novel. the extension to the partially observable setting is interesting as the proposed form finds a common denominator to multiple estimators, but its underlying idea is not novel. Clarity: the paper is overall very well written, and it has a nice flow. Although I am not very familiar with the topic, I certainly enjoyed reading this paper. The only comment is that it might be good to highlight more clearly the form of the convex surrogate. Overall, this seems to me to be a good paper, and my only main concern is the experimental protocol used and the presentation of the results. Specifically: - in Sec 2.2 the decision to show only 100 epochs is arbitrary. I would prefer if you could show the learning curves instead - All results (Fig 1 and 2, and Table 1) should include std or equivalent - Visualizing Train and Test error with the same scale does not really make sense, either use two different scales or separate the plots. - The last sentence of Sec 2.2. is crucial to your finding. I would strongly encourage to replace Fig 1 and 2 with learning curves instead and run them until convergence. Then any difference in optimization landscape will be visible as a change in the integral of the curves. - Similar comments about learning curves apply for Sec 3.6 - It is unclear to me if the reward estimation algorithm is actually evaluated in the experiments. Could you clarify? (it would be nice to include results) - Can you comment on the increased variance demonstrated by Composite on Table 2? Additional comments: - Abstract. "Here we ..." sounds a bit strange since you already list other contributions. - Generally speaking, I find curious that in a paper talking about policy optimization all the experiments consists of classification tasks "reworked" to be decision making. I wonder if it wouldn't be more interesting to use other decision-making benchmarks. - It might also be valuable in Sec 3.6 to run experiments with different data size and distributions other than uniform. - The code was available only for the CIFAR-10 experiments, and not for the MNIST --- Based on the rebuttal, I am increasing my score. The learning curves in the rebuttal might benefit from using log-axis.

Detailed comments: The paper considers policy learning for cost-sensitive classification and contextual bandits problems. The focus is on arguing that directly minimizing the empirical risk is difficult if the policy takes the specific form of \pi(x) = f(q(x)) where x is the context, q an unconstrained model and f the softmax function. To reduce the difficulty, surrogate losses are designed which have the “calibrated” property, which means efficient minimization of the surrogate functions implies efficient minimization of the risk. In the cost-sensitive case, the surrogate is convex. Originality: to the best of my knowledge, the two surrogate functions are original. They connect the cost-sensitive classification and contextual bandit which provide insights to both classes of problems. Theorem 1 is a nice result extending the understanding of the difficulty of optimizing empirical risk. Quality: The main contribution of this paper is theoretical and it is of high quality. One technical question is what the role of the baseline v plays in the theoretical result. It is shown empirically beneficial to have such a shift, could the authors provide some theoretical intuition on why this is the case? Empirically, how is v chosen? Clarity: The paper is mostly clear to read. One thing that can be improved is that the F function, first appeared in Proposition 2, is not defined. Later it is defined in the context for KL divergence but it is not clear whether in Proposition 2 it refers to the KL divergence version or not. Significance: I think the contribution is significant to the community which suggests new surrogate objective for both cost-sensitive classification and contextual bandits. Minor comments: Equation (8) has a typo, the middle expression, inside the parenthesis, should be a minus sign instead of a plus sign. Line 127: \hat{R}(\pi) does not yield the highest test error in MNIST. Am I missing something here?