Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
This paper addresses the problem of off-policy evaluation in combinatorial contextual bandits. A common scenario in many real world contextual bandit problems is when there is a significant amount of logged data (for example logged search queries, results and clicks) but training or evaluating a model online in real time through e.g. an A/B test is expensive or infeasible. Ideally one would evaluate/train the model off-policy (i.e. on the logged data), but this incurs a significant bias since the data was not generated by the policy being evaluated. The authors show in their experiments that this bias significantly impacts the quality of models that are directly trained on the logs. They devise a method called the 'pseudo-inverse estimator' to remove the bias and provide theoretical results showing that the method is unbiased under some strong assumptions (although in experiments it appears that these assumptions can be violated). They demonstrate empirically in carefully crafted synthetic and real-world experiments that the method outperforms baseline approaches.
This paper is very well written and, although dense, is quite clear. The careful explanation of notation is appreciated.
In the figures in the appendices, it looks as though the direct method (DM-tree) outperforms or matches the proposed strategy on a variety of the synthetic problems (particularly on ERR vs NDCG). The authors justify this by noting that ERR violates the assumption of linearity which seems like a reasonable justification (and it's appreciated that both metrics are included). However, then in Section 4.3 the experiment shows PI doing slightly worse on NDCG and better on ERR. This seems inconsistent with the linearity hypothesis. The authors note that the baseline could be stronger for ERR, which suggests that perhaps the proposed method wouldn't outperform SUP on either baseline?
It seems curious that although wIPS is provably unbiased under weaker assumptions than IP, it performs so much more poorly in experiments. Is this because the variance of the estimator is so high due to the cardinality of S(x)?
Overall, this is a very neat paper proposing a tidy solution with compelling theoretical and empirical results and well thought out experiments. The empirical results, however, don't suggest to me that this method is absolutely better than the fairly naive (admittedly unsatisfying) approach of directly regressing the logs with e.g. a regression tree.
There's a typo on line 116: "in on".
The assumption on rewards on 113 seems like it could be a strong one. This line is saying essentially that all the rewards given actions in the slate are independent of all the other actions. In practice, this must almost certainly be violated in most cases (e.g. people's clicks will vary based on context). Although I suppose in some cases, e.g. in search where the user is looking for a specific single result, this may be fine.
The notation in the paragraph starting on line 117 is a little confusing to me. The text refers to vectors, but they're indexed into with pairs (given the notation in the rest of the paper, I assume that they are vectors and the pairs map to a vector index?).
Interesting paper on ranked list recommendation using contextual bandits.
The related work section is too brief and fails to account for the rich literature in contextual bandits for recommendation.
It is not clear why these two different datasets (MSLR-Web30 and MSLR-Web10) are used for the experiments in sections 4.1 and 4.2. Why not use both datasets in both sections, and also add the other state of the art datasets that are often used in the literature (YLR and Yandex). This would only strengthen the reliability and validity of the findings.
The paper introduces a new estimator for evaluation of policies for recommendation engines. The paper is well written. Evaluations are done in two experiments with reasonably large amount of data.
This paper presents a novel off-policy evaluation procedure for slate recommendations enabling one to optimize a recommendation system given actions and rewards from a distinct policy. The proposed estimator is unbiased when rewards are additive within a slate but achieves good experimental results even without this linearity assumption. The authors provide proofs of unbiasedness as well as variance bounds for the estimator and evaluate its performance on a public search engine dataset as well as proprietary search engine logs.
Paper is well written and provides thorough theoretical and empirical analysis of the proposed estimator, but the resulting density does come at some cost to comprehensibility. The contribution is significant as off policy evaluation is a persistent challenge in many domains including recommendation systems, and these results show significant improvements over existing options.