NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
*** The author rebuttal has sufficiently clarified my questions about the model assumptions, and the statements of A3 and theorem 1. *** Originality: The paper considers a model that combines censored feedback with contextual information and non-parametric residuals; the combination of all three has not been studied. They suggest a new algorithm and prove that it achieves logarithmic regret. Quality: Overall the results in the paper seem technically correct, but I had some concerns about the model/assumptions/results: - it is unclear to me how restrictive the parametric assumption that the expected log of the valuation is linear in the covariates. Although authors claim that predictor variables can be arbitrarily transformed, the task of designing the appropriate transformation, ie. "feature engineering", is a non-trivial task itself. Additionally it is unclear how realistic/flexible the logarithm/exponential relationship is, as it imposes that the valuations are exponentially far in relationship to the weighted covariate distances. If the purpose of the logarithmic form is to keep the valuation to be non-negative, there are other options for enforcing non-negativity as well. - Is there a typo in Assumption A3? the first inequality implies that if I were to take z = z^*, then r(z^*, \theta_0) = r(z^*, \theta) even for any \theta \neq \theta_0? Similarly the second inequality implies that if I were to take \theta = \theta_0, then r(z^*, \theta_0) = r(z, \theta_0) for any z \neq z^*? Is this a typo, or is this assumption as stated required and potentially restrictive? - In theorem 1, does the analysis assume a particular choice of the parameter \gamma, or does it hold for any nonnegative \gamma? As \gamma is used in the UCB/LCB, should we expect an expression similar to log(T)? I presume that if \gamma were chosen to be too small, then something in the regret bound would break so that you do not get logarithmic regret? Clarity: The paper reads clearly. Significance: The combination of censored feedback with contextual information and non-parametric residuals seems to be a practically relevant setup, although it is unclear how realistic/flexible the specific choice of the parametric model is. However the algorithm seems to be quite general and easily extendable to other parametric models as well. Misc question: - In your algorithm, you point out that each price selection checks one or more cells such that you get free exploration. In fact, you get additional information from knowing that if the user rejected the item at price p, then the user would have rejected the item at all prices higher than p. Therefore you get a datapoint even for cells that would have recommended higher prices than the chosen p_t. Symmetrically, if the user ended up buying the item at price p, then that means this user would have bought the item at all prices lower than p. Therefore we get a datapoint for cells that would have recommended lower prices than the chosen p_t. Do you imagine that the algorithm could be improved by further aggregating the data in this way? This may not be a good suggestion though as the inclusion of the datapoint to the estimate would be dependent on the the outcome of Y_t, which may cause biases.
Reviewer 2
I think the results presented represent solid progress on the problem and the results seem new, though there are a number of ways in which the results could be improved. * A major drawback of the DEEP-C algorithm is that it has exponential running time. Given that the authors work in a well-specified linear model, this is a bit surprising. Is there reason to believe that computational intractable should be expected for this setting? This would be good to discuss in the paper. If this is indeed the case, it would be nice to prove a computational lower bound. * For the regret bound in Theorem 1, the authors mention that regret scaling as $O(\sqrt{n})$ is necessary, so the bound is near-optimal in terms of dependence on $n$. However, the bound depends on a number of other parameters, and it would be good to discuss whether their dependence can be improved. Even better, it would be great to include a lower bound. EG, is regret scaling as $d^{11/4}$ to be expected? * Given that the authors are proposing a new model, they should spend a bit more time justifying the model and related assumptions and showing why it is natural, otherwise the whole paper feels like it not much more than a technical exercise. The fact that the model generalizes the non-contextual setup of Kleinberg and Leighton (2003) is a plus. Beyond this, the paper is fairly well-written and easy to follow. Some additional notes: * You may want to consider citing related work on semiparametric contextual bandits (Greenewald et al. 2017, Krishamurthy et al. 2018). * There are multiple places in the paper where the authors use the phrase "we conjecture..." in passing. If these are serious conjectures you should spend a bit more time to elaborate on these points and perhaps include an open problem section, otherwise I would recommend removing.
Reviewer 3
This is a good paper and I`d like to see it in the program. Both the model and the new algorithm are very valuable for the literature and they open interesting avenues of further enquiry (e.g. can we get log(T) bounds in this model if the noise distribution is benigh-enough). I have two concerns, though: 1. The algorithm runs in exponential time as there are n^{O(d)} buckets so the algorithm is not very practical unless the dimension is low. There are some techniques to obtain lower computational complexity using the algorithm of Plan and Vershynin, so this point is somewhat addressed in the paper, but this is a departure from the main technique of the paper. 2. The part that I am somewhat more concerned is that the paper seems to be re-deriving known results from stochastic bandits with side information. See for example “Leveraging side observations in stochastic bandits” (2012) or see the references in https://arxiv.org/pdf/1905.09898.pdf . I think none of those two are fatal problems and I still suggest the paper to be accepted despite this fact, but I strongly suggest the authors to address point 2 before publication. ---------- Post rebuittal: I am unconvinced that the literature on bandits with side information can’t be applied here. I understand that the authors prefer their customized approach, but the paper is very nice nevertheless and tying to a general framework is useful in the sense that it makes the paper more readable and helps others to build upon it. I strongly urge the authors to consider this more carefully. Here are my comments: (a) Your algorithm also discretizes the parameter space (even though you have a continuous ouput). There is no harm to also discretize the price up to 1/\sqrt{n} (so you at at most \sqrt{n} more regret) so you can imagine that each cell “suggests” a single price p(z,\theta,x) for each context X . So an arm is a policy parametrized by (z,\theta) that maps the context to a price. (b) the sequence graphs can be revealed in each period. Here the sequence graphs are given by the context. Two policies are linked in the graph for that period if they suggest the same price. (c) The action is also pulling an arm, selecting a policy and choosing the corresponding price. (d) The graphs links together policies that suggest the same price for the given context (note that the graph is different in each period, but you have a uniform bound on its independent set since the graph is a disjoint union of cliques, with each clique corresponding to a price so it is at most the number of prices)