Paper ID: | 8164 |
---|---|

Title: | Generalization Bounds in the Predict-then-Optimize Framework |

The paper proposes theoretical bounds of the predict and optimize problem, using Natarjan dimensions and margin based bound ideas. The bounds appear novel and contribute towards the analysis of the predict and optimize problem, but unfortunately this isn't my field and difficult to review the novelty and innovation of the paper. The presentation is clear, and even for someone not in the field, the main logic and flow can be followed. One suggestion is it would make it more accessible by providing one or two sentences in each section to give the overall picture of how that section fits with the overall logic and contributions of the paper. This will allow readers, even if not in the area, to get a better sense of the overall picture.

I have read the author rebuttal. This is a good work. I have nothing to add. ==== This paper studies the problem of minimizing a linear function on a compact convex set, where the loss vector is not known but needs to be predicted given a feature vector and a dataset of feature-loss pairs. The problem formulation is not new and can be found in [7]. While [7] focuses on the computational aspect, this submission provides the first (as far as I know) generalization error guarantee. The statistical learning formulation is reasonable. The first generalization error bound in Section 3 is standard. The derivation of the second generalization error bound in Section 4 is sufficiently novel and interesting to me. The presentation is outstanding; all required information is given without obvious redundancy; the technical results are well connected by the paragraphs. As the first learning theoretic analysis of a well motivated problem, the significance of this submission is obvious. In particular, there are many immediate open problems as discussed in Section 5.

(1) In the introduction, the authors claimed that their margin-based approach removes the dependency on the number of classes by exploiting the structure of the SPO loss function. However, I can not see clearly this. As far as I can see, there is a logarithmic dependency for $\ell_1$-norm constrained ball and a sqrt dependency for $\ell_2$-norm constrained ball. (2) the comparison with related work is not clear. The authors should present comparison with existing work more clearly to show how this work advance the state of the art. For example, how the results applied to multi-class classification can be compared with existing works. (3) In Theorem 1, the risks and empirical risks are used but defined in Section 4. (4) It seems that several notations are defined but not used in the main text, e.g., $B_q(\bar{w},r)$. These notations can be defined in the supplementary material if they are used there only. --------------------- After authors' response: I have read the authors' response as well as other reviewer comments. I would like to keep my original score.

This paper considers a learning framework called predict-then-optimize. The problem in this setting is that parameters which are used in making predictions are not necessarily at hand when predictions should be made (costs of taking certain roads at particular moment are needed when a route has to be planned), and should be predicted before optimizing over them (in previous example, costs in the past are known and associated with other features known also at the moment). The interesting part of the framework is, that the learning problem used in learning the costs uses a loss function over error on the decision of the optimizer (SPO loss), instead of a direct error over the learned cost. In this framework, the authors provide several generalization bounds in different settings over a linear objective function, such as when feasible region where problem is solved is either polyhedron or any compact and convex region. They further work with stronger convexity assumptions and in their framework generalize margin guarantees for binary classification, and also give two modified versions of the SPO loss. Originality: The work takes a previously-known learning framework and contributes to it by giving generalization bounds for its nonconvex and noncontinuous loss. Quality: I did not check the proofs. Clarity: The paper is well written and the progression is easy to follow, although some details could be clarified for the non-expert reader (for example R_SPO in Theorem 1) Significance: The theoretical results on the difficult SPO loss can encourage further work with it. Authors also describe avenues for continuing research in other settings where Rademacher complexities could be proven. However SPO loss has not attracted much attention in the machine learning community previously (paper introducing SPO they cite is only in arXiv and not yet published). ---------------------- Edit: I have read the author's response, and stand by my review; I think that this is a good paper. As authors have addressed my questions in their responses, I have increased the score from 7 to 8. The work is clearly original and first contribution of a kind.