Reviews: Multiclass Performance Metric Elicitation

This paper studies the metric elicitation problem, a problem proposed by [7] (AISTATS'19 paper). Unlike the previous work on the binary classification, this paper generalizes to the multi-class setting. The authors defined two types of metrics for multi-class classification, Diagonal Linear Performance Metric (DLPM) and Linear Performance Metric (LPM), and designed algorithms for the two type of metrics respectively. The algorithm for DLPM is relatively simple: just use binary-search like [7] and apply it for (k-1) times. The LPM case is much more complicated - the authors proposed a coordinate-wise binary-search style algorithm based on the geometry of the feasible set. Theoretical guarantees for both algorithms are provided. About the presentation of the main results: I am a little confused by the presentation of the main results. To me, it seems like the most interesting (and practical) performance metric, is the linear-fractional one in Appendix E.2. Why isn't this (perhaps the most important one) presented and highlighted in the main text? Is there any particular reason for it? Disclaimer: I have NOT checked the (very long!) appendix very carefully. About the assumptions: the assumptions in the main text (assumptions 1&2) look standard to me. But in the appendix, I am not sure why assumptions 3&4 are necessary - they seem quite unnatural to me. Since these linear-fractional settings are the more important ones, I hope the authors can explain a bit on this. About the optimality of the bounds: It is nice that the proposed algorithms all have theoretical guarantees. But how good are these algorithms? In [7], there is a simple O(log(1/eps)) lower bound on query complexity for binary classification. However, in this paper, I cannot find any lower bounds. It does seem like the O(k) and O(log(1/eps)) is necessary in theorem 1 (but I still hope that we can have a formal lower bound). But for the dependency on the feedback noise eps_omega, theorem 1 has a \sqrt(eps_omega) overhead on the error, but theorem 2 has a sqrt(q*eps_omega) = O(k* \sqrt(eps_omega)). Note that this term is important since it does not go to zero when eps goes to zero. How necessary is the extra factor of k here? Overall, with the concerns above, I think this paper is marginally below the acceptance threshold. But I am happy to increase my score if the authors can address these issues in the response. ======================== After the author response, I think the authors addressed part of my concerns, so I raised my score by 1pt. However, I agree with reviewer 2 that the writing and experiments can be improved, so I won't increase my score beyond that.

Reviewer 2

The authors consider the setting of [1'] where the goal is to find the implicit performance measure that an expert uses for classification. This performance metric is assumed to be a linear function of the confusion matrix (or occasionally a linear fractional as in [1]) and the goal is to find it (or approximate it) using a few queries. The queries are in the form of comparison queries which given two classifiers (or confusion matrices) outputs the classifier that is better according to the implicit score function. This submission extends the analysis of [1] to the multiclass case which poses new challenges. The algorithm for the diagonal case finds the ratios between one of the diagonal elements (say the first one) and all the other elements. This is sufficient as metric is scale invariant. Finding each of the ratios is a simple binary search problem (though the analysis requires some considerations regarding the geometry of the problem to make sure that the binary search method works). The algorithm for the general (linear) case is based on binary search and the ideas from the derivative-free optimization approach. Some experiments have been done. However, no comparisons with any baselines have been made. The authors can at least compare the method with a method that queries randomly and picks the winner (and again the winner is compared with another random solution). I think there are many other better baselines to compare with. The work of [2] may also be somewhat relevant. [1'] G. Hiranandani, S. Boodaghians, R. Mehta, and O. Koyejo. Performance metric elicitation from pairwise classifier comparisons. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 371–379, 2019. [2] Kane et al, "Active classification with comparison queries" === The author's response did not change my decision. I still think the experimental section is weak and the paper needs some rewrite.

Reviewer 3

# originality Even though metric elicitation itself has been proposed by Hiranandani et al. (2019) in the context of binary classification, there is still a large gap between binary and multiclass classification. I claim that there still exists novelty in terms of multiclass analysis of metric elicitation such as restricted Bayes optimal classifiers. # quality Although I could not check every detail of proofs, the proposed algorithms seem reasonable to me. On the other hand, I want to confirm one point. Why do you query four points in both algorithms 1 and 2 although those algorithms are based on binary search? I have read Appendix A and found that you mentioned this is because it would successfully escape from flat regions. However, I guess we can still construct pathological cases that do not work for algorithms querying four points. For example, assume that we query four times and find that m^a = m^c = m^d = m^e = m^b. Even so, there might be a peak in between, say, m^a and m^c. I guess that phenomena would be difficult to avoid no matter how many query times we increase, as long as the target metric parameterization is pseudo concave. Could you clarify this point? # clarity - Definitions 1 and 2: I still did not get the intuition on why the norms are different. Is it due to the different parameterization of confusion matrices? - In "Parameterization of upper boundary" (l.149): I would prefer adding \nu: [0,1] \to \partial\mathcal{D}^+_{k_1,k_2}. The same to the parameterization in LPM (l.183). - l.242: It would be better to add "1-Lipschitz with respect to the confusion matrices." In addition, why does this hold with high probability, not with probability 1? # significance Multiclass metric elicitation would have a practical impact because it gives one way to choose an appropriate performance metric, which is intuitively quite hard. The extension technique to the multiclass case may help further extensions of metric elicitation to more complicated tasks, such as multi-label and structured predictions. ========== After feedback ========== Thank you for responding to my questions. On the first answer, claiming the unimodality has great importance. In addition, it is interesting to look at Figure 10. However, I still have a concern that we cannot detect the modal no matter how many queries we use, given the illustration in Figure 10. For example, take a look at Figure 4 (right) in Appendix. If we obtained this kind of oracle responses, we are still not aware that the modal is located in between m^a and m^c, or between m^c and m^d. Anyway, since I love the idea presented in the current paper, I hope this part can be made clearer in the future version. Here, I will not change the score.

Paper ID:	4994
Title:	Multiclass Performance Metric Elicitation

Reviewer 1

Reviewer 2

Reviewer 3