Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	1237
Title:	Learning with a Wasserstein Loss

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

In this paper, the authors propose to use the wasserstein loss for data fitting in multilabel/multiclass learning. This loss can encode specific losses for between classes error and will promote meaningfull errors (such as error between dog races) instead of costly errors

( for instance the recent Google Gorilla/people).

The authors discuss the use of a regularized version of wasserstein distance (mainly for computational reasons).They propose 2 potential extensions that can handle the difference of mass between the data and the predicted values. Short numerical experiments show the interest of the approach.

It is an good paper that propose a novel data fitting term that can encode error cost between classes. Numerical experiments seem encouraging. The rebuttal of the authors clarified a lot of my questions but there are problems that must be addressed before acceptance.

-The use of Wasserstein to encode between class loss is elegant but one can wonder if it's not a bit optimistic. Indeed the loss correspond to

the

affectation between class that minimize the loss. Intuitively, one would like to minimize the maximum loss over the residue |h()-min(h,y)| in order to avoid the worst case scenario (gorilla/human discussed above). Note that using the maximum loss on the residue would also boil down to TV for multiclass classification and 0-ground metric. This should be at least discussed in the introduction (with a figure?) since Wasserstein is a complex tool that is difficult to interpret.

- strategy 4.2.1 seem to just be lost space. If it has been tested and is inferior to the 4.2.2, then the authors should just refer to it in one or two sentences and spend more time detailing the model and the optimization problem.

- Add the demonstration for proposition 4.1 and 4.2 in the supplementary material. A discussion about the convergence of the fixed point iteration is also required since it will not provably converge unless the operation is a contraction mapping.

- Thanks to the rebuttal, we now know what the model of h(x) is (linear+softmax). This should appear clearly in the paper along with an equation giving the final optimization problem. This must be followed by a quick discussion on the chosen optimization algorithm and a discussion about the type of problem (is it convex? I doubt it). Not only

is it necessary for reproducible research but it will clarify the proposal.

- While you will doubtfully have time to add it in the final version, numerical experiments would benefice from comparison to other multilabel approaches (binary relevance for instance). One way to illustrate the strength of the proposed approach is to use a performance measure independent to the wasserstein (unlike K-cost that will obviously be better for wasserstein minimisation). For instance flicker you could use a clustering of the tags and you could measure the error using inter-cluster errors. If the clustering encode semantically similar tags together, your proposed approach will clearly perform better that other divergence loss (l1,l2,KL) since it will promote the semantic relations encoded in the metric.