Paper ID: | 2170 |
---|---|

Title: | A Consistent Regularization Approach for Structured Prediction |

The paper presents a framework for stuctured prediction that allows for a wide range of loss functions through a "loss trick" that allows the losses to be represented through embeddings in RKHS for the outputs. They present an elegant framework for optimising the models, essentially boiling down to solving a kernel ridge regression type problem and then computing a pre-image (decoding) to arrive at the prediction. The authors provide analysis of consistency and present a generalization bound for the framework. The method is experimentally compared to alternative approaches in different structured prediction tasks with good results

In my view, this is a beautiful paper that will advance the field of structured prediction significantly and provides a platform for further development. Nevertheless, the paper should be better related to existing work on vector-valued regression for structured output. A recent related work is but there are others: C ́eline Brouard, Florence D’Alch ́e-Buc, Marie Szafranski. Input Output Kernel Regression: Supervised and Semi-Supervised Structured Output Prediction with Operator-Valued Kernels. 2015. The paper is generally well written, I have only few remarks: - line 70-72: you might note already here that this amounts to a ridge regression problem in the output Hilbert space. - line 74: (7) is generally known as the pre-image (or decoding) problem. Good to mention it already here. - line 129: closing paranthesis missing: (\psi(y, - line 171: extra full stop . [3]. - line 186: somehow convoluted grammar here: "of the same order of the generalization bounds...". Please rephrase. - line 211: T => \ell in the dimension of the label space, I think. - line 212: i does not appear inside the argument of argmax or argmin - line 223-230: I found this subsection somehow cryptically written. Please write more to the point. What is the relationship of your results to SVMstruct. Can you cover the max-margin models such as SVMstruct and others (e.g. those in Bakir et al. 2007) as well or are those models outside your theory? - Figure 1: Please use only one of the terms, KRR or KRLS.

2-Confident (read it all; understood it all reasonably well)

This paper focus on the L2-regularized least-square approach to structured prediction (also called Kernel Dependency Estimation--KDE). They show that for a large class of loss functions (those satisfying assumption 1), the estimator given by algorithm 1 is universally consistent (theorem 4). They also provide a generalization bound (theorem 5) with a (1/n)^(1/4) convergence rate. The theoretical analysis is supported by the provided experimental results.

This paper is technically very strong, very well written, nicely presented, and relevant to structured output learning algorithms. However, the authors claim (line 169 and line 220) that these are the first consistency results and generalization bounds for KDE approaches. I believe that this statement is wrong because risk bounds for KDE were provided by Giguère et al., "Risk Bounds and Learning Algorithms for the Regression Approach to Structured Output Prediction", ICML 2013, JMLR W&CP, vol 28 pp. 107. In particular, thm 4 of that paper is valid for normalized input kernels and any loss function upper-bounded by the loss induced by the output kernel (eqn 3). Also note the ridge coefficient implies the usage of \lambda \in O(1/n) in algo 1. I believe that the results presented by the authors are stronger than those of the Giguère et al. paper (since they provide a universal consistency result and a coverage of loss functions which is apparently larger?). But the authors should relate their results to those presented in that paper.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper aims at characterizing a large class of loss functions that allows to naturally embed structured outputs in a linear space. Then based on this fact, a surrogate loss approach with regularization term is formulated naturally. Universal consistency and finite sample bounds are proved characterizing the generalization properties of the proposed methods.

(1) Whether the loss defined on Assumption 1 is symmetric or non-negative? (2) Universal consistency and generalization bounds have been provided extensivily for those classical learning problems, such as SVM. My question is what is the difference and challenge in your proof for structural learning problems? (3) How to choose an approriate kernel in (3) for the \Deta loss?

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

-The authors propose a regularization approach for structured prediction problems. First, they suggest an specific estimator(Alg. 1) and derive the reasonability of this method for some special case(normalized kernel). Then, they extend it to general cases by using surrogate loss and regularization techniques. In statistical analysis, they proved universal consistency and finite sample bounds characterizing the generalization properties of the proposed method. And they shows the result of 3 tasks which are ranking movies, digit image reconstruction, and robust estimation.

-As shown in this paper, the main idea is quite similar to kernel trick of SVM. So it seems important to choose the proper kernel for practical applications but there is lack of explanation for it. And I wonder why MovieLens 100k dataset is used for experiment, because as far as I know it is the oldest one of MoviLens dataset and there are better ones containing more samples among them. Also I think that experiments should be performed for other tasks which have more complex datasets such as natural language processing, multiclass classification as mentioned in introduction so as to show that this method can be well applied to general structured prediction in practice. And lastly, I found some minor typos in this paper: line 46(it->is), line 115(colon may need between “Functions” and “Most”), line 171(period should be removed between “consistent” and “[3]”), line 292(“s” should be removed from “milds”).

1-Less confident (might not have understood significant parts)

The authors generalize the idea of Kernel Dependency Estimation (KDE) to address losses that cannot be written as simple RKHS distances. To this end, the authors propose a surrogate framework and derive their estimator. They are able to prove the consistency of the estimator and give finite sample generalization bounds under standard assumptions. Experiments also demonstrate good results compared to other baselines.

I like this work very much. The proposed method is non-parametric and applicable to a wide variety of losses. Moreover, the authors are able to prove the consistency and generalization bounds of this approach. Although I didn't check the proofs in the supplementary materials very carefully, they seem to be very solid. The experimental results are also impressive. In my opinion, this work provides a good method for structured prediction and should have significant impact in the field. Some questions: - The algorithm requires solving a different optimization problem (which may be combinatoric and non-convex) each time for doing a prediction. Do you have any ideas about how to do it in general and how to apply it to large scale datasets? - In Line 470 (in the supplementary material) the authors indicate that d . g is measurable because of combination of measurable functions. But combinations of measurable functions are not necessarily measurable, or did I miss something? - There are some typos to be corrected, such as the r.h.s. of the inequality in Line 129, 'Reimannian' in Line 121, etc.

2-Confident (read it all; understood it all reasonably well)

This paper proposes an regularization method for structured output prediction(KDE) by using surrogate loss framework. Authors provide generalization bounds of their algorithm also showing that the proposed method is universally consistent.

Although this paper is outside my expertise, I find the proposed algorithm quite interesting which boils down to a plain kernel ridge regression for structured output prediction. In general it is well-written, but the proofs and the math in the paper could be made more accessible to the general machine learning readers. Experiments support the authors' hypothesis. Minor comments: How sensitive was the proposed algorithm to the choice of the hyperparameters? I would like to see the discussions of the similar algorithms for neural networks as well.

1-Less confident (might not have understood significant parts)