|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The present paper focus on kernel-based multi-task learning. More specifically, it deals with learning the output kernel which defined over multiple tasks. Starting from the vector-valued RKHS formulation of the multi-task learning problem, the authors propose an output kernel learning algorithm that are able to learn both the multi-task learning function and task dependencies. The authors first provide an optimization formulation of the problem which is jointly convex. Then they show that this optimization can be solved without the positive-semidefiniteness constraint
of the output kernel, which is computationally costly, and propose the use of a stochastic dual coordinate ascent method to solve it.
Experiments on multi-task data sets and comparison in terms of performance and running time with previous multi-task learning algorithms are provided.
The paper builds upon recent studies on multitask learning and output kernel learning. In particular, it is closely related to: [1] Dinuzzo & al., Learning output kernels with block coordinate descent., ICML 2011. [2] Zhang & Yeung, A convex formulation for learning task relationships in multi-task learning, UAI 2010.
While these two references are clearly mentioned in the paper, there are other closely related papers which are not discussed and cited. For example [3] Kadri & al., Multiple operator-valued kernel learning, NIPS 2012. [4] Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and Granger causality, UAI 2013. [5] Ciliberto & al., Convex learning of multiple tasks and their structure, ICML 2015.
These papers, in particular [4,5], needs to be cited and discussed. The proposed approach needs to be compared to them from a theoretical and experimental perspectives.
The contribution is somewhat limited. The main novelty is the dual formulation of the optimization problem which does not need the positive-semidefiniteness constraint of the output kernel. However, it should be noted that even in the previous works [1] and [2] where an alternating optimization scheme is adopted, positive-semidefiniteness constraint is not needed. Incorporating the solution of the minimization over the multi-task learning function in the optimization problem of the output kernel learning (without the positive-semidefiniteness constraint) provides a positive semi-definite matrix over tasks. This is clearly mentioned in [1] (see Eq. 7) and [2]. The way this paper is written gives the impression that it is the first that avoid the optimization over S_+^\top.
Also, the result about the joint convexity of the output kernel learning optimization problem is not new. This is largely discussed in [5] and also in [2]. It is true that [5] is a newly published paper but the contribution of the present work
will be more visible by taking into account recent results in this field.
In the experiments, it is not clear why the proposed approach is compared to OKL in terms of running time and not in terms of performance. The paper is about output kernel learning, so the performance of the proposed method needs to be compared to those of [1, 4, 5].
update: The authors did a good job in answering my questions. My main concern is about the presentation of the results and their novelty. I think that is technical novelty and originality in the proposed approach, however the authors should better explain this and make it clear what the novel contributions of the paper are in comparison to what is known.
Q2: Please summarize your review in 1-2 sentences
The paper provides an output kernel learning algorithms that avoid the positive-semidefiniteness constraint of the output kernel. The paper provides some interesting results and a new formulation of the output kernel learning problem. The writing can be improved. The contribution is somewhat limited. Some closely related references are missing.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This is a well-written paper that introduces a novel method for multi-task learning. The mathematical derivation of the resulting optimization problem and the algorithm used to estimate the parameters are well explained. The usefulness is supported by several experiments on multi-task learning and multi-class classification problems.
However, I have one major concern w.r.t. the method that is introduced. Despite the good performance in experiments, I am not really convinced of the formulation in Eqns. (2) and (3). Those formulations do not allow for a separate regularization mechanism for tasks and parameter vectors of a single task, respectively. More specifically, it is known in the field that two different regularization mechanisms have different effects in multi-task learning. Adding a regularizer for the parameters of a single task prevents overfitting, as in traditional classification or regression settings.
However, including a second regularization mechanism that shrinks models, so that related tasks have similar models, is known to improve the predictive performance (see several references from the reference list). Many existing methods for multi-task learning adopt this principle, in which two different regularization parameters are tuned in a validation step. The method of the authors does not have this flexibility, so theoretically I would expect that it is inferior w.r.t. predictive performance, compared to existing methods. The issue is not at all discussed in the paper, and I think this is an important point. It should be discussed why formulation (2) is a particularly good approach for tackling multi-task learning problems.
Q2: Please summarize your review in 1-2 sentences
This is an interesting paper, but I have one major concern about the proposed methodology (see below).
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The main contribution lies in Section 3, where an efficient optimization procedure is presented to solve a convex subproblem on a PSD cone. Some notations are confusing. For example, the variable 'k' is used to denote the kernel function in Section 2, but in Section 3 it is also used to define p in Theorem 3 and define a subscript in Theorem 4. So it is better to use different symbols in different places.
In the experiments on the SUN397 dataset, the authors are encouraged to compare with other MTL methods instead of just the MTL-SDCA method.
Q2: Please summarize your review in 1-2 sentences
This paper presents a new optimization method for some existing works [7,14] when using some appropriate regularizers for a task covariance matrix.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all reviewers for their feedback. We
highlight our main contributions
- using the theory of positive
definite kernels we show that the constraint that the output kernel is
positive semi-definite (psd) can be dropped in our multi-task learning
formulation and derive an unconstrained dual optimization problem (line
212) which we solve efficiently via dual coordinate ascent. We have an
average speed-up of 7-25 compared to OKL depending on the dataset. We
characterize the class of regularizers for which dropping the psd
constraint is possible (Theorem 5). This is a highly original contribution
and could also lead to more efficient algorithms in other
domains.
- we show that the p-norm regularizer for the output
kernel with small p leads to better prediction performance (Table 3+4) and
also better interpretability of the output kernel (Figure 1) as spurious
correlations between tasks are eliminated.
Detailed Comments:
References [R1-5] below are as used by Reviewer 1. Rest of the
references are as in the paper.
Reviewer 1
- "contribution
is limited as .... previous work [R1] and [R2] need not the positive
semi-definiteness constraint"
[R1,R2,R5] have indeed closed form
expressions for the output kernels in their alternating schemes for fixed
weights. However, computation of these expressions requires a costly
eigenvector decomposition (EVD) (cost O(T^3)). [R1] needs to solve a
Sylvester equation which is solved most efficiently using an EVD (the
authors of [R1] implement it like this in their code). In contrast, we
show that the dual optimization problem is unconstrained and the output
kernel can be obtained by applying *elementwise* a function to all entries
of a certain matrix which requires *no* EVD and is automatically positive
semi-definite. Via the primal-dual gap we have then a well-defined
stopping criterion in contrast to the alternating schemes.
- "the
joint convexity of the output kernel learning problem is not
new"
We are nowhere claiming novelty. The main step for the joint
convexity is Lemma 1 for which we cite [2, R2]. In 114-119 we discuss
similarities resp. differences to existing work. However, the complete
derivation in this explicit form is not contained anywhere. While [R5] is
similar (this paper appeared just a few days before the NIPS deadline), we
think that our derivation avoiding vector-valued RKHS is easier to
understand.
- "why is the proposed approach compared to OKL [R1] in
terms of running time but not in performance, performance needs to
compared to [R4,R5]"
For squared loss and Frobenius norm
regularizer on Theta, OKL is the same as our approach denoted FMTL_2-S
(p=2 and S for squared loss). As the problem is convex, our and their
solver yield the same result. We got the code from the authors of [R5].
While for the OKL/FMTL_2-S formulation, our FMTL_2-S is 4.3 times faster
than [R5] with the best hyper-parameter setting and on avg 7 times faster
on all hyper-parameter values (on SUN dataset). An updated Fig.2 with
timings of [R5] is available
at: http://postimg.org/image/svdw70mcl/ Regarding prediction we
compare with all the formulations studied in [R5], which are: CMTL [9],OKL
[R1],MTRL [R2] and [2]. [2] is a special case of GMTL[8] with 1 group and
we have included this case in our experiments. The approach of [R4]
learns an input and output kernel. We compare prediction performance with
other formulations that learn input and output kernels (MTFL[23], GMTL[8],
see line 314/315). Moreover, the authors of [R4] told us that code for
their method is not available.
Reviewer 3
- The proposed
formulation should have a separate regularization mechanism for every
task. It is known to improve predictive performance
In principle,
we agree. We implicitly have a separate regularization for every task via
the output kernel (Eq (4) and line 144). If the output kernel is a
diagonal matrix, then the inverse diagonal entries correspond to
task-specific regularization parameters. Thus we tune implicitly task
specific regularization parameters when learning the output kernel.
Implicit regularization has been used in previous multi-task approaches
[14,23]. We also compare to approaches which explicitly regularize every
task [R2,9,13] and outperform them.
Reviewer 4
- for
SUN397 dataset, the authors should compare with other MTL methods instead
of just MTL-SDCA
MTL-SDCA was chosen as they have state-of-the-art
results on SUN397 dataset and have a scalable multiclass implementation.
In the meantime we have efficiently implemented MTL [13]. Accuracies are:
42.0(m=5),47.4(m=10),51.7(m=20),57.0(m=50) for SUN (we are better). The
implementation of MTRL [R2] is not efficient for the multi-class setting
which leads to memory problems for SUN. We are implementing a
memory-optimized version of MTRL (for multiclass) and will report its
results in the final version. Please note that for USPS/MNIST we compare
to GMTL and MTRL.
|
|