|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes graph kernels that take smoothed matching into account when computing a structural similarity, i.e., generated by graphlet kernels, Weisfeiler-Lehman subtree kernels, and shortest-path kernels.
Moreover, the paper proposes an extension of the Pitman-Yor process for smoothing subgraph distributions, which originates novel graph kernels.
The empirical evaluation shows that the proposed kernels outperform their non smoothed counterpart. Additionally, they also outperform other graph kernels and are competitive in terms of runtime.
The idea of using NLP smoothing for graph matching is interesting and reasonable. However, the model proposed by the authors is a subcase of a not-so-recent smoothed tree kernels. For example, see:
Danilo Croce, Alessandro Moschitti, and Roberto Basili. Structured lexical similarity via convolution kernels on dependency trees. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1034-1046, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics.
which proposed a smoothing function in the partial tree kernel by Moschitti, ECML 2006. Such function is more general than the smoothing function proposed in the submitted paper since it can be any valid kernel similarity.
Note that the smoothed partial tree kernel can be straightforwardly applied to DAG, see:
Aliaksei Severyn, Alessandro Moschitti: Fast Support Vector Machines for Convolution Tree Kernels. Data Mining Knowledge Discovery 25: 325-357, 2012.
Finally, there has been older work using tree kernels with similarities:
Stephan Bloehdorn and Alessandro Moschitti, Exploiting Structure and Semantics for Expressive Text Kernels, Proceeding of the Conference on Information Knowledge and Management, Lisbon, Portugal, 2007
Q2: Please summarize your review in 1-2 sentences
Interesting idea but it seems a subcase of already proposed work. The paper would greatly improve if a comparison and discussion of the missing related work were added.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a framework, applicable to three families of graph kernels, where smoothing techniques are introduced to create smoothed variants of graph kernels.
The paper is well-organized and easy to follow, however the main results are limited to small accuracy improvements.
+ I would like to see (less vague) statements on how the drawbacks of R-convolution based kernels (cited in the Introduction) are being addressed by the smoothing framework proposed in the paper;
+ Unfortunately, Table 1 shows very little improvement on most datasets, with the exception of WL + MUTAG?
Any comments on why this happens and/or why not on the other cases?
+ Diagonal dominance, power-law behavior, and partial similarities are more likely to occur when graphs contain more diverse/complex information and structure.
For example, when they are labeled with continuous labels.
Details/statistics about the dataset should not have been omitted;
+ Figures 1, 2, 3 are not very informative.
In fact figures 1 and 2 could be combined into a more clear and detailed example with lower k.
Q2: Please summarize your review in 1-2 sentences
This paper presents an incremental work on graph kernels.
Despite being a clever idea, smoothed variants of 3 graph kernels on different datasets do not show significant improvement (Table 1).
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present a new framework for smoothing R-convolution kernels. They apply this idea to smooth various graph kernels. They present extensive results that demonstrate that smoothed kernels outperform their un-smoothed counterparts in term of accuracy. Finally they compare the new kernels in the presence of edge and label noise.
Pro: 1) The paper is well written, easy to read and the supplementary material is substantial. 2) The empirical results of Table 1 and Table 2 are convincing. 3) I applaud the authors for providing their source code (after acceptance).
Cons: 4) A complexity analysis and a minimum of the run time comparison should be part of the main paper. 5) I did not find the results in the presence of noise very convincing. 5.1) It seems that for many of these datasets, edges are undirected. For example, the protein dataset. For that reason, it makes no sense to flip them. For edge noise, the accuracy difference is almost always the same at 0% noise than at 30% noise. I would suggest trying to remove or to add edges instead of changing their directions. 5.2) I think results when vertices are randomly added or removed would be interesting.
Other comments / minor things: 6) I let to the authors to decide but combining Figure 1 and Figure 2 would make a stunning figure. Also, the last layer of the tree in Figure 2 could be removed, the edges are not visible. 7) Overall I feel the paper could be compressed a little to make place for some of the supplementary material. 8) The long sentence between line 233 and 237 is hard to read (too long). 9) Missing coma at line 267 10) The authors should clarify in the caption of Table 2 that "Smoothed" refers to the best kernel of Table 1. 11) Table 1 clearly does not fit the page.
Q2: Please summarize your review in 1-2 sentences
Well written paper, with good empirical results. A few things to address but nothing that should prevent its acceptance.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
* Reviewer1
> However, the model
proposed .. This work is similar to ours in the sense that both
methods use the notion of smoothing for structured objects. The
related work section will be rewritten to take this work into
account. However, we believe that our method is fundamentally
different than theirs as follows: they solve the following problem:
given two trees, how similar are they? In contrast, we ask: given
two graphs, can we count sub-graphs and smooth the counts? The
smoothing is achieved by either using a pre-computed tree (graphlet
kernels) or by using a data dependent tree (WL and SP kernels). Once
the counts are smoothed, we discard the trees and compute the kernel
via a dot product of the smoothed counts.
* Reviewer2
> I would like to ... We will make this explicit in the final
version. Here are some brief comments:
1) Dependency between
sub-structures: Our framework considers dependencies between
sub-structures by explicitly encoding them into a DAG. These
dependencies are ignored by many existing R-convolution kernels.
2) Diagonal dominance: A subgraph is not only similar to itself but
also to its neighbors in the DAG. These "diffused" partial
matchings address diagonal dominance.
3) Power-law behavior: We
take away probability mass from sub-structures that occur
frequently but are not informative (top levels of the DAG) and
redistribute it into informative sub-structures (bottom levels).
This prevents small and uninformative subgraphs from dominating
the kernel.
> Unfortunately, Table 1 ... Many of the
benchmark datasets are a decade old, and very well studied. Therefore
one cannot expect dramatic improvements. Also MUTAG is a small
dataset comparing to others (e.g. 188 graphs vs. 4110), thus the
relative improvement can't expected to be similar. However, note that
almost all of our improvements are statistically significant at p
< 0.05 (shown in bold in Table 1).
> Diagonal dominance,
... This is definitely on our agenda for future work. The key
challenge here is to design a smoothing function that can respect
continuous labels.
* Reviewer3
> A complexity
analysis ... We will move this into the main body.
> It
seems that ... All our graphs have undirected edges. By 'flipping' we
meant randomly deleting and adding an edge. We will clarify
this.
> I think results ... If space permits, we will
include these results.
* Reviewer4
> This reviewer
believes ... Unless we misunderstood, the reviewer is suggesting
adding a pseudo-count for smoothing (aka Laplace or more generally
Dirichlet smoothing). What we describe are much more sophisticated
variants of these simple ideas.
* Reviewer5
> I
have to ... In the original SP kernel, the contribution due to the
pair (ABCD, ABCE) = 0. In our case, it is non-zero. This is over and
above the contributions of the sub-paths, as noted by the reviewer.
Longer paths have more discriminative power, but may not match
exactly. By taking the into account, we can improve the similarity
computation.
> Please comment on ... See our response
to Reviewer 1.
> It would be nice ... Our work is
similar to propagation kernels (PK) in the sense that both works are
designed to improve performance upon information loss (e.g. noise
vs. partially labeled graphs). They define graph features that are
counts of similar node-label distributions on the respective graphs
by using LSH. Our framework not only considers node label
distributions, but also explicitly incorporates structural
similarity via the DAG. Moreover, it seems like PK is only applied
to WL kernel, whereas our work is applied to 3 major graph kernel
families.
> Your DAG ... Both kernels exploit the same
property: all subpaths of a shortest-path (SP) are also SPs. The aim
of the GraphHopper work is to cache certain SP computations to speed
up kernel computation. In our case, we use this property to
smooth the counts.
Moreover, notice that their SP kernel do not
compare paths in terms of endpoints and lenghts, but compare nodes
encountered while 'hopping' along SPs. Therefore, our DAG is
different than ours since we build the DAG on SP sub-structures
(e.g. concatenated endpoints and length of the path).
*
Reviewer6
> However, it is ... Please see Appendix 9.4
Table 5 under the columns labeled GK, SP, WL DAG.
DAG
generation costs: 1. GK. One time effort of O(2^k). Empirically ~6
seconds for k=6.
2. SP. Linear processing after the SPs
have been computed.
3. WL. See Appendix, Figure 5 (bottom
figures).
> Also in some ... We conjecture that this is
primarily due to the differences in the noise level present in
different datasets. |
|