|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors set out to solve a problem that the highly promising architecture of MDLSTM-RNNs have, which is the ability to parallelise computations. A novel sequential flow of information is proposed, which arguably makes parallelisation much easier.
Major issues ===
i) MDRNNs are not treated fairly. Fig 1 a) and Fig 2 left are dangerously misleading: they make the impression that it takes four MDRNNs to scan a whole 2d plane. But this is not true, the scanning can continue from the center point to the lower right corner, and the prediction can be made there. Consequently, a single sweep of a MDRNN sees *all* the pixels, a theoretical benefit of MDRNNs over Pyramid RNNs, for which this property is not shown.
ii) The authors make clear that their computational flow, that of a Pyramid, introduces some independencies which make parallelization easier. What they do not is to elaborate on the consequences of this from a statistical view--i.e. what assumptions are made then.
E.g. if we look at Fig 1 c), the pixels which are neighbours along the columns are treated as independent. This seems as a big restriction to me and I think should be discussed in the text--e.g. is it possible to undo this dependency? (see next point)
iii) The C-LSTM layer is not justified, the authors do not give reasons why they do it. I can imagine that it helps to make up for the dependencies that the Pyramid layer is lacking--but the authors should make clear why this is done.
iv) It is not explained very well how the segmentation is actually done. The paper makes the impression that a single pass over a volume forms a single prediction--the output of the softmax. But what if many pixels are to be segmented? Do the authors propose to sweep
over the volume once per segmentation label? Further, why are the authors chosing the MSE as a training criterion, opposed to the negative log-likelihood of the categorical distribution induced by the softmax?
v) The experiments are very domain specific, while the method is not. I don't see why no results on more standard benchmarks are presented, e.g. Pascal VOC. This would add a lot of value to the paper, as it eases the reader's effort to evaluate the method for applications in different contexts.
Minor issues ===
i) I think that Lucas Theis' work on generative modelling with spatial LSTMs [2] might (!) be of interest for the related work section.
ii) The LSTM explanation is poor. The x symbol is overloaded, it is not clear what dimensionality the different quantities have, \text{} is not used in the subscripts, also the reader has to guess what x is. It should be easy to improve this to take cognitive burden from the reader.
iii) Introduction, "neighbouring" -> "predecessing".
iv) The correct cite for dropout on the non-recurrent connections of RNNs is [1], who did that earlier than Google.
[1] Pham, Vu, et al. "Dropout improves recurrent neural networks for handwriting recognition." Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014.
[2] @unpublished{Theis2015c,
author = "L. Theis and M. Bethge",
title = "Generative Image Modeling Using Spatial LSTMs",
year = 2015,
journal = "arXiv",
month = "Dec",
keywords = "deep learning, generative modeling, natural image statistics, lstm, mcgsm",
url = "http://arxiv.org/abs/1506.03478/" }
Q2: Please summarize your review in 1-2 sentences
While the paper addresses an important issue with MD-LSTM, the paper is coming short in some aspects which I do not believe can be fixed in a camera ready version. Especially, i) some of the architectural decisions are not justified, ii) not explained clearly enough and the iii) the experimental validation makes it hard for the reader to place the method in the map of alternatives.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Paper is about applying LSTM-type models to biomedical data. But beyond this, the contribution of the paper is unclear. On one hand it argues that the pyramid-configuration is good because it makes it GPU compatible. But to make this case, the following are needed: (a) showing speedups on GPU vs CPU for this model; (b) demonstration that changing the topology does not significantly impair performance, relative to existing multiD-LSTMs. The paper has neither of these. Alternatively, the paper could be viewed as applying a novel NN approach to bio-medical data and obtaining good results. The issue here is (a) no baseline comparisons are made to existing (and much simpler) NN approaches, like 2D/3D convnets, or indeed the standard (slow) multi-D LSTMs; (b) the results obtained are good, but not comprehensively better than the existing approaches.
Thus I don't think the paper really says anything useful to a machine learning audience. It might be of interested to a biomedical audience (although a critical observer there would also surely want to see what a simple convnet would do).
The pyramid reformulation itself is quite interesting, but without experiments to show that it is an efficient replacement for the existing multi-D LSTMs, it is unclear if it is a good idea or not.
The paper is clearly written. The related work seems fine (given my limited knowledge of the biomedical literature).
The experiments on the various datasets show good numbers. Although it isn't clear, I presume many of the existing approaches are not using deep nets, thus probably represent rather soft targets to beat. This is why the lack of a simple deep convnet baseline is an issue: it may also beat out these previously published methods.
Q2: Please summarize your review in 1-2 sentences
Paper proposes a new topology for multi-D LSTMs, where they are arranged in a pyramid structure making the model GPU compatible. It is applied to a variety of biomedical data, showing good results. Overall point of paper is unclear: is about GPUifying multi-D LSTMs, or about beating existing non-NN approaches? Experiments don't compellingly make either case.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides a parallel implementation of recurrent neural networks, which requires a non-trivial adaptation of existing training algorithms.
Experiments demonstrate the ideas in practice on two medical image analysis problems and show promising results.
The idea appears sound and the experimental results show significant promise. The paper is well presented and the key ideas come across clearly. This is not a topic I am particularly familiar with, so I cannot be certain of the originality, but from the authors summary of the literature it appears to be original. The contribution is potentially significant, because it enables the training of deep networks in problems where context and recurrence are important, such as the image analysis problems the authors demonstrate with.
Q2: Please summarize your review in 1-2 sentences
A decent and timely contribution. This is a non-trivial solution to a timely parallelisation problem enhancing deep learning on problems that require context of unknown scale.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I doubt the LSTM can (as mentioned in the abstract) "the entire spatio-temporal context of each pixel in a few sweeps through all pixels". The memory is, in practice, never quite large enough!
I applaud the authors for working on a medical dataset. For some comparison to other recent segmentation methods it would be nice though to also compare on more standard benchmark datasets.
Q2: Please summarize your review in 1-2 sentences
This paper introduces a novel multidimensional pyramid LSTM model and applies it to a very interesting task in medical image segmentation.
The model is also parallelizable on GPUs which is a major improvement over previous methods.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all reviewers for their valuable feedback.
Below we respond to general and author-specific
concerns:
General:
We would like to clarify our
results: - We apply the exact same architecture on both datasets. This
is significant since the competing algorithms are very specialized. -
On the EM dataset, we outperform a state-of-the-art deep convolutional
neural network (NN) in Rand Error, the most important error measure for
this dataset (see 'IDSIA' in table 1; these authors are experts in NNs and
won several competitions). We were not able to apply the SCI
post-processing on our dataset (SCI authors did not reply), which is tuned
to improve rand-error further. We still report the SCI post-processing on
the output of other methods for completeness ('IDSIA' and 'DIVE'). - On
the brain dataset, other teams applied various pre-processing and
post-processing to adapt their methods specifically for MR image
segmentation. On the contrary, our approach is not optimized in the
specific domain and only simple pre-processing is applied. The
organizers of mrbrain13 noted that especially the performance on cerebral
spinal fluid was impressive. - We chose biomedical images since they
are varied and challenging, and provided segmented volumetric images. In
the biomedical field and 3D volumetric image segmentation is a challenging
task in the field. The datasets used for our evaluation are large and
images are very noisy and complex.
Reviewer_1: Major: We
appreciate the detailed feedback. We think there is some misunderstanding
between segmentation and classification, that causes several problems, we
try to explain this below.
i) Yes, your example works if it was a
single classification. But we want pixels-wise classifications
(segmentation) using the cell outputs at every pixels position. Then at
the middle-pixel, as shown in Fig 1a, you only get information from part
of the image and have to combine six sweeps in all
directions.
ii) We discuss the effects of our topology at
several places in the paper. All pixels in this system get information
from all pixels in this system, and convolutions model neighbouring pixels
over all axis (the arrows in 1c show the convolutions, we will make this
more clear).
iii) The C-LSTM layer is not an extra layer, it is
the workhorse of PyraMiD-LSTM. It performs the convolution computations as
shown in Fig 1c, and are combined as in Fig 3 (the C-LSTM is shown in
the drawing).
iv) Please see the answer i). A good question, we
chose MSE after getting better results with it, we will note this in the
paper.
v) Please note the two biomedical datasets are very varied,
we agree we like more experiments in other domains, but it is already an
achievement to run deep LSTMs on these scales with high speed on these
volumes. There are also not many volumetric segmentation datasets (Pascal
is 2d, not 3d).
Minor: We will improve the citations and make
LSTM explanation clear. A good point, the domains are only defined in
C-LSTM, not normal LSTM. We will fix this.
Reviewer_2: We agree
with the comments.
Reviewer_3: We believe the method is in
essence simple, but LSTM equations are known to be complicated. We tried
to keep it simple by focusing on clear formulas, but we will try to make
more clear.
Getting state-of-the-art results on these datasets is
not really negative, the competing algorithms are state-of-the-art random
forests models (on mrbrain13) and deep neural networks (on membrane
segmentation), these are not run-of-the-mill algorithms. These datasets
are large, comprehensive and complicated tasks, we thus would not call
these results preliminary.
We will fix the CSF
acronyms.
Reviewer_4: Good point, the idea of LSTM is to
propagate information very far and thus learn to combine information
selectively. The last layer has 64 values per pixel so we hope this allows
for enough memory (this was currently the limit on our hardware). We
are planning to try video-data as future
work.
Reviewer_5: Please see the general comments above about
the comparison with NN approaches and the choice of biomedical
datasets.
- GPU vs CPU: We are able to parallelize convolution
operations (with CUDNN) because of our model, this would not be possible
in MD-LSTM. This is much faster as is well known for convolutions. We will
make this argument more clear in the paper.
- PyraMid-LSTM vs
MD-LSTM: This would indeed be a great comparison, but prohibitive in
computation time. The experiments on these volumetric datasets are simply
not possible with current MD-LSTM implementations, but doable with
PyraMiD-LSTM.
- EM images are evaluated with state-of-the-art
convolutional NN, please see general remarks. The competing algorithms
in brain analysis are state-of-the-art computer vision algorithms from big
research groups.
Reviewer_6: We agree the method can be of broad
interest. |
|