|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces the attention-based recurrent sequence generator for phone recognition. The main contribution of the paper is the introduction of a different normalisation method for the attention vector (that works better for this task) and a "hybrid" attention method that takes into account the preceding attention vector (also improves results on this task).
The paper is well written and clear. Has an adequate 'related work' section and a sufficiently thorough experimental section that supports the claims stated at the beginning of the paper.
However, the novelty of the method presented is very limited. Only two small modifications to the attentional mechanism of the neural machine translation methods.
The paper makes no mention of the computational requirements of the method presented. This is of great importance to the speech recognition community. The use of a method that has access to the whole input sequence at every output step is necessary for machine translation due to the different syntactic rules used by different languages. However, for phone recognition this method is probably overkill, as coarticulation and related effects only have local influence in time. It would be interesting to see this methods adapted to a rolling window approach, which would probably be a lot more computationally efficient.
Q2: Please summarize your review in 1-2 sentences
The paper is sound but the novelty limited.
I recommend acceptance but the paper might find a more interested audience in ICASSP or Interspeech.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper applies recently proposed attention based model from machine translation to speech recognition.
The biggest criticism I am having is an experimental section using TIMIT dataset. It's not the best choice for dealing with ASR task due to several reasons. First, it's (almost) perfectly phonetically balanced (very unusual for real data) and was originally designed to some linguistic/dialect studies rather than anything connected with ASR. Second, it is almost perfectly clean read-speech with any real mismatch between training and testing conditions and third, you cannot really draw strong conclusions based on it. As a result, there is a bunch of ASR techniques working on TIMIT (and not working on anything else) and bunch of techniques working on anything else (but not on TIMIT, for example, sequence discriminative training). As your paper tries to address ASR problem explicitly, I am not sure whether your approach is better when compared to other CTC approaches proposed to date or its just yet another TIMIT artefact.
Given you would try your model on anything more challenging (even something like WSJ as in Alex Graves' ICML paper) I would be totally positive about this work. Reporting on TIMIT is all right given the findings are further strengthened on more challenging ASR benchmark.
Q2: Please summarize your review in 1-2 sentences
An interesting paper towards promising CTC acoustic model, but not quite yet there.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper applies attention-based models to speech recognition. The model is extended by considering location awareness, sharpening and smoothing of the scores, and windowing. These extensions overcome the long sentence issues in attention-based models, and achieved the improvement from the straightforward applications of the attention-based models.
Quality: This paper describes an original attention-based model, and mathematically explains the extension of a location-awareness score function in Eqs. (5) - (9). The other extensions (sharpening and smoothing) are also mathematically described by considering the long utterance issues. Therefore, the paper has sufficient quality in terms of mathematical/theoretical formulation.
Clarity: The paper clearly describes an attention-based model, issues of long sentences, and analysis of experimental discussions.
Originality: Although the novelty of this paper is incremental, each step of the incremental extensions is reasonably supported by the experimental results and analysis.
Significance: Since the task is relatively small (TIMIT phone recognition), the experimental result is not so significant. Although the proposed method uses a window technique, introducing \alpha requires long-range computations in training and decoding, and the proposed method does not seem to be scalable to large-scale (practical) speech recognition tasks.
I summarize the overall Pros. and Cons. as follows: Pros: Attention-based novel architecture provides end-to-end speech recognition. Cons: Scalability.
Minor comment: 1. P.2, Section 2: L and T must be explicitly defined.
Q2: Please summarize your review in 1-2 sentences
A novel attention based model with location awareness is applied to speech recognition. By carefully analyzing an issue for long utterances, the extended model obtains further gain from a conventional attention-based model.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We are very grateful for reviewing our submission
and for the comments.
First, we would like to clarify the position
of this work in the body of machine learning research. Prior to this work,
the literature on attention-based RNNs missed an analysis of applicability
of the approach to long input sequences, and most importantly to sequences
longer those seen during training. Our work fills this important gap. It
presents an analysis of the drawbacks of the existing approaches and
proposes an effective and generic solution. Thus, we believe that this
work belongs to a general machine learning conference, like
NIPS.
On the other hand, our experiments delivered a novel speech
recognition architecture, which works on par with the existing approaches.
We will further respond directly to the criticism of our work in
the following aspects:
1. Novelty We do combine ideas from A.
Graves' work on sequence generation, Neural Turing Machines (NTM) and from
D. Bahdanau's et al. work on machine translation. However, we introduce
novel convolutional features that are arguably better than previous
location-awareness proposals: they are fully trainable, straightforward to
extend to many dimensions, and easy to implement. An attention mechanism
with convolutional features does not involve predicting the shift to the
next location, which we argue against in the subsection 2.1. The
convolutional features are also in a sense deep, as opposed to linear
interpolation of content- and location- based components in NTM.
We
also believe that the investigation of failure modes of the both location
aware and unaware attention mechanisms, along with the resulting notion of
alignment sharpening, is also a crucial contribution. The content-based
attention mechanism allows for more varied alignments than those produced
by other techniques since alignments need not be monotonic. This brings in
other failure modes. We demonstrate that the location-unaware model is
able to learn to implicitly but not robustly track its position, which
makes it fail in a completely different way from the location-aware model.
Knowledge on how to apply models trained on short sequences to longer ones
is important for other users of attention-based models. This need seems to
be justified e.g. by a recent contribution from Google -
http://arxiv.org/abs/1508.01211v1 - which reports failures of decoding on
long utterances, however without proposing any solutions.
2. Value
of experiments conducted on the TIMIT dataset We chose TIMIT because it
is popular, small, and well known. Our main concern was how the attention
mechanism will perform when asked to align long sequences and whether it
will be able to generalize to even longer ones. The speech recognition
specific problems with TIMIT, such as low noise or balanced phoneme set,
are less important when the main goal is to provide a difficult benchmark
for attention-based RNNs.
That being said, we are working currently
with the WSJ and Librispeech corpuses, which are substantially larger. We
work on reducing the computational complexity making the approach more
practical, e.g. we apply the windowing from subsection 2.3 during both
training and testing. This model has a computational complexity that
scales linearly with the length of the output sequence (as opposed to the
one used for the TIMIT experiments, whose computational complexity scales
with the product of the lengths of input and output sequences, or roughly
quadratically with the length of the output sequence). On WSJ, when
trained directly on characters we achieve a low 7.8% character error rate
on the test_eval92 split with no language model used during decoding, and
when decoded with the standard bigram WSJ language model it reaches the
performance of CTC models of about 15% word error rate. Still, no initial
alignments are needed and a conventional ASR system are not
required.
The attention-based model also offers many new
possibilities for speech processing. For instance it discards the typical
requirement that each phoneme requires at least three speech frames. This
allows pooling over time and we have indeed succeeded in reducing the
recording length up to 16 times with very similar decoding
performance.
|
|