
Submitted by
Assigned_Reviewer_2
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Overview: This paper looks at the difficult problem of
learning FST models of unaligned input and output sequences. This is an
interesting problem and the approach appears to have merit; the main
drawback of the paper is that some sections are very difficult to
understand.
Major Comments: 1. In the abstract and
introduction: the authors mention that the setting where sequences are not
aligned is more realistic. I believe the authors, but examples of problems
where unaligned sequences are the norm would be welcome. 2. This was a
very dense paper that took me quite a while to understand. In particular
the critical Sections 2.3 and 3.1 are close to impenetrable. I think it
would be helpful to expand these sections while maybe cutting Section 4.1
up to example 4 in order to make the necessary room. 3. The
experimental results do seem to show that the proposed method does very
well at recovering a FST model with the unsupervised approach while
missing alignment information on a simple problem. However, given that in
the introduction the authors claimed that the motivation for learning FST
models of unaligned input and output sequences was biology and natural
language processing (where unaligned input and output sequences are
common), I was a bit disappointed that the experimental results were only
conducted on a very simple toy problem.
Minor Comments: 1.
The figures after the first paragraph (DNA sequences?) need some context
or caption. 2. Line 67 “constraint” → “constrain” 3. It is not
possible to tell apart the red and blue lines in a noncolor printout in
Example 1 . 4. The text in Figure 1 is way too
small. Q2: Please summarize your review in 12
sentences
The paper contains an interesting algorithm to solve
an interesting problem, but is also very dense and difficult to
understand. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a spectral algorithm for learning
FSTs in an unsupervised manner (i.e. without observing the
inputoutput alignments). This is done by introducing a generalized
Hankel matrix and deriving a result (Prop. 4) that extends a previous
result for the case where the alignments were observed. The Hankel
matrix is estimated by formulating a constrained rank optimization
problem, which is then relaxed to minimizing the nuclear norm.
Overall, this is a nice paper that extends previous work on
spectral learning of FSTs, the novelty being in addressing the case of
unobserved alignments, which complicates the problem quite a bit. The
paper is clearly written and everything is defined and stated
rigorously.
The theoretical properties in 4.1 are nice to have,
but they all refer to the solution of the NPhard problem so are
somewhat useless. Can we claim something about the solution of the
relaxed problem?
In the experiments, it's a little disappointing
that the EM algorithm performed better than the spectral learning
methods (but the same has happened in prior work on spectral
learning).
Minor comments:  In Def. 4, shouldn't the
emptyempty alignment {(eps_+, eps_)} be excluded from \Sigma?  In
Prop. 2, the sum \sum_{s \in \Sigma_+, t \in \Sigma_} should be
\sum_{s \in \Sigma_+^*, t \in \Sigma_^*} (i.e., the Kleene stars are
missing)  line 154: a punctuation sign is missing before "One" 
line 161: "On" > "One"  in def. 8: "\in V_0" < "v \in V_0"
(i.e., v is missing)  in the first paragraph of 3, p(s,t) is used
without being introduced. Are these empirical probabilities?
Q2: Please summarize your review in 12
sentences
This paper presents a spectral algorithm for learning
FSTs in an unsupervised manner (i.e. without observing the
inputoutput alignments). The Hankel matrix is estimated by formulating a
constrained rank optimization problem, which is then relaxed to minimizing
the nuclear norm.
Pros:  extends previous work on spectral
learning of FSTs by addressing the case of unobserved alignments 
some theoretical results regarding the constrained rank problem
Cons:  those theoretical properties are not applicable to the
relaxed problem which is actually solved
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This is a rather technical paper that introduces a
convex optimizationbased (spectral) formulation for unsupervised learning
of finite state transducers.
Unlike previous spectral approaches,
the proposed algorithm does not require explicit alignment information for
the inputoutput pairs in the training data. This causes difficulties in
that the Hankel matrix typically used in spectral learning is not fully
observed.
The main idea of the paper (which was presented before
in [3], but in the slightly different setup of missing matrix values due
to distribution mismatch), is that instead of using the full matrix,
observable statistics can be computed instead that constrain the
coefficients of the Hankel matrix. For instance, it is known that the
probability of observing a a pair of unaligned inputoutput sequences is
computed by summing over all possible alignments.
The main
technical contributions of the paper are sample complexity bounds as well
as identifiability results for the proposed estimation problem.
The experiments are synthetic and do not reveal a lot about the
practical viability of the proposed approach. An EMbased approach is
shown to outperform the proposed approach in terms of L1 error, but
claimed to be more expensive computationally. Actual numbers regarding the
computational efficiency of the considered algorithms are not given in the
paper.
Quality: The proposed approach is backed up by
extensive theory. Whether or not the approach is useful in practice is
hard to tell from the paper. In this reviewer's opinion, it would have
substantially strengthened the submission if more practical experiments
had been included in the manuscript. Clearly, FSTs are a very practical
construct with numerous applications in NLP and other disciplines, so
there is no good excuse for not including such experiments. As it stands,
I have to assume that the approach either does not scale to practical
amounts of data, or has some other limitations that keep it from being
applied to real data.
Clarity: Owing to the technical nature
of the contents, the manuscript is not an easy read. However, the paper is
wellwritten in general and I recognize that there is no magical way of
presenting formal results in an easily comprehensible manner. In any
case, the presentation is clear enough to allow for implementation of the
learning algorithm, although the experiments section probably lacks
sufficient detail in order to allow for reproduction of the results.
Originality: The proposed approach follows the same main idea
that was already presented in [3]; as such, I consider the present work to
be mostly incremental in nature.
Significance: Unsupervised
training of finite state transducers is an important problem that attracts
a lot of attention in the natural language processing community, and
deservedly so. Significant advances in this area are bound to have a major
impact. However, from the present manuscript, it is not entirely clear
whether the proposed approach represent such an
advance. Q2: Please summarize your review in 12
sentences
This is a highly technical but rather wellwritten
paper that introduces a spectral approach to unsupervised learning of
finite state transducers. The authors derive several interesting
theoretical properties of their approach (such as sample complexity bounds
and identifiability results), but the practical relevance is not
demonstrated convincingly.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their valuable feedback.
All reviewers have pointed out that our experiments are somewhat
limited. We agree that empirical validation of the method in real
transduction tasks is an important research goal. To achieve that goal we
would first need to scale the optimization so that it can handle larger
problems. This requires an additional engineering effort that we leave
outside of the scope of this contribution. In any case, we believe that
the engineering challenge can be solved effectively, since recently there
have been largescale applications of matrix completion techniques (e.g.
netflix challenge) where the underlying optimization problem is similar in
nature to ours. That been said, doing a good empirical study on a real
task is probably a full paper by itself.
Based on the scores, all
reviewers have judged our work as "Incremental and unlikely to have much
impact". We respectfully disagree with the "incremental" assessment, which
typically has a negative implication in the ML community. It is true that
we study the same family of FST models than Balle et al ECML2011, who
proposed a spectral method. However, the differnces in the assumptions
made by the learning algorithm are far from superficial (as some reviewers
have actually pointed out). Spectral methods until now have been mainly
applied to problems in which the unobserved variables are restricted to be
the hidden states. In this paper we study the more fundamental issue
of missing state and structure, which leads to problems where the Hankel
matrix is not directly observable. The main contribution of the paper is a
new spectral approach that allows the recovery of structure and missing
state from observable statistics. Although superficially this might
seem a small difference, in reality it is a fundamental one. This is
because our approach opens the door for the application of spectral
methods to a wide range of problems in which both state and structure are
unobservable. For example, we are currently working on an application of
this method to learn Weighted Contextfree Grammars from string statistics
only, and the main ingredients of the method remain the same. We are
framing FST and grammar induction as a lowrank matrix optimization, and
as far as we can see this is radically different from any existing method.
Reviewer 1
With regards to clarity, (as the other two
reviewers acknowledged) it is hard to present the main results of the
paper in a way that it will make it easily accessible to a wide audience.
We will try to include more intuitions about the method wherever is
possible, and perhaps follow the suggestion of dropping one example. For
examples of problems where alignments are not available consider: speech
recognition, phonetictransliteration, machine translation, video event
recognition, etc.
Reviewer 2
Reviewer 2 questions the
value of the bounds given for the rank minimization problem (since the
problem itself is NPhard). The theoretical results on the rank
minimization problem prove that the target distributions are identifiable
and that the problem is statistically consistent. Thus it establishes that
the difficulty of learning FSTs is in essence computational. With regards
to the applicability of the bounds to the relaxation: It is known that for
problems satisfying certain properties (RIP or NSP, cf
http://arxiv.org/abs/1205.2081, and http://arxiv.org/abs/1204.1580)) the
solution of the relaxed problem is a solution to the rankminimization
problem. Unfortunately, it is also known that checking for these
properties is NPhard. In practice the nuclear norm relaxation seems to
give very good solutions. Clearly, when the solution of the relaxed
problem happens to be a solution of the rank minimization problem the
bounds apply.
Reviewer 3
With respect to actual costs
of the experiments. We used Matlab, and we used the CVX optimization
package in order to implement the convex optimization. For the comparison
with EM reported in Table (c), the convex optimization took roughly 1400
seconds, while EM took roughly 150 seconds per iteration and required at
least 100 iterations to get to the reported performance. Hence, under our
current implementation, the convex optimization is about 10 times
faster. We can report these figures in the next version. However, we are
reluctant to take any conclusion from these runs, as we have not
identified the most efficient method to solve the convex optimization, and
our implementations have not been optimized either. This engineering
effort would allow us to have a proper empirical study on real datasets.
 