
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper revisits fitting mixtures of Gaussians with gradient methods. A combination of manifold optimization, with a particular linesearch procedure, and a "reparameterized" objective, empirically beats batch EM in the experiments reported.
While I'm not an expert in Riemannian manifolds, the work appears plausible. The work is a reported at a level that's possible to follow, if taking results from cited references on trust.
Plain mixtures of Gaussians may not seem particularly exciting in this world of Bayesian nonparametrics and deep neural networks. However, GMMs are quietly plodding along as important workhorses in applications: they're useful and surprisingly powerful. People often then want to extend GMMs in various ways, e.g., by adding constraints or tying covariances together. But then EM usually doesn't apply. Gradientbased procedures would generalize fine, so ones that work well on the base model could have significant impact on statistical applications.
I would previously have used an unconstrained parameterization based on the Cholesky. While the paper gives reasons to disfavour this approach, it's a shame it's not included in the empirical comparison. Are spurious stationary points an issue in practice? Is the manifold optimizer actually better? Is the main gain here from the "representation" in (2.3); would that also help the Cholesky approach? The paper would be stronger if it answered these questions.
Minor comments:
Line 081: "open a new" > "open new"
Line 138: rely *on* geodesics
The timings in Table 4 would be probably be better as a graph. A graph would give an idea of the comparison more quickly. Precise times are irrelevant, being implementation and machine dependent.
I suggest replacing lines 147149 "At any point... specifically [4]" with: "Geodesics on the manifold from $\Sigma_1$ to $\Sigma_2$ are given by [4]:". Riemannian metrics are not explained in the paper, and don't need to be understood or mentioned to take on trust the geodesic result. (The lack of context, and the nearby definition of `$d$', encourges a misreading of $d$ as a variable, making the trace expression evaluate to `$d^3$' (!).)
I suggest adding the zero mean ("0,") explicitly to the normal density in (2.3) and the line below it. On a quick reading, augmenting the data with a constant in the line above (2.3) seems like a terrible idea, a Gaussian could fit these new data with infinite density. The zeromean constraint is what stops that from happening. Also, the compact Gaussian notation is only defined explicitly including a mean term above (2.1).
I wouldn't call the transformed problem (2.3) a "reparameterization". A different model is being fitted, one that would generate data from a different distribution than the original.
I suggest numbering all equations so other people can refer to them, not just in reviews, but in reading groups and citations.
Q2: Please summarize your review in 12 sentences
Novel gradientbased procedure for fitting GMMs, which appears to beat EM. While I'm slightly unclear on some claims, I'm interested in this paper as potentially really useful in applications.
I have increased my score in response to the rebuttal. The authors promise further clarification and control experiments to compare this work to the obvious Cholesky approach. I think this will be an interesting paper.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
summary: manifold optimization of GMMs
Quality: high (i think, to be determined)
Clarity: high
Originality: seems so to me (though i'm unfamiliar with the manifold optimization literature)
Significance: good good.
i like this paper. my comments are as follows:
1) i don't like tables, and i don't like only seeing means & standard errors, especially for hard optimization problems.
i also don't like seeing only performance times, because i can't tell how well the algorithms performed in terms of the objective.
2) i'd like the benchmarks from UCI to be in the main text. i'd also like to see comparisons with other implementations, eg, if you are using matlab, perhaps the gmmdist set of functions or mixmod, or in R, perhaps mclust.
3) in general, saying "EM algorithm" as your benchmark it is unclear what implementation you are using, which matters a lot.
for example, setting the convergence criteria differently can make the run time quite different, but the estimates quite similar.
therefore, i'd like to see the following:
(a) twodimensional contour plots (or some such), showing both time and performance for each of the numerical examples.
(b) same thing for a set of benchmarks.
(c) some plots showing likelihood (or the surrogate objective) and time as a function of iteration.
one big problem with EM is that often, the loss gets nearly flat, so the last bunch of iterations are useless, and the termination criteria could be set differently to obtain vastly superior performance.
i don't think it matters much what the results are, as long as they are qualitatively consistent with the text. but currently, the manuscript does not provide any evidence that the algorithms actually work.
Q2: Please summarize your review in 12 sentences
developed a modern algorithm for a classic problem, with promising results, however, no accuracy results are provided,
nor details of the benchmark algorithms, nor performance vs iteration, so more details are required for proper assessment.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Learning Gaussian Mixture Models is a very wellstudied problem in statistics and ML. Expectation Maximization (EM) algorithm from 30 years ago still remains the most popular algorithm. The current paper proposes to use manifold optimization (a reasonably wellestablished subarea of optimization) for this problem. Very briefly, think of the optimization problem where the variables are the parameters of the mixture model (i.e., the mean, covariance and weight for each component, where the number of components is fixed), and given these one can write down the expression for the (log) likelihood of the given data. The most important constraint on the variablesand the main source of difficultyis that the covariance matrices are positive semidefinite. One seeks to find parameters that maximize this. As the paper observes, direct use of standard manifold optimization is not effective as the algorithms tend to be too slow. But, as the paper shows, a simple reparametrization of the optimization program plus judicious application of standard manifold optimization algorithms with some tweaking improves the performance greatly: In the reported experiments, their algorithm (the paper has several algorithms; I will focus on the best) converges in time that's comparable and often better than EM, and the final value of the objective functions is generally the same for the two algorithms. (Note that these methods may converge to local maxima, and the question of how far the parameters thus estimated can be from the true parameters is not discussed here.) It of course remains to be seen how widely this holds. But given the importance of GMM, these results are interesting.
Minor: Line 129: "Problem (2.1) in general can require exponential time." This should be phrased more carefully: "Problem (2.1) in general can require number of samples that's exponential in K."
Q2: Please summarize your review in 12 sentences
I think the paper can be accepted on the strength of experimental results.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their feedback,
suggestions and encouragement. Before we address the comments
individually, we'd like to emphasize our message:
Matching or
exceeding EM on GMMs using manifold optimization methods is valuable,
especially because the flexibility of these methods opens new algorithmic
avenues.
Reviewer 1
 Reparametrization We agree that
reparametrization may not be fully correct. As shown in Theorem 2.2, the
new objective function is a reparametrization of the original at its
stationary points. To avoid confusion we can call it
modeltransformation.
 Cholesky We apologize that we
extrapolated based on comments in [29] to claim that Cholesky
factorization adds spurious stationary points. In fact, we realized that
our modeltransformation shows that Cholesky does not add any local
minima! (this observation may be new). Curiously, our transformation also
accelerates convergence of Cholesky formulation.
But even the
transformd Cholesky problem is 2X10X slower than our manifold approach.
During the rebuttal period, we could only finish simulations of Cholesky
approach for Table 1. We will update our paper with more complete
results.
Supplement to Table 1 [Time(s)]  CG
Cholesky  LBFGS Cholesky  c=0.2
K=2  23.5+11.9  14.7+6.4  ...... K=5  106.7+39.7  68.6+41.4

c=1 K=2  49.0+17.8  31.8+16.3  ... K=5  279.8+111 
272.1+137 
c=5 K=2  177.6+147.6  117.7+58  ......K=5 
562.1+242.7  606.7+344
Thanks for careful reading, catching
typos, and other suggestions. We will apply all suggestions in the final
version.
Reviewer 2
 1st comment Since the reviewer
suggests contour plots for the benchmark data, perhaps the reviewer thinks
we are using different initializations and showing averaged values on
benchmark data? For this data, we ran the algorithms only once and report
the objective and the performance time. The objective ALL (average
loglikelihood) is the loglikelihood of the data divided by the number of
samples. Since the ALL values can be arbitrary, contour plots may be quite
noisy and not so informative.
Similar to EM, manifold optimization
converges fast in the first few iterations. We will add
time/likelihood/iteration plots as per the reviewer's suggestion in (c).
One concrete such result is the following (corr. to d=35, and K=5;
iteration means one step of EM, or one LL and gradient evaluation for
LBFGS):
Iter  LBFGS  EM  2  25.78  30.18 8 
31.40  31.51 16  32.05  31.80 32  32.07  31.98 64
   32.07
It is true that EM gets slow close to singular
solutions; but the same is true for other optimization methods. We sought
a fair stopping criterion and used a tolerance of 1e6, which seems to be
a standard value in many implementations, including Matlab. For a looser
criterion, in flatter regions LBFGS may stop early and yield smaller LL,
but EM suffers more. For tighter criteria, unsurprisingly EM suffers much
more.
 2nd Comment Computation costs for EM are just the Estep
and Mstep, while the main costs for manifold optimization (esp. for
highD) are computing loglikelihood and gradients. The Estep costs
similar to computing the loglikelihood; the Mstep costs about the same
as the gradient, so mismatch due to implementation is reasonably
small.
We developed a Matlab toolbox for mixture models, and used
it in our simulations; we will add a link to this toolbox. Upon the
reviewer's suggestion, we also tried 'fitgmdist' from Matlab; the timing
results are very similar. The results reported in our paper are very
conservative: indeed, the implementation of the manifold optimization
methods can be substantially improved, which may not be the case for
EM.
We assure that we will try our best to squeeze UCI results into
the main text.
 3rd Comment As noted above, we believe we used
a fair convergence criterion; we can add results with looser and tighter
criteria to offer a more complete picture.
Reviewer
3
Indeed, the methods may converge to local maxima. Avoiding local
maxima is an independent research problem and many empirically effective
approaches for tackling local maxima (eg. splitandmerge algorithm)
exist; these currently focus on EM, but actually apply to our algorithms
too.
Reviewer 4
We believe our paper offers a new set of
algorithmic tools for GMMs. Our method is grounded in optimization, so
extensions to stochastic, parallel, and other regimes are natural
sequels.
Detailed comparisons with Z&W are orthogonal to our
paper: we are proposing a new way to estimate GMM parameters, while
Z&W use GMMs on natural images. We use image data as a realistic
example on which people use GMMs. We have benchmarked our algorithm on
several other data (see also appendix). 
