
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper reduces a broad class of machine learning problems involving latent variables to the problem of finding "anchors" defining the conical hull of the data (via the method of moments). In addition, it proposes a new divideandconquer algorithm based on random projections to speed up the search for the anchors.
Overall, I found this an interesting paper presenting significant contributions. However the presentation could be greatly improved as it lacks clarity here and there. It looks like this paper was squeezed in a hurry to fit the 8page limit.
In Definition 2, line 113, isn't the identity matrix only for the case X=Y?
Line 122 is not very precise: "finds the subset of rows in Y that define the same cone as all the rows in X"  it's not necessarily the same cone, but a supercone, i.e., one which contains cone(X).
Line 132: "in the center"  which center?
Line 244: why is [4] a "Bayesian learning method"?
Definition 4 is confusing. What are i,j,s representing?
Figures 3, 4, 5, 6 are too small and the font size in the legend is tiny. As it stands it's unreadable.
A conclusion section is missing. Q2: Please summarize your review in 12 sentences
This paper reduces a broad class of machine learning problems involving latent variables to the problem of finding "anchors" defining the conical hull of the data (via the method of moments). In addition, it proposes a new divideandconquer algorithm based on random projections to speed up the search for the anchors. Overall, I found this an interesting paper presenting significant contributions. However the presentation could be greatly improved as it lacks clarity here and there. It looks like this paper was squeezed in a hurry to fit the 8page limit.
Submitted by Assigned_Reviewer_18
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This work proposes a new algorithm, "divideandconquer anchoring" (DCA), that solves the problem of selecting a subset A of rows from a matrix Y such that those rows best model another matrix X: X = FY_A. Previous work [29] addresses the simpler setting where X = Y and shows how this setting generalizes nonnegative matrix factorization (NMF). The even more general X != Y setting that this work tackles is shown to generalize many core machine learning problems: Gaussian mixture models (GMMs), hidden Markov models (HMMs), latent Dirichlet allocation (LDA), and subspace clustering (SC). This setting relies on a "general separability assumption", which requires that there exist a subset A of Y's rows such that the rows of X are contained in their cone. Lemma 1 gives a very simple condition for the identifiability of A, which allows DCA to be much more robust to variance in the sample moments than spectral methods. Experimental results on a large number of tasks indicate that DCA is almost always much faster than traditional learning methods (including spectral ones) without losing significant accuracy.
Comments:
Clarity, quality: The general idea and algorithm of the paper is clear, but the math in this work is pretty dense. Additional small, illustrative examples and figures (similar to that of Figure 2) would be helpful to make some of the concepts more concrete.
Originality: The work of [29] seems to be the most closely related to this paper. The X != Y difference represents a large delta in terms of theoretical and experimental work though, so this paper is certainly sufficiently original.
Significance: Given the large number of practical machine learning tasks that this minimum conical hull problem encompasses, and its clear performance advantages on them, this work seems very significant.
Supplement: It's somewhat difficult to map back and forth between the document and the supplement, since the theorem/lemma/etc. numbers don't match. When referring to [3] (the supplement) from the main paper, it would be helpful if the authors could use the optional argument of the cite command to indicate more precisely which part of [3] is being referenced: \cite[specific location]{supplement}. Additionally, there is a lot of information in the supplement that isn't represented in the main body of the paper. Perhaps this paper would be better suited to a venue without page limits (such as a journal).
Experiments: Do the authors have plans to release the associated code? How were the svalues (50, 92, 133, 175) selected? Often DCA, wins speedwise but loses slightly accuracywise. Can the authors comment further on why this might be the case? Are there knobs that can but turned on the DCA algorithm to alter the speed/accuracy balance? (Besides the sknob, which doesn't seem to have a huge effect.)
Organization: Since NMF is already covered by [29], it might make sense to leave its explanation and experiments out of the main body of the paper and focus on explaining the other tasks (or the theory) in greater detail.
The variable "c": The variable "c" in the definition of "s" on line 268 is only really defined in the supplement. Even there it's somewhat unclear what c's range is. Can the authors provide a mathematical statement describing the relationship between c and the flatness of the flattest anchor? Is n the maximum value for c?
Projection matrices: Proposition 1 indicates that projection matrices are drawn uniformly from the Grassmanian manifold, and the supplement indicates several other random matrix ensembles that could be drawn from. It would be helpful if the authors could provide a few references detailing how to draw from each of these.
More minor notes:
1) "although spectral method using" > "although spectral methods using"
2) "random drawn lowdimensional" > "randomly drawn lowdimensional"
3) Equation (1) and elsewhere: Use \left and \right to make brackets and parentheses match the height of the math that they enclose.
2) "general models such as matrix factorization and latent variable model" > "general models such as matrix factorization and latent variable models"
4) "most popular models such as GMM and HMM" > "most popular models such as GMMs and HMMs"
5) "the two D matrices in RHS of" > "the two D matrices in the RHS of"
6) "either matrix in LHS of" > "either matrix in the LHS of"
7) "each handled by a solver to minimum conical hull problem" > "each handled by a solver to the minimum conical hull problem"
8) "anchoring algorithm for subproblem on 2Dplane" > "anchoring algorithm for a subproblem on the 2Dplane"
9) "Due to the convexity of cone" > "Due to the convexity of cones"
10) "For arbitrary point" > "For an arbitrary point"
11) "w.h.p. solving" > "w.h.p.\ solving"
12) "given two sets of points(rows)" > "given two sets of points (rows)"
13) "under general separability assumption" > "under the general separability assumption"
14) Supplement, line 415: "Lemma ??" > "Lemma 2"
15) "each point on each curve is the result by averaging" > "each point on each curve is the result of averaging"
16) "Because DCA uses separability assumption" > "DCA uses the separability assumption"
17) There are many grammatical errors in the supplement besides the ones listed above. It could use a more careful proofreading. Q2: Please summarize your review in 12 sentences
This work proposes a new algorithm, "divideandconquer anchoring" (DCA), that solves a problem which generalizes many other core machine learning problems (NMF, GMMS, HMMs, LDA, SC). It is both interesting theoretically and performs well experimentally. Submitted by Assigned_Reviewer_33
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes to solve the problem of finding k extremal rays of minimum conical hull, as an alternative to common algorithms such as EM. In order to reach the solutions, the authors introduce a divideandconquer scheme to reduce the original problem into subproblems that are easy to be solved. Applied on several models, this method shows competitive performances.
The paper is overall clear. The good experimental results and simplicity of implementation suggest the usefulness of the proposed method. The supplementary contains enough details.
The originality is limited, since it combines several tricks that have been widely used in the optimization and machine learning societies, such as the conical hull formulation and the random projection trick.
Although the work is technically correct, the following points would require clarification:
 The notations should be checked for the reading proficiency. There are notations without specifications. For example, in Line 138, there is no specification for the operator "otimes", which is inappropriately located at Line 191.
 More literature reviews on using conical hull for parameter optimizations are expected to improve the clarity of this work.
 Concerning the experimental results, more methods are expected to be included in the comparison. For instance, the Viterbi (MAP) training for HMM would better to be included besides the BaumWelch (EM) algorithm. Moreover, the computational cost of kmeans GMM learning seems weird. In contrast to EM, why the time cost of kmeans almost holds the same as the number of components increases? If there is some trick for kmeans, it needs specifications. Q2: Please summarize your review in 12 sentences
The work overall is interesting and appears to work well empirically. As it stands, I think the theoretical analysis is incomplete and some missed content is needed. Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point. We would like to thank all the reviewers for their efforts in reviewing this paper and constructive comments!
Re Reviewer #1:
Yes it is not easy to squeeze the 23 page full version to only 8 pages, but we will try our best to make the short version clearer than current one, remove all the typos, adjust the font sizes in figures, and add a conclusion session (one is available in our Arxiv version). For the current draft, all the missing details are contained in the submitted supplemental material.
Thanks for the comments about line 113 and 122! Yes they are only under the separability assumption and should be changed in the general version. But it is worth noting that these two typos will not change the correctness of the other parts of this paper.
Line 132: in the center of multiple layers of conical hulls.
Line 244: it should be a “Bayes’ learning method”, because it applies Bayes rules in formulating the matrix factorization model of probabilities, and recover the topic probability from matrix factorization by using Bayes’ rule.
Definition 4: i, j, s are defined in the conditions in (8), s indexes arbitrary point in X, j indexes arbitrary point in subset A of Y, i indexes arbitrary point in the complementary set of subset A of Y.
Re Reviewer #18:
Thanks for your positive comments about our work, and the detailed minor notes! We find they are very useful, and will use them for the revision. We will also improve the mapping between the short version and long version, as well as the organization. The code for all different models will be released later.
The number of subproblems s can be selected according to the theoretical bound, i.e., s=O(klogk), and computational budget. The {50, 92, 133, 175} in NMF experiment is just a randomly generated increasing sequence of s showing how the performance and speed will change when increasing s. On NMF, DCA loses a slight accuracy because it uses a randomized strategy (others are deterministic), and we evaluate recovery accuracy on training set. When s goes to infinity, it is expected to achieve the same performance as others. It worth noting that we rarely see this slight drawback on other tasks, which instead evaluates generative performance on a test set. Except s, you can also adjust the dimension of each subproblem to gain a speed/accuracy tradeoff. But in this case, the subproblems are hard to be solved by our fast anchoring algorithm, and slower iterative algorithms are needed.
Variable “c”: by the beginning of Theorem 3, c/k (k is the number of anchors) is defined as the lower bound for \alpha2\beta, where \alpha and \beta are angles defined in (27). From Figure 2, large c comes with large \alpah and small \beta, which indicate a less “flat” anchor.
The projection matrix can be some fast random transformation or random sparse matrix. We will include references to them in later version.
Re Reviewer #33:
Thanks for your comments! However, we do not think our work is a simple combination of existing tricks. We declared in our paper (line 057061) our major difference, compared to the conical hull problem, is that we use model X=FY_A rather than X=FX_A. This requires us to build new theory for the new model (Definition 2 & 3, Lemma 1), that allows us to apply our model to a much richer class of learning problems that cannot be addressed by existing conical hull methods. In addition, different from JL lemma based random projection (RP), we use RP to only preserve partial information and thus it could be of ultra low (1 or 2) dimensions. This is benefited from the new “divideandconquer randomization” idea (line 076078), which is different from most current “batch” RP methods. New theory has been built for this idea as well (Proposition 1, Theorem 1, Corollary 1).
We will improve the notation and make them clearer in the final version, however, given the chance.
We will also include more baseline methods in our empirical comparisons. The computational cost of kmeans in our paper is less than normal kmeans because we use a faster public available implementation called “litekmeans”.
We reviewed six papers ([4, 5, 12, 15, 20, 29]) about conical/convex hull problems for parameter optimization, all stateoftheart. We also provided empirical comparison to most of them. Given the onepage limit of references, and because one of our major contributions is how to reduce other problems to conical hull problem rather than only developing fast conical hull algorithm, we believe these are sufficient. However, we will include more in the long version of the paper.
The optimization problem as well as object of general conical hull problem has been formally given in (3). The objective equals to solve equation X=FY_A, or to minimize \XFY_A\, which have been declared in Section 2. For matrix factorization, it is recovery error minimization. For latent variable models, the object is to fit moments by model parameters. This has been given in (4).
 