Reviews: Ultrametric Fitting by Gradient Descent

The paper introduces a new algorithm to learn hierarchical clusters based on fitting an ultrametric to the data. This is achieved by transforming an optimization problem over the constrained set of ultrametrics to one over an unconstrained set of weighted graphs, with the ultrametric constraint implicitly satisfied by the cost function. The idea of transforming the ultrametric fitting in a way conducive to gradient based methods is novel. It allows many ideas for regularization to modify hierarchical clusters in a way that standard hierarchical clustering approaches e.g. hierarchical agglomerative clustering does not allow as easily. The paper then demonstrates the flexibility of this new formulation by showing how different regularizations and data fidelity terms can be incorporated into the cost function. For example, noticing that in a vanilla MSE cost, small clusters appear high up in the dendrogram, the paper proposes penalizing such occurrences to produce more balanced hierarchies. Another example is the ability to allow semi-supervision in the form of specifying triplets. These cost functions are then showed to be approximated by differentiable functions and (in most cases) efficiently computed in a gradient descent framework. Finally, experimental results validate both the framework (using the CUCP algorithm as baseline) for runtime and accuracy and show that the quality of the clusters are comparable to standard algorithms (Ward and semi-supervised-SVM). The paper is well written and organized. It potentially opens new directions for exploring hierarchical clustering algorithms through the new ultrametric fitting approach.

Reviewer 2

The paper proposes an optimization framework for ultrametric fitting in the context of hierarchical clustering. The main idea of the paper is to use a min-max operator directly in a cost function defined for a given edge-weighted graph in order to find the optimal ultrametric for a dissimilarity graph. Two different ultrametric related cost functions and two different regularizations are used. Experimental evaluations on superpixel adjacency graphs from the datasets of the LIBSVM webpage illustrate the performance of this hierarchical clustering approach compared to state-of-the-art unsupervised Ward method and semi-supervised SVM. Strengths: - This paper address an interesting problem of hierarchical cluster analysis by using an optimizations framework for ultrametric fitting. - The usage of the min-max operator in order to solve the optimization problem of finding a best ultrametric has not been studied so far. - Paper is overall well written with a good structure. However, some details should be clarified (see below). - The authors make good use of already existing cost functions for evaluating an ultrametric for their purposes. - The experimental part gives a good overview of the comparison to the state-of-the-art agglomerative methods. Weaknesses: 1.) l Line 8,56,70,93,: Usage of the word equivalent. I would suggest a more cautious usage of this word. Especially, if the equivalence is not verified. 2.) A differentiation between the ultrametric d and the ultrametric u would make their different usages clearer. 3.) Line 186 ff. : A usage of the subdominant ultrametric for the cluster-size regularization, would make the algorithms part more consistent with the following considerations in this paper. 4) The paper is not sufficiently clear in some aspects (see below for a list of questions) Overall, I have the impressions the the weaknesses could be fixed until the final paper submission.

Reviewer 3

Originality: For the aforementioned contributions, I believe this work provides a creative, unique approach to this problem. Quality: I believe this paper to be technically sound, a complete work that presents interesting approaches for hierarchical clustering. Clarity: The paper is written well and clearly explains the approach. But there were a some details that I thought could have been made clearer in both the presentation and in the experiments. Unless I’ve missed something, I think that it would be good to more clearly state the process (and its complexity) of going from the ultrametric fit to data to a dendrogram. I believe this can be done using the LCA function applied to all pairs of points / the Jacobian of \phi. If there is no room in the body of the paper, I believe it could be included in the supplementary material. I also think the experiments could better show the effectiveness of the approach constructing high quality hierarchical clusterings. I believe a more clear experiment would compare their optimization of Dasgupta’s cost to alternative approaches to illustrate which approaches find clusterings with lower costs. Significance: I believe the work introduces novel methodology and a creative approach in a relatively unexplored space of gradient-based methods for hierarchical clustering. The methodology is clearly expressed. However, I do feel that the experiments of the paper are a bit weak and could be made stronger as mentioned in the clarity section. The work compares both hierarchical clustering objectives and optimization methods, but it would be stronger if the experiments compared the proposed method to alternative approaches that also optimize the Dasgupta cost function. I also think the paper would benefit from a lengthier discussion of the merits of gradient-based approaches for ultrametric fitting (and perhaps further experiments to support this). The paper does not give theoretical results.

Paper ID:	1783
Title:	Ultrametric Fitting by Gradient Descent

Reviewer 1

Reviewer 2

Reviewer 3